0ct0pu5/ladybird

Author	SHA1	Message	Date
Andreas Kling	cc4109c03b	LibWeb: Move the HTML parser into HTML/Parser/	2020-07-28 19:23:18 +02:00
Andreas Kling	c46439f240	LibWeb: Move HTML classes into the Web::HTML namespace	2020-07-28 18:55:48 +02:00
Luke	19d6884529	LibWeb: Implement quirks mode detection This allows us to determine which mode to render the page in. Exposes "doctype" and "compatMode" on Document. Exposes "name", "publicId" and "systemId" on DocumentType.	2020-07-21 01:08:32 +02:00
Luke	2df69317f1	LibWeb: Implement almost all missing tokenizer cases	2020-06-28 16:56:26 +02:00
Kevin Meyer	22b20c381f	LibWeb: Implement remaining missing tokenizer EOF cases	2020-06-27 13:27:10 +02:00
Andreas Kling	8e6522d034	LibWeb: Implement some missing tokenizer cases for EOF handling	2020-06-26 22:47:07 +02:00
Andreas Kling	c33d17d363	LibWeb: Fix tokenization of attributes with URL query strings in them <a href="/foo&amp=bar"> was being tokenized into <a href="/foo&=bar">. The spec mentions this but I had overlooked it. The bug happens because we interpreted the "&amp" as a named character reference.	2020-06-23 16:45:01 +02:00
stelar7	5eb39a5f61	LibWeb: Update parser with more insertion modes :^) Implements handling of InHeadNoScript, InSelectInTable, InTemplate, InFrameset, AfterFrameset, and AfterAfterFrameset.	2020-06-21 10:13:31 +02:00
Luke	a1838f676e	LibWeb: Implement all CDATA tokenizer states Even though we haven't implemented any switches to these states yet, we may as well have them ready for when we do implement the switches.	2020-06-14 13:47:19 +02:00
Luke	821312729a	LibWeb: Fully implement all DOCTYPE tokenizer states Also fixes TagOpen having a seperate emit and reconsume in ANYTHING_ELSE.	2020-06-14 13:47:19 +02:00
Luke	ab1df177d8	LibWeb: Fully implement all comment tokenizer states	2020-06-14 13:47:19 +02:00
Andreas Kling	47df0cbbc8	LibWeb: Fix broken tokenization of hexadecimal character references We were interpreting 'A'-'F' as decimal digits which didn't work right.	2020-06-13 13:46:12 +02:00
Andreas Kling	ab4c03ce2d	LibWeb: Fix tokenizer swallowing an extra token after a named entity	2020-06-07 19:09:03 +02:00
Luke	61d5bec739	LibWeb: Fully implement all script tokenizer states Also fixes RAWTEXTLessThanSign having a separate emit and reconsume.	2020-06-06 09:55:15 +02:00
Andreas Kling	4e71684a3a	LibWeb: Fix missing tokenizer state change in RCDATALessThanSign We can't RECONSUME_IN after we've used EMIT_CHARACTER since we'll have returned from the function.	2020-06-05 12:02:30 +02:00
Andreas Kling	b59f4632d5	LibWeb: Unbreak character reference and DOCTYPE parsing post-UTF-8 Oops, these were still using the byte-offset cursor. My goodness is it unergonomic to index into UTF-8 strings, but Dr. Bugaev says it's good. There is lots of room for improvement here. Just like the rest of the tokenizer and parser. We'll have to do a few optimization passes over them once they mature.	2020-06-04 22:09:36 +02:00
Andreas Kling	b6288163f1	LibWeb: Make the new HTML parser parse input as UTF-8 We already convert the input to UTF-8 before starting the tokenizer, so all this patch had to do was switch the tokenizer to use an Utf8View for its input (and to emit 32-bit codepoints.)	2020-06-04 21:12:17 +02:00
Andreas Kling	19190267a6	LibWeb: Fix incorrectly consumed characters after reference tokens The NumericCharacterReferenceEnd tokenizer state should not advance the input stream.	2020-06-04 16:49:21 +02:00
Andreas Kling	ca33bc7895	LibWeb: Fix tokenization of attributes with empty attributes We were neglecting to emit start tags for tags where the last attribute had no value. Also fix a parse error TODO that I hit while looking at this.	2020-06-04 12:00:09 +02:00
Andreas Kling	a3936f10eb	LibWeb: Fix tokenizing scripts with '<' in them The EMIT_CHARACTER_AND_RECONSUME_IN was emitting the current token instead of the specified codepoint.	2020-06-02 14:27:53 +02:00
Andreas Kling	77a3710e9d	LibWeb: Tokenize "anything else" in CommentLessThanSignBangDashDash	2020-06-01 20:14:23 +02:00
Andreas Kling	db93db8100	LibWeb: Put whining about tokenizer errors behind an #ifdef Real web content has tons of tokenizer errors and we don't need to complain every time as that makes the debug log unbearable.	2020-06-01 18:46:11 +02:00
Andreas Kling	a775c2c717	LibWeb: Handle more cases in the SelfClosingStartTag tokenizer state	2020-06-01 18:46:11 +02:00
Andreas Kling	f3b09ddd8e	LibWeb: Implement more of the ScriptDataEndTagName tokenizer state Some of this is extremely repetitive. We'll need to rethink how we do queue/emit to improve this.	2020-05-30 23:00:35 +02:00
Andreas Kling	756829555a	LibWeb: Parse "textarea" tags during the "in body" insertion mode Had to handle some more cases in the tokenizer to support this.	2020-05-30 18:40:23 +02:00
Andreas Kling	c9dd459822	LibWeb: Implement some more RAWTEXT stuff in the tokenizer	2020-05-30 17:47:50 +02:00
TheDumpap	d92c9d3772	LibWeb: Implement more of the tokenizer states Slowly adding more unimplemented options for tokenizer states.	2020-05-30 17:47:50 +02:00
Andreas Kling	62885b5646	LibWeb: Fix accidental swallow of self-closing tag tokens Instead of dropping self-closing tags on the floor, we now emit them into the token stream. :^)	2020-05-30 11:31:49 +02:00
Andreas Kling	851a0f983a	LibWeb: Tokenizing a semicolon-less HTML entity is (just a) parse error No need to blow chunks over this.	2020-05-30 11:31:49 +02:00
Andreas Kling	1ef5d609d9	AK+LibC: Add TODO() as an alternative to ASSERT_NOT_REACHED() I've been using this in the new HTML parser and it makes it much easier to understand the state of unfinished code branches. TODO() is for places where it's okay to end up but we need to implement something there. ASSERT_NOT_REACHED() is for places where it's not okay to end up, and something has gone wrong.	2020-05-30 11:31:49 +02:00
Andreas Kling	bb2f22577b	LibWeb: Implement a bunch more script-related tokenization states	2020-05-28 18:44:17 +02:00
Andreas Kling	5e53c45113	LibWeb: Plumb content encoding into the new HTML parser We still don't handle non-ASCII input correctly, but at least now we'll convert e.g ISO-8859-1 to UTF-8 before starting to tokenize. This patch also makes "view source" work with the new parser. :^)	2020-05-28 12:35:19 +02:00
Andreas Kling	5c35f3c9ba	LibWeb: Support named character references (e.g "&")	2020-05-28 11:44:19 +02:00
Andreas Kling	39b5494aeb	LibWeb: Implement the "after attribute name" tokenizer state One little step at a time towards parsing the monster blob of HTML we get from twitter.com :^)	2020-05-27 18:30:29 +02:00
Andreas Kling	1de29e3f59	LibWeb: Implement the "self closing start tag" tokenizer state	2020-05-27 18:30:29 +02:00
Andreas Kling	a5ce09f8e3	LibWeb: Implement partial support for numeric character references	2020-05-27 18:30:27 +02:00
Andreas Kling	ecd25ce6c7	LibWeb: Allow HTML tokenizer to emit more than one token Tokens are now put on a queue when emitted, and we always pop from that queue when returning from next_token().	2020-05-26 15:50:05 +02:00
Andreas Kling	406fd95f32	LibWeb: Flesh out the remaining DOCTYPE related tokenizer states We can now parse public and system identifiers! Not super useful, but at least we can do it :^)	2020-05-25 19:51:23 +02:00
Andreas Kling	556a6eea61	LibWeb: Checking for "DOCTYPE" should be case insensitive in tokenizer	2020-05-25 19:51:23 +02:00
Andreas Kling	45da08a1e6	LibWeb: A whole bunch of work towards spec-compliant <script> elements This is still very unfinished, but there's at least a skeleton of code.	2020-05-24 23:54:22 +02:00
Andreas Kling	5d332c1f11	LibWeb: Parse enough to handle a <style> inside a <head> :^)	2020-05-24 23:54:22 +02:00
Andreas Kling	20911efd4d	LibWeb: More work on the HTML parser and tokenizer The parser can now switch the state of the tokenizer! Very webby. :^)	2020-05-24 23:54:22 +02:00
Andreas Kling	96cc1138c0	LibWeb: Remove tokenizer's premature character buffering optimization	2020-05-24 23:54:22 +02:00
Emanuele Torre	3f2158bbfe	LibWeb: HtmlTokenizer.cpp: fix ON_WHITESPACE macro The "audible bell" character ('\a' U+0007) was treated as whitespace while the "line feed" character ('\n' U+000a) was not. '\a' is no longer considered whitespace. '\n' is now considered whitespace.	2020-05-24 09:47:28 +02:00
Andreas Kling	e44c87cfff	LibWeb: Implement enough HTML parsing to handle a small simple DOM :^) We can now parse a little DOM like this: <!DOCTYPE html> <html> <head></head> <body> <div></div> </body> </html> This is pretty slow work, but the incremental progress is satisfying!	2020-05-24 00:49:22 +02:00
Andreas Kling	fd1b31d0ff	LibWeb: Start building the tree building part of the new HTML parser This patch adds a new HTMLDocumentParser class. It keeps a tokenizer object internally and feeds itself with one token at a time from it. The names and idioms in this class are expressed as closely to the actual HTML parsing spec as possible, to make development as easy and bug free as possible. :^) This is going to become pretty large, but it's pretty cool!	2020-05-24 00:14:23 +02:00
Andreas Kling	e45c8b842c	LibWeb: Implement a bit more of DOCTYPE tokenization	2020-05-23 21:08:25 +02:00
Andreas Kling	7be36366be	LibWeb: Emit character/comment tokens lazily to accumulate more data Instead of emitting data-bearing tokens immediately, do it lazily at the next state change. This allows us to accumulate full bursts of text in between tags instead of having one token per character. :^)	2020-05-23 18:44:32 +02:00
Andreas Kling	45450c7edc	LibWeb: Make BEGIN_STATE and END_STATE include some {{{ and }}} This makes it a compile error to omit the END_STATE. Also add some more missing END_STATE's exposed by this (nice!) Thanks to @predmond for suggesting the multi-pair trick! :^)	2020-05-23 15:25:43 +02:00
Andreas Kling	2e4147d0fc	LibWeb: Add missing END_STATE for TagName Fixes #2339.	2020-05-23 10:33:23 +02:00

1 2

53 commits