0ct0pu5/ladybird

Author	SHA1	Message	Date
Andreas Kling	1de29e3f59	LibWeb: Implement the "self closing start tag" tokenizer state	2020-05-27 18:30:29 +02:00
Andreas Kling	a5ce09f8e3	LibWeb: Implement partial support for numeric character references	2020-05-27 18:30:27 +02:00
Andreas Kling	ecd25ce6c7	LibWeb: Allow HTML tokenizer to emit more than one token Tokens are now put on a queue when emitted, and we always pop from that queue when returning from next_token().	2020-05-26 15:50:05 +02:00
Andreas Kling	406fd95f32	LibWeb: Flesh out the remaining DOCTYPE related tokenizer states We can now parse public and system identifiers! Not super useful, but at least we can do it :^)	2020-05-25 19:51:23 +02:00
Andreas Kling	556a6eea61	LibWeb: Checking for "DOCTYPE" should be case insensitive in tokenizer	2020-05-25 19:51:23 +02:00
Andreas Kling	45da08a1e6	LibWeb: A whole bunch of work towards spec-compliant <script> elements This is still very unfinished, but there's at least a skeleton of code.	2020-05-24 23:54:22 +02:00
Andreas Kling	5d332c1f11	LibWeb: Parse enough to handle a <style> inside a <head> :^)	2020-05-24 23:54:22 +02:00
Andreas Kling	20911efd4d	LibWeb: More work on the HTML parser and tokenizer The parser can now switch the state of the tokenizer! Very webby. :^)	2020-05-24 23:54:22 +02:00
Andreas Kling	96cc1138c0	LibWeb: Remove tokenizer's premature character buffering optimization	2020-05-24 23:54:22 +02:00
Emanuele Torre	3f2158bbfe	LibWeb: HtmlTokenizer.cpp: fix ON_WHITESPACE macro The "audible bell" character ('\a' U+0007) was treated as whitespace while the "line feed" character ('\n' U+000a) was not. '\a' is no longer considered whitespace. '\n' is now considered whitespace.	2020-05-24 09:47:28 +02:00
Andreas Kling	e44c87cfff	LibWeb: Implement enough HTML parsing to handle a small simple DOM :^) We can now parse a little DOM like this: <!DOCTYPE html> <html> <head></head> <body> <div></div> </body> </html> This is pretty slow work, but the incremental progress is satisfying!	2020-05-24 00:49:22 +02:00
Andreas Kling	fd1b31d0ff	LibWeb: Start building the tree building part of the new HTML parser This patch adds a new HTMLDocumentParser class. It keeps a tokenizer object internally and feeds itself with one token at a time from it. The names and idioms in this class are expressed as closely to the actual HTML parsing spec as possible, to make development as easy and bug free as possible. :^) This is going to become pretty large, but it's pretty cool!	2020-05-24 00:14:23 +02:00
Andreas Kling	e45c8b842c	LibWeb: Implement a bit more of DOCTYPE tokenization	2020-05-23 21:08:25 +02:00
Andreas Kling	7be36366be	LibWeb: Emit character/comment tokens lazily to accumulate more data Instead of emitting data-bearing tokens immediately, do it lazily at the next state change. This allows us to accumulate full bursts of text in between tags instead of having one token per character. :^)	2020-05-23 18:44:32 +02:00
Andreas Kling	45450c7edc	LibWeb: Make BEGIN_STATE and END_STATE include some {{{ and }}} This makes it a compile error to omit the END_STATE. Also add some more missing END_STATE's exposed by this (nice!) Thanks to @predmond for suggesting the multi-pair trick! :^)	2020-05-23 15:25:43 +02:00
Andreas Kling	2e4147d0fc	LibWeb: Add missing END_STATE for TagName Fixes #2339.	2020-05-23 10:33:23 +02:00
Andreas Kling	a58500fdc5	LibWeb: Teach HTMLTokenizer how to tokenize comments We can now correctly tokenize the welcome.html test page. :^)	2020-05-23 01:54:26 +02:00
Andreas Kling	6caa5661f3	LibWeb: Teach HTMLTokenizer how to tokenize attributes Properly tokenize single-quoted, double-quoted and unquoted attributes!	2020-05-23 01:22:15 +02:00
Andreas Kling	272b35d2e1	LibWeb: Begin work on a spec-compliant HTML parser In order to actually view the web as it is, we're gonna need a proper HTML parser. So let's build one! This patch introduces the Web::HTMLTokenizer class, which currently operates on a StringView input stream where it fetches (ASCII only atm) codepoints and tokenizes acccording to the HTML spec tokenization algo. The tokenizer state machine looks a bit weird but is written in a way that tries to mimic the spec as closely as possible, in order to make development easier and bugs less likely. This initial version is far from finished, but it can parse a trivial document with a DOCTYPE and open/close tags. :^)	2020-05-22 21:46:13 +02:00

19 commits