Commit graph

74 commits

Author SHA1 Message Date
Luke
821312729a LibWeb: Fully implement all DOCTYPE tokenizer states
Also fixes TagOpen having a seperate emit and reconsume in
ANYTHING_ELSE.
2020-06-14 13:47:19 +02:00
Andreas Kling
9b17bf3dcd LibWeb: Use HTML::TagNames globals in the new HTML parser 2020-06-07 23:53:16 +02:00
Andreas Kling
be6abce44f LibWeb: Handle EOF tokens during "text" insertion 2020-06-06 16:36:18 +02:00
Andreas Kling
3337365000 LibWeb: Parse param/source/track start tags during "in body" insertion 2020-06-05 21:59:46 +02:00
Andreas Kling
b4591f0037 LibWeb: Fix parsing of "<textarea></textarea>"
When handling a "textarea" start tag, we have to ignore the next token
if it's an LF ('\n'). However, we were not switching the tokenizer
state before fetching the lookahead token, and this caused us to force
the tokenizer into the RCDATA state too late, effectively getting it
stuck in that state for way longer than it should be.

Fixes #2508.
2020-06-05 12:05:42 +02:00
Kyle McLean
b9549078cc LibWeb: Handle "html" end tag during "in body" 2020-06-04 09:09:33 +02:00
Kyle McLean
a3bf3a5d68 LibWeb: Handle "xmp" start tag during "in body" 2020-06-04 09:09:33 +02:00
Kyle McLean
c70bd0ba58 LibWeb: Handle "nobr" start tag during "in body" 2020-06-04 09:09:33 +02:00
Kyle McLean
22521e57fd LibWeb: Handle "form" end tag during "in body" if stack of open elements does not contain "template" 2020-06-04 09:09:33 +02:00
Kyle McLean
4edd0643a6 LibWeb: Handle NULL character during "in body" 2020-06-04 09:09:33 +02:00
Kyle McLean
5e3972a946 LibWeb: Parse "body" end tags during "in body" 2020-06-04 09:09:33 +02:00
Kyle McLean
1ad81e4833 LibWeb: Parse "br" end tags during "in body" 2020-06-04 09:09:33 +02:00
Kyle McLean
9fca4b56d3 LibWeb: Parse end tags for "applet", "marquee", and "object" during "in body" 2020-06-04 09:09:33 +02:00
Andreas Kling
3c2fbc825c LibWeb: Call children_changed() on text nodes when flushing characters
Now that we flush characters in a single place, we can call the Text's
children_changed() from there instead of having a goofy targeted hack
for <style> elements. :^)
2020-06-03 22:13:29 +02:00
Andreas Kling
c40de9275a LibWeb: Buffer text node character insertions in the new parser
Instead of appending character-at-a-time, we now buffer character
insertions in a StringBuilder, and flush them to the relevant node
whenever we start inserting into a new node (and when parsing ends.)
2020-06-03 21:53:08 +02:00
Andreas Kling
410fa5abe0 LibWeb: Parse barebones document without doctype, <html>, etc.
Last night I tried making a little test page that had a bunch of <img>
elements and nothing else. It didn't work.

Fix this by correctly adding a synthesized <html> element to the
document if we get something else in the "before html insertion mode.
2020-06-02 08:50:33 +02:00
Andreas Kling
e5ddb76a67 LibWeb: Support "td" and "th" start tags during "in table body"
This makes it possible to load Google Image Search results. You can't
see the images yet, but it's still something. :^)
2020-06-01 22:09:09 +02:00
Andreas Kling
8766e49a7c LibWeb+Browser: Use the new HTML parser by default
You can still run the old parser with "br -O", but the new one is good
enough to be the default parser now. We'll fix issues as we go and
eventually remove the old one completely. :^)
2020-06-01 19:08:31 +02:00
Andreas Kling
5944abf31c LibWeb: More parser cases in the "in body" and "after after body" modes 2020-06-01 18:46:11 +02:00
Andreas Kling
8429551368 LibWeb: Implement more of the "after head" insertion mode 2020-06-01 18:46:11 +02:00
Andreas Kling
d058addd74 LibWeb: Handle "dd" and "dt" end tags during "in body" 2020-05-30 23:00:35 +02:00
Andreas Kling
ca6fbefbc9 LibWeb: Support parsing "select" elements (outside of tables) 2020-05-30 19:58:52 +02:00
Andreas Kling
60352c7b9b LibWeb: Hack the parser to dodge <template> elements in <head> for now 2020-05-30 19:23:04 +02:00
Andreas Kling
ca23db10ef LibWeb: Don't crash when encountering <svg> or <math> elements
Just treat them like unknown elements for now. :^)
2020-05-30 18:46:39 +02:00
Andreas Kling
756829555a LibWeb: Parse "textarea" tags during the "in body" insertion mode
Had to handle some more cases in the tokenizer to support this.
2020-05-30 18:40:23 +02:00
Andreas Kling
f4778d1ba0 LibWeb: Add missing special tag case in the "in body" insertion mode 2020-05-30 18:26:44 +02:00
Andreas Kling
5818ef2c80 LibWeb: Implement more table-related insertion modes 2020-05-30 18:26:44 +02:00
Andreas Kling
8c96b8174b LibWeb: Handle AAA situation where there's no formatting element found
In this case, we're supposed to return from the AAA and then jump to a
different behavior in the "in body" insertion mode. So now we do that.
2020-05-30 17:47:50 +02:00
Andreas Kling
f662b1ea37 LibWeb: Implement enough parsing to parse the HTML spec front page :^)
We can now actually open http://html.spec.whatwg.org/ in Browser.
2020-05-30 13:07:47 +02:00
Andreas Kling
770372ad02 LibWeb: Handle end-of-file token during "in body" insertion mode 2020-05-30 12:40:12 +02:00
Andreas Kling
368044eabd LibWeb: Flesh out the "in head" insertion mode and add missing cases 2020-05-30 12:28:12 +02:00
Andreas Kling
e82226f3fb LibWeb: Handle two kinds of deferred script executions
This patch adds two script lists to Document:

- Scripts to execute when parsing has finished
- Scripts to execute as soon as possible

Since we don't actually load scripts asynchronously yet (we just do a
synchronous load when parsing the <script> element for simplicity),
these are already loaded by the time we get to "The end" of parsing.
2020-05-30 12:26:15 +02:00
Andreas Kling
fbd52047bb LibWeb: Parse "form" tags during the "in body" insertion mode 2020-05-30 11:31:49 +02:00
Andreas Kling
b9d5d45eff LibWeb: Handle an error condition for "a" start tag during "in body"
If we have an <a> element on the list of active formatting elements
when hitting another "a" start tag, that's a parse error. Recover by
using the AAA.
2020-05-30 11:31:49 +02:00
Andreas Kling
1ef5d609d9 AK+LibC: Add TODO() as an alternative to ASSERT_NOT_REACHED()
I've been using this in the new HTML parser and it makes it much easier
to understand the state of unfinished code branches.

TODO() is for places where it's okay to end up but we need to implement
something there.

ASSERT_NOT_REACHED() is for places where it's not okay to end up, and
something has gone wrong.
2020-05-30 11:31:49 +02:00
Andreas Kling
cfbd95f42a LibWeb: Turn a bunch of ASSERT_NOT_REACHED() in the parser into TODO() 2020-05-30 11:31:49 +02:00
Andreas Kling
6854f726ce LibWeb: Improve support for "a" and "li" during "in body" insertion
We can now parse welcome.html once again, without resorting to hacks
or fallbacks during "in body" :^)
2020-05-30 11:31:49 +02:00
Andreas Kling
30d64fccde LibWeb: Parse "li" start tags in the "in body" insertion mode 2020-05-30 11:31:49 +02:00
Andreas Kling
2b1517f215 LibWeb: Add all branches from the parsing spec to "in body"
This makes us crash in TODO() more often, but it's better that we know
what's missing instead of incorrectly ending up on the fallback path.
2020-05-30 11:31:49 +02:00
Andreas Kling
68b1bdc234 LibWeb: Add a way to stop the new HTML parser
Some things are specced to "stop parsing", which basically just means
to stop fetching tokens and jump to "The end"
2020-05-28 18:55:18 +02:00
Andreas Kling
00b44ab148 LibWeb: Implement more of the "after body" insertion mode 2020-05-28 18:52:32 +02:00
Andreas Kling
cba5d59adc LibWeb: Parse comments in the "in body" insertion mode 2020-05-28 18:46:39 +02:00
Andreas Kling
5f8cbe6a1b LibWeb: Fix HTMLDocumentParser build 2020-05-28 18:20:55 +02:00
Andreas Kling
308cb69329 LibWeb: Remove a misplaced call to close_a_p_element() in "in body"
This should only be done for the corresponding start tags.
2020-05-28 18:18:20 +02:00
Andreas Kling
c84212aaba LibWeb: Add a StackOfOpenElements helper for "popping until a tag name" 2020-05-28 18:18:20 +02:00
Andreas Kling
5e53c45113 LibWeb: Plumb content encoding into the new HTML parser
We still don't handle non-ASCII input correctly, but at least now we'll
convert e.g ISO-8859-1 to UTF-8 before starting to tokenize.
This patch also makes "view source" work with the new parser. :^)
2020-05-28 12:35:19 +02:00
Andreas Kling
772b51038e LibWeb: Parse "input" tags during the "in body" insertion mode 2020-05-28 12:19:18 +02:00
Andreas Kling
7aa7a2078f LibWeb: Parse "td" start tags during "in cell" insertion mode 2020-05-28 11:46:08 +02:00
Andreas Kling
ebb1649a52 LibWeb: Implement more table support in the new HTML parser
This is enough to parse the Google front page! (Note: I did have to
hack the tokenizer while parsing Google, in order to avoid named
character references screwing everything up. We'll fix that too soon
enough!)
2020-05-28 00:27:46 +02:00
Andreas Kling
7f18c51f4c LibWeb: Flesh out "reset the insertion mode appropriately" algorithm 2020-05-28 00:27:00 +02:00