When handling a "textarea" start tag, we have to ignore the next token
if it's an LF ('\n'). However, we were not switching the tokenizer
state before fetching the lookahead token, and this caused us to force
the tokenizer into the RCDATA state too late, effectively getting it
stuck in that state for way longer than it should be.
Fixes#2508.
Now that we flush characters in a single place, we can call the Text's
children_changed() from there instead of having a goofy targeted hack
for <style> elements. :^)
Instead of appending character-at-a-time, we now buffer character
insertions in a StringBuilder, and flush them to the relevant node
whenever we start inserting into a new node (and when parsing ends.)
Last night I tried making a little test page that had a bunch of <img>
elements and nothing else. It didn't work.
Fix this by correctly adding a synthesized <html> element to the
document if we get something else in the "before html insertion mode.
You can still run the old parser with "br -O", but the new one is good
enough to be the default parser now. We'll fix issues as we go and
eventually remove the old one completely. :^)
This patch adds two script lists to Document:
- Scripts to execute when parsing has finished
- Scripts to execute as soon as possible
Since we don't actually load scripts asynchronously yet (we just do a
synchronous load when parsing the <script> element for simplicity),
these are already loaded by the time we get to "The end" of parsing.
If we have an <a> element on the list of active formatting elements
when hitting another "a" start tag, that's a parse error. Recover by
using the AAA.
I've been using this in the new HTML parser and it makes it much easier
to understand the state of unfinished code branches.
TODO() is for places where it's okay to end up but we need to implement
something there.
ASSERT_NOT_REACHED() is for places where it's not okay to end up, and
something has gone wrong.
We still don't handle non-ASCII input correctly, but at least now we'll
convert e.g ISO-8859-1 to UTF-8 before starting to tokenize.
This patch also makes "view source" work with the new parser. :^)
This is enough to parse the Google front page! (Note: I did have to
hack the tokenizer while parsing Google, in order to avoid named
character references screwing everything up. We'll fix that too soon
enough!)