Commit graph

13 commits

Author SHA1 Message Date
Andreas Kling
b59f4632d5 LibWeb: Unbreak character reference and DOCTYPE parsing post-UTF-8
Oops, these were still using the byte-offset cursor. My goodness is it
unergonomic to index into UTF-8 strings, but Dr. Bugaev says it's good.

There is lots of room for improvement here. Just like the rest of the
tokenizer and parser. We'll have to do a few optimization passes over
them once they mature.
2020-06-04 22:09:36 +02:00
Andreas Kling
b6288163f1 LibWeb: Make the new HTML parser parse input as UTF-8
We already convert the input to UTF-8 before starting the tokenizer,
so all this patch had to do was switch the tokenizer to use an Utf8View
for its input (and to emit 32-bit codepoints.)
2020-06-04 21:12:17 +02:00
Andreas Kling
5e53c45113 LibWeb: Plumb content encoding into the new HTML parser
We still don't handle non-ASCII input correctly, but at least now we'll
convert e.g ISO-8859-1 to UTF-8 before starting to tokenize.
This patch also makes "view source" work with the new parser. :^)
2020-05-28 12:35:19 +02:00
Andreas Kling
4c9c6b3a7b LibWeb: Bring up basic external script execution in the new parser
This only works in some narrow cases, but should be enough for our own
welcome.html at least. :^)
2020-05-27 23:02:03 +02:00
Andreas Kling
a5ce09f8e3 LibWeb: Implement partial support for numeric character references 2020-05-27 18:30:27 +02:00
Andreas Kling
ecd25ce6c7 LibWeb: Allow HTML tokenizer to emit more than one token
Tokens are now put on a queue when emitted, and we always pop from that
queue when returning from next_token().
2020-05-26 15:50:05 +02:00
Andreas Kling
556a6eea61 LibWeb: Checking for "DOCTYPE" should be case insensitive in tokenizer 2020-05-25 19:51:23 +02:00
Andreas Kling
20911efd4d LibWeb: More work on the HTML parser and tokenizer
The parser can now switch the state of the tokenizer! Very webby. :^)
2020-05-24 23:54:22 +02:00
Andreas Kling
96cc1138c0 LibWeb: Remove tokenizer's premature character buffering optimization 2020-05-24 23:54:22 +02:00
Andreas Kling
e44c87cfff LibWeb: Implement enough HTML parsing to handle a small simple DOM :^)
We can now parse a little DOM like this:

<!DOCTYPE html>
<html>
    <head></head>
    <body>
        <div></div>
    </body>
</html>

This is pretty slow work, but the incremental progress is satisfying!
2020-05-24 00:49:22 +02:00
Andreas Kling
fd1b31d0ff LibWeb: Start building the tree building part of the new HTML parser
This patch adds a new HTMLDocumentParser class. It keeps a tokenizer
object internally and feeds itself with one token at a time from it.

The names and idioms in this class are expressed as closely to the
actual HTML parsing spec as possible, to make development as easy
and bug free as possible. :^)

This is going to become pretty large, but it's pretty cool!
2020-05-24 00:14:23 +02:00
Andreas Kling
7be36366be LibWeb: Emit character/comment tokens lazily to accumulate more data
Instead of emitting data-bearing tokens immediately, do it lazily at
the next state change. This allows us to accumulate full bursts of
text in between tags instead of having one token per character. :^)
2020-05-23 18:44:32 +02:00
Andreas Kling
272b35d2e1 LibWeb: Begin work on a spec-compliant HTML parser
In order to actually view the web as it is, we're gonna need a proper
HTML parser. So let's build one!

This patch introduces the Web::HTMLTokenizer class, which currently
operates on a StringView input stream where it fetches (ASCII only atm)
codepoints and tokenizes acccording to the HTML spec tokenization algo.

The tokenizer state machine looks a bit weird but is written in a way
that tries to mimic the spec as closely as possible, in order to make
development easier and bugs less likely.

This initial version is far from finished, but it can parse a trivial
document with a DOCTYPE and open/close tags. :^)
2020-05-22 21:46:13 +02:00