Commit graph

11 commits

Author SHA1 Message Date
Andreas Kling
96cc1138c0 LibWeb: Remove tokenizer's premature character buffering optimization 2020-05-24 23:54:22 +02:00
Emanuele Torre
3f2158bbfe LibWeb: HtmlTokenizer.cpp: fix ON_WHITESPACE macro
The "audible bell" character ('\a' U+0007) was treated as whitespace
while the "line feed" character ('\n' U+000a) was not.

'\a' is no longer considered whitespace.
'\n' is now considered whitespace.
2020-05-24 09:47:28 +02:00
Andreas Kling
e44c87cfff LibWeb: Implement enough HTML parsing to handle a small simple DOM :^)
We can now parse a little DOM like this:

<!DOCTYPE html>
<html>
    <head></head>
    <body>
        <div></div>
    </body>
</html>

This is pretty slow work, but the incremental progress is satisfying!
2020-05-24 00:49:22 +02:00
Andreas Kling
fd1b31d0ff LibWeb: Start building the tree building part of the new HTML parser
This patch adds a new HTMLDocumentParser class. It keeps a tokenizer
object internally and feeds itself with one token at a time from it.

The names and idioms in this class are expressed as closely to the
actual HTML parsing spec as possible, to make development as easy
and bug free as possible. :^)

This is going to become pretty large, but it's pretty cool!
2020-05-24 00:14:23 +02:00
Andreas Kling
e45c8b842c LibWeb: Implement a bit more of DOCTYPE tokenization 2020-05-23 21:08:25 +02:00
Andreas Kling
7be36366be LibWeb: Emit character/comment tokens lazily to accumulate more data
Instead of emitting data-bearing tokens immediately, do it lazily at
the next state change. This allows us to accumulate full bursts of
text in between tags instead of having one token per character. :^)
2020-05-23 18:44:32 +02:00
Andreas Kling
45450c7edc LibWeb: Make BEGIN_STATE and END_STATE include some {{{ and }}}
This makes it a compile error to omit the END_STATE. Also add some more
missing END_STATE's exposed by this (nice!)

Thanks to @predmond for suggesting the multi-pair trick! :^)
2020-05-23 15:25:43 +02:00
Andreas Kling
2e4147d0fc LibWeb: Add missing END_STATE for TagName
Fixes #2339.
2020-05-23 10:33:23 +02:00
Andreas Kling
a58500fdc5 LibWeb: Teach HTMLTokenizer how to tokenize comments
We can now correctly tokenize the welcome.html test page. :^)
2020-05-23 01:54:26 +02:00
Andreas Kling
6caa5661f3 LibWeb: Teach HTMLTokenizer how to tokenize attributes
Properly tokenize single-quoted, double-quoted and unquoted attributes!
2020-05-23 01:22:15 +02:00
Andreas Kling
272b35d2e1 LibWeb: Begin work on a spec-compliant HTML parser
In order to actually view the web as it is, we're gonna need a proper
HTML parser. So let's build one!

This patch introduces the Web::HTMLTokenizer class, which currently
operates on a StringView input stream where it fetches (ASCII only atm)
codepoints and tokenizes acccording to the HTML spec tokenization algo.

The tokenizer state machine looks a bit weird but is written in a way
that tries to mimic the spec as closely as possible, in order to make
development easier and bugs less likely.

This initial version is far from finished, but it can parse a trivial
document with a DOCTYPE and open/close tags. :^)
2020-05-22 21:46:13 +02:00