Andreas Kling
cc4109c03b
LibWeb: Move the HTML parser into HTML/Parser/
2020-07-28 19:23:18 +02:00
Andreas Kling
c46439f240
LibWeb: Move HTML classes into the Web::HTML namespace
2020-07-28 18:55:48 +02:00
Luke
19d6884529
LibWeb: Implement quirks mode detection
...
This allows us to determine which mode to render the page in.
Exposes "doctype" and "compatMode" on Document.
Exposes "name", "publicId" and "systemId" on DocumentType.
2020-07-21 01:08:32 +02:00
Luke
2df69317f1
LibWeb: Implement almost all missing tokenizer cases
2020-06-28 16:56:26 +02:00
Kevin Meyer
22b20c381f
LibWeb: Implement remaining missing tokenizer EOF cases
2020-06-27 13:27:10 +02:00
Andreas Kling
8e6522d034
LibWeb: Implement some missing tokenizer cases for EOF handling
2020-06-26 22:47:07 +02:00
Andreas Kling
c33d17d363
LibWeb: Fix tokenization of attributes with URL query strings in them
...
<a href="/foo&=bar"> was being tokenized into <a href="/foo&=bar">.
The spec mentions this but I had overlooked it. The bug happens because
we interpreted the "&" as a named character reference.
2020-06-23 16:45:01 +02:00
stelar7
5eb39a5f61
LibWeb: Update parser with more insertion modes :^)
...
Implements handling of InHeadNoScript, InSelectInTable, InTemplate,
InFrameset, AfterFrameset, and AfterAfterFrameset.
2020-06-21 10:13:31 +02:00
Luke
a1838f676e
LibWeb: Implement all CDATA tokenizer states
...
Even though we haven't implemented any switches to these states yet,
we may as well have them ready for when we do implement the switches.
2020-06-14 13:47:19 +02:00
Luke
821312729a
LibWeb: Fully implement all DOCTYPE tokenizer states
...
Also fixes TagOpen having a seperate emit and reconsume in
ANYTHING_ELSE.
2020-06-14 13:47:19 +02:00
Luke
ab1df177d8
LibWeb: Fully implement all comment tokenizer states
2020-06-14 13:47:19 +02:00
Andreas Kling
47df0cbbc8
LibWeb: Fix broken tokenization of hexadecimal character references
...
We were interpreting 'A'-'F' as decimal digits which didn't work right.
2020-06-13 13:46:12 +02:00
Andreas Kling
ab4c03ce2d
LibWeb: Fix tokenizer swallowing an extra token after a named entity
2020-06-07 19:09:03 +02:00
Luke
61d5bec739
LibWeb: Fully implement all script tokenizer states
...
Also fixes RAWTEXTLessThanSign having a separate emit and reconsume.
2020-06-06 09:55:15 +02:00
Andreas Kling
4e71684a3a
LibWeb: Fix missing tokenizer state change in RCDATALessThanSign
...
We can't RECONSUME_IN after we've used EMIT_CHARACTER since we'll have
returned from the function.
2020-06-05 12:02:30 +02:00
Andreas Kling
b59f4632d5
LibWeb: Unbreak character reference and DOCTYPE parsing post-UTF-8
...
Oops, these were still using the byte-offset cursor. My goodness is it
unergonomic to index into UTF-8 strings, but Dr. Bugaev says it's good.
There is lots of room for improvement here. Just like the rest of the
tokenizer and parser. We'll have to do a few optimization passes over
them once they mature.
2020-06-04 22:09:36 +02:00
Andreas Kling
b6288163f1
LibWeb: Make the new HTML parser parse input as UTF-8
...
We already convert the input to UTF-8 before starting the tokenizer,
so all this patch had to do was switch the tokenizer to use an Utf8View
for its input (and to emit 32-bit codepoints.)
2020-06-04 21:12:17 +02:00
Andreas Kling
19190267a6
LibWeb: Fix incorrectly consumed characters after reference tokens
...
The NumericCharacterReferenceEnd tokenizer state should not advance
the input stream.
2020-06-04 16:49:21 +02:00
Andreas Kling
ca33bc7895
LibWeb: Fix tokenization of attributes with empty attributes
...
We were neglecting to emit start tags for tags where the last attribute
had no value.
Also fix a parse error TODO that I hit while looking at this.
2020-06-04 12:00:09 +02:00
Andreas Kling
a3936f10eb
LibWeb: Fix tokenizing scripts with '<' in them
...
The EMIT_CHARACTER_AND_RECONSUME_IN was emitting the current token
instead of the specified codepoint.
2020-06-02 14:27:53 +02:00
Andreas Kling
77a3710e9d
LibWeb: Tokenize "anything else" in CommentLessThanSignBangDashDash
2020-06-01 20:14:23 +02:00
Andreas Kling
db93db8100
LibWeb: Put whining about tokenizer errors behind an #ifdef
...
Real web content has *tons* of tokenizer errors and we don't need to
complain every time as that makes the debug log unbearable.
2020-06-01 18:46:11 +02:00
Andreas Kling
a775c2c717
LibWeb: Handle more cases in the SelfClosingStartTag tokenizer state
2020-06-01 18:46:11 +02:00
Andreas Kling
f3b09ddd8e
LibWeb: Implement more of the ScriptDataEndTagName tokenizer state
...
Some of this is extremely repetitive. We'll need to rethink how we
do queue/emit to improve this.
2020-05-30 23:00:35 +02:00
Andreas Kling
756829555a
LibWeb: Parse "textarea" tags during the "in body" insertion mode
...
Had to handle some more cases in the tokenizer to support this.
2020-05-30 18:40:23 +02:00
Andreas Kling
c9dd459822
LibWeb: Implement some more RAWTEXT stuff in the tokenizer
2020-05-30 17:47:50 +02:00
TheDumpap
d92c9d3772
LibWeb: Implement more of the tokenizer states
...
Slowly adding more unimplemented options for tokenizer states.
2020-05-30 17:47:50 +02:00
Andreas Kling
62885b5646
LibWeb: Fix accidental swallow of self-closing tag tokens
...
Instead of dropping self-closing tags on the floor, we now emit them
into the token stream. :^)
2020-05-30 11:31:49 +02:00
Andreas Kling
851a0f983a
LibWeb: Tokenizing a semicolon-less HTML entity is (just a) parse error
...
No need to blow chunks over this.
2020-05-30 11:31:49 +02:00
Andreas Kling
1ef5d609d9
AK+LibC: Add TODO() as an alternative to ASSERT_NOT_REACHED()
...
I've been using this in the new HTML parser and it makes it much easier
to understand the state of unfinished code branches.
TODO() is for places where it's okay to end up but we need to implement
something there.
ASSERT_NOT_REACHED() is for places where it's not okay to end up, and
something has gone wrong.
2020-05-30 11:31:49 +02:00
Andreas Kling
bb2f22577b
LibWeb: Implement a bunch more script-related tokenization states
2020-05-28 18:44:17 +02:00
Andreas Kling
5e53c45113
LibWeb: Plumb content encoding into the new HTML parser
...
We still don't handle non-ASCII input correctly, but at least now we'll
convert e.g ISO-8859-1 to UTF-8 before starting to tokenize.
This patch also makes "view source" work with the new parser. :^)
2020-05-28 12:35:19 +02:00
Andreas Kling
5c35f3c9ba
LibWeb: Support named character references (e.g "&")
2020-05-28 11:44:19 +02:00
Andreas Kling
39b5494aeb
LibWeb: Implement the "after attribute name" tokenizer state
...
One little step at a time towards parsing the monster blob of HTML we
get from twitter.com :^)
2020-05-27 18:30:29 +02:00
Andreas Kling
1de29e3f59
LibWeb: Implement the "self closing start tag" tokenizer state
2020-05-27 18:30:29 +02:00
Andreas Kling
a5ce09f8e3
LibWeb: Implement partial support for numeric character references
2020-05-27 18:30:27 +02:00
Andreas Kling
ecd25ce6c7
LibWeb: Allow HTML tokenizer to emit more than one token
...
Tokens are now put on a queue when emitted, and we always pop from that
queue when returning from next_token().
2020-05-26 15:50:05 +02:00
Andreas Kling
406fd95f32
LibWeb: Flesh out the remaining DOCTYPE related tokenizer states
...
We can now parse public and system identifiers! Not super useful, but
at least we can do it :^)
2020-05-25 19:51:23 +02:00
Andreas Kling
556a6eea61
LibWeb: Checking for "DOCTYPE" should be case insensitive in tokenizer
2020-05-25 19:51:23 +02:00
Andreas Kling
45da08a1e6
LibWeb: A whole bunch of work towards spec-compliant <script> elements
...
This is still very unfinished, but there's at least a skeleton of code.
2020-05-24 23:54:22 +02:00
Andreas Kling
5d332c1f11
LibWeb: Parse enough to handle a <style> inside a <head> :^)
2020-05-24 23:54:22 +02:00
Andreas Kling
20911efd4d
LibWeb: More work on the HTML parser and tokenizer
...
The parser can now switch the state of the tokenizer! Very webby. :^)
2020-05-24 23:54:22 +02:00
Andreas Kling
96cc1138c0
LibWeb: Remove tokenizer's premature character buffering optimization
2020-05-24 23:54:22 +02:00
Emanuele Torre
3f2158bbfe
LibWeb: HtmlTokenizer.cpp: fix ON_WHITESPACE macro
...
The "audible bell" character ('\a' U+0007) was treated as whitespace
while the "line feed" character ('\n' U+000a) was not.
'\a' is no longer considered whitespace.
'\n' is now considered whitespace.
2020-05-24 09:47:28 +02:00
Andreas Kling
e44c87cfff
LibWeb: Implement enough HTML parsing to handle a small simple DOM :^)
...
We can now parse a little DOM like this:
<!DOCTYPE html>
<html>
<head></head>
<body>
<div></div>
</body>
</html>
This is pretty slow work, but the incremental progress is satisfying!
2020-05-24 00:49:22 +02:00
Andreas Kling
fd1b31d0ff
LibWeb: Start building the tree building part of the new HTML parser
...
This patch adds a new HTMLDocumentParser class. It keeps a tokenizer
object internally and feeds itself with one token at a time from it.
The names and idioms in this class are expressed as closely to the
actual HTML parsing spec as possible, to make development as easy
and bug free as possible. :^)
This is going to become pretty large, but it's pretty cool!
2020-05-24 00:14:23 +02:00
Andreas Kling
e45c8b842c
LibWeb: Implement a bit more of DOCTYPE tokenization
2020-05-23 21:08:25 +02:00
Andreas Kling
7be36366be
LibWeb: Emit character/comment tokens lazily to accumulate more data
...
Instead of emitting data-bearing tokens immediately, do it lazily at
the next state change. This allows us to accumulate full bursts of
text in between tags instead of having one token per character. :^)
2020-05-23 18:44:32 +02:00
Andreas Kling
45450c7edc
LibWeb: Make BEGIN_STATE and END_STATE include some {{{ and }}}
...
This makes it a compile error to omit the END_STATE. Also add some more
missing END_STATE's exposed by this (nice!)
Thanks to @predmond for suggesting the multi-pair trick! :^)
2020-05-23 15:25:43 +02:00
Andreas Kling
2e4147d0fc
LibWeb: Add missing END_STATE for TagName
...
Fixes #2339 .
2020-05-23 10:33:23 +02:00