The spec for each of these state:
-> EOF:
This is an eof-in-comment parse error. Emit the current comment
token. Emit an end-of-file token.
We were neglecting to emit the current comment token before emitting an
EOF token. Note the existing EMIT_CURRENT_TOKEN macro was unused.
If a call to `document.write` inserts an incomplete HTML tag, e.g.:
document.write("<p");
we would previously continue parsing the document until we reached a
closing angle bracket. However, the spec states we should stop once we
reach the new insertion point.
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).
This commit is auto-generated:
$ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
Meta Ports Ladybird Tests Kernel)
$ perl -pie 's/\bDeprecatedString\b/ByteString/g;
s/deprecated_string/byte_string/g' $xs
$ clang-format --style=file -i \
$(git diff --name-only | grep \.cpp\|\.h)
$ gn format $(git ls-files '*.gn' '*.gni')
This required dealing with a *lot* of fallout, but it's all basically
just switching from DeprecatedFlyString to either FlyString or
Optional<FlyString> in a hundred places to accommodate the change.
Which pretty much needs to be done together due to the amount of places
where they are compared together.
This also involves porting over StackOfOpenElements over to FlyString
from DeprecatedFly string to prevent a gazillion calls to
`.to_deprecated_fly_string` calls in HTMLParser.
We currently track the [line, column] position of every HTMLToken, as
this is what is needed for LibGUI's syntax highlighting. Some non-LibGUI
purposes (e.g. highlighting HTML with HTML) require a byte offset. Track
both during tokenization.
We were previously setting the end position of attribute names in self-
closing HTML tags to the end of the attribute value. To illustrate the
previous behavior, consider this tag and its attribute's start and end
positions (shown inclusively below):
<meta charset="UTF-8" />
^ name start
^ value start
^ value end
^ name end
Rather than setting the end position of the attribute name when we parse
the closing slash, ensure the end position is already set while we are
in the AttributeName state. We now have:
<meta charset="UTF-8" />
^ name start
^ name end
^ value start
^ value end
The tokenizer unit test has been extended to test these positions.
To illustrate the previous behavior, consider these tags and their start
and end positions (shown inclusively below):
Start tag: End tag:
<span> </span>
^ start ^ start
^end ^end
The start position of a tag is the first ASCII-alpha code point after
the opening brace. The start position of a close tag is the slash just
before the first ASCII-alpha code point. And the end position of both
is the closing brace. So the opening brace is not included in the
emitted tag, but the closing brace is. And the end tag including the
slash is an oddity that had to be worked around in its only use case
(syntax highlighting).
We now consistently exclude the braces from the emitted tag, and also
exclude the slash from the end tag, so that it does not need to be
accounted for in syntax highlighting. That is, we now have:
Start tag: End tag:
<span> </span>
^ start ^ start
^end ^end
The tokenizer unit test has been extended to test these positions.
Having an alias function that only wraps another one is silly, and
keeping the more obvious name should flush out more uses of deprecated
strings.
No behavior change.
This will make it easier to support both string types at the same time
while we convert code, and tracking down remaining uses.
One big exception is Value::to_string() in LibJS, where the name is
dictated by the ToString AO.
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
Once we know that the current code point is an ASCII character, we can
just check if it's one of the HTML whitespace characters.
Before this patch, we were using the generic StringView::contains(u32)
path that splats a code point into a StringBuilder and then searches
for it with memmem().
This reduces time spent in the HTML tokenizer from 16% to 6% when
loading the ECMA-262 spec.
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).
No functional changes.
This implements basic support for dynamic markup insertion, adding
* Document::open()
* Document::write(Vector<String> const&)
* Document::writeln(Vector<String> const&)
* Document::close()
The HTMLParser is modified to make it possible to create a
script-created parser which initially only contains a HTMLTokenizer
without any data. Aditionally the HTMLParser::run method gains an
overload which does not modify the Document and does not run
HTMLParser::the_end() so that we can reenter the parser at a later time.
Furthermore all FIXMEs that consern the insertion point are implemented
wich is defined in the HTMLTokenizer. Additionally the following
member-variables of the HTMLParser are now exposed by getter funcions:
* m_tokenizer
* m_aborted
* m_script_nesting_level
The HTMLTokenizer is modified so that it contains an insertion
point which keeps track of where the next input from the Document::write
functions will be inserted. The insertion point is implemented as the
charakter offset into m_decoded_input and a boolean describing if the
insertion point is defined. Functions to update, check and {re}store the
insertion point are also added.
The function HTMLTokenizer::insert_eof is added to tell a script-created
parser that document::close was called and HTMLParser::the_end() should
be called.
Lastly an explicit default constructor is added to HTMLTokenizer to
create a empty HTMLTokenizer into which data can be inserted.
Also, update the expected hash in the LibWeb TestHTMLTokenizer
regression test.
This is due to the "This comment has a few too many dashes." comment
token being updated.
Newline normalization will replace \r and \r\n with \n.
The spec specifically states
> Before the tokenization stage, the input stream must be preprocessed
> by normalizing newlines.
wheras this is implemented the processing during the tokenization
itself.
This should still exhibit the same behaviour, while keeping the
tokenization logic in the same place.
In 'NamedCharacterReference' we attempt to lookup the code point by a
identifier, eg apos; becomes '
This is done by passing the entire rest of the document to the
`HTML::code_points_from_entity` function.
However, before this change we didn't sent the final character which
meant if the document ended in a named character reference the lookup
would fail.
Emitting tokens on EOF caused an infinite loop, freezing the app, which
could be a bit annoying when writing an HTML comment at the end of
the file in Text Editor. :^)
Commit b193351a99 caused the HTML comments to flash when changing
the text cursor. Also, when double-clicking on a comment, the selection
started from the beginning of the file instead.
The following message was displaying when `TOKENIZER_TRACE_DEBUG`
was enabled:
(Tokenizer::nth_last_position) Invalid position requested: 4th-last
of 4. Returning (0-0).
Changing the `nth_last_position` to 3 fixes this. I'm guessing that's
because the parser is at that moment on the second hyphen of the `<!--`
string, so it has to go back only by three characters.
The difference should be between m_utf8_iterator and the
the new position, if m_prev_utf8_iterator is used one fewer
source position is popped than required.
This issue was not apparent on most pages since restore_to
used for tokens such <!doctype> that are normally
followed by a newline that resets the column to zero,
but it can be seen on pages with minified HTML.
This is in preparation for an upcoming storage change of HTMLToken. In
contrast to the other token types, the accessor can hand out a mutable
reference to allow users to change parts of the DoctypeData easily.
Previously, HTMLToken would expose the Vector<Attribute> directly to
its users. In preparation for a future change, all users now use
implementation-agnostic APIs which do not expose the Vector directly.
This fixes parsing the following regular expression: /</g;
It also adds a simple script element to the HTMLTokenizer regression
test, which also contains that specific regex.