Instead of constructing a Tokenizer and then calling parse() on it, we
now call `Tokenizer::tokenize(...)` directly. (Renamed from `parse()`
because this is a Tokenizer, not a Parser.)
This will make it easier to support both string types at the same time
while we convert code, and tracking down remaining uses.
One big exception is Value::to_string() in LibJS, where the name is
dictated by the ToString AO.
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
This could potentially be sped up by tracking the up to three different
ranges of characters known to be digits. This would save the double
parser from checking whether these are digits and because it has the
size it can use the fast parsing method.
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).
No functional changes.
Dimension tokens don't make use of the m_value string for anything else,
so we can sneak the unit string in there.
- Token goes from 72 to 64 bytes
- StyleComponentValueRule goes from 80 to 72 bytes
These three are all integers - we just repeatedly multiply them by 10
and then add a digit - so using an integer here is both faster and more
accurate. :^)
`static const` variables can be computed and initialized at run-time
during initialization or the first time a function is called. Change
them to `static constexpr` to ensure they are computed at
compile-time.
This allows some removal of `strlen` because the length of the
`StringView` can be used which is pre-computed at compile-time.
Raw whitespace is not allowed inside a name, but escaped whitespace is,
for example `\9`, which is the tab character.
This stops yakzz.com from crashing the Browser, since it was using `\9`
in various places as a hack to only apply those properties to IE8/9.
This means we can get rid of the hacks where we were peeking a code
point instead of getting the next one so that we could peek_twin()
later. Now, we follow the spec more closely. :^)
This fixes the crash that Luke found using Domato:
```css
. foo {
mso-border-alt: solid .-1pt;
}
```
The spec distinguishes between "If the next 3 code points would
start..." and "If the input stream starts with..." but we were treating
them the same way, skipping the first code point in the process.
These correspond to "If the input stream starts with..." in the spec,
which up until now we were not handling correctly, which led to some fun
bugs.
As noted, reconsuming the input code point in order to read its value is
hacky, but works. Keeping track of the current code point in Tokenizer
would be nicer, when I'm feeling brave enough to mess with it!
This was specifically causing the string "0" to be parsed as an invalid
Dimension token with no units, instead of as a Number. That then caused
out generated `property_initial_value()` function to fail for those
values.
Previously, the loops would stop before reaching EOF, meaning that the
values that should have been set to EOF were left with their 0 initial
values. Now, we initialize to EOFs instead. The if/else inside the loops
always ran the else branch so I have removed the if branches.
Also added css-import.html, which tests the 3 syntax variations on
`@import` statements. Note that the optional media-query parameter to
`@import` is not handled yet.
Calling is_valid_escape_sequence() with no arguments hides what it
is operating on, so I have removed that, so that you must explicitly
tell it what you are testing.
The call from consume_a_token() was using the wrong tokens, so it
returned false incorrectly. This was resulting in corrupted output
when faced with this code from Acid2. (Abbreviated)
```css
.parser { error: \}; }
.parser { }
```
The entry points for CSS parsing in the spec are defined as accepting
any of a stream of Tokens, or a stream of ComponentValues, or a String.
TokenStream is an attempt to reduce the duplication of code for that.
The end goal here is to make the two classes mostly interchangeable, as
the CSS spec requires that the various parser algorithms can take a
stream of either class, and we want to have that functionality without
needing to duplicate all of the code.
Optional seems like a good idea, but in many places we were not
checking if it had a value, which was causing crashes when the
Tokenizer was given malformed input. Using an EOF value along with
is_eof() makes things a lot simpler.
This replaces ctype.h with CharacterType.h everywhere I could find
issues with narrowing conversions. While using it will probably make
sense almost everywhere in the future, the most critical places should
have been addressed.