Commit graph

59 commits

Author SHA1 Message Date
Sam Atkins
89c5f25016 LibWeb/CSS: Remove tiny-oom propagation from CSS Tokenizer 2024-07-26 17:29:20 +02:00
Andreas Kling
dba6216caa LibWeb: Skip CSS tokenizer filtering when string has no '\r' or '\f'
When loading a canned version of reddit.com, we end up parsing many many
shadow tree style sheets of roughly ~170 KiB text each.

None of them have '\r' or '\f', yet we spend 2-3 ms for each sheet just
looping over and reconstructing the text to see if we need to normalize
any newlines.

This patch makes the common case faster in two ways:

- We use TextCodec::Decoder::to_utf8() instead of process()
  This way, we do a one-shot fast validation and conversion to UTF-8,
  instead of using the generic code-point-at-a-time callback API.

- We scan for '\r' and '\f' before filtering, and if neither is present,
  we simply use the unfiltered string.

With these changes, we now spend 0 ms in the filtering function for the
vast majority of style sheets I've seen so far.
2024-07-20 15:35:30 +02:00
Andreas Kling
8d7a1e5654 LibWeb: Skip some redundant UTF-8 validation in CSS tokenizer
If we're just adding code points to a StringBuilder, there's no need to
revalidate the result.
2024-03-24 13:28:24 +01:00
Shannon Booth
e2e7c4d574 Everywhere: Use to_number<T> instead of to_{int,uint,float,double}
In a bunch of cases, this actually ends up simplifying the code as
to_number will handle something such as:

```
Optional<I> opt;
if constexpr (IsSigned<I>)
    opt = view.to_int<I>();
else
    opt = view.to_uint<I>();
```

For us.

The main goal here however is to have a single generic number conversion
API between all of the String classes.
2023-12-23 20:41:07 +01:00
Sam Atkins
1a5533e528 LibWeb: Tokenize CSS numbers as doubles
Every later stage uses doubles, so dropping that precision right at the
start of parsing is a little silly. :^)
2023-08-20 14:25:18 +01:00
Sam Atkins
c138845013 LibWeb: Store the original representation of CSS tokens
This is required for the `<urange>` type, and custom properties, to work
correctly, as both need to know exactly what the original text was.
2023-03-22 19:45:40 +01:00
Sam Atkins
a3d6d9db37 LibWeb: Correct logic when consuming a CSS number in scientific notation
Before, we were classifying the number as a "number" type if it had an
"E", even if that was not followed by an exponent.
2023-03-22 19:45:40 +01:00
Sam Atkins
84af8dd9ed LibWeb: Propagate errors from CSS Tokenizer 2023-03-07 00:43:36 +01:00
Sam Atkins
17618989a3 LibWeb: Propagate errors from CSS Tokenizer construction
Instead of constructing a Tokenizer and then calling parse() on it, we
now call `Tokenizer::tokenize(...)` directly. (Renamed from `parse()`
because this is a Tokenizer, not a Parser.)
2023-03-07 00:43:36 +01:00
Sam Atkins
2db168acc1 LibTextCodec+Everywhere: Port Decoders to new Strings 2023-02-19 17:15:47 +01:00
Sam Atkins
f2a9426885 LibTextCodec+Everywhere: Return Optional<Decoder&> from decoder_for() 2023-02-19 17:15:47 +01:00
Sam Atkins
3685a8813a LibWeb: Port CSS Tokenizer to new Strings
Specifically, this uses FlyString, because the data gets held long-term
as a FlyString anyway.
2023-02-15 12:48:26 -05:00
Sam Atkins
8af65108e4 LibWeb: Construct CSS Tokenizer and Parser with a StringView encoding
This doesn't need to be a full (Deprecated)String, so let's not force it
to be.
2023-02-15 12:48:26 -05:00
Sam Atkins
7fc72d3838 LibWeb: Convert CSS Token value to new FlyString 2023-02-13 14:35:40 +00:00
Linus Groh
57dc179b1f Everywhere: Rename to_{string => deprecated_string}() where applicable
This will make it easier to support both string types at the same time
while we convert code, and tracking down remaining uses.

One big exception is Value::to_string() in LibJS, where the name is
dictated by the ToString AO.
2022-12-06 08:54:33 +01:00
Linus Groh
6e19ab2bbc AK+Everywhere: Rename String to DeprecatedString
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
2022-12-06 08:54:33 +01:00
davidot
8abd4f6102 LibWeb: Make the CSS parser use the new double parser
This could potentially be sped up by tracking the up to three different
ranges of characters known to be digits. This would save the double
parser from checking whether these are digits and because it has the
size it can use the fast parsing method.
2022-10-23 15:48:45 +02:00
Sam Atkins
164094e161 LibWeb: Bring CSS tokenization preprocessing closer to spec
This is based on an editorial change in the December 2021 version of
SYNTAX-3: https://www.w3.org/TR/2021/CRD-css-syntax-3-20211224/

They named this step "filter code points", so let's use that name.
2022-10-03 17:09:41 +01:00
Sam Atkins
97e174afcd LibWeb: Use the term "ident sequence" instead of "name"
This is an editorial change in the December 2021 version of SYNTAX-3:
https://www.w3.org/TR/2021/CRD-css-syntax-3-20211224/
2022-10-03 17:09:41 +01:00
sin-ack
3f3f45580a Everywhere: Add sv suffix to strings relying on StringView(char const*)
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).

No functional changes.
2022-07-12 23:11:35 +02:00
stelar7
cd73d5c1d0 LibWeb: Add missing preprocessing step to the css tokenizer 2022-05-08 16:29:46 +02:00
Sam Atkins
bf786d66b1 LibWeb: Move Token and Tokenizer into Parser namespace 2022-04-12 23:03:46 +02:00
Idan Horowitz
086969277e Everywhere: Run clang-format 2022-04-01 21:24:45 +01:00
Sam Atkins
13e1232d79 LibWeb: Remove separate Token::m_unit field
Dimension tokens don't make use of the m_value string for anything else,
so we can sneak the unit string in there.

- Token goes from 72 to 64 bytes
- StyleComponentValueRule goes from 80 to 72 bytes
2022-03-22 15:47:36 +01:00
Sam Atkins
fe372cd073 LibWeb: Use CSS::Number for Token numeric values 2022-03-22 15:47:36 +01:00
Sam Atkins
0795b9f7bb LibWeb: Use floats instead of doubles for CSS numbers
Using doubles isn't necessary, and they make things slightly bigger and
slower, so let's use floats instead.
2022-03-22 15:47:36 +01:00
Sam Atkins
1f5b5d3f99 LibWeb: Use intermediate ints when converting strings to numbers in CSS
These three are all integers - we just repeatedly multiply them by 10
and then add a digit - so using an integer here is both faster and more
accurate. :^)
2022-03-22 15:47:36 +01:00
Karol Kosek
fd235d8a06 LibWeb: Don't put a backslash after escape sequences in text-like tokens
Previously, a string token like '\41' would be tokenized to 'A\'. This
could be seen on Wikipedia headlines.
2022-03-19 13:10:00 -07:00
Lenny Maiorani
f912a48315 Userland: Change static const variables to static constexpr
`static const` variables can be computed and initialized at run-time
during initialization or the first time a function is called. Change
them to `static constexpr` to ensure they are computed at
compile-time.

This allows some removal of `strlen` because the length of the
`StringView` can be used which is pre-computed at compile-time.
2022-03-18 19:58:57 +01:00
Sam Atkins
2a7a8d2cab LibWeb: Don't verify that a dimension unit isn't whitespace
Raw whitespace is not allowed inside a name, but escaped whitespace is,
for example `\9`, which is the tab character.

This stops yakzz.com from crashing the Browser, since it was using `\9`
in various places as a hack to only apply those properties to IE8/9.
2022-02-02 18:29:05 +01:00
Sam Atkins
5d0851cb0e LibWeb: Use start_of_input_stream_twin() for is_valid_escape_sequence()
This means we can get rid of the hacks where we were peeking a code
point instead of getting the next one so that we could peek_twin()
later. Now, we follow the spec more closely. :^)
2021-12-27 22:56:08 +01:00
Sam Atkins
269a24d4ca LibWeb: Pass correct values to would_start_an_identifier()
Same as with would_start_a_number(), we were skipping a code point.
2021-12-27 22:56:08 +01:00
Sam Atkins
bb82ee5530 LibWeb: Pass correct values to would_start_a_number()
This fixes the crash that Luke found using Domato:
```css
. foo {
    mso-border-alt: solid  .-1pt;
}
```

The spec distinguishes between "If the next 3 code points would
start..." and "If the input stream starts with..." but we were treating
them the same way, skipping the first code point in the process.
2021-12-27 22:56:08 +01:00
Sam Atkins
981badb45f LibWeb: Add CSS::Tokenizer::start_of_input_stream_[twin|triplet]()
These correspond to "If the input stream starts with..." in the spec,
which up until now we were not handling correctly, which led to some fun
bugs.

As noted, reconsuming the input code point in order to read its value is
hacky, but works. Keeping track of the current code point in Tokenizer
would be nicer, when I'm feeling brave enough to mess with it!
2021-12-27 22:56:08 +01:00
Sam Atkins
85e5586a27 LibWeb: Add spec comments to CSS Tokenizer
Some of the code has been slightly rearranged to match the spec order,
but otherwise I've tried not to mess with it.
2021-11-19 22:35:05 +01:00
Sam Atkins
9403cc42f9 LibWeb: Convert CSS Token::m_value from StringBuilder to FlyString
Again, this value does not change once we have finished creating the
Token, so it can be more lightweight.
2021-11-19 22:35:05 +01:00
Sam Atkins
75e7c2c5c0 LibWeb: Convert CSS Token::m_unit from StringBuilder to FlyString
This value doesn't change once it's assigned to the Token, so it can be
more lightweight than a StringBuilder.
2021-11-19 22:35:05 +01:00
Sam Atkins
f6869797a7 LibWeb: Convert numeric tokens to numbers in CSS Tokenizer
The spec wants us to produce numeric values as the Tokenizer sees them,
rather than waiting until the parse stage. This is a first step towards
that.
2021-11-19 22:35:05 +01:00
Andreas Kling
8b1108e485 Everywhere: Pass AK::StringView by value 2021-11-11 01:27:46 +01:00
Sam Atkins
ecf5368535 LibWeb: Record position information in CSS Tokens
This is a requirement to be able to use the Tokens for syntax
highlighting.
2021-10-23 19:07:44 +02:00
Sam Atkins
9a2eecaca4 LibWeb: Add CSS Tokenizer::consume_as_much_whitespace_as_possible()
This is a step in the spec in 3 places, and we had it implemented
differently in each one. This unifies them and makes it clearer what
we're doing.
2021-10-23 19:07:44 +02:00
Sam Atkins
dfbdc20f87 LibWeb: Add spec links to CSS Tokenizer
Also renamed `starts_with_a_number()` -> `would_start_a_number()` to
better match spec terminology.
2021-10-23 19:07:44 +02:00
Sam Atkins
bb1cc99750 LibWeb: Stop treating EOF as a valid part of an identifier
This was specifically causing the string "0" to be parsed as an invalid
Dimension token with no units, instead of as a Number. That then caused
out generated `property_initial_value()` function to fail for those
values.
2021-09-17 23:06:45 +02:00
sin-ack
d9900ece2f LibWeb: Preprocess the CSS stream in the Tokenizer
This commit implements the input preprocessing algorithm that CSS Syntax
Module Level 3 defines.
2021-08-30 00:08:40 +02:00
Sam Atkins
74c9587798 LibWeb: Fix EOF handling in CSS Tokenizer peek_{twin,triplet}()
Previously, the loops would stop before reaching EOF, meaning that the
values that should have been set to EOF were left with their 0 initial
values. Now, we initialize to EOFs instead. The if/else inside the loops
always ran the else branch so I have removed the if branches.
2021-08-04 19:04:12 +04:30
Sam Atkins
e54531244f LibWeb: Define proper debug symbols for CSS Parser and Tokenizer
You can now turn debug logging for them on using `CSS_PARSER_DEBUG` and
`CSS_TOKENIZER_DEBUG`.
2021-07-31 00:18:11 +02:00
Sam Atkins
7439fbd896 LibWeb: Get CSS @import rules working in new parser
Also added css-import.html, which tests the 3 syntax variations on
`@import` statements. Note that the optional media-query parameter to
`@import` is not handled yet.
2021-07-31 00:18:11 +02:00
Sam Atkins
c249fbd17c LibWeb: Correct escape handling in CSS Tokenizer
Calling is_valid_escape_sequence() with no arguments hides what it
is operating on, so I have removed that, so that you must explicitly
tell it what you are testing.

The call from consume_a_token() was using the wrong tokens, so it
returned false incorrectly. This was resulting in corrupted output
when faced with this code from Acid2. (Abbreviated)

```css
.parser { error: \}; }
.parser { }
```
2021-07-11 23:19:56 +02:00
Sam Atkins
b7116711bf LibWeb: Add TokenStream class to CSS Parser
The entry points for CSS parsing in the spec are defined as accepting
any of a stream of Tokens, or a stream of ComponentValues, or a String.
TokenStream is an attempt to reduce the duplication of code for that.
2021-07-11 23:19:56 +02:00
Sam Atkins
6c03123b2d LibWeb: Give CSS Token and StyleComponentValueRule matching is() funcs
The end goal here is to make the two classes mostly interchangeable, as
the CSS spec requires that the various parser algorithms can take a
stream of either class, and we want to have that functionality without
needing to duplicate all of the code.
2021-07-11 23:19:56 +02:00