0ct0pu5/ladybird

Author	SHA1	Message	Date
Sam Atkins	89c5f25016	LibWeb/CSS: Remove tiny-oom propagation from CSS Tokenizer	2024-07-26 17:29:20 +02:00
Andreas Kling	dba6216caa	LibWeb: Skip CSS tokenizer filtering when string has no '\r' or '\f' When loading a canned version of reddit.com, we end up parsing many many shadow tree style sheets of roughly ~170 KiB text each. None of them have '\r' or '\f', yet we spend 2-3 ms for each sheet just looping over and reconstructing the text to see if we need to normalize any newlines. This patch makes the common case faster in two ways: - We use TextCodec::Decoder::to_utf8() instead of process() This way, we do a one-shot fast validation and conversion to UTF-8, instead of using the generic code-point-at-a-time callback API. - We scan for '\r' and '\f' before filtering, and if neither is present, we simply use the unfiltered string. With these changes, we now spend 0 ms in the filtering function for the vast majority of style sheets I've seen so far.	2024-07-20 15:35:30 +02:00
Andreas Kling	8d7a1e5654	LibWeb: Skip some redundant UTF-8 validation in CSS tokenizer If we're just adding code points to a StringBuilder, there's no need to revalidate the result.	2024-03-24 13:28:24 +01:00
Shannon Booth	e2e7c4d574	Everywhere: Use to_number<T> instead of to_{int,uint,float,double} In a bunch of cases, this actually ends up simplifying the code as to_number will handle something such as: ``` Optional<I> opt; if constexpr (IsSigned<I>) opt = view.to_int<I>(); else opt = view.to_uint<I>(); ``` For us. The main goal here however is to have a single generic number conversion API between all of the String classes.	2023-12-23 20:41:07 +01:00
Sam Atkins	1a5533e528	LibWeb: Tokenize CSS numbers as doubles Every later stage uses doubles, so dropping that precision right at the start of parsing is a little silly. :^)	2023-08-20 14:25:18 +01:00
Sam Atkins	c138845013	LibWeb: Store the original representation of CSS tokens This is required for the `<urange>` type, and custom properties, to work correctly, as both need to know exactly what the original text was.	2023-03-22 19:45:40 +01:00
Sam Atkins	a3d6d9db37	LibWeb: Correct logic when consuming a CSS number in scientific notation Before, we were classifying the number as a "number" type if it had an "E", even if that was not followed by an exponent.	2023-03-22 19:45:40 +01:00
Sam Atkins	84af8dd9ed	LibWeb: Propagate errors from CSS Tokenizer	2023-03-07 00:43:36 +01:00
Sam Atkins	17618989a3	LibWeb: Propagate errors from CSS Tokenizer construction Instead of constructing a Tokenizer and then calling parse() on it, we now call `Tokenizer::tokenize(...)` directly. (Renamed from `parse()` because this is a Tokenizer, not a Parser.)	2023-03-07 00:43:36 +01:00
Sam Atkins	2db168acc1	LibTextCodec+Everywhere: Port Decoders to new Strings	2023-02-19 17:15:47 +01:00
Sam Atkins	f2a9426885	LibTextCodec+Everywhere: Return Optional<Decoder&> from `decoder_for()`	2023-02-19 17:15:47 +01:00
Sam Atkins	3685a8813a	LibWeb: Port CSS Tokenizer to new Strings Specifically, this uses FlyString, because the data gets held long-term as a FlyString anyway.	2023-02-15 12:48:26 -05:00
Sam Atkins	8af65108e4	LibWeb: Construct CSS Tokenizer and Parser with a StringView encoding This doesn't need to be a full (Deprecated)String, so let's not force it to be.	2023-02-15 12:48:26 -05:00
Sam Atkins	7fc72d3838	LibWeb: Convert CSS Token value to new FlyString	2023-02-13 14:35:40 +00:00
Linus Groh	57dc179b1f	Everywhere: Rename to_{string => deprecated_string}() where applicable This will make it easier to support both string types at the same time while we convert code, and tracking down remaining uses. One big exception is Value::to_string() in LibJS, where the name is dictated by the ToString AO.	2022-12-06 08:54:33 +01:00
Linus Groh	6e19ab2bbc	AK+Everywhere: Rename String to DeprecatedString We have a new, improved string type coming up in AK (OOM aware, no null state), and while it's going to use UTF-8, the name UTF8String is a mouthful - so let's free up the String name by renaming the existing class. Making the old one have an annoying name will hopefully also help with quick adoption :^)	2022-12-06 08:54:33 +01:00
davidot	8abd4f6102	LibWeb: Make the CSS parser use the new double parser This could potentially be sped up by tracking the up to three different ranges of characters known to be digits. This would save the double parser from checking whether these are digits and because it has the size it can use the fast parsing method.	2022-10-23 15:48:45 +02:00
Sam Atkins	164094e161	LibWeb: Bring CSS tokenization preprocessing closer to spec This is based on an editorial change in the December 2021 version of SYNTAX-3: https://www.w3.org/TR/2021/CRD-css-syntax-3-20211224/ They named this step "filter code points", so let's use that name.	2022-10-03 17:09:41 +01:00
Sam Atkins	97e174afcd	LibWeb: Use the term "ident sequence" instead of "name" This is an editorial change in the December 2021 version of SYNTAX-3: https://www.w3.org/TR/2021/CRD-css-syntax-3-20211224/	2022-10-03 17:09:41 +01:00
sin-ack	3f3f45580a	Everywhere: Add sv suffix to strings relying on StringView(char const) Each of these strings would previously rely on StringView's char const constructor overload, which would call __builtin_strlen on the string. Since we now have operator ""sv, we can replace these with much simpler versions. This opens the door to being able to remove StringView(char const*). No functional changes.	2022-07-12 23:11:35 +02:00
stelar7	cd73d5c1d0	LibWeb: Add missing preprocessing step to the css tokenizer	2022-05-08 16:29:46 +02:00
Sam Atkins	bf786d66b1	LibWeb: Move Token and Tokenizer into Parser namespace	2022-04-12 23:03:46 +02:00
Idan Horowitz	086969277e	Everywhere: Run clang-format	2022-04-01 21:24:45 +01:00
Sam Atkins	13e1232d79	LibWeb: Remove separate Token::m_unit field Dimension tokens don't make use of the m_value string for anything else, so we can sneak the unit string in there. - Token goes from 72 to 64 bytes - StyleComponentValueRule goes from 80 to 72 bytes	2022-03-22 15:47:36 +01:00
Sam Atkins	fe372cd073	LibWeb: Use CSS::Number for Token numeric values	2022-03-22 15:47:36 +01:00
Sam Atkins	0795b9f7bb	LibWeb: Use floats instead of doubles for CSS numbers Using doubles isn't necessary, and they make things slightly bigger and slower, so let's use floats instead.	2022-03-22 15:47:36 +01:00
Sam Atkins	1f5b5d3f99	LibWeb: Use intermediate ints when converting strings to numbers in CSS These three are all integers - we just repeatedly multiply them by 10 and then add a digit - so using an integer here is both faster and more accurate. :^)	2022-03-22 15:47:36 +01:00
Karol Kosek	fd235d8a06	LibWeb: Don't put a backslash after escape sequences in text-like tokens Previously, a string token like '\41' would be tokenized to 'A\'. This could be seen on Wikipedia headlines.	2022-03-19 13:10:00 -07:00
Lenny Maiorani	f912a48315	Userland: Change static const variables to static constexpr `static const` variables can be computed and initialized at run-time during initialization or the first time a function is called. Change them to `static constexpr` to ensure they are computed at compile-time. This allows some removal of `strlen` because the length of the `StringView` can be used which is pre-computed at compile-time.	2022-03-18 19:58:57 +01:00
Sam Atkins	2a7a8d2cab	LibWeb: Don't verify that a dimension unit isn't whitespace Raw whitespace is not allowed inside a name, but escaped whitespace is, for example `\9`, which is the tab character. This stops yakzz.com from crashing the Browser, since it was using `\9` in various places as a hack to only apply those properties to IE8/9.	2022-02-02 18:29:05 +01:00
Sam Atkins	5d0851cb0e	LibWeb: Use start_of_input_stream_twin() for is_valid_escape_sequence() This means we can get rid of the hacks where we were peeking a code point instead of getting the next one so that we could peek_twin() later. Now, we follow the spec more closely. :^)	2021-12-27 22:56:08 +01:00
Sam Atkins	269a24d4ca	LibWeb: Pass correct values to would_start_an_identifier() Same as with would_start_a_number(), we were skipping a code point.	2021-12-27 22:56:08 +01:00
Sam Atkins	bb82ee5530	LibWeb: Pass correct values to would_start_a_number() This fixes the crash that Luke found using Domato: ```css . foo { mso-border-alt: solid .-1pt; } ``` The spec distinguishes between "If the next 3 code points would start..." and "If the input stream starts with..." but we were treating them the same way, skipping the first code point in the process.	2021-12-27 22:56:08 +01:00
Sam Atkins	981badb45f	LibWeb: Add CSS::Tokenizer::start_of_input_stream_[twin\|triplet]() These correspond to "If the input stream starts with..." in the spec, which up until now we were not handling correctly, which led to some fun bugs. As noted, reconsuming the input code point in order to read its value is hacky, but works. Keeping track of the current code point in Tokenizer would be nicer, when I'm feeling brave enough to mess with it!	2021-12-27 22:56:08 +01:00
Sam Atkins	85e5586a27	LibWeb: Add spec comments to CSS Tokenizer Some of the code has been slightly rearranged to match the spec order, but otherwise I've tried not to mess with it.	2021-11-19 22:35:05 +01:00
Sam Atkins	9403cc42f9	LibWeb: Convert CSS Token::m_value from StringBuilder to FlyString Again, this value does not change once we have finished creating the Token, so it can be more lightweight.	2021-11-19 22:35:05 +01:00
Sam Atkins	75e7c2c5c0	LibWeb: Convert CSS Token::m_unit from StringBuilder to FlyString This value doesn't change once it's assigned to the Token, so it can be more lightweight than a StringBuilder.	2021-11-19 22:35:05 +01:00
Sam Atkins	f6869797a7	LibWeb: Convert numeric tokens to numbers in CSS Tokenizer The spec wants us to produce numeric values as the Tokenizer sees them, rather than waiting until the parse stage. This is a first step towards that.	2021-11-19 22:35:05 +01:00
Andreas Kling	8b1108e485	Everywhere: Pass AK::StringView by value	2021-11-11 01:27:46 +01:00
Sam Atkins	ecf5368535	LibWeb: Record position information in CSS Tokens This is a requirement to be able to use the Tokens for syntax highlighting.	2021-10-23 19:07:44 +02:00
Sam Atkins	9a2eecaca4	LibWeb: Add CSS Tokenizer::consume_as_much_whitespace_as_possible() This is a step in the spec in 3 places, and we had it implemented differently in each one. This unifies them and makes it clearer what we're doing.	2021-10-23 19:07:44 +02:00
Sam Atkins	dfbdc20f87	LibWeb: Add spec links to CSS Tokenizer Also renamed `starts_with_a_number()` -> `would_start_a_number()` to better match spec terminology.	2021-10-23 19:07:44 +02:00
Sam Atkins	bb1cc99750	LibWeb: Stop treating EOF as a valid part of an identifier This was specifically causing the string "0" to be parsed as an invalid Dimension token with no units, instead of as a Number. That then caused out generated `property_initial_value()` function to fail for those values.	2021-09-17 23:06:45 +02:00
sin-ack	d9900ece2f	LibWeb: Preprocess the CSS stream in the Tokenizer This commit implements the input preprocessing algorithm that CSS Syntax Module Level 3 defines.	2021-08-30 00:08:40 +02:00
Sam Atkins	74c9587798	LibWeb: Fix EOF handling in CSS Tokenizer peek_{twin,triplet}() Previously, the loops would stop before reaching EOF, meaning that the values that should have been set to EOF were left with their 0 initial values. Now, we initialize to EOFs instead. The if/else inside the loops always ran the else branch so I have removed the if branches.	2021-08-04 19:04:12 +04:30
Sam Atkins	e54531244f	LibWeb: Define proper debug symbols for CSS Parser and Tokenizer You can now turn debug logging for them on using `CSS_PARSER_DEBUG` and `CSS_TOKENIZER_DEBUG`.	2021-07-31 00:18:11 +02:00
Sam Atkins	7439fbd896	LibWeb: Get CSS @import rules working in new parser Also added css-import.html, which tests the 3 syntax variations on `@import` statements. Note that the optional media-query parameter to `@import` is not handled yet.	2021-07-31 00:18:11 +02:00
Sam Atkins	c249fbd17c	LibWeb: Correct escape handling in CSS Tokenizer Calling is_valid_escape_sequence() with no arguments hides what it is operating on, so I have removed that, so that you must explicitly tell it what you are testing. The call from consume_a_token() was using the wrong tokens, so it returned false incorrectly. This was resulting in corrupted output when faced with this code from Acid2. (Abbreviated) ```css .parser { error: \}; } .parser { } ```	2021-07-11 23:19:56 +02:00
Sam Atkins	b7116711bf	LibWeb: Add TokenStream class to CSS Parser The entry points for CSS parsing in the spec are defined as accepting any of a stream of Tokens, or a stream of ComponentValues, or a String. TokenStream is an attempt to reduce the duplication of code for that.	2021-07-11 23:19:56 +02:00
Sam Atkins	6c03123b2d	LibWeb: Give CSS Token and StyleComponentValueRule matching is() funcs The end goal here is to make the two classes mostly interchangeable, as the CSS spec requires that the various parser algorithms can take a stream of either class, and we want to have that functionality without needing to duplicate all of the code.	2021-07-11 23:19:56 +02:00

1 2

59 commits