beenull/ladybird

mirror of https://github.com/LadybirdBrowser/ladybird.git synced 2024-11-22 15:40:19 +00:00

Author	SHA1	Message	Date
Timothy Flynn	986ff984cc	LibUnicode: Replace code point general categories with ICU	2024-06-22 14:56:39 +02:00
Timothy Flynn	c804bda5fd	LibUnicode: Replace code point properties with ICU	2024-06-22 14:56:39 +02:00
Ali Mohammad Pur	27a38932da	LibRegex: Account for extra explicit And/Or in class parser assertion Fixes #23691.	2024-03-24 08:24:46 +01:00
Ali Mohammad Pur	e265d81277	LibRegex: Correct And/Or and inversion interplay semantics This commit also fixes an incorrect test case from very early on, our behaviour now matches the ECMA262 spec in this case. Fixes #21786.	2024-01-11 11:36:09 +01:00
Ali Mohammad Pur	267040dde7	LibRegex: Error out on Eof when parsing nonempty class range elements Fixes #22507.	2023-12-31 15:36:42 +01:00
Shannon Booth	e2e7c4d574	Everywhere: Use to_number<T> instead of to_{int,uint,float,double} In a bunch of cases, this actually ends up simplifying the code as to_number will handle something such as: ``` Optional<I> opt; if constexpr (IsSigned<I>) opt = view.to_int<I>(); else opt = view.to_uint<I>(); ``` For us. The main goal here however is to have a single generic number conversion API between all of the String classes.	2023-12-23 20:41:07 +01:00
Ali Mohammad Pur	5e1499d104	Everywhere: Rename {Deprecated => Byte}String This commit un-deprecates DeprecatedString, and repurposes it as a byte string. As the null state has already been removed, there are no other particularly hairy blockers in repurposing this type as a byte string (what it _really_ is). This commit is auto-generated: $ xs=$(ack -l \bDeprecatedString\b\\|deprecated_string AK Userland \ Meta Ports Ladybird Tests Kernel) $ perl -pie 's/\bDeprecatedString\b/ByteString/g; s/deprecated_string/byte_string/g' $xs $ clang-format --style=file -i \ $(git diff --name-only \| grep \.cpp\\|\.h) $ gn format $(git ls-files '.gn' '.gni')	2023-12-17 18:25:10 +03:30
Ali Mohammad Pur	2d6f50932b	LibRegex: Assign unique serial IDs to checkpoints This makes the compiler assign a serial ID to each checkpoint instead of using the IP as the identifier. This will be used in a future commit to replace the backing store of checkpoints with a vector.	2023-07-14 08:59:19 +02:00
Timothy Flynn	8b668da9d5	LibRegex: Bail parsing class set characters upon early EOF Otherwise, we reach a skip() invocation at the end of this function, which crashes due to EOF. Caught by test262.	2023-06-23 20:22:45 +02:00
Ali Mohammad Pur	cdec23a68c	LibRegex: Treat \<ORD_CHAR> as unescaped in POSIX BRE/ERE This is undefined according to the spec, but glibc ignores the backslash and some applications seem to prefer this behaviour (e.g. sed).	2023-06-14 12:38:10 +02:00
Ali Mohammad Pur	eba466b8e7	LibRegex: Avoid calling GenericLexer::consume() past EOF The consume(size_t) overload consumes "at most" as many bytes as requested, but consume() consumes exactly one byte. This commit makes sure to avoid consuming past EOF. Fixes #18324. Fixes #18325.	2023-04-14 12:33:54 +02:00
Linus Groh	6e7459322d	AK: Remove StringBuilder::build() in favor of to_deprecated_string() Having an alias function that only wraps another one is silly, and keeping the more obvious name should flush out more uses of deprecated strings. No behavior change.	2023-01-27 20:38:49 +00:00
Timothy Flynn	f3db548a3d	AK+Everywhere: Rename FlyString to DeprecatedFlyString DeprecatedFlyString relies heavily on DeprecatedString's StringImpl, so let's rename it to A) match the name of DeprecatedString, B) write a new FlyString class that is tied to String.	2023-01-09 23:00:24 +00:00
Ben Wiederhake	8a331d4fa0	Everywhere: Move AK/Debug.h include to using files or remove	2023-01-02 20:27:20 -05:00
Linus Groh	6e19ab2bbc	AK+Everywhere: Rename String to DeprecatedString We have a new, improved string type coming up in AK (OOM aware, no null state), and while it's going to use UTF-8, the name UTF8String is a mouthful - so let's free up the String name by renaming the existing class. Making the old one have an annoying name will hopefully also help with quick adoption :^)	2022-12-06 08:54:33 +01:00
Linus Groh	babfc13c84	Everywhere: Remove 'clang-format off' comments that are no longer needed https://github.com/SerenityOS/serenity/pull/15654#issuecomment-1322554496	2022-12-03 23:52:23 +00:00
Ali Mohammad Pur	660d2b53b1	LibRegex: Account for eof after \<x> when 'x' leads to legacy behaviour	2022-09-12 16:03:57 +04:30
Ali Mohammad Pur	48442059fc	LibRegex: Consume exactly two chars for escaped characters We were previously consuming an extra char afterwards, which could be the charclass terminator, leading to possible OOB accesses.	2022-09-12 16:03:57 +04:30
Ali Mohammad Pur	598dc74a76	LibRegex: Partially implement the ECMAScript unicodeSets proposal This skips the new string unicode properties additions, along with \q{}.	2022-07-20 21:25:59 +01:00
Ali Mohammad Pur	7734914909	LibRegex: Refactor parsing 'CharacterEscape' out of 'AtomEscape' The ECMA262 spec has this as a separate production, and we need it to be split up for a future commit.	2022-07-20 21:25:59 +01:00
Ali Mohammad Pur	b908f9f6ef	LibRegex: Pass parse flags as a struct instead of multiple arguments	2022-07-20 21:25:59 +01:00
sin-ack	fbc771efe9	Everywhere: Use default StringView constructor over nullptr While null StringViews are just as bad, these prevent the removal of StringView(char const*) as that constructor accepts a nullptr. No functional changes.	2022-07-12 23:11:35 +02:00
sin-ack	3f3f45580a	Everywhere: Add sv suffix to strings relying on StringView(char const) Each of these strings would previously rely on StringView's char const constructor overload, which would call __builtin_strlen on the string. Since we now have operator ""sv, we can replace these with much simpler versions. This opens the door to being able to remove StringView(char const*). No functional changes.	2022-07-12 23:11:35 +02:00
Ali Mohammad Pur	97a333608e	LibRegex: Make codegen+optimisation for alternatives much faster Just a little thinking outside the box, and we can now parse and optimise a million copies of "a\|" chained together in just a second :^)	2022-02-20 11:53:59 +01:00
Ali Mohammad Pur	4be7239626	LibRegex: Make parse_disjunction() consume all disjunctions in one frame This helps us not blow up when too many disjunctions are chained togther in the regex we're parsing. Fixes #12615.	2022-02-20 11:53:59 +01:00
Ali Mohammad Pur	627bbee055	LibRegex: Allow quantifiers after quantifiable assertions While quantifying assertions is very much meaningless, the specification allows them with annex B's extended grammar for browsers, so read and apply the quantifiers. Fixes #12373.	2022-02-20 11:53:59 +01:00
Ali Mohammad Pur	5fac41f733	LibRegex: Implement ECMA262 multiline matching without splitting lines As ECMA262 regex allows `[^]` and literal newlines to match newlines in the input string, we shouldn't split the input string into lines, rather simply make boundaries and catchall patterns capable of checking for these conditions specifically.	2022-01-26 00:53:09 +03:30
Ali Mohammad Pur	c11be92e23	LibRegex: Implement an ECMA262 Regex quirk with negative lookarounds This implements the quirk defined by "Note 3" in section "Canonicalize" (https://tc39.es/ecma262/#sec-runtime-semantics-canonicalize-ch). Crosses off another quirk from #6042.	2022-01-21 18:14:08 +03:30
Hendiadyoin1	303af07df8	LibRegex: Use AK::any_of in Parser::lookahead_any Equivalent to std::ranges::any_of, which clang-tidy suggests.	2021-12-21 18:17:28 -08:00
Hendiadyoin1	ca69ded9a5	LibRegex: Collapse some `if(...) return true; else return false;` blocks	2021-12-21 18:17:28 -08:00
Hendiadyoin1	a2563496f5	LibRegex: Remove some else-after-returns	2021-12-21 18:17:28 -08:00
davidot	154ed3994c	LibRegex: Parse capture group names according to the ECMA262 spec	2021-12-21 14:04:23 +01:00
davidot	733a70671b	LibRegex: Disallow duplicate named capture groups in ECMA262 parser	2021-12-21 14:04:23 +01:00
Tim Schumacher	ff38062318	LibRegex: Correctly translate BRE pattern end anchors Previously we were always choosing the "nothing special" code path, even if the dollar symbol was at the end of the pattern (and therefore should have been considered special). Fix that by actually checking if the pattern end follows, and emitting the correct instruction if necessary.	2021-11-13 15:06:52 +03:30
Andreas Kling	8b1108e485	Everywhere: Pass AK::StringView by value	2021-11-11 01:27:46 +01:00
Nico Weber	de72332920	Libraries: Fix typos	2021-10-01 01:06:40 +01:00
Ben Wiederhake	32e98d0924	Libraries: Use AK::Variant default initialization where appropriate	2021-09-21 04:22:52 +04:30
Ali Mohammad Pur	8e3fe80c06	LibRegex: Avoid using GenericLexer::consume() when at eof Fixes #10027.	2021-09-14 22:02:25 +02:00
Ali Mohammad Pur	7fefb8148b	LibRegex: Use the correct capture group index in ERE bytecode generation Otherwise the left and right capture instructions wouldn't point to the same capture group if there was another nested group there.	2021-09-07 20:01:58 +02:00
Ali Mohammad Pur	dd82c2e9b4	LibRegex: Correctly handle failing in the middle of explicit repeats - Make sure that all the Repeat ops are reset (otherwise the operation would not be correct when going over the Repeat op a second time) - Make sure that all matches that are allowed to fail are backed by a fork, otherwise the last failing fork would not have anywhere to return to. Fixes #9707.	2021-09-01 13:36:53 +02:00
Ali Mohammad Pur	05c65f9b5d	LibRegex: Limit the number of nested capture groups allowed in BRE Found by OSS-Fuzz: https://oss-fuzz.com/testcase?key=4869334212673536	2021-08-31 16:37:49 +02:00
Timothy Flynn	562d4e497b	LibRegex: Treat pattern string characters as unsigned For example, consider the following pattern: new RegExp('\ud834\udf06', 'u') With this pattern, the regex parser should insert the UTF-8 encoded bytes 0xf0, 0x9d, 0x8c, and 0x86. However, because these characters are currently treated as normal char types, they have a negative value since they are all > 0x7f. Then, due to sign extension, when these characters are cast to u64, the sign bit is preserved. The result is that these bytes are inserted as 0xfffffffffffffff0, 0xffffffffffffff9d, etc. Fortunately, there are only a few places where we insert bytecode with the raw characters. In these places, be sure to treat the bytes as u8 before they are cast to u64.	2021-08-20 19:16:33 +02:00
Timothy Flynn	4f2cbe119b	LibRegex: Allow Unicode escape sequences in capture group names Unfortunately, this requires a slight divergence in the way the capture group names are stored. Previously, the generated byte code would simply store a view into the regex pattern string, so no string copying was required. Now, the escape sequences are decoded into a new string, and a vector of all parsed capture group names are stored in a vector in the parser result structure. The byte code then stores a view into the corresponding string in that vector.	2021-08-19 23:49:25 +02:00
Timothy Flynn	6131c0485e	LibRegex: Use GenericLexer to consume escaped code points	2021-08-19 23:49:25 +02:00
Timothy Flynn	5ff9596678	LibRegex: Convert regex::Lexer to inherit from GenericLexer This will allow regex::Lexer users to invoke GenericLexer consumption methods, such as GenericLexer::consume_escaped_codepoint(). This also allows for de-duplicating common methods between the lexers.	2021-08-19 23:49:25 +02:00
Timothy Flynn	02e3633b7f	AK: Move FormatParser definition from header to implementation file This is primarily to be able to remove the GenericLexer include out of Format.h as well. A subsequent commit will add AK::Result to GenericLexer, which will cause naming conflicts with other structures named Result. This can be avoided (for now) by preventing nearly every file in the system from implicitly including GenericLexer. Other changes in this commit are to add the GenericLexer include to files where it is missing.	2021-08-19 23:49:25 +02:00
Timothy Flynn	a9716ad44e	LibRegex: In non-Unicode mode, parse \u{4} as a repetition pattern	2021-08-18 09:47:09 +04:30
Timothy Flynn	9509433e25	LibRegex: Implement and use a REPEAT operation for bytecode repetition Currently, when we need to repeat an instruction N times, we simply add that instruction N times in a for-loop. This doesn't scale well with extremely large values of N, and ECMA-262 allows up to N = 2^53 - 1. Instead, add a new REPEAT bytecode operation to defer this loop from the parser to the runtime executor. This allows the parser to complete sans any loops (for this instruction), and allows the executor to bail early if the repeated bytecode fails. Note: The templated ByteCode methods are to allow the Posix parsers to continue using u32 because they are limited to N = 2^20.	2021-08-15 11:43:45 +01:00
Timothy Flynn	f1ce998d73	LibRegex+LibJS: Combine named and unnamed capture groups in MatchState Combining these into one list helps reduce the size of MatchState, and as a result, reduces the amount of memory consumed during execution of very large regex matches. Doing this also allows us to remove a few regex byte code instructions: ClearNamedCaptureGroup, SaveLeftNamedCaptureGroup, and NamedReference. Named groups now behave the same as unnamed groups for these operations. Note that SaveRightNamedCaptureGroup still exists to cache the matched group name. This also removes the recursion level from the MatchState, as it can exist as a local variable in Matcher::execute instead.	2021-08-15 11:43:45 +01:00
Timothy Flynn	1a173be29d	LibRegex: Disallow unescaped quantifiers in Unicode mode	2021-08-15 11:43:45 +01:00

1 2

100 commits