In 7c5e30daaa, the focus was "only" on
Userland/Libraries/, whereas this commit cleans up the remaining
headers in the repo, and any new badly-formatted include.
This will make it easier to support both string types at the same time
while we convert code, and tracking down remaining uses.
One big exception is Value::to_string() in LibJS, where the name is
dictated by the ToString AO.
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
Previously, for a regex such as /[a-sy-z]/i, we would incorrectly think
the character "u" fell into the range "a-s" because neither of the
conditions "u > s && U > s" or "u < a && U < a" would be true, resulting
in the lookup falling back to assuming the character is in the range.
Instead, first explicitly check if the character falls into the range,
rather than checking if it falls outside the range. If the explicit
checks fail, then we know the character is outside the range.
This prevents us from needing a sv suffix, and potentially reduces the
need to run generic code for a single character (as contains,
starts_with, ends_with etc. for a char will be just a length and
equality check).
No functional changes.
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).
No functional changes.
[^XYZ] is not(X | Y | Z), we used to translate this to
not(X) | not(Y) | not(Z), this commit makes LibRegex interpret this
pattern as not(X) & not(Y) & not(Z).
The lowercase version of a range is not required to be a valid range,
instead of casefolding the range and making it invalid, check twice with
both cases of the input character (which are the same as the input if
not insensitive).
This time includes an actual test :^)
We had a really naive and simplistic implementation, which lead to
various issues where the optimiser incorrectly rewrote the regex to use
atomic groups; this commit fixes that.
While quantifying assertions is very much meaningless, the specification
allows them with annex B's extended grammar for browsers, so read and
apply the quantifiers.
Fixes#12373.
ECMA-262 defines \s as:
Return the CharSet containing all characters corresponding to a code
point on the right-hand side of the WhiteSpace or LineTerminator
productions.
The LineTerminator production is simply: U+000A, U+000D, U+2028, or
U+2029. Unfortunately there isn't a Unicode property that covers just
those code points.
The WhiteSpace production is: U+0009, U+000B, U+000C, U+FEFF, or any
code point with the Space_Separator general category.
If the Unicode generators are disabled, this will fall back to ASCII
space code points.
LibRegex already implements this loop in a more performant way, so all
LibJS has to do here is to return things in the right shape, and not
loop over the input string.
Previously this was a quadratic operation on string length, which lead
to crazy execution times on failing regexps - now it's nice and fast :^)
Note that a Regex test has to be updated to remove the stateful flag as
it repeats matching on multiple strings.
As ECMA262 regex allows `[^]` and literal newlines to match newlines in
the input string, we shouldn't split the input string into lines, rather
simply make boundaries and catchall patterns capable of checking for
these conditions specifically.
Instead of leaking all capture groups and selectively clearing some,
simply avoid leaking things and only "define" the ones that need to
exist.
This *actually* implements the capture groups ECMA262 quirk.
Also adds the test removed in the previous commit (to avoid messing up
test runs across bisects).
This partially reverts commit c11be92e23.
That commit fixes one thing and breaks many more, a next commit will
implement this quirk in a more sane way.
Previously we were jumping to the new end of the previous block (created
by the newly inserted ForkStay), correct the offset to jump to the
correct block as shown in the comments.
Fixes#12033.
The instructions can have dependencies (e.g. Repeat), so only unify
equal blocks instead of consecutive instructions.
Fixes#11247.
Also adds the minimal test case(s) from that issue.
The initial `ForkStay` is only needed if the looping block has a
following block, if there's no following block or the following block
does not attempt to match anything, we should not insert the ForkStay,
otherwise we would be rewriting `a+` as `a*` by allowing the 'end' to be
executed.
Fixes#10952.
Generate a sorted, compressed series of ranges in a match table for
character classes, and use a binary search to find the matches.
This is about a 3-4x speedup for character class match performance. :^)
For example, consider the following pattern:
new RegExp('\ud834\udf06', 'u')
With this pattern, the regex parser should insert the UTF-8 encoded
bytes 0xf0, 0x9d, 0x8c, and 0x86. However, because these characters are
currently treated as normal char types, they have a negative value since
they are all > 0x7f. Then, due to sign extension, when these characters
are cast to u64, the sign bit is preserved. The result is that these
bytes are inserted as 0xfffffffffffffff0, 0xffffffffffffff9d, etc.
Fortunately, there are only a few places where we insert bytecode with
the raw characters. In these places, be sure to treat the bytes as u8
before they are cast to u64.
Unfortunately, this requires a slight divergence in the way the capture
group names are stored. Previously, the generated byte code would simply
store a view into the regex pattern string, so no string copying was
required.
Now, the escape sequences are decoded into a new string, and a vector
of all parsed capture group names are stored in a vector in the parser
result structure. The byte code then stores a view into the
corresponding string in that vector.