Commit graph

113 commits

Author SHA1 Message Date
Timothy Flynn
1e10d6d7ce LibRegex: Support property escapes of Unicode General Categories
This changes LibRegex to parse the property escape as a Variant of
Unicode Property & General Category values. A byte code instruction is
added to perform matching based on General Category values.
2021-08-02 21:02:09 +04:30
Ali Mohammad Pur
85d87cbcc8 LibRegex: Add some tests for Fork{Stay,Jump} performance
Without the previous fixes, these will blow up the stack.
2021-08-02 17:22:50 +04:30
Timothy Flynn
d485cf29d7 LibRegex+LibUnicode: Begin implementing Unicode property escapes
This supports some binary property matching. It does not support any
properties not yet parsed by LibUnicode, nor does it support value
matching (such as Script_Extensions=Latin).
2021-07-30 21:26:31 +01:00
Timothy Flynn
345ef6abba LibRegex: Support ECMA-262 Unicode escapes of the form "\u{code_point}"
When the Unicode flag is set, regular expressions may escape code points
by surrounding the hexadecimal code point with curly braces, e.g. \u{41}
is the character "A".

When the Unicode flag is not set, this should be considered a repetition
symbol - \u{41} is the character "u" repeated 41 times. This is left as
a TODO for now.
2021-07-23 23:06:57 +01:00
Timothy Flynn
47f6bb38a1 LibRegex: Support UTF-16 RegexStringView and improve Unicode matching
When the Unicode option is not set, regular expressions should match
based on code units; when it is set, they should match based on code
points. To do so, the regex parser must combine surrogate pairs when
the Unicode option is set. Further, RegexStringView needs to know if
the flag is set in order to return code point vs. code unit based
string lengths and substrings.
2021-07-23 23:06:57 +01:00
Ali Mohammad Pur
f364fcec5d LibRegex+Everywhere: Make LibRegex more unicode-aware
This commit makes LibRegex (mostly) capable of operating on any of
the three main string views:
- StringView for raw strings
- Utf8View for utf-8 encoded strings
- Utf32View for raw unicode strings

As a result, regexps with unicode strings should be able to properly
handle utf-8 and not stop in the middle of a code point.
A future commit will update LibJS to use the correct type of string
depending on the flags.
2021-07-18 21:10:55 +04:30
Ali Mohammad Pur
e5af15a6e9 LibRegex: Don't do out-of-bound match accesses when a test fails 2021-07-18 21:10:55 +04:30
Ali Mohammad Pur
1c584e9d80 LibRegex: Correctly parse BRE bracket expressions
Commonly, bracket expressions are in fact, enclosed in brackets.
2021-07-10 22:58:24 +04:30
Ali Mohammad Pur
daa6d99e6e LibRegex: Add support for non-extended regular expressions in regcomp()
Fixes part of #8506.
2021-07-10 13:33:08 +02:00
Timothy Flynn
65003241e4 LibRegex: Allow dollar signs in ECMA262 named capture groups
Fixes 1 test262 test.
2021-07-06 22:33:17 +01:00
sin-ack
9a9e7f03f2 Tests: Add test for case-insensitive matching 2021-06-16 16:30:12 +04:30
Andrew Kaster
55d338b66f Tests: Free all memory allocated with regcomp in RegexLibC tests
The C interface (posix interface?) for regexes has no "initialize"
function, only a free function. The comment in regcomp in
LibRegex/C/Regex.cpp notes that calling regcomp without a regfree is an
error, and will leak memory. Every single time regcomp is called on a
regex_t*, it will allocate new memory.

Make sure that all the regcomp calls are paired with a regfree in the
tests program
2021-05-14 08:34:00 +01:00
Brian Gianforcaro
6e918e4e02 Tests: Move LibRegex tests to Tests/LibRegex 2021-05-06 17:54:28 +02:00