The underlying CPU-specific instructions for operating on UTF-16 strings
behave differently for null inputs. Add an explicit check for this state
for consistency.
Utf16View currently assumes host endianness. Add support for specifying
either big or little endianness (which we mostly just pipe through to
simdutf). This will allow using simdutf facilities with LibTextCodec.
The one behavior difference is that we will now actually fail on invalid
code units with Utf16View::to_utf8(AllowInvalidCodeUnits::No). It was
arguably a bug that this wasn't already the case.
The [UTF-8](https://datatracker.ietf.org/doc/html/rfc3629#page-5)
standard says to reject strings with upper or lower surrogates. However,
in many standards, ECMAScript included, unpaired surrogates (and
therefore UTF-8 surrogates) are allowed in strings. So, this commit
extends the UTF-8 validation API with `AllowSurrogates`, which will
reject upper and lower surrogate characters.
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).
This commit is auto-generated:
$ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
Meta Ports Ladybird Tests Kernel)
$ perl -pie 's/\bDeprecatedString\b/ByteString/g;
s/deprecated_string/byte_string/g' $xs
$ clang-format --style=file -i \
$(git diff --name-only | grep \.cpp\|\.h)
$ gn format $(git ls-files '*.gn' '*.gni')
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
It's very common to encounter single-character strings in JavaScript on
the web. We can make such strings significantly lighter by having a
1-character inline capacity on the Vectors.
The current method of iterating through the string to access a code
point hurts performance quite badly for very large strings. The test262
test "RegExp/property-escapes/generated/Any.js" previously took 3 hours
to complete; this one change brings it down to under 10 seconds.
To be used as a RegexStringView variant, Utf16View must provide a couple
more helper methods. It must also not default its assignment operators,
because that implicitly deletes move/copy constructors.