These don't have to worry about the input not being valid UTF-8 and
so can be infallible (and can even return self if no changes needed.)
We use this instead of Infra::to_ascii_{upper,lower}_case in LibWeb.
First, this isn't actually helpful, as we no longer store 32-bit values
in JsonValue. They are stored as 64-bit values anyways.
But more imporatantly, there was a bug here when trying to coerce an i64
to an i32. All negative values were cast to an i32, without checking if
the value is below NumericLimits<i32>::min.
Some callers (LibJS) will want to control the size of the output buffer,
to decode up to a maximum length. They will also want to receive partial
results in the case of an error. This patch adds a method to provide
those capabilities, and makes the existing implementation use it.
This requires pulling in some of the STL, but the result is that our
iterator is now STL Approved ™️ and our containers can be
auto-conformed to Swift protocols.
This isn't an issue now because this is only invoked from a macro that
is expanded within this file. But in an upcoming commit, it will be
invoked from a helper function in the Test namespace. At that point, the
compiler complains about the comparitor not being found (and helpfully
indicates we should move this one to the AK namespace to allow ADL to
succeed).
Otherwise, the following code would not compile:
constexpr Array<int, 3> array { 4, 5, 6 };
Vector<int> vector { 4, 5, 6 };
if (array == vector.span()) { }
We do such comparisons in tests quite a bit. But it currently doesn't
become an issue because of the way EXPECT_EQ copies its input parameters
to non-const locals. In a future patch, that copying will be removed,
and the compiler would otherwise complain about not finding a suitable
comparison operator.
The underlying CPU-specific instructions for operating on UTF-16 strings
behave differently for null inputs. Add an explicit check for this state
for consistency.
The underlying CPU-specific instructions for operating on UTF-8 strings
behave differently for null inputs. Add an explicit check for this state
for consistency.
Utf16View currently assumes host endianness. Add support for specifying
either big or little endianness (which we mostly just pipe through to
simdutf). This will allow using simdutf facilities with LibTextCodec.
The one behavior difference is that we will now actually fail on invalid
code units with Utf16View::to_utf8(AllowInvalidCodeUnits::No). It was
arguably a bug that this wasn't already the case.
We currently have 2 base64 coders: one in AK, another in LibWeb for a
"forgiving" implementation. ECMA-262 has an upcoming proposal which will
require a third implementation.
Instead, let's use the base64 implementation that is used by Node.js and
recommended by the upcoming proposal. It handles forgiving decoding as
well.
Our users of AK's implementation should be fine with the forgiving
implementation. The AK impl originally had naive forgiving behavior, but
that was removed solely for performance reasons.
Using http://mattmahoney.net/dc/enwik8.zip (100MB unzipped) as a test,
performance of our old home-grown implementations vs. the simdutf
implementation (on Linux x64):
Encode Decode
AK base64 0.226s 0.169s
LibWeb base64 N/A 1.244s
simdutf 0.161s 0.047s
There are a couple of differences here due to using ICU:
1. Titlecasing behaves slightly differently. We previously transformed
"123dollars" to "123Dollars", as we would use word segmentation to
split a string into words, then transform the first cased character
to titlecase. ICU doesn't go quite that far, and leaves the string
as "123dollars". While this is a behavior change, the only user of
this API is the `text-transform: capitalize;` CSS rule, and we now
match the behavior of other browsers.
2. There isn't an API to compare strings with case insensitivity without
allocating case-folded strings for both the left- and right-hand-side
strings. Our implementation was previously allocation-free; however,
in a benchmark, ICU is still ~1.4x faster.
The [UTF-8](https://datatracker.ietf.org/doc/html/rfc3629#page-5)
standard says to reject strings with upper or lower surrogates. However,
in many standards, ECMAScript included, unpaired surrogates (and
therefore UTF-8 surrogates) are allowed in strings. So, this commit
extends the UTF-8 validation API with `AllowSurrogates`, which will
reject upper and lower surrogate characters.
Instead of attempting a stack use-after-free by reading an out-of-scope
object's data member, let's keep a flag that checks if the destructor
had been called in the outer scope.
Fixes#64
`ceil_div(-1, 2)` used to return -1.
Now it returns 0, which is the correct ceil(-0.5).
(C++'s division semantics have floor semantics for numbers > 0,
but ceil semantics for numbers < 0.)
This will be important for the JPEG2000 decoder eventually.
The following command was used to clang-format these files:
clang-format-18 -i $(find . \
-not \( -path "./\.*" -prune \) \
-not \( -path "./Base/*" -prune \) \
-not \( -path "./Build/*" -prune \) \
-not \( -path "./Toolchain/*" -prune \) \
-not \( -path "./Ports/*" -prune \) \
-type f -name "*.cpp" -o -name "*.mm" -o -name "*.h")
There are a couple of weird cases where clang-format now thinks that a
pointer access in an initializer list, e.g. `m_member(ptr->foo)`, is a
lambda return statement, and it puts spaces around the `->`.
These changes allow lines of arbitrary length to be read with
BufferedStream. When the user supplied buffer is smaller than
the line, it will be resized to fit the line. When the internal
buffer in BufferedStream is smaller than the line, it will be
read into the user supplied buffer chunk by chunk with the
buffer growing accordingly.
Other behaviors match the behavior of the existing read_line method.
This was added in commit f2663f477f as a
partial implementation of what is now LibWeb's forgiving Base64 decoder.
All use cases within LibWeb that require whitespace skipping now use
that implementation instead.
Removing this feature from AK allows us to know the exact output size of
a decoded Base64 string. We can still trim whitespace at the start and
end of the input though; for example, this is useful when reading from a
file that may have a newline at the end of the file.