First, this isn't actually helpful, as we no longer store 32-bit values
in JsonValue. They are stored as 64-bit values anyways.
But more imporatantly, there was a bug here when trying to coerce an i64
to an i32. All negative values were cast to an i32, without checking if
the value is below NumericLimits<i32>::min.
Some callers (LibJS) will want to control the size of the output buffer,
to decode up to a maximum length. They will also want to receive partial
results in the case of an error. This patch adds a method to provide
those capabilities, and makes the existing implementation use it.
This requires pulling in some of the STL, but the result is that our
iterator is now STL Approved ™️ and our containers can be
auto-conformed to Swift protocols.
This isn't an issue now because this is only invoked from a macro that
is expanded within this file. But in an upcoming commit, it will be
invoked from a helper function in the Test namespace. At that point, the
compiler complains about the comparitor not being found (and helpfully
indicates we should move this one to the AK namespace to allow ADL to
succeed).
Otherwise, the following code would not compile:
constexpr Array<int, 3> array { 4, 5, 6 };
Vector<int> vector { 4, 5, 6 };
if (array == vector.span()) { }
We do such comparisons in tests quite a bit. But it currently doesn't
become an issue because of the way EXPECT_EQ copies its input parameters
to non-const locals. In a future patch, that copying will be removed,
and the compiler would otherwise complain about not finding a suitable
comparison operator.
The underlying CPU-specific instructions for operating on UTF-16 strings
behave differently for null inputs. Add an explicit check for this state
for consistency.
The underlying CPU-specific instructions for operating on UTF-8 strings
behave differently for null inputs. Add an explicit check for this state
for consistency.
Utf16View currently assumes host endianness. Add support for specifying
either big or little endianness (which we mostly just pipe through to
simdutf). This will allow using simdutf facilities with LibTextCodec.
The one behavior difference is that we will now actually fail on invalid
code units with Utf16View::to_utf8(AllowInvalidCodeUnits::No). It was
arguably a bug that this wasn't already the case.
We currently have 2 base64 coders: one in AK, another in LibWeb for a
"forgiving" implementation. ECMA-262 has an upcoming proposal which will
require a third implementation.
Instead, let's use the base64 implementation that is used by Node.js and
recommended by the upcoming proposal. It handles forgiving decoding as
well.
Our users of AK's implementation should be fine with the forgiving
implementation. The AK impl originally had naive forgiving behavior, but
that was removed solely for performance reasons.
Using http://mattmahoney.net/dc/enwik8.zip (100MB unzipped) as a test,
performance of our old home-grown implementations vs. the simdutf
implementation (on Linux x64):
Encode Decode
AK base64 0.226s 0.169s
LibWeb base64 N/A 1.244s
simdutf 0.161s 0.047s
There are a couple of differences here due to using ICU:
1. Titlecasing behaves slightly differently. We previously transformed
"123dollars" to "123Dollars", as we would use word segmentation to
split a string into words, then transform the first cased character
to titlecase. ICU doesn't go quite that far, and leaves the string
as "123dollars". While this is a behavior change, the only user of
this API is the `text-transform: capitalize;` CSS rule, and we now
match the behavior of other browsers.
2. There isn't an API to compare strings with case insensitivity without
allocating case-folded strings for both the left- and right-hand-side
strings. Our implementation was previously allocation-free; however,
in a benchmark, ICU is still ~1.4x faster.
The [UTF-8](https://datatracker.ietf.org/doc/html/rfc3629#page-5)
standard says to reject strings with upper or lower surrogates. However,
in many standards, ECMAScript included, unpaired surrogates (and
therefore UTF-8 surrogates) are allowed in strings. So, this commit
extends the UTF-8 validation API with `AllowSurrogates`, which will
reject upper and lower surrogate characters.
Instead of attempting a stack use-after-free by reading an out-of-scope
object's data member, let's keep a flag that checks if the destructor
had been called in the outer scope.
Fixes#64
`ceil_div(-1, 2)` used to return -1.
Now it returns 0, which is the correct ceil(-0.5).
(C++'s division semantics have floor semantics for numbers > 0,
but ceil semantics for numbers < 0.)
This will be important for the JPEG2000 decoder eventually.
The following command was used to clang-format these files:
clang-format-18 -i $(find . \
-not \( -path "./\.*" -prune \) \
-not \( -path "./Base/*" -prune \) \
-not \( -path "./Build/*" -prune \) \
-not \( -path "./Toolchain/*" -prune \) \
-not \( -path "./Ports/*" -prune \) \
-type f -name "*.cpp" -o -name "*.mm" -o -name "*.h")
There are a couple of weird cases where clang-format now thinks that a
pointer access in an initializer list, e.g. `m_member(ptr->foo)`, is a
lambda return statement, and it puts spaces around the `->`.
These changes allow lines of arbitrary length to be read with
BufferedStream. When the user supplied buffer is smaller than
the line, it will be resized to fit the line. When the internal
buffer in BufferedStream is smaller than the line, it will be
read into the user supplied buffer chunk by chunk with the
buffer growing accordingly.
Other behaviors match the behavior of the existing read_line method.
This was added in commit f2663f477f as a
partial implementation of what is now LibWeb's forgiving Base64 decoder.
All use cases within LibWeb that require whitespace skipping now use
that implementation instead.
Removing this feature from AK allows us to know the exact output size of
a decoded Base64 string. We can still trim whitespace at the start and
end of the input though; for example, this is useful when reading from a
file that may have a newline at the end of the file.
This encoding scheme comes from section 5 of RFC 4648, as an
alternative to the standard base64 encode/decode methods.
The only difference is that the last two characters are replaced
with '-' and '_', as '+' and '/' are not safe in URLs or filenames.
This URL library ends up being a relatively fundamental base library of
the system, as LibCore depends on LibURL.
This change has two main benefits:
* Moving AK back more towards being an agnostic library that can
be used between the kernel and userspace. URL has never really fit
that description - and is not used in the kernel.
* URL _should_ depend on LibUnicode, as it needs punnycode support.
However, it's not really possible to do this inside of AK as it can't
depend on any external library. This change brings us a little closer
to being able to do that, but unfortunately we aren't there quite
yet, as the code generators depend on LibCore.
Otherwise, we percent-encode negative signed chars incorrectly. For
example, https://www.strava.com/login contains the following hidden
<input> field:
<input name="utf8" type="hidden" value="✓" />
On submitting the form, we would percent-encode that field as:
utf8=%-1E%-64%-6D
Which would cause us to receive an HTTP 500 response. We now properly
percent-encode that field as:
utf8=%E2%9C%93
And can login to Strava :^)
We already have a helper to split a StringView by line while considering
"\n", "\r", and "\r\n". Add an analagous method to just count the number
of lines in the same manner.
If the BufferedStream is able to fill its entire circular buffer in
populate_read_buffer() and is later asked to read a line or read until
a delimiter, it could erroneously return EMSGSIZE if the caller's buffer
was smaller than the internal buffer. In this case, all we really care
about is whether the caller's buffer is big enough for however much data
we're going to copy into it. Which needs to take into account the
candidate.