Some callers (LibJS) will want to control the size of the output buffer,
to decode up to a maximum length. They will also want to receive partial
results in the case of an error. This patch adds a method to provide
those capabilities, and makes the existing implementation use it.
We currently have 2 base64 coders: one in AK, another in LibWeb for a
"forgiving" implementation. ECMA-262 has an upcoming proposal which will
require a third implementation.
Instead, let's use the base64 implementation that is used by Node.js and
recommended by the upcoming proposal. It handles forgiving decoding as
well.
Our users of AK's implementation should be fine with the forgiving
implementation. The AK impl originally had naive forgiving behavior, but
that was removed solely for performance reasons.
Using http://mattmahoney.net/dc/enwik8.zip (100MB unzipped) as a test,
performance of our old home-grown implementations vs. the simdutf
implementation (on Linux x64):
Encode Decode
AK base64 0.226s 0.169s
LibWeb base64 N/A 1.244s
simdutf 0.161s 0.047s
This was added in commit f2663f477f as a
partial implementation of what is now LibWeb's forgiving Base64 decoder.
All use cases within LibWeb that require whitespace skipping now use
that implementation instead.
Removing this feature from AK allows us to know the exact output size of
a decoded Base64 string. We can still trim whitespace at the start and
end of the input though; for example, this is useful when reading from a
file that may have a newline at the end of the file.
Caught by clang-format-17. Note that clang-format-16 is fine with this
as well (it leaves the const placement alone), it just doesn't perform
the formatting to east-const itself.
There's no need to copy the result. We can also avoid increasing the
size of the output buffer by 1 for each written byte.
This reduces the runtime of `./bin/base64 -d enwik8.base64 >/dev/null`
from 0.917s to 0.632s.
(enwik8 is a 100MB test file from http://mattmahoney.net/dc/enwik8.zip)
We don't really need the features provided by StringBuilder here, since
we know the exact size of the output. Avoiding StringBuilder avoids the
recurring capacity/size checks both within StringBuilder itself and its
internal ByteBuffer.
This reduces the runtime of `./bin/base64 enwik8 >/dev/null` from
0.976s to 0.428s.
(enwik8 is a 100MB test file from http://mattmahoney.net/dc/enwik8.zip)
We know we are only appending ASCII characters to the StringBuilder, so
do not bother validating the result.
This reduces the runtime of `./bin/base64 enwik8 >/dev/null` from
1.192s to 0.976s.
(enwik8 is a 100MB test file from http://mattmahoney.net/dc/enwik8.zip)
This encoding scheme comes from section 5 of RFC 4648, as an
alternative to the standard base64 encode/decode methods.
The only difference is that the last two characters are replaced
with '-' and '_', as '+' and '/' are not safe in URLs or filenames.
This will make it easier to support both string types at the same time
while we convert code, and tracking down remaining uses.
One big exception is Value::to_string() in LibJS, where the name is
dictated by the ToString AO.
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
Apologies for the enormous commit, but I don't see a way to split this
up nicely. In the vast majority of cases it's a simple change. A few
extra places can use TRY instead of manual error checking though. :^)
In the long-term, we should probably have a way to signal decoding
failure. For now, it should suffice to at least not crash. This is
particularly relevant because apparently this can be triggered while
parsing a PEM certificate, which happens during every TLS connection.
Found by OSS Fuzz
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=38979
It's prone to finding "technically uninitialized but can never happen"
cases, particularly in Optional<T> and Variant<Ts...>.
The general case seems to be that it cannot infer the dependency
between Variant's index (or Optional's boolean state) and a particular
alternative (or Optional's buffer) being untouched.
So it can flag cases like this:
```c++
if (index == StaticIndexForF)
new (new_buffer) F(move(*bit_cast<F*>(old_buffer)));
```
The code in that branch can _technically_ make a partially initialized
`F`, but that path can never be taken since the buffer holding an
object of type `F` and the condition being true are correlated, and so
will never be taken _unless_ the buffer holds an object of type `F`.
This commit also removed the various 'diagnostic ignored' pragmas used
to work around this warning, as they no longer do anything.
We were accidentally calling calculate_base64_decoded_length instead,
which resulted in extra allocations during the StringBuilder::append
calls that can be avoided.
Adding -fno-semantic-interposition to the GCC command
line caused this new warning.
I don't see how output.data() could be uninitialized here. Also,
commenting out the ensure_capacity() call for the Vector
also gets rid of this warning.
SPDX License Identifiers are a more compact / standardized
way of representing file license information.
See: https://spdx.dev/resources/use/#identifiers
This was done with the `ambr` search and replace tool.
ambr --no-parent-ignore --key-from-file --rep-from-file key.txt rep.txt *
Mostly due to the fact that clang-format allows aligned comments via
AlignTrailingComments.
We could also use raw string literals in inline asm, which clang-format
deals with properly (and would be nicer in a lot of places).
Problem:
- Output of decode and encode grow as the decode and encode
happen. This is inefficient because a large size will require many
reallocations.
- `const` qualifiers are missing on variables which are not intended
to change.
Solution:
- Since the size of the decoded or encoded message is known prior to
starting, calculate the size and set the output to that size
immediately. All appends will not incur the reallocation overhead.
- Add `const` qualifiers to show intent.
Problem:
- The Base64 alphabet and lookup table are initialized at
run-time. This results in an initial start-up cost as well as a
boolean evaluation and branch every time the function is called.
Solution:
- Provide `constexpr` functions which initialize the alphabet and
lookup table at compile-time. These can be called and assigned to a
`constexpr` variable so that there is no run-time cost associated
with the initialization or lookup.
This enables a nice warning in case a function becomes dead code. Also, add forgotten
header to Base64.cpp, which would cause an issue later when we enable -Wmissing-declarations.