UTF8Decoder was already converting invalid data into replacement
characters while converting, so we know for sure we have valid UTF-8
by the time conversion is finished.
This patch adds a new StringBuilder::to_string_without_validation()
and uses it to make UTF8Decoder avoid half the work it was doing.
Instead of using a StringBuilder, add a String::repeated(String, N)
overload that takes advantage of knowing it's already all UTF-8.
This makes the following microbenchmark go 4x faster:
"foo".repeat(100_000_000)
And for single character strings, we can even go 10x faster:
"x".repeat(100_000_000)
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).
This commit is auto-generated:
$ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
Meta Ports Ladybird Tests Kernel)
$ perl -pie 's/\bDeprecatedString\b/ByteString/g;
s/deprecated_string/byte_string/g' $xs
$ clang-format --style=file -i \
$(git diff --name-only | grep \.cpp\|\.h)
$ gn format $(git ls-files '*.gn' '*.gni')
Now that ""_string is infallible, the only benefit of explicitly
constructing a short string is the ability to do it at compile-time. But
we never do that, so let's simplify the API and remove this
implementation detail from it.
This was missed in 02b74e5a70
We need to disable consteval in AK::String as well as AK::StringView,
and we need to disable it when building both the tools build and the
fuzzer build.
Calling `from_utf8` with a DeprecatedString will hide the fact that we
have a DeprecatedString, while using `from_deprecated_string` with a
StringView will silently and needlessly allocate a DeprecatedString,
so let's forbid that.
We currently fully casefold the left- and right-hand sides to compare
two strings with case-insensitivity. Now, we casefold one code point at
a time, storing the result in a view for comparison, until we exhaust
both strings.
We briefly discussed this when adding the new String type but couldn't
settle on a name. However, having to use String::from_utf8() on every
literal string is a bit unwieldy, so let's have these options available!
Naming-wise '_string' is not as short as 'sv' but should be relatively
clear; it also matches '_bigint' and '_ubigint' in length.
'_short_string' may be longer than the actual string itself, but it's
still an improvement over the static function :^)
Since our C++ source files are UTF-8 encoded anyway, it should be
impossible to create a string literal with invalid UTF-8, so including
that in the name is not as important as in the function that can receive
arbitrary data.
In cases where we know a string literal will fit in the short string
storage, we can do so at compile time without needing to handle error
propagation. If the provided string literal is too long, a compilation
error will be emitted due to the failed VERIFY statement being a non-
constant expression.
The Unicode spec defines much more complicated caseless matching
algorithms in its Collation spec. This implements the "basic" case
folding comparison.
These are formatters that can only be used with debug print
functions, such as dbgln(). Currently this is limited to
Formatter<ErrorOr<T>>. With this you can still debug log ErrorOr
values (good for debugging), but trying to use them in any
String::formatted() call will fail (which prevents .to_string()
errors with the new failable strings being ignored).
You make a formatter debug only by adding a constexpr method like:
static constexpr bool is_debug_only() { return true; }