The Unicode spec defines much more complicated caseless matching
algorithms in its Collation spec. This implements the "basic" case
folding comparison.
These are formatters that can only be used with debug print
functions, such as dbgln(). Currently this is limited to
Formatter<ErrorOr<T>>. With this you can still debug log ErrorOr
values (good for debugging), but trying to use them in any
String::formatted() call will fail (which prevents .to_string()
errors with the new failable strings being ignored).
You make a formatter debug only by adding a constexpr method like:
static constexpr bool is_debug_only() { return true; }
This implements a FlyString that will de-duplicate String instances. The
FlyString will store the raw encoded data of the String instance: If the
String is a short string, FlyString holds the String::ShortString bytes;
otherwise FlyString holds a pointer to the Detail::StringData.
FlyString itself does not know about String's storage or how to refcount
its Detail::StringData. It defers to String to implement these details.
Since AK can't refer to LibUnicode directly, the strategy here is that
if you need case transformations, you can link LibUnicode and receive
them. If you try to use either of these methods without linking it, then
you'll of course get a linker error (note we don't do any fallbacks to
e.g. ASCII case transformations). If you don't need these methods, you
don't have to link LibUnicode.
s p a c e s h i p o p e r a t o r
Comparing UTF-8 can be done by simple byte lexicographic comparison per
definition, so we just piggy-back on StringView's high-performance
comparator.
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
to use in allocation-sensitive contexts, and is the reason we had to
ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
from the empty state, although null strings are considered empty.
All code is immediately nicer when using Optional<DeprecatedString>
but DeprecatedString came before Optional, which is how we ended up
like this.
- The encoding of the underlying data is ambiguous. For the most part,
we use it as if it's always UTF-8, but there have been cases where
we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
byte at a time. This is done all over the codebase, and will *not*
give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
We may need to add a bypass for this in the future, for cases where
you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
with bytes(), but for iterating over code points, you should be using
an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
can fit entirely within a pointer. This means up to 3 bytes on 32-bit
platforms, and 7 bytes on 64-bit platforms. Such small strings will
not be heap-allocated.
- String can create substrings without making a deep copy of the
substring. Instead, the superstring gets +1 refcount from the
substring, and it acts like a view into the superstring. To make
substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
like DeprecatedString does today. While this was nifty in a handful of
places where we were calling C functions, it did stand in the way of
shared-superstring substrings.
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
This patch adds the `USING_AK_GLOBALLY` macro which is enabled by
default, but can be overridden by build flags.
This is a step towards integrating Jakt and AK types.
C++20 can automatically synthesize `operator!=` from `operator==`, so
there is no point in writing such functions by hand if all they do is
call through to `operator==`.
This fixes a compile error with compilers that implement P2468 (Clang
16 currently). This paper restores the C++17 behavior that if both
`T::operator==(U)` and `T::operator!=(U)` exist, `U == T` won't be
rewritten in reverse to call `T::operator==(U)`. Removing `!=` operators
makes the rewriting possible again.
See https://reviews.llvm.org/D134529#3853062
These are guarded with #ifndef KERNEL, since doubles (and floats) are
not allowed in KERNEL mode.
In StringUtils there is convert_to_floating_point which does have a
template parameter incase you have a templated type.
During the removal of StringView(char const*), all users of these
functions were removed, and they are of dubious value (relying on
implicit StringView conversion).
This commit has no behavior changes.
In particular, this does not fix any of the wrong uses of the previous
default parameter (which used to be 'false', meaning "only replace the
first occurence in the string"). It simply replaces the default uses by
String::replace(..., ReplaceMode::FirstOnly), leaving them incorrect.
This removes the awkward String::replace API which was the only String
API which mutated the String and replaces it with a new immutable
version that returns a new String with the replacements applied. This
also fixes a couple of UAFs that were caused by the use of this API.
As an optimization an equivalent StringView::replace API was also added
to remove an unnecessary String allocations in the format of:
`String { view }.replace(...);`
This was needlessly copying StringView arguments, and was also using
strstr internally, which meant it was doing a bunch of unnecessary
strlen calls on it. This also moves the implementation to StringUtils
to allow API consistency between String and StringView.
This implements StringUtils::find_any_of() and uses it in
String::find_any_of() and StringView::find_any_of(). All uses of
find_{first,last}_of have been replaced with find_any_of(), find() or
find_last(). find_{first,last}_of have subsequently been removed.
This adds the String::find_last() as wrapper for StringUtils::find_last,
which is another step in harmonizing the String and StringView APIs
where possible.
This also inlines the find() methods, as they are simple wrappers around
StringUtils functions without any additional logic.
This implements the StringView::find_all() method by re-implemeting the
current method existing for String in StringUtils, and using that
implementation for both String and StringView.
The rewrite uses memmem() instead of strstr(), so the String::find_all()
argument type has been changed from String to StringView, as the null
byte is no longer required.
Previously this was generating a crazy number of symbols, and it was
also pretty-damn-slow as it was defined recursively, which made the
compiler incapable of inlining it (due to the many many layers of
recursion before it terminated).
This commit replaces the recursion with a pack expansion and marks it
always-inline.
We had two functions for doing mostly the same thing. Combine both
of them into String::find() and use that everywhere.
Also add some tests to cover basic behavior.
This allows everybody to create a String version of their number
in a arbitrary bijective base. Bijective base meaning that the mapping
doesn't have a 0. In the usual mapping to the alphabet the follower
after 'Z' is 'AA'.
The mapping using the (uppercase) alphabet is used as a standard but
can be overridden specifying 'base' and 'map'.
The code was directly yanked from the Spreadsheet.
SPDX License Identifiers are a more compact / standardized
way of representing file license information.
See: https://spdx.dev/resources/use/#identifiers
This was done with the `ambr` search and replace tool.
ambr --no-parent-ignore --key-from-file --rep-from-file key.txt rep.txt *
This commit makes the user-facing StdLibExtras templates and utilities
arguably more nice-looking by removing the need to reach into the
wrapper structs generated by them to get the value/type needed.
The C++ standard library had to invent `_v` and `_t` variants (likely
because of backwards compat), but we don't need to cater to any codebase
except our own, so might as well have good things for free. :^)
This is basically just for consistency, it's quite strange to see
multiple AK container types next to each other, some with and some
without the namespace prefix - we're 'using AK::Foo;' a lot and should
leverage that. :^)
This is an improved version of WrapperGenerator's snake_name(), which
seems like the kind of thing that could be useful elsewhere but would
end up getting duplicated - so let's add this to AK::String instead,
like to_{lowercase,uppercase}().
This checks the following things:
- No unclosed braces in format string
`dbgln("a:{}}", a)` where the '}}' would be interpreted as a
literal '}'
`dbgln("a:{", a)` where someone with a faulty keyboard like mine
could generate
- No extra closed braces in format string
`dbgln("a:{{}", a)` where the '{{' would interpreted as a literal '{'
`dbgln("a:}", a)` where someone with a faulty keyboard could
generate
- No references to nonexistent arguments
`dbgln("a:{} b:{}", a)` where the value of `b` is not in the
arguments list
- No unconsumed argument
`dbgln("a:{1}", not_used, 1)` where `not_used` is extraneous
Problem:
- Many constructors are defined as `{}` rather than using the ` =
default` compiler-provided constructor.
- Some types provide an implicit conversion operator from `nullptr_t`
instead of requiring the caller to default construct. This violates
the C++ Core Guidelines suggestion to declare single-argument
constructors explicit
(https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#c46-by-default-declare-single-argument-constructors-explicit).
Solution:
- Change default constructors to use the compiler-provided default
constructor.
- Remove implicit conversion operators from `nullptr_t` and change
usage to enforce type consistency without conversion.
Add requires clauses to constraints on InputStream and OutputStream
operator<< / operator>>. Make the constraint on String::number a
requires clause instead of SFINAE. Also, fix some unecessary IsSame in
Trie where specialized traits exist for the given use cases.
Use SFINAE to enforce the fact that it's supposed to only be called for
Arithmetic types, rather than counting on the linker to tell us that an
instantiation of String::number(my_arg) was not found. This also adds
String::number for floating point types as a side-effect.
This is a convenience API when you just want the rest of the string
starting at some index. We already had substring_view() in the same
flavor, so this is a complement to that.
With this commit, <AK/Format.h> has a more supportive role and isn't
used directly.
Essentially, there now is a public 'vformat' function ('v' for vector)
which takes already type erased parameters. The name is choosen to
indicate that this function behaves similar to C-style functions taking
a va_list equivalent.
The interface for frontend users are now 'String::formatted' and
'StringBuilder::appendff'.
Consider the following snippet:
void foo(InputStream& stream) {
if(!stream.eof()) {
u8 byte;
stream >> byte;
}
}
There is a very subtle bug in this snippet, for some input streams eof()
might return false even if no more data can be read. In this case an
error flag would be set on the stream.
Until now I've always ensured that this is not the case, but this made
the implementation of eof() unnecessarily complicated.
InputFileStream::eof had to keep a ByteBuffer around just to make this
possible. That meant a ton of unnecessary copies just to get a reliable
eof().
In most cases it isn't actually necessary to have a reliable eof()
implementation.
In most other cases a reliable eof() is avaliable anyways because in
some cases like InputMemoryStream it is very easy to implement.
This is a strcpy()-like method with actually sane semantics:
* It accepts a non-empty buffer along with its size in bytes.
* It copies as much of the string as fits into the buffer.
* It always null-terminates the result.
* It returns, as a non-discardable boolean, whether the whole string has been
copied.
Intended usage looks like this:
bool fits = string.copy_characters_to_buffer(buffer, sizeof(buffer));
and then either
if (!fits) {
fprintf(stderr, "The name does not fit!!11");
return nullptr;
}
or, if you're sure the buffer is large enough,
// I'm totally sure it fits because [reasons go here].
ASSERT(fits);
or if you're feeling extremely adventurous,
(void)fits;
but don't do that, please.
Get rid of the weird old signature:
- int StringType::to_int(bool& ok) const
And replace it with sensible new signature:
- Optional<int> StringType::to_int() const
FileSystemPath::has_extension was jumping through hoops and allocating
memory to do a case insensitive comparison needlessly. Extend the
existing String::ends_with method to allow the caller to specify the
case sensitivity required.
As suggested by @awesomekling in a code review and (initially) ignored
by me :^)
Implementation is roughly based on LibJS's trim_string(), but with a fix
for trimming all-whitespace strings.