Commit graph

50 commits

Author SHA1 Message Date
Dennis Camera
b54a1c6284 AK: Implement ShortString for big-endian 2024-07-05 09:49:23 -06:00
Timothy Flynn
5cf818e305 LibUnicode: Replace case transformations and comparison with ICUs
There are a couple of differences here due to using ICU:

1. Titlecasing behaves slightly differently. We previously transformed
   "123dollars" to "123Dollars", as we would use word segmentation to
   split a string into words, then transform the first cased character
   to titlecase. ICU doesn't go quite that far, and leaves the string
   as "123dollars". While this is a behavior change, the only user of
   this API is the `text-transform: capitalize;` CSS rule, and we now
   match the behavior of other browsers.

2. There isn't an API to compare strings with case insensitivity without
   allocating case-folded strings for both the left- and right-hand-side
   strings. Our implementation was previously allocation-free; however,
   in a benchmark, ICU is still ~1.4x faster.
2024-06-20 10:59:55 +02:00
Timothy Flynn
fe3fde2411 AK+LibUnicode: Implement a case-insensitive variant of find_byte_offset
The existing String::find_byte_offset is case-sensitive. This variant
allows performing searches using Unicode-aware case folding.
2024-06-01 07:37:54 +02:00
Shannon Booth
d777b279e3 LibUnicode+Tests: Remove now unused to_unicode_*_full methods
Relocating all of the tests for these in LibUnicode over to the AK
String testsuite.
2023-11-28 17:15:27 -05:00
Timothy Flynn
6aa334767f AK: Ensure assigned-to Strings are dereferenced if needed
If we assign to an existing non-short string, we must dereference its
StringData object to prevent leaking that data.
2023-11-28 16:38:18 +01:00
Lucas CHOLLET
fde26c53f0 AK: Remove the API to explicitly construct short strings
Now that ""_string is infallible, the only benefit of explicitly
constructing a short string is the ability to do it at compile-time. But
we never do that, so let's simplify the API and remove this
implementation detail from it.
2023-08-08 07:37:21 +02:00
Lucas CHOLLET
3f35ffb648 Userland: Prefer _string over _short_string
As `_string` can't fail anymore (since 3434412), there are no real
benefits to use the short variant in most cases.
2023-08-08 07:37:21 +02:00
Andreas Kling
34344120f2 AK: Make "foo"_string infallible
Stop worrying about tiny OOMs.

Work towards #20405.
2023-08-07 16:03:27 +02:00
Tim Schumacher
ae51c1821c Everywhere: Remove unintentional partial stream reads and writes 2023-03-13 15:16:20 +00:00
Tim Schumacher
d5871f5717 AK: Rename Stream::{read,write} to Stream::{read_some,write_some}
Similar to POSIX read, the basic read and write functions of AK::Stream
do not have a lower limit of how much data they read or write (apart
from "none at all").

Rename the functions to "read some [data]" and "write some [data]" (with
"data" being omitted, since everything here is reading and writing data)
to make them sufficiently distinct from the functions that ensure to
use the entire buffer (which should be the go-to function for most
usages).

No functional changes, just a lot of new FIXMEs.
2023-03-13 15:16:20 +00:00
Timothy Flynn
1393ed2000 AK+LibUnicode: Implement String::equals_ignoring_case without allocating
We currently fully casefold the left- and right-hand sides to compare
two strings with case-insensitivity. Now, we casefold one code point at
a time, storing the result in a view for comparison, until we exhaust
both strings.
2023-03-08 18:57:53 +00:00
Timothy Flynn
515fca4f7a AK: Make String::contains(code_point) handle non-ASCII
We currently only accept a char, instead of a full code point.
2023-03-08 14:16:47 +00:00
Timothy Flynn
f882581e91 AK: Make String::{starts,ends}_with(code_point) handle non-ASCII
We currently pass the code point to StringView::{starts,ends}_with,
which actually accepts a single char, thus cannot handle non-ASCII
code points.
2023-03-08 14:16:47 +00:00
Timothy Flynn
da0d000909 AK: Ensure short String instances are valid UTF-8
We are currently only validating long strings.
2023-03-03 11:46:42 -05:00
Linus Groh
09d40bfbb2 Everywhere: Use _{short_,}string to create Strings from literals 2023-02-25 20:51:49 +01:00
Andrew Kaster
0ea697ace5 AK: Add String::from_stream method
The caller is responsible for determining how long the string is that
they want to read.
2023-02-21 10:57:44 +01:00
Andreas Kling
d0697d350d AK: Fix 64-bit alignment issue in shared-superstring substrings
Thanks to Timothy Flynn for the test!

Fixes #17141
2023-02-18 09:12:46 -05:00
Timothy Flynn
5cbf054651 LibUnicode: Fix typos causing text segmentation on mid-word punctuation
For example the words "can't" and "32.3" should not have boundaries
detected on the "'" and "." code points, respectively.

The String test cases fixed here are because "b'ar" is now considered
one word.
2023-02-15 12:36:47 +01:00
Timothy Flynn
c59268d15b AK: Add String::trim 2023-01-28 00:13:46 +00:00
Timothy Flynn
cccaa94767 AK: Add String::join 2023-01-28 00:13:46 +00:00
Timothy Flynn
c35b1371a3 AK: Add an overload of String::find_byte_offset for StringView 2023-01-27 18:00:17 +00:00
Timothy Flynn
427b82065c AK: Add a method to create a String with a repeated code point 2023-01-24 16:23:50 -05:00
Timothy Flynn
d50724956e AK: Add a method to find the byte offset of a code point 2023-01-24 16:23:50 -05:00
Timothy Flynn
12c8bc3e85 AK: Add a String factory to create a string from a single code point 2023-01-22 01:03:13 +00:00
martinfalisse
aec2dadfdd AK: Add split() for String 2023-01-21 14:35:00 +01:00
Timothy Flynn
d48266a420 AK: Support creating known short string literals at compile time
In cases where we know a string literal will fit in the short string
storage, we can do so at compile time without needing to handle error
propagation. If the provided string literal is too long, a compilation
error will be emitted due to the failed VERIFY statement being a non-
constant expression.
2023-01-20 14:24:12 -05:00
Timothy Flynn
537fcaf59e AK+LibUnicode: Provide Unicode-aware caseless String matching
The Unicode spec defines much more complicated caseless matching
algorithms in its Collation spec. This implements the "basic" case
folding comparison.
2023-01-18 14:43:40 +00:00
Timothy Flynn
d6ddca0c0f AK+LibUnicode: Provide Unicode-aware String titlecase transformation 2023-01-16 18:33:44 -05:00
Timothy Flynn
bd9b65e82f AK: Add String::is_one_of for variadic string comparison 2023-01-15 01:00:20 +00:00
Timothy Flynn
9db9b2f9be AK: Add a somewhat naive implementation of String::reverse
This will reverse the String's code points (i.e. not just its bytes),
but is not aware of grapheme clusters.
2023-01-15 01:00:20 +00:00
Timothy Flynn
6fcc1c7426 AK+LibUnicode: Provide Unicode-aware String case transformations
Since AK can't refer to LibUnicode directly, the strategy here is that
if you need case transformations, you can link LibUnicode and receive
them. If you try to use either of these methods without linking it, then
you'll of course get a linker error (note we don't do any fallbacks to
e.g. ASCII case transformations). If you don't need these methods, you
don't have to link LibUnicode.
2023-01-09 19:23:46 -07:00
Maciej
58f5deba70 AK: Unref old m_data in String's move assignment
We were overridding the data pointer without unreffing it,
causing a memory leak when assigning a String.
2022-12-09 00:02:53 +01:00
Andreas Kling
a3e82eaad3 AK: Introduce the new String, replacement for DeprecatedString
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.

Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.

Problems in DeprecatedString:

- It assumes string allocation never fails. This makes it impossible
  to use in allocation-sensitive contexts, and is the reason we had to
  ban DeprecatedString from the kernel entirely.

- The awkward null state. DeprecatedString can be null. It's different
  from the empty state, although null strings are considered empty.
  All code is immediately nicer when using Optional<DeprecatedString>
  but DeprecatedString came before Optional, which is how we ended up
  like this.

- The encoding of the underlying data is ambiguous. For the most part,
  we use it as if it's always UTF-8, but there have been cases where
  we pass around strings in other encodings (e.g ISO8859-1)

- operator[] and length() are used to iterate over DeprecatedString one
  byte at a time. This is done all over the codebase, and will *not*
  give the right results unless the string is all ASCII.

How we solve these issues in the new String:

- Functions that may allocate now return ErrorOr<String> so that ENOMEM
  errors can be passed to the caller.

- String has no null state. Use Optional<String> when needed.

- String is always UTF-8. This is validated when constructing a String.
  We may need to add a bypass for this in the future, for cases where
  you have a known-good string, but for now: validate all the things!

- There is no operator[] or length(). You can get the underlying data
  with bytes(), but for iterating over code points, you should be using
  an UTF-8 iterator.

Furthermore, it has two nifty new features:

- String implements a small string optimization (SSO) for strings that
  can fit entirely within a pointer. This means up to 3 bytes on 32-bit
  platforms, and 7 bytes on 64-bit platforms. Such small strings will
  not be heap-allocated.

- String can create substrings without making a deep copy of the
  substring. Instead, the superstring gets +1 refcount from the
  substring, and it acts like a view into the superstring. To make
  substrings like this, use the substring_with_shared_superstring() API.

One caveat:

- String does not guarantee that the underlying data is null-terminated
  like DeprecatedString does today. While this was nifty in a handful of
  places where we were calling C functions, it did stand in the way of
  shared-superstring substrings.
2022-12-06 15:21:26 +01:00
Linus Groh
6e19ab2bbc AK+Everywhere: Rename String to DeprecatedString
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
2022-12-06 08:54:33 +01:00
demostanis
3e8b5ac920 AK+Everywhere: Turn bool keep_empty to an enum in split* functions 2022-10-24 23:29:18 +01:00
sin-ack
3f3f45580a Everywhere: Add sv suffix to strings relying on StringView(char const*)
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).

No functional changes.
2022-07-12 23:11:35 +02:00
DexesTTP
7ceeb74535 AK: Use an enum instead of a bool for String::replace(all_occurences)
This commit has no behavior changes.

In particular, this does not fix any of the wrong uses of the previous
default parameter (which used to be 'false', meaning "only replace the
first occurence in the string"). It simply replaces the default uses by
String::replace(..., ReplaceMode::FirstOnly), leaving them incorrect.
2022-07-06 11:12:45 +02:00
Daniel Bertalan
e15d6125b2 Tests: Move sprintf test from AK/ to LibC/
This test doesn't test AK::String, but LibC's sprintf instead, so it
does not belong in `Tests/AK`. This also means this test won't be ran on
Lagom using the host OS's printf implementation.

Fixes a deprecated declaration warning when compiling with macOS SDK 13.
2022-07-04 21:46:02 +02:00
Daniel Bertalan
8473f6caee AK+Tests: Make null strings compare less than non-null strings
This behavior regressed in ca58c71faa.

Fixes #12213
2022-01-30 17:23:02 +00:00
Andreas Kling
79ee846f3d AK: Disable the empty-string-vs-null-string test until we have a fix 2022-01-30 16:21:59 +01:00
networkException
1921a166e5 Tests: Add test for null string and empty string to be unequal
See #12213
2022-01-30 15:24:35 +01:00
Matt Jacobson
47e8d58553 AK: Fix logic in String::operator>(const String&)
Null strings should not compare greater than non-null strings.

Add tests for >, <, >=, and <= comparison involving null strings.
2022-01-16 11:08:23 +01:00
Andreas Kling
5f7d008791 AK+Everywhere: Stop including Vector.h from StringView.h
Preparation for using Error.h from Vector.h. This required moving some
things out of line.
2021-11-10 21:58:58 +01:00
Idan Horowitz
6704961c82 AK: Replace the mutable String::replace API with an immutable version
This removes the awkward String::replace API which was the only String
API which mutated the String and replaces it with a new immutable
version that returns a new String with the replacements applied. This
also fixes a couple of UAFs that were caused by the use of this API.

As an optimization an equivalent StringView::replace API was also added
to remove an unnecessary String allocations in the format of:
`String { view }.replace(...);`
2021-09-11 20:36:43 +03:00
Mandar Kulkarni
aaf232f903 Tests: Add test for String::bijective_base_from() 2021-08-09 14:14:07 +04:30
Tobias Christiansen
f35c25a7eb Tests: Add test for String::roman_number_from() 2021-07-04 22:17:03 +02:00
Max Wipfli
4b87dd5c5c Tests: Add test for String::find with empty needle
This adds a test case for String::find and String::find_all with empty
needles. The expected behavior is in line with what the C++ standard
library (and other languages standard libraries) expect.
2021-07-02 21:54:21 +02:00
Andreas Kling
de395a3df2 AK+Everywhere: Consolidate String::index_of() and String::find()
We had two functions for doing mostly the same thing. Combine both
of them into String::find() and use that everywhere.

Also add some tests to cover basic behavior.
2021-05-24 11:59:18 +02:00
Maciej Zygmanowski
80077cea86 AK: Add String::find_all() and String::count() 2021-05-19 20:51:51 +01:00
Brian Gianforcaro
67322b0702 Tests: Move AK tests to Tests/AK 2021-05-06 17:54:28 +02:00
Renamed from AK/Tests/TestString.cpp (Browse further)