2020-03-02 13:23:11 +00:00
|
|
|
/*
|
AK: Introduce the new String, replacement for DeprecatedString
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
to use in allocation-sensitive contexts, and is the reason we had to
ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
from the empty state, although null strings are considered empty.
All code is immediately nicer when using Optional<DeprecatedString>
but DeprecatedString came before Optional, which is how we ended up
like this.
- The encoding of the underlying data is ambiguous. For the most part,
we use it as if it's always UTF-8, but there have been cases where
we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
byte at a time. This is done all over the codebase, and will *not*
give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
We may need to add a bypass for this in the future, for cases where
you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
with bytes(), but for iterating over code points, you should be using
an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
can fit entirely within a pointer. This means up to 3 bytes on 32-bit
platforms, and 7 bytes on 64-bit platforms. Such small strings will
not be heap-allocated.
- String can create substrings without making a deep copy of the
substring. Instead, the superstring gets +1 refcount from the
substring, and it acts like a view into the superstring. To make
substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
like DeprecatedString does today. While this was nifty in a handful of
places where we were calling C functions, it did stand in the way of
shared-superstring substrings.
2022-12-01 12:27:43 +00:00
|
|
|
* Copyright (c) 2018-2022, Andreas Kling <awesomekling@gmail.com>
|
2020-03-02 13:23:11 +00:00
|
|
|
* Copyright (c) 2020, Fei Wu <f.eiwu@yahoo.com>
|
|
|
|
*
|
2021-04-22 08:24:48 +00:00
|
|
|
* SPDX-License-Identifier: BSD-2-Clause
|
2020-03-02 13:23:11 +00:00
|
|
|
*/
|
|
|
|
|
2024-06-17 22:12:53 +00:00
|
|
|
#include <AK/ByteString.h>
|
2021-07-06 11:46:46 +00:00
|
|
|
#include <AK/CharacterTypes.h>
|
2024-06-17 22:12:53 +00:00
|
|
|
#include <AK/FloatingPointStringConversions.h>
|
2021-01-12 19:58:45 +00:00
|
|
|
#include <AK/MemMem.h>
|
2020-06-12 19:07:52 +00:00
|
|
|
#include <AK/Optional.h>
|
AK: Introduce the new String, replacement for DeprecatedString
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
to use in allocation-sensitive contexts, and is the reason we had to
ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
from the empty state, although null strings are considered empty.
All code is immediately nicer when using Optional<DeprecatedString>
but DeprecatedString came before Optional, which is how we ended up
like this.
- The encoding of the underlying data is ambiguous. For the most part,
we use it as if it's always UTF-8, but there have been cases where
we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
byte at a time. This is done all over the codebase, and will *not*
give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
We may need to add a bypass for this in the future, for cases where
you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
with bytes(), but for iterating over code points, you should be using
an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
can fit entirely within a pointer. This means up to 3 bytes on 32-bit
platforms, and 7 bytes on 64-bit platforms. Such small strings will
not be heap-allocated.
- String can create substrings without making a deep copy of the
substring. Instead, the superstring gets +1 refcount from the
substring, and it acts like a view into the superstring. To make
substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
like DeprecatedString does today. While this was nifty in a handful of
places where we were calling C functions, it did stand in the way of
shared-superstring substrings.
2022-12-01 12:27:43 +00:00
|
|
|
#include <AK/String.h>
|
2021-02-20 21:39:22 +00:00
|
|
|
#include <AK/StringBuilder.h>
|
2020-02-26 07:25:24 +00:00
|
|
|
#include <AK/StringUtils.h>
|
|
|
|
#include <AK/StringView.h>
|
2020-10-25 05:34:39 +00:00
|
|
|
#include <AK/Vector.h>
|
2024-06-17 22:12:53 +00:00
|
|
|
#include <string.h>
|
2022-02-15 22:24:43 +00:00
|
|
|
|
2020-02-26 07:25:24 +00:00
|
|
|
namespace AK {
|
|
|
|
|
|
|
|
namespace StringUtils {
|
|
|
|
|
2021-11-10 23:55:02 +00:00
|
|
|
bool matches(StringView str, StringView mask, CaseSensitivity case_sensitivity, Vector<MaskSpan>* match_spans)
|
2020-03-22 12:04:04 +00:00
|
|
|
{
|
2020-10-25 05:34:39 +00:00
|
|
|
auto record_span = [&match_spans](size_t start, size_t length) {
|
|
|
|
if (match_spans)
|
|
|
|
match_spans->append({ start, length });
|
|
|
|
};
|
|
|
|
|
2020-03-22 12:04:04 +00:00
|
|
|
if (str.is_null() || mask.is_null())
|
|
|
|
return str.is_null() && mask.is_null();
|
|
|
|
|
2021-09-27 17:17:56 +00:00
|
|
|
if (mask == "*"sv) {
|
2020-10-25 05:34:39 +00:00
|
|
|
record_span(0, str.length());
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2022-04-01 17:58:27 +00:00
|
|
|
char const* string_ptr = str.characters_without_null_termination();
|
|
|
|
char const* string_start = str.characters_without_null_termination();
|
|
|
|
char const* string_end = string_ptr + str.length();
|
|
|
|
char const* mask_ptr = mask.characters_without_null_termination();
|
|
|
|
char const* mask_end = mask_ptr + mask.length();
|
2020-03-22 12:04:04 +00:00
|
|
|
|
2020-10-25 05:34:39 +00:00
|
|
|
while (string_ptr < string_end && mask_ptr < mask_end) {
|
|
|
|
auto string_start_ptr = string_ptr;
|
|
|
|
switch (*mask_ptr) {
|
|
|
|
case '*':
|
2021-09-27 17:17:56 +00:00
|
|
|
if (mask_ptr == mask_end - 1) {
|
2020-10-25 05:34:39 +00:00
|
|
|
record_span(string_ptr - string_start, string_end - string_ptr);
|
2020-03-22 12:04:04 +00:00
|
|
|
return true;
|
2020-10-25 05:34:39 +00:00
|
|
|
}
|
2021-09-27 17:17:56 +00:00
|
|
|
while (string_ptr < string_end && !matches({ string_ptr, static_cast<size_t>(string_end - string_ptr) }, { mask_ptr + 1, static_cast<size_t>(mask_end - mask_ptr - 1) }, case_sensitivity))
|
2020-10-25 05:34:39 +00:00
|
|
|
++string_ptr;
|
|
|
|
record_span(string_start_ptr - string_start, string_ptr - string_start_ptr);
|
|
|
|
--string_ptr;
|
|
|
|
break;
|
|
|
|
case '?':
|
|
|
|
record_span(string_ptr - string_start, 1);
|
|
|
|
break;
|
2022-09-10 16:14:52 +00:00
|
|
|
case '\\':
|
2022-12-16 18:20:53 +00:00
|
|
|
// if backslash is last character in mask, just treat it as an exact match
|
|
|
|
// otherwise use it as escape for next character
|
|
|
|
if (mask_ptr + 1 < mask_end)
|
|
|
|
++mask_ptr;
|
|
|
|
[[fallthrough]];
|
2020-10-25 05:34:39 +00:00
|
|
|
default:
|
2021-09-27 17:17:56 +00:00
|
|
|
auto p = *mask_ptr;
|
|
|
|
auto ch = *string_ptr;
|
|
|
|
if (case_sensitivity == CaseSensitivity::CaseSensitive ? p != ch : to_ascii_lowercase(p) != to_ascii_lowercase(ch))
|
2020-10-25 05:34:39 +00:00
|
|
|
return false;
|
2020-03-22 12:04:04 +00:00
|
|
|
break;
|
2020-02-26 07:25:24 +00:00
|
|
|
}
|
2020-10-25 05:34:39 +00:00
|
|
|
++string_ptr;
|
|
|
|
++mask_ptr;
|
2020-03-22 12:04:04 +00:00
|
|
|
}
|
2020-02-26 07:25:24 +00:00
|
|
|
|
2020-12-28 20:34:12 +00:00
|
|
|
if (string_ptr == string_end) {
|
|
|
|
// Allow ending '*' to contain nothing.
|
|
|
|
while (mask_ptr != mask_end && *mask_ptr == '*') {
|
|
|
|
record_span(string_ptr - string_start, 0);
|
|
|
|
++mask_ptr;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-10-25 05:34:39 +00:00
|
|
|
return string_ptr == string_end && mask_ptr == mask_end;
|
2020-03-22 12:04:04 +00:00
|
|
|
}
|
2020-02-26 07:25:24 +00:00
|
|
|
|
2020-12-10 13:17:30 +00:00
|
|
|
template<typename T>
|
2021-11-10 23:55:02 +00:00
|
|
|
Optional<T> convert_to_int(StringView str, TrimWhitespace trim_whitespace)
|
2020-03-22 12:04:04 +00:00
|
|
|
{
|
2021-06-18 16:21:27 +00:00
|
|
|
auto string = trim_whitespace == TrimWhitespace::Yes
|
|
|
|
? str.trim_whitespace()
|
|
|
|
: str;
|
|
|
|
if (string.is_empty())
|
2020-06-12 19:07:52 +00:00
|
|
|
return {};
|
2020-02-26 07:25:24 +00:00
|
|
|
|
2020-12-20 05:27:33 +00:00
|
|
|
T sign = 1;
|
2020-03-22 12:04:04 +00:00
|
|
|
size_t i = 0;
|
2022-04-01 17:58:27 +00:00
|
|
|
auto const characters = string.characters_without_null_termination();
|
2020-03-22 12:04:04 +00:00
|
|
|
|
|
|
|
if (characters[0] == '-' || characters[0] == '+') {
|
2021-06-18 16:21:27 +00:00
|
|
|
if (string.length() == 1)
|
2020-06-12 19:07:52 +00:00
|
|
|
return {};
|
2020-03-22 12:04:04 +00:00
|
|
|
i++;
|
2020-12-20 05:27:33 +00:00
|
|
|
if (characters[0] == '-')
|
|
|
|
sign = -1;
|
2020-03-22 12:04:04 +00:00
|
|
|
}
|
2020-03-02 13:19:33 +00:00
|
|
|
|
2020-12-10 13:17:30 +00:00
|
|
|
T value = 0;
|
2021-06-18 16:21:27 +00:00
|
|
|
for (; i < string.length(); i++) {
|
2020-06-12 19:07:52 +00:00
|
|
|
if (characters[i] < '0' || characters[i] > '9')
|
|
|
|
return {};
|
2020-12-20 05:27:33 +00:00
|
|
|
|
|
|
|
if (__builtin_mul_overflow(value, 10, &value))
|
|
|
|
return {};
|
|
|
|
|
|
|
|
if (__builtin_add_overflow(value, sign * (characters[i] - '0'), &value))
|
|
|
|
return {};
|
2020-03-22 12:04:04 +00:00
|
|
|
}
|
2020-12-20 05:27:33 +00:00
|
|
|
return value;
|
2020-03-22 12:04:04 +00:00
|
|
|
}
|
2020-03-02 13:19:33 +00:00
|
|
|
|
2021-11-10 23:55:02 +00:00
|
|
|
template Optional<i8> convert_to_int(StringView str, TrimWhitespace);
|
|
|
|
template Optional<i16> convert_to_int(StringView str, TrimWhitespace);
|
|
|
|
template Optional<i32> convert_to_int(StringView str, TrimWhitespace);
|
|
|
|
template Optional<long> convert_to_int(StringView str, TrimWhitespace);
|
|
|
|
template Optional<long long> convert_to_int(StringView str, TrimWhitespace);
|
2020-12-10 13:17:30 +00:00
|
|
|
|
|
|
|
template<typename T>
|
2021-11-10 23:55:02 +00:00
|
|
|
Optional<T> convert_to_uint(StringView str, TrimWhitespace trim_whitespace)
|
2020-03-22 12:04:04 +00:00
|
|
|
{
|
2021-06-18 16:21:27 +00:00
|
|
|
auto string = trim_whitespace == TrimWhitespace::Yes
|
|
|
|
? str.trim_whitespace()
|
|
|
|
: str;
|
|
|
|
if (string.is_empty())
|
2020-06-12 19:07:52 +00:00
|
|
|
return {};
|
2020-03-02 13:19:33 +00:00
|
|
|
|
2020-12-10 13:17:30 +00:00
|
|
|
T value = 0;
|
2022-04-01 17:58:27 +00:00
|
|
|
auto const characters = string.characters_without_null_termination();
|
2020-03-22 12:04:04 +00:00
|
|
|
|
2021-06-18 16:21:27 +00:00
|
|
|
for (size_t i = 0; i < string.length(); i++) {
|
2020-06-12 19:07:52 +00:00
|
|
|
if (characters[i] < '0' || characters[i] > '9')
|
|
|
|
return {};
|
|
|
|
|
2020-12-20 05:27:33 +00:00
|
|
|
if (__builtin_mul_overflow(value, 10, &value))
|
|
|
|
return {};
|
|
|
|
|
|
|
|
if (__builtin_add_overflow(value, characters[i] - '0', &value))
|
|
|
|
return {};
|
2020-03-02 13:19:33 +00:00
|
|
|
}
|
2020-03-22 12:04:04 +00:00
|
|
|
return value;
|
|
|
|
}
|
2020-03-02 13:19:33 +00:00
|
|
|
|
2021-11-10 23:55:02 +00:00
|
|
|
template Optional<u8> convert_to_uint(StringView str, TrimWhitespace);
|
|
|
|
template Optional<u16> convert_to_uint(StringView str, TrimWhitespace);
|
|
|
|
template Optional<u32> convert_to_uint(StringView str, TrimWhitespace);
|
|
|
|
template Optional<unsigned long> convert_to_uint(StringView str, TrimWhitespace);
|
|
|
|
template Optional<unsigned long long> convert_to_uint(StringView str, TrimWhitespace);
|
2020-12-10 13:17:30 +00:00
|
|
|
|
|
|
|
template<typename T>
|
2021-11-10 23:55:02 +00:00
|
|
|
Optional<T> convert_to_uint_from_hex(StringView str, TrimWhitespace trim_whitespace)
|
2020-05-20 18:20:43 +00:00
|
|
|
{
|
2021-06-18 16:21:27 +00:00
|
|
|
auto string = trim_whitespace == TrimWhitespace::Yes
|
|
|
|
? str.trim_whitespace()
|
|
|
|
: str;
|
|
|
|
if (string.is_empty())
|
2020-06-12 19:07:52 +00:00
|
|
|
return {};
|
2020-05-20 18:20:43 +00:00
|
|
|
|
2020-12-10 13:17:30 +00:00
|
|
|
T value = 0;
|
2022-04-01 17:58:27 +00:00
|
|
|
auto const count = string.length();
|
2024-01-03 21:43:01 +00:00
|
|
|
T const upper_bound = NumericLimits<T>::max();
|
2020-05-20 18:20:43 +00:00
|
|
|
|
|
|
|
for (size_t i = 0; i < count; i++) {
|
2021-06-18 16:21:27 +00:00
|
|
|
char digit = string[i];
|
2020-05-20 18:20:43 +00:00
|
|
|
u8 digit_val;
|
2020-12-20 05:27:33 +00:00
|
|
|
if (value > (upper_bound >> 4))
|
|
|
|
return {};
|
2020-05-20 18:20:43 +00:00
|
|
|
|
|
|
|
if (digit >= '0' && digit <= '9') {
|
|
|
|
digit_val = digit - '0';
|
|
|
|
} else if (digit >= 'a' && digit <= 'f') {
|
|
|
|
digit_val = 10 + (digit - 'a');
|
|
|
|
} else if (digit >= 'A' && digit <= 'F') {
|
|
|
|
digit_val = 10 + (digit - 'A');
|
|
|
|
} else {
|
2020-06-12 19:07:52 +00:00
|
|
|
return {};
|
2020-05-20 18:20:43 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
value = (value << 4) + digit_val;
|
|
|
|
}
|
|
|
|
return value;
|
|
|
|
}
|
|
|
|
|
2021-11-10 23:55:02 +00:00
|
|
|
template Optional<u8> convert_to_uint_from_hex(StringView str, TrimWhitespace);
|
|
|
|
template Optional<u16> convert_to_uint_from_hex(StringView str, TrimWhitespace);
|
|
|
|
template Optional<u32> convert_to_uint_from_hex(StringView str, TrimWhitespace);
|
|
|
|
template Optional<u64> convert_to_uint_from_hex(StringView str, TrimWhitespace);
|
2020-12-10 13:17:30 +00:00
|
|
|
|
2021-12-20 20:06:54 +00:00
|
|
|
template<typename T>
|
|
|
|
Optional<T> convert_to_uint_from_octal(StringView str, TrimWhitespace trim_whitespace)
|
|
|
|
{
|
|
|
|
auto string = trim_whitespace == TrimWhitespace::Yes
|
|
|
|
? str.trim_whitespace()
|
|
|
|
: str;
|
|
|
|
if (string.is_empty())
|
|
|
|
return {};
|
|
|
|
|
|
|
|
T value = 0;
|
2022-04-01 17:58:27 +00:00
|
|
|
auto const count = string.length();
|
2024-01-03 21:43:01 +00:00
|
|
|
T const upper_bound = NumericLimits<T>::max();
|
2021-12-20 20:06:54 +00:00
|
|
|
|
|
|
|
for (size_t i = 0; i < count; i++) {
|
|
|
|
char digit = string[i];
|
|
|
|
u8 digit_val;
|
|
|
|
if (value > (upper_bound >> 3))
|
|
|
|
return {};
|
|
|
|
|
|
|
|
if (digit >= '0' && digit <= '7') {
|
|
|
|
digit_val = digit - '0';
|
|
|
|
} else {
|
|
|
|
return {};
|
|
|
|
}
|
|
|
|
|
|
|
|
value = (value << 3) + digit_val;
|
|
|
|
}
|
|
|
|
return value;
|
|
|
|
}
|
|
|
|
|
|
|
|
template Optional<u8> convert_to_uint_from_octal(StringView str, TrimWhitespace);
|
|
|
|
template Optional<u16> convert_to_uint_from_octal(StringView str, TrimWhitespace);
|
|
|
|
template Optional<u32> convert_to_uint_from_octal(StringView str, TrimWhitespace);
|
|
|
|
template Optional<u64> convert_to_uint_from_octal(StringView str, TrimWhitespace);
|
|
|
|
|
2022-10-10 22:48:45 +00:00
|
|
|
template<typename T>
|
|
|
|
Optional<T> convert_to_floating_point(StringView str, TrimWhitespace trim_whitespace)
|
|
|
|
{
|
|
|
|
static_assert(IsSame<T, double> || IsSame<T, float>);
|
|
|
|
auto string = trim_whitespace == TrimWhitespace::Yes
|
|
|
|
? str.trim_whitespace()
|
|
|
|
: str;
|
|
|
|
|
|
|
|
char const* start = string.characters_without_null_termination();
|
2023-10-14 08:42:59 +00:00
|
|
|
return parse_floating_point_completely<T>(start, start + string.length());
|
2022-10-10 22:48:45 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
template Optional<double> convert_to_floating_point(StringView str, TrimWhitespace);
|
|
|
|
template Optional<float> convert_to_floating_point(StringView str, TrimWhitespace);
|
|
|
|
|
2023-03-10 07:48:54 +00:00
|
|
|
bool equals_ignoring_ascii_case(StringView a, StringView b)
|
2020-03-22 12:07:45 +00:00
|
|
|
{
|
|
|
|
if (a.length() != b.length())
|
|
|
|
return false;
|
|
|
|
for (size_t i = 0; i < a.length(); ++i) {
|
2021-07-06 11:46:46 +00:00
|
|
|
if (to_ascii_lowercase(a.characters_without_null_termination()[i]) != to_ascii_lowercase(b.characters_without_null_termination()[i]))
|
2020-03-22 12:07:45 +00:00
|
|
|
return false;
|
|
|
|
}
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2021-11-10 23:55:02 +00:00
|
|
|
bool ends_with(StringView str, StringView end, CaseSensitivity case_sensitivity)
|
2020-05-26 09:58:34 +00:00
|
|
|
{
|
|
|
|
if (end.is_empty())
|
|
|
|
return true;
|
|
|
|
if (str.is_empty())
|
|
|
|
return false;
|
|
|
|
if (end.length() > str.length())
|
|
|
|
return false;
|
2020-05-26 09:12:18 +00:00
|
|
|
|
|
|
|
if (case_sensitivity == CaseSensitivity::CaseSensitive)
|
|
|
|
return !memcmp(str.characters_without_null_termination() + (str.length() - end.length()), end.characters_without_null_termination(), end.length());
|
|
|
|
|
|
|
|
auto str_chars = str.characters_without_null_termination();
|
|
|
|
auto end_chars = end.characters_without_null_termination();
|
|
|
|
|
|
|
|
size_t si = str.length() - end.length();
|
|
|
|
for (size_t ei = 0; ei < end.length(); ++si, ++ei) {
|
2021-07-06 11:46:46 +00:00
|
|
|
if (to_ascii_lowercase(str_chars[si]) != to_ascii_lowercase(end_chars[ei]))
|
2020-05-26 09:12:18 +00:00
|
|
|
return false;
|
|
|
|
}
|
|
|
|
return true;
|
2020-05-26 09:58:34 +00:00
|
|
|
}
|
|
|
|
|
2021-11-10 23:55:02 +00:00
|
|
|
bool starts_with(StringView str, StringView start, CaseSensitivity case_sensitivity)
|
2020-07-18 16:59:38 +00:00
|
|
|
{
|
|
|
|
if (start.is_empty())
|
|
|
|
return true;
|
|
|
|
if (str.is_empty())
|
|
|
|
return false;
|
|
|
|
if (start.length() > str.length())
|
|
|
|
return false;
|
|
|
|
if (str.characters_without_null_termination() == start.characters_without_null_termination())
|
|
|
|
return true;
|
|
|
|
|
|
|
|
if (case_sensitivity == CaseSensitivity::CaseSensitive)
|
|
|
|
return !memcmp(str.characters_without_null_termination(), start.characters_without_null_termination(), start.length());
|
|
|
|
|
|
|
|
auto str_chars = str.characters_without_null_termination();
|
|
|
|
auto start_chars = start.characters_without_null_termination();
|
|
|
|
|
|
|
|
size_t si = 0;
|
|
|
|
for (size_t starti = 0; starti < start.length(); ++si, ++starti) {
|
2021-07-06 11:46:46 +00:00
|
|
|
if (to_ascii_lowercase(str_chars[si]) != to_ascii_lowercase(start_chars[starti]))
|
2020-07-18 16:59:38 +00:00
|
|
|
return false;
|
|
|
|
}
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2021-11-10 23:55:02 +00:00
|
|
|
bool contains(StringView str, StringView needle, CaseSensitivity case_sensitivity)
|
2020-10-20 21:07:03 +00:00
|
|
|
{
|
|
|
|
if (str.is_null() || needle.is_null() || str.is_empty() || needle.length() > str.length())
|
|
|
|
return false;
|
|
|
|
if (needle.is_empty())
|
|
|
|
return true;
|
|
|
|
auto str_chars = str.characters_without_null_termination();
|
|
|
|
auto needle_chars = needle.characters_without_null_termination();
|
|
|
|
if (case_sensitivity == CaseSensitivity::CaseSensitive)
|
|
|
|
return memmem(str_chars, str.length(), needle_chars, needle.length()) != nullptr;
|
|
|
|
|
2021-07-06 11:46:46 +00:00
|
|
|
auto needle_first = to_ascii_lowercase(needle_chars[0]);
|
2020-11-12 23:44:32 +00:00
|
|
|
for (size_t si = 0; si < str.length(); si++) {
|
2021-07-06 11:46:46 +00:00
|
|
|
if (to_ascii_lowercase(str_chars[si]) != needle_first)
|
2020-10-20 21:07:03 +00:00
|
|
|
continue;
|
2020-11-12 23:44:32 +00:00
|
|
|
for (size_t ni = 0; si + ni < str.length(); ni++) {
|
2021-07-06 11:46:46 +00:00
|
|
|
if (to_ascii_lowercase(str_chars[si + ni]) != to_ascii_lowercase(needle_chars[ni])) {
|
2022-03-18 18:02:07 +00:00
|
|
|
if (ni > 0)
|
|
|
|
si += ni - 1;
|
2020-10-20 21:07:03 +00:00
|
|
|
break;
|
2020-11-12 23:44:32 +00:00
|
|
|
}
|
|
|
|
if (ni + 1 == needle.length())
|
|
|
|
return true;
|
2020-10-20 21:07:03 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2021-11-10 23:55:02 +00:00
|
|
|
bool is_whitespace(StringView str)
|
2020-09-20 13:35:04 +00:00
|
|
|
{
|
2021-07-25 21:05:48 +00:00
|
|
|
return all_of(str, is_ascii_space);
|
2021-01-02 23:26:02 +00:00
|
|
|
}
|
2020-09-20 13:35:04 +00:00
|
|
|
|
2021-11-10 23:55:02 +00:00
|
|
|
StringView trim(StringView str, StringView characters, TrimMode mode)
|
2021-01-02 23:26:02 +00:00
|
|
|
{
|
2020-09-20 13:35:04 +00:00
|
|
|
size_t substring_start = 0;
|
|
|
|
size_t substring_length = str.length();
|
|
|
|
|
|
|
|
if (mode == TrimMode::Left || mode == TrimMode::Both) {
|
|
|
|
for (size_t i = 0; i < str.length(); ++i) {
|
|
|
|
if (substring_length == 0)
|
2022-07-11 17:32:29 +00:00
|
|
|
return ""sv;
|
2021-05-25 07:42:01 +00:00
|
|
|
if (!characters.contains(str[i]))
|
2020-09-20 13:35:04 +00:00
|
|
|
break;
|
|
|
|
++substring_start;
|
|
|
|
--substring_length;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (mode == TrimMode::Right || mode == TrimMode::Both) {
|
2022-10-11 13:38:09 +00:00
|
|
|
for (size_t i = str.length(); i > 0; --i) {
|
2020-09-20 13:35:04 +00:00
|
|
|
if (substring_length == 0)
|
2022-07-11 17:32:29 +00:00
|
|
|
return ""sv;
|
2022-10-11 13:38:09 +00:00
|
|
|
if (!characters.contains(str[i - 1]))
|
2020-09-20 13:35:04 +00:00
|
|
|
break;
|
|
|
|
--substring_length;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return str.substring_view(substring_start, substring_length);
|
|
|
|
}
|
2021-01-12 19:58:45 +00:00
|
|
|
|
2021-11-10 23:55:02 +00:00
|
|
|
StringView trim_whitespace(StringView str, TrimMode mode)
|
2021-05-25 07:42:01 +00:00
|
|
|
{
|
2022-07-11 17:32:29 +00:00
|
|
|
return trim(str, " \n\t\v\f\r"sv, mode);
|
2021-05-25 07:42:01 +00:00
|
|
|
}
|
|
|
|
|
2021-11-10 23:55:02 +00:00
|
|
|
Optional<size_t> find(StringView haystack, char needle, size_t start)
|
2021-01-12 19:58:45 +00:00
|
|
|
{
|
2021-07-01 12:58:37 +00:00
|
|
|
if (start >= haystack.length())
|
|
|
|
return {};
|
|
|
|
for (size_t i = start; i < haystack.length(); ++i) {
|
|
|
|
if (haystack[i] == needle)
|
|
|
|
return i;
|
|
|
|
}
|
|
|
|
return {};
|
|
|
|
}
|
|
|
|
|
2021-11-10 23:55:02 +00:00
|
|
|
Optional<size_t> find(StringView haystack, StringView needle, size_t start)
|
2021-07-01 12:58:37 +00:00
|
|
|
{
|
|
|
|
if (start > haystack.length())
|
|
|
|
return {};
|
|
|
|
auto index = AK::memmem_optional(
|
|
|
|
haystack.characters_without_null_termination() + start, haystack.length() - start,
|
2021-01-12 19:58:45 +00:00
|
|
|
needle.characters_without_null_termination(), needle.length());
|
2021-07-01 12:58:37 +00:00
|
|
|
return index.has_value() ? (*index + start) : index;
|
|
|
|
}
|
|
|
|
|
2021-11-10 23:55:02 +00:00
|
|
|
Optional<size_t> find_last(StringView haystack, char needle)
|
2021-07-01 12:58:37 +00:00
|
|
|
{
|
|
|
|
for (size_t i = haystack.length(); i > 0; --i) {
|
|
|
|
if (haystack[i - 1] == needle)
|
|
|
|
return i - 1;
|
|
|
|
}
|
|
|
|
return {};
|
2021-01-12 19:58:45 +00:00
|
|
|
}
|
2021-02-20 21:39:22 +00:00
|
|
|
|
2022-12-15 21:20:14 +00:00
|
|
|
Optional<size_t> find_last(StringView haystack, StringView needle)
|
|
|
|
{
|
2024-01-03 21:48:42 +00:00
|
|
|
if (needle.length() > haystack.length())
|
|
|
|
return {};
|
|
|
|
|
|
|
|
for (size_t i = haystack.length() - needle.length();; --i) {
|
|
|
|
if (haystack.substring_view(i, needle.length()) == needle)
|
|
|
|
return i;
|
|
|
|
|
|
|
|
if (i == 0)
|
|
|
|
break;
|
2022-12-15 21:20:14 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
return {};
|
|
|
|
}
|
|
|
|
|
2022-09-30 19:19:53 +00:00
|
|
|
Optional<size_t> find_last_not(StringView haystack, char needle)
|
|
|
|
{
|
|
|
|
for (size_t i = haystack.length(); i > 0; --i) {
|
|
|
|
if (haystack[i - 1] != needle)
|
|
|
|
return i - 1;
|
|
|
|
}
|
|
|
|
return {};
|
|
|
|
}
|
|
|
|
|
2021-11-10 23:55:02 +00:00
|
|
|
Vector<size_t> find_all(StringView haystack, StringView needle)
|
2021-07-01 15:00:34 +00:00
|
|
|
{
|
|
|
|
Vector<size_t> positions;
|
|
|
|
size_t current_position = 0;
|
|
|
|
while (current_position <= haystack.length()) {
|
|
|
|
auto maybe_position = AK::memmem_optional(
|
|
|
|
haystack.characters_without_null_termination() + current_position, haystack.length() - current_position,
|
|
|
|
needle.characters_without_null_termination(), needle.length());
|
|
|
|
if (!maybe_position.has_value())
|
|
|
|
break;
|
|
|
|
positions.append(current_position + *maybe_position);
|
|
|
|
current_position += *maybe_position + 1;
|
|
|
|
}
|
|
|
|
return positions;
|
|
|
|
}
|
|
|
|
|
2021-11-10 23:55:02 +00:00
|
|
|
Optional<size_t> find_any_of(StringView haystack, StringView needles, SearchDirection direction)
|
2021-07-01 16:12:21 +00:00
|
|
|
{
|
|
|
|
if (haystack.is_empty() || needles.is_empty())
|
|
|
|
return {};
|
|
|
|
if (direction == SearchDirection::Forward) {
|
|
|
|
for (size_t i = 0; i < haystack.length(); ++i) {
|
|
|
|
if (needles.contains(haystack[i]))
|
|
|
|
return i;
|
|
|
|
}
|
|
|
|
} else if (direction == SearchDirection::Backward) {
|
|
|
|
for (size_t i = haystack.length(); i > 0; --i) {
|
|
|
|
if (needles.contains(haystack[i - 1]))
|
|
|
|
return i - 1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return {};
|
|
|
|
}
|
|
|
|
|
2023-12-16 14:19:34 +00:00
|
|
|
ByteString to_snakecase(StringView str)
|
2021-02-20 21:39:22 +00:00
|
|
|
{
|
|
|
|
auto should_insert_underscore = [&](auto i, auto current_char) {
|
|
|
|
if (i == 0)
|
|
|
|
return false;
|
|
|
|
auto previous_ch = str[i - 1];
|
2021-07-06 11:46:46 +00:00
|
|
|
if (is_ascii_lower_alpha(previous_ch) && is_ascii_upper_alpha(current_char))
|
2021-02-20 21:39:22 +00:00
|
|
|
return true;
|
|
|
|
if (i >= str.length() - 1)
|
|
|
|
return false;
|
|
|
|
auto next_ch = str[i + 1];
|
2021-07-06 11:46:46 +00:00
|
|
|
if (is_ascii_upper_alpha(current_char) && is_ascii_lower_alpha(next_ch))
|
2021-02-20 21:39:22 +00:00
|
|
|
return true;
|
|
|
|
return false;
|
|
|
|
};
|
|
|
|
|
|
|
|
StringBuilder builder;
|
|
|
|
for (size_t i = 0; i < str.length(); ++i) {
|
|
|
|
auto ch = str[i];
|
|
|
|
if (should_insert_underscore(i, ch))
|
|
|
|
builder.append('_');
|
2021-07-06 11:46:46 +00:00
|
|
|
builder.append_as_lowercase(ch);
|
2021-02-20 21:39:22 +00:00
|
|
|
}
|
2023-12-16 14:19:34 +00:00
|
|
|
return builder.to_byte_string();
|
2021-02-20 21:39:22 +00:00
|
|
|
}
|
|
|
|
|
2023-12-16 14:19:34 +00:00
|
|
|
ByteString to_titlecase(StringView str)
|
2021-08-26 17:55:41 +00:00
|
|
|
{
|
|
|
|
StringBuilder builder;
|
|
|
|
bool next_is_upper = true;
|
|
|
|
|
|
|
|
for (auto ch : str) {
|
|
|
|
if (next_is_upper)
|
2022-10-20 12:44:18 +00:00
|
|
|
builder.append(to_ascii_uppercase(ch));
|
2021-08-26 17:55:41 +00:00
|
|
|
else
|
2022-10-20 12:44:18 +00:00
|
|
|
builder.append(to_ascii_lowercase(ch));
|
2021-08-26 17:55:41 +00:00
|
|
|
next_is_upper = ch == ' ';
|
|
|
|
}
|
|
|
|
|
2023-12-16 14:19:34 +00:00
|
|
|
return builder.to_byte_string();
|
2021-08-26 17:55:41 +00:00
|
|
|
}
|
|
|
|
|
2023-12-16 14:19:34 +00:00
|
|
|
ByteString invert_case(StringView str)
|
2022-05-19 05:23:45 +00:00
|
|
|
{
|
|
|
|
StringBuilder builder(str.length());
|
|
|
|
|
|
|
|
for (auto ch : str) {
|
|
|
|
if (is_ascii_lower_alpha(ch))
|
|
|
|
builder.append(to_ascii_uppercase(ch));
|
|
|
|
else
|
|
|
|
builder.append(to_ascii_lowercase(ch));
|
|
|
|
}
|
|
|
|
|
2023-12-16 14:19:34 +00:00
|
|
|
return builder.to_byte_string();
|
2022-05-19 05:23:45 +00:00
|
|
|
}
|
|
|
|
|
2023-12-17 16:52:45 +00:00
|
|
|
// Finishes the replacing algorithm once it is known that ita least one
|
|
|
|
// replacemnet is going to be done. Otherwise the caller may want to follow a
|
|
|
|
// different route to construct its output.
|
|
|
|
static StringBuilder replace_into_builder(StringView str, StringView needle, StringView replacement, ReplaceMode replace_mode, size_t first_replacement_position)
|
|
|
|
{
|
|
|
|
StringBuilder replaced_string;
|
|
|
|
|
|
|
|
replaced_string.append(str.substring_view(0, first_replacement_position));
|
|
|
|
replaced_string.append(replacement);
|
|
|
|
|
|
|
|
StringView remaining = str.substring_view(first_replacement_position + needle.length());
|
|
|
|
|
|
|
|
switch (replace_mode) {
|
|
|
|
case ReplaceMode::All:
|
|
|
|
while (!remaining.is_empty()) {
|
|
|
|
auto maybe_pos = remaining.find(needle);
|
|
|
|
if (!maybe_pos.has_value())
|
|
|
|
break;
|
|
|
|
replaced_string.append(remaining.substring_view(0, *maybe_pos));
|
|
|
|
replaced_string.append(replacement);
|
|
|
|
remaining = remaining.substring_view(*maybe_pos + needle.length());
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case ReplaceMode::FirstOnly:
|
|
|
|
// We already made the first replacement.
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
// The remaining bits either don't contain the needle or are ignored due to
|
|
|
|
// `replace_mode` being `ReplaceMode::FirstOnly`.
|
|
|
|
replaced_string.append(remaining);
|
|
|
|
|
|
|
|
return replaced_string;
|
|
|
|
}
|
|
|
|
|
|
|
|
ByteString replace(StringView str, StringView needle, StringView replacement,
|
|
|
|
ReplaceMode replace_mode)
|
2021-09-10 23:15:44 +00:00
|
|
|
{
|
|
|
|
if (str.is_empty())
|
|
|
|
return str;
|
|
|
|
|
2023-12-17 16:52:45 +00:00
|
|
|
auto maybe_first = str.find(needle);
|
|
|
|
if (!maybe_first.has_value())
|
|
|
|
return str;
|
2021-09-10 23:15:44 +00:00
|
|
|
|
2023-12-17 16:52:45 +00:00
|
|
|
auto resulting_builder = replace_into_builder(str, needle, replacement, replace_mode, *maybe_first);
|
|
|
|
return resulting_builder.to_byte_string();
|
2021-09-10 23:15:44 +00:00
|
|
|
}
|
AK: Introduce the new String, replacement for DeprecatedString
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
to use in allocation-sensitive contexts, and is the reason we had to
ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
from the empty state, although null strings are considered empty.
All code is immediately nicer when using Optional<DeprecatedString>
but DeprecatedString came before Optional, which is how we ended up
like this.
- The encoding of the underlying data is ambiguous. For the most part,
we use it as if it's always UTF-8, but there have been cases where
we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
byte at a time. This is done all over the codebase, and will *not*
give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
We may need to add a bypass for this in the future, for cases where
you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
with bytes(), but for iterating over code points, you should be using
an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
can fit entirely within a pointer. This means up to 3 bytes on 32-bit
platforms, and 7 bytes on 64-bit platforms. Such small strings will
not be heap-allocated.
- String can create substrings without making a deep copy of the
substring. Instead, the superstring gets +1 refcount from the
substring, and it acts like a view into the superstring. To make
substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
like DeprecatedString does today. While this was nifty in a handful of
places where we were calling C functions, it did stand in the way of
shared-superstring substrings.
2022-12-01 12:27:43 +00:00
|
|
|
|
|
|
|
ErrorOr<String> replace(String const& haystack, StringView needle, StringView replacement, ReplaceMode replace_mode)
|
|
|
|
{
|
|
|
|
if (haystack.is_empty())
|
|
|
|
return haystack;
|
|
|
|
|
2023-12-17 16:52:45 +00:00
|
|
|
auto const source_bytes = haystack.bytes_as_string_view();
|
AK: Introduce the new String, replacement for DeprecatedString
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
to use in allocation-sensitive contexts, and is the reason we had to
ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
from the empty state, although null strings are considered empty.
All code is immediately nicer when using Optional<DeprecatedString>
but DeprecatedString came before Optional, which is how we ended up
like this.
- The encoding of the underlying data is ambiguous. For the most part,
we use it as if it's always UTF-8, but there have been cases where
we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
byte at a time. This is done all over the codebase, and will *not*
give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
We may need to add a bypass for this in the future, for cases where
you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
with bytes(), but for iterating over code points, you should be using
an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
can fit entirely within a pointer. This means up to 3 bytes on 32-bit
platforms, and 7 bytes on 64-bit platforms. Such small strings will
not be heap-allocated.
- String can create substrings without making a deep copy of the
substring. Instead, the superstring gets +1 refcount from the
substring, and it acts like a view into the superstring. To make
substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
like DeprecatedString does today. While this was nifty in a handful of
places where we were calling C functions, it did stand in the way of
shared-superstring substrings.
2022-12-01 12:27:43 +00:00
|
|
|
|
2023-12-17 16:52:45 +00:00
|
|
|
auto maybe_first = source_bytes.find(needle);
|
|
|
|
if (!maybe_first.has_value())
|
|
|
|
return haystack;
|
|
|
|
|
|
|
|
auto resulting_builder = replace_into_builder(source_bytes, needle, replacement, replace_mode, *maybe_first);
|
|
|
|
return resulting_builder.to_string();
|
AK: Introduce the new String, replacement for DeprecatedString
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
to use in allocation-sensitive contexts, and is the reason we had to
ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
from the empty state, although null strings are considered empty.
All code is immediately nicer when using Optional<DeprecatedString>
but DeprecatedString came before Optional, which is how we ended up
like this.
- The encoding of the underlying data is ambiguous. For the most part,
we use it as if it's always UTF-8, but there have been cases where
we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
byte at a time. This is done all over the codebase, and will *not*
give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
We may need to add a bypass for this in the future, for cases where
you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
with bytes(), but for iterating over code points, you should be using
an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
can fit entirely within a pointer. This means up to 3 bytes on 32-bit
platforms, and 7 bytes on 64-bit platforms. Such small strings will
not be heap-allocated.
- String can create substrings without making a deep copy of the
substring. Instead, the superstring gets +1 refcount from the
substring, and it acts like a view into the superstring. To make
substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
like DeprecatedString does today. While this was nifty in a handful of
places where we were calling C functions, it did stand in the way of
shared-superstring substrings.
2022-12-01 12:27:43 +00:00
|
|
|
}
|
2021-09-10 23:15:44 +00:00
|
|
|
|
2021-09-10 22:02:24 +00:00
|
|
|
// TODO: Benchmark against KMP (AK/MemMem.h) and switch over if it's faster for short strings too
|
2021-11-10 23:55:02 +00:00
|
|
|
size_t count(StringView str, StringView needle)
|
2021-09-10 22:02:24 +00:00
|
|
|
{
|
|
|
|
if (needle.is_empty())
|
|
|
|
return str.length();
|
|
|
|
|
|
|
|
size_t count = 0;
|
|
|
|
for (size_t i = 0; i < str.length() - needle.length() + 1; ++i) {
|
|
|
|
if (str.substring_view(i).starts_with(needle))
|
|
|
|
count++;
|
|
|
|
}
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
|
2023-08-15 01:55:36 +00:00
|
|
|
size_t count(StringView str, char needle)
|
|
|
|
{
|
|
|
|
size_t count = 0;
|
|
|
|
for (size_t i = 0; i < str.length(); ++i) {
|
|
|
|
if (str[i] == needle)
|
|
|
|
count++;
|
|
|
|
}
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
|
2020-02-26 07:25:24 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
}
|