Commit graph

280 commits

Author SHA1 Message Date
Andreas Kling
df547bb321 LibUnicode: Avoid redundant UTF-8 validation in AK::String helpers 2024-04-21 19:32:49 +02:00
Idan Horowitz
945c58c7c1 LibUnicode: Generate and use code point composition mappings
These allow us to binary search the code point compositions based on
the first code point being combined, which makes the search close to
O(log N) instead of O(N).
2024-04-06 14:21:04 -04:00
Idan Horowitz
e227bf0f71 LibUnicode: Optimize the canonical composition algorithm implementation
It now takes O(N) time instead of O(N^2) time. Additionally some always
false conditions are removed.
2024-04-06 14:21:04 -04:00
Timothy Flynn
576c2f4f4d LibURL+LibUnicode+LibWebView: Handle punycode directly in LibURL
We had defined punycode handling in LibUnicode when LibURL (AK at the
time) was unable to depend on LibUnicode. This is no longer the case.
2024-03-26 12:25:21 -04:00
Shannon Booth
e800605ad3 AK+LibURL: Move AK::URL into a new URL library
This URL library ends up being a relatively fundamental base library of
the system, as LibCore depends on LibURL.

This change has two main benefits:
 * Moving AK back more towards being an agnostic library that can
   be used between the kernel and userspace. URL has never really fit
   that description - and is not used in the kernel.
 * URL _should_ depend on LibUnicode, as it needs punnycode support.
   However, it's not really possible to do this inside of AK as it can't
   depend on any external library. This change brings us a little closer
   to being able to do that, but unfortunately we aren't there quite
   yet, as the code generators depend on LibCore.
2024-03-18 14:06:28 -04:00
Timothy Flynn
aa0a6d58b2 Userland: Remove LibCore dependency from libraries that do not use it 2024-01-22 08:48:34 -05:00
Ali Mohammad Pur
5e1499d104 Everywhere: Rename {Deprecated => Byte}String
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).

This commit is auto-generated:
  $ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
    Meta Ports Ladybird Tests Kernel)
  $ perl -pie 's/\bDeprecatedString\b/ByteString/g;
    s/deprecated_string/byte_string/g' $xs
  $ clang-format --style=file -i \
    $(git diff --name-only | grep \.cpp\|\.h)
  $ gn format $(git ls-files '*.gn' '*.gni')
2023-12-17 18:25:10 +03:30
Timothy Flynn
43e9dc0500 LibUnicode: Use weak symbols to provide default IDNA defintions
Rather than using #ifdef blocks, update the fallback IDNA definitions to
use weak symbols to match the rest of LibUnicode / LibLocale.
2023-12-10 10:19:14 -05:00
Timothy Flynn
1f0e24bc3b LibUnicode: Fix compilation when ENABLE_UNICODE_DATABASE_DOWNLOAD is OFF 2023-12-10 10:19:14 -05:00
Simon Wanner
58f08107b0 AK+LibUnicode: Add Unicode::create_unicode_url
This is a workaround for the fact that AK::URLParser can't call into
LibUnicode directly.
2023-12-10 08:04:58 -05:00
Simon Wanner
5bcb019106 LibUnicode: Add IDNA::to_ascii
This implements the ToASCII operation of Unicode Technical Standard 46
2023-12-10 08:04:58 -05:00
Simon Wanner
7d9fe44039 LibUnicode: Download and parse IDNA data 2023-12-10 08:04:58 -05:00
Simon Wanner
cfd0a60863 LibUnicode: Add Punycode::encode 2023-12-10 08:04:58 -05:00
Simon Wanner
299d35aadc LibUnicode: Add Punycode::decode 2023-12-10 08:04:58 -05:00
Shannon Booth
d777b279e3 LibUnicode+Tests: Remove now unused to_unicode_*_full methods
Relocating all of the tests for these in LibUnicode over to the AK
String testsuite.
2023-11-28 17:15:27 -05:00
Shannon Booth
6b32a1f18f AK+LibUnicode: Expose TrailingCodePointTransformation in to_titlecase
Relocating the definition of this enum from LibUnicode to AK.
2023-11-28 17:15:27 -05:00
Timothy Flynn
6070df40f3 LibUnicode: Define case-insensitive string comparison more generically
The only user is currently String::equals_ignoring_case, but LibRegex
will need to do the same case-folded comparison with UTF-32 data. As it
turns out, the comparison works with all Unicode view types without much
fuss.
2023-11-08 12:54:26 -05:00
Cr4xy
bbfe0d3a82 LibWeb: Implement text-transform: capitalize 2023-10-03 09:47:17 -04:00
Timothy Flynn
139c575cc9 LibUnicode: Update to Unicode version 15.1.0
https://unicode.org/versions/Unicode15.1.0/

This update includes a new set of code point properties, Indic Conjunct
Break. These may have the values Consonant, Linker, or Extend. These are
used in text segmentation to prevent breaking on some extended grapheme
cluster sequences.
2023-09-15 18:30:26 +02:00
Timothy Flynn
02a8683266 LibUnicode+LibJS: Stop propagating small OOM errors from normalization
This API only perform small allocations, and is only used by LibJS.
2023-09-09 13:03:25 -04:00
Sam Atkins
0d021a63c7 LibUnicode: Generate data for bidirectional character types
This will let us examine code points to determine the rtl/ltr direction
of a piece of text.
2023-08-20 16:21:35 -04:00
Timothy Flynn
456211932f LibUnicode: Perform code point case conversion lookups in constant time
Similar to commit 0652cc4, we now generate 2-stage lookup tables for
case conversion information. Only about 1500 code points are actually
cased. This means that case information is rather highly compressible,
as the blocks we break the code points into will generally all have no
casing information at all.

In total, this change:

    * Does not change the size of libunicode.so (which is nice because,
      generally, the 2-stage lookup tables are expected to trade a bit
      of size for performance).

    * Reduces the runtime of the new benchmark test case added here from
      1.383s to 1.127s (about an 18.5% improvement).
2023-07-28 05:28:50 +02:00
Timothy Flynn
cb128dcf75 LibUnicode: Move the CodePointRangeComparator struct to a public header
Move it out of the generated code so that it may be used by the code
generator itself.
2023-07-26 08:36:20 +02:00
Timothy Flynn
c950f88611 LibUnicode: Stop generating Block property data
We started generating this data in commit 0505e03, but it was unused.
It's still not used, so let's remove it, rather than bloating the size
of libunicode.so with unused data. If we need it in the future, it's
trivial to add back.

Note we *have* always used the block name data from that commit, and
that is still present here.
2023-07-26 08:36:20 +02:00
Timothy Flynn
1393ed2000 AK+LibUnicode: Implement String::equals_ignoring_case without allocating
We currently fully casefold the left- and right-hand sides to compare
two strings with case-insensitivity. Now, we casefold one code point at
a time, storing the result in a view for comparison, until we exhaust
both strings.
2023-03-08 18:57:53 +00:00
Timothy Flynn
f8a0365002 LibUnicode: Detect ZWJ sequences when filtering by emoji presentation
This was preventing some unqualified emoji sequences from rendering
properly, such as the custom SerenityOS flag. We rendered the flag
correctly when given the fully qualified sequence:

    U+1F3F3 U+FEOF U+200D U+1F41E

But were not detecting the unqualified sequence as an emoji when also
filtering for emoji-presentation sequences:

    U+1F3F3 U+200D U+1F41E
2023-03-05 20:21:57 +01:00
Timothy Flynn
42c272c059 LibUnicode: Allow ignoring text presentation emoji in sequence detection
This adds an option to only detect emoji that should always present as
emoji. For example, the copyright symbol (unless followed by an emoji
presentation selector) should render as text.
2023-02-28 13:22:58 +00:00
Timothy Flynn
fa96811a22 LibUnicode: Skip over emoji sequences in grapheme boundary segmentation
Emoji sequences in the grapheme segmentation spec are a bit tricky:

    \p{Extended_Pictographic} Extend* ZWJ × \p{Extended_Pictographic}

Our current strategy of tracking a boolean to indicate if we are in an
emoji sequence was causing us to break up emoji made of multiple sub-
sequences. For example, in the "family: man, woman, girl, boy" sequence:

    U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466

We would break at indices 0 (correctly) and 6 (incorrectly).

Instead of tracking a boolean, it's quite a bit simpler to reason about
emoji sequences by just skipping past them entirely. Note that in cases
like the above emoji, we skip one sub-sequence at a time.
2023-02-25 22:23:39 +01:00
Timothy Flynn
1484d3d9f5 LibUnicode: Add a method to check if a code point could start an emoji 2023-02-24 19:48:47 +01:00
Timothy Flynn
8c38d46c1a LibUnicode: Generate the path to emoji images alongside emoji data
This will provide for quicker emoji lookups, rather than having to
discover and allocate these paths at runtime before we find out if they
even exist.
2023-02-24 19:48:47 +01:00
Timothy Flynn
32a01a60e7 LibUnicode: Remove non-iterative text segmentation algorithms
They are now unused.
2023-02-16 11:18:53 +01:00
Timothy Flynn
6ce7ec2eb3 LibUnicode: Use iterative text segmentation algorithms for titlecasing 2023-02-16 11:18:53 +01:00
Timothy Flynn
5cbf054651 LibUnicode: Fix typos causing text segmentation on mid-word punctuation
For example the words "can't" and "32.3" should not have boundaries
detected on the "'" and "." code points, respectively.

The String test cases fixed here are because "b'ar" is now considered
one word.
2023-02-15 12:36:47 +01:00
Timothy Flynn
6e7a6e2d02 LibUnicode: Support finding the next/previous text segmentation boundary 2023-02-15 12:36:47 +01:00
Timothy Flynn
abe7786a81 LibUnicode: Allow iterating over text segmentation boundaries
This will be useful for e.g. finding the next boundary after a specific
index - we can just stop iterating once a condition is satisfied.
2023-02-15 12:36:47 +01:00
Timothy Flynn
dd4c47456e LibUnicode: Implement text segmentation algorithms for all UTF encodings
Similar to commit 6d710eeb43. Rather than
pick-and-chosing what to support, let's just support all encodings now,
as it is trivial. For example, LibGUI will want the UTF-32 overloads.
2023-02-15 12:36:47 +01:00
Timothy Flynn
2d487e4e4c LibUnicode+LibJS: Move text segmentation algorithms to their own files
These algorithms are quite chonky, and more APIs around them are to be
added, so let's move them to their own files for a bit of organization.
2023-02-15 12:36:47 +01:00
MacDue
63b11030f0 Everywhere: Use ReadonlySpan<T> instead of Span<T const> 2023-02-08 19:15:45 +00:00
Timothy Flynn
537fcaf59e AK+LibUnicode: Provide Unicode-aware caseless String matching
The Unicode spec defines much more complicated caseless matching
algorithms in its Collation spec. This implements the "basic" case
folding comparison.
2023-01-18 14:43:40 +00:00
Timothy Flynn
8f2589b3b0 LibUnicode: Parse and generate case folding code point data
Case folding rules have a similar mapping style as special casing rules,
where one code point may map to zero or more case folding rules. These
will be used for case-insensitive string comparisons. To see how case
folding can differ from other casing rules, consider "ß" (U+00DF):

    >>> "ß".lower()
    'ß'

    >>> "ß".upper()
    'SS'

    >>> "ß".title()
    'Ss'

    >>> "ß".casefold()
    'ss'
2023-01-18 14:43:40 +00:00
Timothy Flynn
8d9fb898d7 LibUnicode: Update out-of-date spec links
And remove links that aren't adding much value but will often get out of
date (i.e. links to UCD files, which are already all listed in
unicode_data.cmake).
2023-01-18 14:43:40 +00:00
Timothy Flynn
d6ddca0c0f AK+LibUnicode: Provide Unicode-aware String titlecase transformation 2023-01-16 18:33:44 -05:00
Timothy Flynn
bc51017a03 LibUnicode: Support full case folding for titlecasing a string
Unicode declares that to titlecase a string, the first cased code point
after each word boundary should be transformed to its titlecase mapping.
All other codepoints are transformed to their lowercase mapping.
2023-01-16 18:33:44 -05:00
Timothy Flynn
b562348d31 LibUnicode: Generate simple case folding mappings for titlecase
Note we already generate the special case foldings for titlecase.
2023-01-16 18:33:44 -05:00
Timothy Flynn
6d710eeb43 LibUnicode: Add an overload of word segmentation for UTF-8 strings 2023-01-16 18:33:44 -05:00
Timothy Flynn
58bc831750 LibUnicode: Return a String from Unicode normalization 2023-01-15 01:00:20 +00:00
Timothy Flynn
6fcc1c7426 AK+LibUnicode: Provide Unicode-aware String case transformations
Since AK can't refer to LibUnicode directly, the strategy here is that
if you need case transformations, you can link LibUnicode and receive
them. If you try to use either of these methods without linking it, then
you'll of course get a linker error (note we don't do any fallbacks to
e.g. ASCII case transformations). If you don't need these methods, you
don't have to link LibUnicode.
2023-01-09 19:23:46 -07:00
Timothy Flynn
12f6793223 LibUnicode: Move Unicode-aware case transformations to a helper file
These will be needed by AK::String as well, so move them to a helper
file where they can be re-used.
2023-01-09 19:23:46 -07:00
Timothy Flynn
3d22efccca LibUnicode+LibJS: Propagate OOM from Unicode normalization 2023-01-09 22:48:15 +00:00
Timothy Flynn
1ff29afc45 LibUnicode+LibJS+LibWeb: Propagate OOM from Unicode case transformations 2023-01-09 22:48:15 +00:00