Commit graph

32 commits

Author SHA1 Message Date
Pavel Shliak
5fe1d38955 Tests: Remove duplicated test for UnicodeCharacterTypes 2024-11-06 09:22:44 +00:00
Timothy Flynn
aa3a30870b LibUnicode: Replace code point bidirectional classes with ICU 2024-06-22 14:56:39 +02:00
Timothy Flynn
5cf818e305 LibUnicode: Replace case transformations and comparison with ICUs
There are a couple of differences here due to using ICU:

1. Titlecasing behaves slightly differently. We previously transformed
   "123dollars" to "123Dollars", as we would use word segmentation to
   split a string into words, then transform the first cased character
   to titlecase. ICU doesn't go quite that far, and leaves the string
   as "123dollars". While this is a behavior change, the only user of
   this API is the `text-transform: capitalize;` CSS rule, and we now
   match the behavior of other browsers.

2. There isn't an API to compare strings with case insensitivity without
   allocating case-folded strings for both the left- and right-hand-side
   strings. Our implementation was previously allocation-free; however,
   in a benchmark, ICU is still ~1.4x faster.
2024-06-20 10:59:55 +02:00
Timothy Flynn
1feef17bf7 LibUnicode: Remove completely unused code point name & block name data
These were used for e.g. the Character Map on Serenity, but are not used
at all for Ladybird.
2024-06-18 21:07:56 +02:00
Shannon Booth
d777b279e3 LibUnicode+Tests: Remove now unused to_unicode_*_full methods
Relocating all of the tests for these in LibUnicode over to the AK
String testsuite.
2023-11-28 17:15:27 -05:00
Sam Atkins
0d021a63c7 LibUnicode: Generate data for bidirectional character types
This will let us examine code points to determine the rtl/ltr direction
of a piece of text.
2023-08-20 16:21:35 -04:00
Timothy Flynn
456211932f LibUnicode: Perform code point case conversion lookups in constant time
Similar to commit 0652cc4, we now generate 2-stage lookup tables for
case conversion information. Only about 1500 code points are actually
cased. This means that case information is rather highly compressible,
as the blocks we break the code points into will generally all have no
casing information at all.

In total, this change:

    * Does not change the size of libunicode.so (which is nice because,
      generally, the 2-stage lookup tables are expected to trade a bit
      of size for performance).

    * Reduces the runtime of the new benchmark test case added here from
      1.383s to 1.127s (about an 18.5% improvement).
2023-07-28 05:28:50 +02:00
Timothy Flynn
0652cc48c0 LibUnicode: Perform code point property lookups in constant time
We currently produce a single table for all categories of code point
properties (GeneralCategory, Script, etc.). Each row contains a field
indicating the range of code points to which that property applies. At
runtime, we then do a binary search through that table to decide if a
code point has a property.

This changes our approach to generate a 2-stage lookup table for each of
those categories. There is an in-depth explanation of these tables above
the new `create_code_point_tables` method. The end effect is that code
point property lookup is reduced from a binary search to constant-time
array lookups.

In total, this change:

    * Increases the size of libunicode.so from 2.7 MB to 2.9 MB.

    * Reduces the runtime of the new benchmark test case added here from
      3.576s to 1.020s (a 3.5x speedup).

    * In a profile of resizing a TextEditor window with a 3MB file open,
      the runtime of checking if a code point has a word break property
      reduces from ~81% to ~56%.
2023-07-26 08:36:20 +02:00
Timothy Flynn
c950f88611 LibUnicode: Stop generating Block property data
We started generating this data in commit 0505e03, but it was unused.
It's still not used, so let's remove it, rather than bloating the size
of libunicode.so with unused data. If we need it in the future, it's
trivial to add back.

Note we *have* always used the block name data from that commit, and
that is still present here.
2023-07-26 08:36:20 +02:00
Timothy Flynn
5cbf054651 LibUnicode: Fix typos causing text segmentation on mid-word punctuation
For example the words "can't" and "32.3" should not have boundaries
detected on the "'" and "." code points, respectively.

The String test cases fixed here are because "b'ar" is now considered
one word.
2023-02-15 12:36:47 +01:00
Timothy Flynn
8f2589b3b0 LibUnicode: Parse and generate case folding code point data
Case folding rules have a similar mapping style as special casing rules,
where one code point may map to zero or more case folding rules. These
will be used for case-insensitive string comparisons. To see how case
folding can differ from other casing rules, consider "ß" (U+00DF):

    >>> "ß".lower()
    'ß'

    >>> "ß".upper()
    'SS'

    >>> "ß".title()
    'Ss'

    >>> "ß".casefold()
    'ss'
2023-01-18 14:43:40 +00:00
Timothy Flynn
bc51017a03 LibUnicode: Support full case folding for titlecasing a string
Unicode declares that to titlecase a string, the first cased code point
after each word boundary should be transformed to its titlecase mapping.
All other codepoints are transformed to their lowercase mapping.
2023-01-16 18:33:44 -05:00
Timothy Flynn
b562348d31 LibUnicode: Generate simple case folding mappings for titlecase
Note we already generate the special case foldings for titlecase.
2023-01-16 18:33:44 -05:00
Timothy Flynn
1ff29afc45 LibUnicode+LibJS+LibWeb: Propagate OOM from Unicode case transformations 2023-01-09 22:48:15 +00:00
Timothy Flynn
f38c68177b LibUnicode: Update code point ideographic replacements for Unicode 15 2022-10-07 18:17:40 +01:00
thankyouverycool
5658524aa3 Tests: Add Unicode tests for CharacterType block properties 2022-02-15 10:13:19 -05:00
Timothy Flynn
6efbafa6e0 Everywhere: Update copyrights with my new serenityos.org e-mail :^) 2022-01-31 18:23:22 +00:00
Timothy Flynn
7e6ad172a4 LibUnicode: Support code point names that apply to ranges of code points
For example, consider the following adjacent entries in UnicodeData.txt:

    3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
    4DBF;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;

Our current implementation would assign the display name "CJK Ideograph
Extension A" to code points U+3400 & U+4DBF, but not to the code points
in between. Not only should those code points be assigned a name, but
the Unicode spec also has formatting rules on what the names should be
(the names for these ranged code points are not as they appear in
UnicodeData.txt).

The spec also defines names for code point ranges that actually are
listed individually in UnicodeData.txt. For example:

    2F800;CJK COMPATIBILITY IDEOGRAPH-2F800;Lo;0;L;4E3D;;;;N;;;;;
    2F801;CJK COMPATIBILITY IDEOGRAPH-2F801;Lo;0;L;4E38;;;;N;;;;;
    2F802;CJK COMPATIBILITY IDEOGRAPH-2F802;Lo;0;L;4E41;;;;N;;;;;

Code points are only coalesced into a range if all fields after the name
are equivalent. Our parser will insert the range and its name formatting
pattern when it comes across the first code point in that range, then
ignore other code points in that range. This reduces the number of names
we generated by nearly 2,000.
2021-11-30 11:24:02 +01:00
Timothy Flynn
50158abaf1 LibUnicode: Implement locale-aware BEFORE_DOT special casing
Note that the algorithm in the Unicode spec is for checking that a code
point precedes U+0307, but the special casing condition NotBeforeDot is
interested in the inverse of this rule.
2021-09-06 15:24:27 +01:00
Timothy Flynn
436faf9fd9 LibUnicode: Implement locale-aware MORE_ABOVE special casing 2021-09-06 15:24:27 +01:00
Timothy Flynn
1427ebc622 LibUnicode: Implement locale-aware AFTER_SOFT_DOTTED special casing 2021-09-06 15:24:27 +01:00
Timothy Flynn
0053d48c41 LibUnicode: Implement locale-aware AFTER_I special casing 2021-09-06 15:24:27 +01:00
Timothy Flynn
1e91334008 LibUnicode: Handle edge-case script extensions, Common and Inherited
These script extensions have some peculiar behavior in the Unicode spec.
The UCD ScriptExtension file does not contain these scripts. Rather, it
is implied the code points which have these scripts as an extension are
the code points that both:

  1. Have Common or Inherited as their primary script value
  2. Do not have any other script value in their script extension lists

Because these are not explictly listed in the UCD, we must manually form
these script extensions.
2021-08-11 13:11:01 +02:00
Timothy Flynn
47bb350ebd LibUnicode: Generate separate tables for scripts and script extensions
Notice that unlike the note in populate_general_category_unions(),
script extension do indeed have code point ranges which overlap. Thus,
this commit adds code to handle that, and hooks it into the GC unions.
2021-08-11 13:11:01 +02:00
Timothy Flynn
5ac23d244d LibUnicode: Generate separate tables for Unicode properties
Similar to General Categories, this generates separate tables for the
Property list.
2021-08-11 13:11:01 +02:00
Timothy Flynn
b06c104076 LibUnicode: Include Unassigned code points in the Other General Category
Now that the generator parses unassigned General Category properties, it
can include Unassigned (Cn) in the Other (C) category.
2021-08-11 13:11:01 +02:00
Timothy Flynn
7dce2bfe23 LibUnicode: Generate separate tables for General Category properties
Previously, each code point's General Category was part of the generated
UnicodeData structure. This ultimately presented two problems, one
functional and one performance related:

  * Some General Categories are applied to unassigned code points, for
    example the Unassigned (Cn) category. Unassigned code points are
    strictly excluded from UnicodeData.txt, so by relying on that file,
    the generator is unable to handle these categories.

  * Lookups for General Categories are slower when searching through the
    large UnicodeData hash map. Even though lookups are O(1), the hash
    function turned out to be slower than binary searching through a
    category-specific table.

So, now a table is generated for each General Category. When querying a
code point for a category, a binary search is done on each code point
range in that category's table to check if code point has that category.

Further, General Categories are now parsed from the UCD file
DerivedGeneralCategory.txt. This file is a normal "prop list" file and
contains the categories for unassigned code points.
2021-08-11 13:11:01 +02:00
Timothy Flynn
c4bfda7f7f LibUnicode: Handle code points that are both cased and case-ignorable
Apparently, some code points fit both categories, for example U+0345
(COMBINING GREEK YPOGEGRAMMENI). Handle this fact when determining if
a code point is a final code point in a string.
2021-07-28 23:42:29 +02:00
Timothy Flynn
7827aede6f LibUnicode: Check word break when deciding on case-ignorable code points 2021-07-28 23:42:29 +02:00
Timothy Flynn
c45a014645 LibUnicode: Check property list when deciding if a code point is cased 2021-07-28 23:42:29 +02:00
Timothy Flynn
39f971e42b LibUnicode: Begin implementing special Unicode case folding
This implements unconditional special case folding, and conditional
folding for non-locale cases. Worth noting that the only conditional,
non-locale special case is for converting an uppercase sigma to
lowercase.
2021-07-27 21:04:36 +01:00
Timothy Flynn
4dda3edc9e LibUnicode: Introduce a Unicode library for interacting with UCD files
The Unicode standard publishes the Unicode Character Database (UCD) with
information about every code point, such as each code point's upper case
mapping. LibUnicode exists to download and parse UCD files at build time
and to provide accessors to that data.

As a start, LibUnicode includes upper- and lower-case code point
converters.
2021-07-26 17:03:55 +01:00