Commit graph

88 commits

Author SHA1 Message Date
Timothy Flynn
ffb3ba3079 Tests: Link some tests directly against LibUnicodeData
These were missed in 565a880ce5.

This wasn't an issue because these tests don't pledge/unveil anything,
so they could happily dlopen() the library at runtime. But this is now
needed in order to migrate LibUnicode towards weak symbols instead.
2022-01-04 22:49:43 +00:00
Timothy Flynn
7e6ad172a4 LibUnicode: Support code point names that apply to ranges of code points
For example, consider the following adjacent entries in UnicodeData.txt:

    3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
    4DBF;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;

Our current implementation would assign the display name "CJK Ideograph
Extension A" to code points U+3400 & U+4DBF, but not to the code points
in between. Not only should those code points be assigned a name, but
the Unicode spec also has formatting rules on what the names should be
(the names for these ranged code points are not as they appear in
UnicodeData.txt).

The spec also defines names for code point ranges that actually are
listed individually in UnicodeData.txt. For example:

    2F800;CJK COMPATIBILITY IDEOGRAPH-2F800;Lo;0;L;4E3D;;;;N;;;;;
    2F801;CJK COMPATIBILITY IDEOGRAPH-2F801;Lo;0;L;4E38;;;;N;;;;;
    2F802;CJK COMPATIBILITY IDEOGRAPH-2F802;Lo;0;L;4E41;;;;N;;;;;

Code points are only coalesced into a range if all fields after the name
are equivalent. Our parser will insert the range and its name formatting
pattern when it comes across the first code point in that range, then
ignore other code points in that range. This reduces the number of names
we generated by nearly 2,000.
2021-11-30 11:24:02 +01:00
Timothy Flynn
93ee922027 LibUnicode: Support locales-without-script aliases for ECMA-402
As noted by ECMA-402, if a supported locale contains all of a language,
script, and region subtag, then the implementation must also support the
locale without the script subtag. The most complicated example of this
is the zh-TW locale.

The list of locales in the CLDR database does not include zh-TW or its
maximized zh-Hant-TW variant. Instead, it inlcudes the zh-Hant locale.
However, zh-Hant-TW is listed in the default-content locale list in the
cldr-core package. This defines an alias from zh-Hant-TW to zh-Hant. We
must then also support the zh-Hant-TW alias without the script subtag:
zh-TW. This transitively maps zh-TW to zh-Hant, which is a case quite
heavily tested by test262.
2021-11-19 11:45:35 +01:00
Timothy Flynn
357c97dfa8 LibUnicode: Parse the CLDR's defaultContent.json locale list
This file contains the list of locales which default to their parent
locale's values. In the core CLDR dataset, these locales have their own
files, but they are empty (except for identity data). For example:

https://github.com/unicode-org/cldr/blob/main/common/main/en_US.xml

In the JSON export, these files are excluded, so we currently are not
recognizing these locales just by iterating the locale files.

This is a prerequisite for upgrading to CLDR version 40. One of these
default-content locales is the popular "en-US" locale, which defaults to
"en" values. We were previously inferring the existence of this locale
from the "en-US-POSIX" locale (many implementations, including ours,
strip variants such as POSIX). However, v40 removes the "en-US-POSIX"
locale entirely, meaning that without this change, we wouldn't know that
"en-US" exists (we would default to "en").

For more detail on this and other v40 changes, see:
https://cldr.unicode.org/index/downloads/cldr-40#h.nssoo2lq3cba
2021-11-09 20:44:52 +01:00
Timothy Flynn
4f2bcebe74 LibUnicode+LibJS: Store locale keyword values as a single string
Previously, LibUnicode would store the values of a keyword as a Vector.
For example, the locale "en-u-ca-abc-def" would have its keyword "ca"
stored as {"abc, "def"}. Then, canonicalization would occur on each of
the elements in that Vector.

This is incorrect because, for example, the keyword value "true" should
only be dropped if that is the entire value. That is, the canonical form
of "en-u-kb-true" is "en-u-kb", but "en-u-kb-abc-true" does not change
for canonicalization. However, we would canonicalize that locale as
"en-u-kb-abc".
2021-09-08 21:08:48 +01:00
Timothy Flynn
50158abaf1 LibUnicode: Implement locale-aware BEFORE_DOT special casing
Note that the algorithm in the Unicode spec is for checking that a code
point precedes U+0307, but the special casing condition NotBeforeDot is
interested in the inverse of this rule.
2021-09-06 15:24:27 +01:00
Timothy Flynn
436faf9fd9 LibUnicode: Implement locale-aware MORE_ABOVE special casing 2021-09-06 15:24:27 +01:00
Timothy Flynn
1427ebc622 LibUnicode: Implement locale-aware AFTER_SOFT_DOTTED special casing 2021-09-06 15:24:27 +01:00
Timothy Flynn
0053d48c41 LibUnicode: Implement locale-aware AFTER_I special casing 2021-09-06 15:24:27 +01:00
Timothy Flynn
a05419db55 LibUnicode: Add lexer to test if a string matches the "type" production 2021-09-02 17:56:42 +01:00
Andrew Kaster
58797a1289 Tests: Remove all file(GLOB) from CMakeLists in Tests
Using a file(GLOB) to find all the test files in a directory is an easy
hack to get things started, but has some drawbacks. Namely, if you add
a test, it won't be found again without re-running CMake. `ninja` seems
to do this automatically, but it would be nice to one day stop seeing it
rechecking our globbed directories.
2021-09-02 09:08:23 +02:00
Timothy Flynn
fd0011989a LibUnicode: Resolve the most likely territory alias when there are many 2021-09-01 14:14:47 +01:00
Timothy Flynn
72f49e42b4 LibUnicode: Perform complex Unicode locale alias substitution 2021-09-01 14:14:47 +01:00
Timothy Flynn
da89cf9afb LibUnicode: Canonicalize calendar subtags
Calendar subtags are a bit of an odd-man-out in that we must match the
variants "ethiopic-amete-alem" in that order, without any other variant
in the locale. So a separate method is needed for this, and we now defer
sorting the variant list until after other canonicalization is done.
2021-09-01 14:14:47 +01:00
Timothy Flynn
8458f477a4 LibUnicode: Canonicalize timezone subtags 2021-09-01 14:14:47 +01:00
Timothy Flynn
335f985b31 LibUnicode: Canonicalize the subtag "imperial" to "uksystem" 2021-09-01 14:14:47 +01:00
Timothy Flynn
2d90144888 LibUnicode: Canonicalize the subtag "primary" and "tertiary" to "levelN" 2021-09-01 14:14:47 +01:00
Timothy Flynn
409f39b336 LibUnicode: Canonicalize the subtag "names" to "prprname" 2021-09-01 14:14:47 +01:00
Timothy Flynn
f907a7dc38 LibUnicode: Canonicalize the subtag "yes" to "true" 2021-09-01 14:14:47 +01:00
Timothy Flynn
556374a904 LibUnicode: Substitute Unicode locale aliases during canonicalization
Unicode TR35 defines how locale subtag aliases should be emplaced when
converting a locale to canonical form. For most subtags, it is a simple
substitution. Language subtags depend on context; for example, the
language "sh" should become "sr-Latn", but if the original locale has a
script subtag already ("sh-Cyrl"), then only the language subtag of the
alias should be taken ("sr-Latn").

To facilitate this, we now make two passes when canonicalizing a locale.
In the first pass, we convert the LocaleID structure to canonical syntax
(where the conversions all happen in-place). In the second pass, we form
the canonical string based on the canonical syntax.
2021-09-01 14:14:47 +01:00
Timothy Flynn
d13142f015 LibJS+LibUnicode: Store parsed Unicode locale data as full strings
Originally, it was convenient to store the parsed Unicode locale data as
views into the original string being parsed. But to implement locale
aliases will require mutating the data that was parsed. To prepare for
that, store the parsed data as proper strings.
2021-09-01 14:14:47 +01:00
Timothy Flynn
f897c2edb3 LibUnicode: Canonicalize locale private use extensions 2021-08-30 19:42:40 +01:00
Timothy Flynn
6f0cb52dc4 LibUnicode: Canonicalize locale extensions 2021-08-30 19:42:40 +01:00
Timothy Flynn
30855e6663 LibUnicode: Parse locale private use extensions 2021-08-30 19:42:40 +01:00
Timothy Flynn
29f76ef7c8 LibUnicode: Parse locale extensions of the other extension form 2021-08-30 19:42:40 +01:00
Timothy Flynn
d2d304fcf8 LibUnicode: Parse locale extensions of the transformed extension form 2021-08-30 19:42:40 +01:00
Timothy Flynn
eda92d15e4 LibUnicode: Parse locale extensions of the Unicode locale extension form 2021-08-30 19:42:40 +01:00
Timothy Flynn
b7a95cba65 LibUnicode: Implement grammar validators for Unicode TR-35
ECMA-402 requires validating user input against the EBNF grammar for
Unicode locales described in TR-35: https://www.unicode.org/reports/tr35

This commit adds validators for that grammar, as well as other helper to
e.g. canonicalize a locale string.
2021-08-26 22:04:09 +01:00
Timothy Flynn
1e91334008 LibUnicode: Handle edge-case script extensions, Common and Inherited
These script extensions have some peculiar behavior in the Unicode spec.
The UCD ScriptExtension file does not contain these scripts. Rather, it
is implied the code points which have these scripts as an extension are
the code points that both:

  1. Have Common or Inherited as their primary script value
  2. Do not have any other script value in their script extension lists

Because these are not explictly listed in the UCD, we must manually form
these script extensions.
2021-08-11 13:11:01 +02:00
Timothy Flynn
47bb350ebd LibUnicode: Generate separate tables for scripts and script extensions
Notice that unlike the note in populate_general_category_unions(),
script extension do indeed have code point ranges which overlap. Thus,
this commit adds code to handle that, and hooks it into the GC unions.
2021-08-11 13:11:01 +02:00
Timothy Flynn
5ac23d244d LibUnicode: Generate separate tables for Unicode properties
Similar to General Categories, this generates separate tables for the
Property list.
2021-08-11 13:11:01 +02:00
Timothy Flynn
b06c104076 LibUnicode: Include Unassigned code points in the Other General Category
Now that the generator parses unassigned General Category properties, it
can include Unassigned (Cn) in the Other (C) category.
2021-08-11 13:11:01 +02:00
Timothy Flynn
7dce2bfe23 LibUnicode: Generate separate tables for General Category properties
Previously, each code point's General Category was part of the generated
UnicodeData structure. This ultimately presented two problems, one
functional and one performance related:

  * Some General Categories are applied to unassigned code points, for
    example the Unassigned (Cn) category. Unassigned code points are
    strictly excluded from UnicodeData.txt, so by relying on that file,
    the generator is unable to handle these categories.

  * Lookups for General Categories are slower when searching through the
    large UnicodeData hash map. Even though lookups are O(1), the hash
    function turned out to be slower than binary searching through a
    category-specific table.

So, now a table is generated for each General Category. When querying a
code point for a category, a binary search is done on each code point
range in that category's table to check if code point has that category.

Further, General Categories are now parsed from the UCD file
DerivedGeneralCategory.txt. This file is a normal "prop list" file and
contains the categories for unassigned code points.
2021-08-11 13:11:01 +02:00
Timothy Flynn
c4bfda7f7f LibUnicode: Handle code points that are both cased and case-ignorable
Apparently, some code points fit both categories, for example U+0345
(COMBINING GREEK YPOGEGRAMMENI). Handle this fact when determining if
a code point is a final code point in a string.
2021-07-28 23:42:29 +02:00
Timothy Flynn
7827aede6f LibUnicode: Check word break when deciding on case-ignorable code points 2021-07-28 23:42:29 +02:00
Timothy Flynn
c45a014645 LibUnicode: Check property list when deciding if a code point is cased 2021-07-28 23:42:29 +02:00
Timothy Flynn
39f971e42b LibUnicode: Begin implementing special Unicode case folding
This implements unconditional special case folding, and conditional
folding for non-locale cases. Worth noting that the only conditional,
non-locale special case is for converting an uppercase sigma to
lowercase.
2021-07-27 21:04:36 +01:00
Timothy Flynn
4dda3edc9e LibUnicode: Introduce a Unicode library for interacting with UCD files
The Unicode standard publishes the Unicode Character Database (UCD) with
information about every code point, such as each code point's upper case
mapping. LibUnicode exists to download and parse UCD files at build time
and to provide accessors to that data.

As a start, LibUnicode includes upper- and lower-case code point
converters.
2021-07-26 17:03:55 +01:00