Commit graph

90 commits

Author SHA1 Message Date
Timothy Flynn
e6334cb856 LibUnicode: Add some data related to currency codes
This data is published under ISO-4217 as an XML file. Since we can't
parse XML files yet, and the data isn't very large, it was translated to
C++ manually here.
2021-09-11 11:05:50 +01:00
Timothy Flynn
3ae4ff109f LibUnicode: Extract canonicalization of Unicode extension values
LibJS will need to canonicalize Unicode extension values, so extract the
lambda that was doing this work to its own function. This also changes
the helpers it invokes to take the provided key as a StringView because
we don't need (and won't always have) full String objects here.
2021-09-11 11:05:50 +01:00
Timothy Flynn
b1d4bcf364 LibUnicode: Generate numeric keyword values for each locale
This is needed for Intl.NumberFormat's usage of the ResolveLocale AO,
where the [[RelevantExtensionKeys]] internal slot will be "nu".
2021-09-11 11:05:50 +01:00
Timothy Flynn
4f2bcebe74 LibUnicode+LibJS: Store locale keyword values as a single string
Previously, LibUnicode would store the values of a keyword as a Vector.
For example, the locale "en-u-ca-abc-def" would have its keyword "ca"
stored as {"abc, "def"}. Then, canonicalization would occur on each of
the elements in that Vector.

This is incorrect because, for example, the keyword value "true" should
only be dropped if that is the entire value. That is, the canonical form
of "en-u-kb-true" is "en-u-kb", but "en-u-kb-abc-true" does not change
for canonicalization. However, we would canonicalize that locale as
"en-u-kb-abc".
2021-09-08 21:08:48 +01:00
Timothy Flynn
75657b79c6 LibUnicode: Update comment with link to related upstream issue
LibUnicode has to hard-code some aliases because the related data is not
available in the JSON export of CLDR. Turns out there is a ticket to add
this data in an upcoming CLDR release. Add a link to that ticket for
reference.
2021-09-08 21:08:48 +01:00
Timothy Flynn
3f64a14e06 LibUnicode: Parse and generate the Unicode locale list patterns dataset
This data informs consumers how to join lists of values. For example,
in en-US, the list ["a", "b", "c"] formatted to a string should become
"a, b, and c".
2021-09-06 23:49:56 +01:00
Timothy Flynn
40ea659282 LibUnicode+LibJS: Return removed extensions from remove_extension_type
Some callers will need to hold onto the removed extensions.
2021-09-06 15:24:27 +01:00
Timothy Flynn
50158abaf1 LibUnicode: Implement locale-aware BEFORE_DOT special casing
Note that the algorithm in the Unicode spec is for checking that a code
point precedes U+0307, but the special casing condition NotBeforeDot is
interested in the inverse of this rule.
2021-09-06 15:24:27 +01:00
Timothy Flynn
436faf9fd9 LibUnicode: Implement locale-aware MORE_ABOVE special casing 2021-09-06 15:24:27 +01:00
Timothy Flynn
1427ebc622 LibUnicode: Implement locale-aware AFTER_SOFT_DOTTED special casing 2021-09-06 15:24:27 +01:00
Timothy Flynn
0053d48c41 LibUnicode: Implement locale-aware AFTER_I special casing 2021-09-06 15:24:27 +01:00
Timothy Flynn
68b2680040 LibUnicode: Ensure case conversion methods increment the current index
There was one branch in these methods (the branch where a special casing
was found) that neglected to update the current index.
2021-09-06 15:24:27 +01:00
Timothy Flynn
12ae0a44d7 LibUnicode: Add public wrapper for the generated locale_from_string 2021-09-06 15:24:27 +01:00
Timothy Flynn
a77f323dfb LibUnicode: Implement the Remove Likely Subtags method
Unlike Add Likely Subtags, this method doesn't require generated data.
Instead, it is defined in terms of Add Likely Subtags.
2021-09-04 13:51:40 +01:00
Timothy Flynn
e6a2ab1202 LibUnicode: Generate an implementation of the Add Likely Subtags method 2021-09-04 13:51:40 +01:00
Timothy Flynn
ca90231794 LibUnicode: Define is_unicode_*_subtag helpers inline in their header
The UnicodeLocale generator will need to parse canonicalized locale
strings, and will require using these methods. However, the generator
cannot depend on LibUnicode because Locale.cpp within LibUnicode already
depends on the generated file. Instead, defining the methods that the
generator needs inline allows the generator to use them without linking
against LibUnicode.
2021-09-04 13:51:40 +01:00
Timothy Flynn
21c4922ac0 LibUnicode: Add helper methods to LocaleID and LanguageID for LibJS
Add a method to remove an extension type from the locale's extension set
and methods to convert a locale and language to a string without
canonicalization. Each of these will be used by LibJS.
2021-09-02 17:56:42 +01:00
Timothy Flynn
a05419db55 LibUnicode: Add lexer to test if a string matches the "type" production 2021-09-02 17:56:42 +01:00
Timothy Flynn
113bf4a9dd LibUnicode: Add missing structures to forwarding header 2021-09-02 17:56:42 +01:00
Timothy Flynn
fd0011989a LibUnicode: Resolve the most likely territory alias when there are many 2021-09-01 14:14:47 +01:00
Timothy Flynn
1fbc5dba08 LibUnicode: Generate Unicode locale likely subtag data
CLDR contains a set of likely subtag data where, given a locale, you can
resolve what is the most likely language, script, or territory of that
locale. This data is needed for resolving territory aliases. These
aliases might contain multiple territories, and we need to resolve which
of those territories is most likely correct for a locale.

Note that the likely subtag data is quite huge (a few thousand entries).
As an optimization encouraged by the spec, we only generate the smallest
subset of this data that we actually need (about 150 entries).
2021-09-01 14:14:47 +01:00
Timothy Flynn
72f49e42b4 LibUnicode: Perform complex Unicode locale alias substitution 2021-09-01 14:14:47 +01:00
Timothy Flynn
9ae7ac4c87 LibUnicode: Generate complex Unicode locale alias matching
Most alias substitutions are "simple", meaning that alias matching is
done by examining a single locale subtag. However, there are a handful
of "complex" aliases where matching is done by examining multiple
subtags. For example, the variant subtag "lojban" causes the locale
"art-lojban" to be canonicalized to "jbo", but only when the language
subtag is "art" (i.e. this should not occur for the locale "en-lojban").

This generates a method to perform complex alias matching.
2021-09-01 14:14:47 +01:00
Timothy Flynn
da89cf9afb LibUnicode: Canonicalize calendar subtags
Calendar subtags are a bit of an odd-man-out in that we must match the
variants "ethiopic-amete-alem" in that order, without any other variant
in the locale. So a separate method is needed for this, and we now defer
sorting the variant list until after other canonicalization is done.
2021-09-01 14:14:47 +01:00
Timothy Flynn
8458f477a4 LibUnicode: Canonicalize timezone subtags 2021-09-01 14:14:47 +01:00
Timothy Flynn
335f985b31 LibUnicode: Canonicalize the subtag "imperial" to "uksystem" 2021-09-01 14:14:47 +01:00
Timothy Flynn
2d90144888 LibUnicode: Canonicalize the subtag "primary" and "tertiary" to "levelN" 2021-09-01 14:14:47 +01:00
Timothy Flynn
409f39b336 LibUnicode: Canonicalize the subtag "names" to "prprname" 2021-09-01 14:14:47 +01:00
Timothy Flynn
f907a7dc38 LibUnicode: Canonicalize the subtag "yes" to "true" 2021-09-01 14:14:47 +01:00
Timothy Flynn
556374a904 LibUnicode: Substitute Unicode locale aliases during canonicalization
Unicode TR35 defines how locale subtag aliases should be emplaced when
converting a locale to canonical form. For most subtags, it is a simple
substitution. Language subtags depend on context; for example, the
language "sh" should become "sr-Latn", but if the original locale has a
script subtag already ("sh-Cyrl"), then only the language subtag of the
alias should be taken ("sr-Latn").

To facilitate this, we now make two passes when canonicalizing a locale.
In the first pass, we convert the LocaleID structure to canonical syntax
(where the conversions all happen in-place). In the second pass, we form
the canonical string based on the canonical syntax.
2021-09-01 14:14:47 +01:00
Timothy Flynn
9b118f1f06 LibUnicode: Generate Unicode locale alias data
CLDR contains a set of aliases for languages, territories, etc. that no
longer are meant to be used (e.g. due to deprecation). For example, the
language "aam" is deprecated and should be canonicalized as "aas".
2021-09-01 14:14:47 +01:00
Timothy Flynn
d13142f015 LibJS+LibUnicode: Store parsed Unicode locale data as full strings
Originally, it was convenient to store the parsed Unicode locale data as
views into the original string being parsed. But to implement locale
aliases will require mutating the data that was parsed. To prepare for
that, store the parsed data as proper strings.
2021-09-01 14:14:47 +01:00
Timothy Flynn
f897c2edb3 LibUnicode: Canonicalize locale private use extensions 2021-08-30 19:42:40 +01:00
Timothy Flynn
6f0cb52dc4 LibUnicode: Canonicalize locale extensions 2021-08-30 19:42:40 +01:00
Timothy Flynn
671eaa0c59 LibUnicode: Add helper lambda for appending canonicalized strings
Once canonical extensions are implemented, the number of:

    if (optional_string.has_value() {
        builder.append('-');
        builder.append(optional_string->to_lowercase_string());
    }

Will be quite large. This commit just adds a helper lambda to handle
this pattern to prevent this function from becoming even more enormous.
2021-08-30 19:42:40 +01:00
Timothy Flynn
30855e6663 LibUnicode: Parse locale private use extensions 2021-08-30 19:42:40 +01:00
Timothy Flynn
29f76ef7c8 LibUnicode: Parse locale extensions of the other extension form 2021-08-30 19:42:40 +01:00
Timothy Flynn
d2d304fcf8 LibUnicode: Parse locale extensions of the transformed extension form 2021-08-30 19:42:40 +01:00
Timothy Flynn
eda92d15e4 LibUnicode: Parse locale extensions of the Unicode locale extension form 2021-08-30 19:42:40 +01:00
Timothy Flynn
dd89901b07 LibUnicode: Use GenericLexer to parse Unicode language IDs
This is preparatory work to read locale extensions. The parser currently
enforces that the entire string is consumed. But to parse extensions,
parse_unicode_locale_id() will need parse_unicode_language_id() to just
stop parsing on the first segment that does not match the language ID
grammar. It will also need to know where the parsing stopped. Both of
these needs are fulfilled by GenericLexer.

The caveat is that we can no longer simply split the parsed string on
separator characters. So parse_unicode_language_id() now operates as a
small state machine.
2021-08-30 19:42:40 +01:00
Andrew Kaster
63956b36d0 Everywhere: Move all host tools into the Lagom/Tools subdirectory
This allows us to remove all the add_subdirectory calls from the top
level CMakeLists.txt that referred to targets linking LagomCore.

Segregating the host tools and Serenity targets helps us get to a place
where the main Serenity build can simply use a CMake toolchain file
rather than swapping all the compiler/sysroot variables after building
host libraries and tools.
2021-08-28 08:44:17 +01:00
Andrew Kaster
e88761b2b9 Meta+LibUnicode: Move unicode_data helper to Meta/CMake
Moving this helper CMake file to the centralized Meta/CMake folder helps
to get a better grasp on what extra files are required for the build,
and what files are generated.

While we're at it, don't use add_compile_definitions for
ENABLE_UNICODE_DATA, which only needs to be seen by LibUnicode sources.
2021-08-28 08:44:17 +01:00
Robert Syring
4f2dc8db26 LibUnicode: Change unzip commands to also extract subdirectories
Changed unzip commands from * to ** in order to also extract subdirectories from cldr.zip.
2021-08-28 08:13:32 +01:00
Timothy Flynn
8b93d51212 LibUnicode: Parse Unicode CLDR currencies and generate locale mappings 2021-08-27 12:32:24 +01:00
Timothy Flynn
297db925fc LibUnicode: Extract cldr-numbers dataset from CLDR database
This dataset holds the values needed to handle DisplayNames.prototype.of
with a type option of "currency".
2021-08-27 12:32:24 +01:00
Timothy Flynn
0f02def3c2 LibUnicode: Parse Unicode CLDR scripts and generate locale mappings 2021-08-27 12:32:24 +01:00
Timothy Flynn
ab7a1dd89e LibUnicode: Parse Unicode CLDR languages and generate locale mappings 2021-08-27 12:32:24 +01:00
Timothy Flynn
6719e5cb17 LibUnicode: Generate locale subtag data as multiple smaller tables
This commit is preemptive to upcoming commits which add more subtags to
the CLDR generator. Rather than generating a giant HashMap containing
all data, generate more (smaller) Array-based tables. This mimics the
UCD generator. This also allows simpler lookups at runtime since we can
generate index-based lookups into the smaller tables rather easily.

Without this change, adding the remaining locale subtags would result
in the generation and compilation of UnicodeLocale.cpp taking about 30s
on my machine. With this change, it takes about half that. Additionally,
the size of the generated file reduces by about 1.5MB.
2021-08-27 12:32:24 +01:00
Timothy Flynn
b8ad4d302e LibUnicode: Move Locale enumeration from generated UCD data to CLDR data
The UCD set of data contained a very small subset of all locales just to
handle some special casing rules. This enumeration will be needed within
the CLDR generator as well. So rather than duplicate the enum, remove it
from the UCD generator in favor of the full list of locales known by the
CLDR generator.
2021-08-27 12:32:24 +01:00
Timothy Flynn
a57615c2b4 Meta: Ensure cmake fails if we are unable to unzip the CLDR database 2021-08-26 23:40:23 +02:00