For example, there isn't a unique set of data for the en-US locale;
rather, it defaults to the data for the en locale. See this commit for
much more detail: 357c97dfa8
Currently, LibUnicode is only parsing and generating the "long" style of
currency display names. However, the CLDR contains "short" and "narrow"
forms as well that need to be handled. Parse these, and update LibJS to
actually respect the "style" option provided by the user for displaying
currencies with Intl.DisplayNames.
Note: There are some discrepencies between the engines on how style is
handled. In particular, running:
new Intl.DisplayNames('en', {type:'currency', style:'narrow'}).of('usd')
Gives:
SpiderMoney: "USD"
V8: "US Dollar"
LibJS: "$"
And running:
new Intl.DisplayNames('en', {type:'currency', style:'short'}).of('usd')
Gives:
SpiderMonkey: "$"
V8: "US Dollar"
LibJS: "$"
My best guess is V8 isn't handling style, and just returning the long
form (which is what LibJS did before this commit). And SpiderMoney can
handle some styles, but if they don't have a value for the requested
style, they fall back to the canonicalized code passed into of().
The data used for number formatting is going to grow quite a bit when
the cldr-units package is parsed. To prevent the generated UnicodeLocale
file from growing outrageously large, the number formatting data can go
into its own file. To prepare for this, move code that will be common
between the generators for UnicodeLocale and UnicodeNumberFormat to the
utility header.
This will be needed for the ComputeExponentForMagnitude AO for compact
formatting, namely step 5b:
Let exponent be an implementation- and locale-dependent (ILD) integer
by which to scale a number of the given magnitude in compact notation
for the current locale.
A number formatting pattern in the CLDR contains one or two entries,
delimited by a semi-colon. Previously, LibUnicode was just storing the
entire pattern as one string. This changes the generator to split the
pattern on that delimiter and generate the 3 unique patterns expected by
ECMA-402.
The rules for generating the 3 patterns are as follows:
* If the pattern contains 1 entry, it is the zero pattern. The positive
pattern is the zero pattern prepended with {plusSign}. The negative
pattern is the zero pattern prepended with {minusSign}.
* If the pattern contains 2 entries, the first is the zero pattern, and
the second is the negative pattern. The positive pattern is the zero
pattern prepended with {plusSign}.
The number system data in the CLDR contains information on how to format
numbers in a locale-dependent manner. Start parsing this data, beginning
with numeric symbol strings. For example the symbol NaN maps to "NaN" in
the en-US locale, and "非數值" in the zh-Hant locale.
Some locales in the CLDR have alternate default numbering systems listed
under "defaultNumberingSystem-alt-*", e.g.:
"defaultNumberingSystem": "arab",
"defaultNumberingSystem-alt-latn": "latn",
"otherNumberingSystems": {
"native": "arab"
},
We were previously only parsing "defaultNumberingSystem" and
"otherNumberingSystems". This odd format appears to be an artifact of
converting from XML.
This isn't particularly important because this generates code that is
quite hidden from outside callers. But when viewing the generated code,
it's a bit nicer to read e.g. enum identifiers such as "MinusSign"
rather than "Minussign".
This file contains the list of locales which default to their parent
locale's values. In the core CLDR dataset, these locales have their own
files, but they are empty (except for identity data). For example:
https://github.com/unicode-org/cldr/blob/main/common/main/en_US.xml
In the JSON export, these files are excluded, so we currently are not
recognizing these locales just by iterating the locale files.
This is a prerequisite for upgrading to CLDR version 40. One of these
default-content locales is the popular "en-US" locale, which defaults to
"en" values. We were previously inferring the existence of this locale
from the "en-US-POSIX" locale (many implementations, including ours,
strip variants such as POSIX). However, v40 removes the "en-US-POSIX"
locale entirely, meaning that without this change, we wouldn't know that
"en-US" exists (we would default to "en").
For more detail on this and other v40 changes, see:
https://cldr.unicode.org/index/downloads/cldr-40#h.nssoo2lq3cba
Typically size_t is used for indices, but we can take advantage of the
knowledge that there is approximately only 46K unique strings in the
generated UnicodeLocale.cpp file. Therefore, we can get away with using
u16 to store indices. There is a VERIFY that will fail if we ever exceed
the limits of u16.
On x86_64 builds, this reduces libunicode.so from 9.2 MiB to 7.3 MiB.
On i686 builds, this reduces libunicode.so from 3.9 MiB to 3.3 MiB.
These savings are entirely in the .rodata section of the shared library.
The *_from_string() and resolve_*_alias() generated methods are the last
remaining users of HashMap in the LibUnicode generated files (read: the
last methods not using compile-time structures). This converts these
methods to use an array containing pairs of hash values to the desired
lookup value.
Because this code generation is the same between GenerateUnicodeData.cpp
and GenerateUnicodeLocale.cpp, this adds a GeneratorUtil.h header to the
LibUnicode generators to contain the method that generates the methods.
The list-format strings used for Intl.ListFormat are small, but quite
heavily duplicated. For example, the string "{0}, {1}" appears 6,519
times. Generate unique strings for this data to avoid duplication.
In the generated UnicodeLocale.cpp file, there are 296,408 strings for
localizations of languages, territories, scripts, currencies & keywords.
Of these, only 43,848 (14.8%) are actually unique, so there are quite a
large number of duplicated strings.
This generates a single compile-time array to store these strings. The
arrays for the localizations now store an index into this single array
rather than duplicating any strings.
Some CLDR languages.json / territories.json files contain localizations
for some lanuages/territories that are otherwise not present in the CLDR
database. We already don't generate anything in UnicodeLocale.cpp for
these anomalies, but this will stop us from even storing that data in
the generator's memory.
This doesn't affect the output of the generator, but will have an effect
after an upcoming commit to unique-ify all of the strings in the CLDR.
This removes the awkward String::replace API which was the only String
API which mutated the String and replaces it with a new immutable
version that returns a new String with the replacements applied. This
also fixes a couple of UAFs that were caused by the use of this API.
As an optimization an equivalent StringView::replace API was also added
to remove an unnecessary String allocations in the format of:
`String { view }.replace(...);`
There's only a couple of cases like this, but there are some locale
paths in the CLDR that contain variants. For example, there isn't a
en-US path, but there is a en-US-POSIX path. This interferes with the
operation to search for locales by name. The algorithm is such that
searching for en-US will not result in en-US-POSIX being found. To
resolve this, we should remove variants from the locale name.
This data informs consumers how to join lists of values. For example,
in en-US, the list ["a", "b", "c"] formatted to a string should become
"a, b, and c".
The amount of aliases in the likely-subtags dataset is quite large, so
this also needed to change the way the data is generated. Otherwise, the
compiler would complain about the size of the generated code.
Previously, a static method was generated that would effectively parse
the dataset into a HashMap of Unicode::LanguageID at runtime. We now
perform that parsing at generation-time, and instead generate an Array
of a structure similar to Unicode::LanguageID (we cannot use the same
structure because it contains String and Optional, which cannot be used
at compile-time).
CLDR contains a set of likely subtag data where, given a locale, you can
resolve what is the most likely language, script, or territory of that
locale. This data is needed for resolving territory aliases. These
aliases might contain multiple territories, and we need to resolve which
of those territories is most likely correct for a locale.
Note that the likely subtag data is quite huge (a few thousand entries).
As an optimization encouraged by the spec, we only generate the smallest
subset of this data that we actually need (about 150 entries).
Most alias substitutions are "simple", meaning that alias matching is
done by examining a single locale subtag. However, there are a handful
of "complex" aliases where matching is done by examining multiple
subtags. For example, the variant subtag "lojban" causes the locale
"art-lojban" to be canonicalized to "jbo", but only when the language
subtag is "art" (i.e. this should not occur for the locale "en-lojban").
This generates a method to perform complex alias matching.
CLDR contains a set of aliases for languages, territories, etc. that no
longer are meant to be used (e.g. due to deprecation). For example, the
language "aam" is deprecated and should be canonicalized as "aas".
This allows us to remove all the add_subdirectory calls from the top
level CMakeLists.txt that referred to targets linking LagomCore.
Segregating the host tools and Serenity targets helps us get to a place
where the main Serenity build can simply use a CMake toolchain file
rather than swapping all the compiler/sysroot variables after building
host libraries and tools.
2021-08-28 08:44:17 +01:00
Renamed from Userland/Libraries/LibUnicode/CodeGenerators/GenerateUnicodeLocale.cpp (Browse further)