To prepare for using plural rules within number & duration format, this
removes the NumberFormat::Plurality enumeration.
This also adds PluralCategory::ExactlyZero & PluralCategory::ExactlyOne.
These are used in locales like French, where PluralCategory::One really
means any value from 0.00 to 1.99. PluralCategory::ExactlyOne means only
the value 1, as the name implies. These exact rules are not known by the
general plural rules, they are explicitly for number / currency format.
The PluralCategory enum is currently generated for plural rules. Instead
of generating it, this moves the enum to the public LibUnicode header.
While it was nice to auto-discover these values, they are well defined
by TR-35, and we will need their values from within the number format
code generator (which can't rely on the plural rules generator having
run yet). Further, number format will require additional values in the
enum that plural rules doesn't know about.
Plural rules in the CLDR are of the form:
"cs": {
"pluralRule-count-one": "i = 1 and v = 0 @integer 1",
"pluralRule-count-few": "i = 2..4 and v = 0 @integer 2~4",
"pluralRule-count-many": "v != 0 @decimal 0.0~1.5, 10.0, 100.0 ...",
"pluralRule-count-other": "@integer 0, 5~19, 100, 1000, 10000 ..."
}
The syntax is described here:
https://unicode.org/reports/tr35/tr35-numbers.html#Plural_rules_syntax
There are up to 2 sets of rules for each locale, a cardinal set and an
ordinal set. The approach here is to generate a C++ function for each
set of rules. Each condition in the rules (e.g. "i = 1 and v = 0") is
transpiled to a C++ if-statement within its function. Then lookup tables
are generated to match locales to their generated functions.
NOTE: -Wno-parentheses-equality is added to the LibUnicodeData compile
flags because the generated plural rules have lots of extra parentheses
(because e.g. we need to selectively negate and combine rules). The code
to generate only exactly the right number of parentheses is quite hairy,
so this just tells the compiler to ignore the extras.
This includes:
* The minimum number of days in a week for that week to count as the
first week of a new year.
* The day to be shown as the first day of the week in a calendar.
* The start/end days of the weekend.
Like the existing hour cycle data, week data is presented per-region in
the CLDR, rather than per-locale. The method to add likely subtags to a
locale to perform region lookups is the same.
The list of regions in the CLDR for hour cycle, minimum days, first day,
and weekend days are quite different. So rather than changing the
existing HourCycleRegion enum to a generic Region enum, we generate
separate enums for each of the week data fields. This allows each lookup
into these fields to remain simple array-based index access, without any
"jumps" for regions that don't have CLDR data for a field.
Currently contains just each locale's character order, but is set up to
easily add other text layout fields from the CLDR if ECMA-402 eventually
requires them.
This commit has no behavior changes.
In particular, this does not fix any of the wrong uses of the previous
default parameter (which used to be 'false', meaning "only replace the
first occurence in the string"). It simply replaces the default uses by
String::replace(..., ReplaceMode::FirstOnly), leaving them incorrect.
BCP 47 will be the single source of truth for known calendar and number
system keywords, and their aliases (e.g. "gregory" is an alias for
"gregorian"). Move the generation of available keywords to where we
parse the BCP 47 data, so that hard-coded aliases may be removed from
other generators.
We have a fair amount of hard-coded keywords / aliases that can now be
replaced with real data from BCP 47. As a result, the also changes the
awkward way we were previously generating keys. Before, we were more or
less generating keywords as a CSV list of keys, e.g. for the "nu" key,
we'd generate "latn,arab,grek" (ordered by locale preference). Then at
runtime, we'd split on the comma. We now just generate spans of keywords
directly.
As we didn't (and still don't) have Intl.PluralRules when we implemented
Intl.NumberFormat, we use a locale-unaware basic implementation to pick
a pattern based on a number's value. Templatize this method for now to
work other other format-like structures (will be used for relative-time
formatting).
Relative-time format patterns are of one of two forms:
* Tensed - refer to the past or the future, e.g. "N years ago" or
"in N years".
* Numbered - refer to a specific numeric value, e.g. "in 1 year"
becomes "next year" and "in 0 years" becomes "this year".
In ECMA-402, tensed and numbered refer to the numeric formatting options
of "always" and "auto", respectively.
Previously, we were breaking up digits into groups without regard for
the locale's minimumGroupingDigits value in the CLDR. This value is 1 in
most locales, but is 2 in locales such as pl-PL. What this means is that
in those locales, the group separator should only be inserted if the
thousands group has at least 2 digits. So 1000 is formatted as "1,000"
in en-US, but "1000" in pl-PL. And 10000 is "10,000" in en-US and
"10 000" in pl-PL.
This is no longer needed now that LibTimeZone is included within LibC.
Remove the direct linkage so that others do not mistakenly copy-paste
the CMakeLists text elsewhere.
Currently, the UnicodeLocale generator collects a list of known locales
from the CLDR before processing language display names. For each locale,
the identifier is broken into language, script, and region subtags, and
we create a list of seen languages. When processing display names, we
skip languages we hadn't seen in that first step.
This is insufficient for language display names like "en-GB", which do
not have an locale entry in the CLDR, and thus are skipped. So instead,
create the list of known languages by actually reading through the list
of languages which have a display name.
These patterns indicate how to display locale strings when that locale
contains multiple subtags. For example, "en-US" would be displayed as
"English (United States)".
Before LibUnicode generated methods were weakly linked, we had a public
method (get_locale_currency_mapping) for retrieving currency mappings.
That method invoked one of several style-specific methods that only
existed in the generated UnicodeLocale.
One caveat of weakly linked functions is that every such function must
have a public declaration. The result is that each of those styled
methods are declared publicly, which makes the wrapper redundant
because it is just as easy to invoke the method for the desired style.
Note there's a bit of an unfortunate duplication in the calendar enum
generated by UnicodeLocale and the existing enum generated by
UnicodeDateTimeFormat. The former contains every calendar known to the
CLDR, whereas the latter contains the calendars we've actually parsed
for DateTimeFormat (currently only Gregorian). The new enum generated
here can be removed once DateTimeFormat knows about all calendars.
We had a hard-coded table of number system digits copied from ECMA-402.
Turns out these digits are in the CLDR, so let's parse the digits from
there instead of hard-coding them.
The following table in TR-35 includes a web of fall back rules when the
requested time zone style is unavailable:
https://unicode.org/reports/tr35/tr35-dates.html#dfst-zone
Conveniently, the subset of styles supported by ECMA-402 (and therefore
LibUnicode) all either fall back to GMT offset or to a style that is
unsupported but itself falls back to GMT offset.
This adds an API to use LibTimeZone to convert a time zone such as
"America/New_York" to a GMT offset string like "GMT-5" (short form) or
"GMT-05:00" (long form).
There are a few algorithms in TR-35 that need to replace digits before
returning any results to callers. For example, when formatting time zone
offsets, a string like "GMT+12:34" must have its digits replaced with
the default numbering system for the desired locale.
LibUnicode no longer needs to generate a list of time zone names that it
parsed from metaZones.json. We can defer to the TZDB for a golden list
of time zones.
The generator parses metaZones.json to form a mapping of meta zones to
time zones (AKA "golden zone" in TR-35). This parser errantly assumed
this was a 1-to-1 mapping.
Currently, we load the generated Unicode symbols with dlopen at runtime.
This is unnecessary as of 565a880ce5.
Applications that want Unicode data now link directly against the shared
library holding that data. So the same functionality can be achieved
with weak symbols.
ECMA-402 now supports short-offset, long-offset, short-generic, and
long-generic time zone name formatting. For example, in the en-US locale
the America/Eastern time zone would be formatted as:
short-offset: GMT-5
long-offset: GMT-05:00
short-generic: ET
long-generic: Eastern Time
We currently only support the UTC time zone, however. Therefore, this
very minimal implementation does not consider GMT offset or generic
display names. Instead, the CLDR defines specific strings for UTC.
The generated data for libunicodedata.so is quite large, and loading it
is a price paid by nearly every application by way of depending on
LibRegex. In order to defer this cost until an application actually uses
one of the surrounding APIs, dynamically load the generated symbols.
To be able to load the symbols dynamically, the generated methods must
have demangled names. Typically, this is accomplished with `extern "C"`
blocks. The clang toolchain complains about this here because the types
returned from the generators are strictly C++ types. So to demangle the
names, we use the asm() compiler directive to manually define a symbol
name; the caveat is that we *must* be sure the symbols are unique. As an
extra precaution, we prefix each symbol name with "unicode_". For more
details, see: https://gcc.gnu.org/onlinedocs/gcc/Asm-Labels.html
This symbol loader used in this implementation provides the additional
benefit of removing many [[maybe_unused]] attributes from the LibUnicode
methods. Internally, if ENABLE_UNICODE_DATABASE_DOWNLOAD is OFF, the
loader is able to stub out the function pointers it returns.
Note that as of this commit, LibUnicode is still directly linked against
LibUnicodeData. This commit is just a first step towards removing that.
This breaks LibUnicode into two libraries: LibUnicode containing the
public APIs for accessing the library, and LibUnicodeData containing the
generated source files. LibUnicodeData has compile options optimized for
size, which save about 1MB of data in total.
Similar to commit 2a7f36b392, this change moves the generated
CalendarSymbol enumeration to the public LibUnicode/NumberFormat.h
header with a pre-defined set of symbols that we need. This is to
prepare for uniquely generating the CalendarSymbols structure.
There are 443 number system objects generated, each of which held an
array of number system symbols. Of those 443 arrays, only 39 are unique.
To uniquely store these, this change moves the generated NumericSymbol
enumeration to the public LibUnicode/NumberFormat.h header with a pre-
defined set of symbols that we need. This is to ensure the generated,
unique arrays are created in a known order with known symbols. While it
is unfortunate to no longer discover these symbols at generation time,
it does allow us to ignore unwanted symbols and perform less string-to-
enumeration conversions at lookup time.
In the CLDR, there aren't "night" values, there are "night1" & "night2"
values. This is for locales which use a different name for nighttime
depending on the hour. For example, the ja locale uses "夜" between the
hours of 19:00 and 23:00, and "夜中" between the hours of 23:00 and
04:00. Our CLDR parser is currently ignoring "night2", so this rename
is to prepare for that.
We could probably come up with better names, but in the end, the API in
LibUnicode will be such that outside callers won't even see Night1, etc.
Pattern skeletons are more or less the "key" of format patterns. Every
format pattern is assigned a skeleton. Interval patterns (which are not
yet parsed) are also assigned a skeleton - this is used to match them to
an "owning" format pattern. So we will use the skeleton generated here
to match format patterns at runtime with their available interval
patterns.
An alternative approach would be to append interval patterns directly to
their owning format pattern, but this has some draw backs:
1. Skeletons aren't totally unique. A skeleton may appear in both
the "dateFormats" and "availableFormats" objects, in which case
the same interval formats would be generated more than once.
2. Otherwise unique format patterns may only differ by the interval
patterns assigned to them. This would cause the UniqueStorage for
the format patterns to increase in size, impacting both compile
times and libunicode.so size.
The parsing in parse_calendar_symbols() might be a bit more verbose than
it really needs to be, but it is to ensure the symbols are generated in
a known order that we can control with enumerations.
For example, consider the following adjacent entries in UnicodeData.txt:
3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
4DBF;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;
Our current implementation would assign the display name "CJK Ideograph
Extension A" to code points U+3400 & U+4DBF, but not to the code points
in between. Not only should those code points be assigned a name, but
the Unicode spec also has formatting rules on what the names should be
(the names for these ranged code points are not as they appear in
UnicodeData.txt).
The spec also defines names for code point ranges that actually are
listed individually in UnicodeData.txt. For example:
2F800;CJK COMPATIBILITY IDEOGRAPH-2F800;Lo;0;L;4E3D;;;;N;;;;;
2F801;CJK COMPATIBILITY IDEOGRAPH-2F801;Lo;0;L;4E38;;;;N;;;;;
2F802;CJK COMPATIBILITY IDEOGRAPH-2F802;Lo;0;L;4E41;;;;N;;;;;
Code points are only coalesced into a range if all fields after the name
are equivalent. Our parser will insert the range and its name formatting
pattern when it comes across the first code point in that range, then
ignore other code points in that range. This reduces the number of names
we generated by nearly 2,000.
For other keywords, allowed values per locale are generated at compile
time. But since the CLDR doesn't present hour cycles on a per-locale
basis, and hour cycles lookups depend on runtime data, we must handle
hour cycle keyword lookups differently than other keywords.
Unlike most data in the CLDR, hour cycles are not stored on a per-locale
basis. Instead, they are keyed by a string that is usually a region, but
sometimes is a locale. Therefore, given a locale, to determine the hour
cycles for that locale, we:
1. Check if the locale itself is assigned hour cycles.
2. If the locale has a region, check if that region is assigned hour
cycles.
3. Otherwise, maximize that locale, and if the maximized locale has
a region, check if that region is assigned hour cycles.
4. If the above all fail, fallback to the "001" region.
Further, each locale's default hour cycle is the first assigned hour
cycle.
Currently, we generate separate data files for locale and number format
related tables/methods, but provide public accessors for all of the data
in one Locale.h file. Rather than continuing this trend for date-time,
relative time, etc. formatting, it's a bit easier to reason about if the
public accessors are also in separate files.
As noted at the top of this method, this is a naive implementation of
the Unicode plurality specification. But for now, we should tweak the
defintion of "many" to be "more than 2" (which is what I had in mind
when I wrote this, but forgot about fractions).
This wasn't the case for compact patterns, but unit patterns can contain
multiple (up to 2, really) identifiers that must each be recognized by
LibJS.
Each generated NumberFormat object now stores an array of identifiers
parsed. The format pattern itself is encoded with the index into this
array for that identifier, e.g. the compact format string "0K" will
become "{number}{compactIdentifier:0}".
This field is currently used to store the StringView into the compact
name/symbol in the format string. Units will need to store a similar
field, so rename the field to be more generic, and extract the parser
for it.
Instead of currency pattern lookups within select_currency_unit_pattern,
rename the method to select_pattern_with_plurality and accept any list
of patterns. This method will be needed for units.
The compact scale of each formatting rule was precomputed in commit:
be69eae651
Using the formula: compact scale = magnitude - pattern scale
This computation was off-by-one.
For example, consider the format key "10000-count-one", which maps to
"00 thousand" in en-US. What we are really after is the exponent that
best represents the string "thousand" for values greater than 10000
and less than 100000 (the next format key). We were previously doing:
log10(10000) - "00 thousand".count("0") = 2
Which clearly isn't what we want. Instead, if we do:
log10(10000) + 1 - "00 thousand".count("0") = 3
We get the correct exponent for each format key for each locale.
This commit also renames the generated variable from "compact_scale" to
"exponent" to match the terminology used in ECMA-402.
For example, in en-US, the decimal, long compact pattern for numbers
between 10,000 and 100,000 is "00 thousand". In that pattern, "thousand"
is the compact identifier, and the generated format pattern is now
"{number} {compactIdentifier}". This also generates that identifier as
its own field in the NumberFormat structure.
Most locales have a single grouping size (the number of integer digits
to be written before inserting a grouping separator). However some have
a primary and secondary size. We parse the primary size as the size used
for the least significant integer digits, and the secondary size for the
most significant.
In order to implement Intl.NumberFormat.prototype.formatToParts, do not
replace {currency} keys in the format pattern before ECMA-402 tells us
to. Otherwise, the array return by formatToParts will not contain the
expected currency key.
Early replacement was done to avoid resolving the currency display more
than once, as it involves a couple of round trips to search through
LibUnicode data. So this adds a non-standard method to NumberFormat to
do this resolution and cache the result.
Another side effect of this change is that LibUnicode must replace unit
format patterns of the form "{0} {1}" during code generation. These were
previously skipped during code generation because LibJS would just
replace the keys with the currency display at runtime. But now that the
currency display injection is delayed, any {0} or {1} keys in the format
pattern will cause PartitionNumberPattern to abort.
Previously, we were checking if the code point immediately before/after
the {currency} key was U+00A0 (non-breaking space). Instead, to handle
other spacing code points, we must check if the surrounding code point
has the separator general category.
When I originally wrote this method, I had it in LibJS, where we can't
refer to the GeneralCategory enumeration directly. This is a big TODO,
anyone outside of LibUnicode can't assume the generated enumerations
exist and must get these values by string lookup. But this function
ended up living in LibUnicode, who can reference the enumeration.
Currencies are a bit strange; the layout of currency data in the CLDR is
not particularly compatible with what ECMA-402 expects. For example, the
currency format in the "en" and "ar" locales for the Latin script are:
en: "¤#,##0.00"
ar: "¤\u00A0#,##0.00"
Note how the "ar" locale has a non-breaking space after the currency
symbol (¤), but "en" does not. This does not mean that this space will
appear in the "ar"-formatted string, nor does it mean that a space won't
appear in the "en"-formatted string. This is a runtime decision based on
the currency display chosen by the user ("$" vs. "USD" vs. "US dollar")
and other rules in the Unicode TR-35 spec.
ECMA-402 shies away from the nuances here with "implementation-defined"
steps. LibUnicode will store the data parsed from the CLDR however it is
presented; making decisions about spacing, etc. will occur at runtime
based on user input.