Commit graph

14 commits

Author SHA1 Message Date
Timothy Flynn
c24a350a18 LibUnicode: Ignore U+200F when parsing format identifiers
Noticed this while implementing multiple identifier support. We were
errantly parsing U+200F as a lone identifier in some Hebrew formats.
2021-11-16 23:14:09 +00:00
Timothy Flynn
04b8b87c17 LibJS+LibUnicode: Support multiple identifiers within format pattern
This wasn't the case for compact patterns, but unit patterns can contain
multiple (up to 2, really) identifiers that must each be recognized by
LibJS.

Each generated NumberFormat object now stores an array of identifiers
parsed. The format pattern itself is encoded with the index into this
array for that identifier, e.g. the compact format string "0K" will
become "{number}{compactIdentifier:0}".
2021-11-16 23:14:09 +00:00
Timothy Flynn
3b68370212 LibJS+LibUnicode: Rename the generated compact_identifier to identifier
This field is currently used to store the StringView into the compact
name/symbol in the format string. Units will need to store a similar
field, so rename the field to be more generic, and extract the parser
for it.
2021-11-16 23:14:09 +00:00
Timothy Flynn
1f546476d5 LibJS+LibUnicode: Fix computation of compact pattern exponents
The compact scale of each formatting rule was precomputed in commit:
be69eae651

Using the formula: compact scale = magnitude - pattern scale

This computation was off-by-one.

For example, consider the format key "10000-count-one", which maps to
"00 thousand" in en-US. What we are really after is the exponent that
best represents the string "thousand" for values greater than 10000
and less than 100000 (the next format key). We were previously doing:

    log10(10000) - "00 thousand".count("0") = 2

Which clearly isn't what we want. Instead, if we do:

    log10(10000) + 1 - "00 thousand".count("0") = 3

We get the correct exponent for each format key for each locale.

This commit also renames the generated variable from "compact_scale" to
"exponent" to match the terminology used in ECMA-402.
2021-11-16 00:56:55 +00:00
Timothy Flynn
48d5684780 LibUnicode: Parse compact identifiers and replace them with a format key
For example, in en-US, the decimal, long compact pattern for numbers
between 10,000 and 100,000 is "00 thousand". In that pattern, "thousand"
is the compact identifier, and the generated format pattern is now
"{number} {compactIdentifier}". This also generates that identifier as
its own field in the NumberFormat structure.
2021-11-16 00:56:55 +00:00
Timothy Flynn
30fbb7d9cd LibUnicode: Parse and generate scientific formatting rules 2021-11-14 17:00:35 +00:00
Timothy Flynn
3645f6a0fc LibUnicode: Fix typo in percent format parser
Just by sheer luck this had no actual effect because the decimal format
prefix has the same length as the percent format prefix.
2021-11-14 17:00:35 +00:00
Timothy Flynn
3b7f5af042 LibUnicode: Generate primary and secondary number grouping sizes
Most locales have a single grouping size (the number of integer digits
to be written before inserting a grouping separator). However some have
a primary and secondary size. We parse the primary size as the size used
for the least significant integer digits, and the secondary size for the
most significant.
2021-11-14 10:35:19 +00:00
Timothy Flynn
c65dea64bd LibJS+LibUnicode: Don't remove {currency} keys in GetNumberFormatPattern
In order to implement Intl.NumberFormat.prototype.formatToParts, do not
replace {currency} keys in the format pattern before ECMA-402 tells us
to. Otherwise, the array return by formatToParts will not contain the
expected currency key.

Early replacement was done to avoid resolving the currency display more
than once, as it involves a couple of round trips to search through
LibUnicode data. So this adds a non-standard method to NumberFormat to
do this resolution and cache the result.

Another side effect of this change is that LibUnicode must replace unit
format patterns of the form "{0} {1}" during code generation. These were
previously skipped during code generation because LibJS would just
replace the keys with the currency display at runtime. But now that the
currency display injection is delayed, any {0} or {1} keys in the format
pattern will cause PartitionNumberPattern to abort.
2021-11-13 19:01:25 +00:00
Timothy Flynn
a701ed52fc LibJS+LibUnicode: Fully implement currency number formatting
Currencies are a bit strange; the layout of currency data in the CLDR is
not particularly compatible with what ECMA-402 expects. For example, the
currency format in the "en" and "ar" locales for the Latin script are:

    en: "¤#,##0.00"
    ar: "¤\u00A0#,##0.00"

Note how the "ar" locale has a non-breaking space after the currency
symbol (¤), but "en" does not. This does not mean that this space will
appear in the "ar"-formatted string, nor does it mean that a space won't
appear in the "en"-formatted string. This is a runtime decision based on
the currency display chosen by the user ("$" vs. "USD" vs. "US dollar")
and other rules in the Unicode TR-35 spec.

ECMA-402 shies away from the nuances here with "implementation-defined"
steps. LibUnicode will store the data parsed from the CLDR however it is
presented; making decisions about spacing, etc. will occur at runtime
based on user input.
2021-11-13 11:52:45 +00:00
Timothy Flynn
e9493a2cd5 LibUnicode: Ensure UnicodeNumberFormat is aware of default content
For example, there isn't a unique set of data for the en-US locale;
rather, it defaults to the data for the en locale. See this commit for
much more detail: 357c97dfa8
2021-11-13 11:52:45 +00:00
Timothy Flynn
9421d5c0cf LibUnicode: Generate currency unit-pattern number formats
These are used when formatting a number as currency with a display
option of "name" (e.g. for USD, the name is "US Dollars" in en-US).

These patterns appear in the CLDR in a different manner than other
number formats that are pluralized. They are of the form "{0} {1}",
therefore do not undergo subpattern replacements.
2021-11-13 11:52:45 +00:00
Timothy Flynn
6cfd63e5bd LibUnicode: Parse numbers in number formats a bit more leniently
The parser was previously expecting number sections within a pattern to
start with "#", but they may also begin with "0".
2021-11-13 11:52:45 +00:00
Timothy Flynn
1f2ac0ab41 LibUnicode: Move number formatting code generator to UnicodeNumberFormat 2021-11-12 20:46:38 +00:00