This uses ICU for the Intl.NumberFormat `formatRange` and
`formatRangeToParts` prototypes.
Note: All of the changes to the test files in this patch are now aligned
with both Chrome and Safari.
This uses ICU for the Intl.NumberFormat `format` and `formatToParts`
prototypes. It does not yet port the range formatter prototypes.
Most of the new code in LibLocale/NumberFormat is simply mapping from
ECMA-402 types to ICU types. Beyond that, the only algorithmic change is
that we have to mutate the output from ICU for `formatToParts` to match
what is expected by ECMA-402. This is explained in NumberFormat.cpp in
`flatten_partitions`.
This lets us remove most data from our number format generator. All that
remains are numbering system digits and symbols, which are relied upon
still for other interfaces (e.g. Intl.DateTimeFormat). So they will be
removed in a future patch.
Note: All of the changes to the test files in this patch are now aligned
with both Chrome and Safari.
Note: We keep locale parsing and syntactic validation as-is. ECMA-402
places additional restrictions on locales above what is required by the
Unicode spec. ICU doesn't provide methods that let us easily check those
restrictions, whereas LibLocale does. Other browsers also implement
their own validators here.
This introduces a locale cache to re-use parsed locale data and various
related structures (not doing so has a non-negligible performance impact
on Intl tests).
The existing APIs for canonicalization and display names are pretty
intertwined, so they must both be adapted at once here. The results of
canonicalization are slightly different on some edge cases. But the
changed results are actually now aligned with Chrome and Safari.
This patch adds two macros to declare per-type allocators:
- JS_DECLARE_ALLOCATOR(TypeName)
- JS_DEFINE_ALLOCATOR(TypeName)
When used, they add a type-specific CellAllocator that the Heap will
delegate allocation requests to.
The result of this is that GC objects of the same type always end up
within the same HeapBlock, drastically reducing the ability to perform
type confusion attacks.
It also improves HeapBlock utilization, since each block now has cells
sized exactly to the type used within that block. (Previously we only
had a handful of block sizes available, and most GC allocations ended
up with a large amount of slack in their tails.)
There is a small performance hit from this, but I'm sure we can make
up for it elsewhere.
Note that the old size-based allocators still exist, and we fall back
to them for any type that doesn't have its own CellAllocator.
These APIs only perform small allocations, and are only used by LibJS.
Callers which could only have failed from these APIs are also made to
be infallible here.
First, this adds an overload of PrimitiveString::create for StringView.
This overload will throw an OOM completion if creating a String fails.
This is not only a bit more convenient, but it also ensures at compile
time that all PrimitiveString::create(string_view) invocations will be
handled as String and OOM-aware.
Next, this wraps all invocations to PrimitiveString::create(string_view)
with MUST_OR_THROW_OOM.
A small PrimitiveString::create(DeprecatedFlyString) overload also had
to be added to disambiguate between the StringView and DeprecatedString
overloads.
This was converted to an enumeration for Intl.NumberFormat V3 in commit
33698b9615, but the default value was not
updated (and it's a bit surprising it compiled at all, given that this
is an 'enum class').
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
Instead of passing a GlobalObject everywhere, we will simply pass a VM,
from which we can get everything we need: common names, the current
realm, symbols, arguments, the heap, and a few other things.
In some places we already don't actually need a global object and just
do it for consistency - no more `auto& vm = global_object.vm();`!
This will eventually automatically fix the "wrong realm" issue we have
in some places where we (incorrectly) use the global object from the
allocating object, e.g. in call() / construct() implementations. When
only ever a VM is passed around, this issue can't happen :^)
I've decided to split this change into a series of patches that should
keep each commit down do a somewhat manageable size.
After the Intl MV is implemented, returning a copy of the desired value
here may involve copying non-trivial data. Instead, return an enum to
indicate which decision was made.
This contains minimal changes to parse newly added and modified options
from the Intl.NumberFormat V3 proposal, while maintaining main spec
behavior in Intl.NumberFormat.prototype.format. The parsed options are
reflected only in Intl.NumberFormat.prototype.resolvedOptions and the js
REPL.
This also allows removing a bit of a BigInt hack to resolve plurality of
BigInt numbers (because the AOs used in ResolvePlural support BigInt,
wherease the naive Unicode::select_pattern_with_plurality did not).
We use cardinal form here; the number format patterns in the CLDR align
with the cardinal form of the plural rules.
Intl.NumberFormat is meant to format both Number and BigInt types. To
prepare for formatting BigInt types, this generalizes our NumberFormat
implementation to operate on Value instances rather than doubles. All
arithmetic is moved to static helpers that can now be updated with
BigInt semantics.
Other Intl objects, such as PluralRules, are to be treated as a
NumberFormat object in some AOs. There's only a handful of fields which
are to be shared between those objects - move them to a base class for
shared reuse.
This also updates the couple of NumberFormat AOs that are meant to
operate on these NumberFormat-like objects.
Alternatively, we could just have objects like PluralRules inherit from
NumberFormat directly. But that messes up the is<NumberFormat> runtime
checks, so this feels safer.
Currently, we generate separate data files for locale and number format
related tables/methods, but provide public accessors for all of the data
in one Locale.h file. Rather than continuing this trend for date-time,
relative time, etc. formatting, it's a bit easier to reason about if the
public accessors are also in separate files.
Finding the best number format to use for compact notation involves
creating a Vector of all compact formats for the locale and looking for
the one that best matches the number's magnitude. ECMA-402 wants this
number format to be found multiple times, so cache the result for future
use.
In order to implement Intl.NumberFormat.prototype.formatToParts, do not
replace {currency} keys in the format pattern before ECMA-402 tells us
to. Otherwise, the array return by formatToParts will not contain the
expected currency key.
Early replacement was done to avoid resolving the currency display more
than once, as it involves a couple of round trips to search through
LibUnicode data. So this adds a non-standard method to NumberFormat to
do this resolution and cache the result.
Another side effect of this change is that LibUnicode must replace unit
format patterns of the form "{0} {1}" during code generation. These were
previously skipped during code generation because LibJS would just
replace the keys with the currency display at runtime. But now that the
currency display injection is delayed, any {0} or {1} keys in the format
pattern will cause PartitionNumberPattern to abort.
Currencies are a bit strange; the layout of currency data in the CLDR is
not particularly compatible with what ECMA-402 expects. For example, the
currency format in the "en" and "ar" locales for the Latin script are:
en: "¤#,##0.00"
ar: "¤\u00A0#,##0.00"
Note how the "ar" locale has a non-breaking space after the currency
symbol (¤), but "en" does not. This does not mean that this space will
appear in the "ar"-formatted string, nor does it mean that a space won't
appear in the "en"-formatted string. This is a runtime decision based on
the currency display chosen by the user ("$" vs. "USD" vs. "US dollar")
and other rules in the Unicode TR-35 spec.
ECMA-402 shies away from the nuances here with "implementation-defined"
steps. LibUnicode will store the data parsed from the CLDR however it is
presented; making decisions about spacing, etc. will occur at runtime
based on user input.
There is quite a lot to be done here so this is just a first pass at
number formatting. Decimal and percent formatting are mostly working,
but only for standard and compact notation (engineering and scientific
notation are not implemented here). Currency formatting is parsed, but
there is more work to be done to handle e.g. using symbols instead of
currency codes ("$" instead of "USD"), and putting spaces around the
currency symbol ("USD 2.00" instead of "USD2.00").
This method represents the Intl.NumberFormat's [[RelevantExtensionKeys]]
internal slot, so it makes more sense for this to be directly in the
class itself.