Commit graph

73 commits

Author SHA1 Message Date
Timothy Flynn
becec3578f LibTimeZone+LibUnicode: Generate string data with run-length encoding
Currently, the unique string lists are stored in the initialized data
sections of their shared libraries. In order to move the data to the
read-only section, generate the strings using RLE arrays.

We generate two arrays: the first is the RLE data itself, the second is
a list of indices into the RLE array for each string. We then generate a
decoding method to convert an RLE string to a StringView.
2022-08-16 16:56:17 +02:00
Timothy Flynn
f8f7015419 LibUnicode: Generate a method to lookup locale-preferred keyword values 2022-07-15 12:31:43 +02:00
Timothy Flynn
80568d5776 LibUnicode: Generate a method to lookup available keyword values 2022-07-15 12:31:43 +02:00
Timothy Flynn
c2e5b20eb6 LibUnicode: Generate available values for the keywords co, kf, kn, hc
This also ensures we only include values we actually support in the
generated list of available values.
2022-07-15 12:31:43 +02:00
sin-ack
3f3f45580a Everywhere: Add sv suffix to strings relying on StringView(char const*)
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).

No functional changes.
2022-07-12 23:11:35 +02:00
Timothy Flynn
4868b888be LibUnicode: Generate per-locale text layout information
Currently contains just each locale's character order, but is set up to
easily add other text layout fields from the CLDR if ECMA-402 eventually
requires them.
2022-07-06 16:56:42 +02:00
DexesTTP
7ceeb74535 AK: Use an enum instead of a bool for String::replace(all_occurences)
This commit has no behavior changes.

In particular, this does not fix any of the wrong uses of the previous
default parameter (which used to be 'false', meaning "only replace the
first occurence in the string"). It simply replaces the default uses by
String::replace(..., ReplaceMode::FirstOnly), leaving them incorrect.
2022-07-06 11:12:45 +02:00
Timothy Flynn
63c3437274 LibUnicode: Use BCP 47 data to generate available calendars and numbers
BCP 47 will be the single source of truth for known calendar and number
system keywords, and their aliases (e.g. "gregory" is an alias for
"gregorian"). Move the generation of available keywords to where we
parse the BCP 47 data, so that hard-coded aliases may be removed from
other generators.
2022-02-16 07:23:07 -05:00
Timothy Flynn
89ead8c00a LibJS+LibUnicode: Parse Unicode keywords from the BCP 47 CLDR package
We have a fair amount of hard-coded keywords / aliases that can now be
replaced with real data from BCP 47. As a result, the also changes the
awkward way we were previously generating keys. Before, we were more or
less generating keywords as a CSV list of keys, e.g. for the "nu" key,
we'd generate "latn,arab,grek" (ordered by locale preference). Then at
runtime, we'd split on the comma. We now just generate spans of keywords
directly.
2022-02-16 07:23:07 -05:00
Timothy Flynn
d0fc61e79b LibUnicode: Extract the BCP 47 package from the CLDR
This package was originally meant to be included in CLDR version 40, but
was missed in their release scripts. This has been resolved:
https://unicode-org.atlassian.net/browse/CLDR-15158

Unfortunately, the CLDR was re-released with the same version number. So
to bust the build's CLDR cache, change the "version" used to detect that
we need to redownload the CLDR.
2022-02-16 07:23:07 -05:00
Timothy Flynn
a338e9403b LibUnicode: Port the CLDR locale generator to the stream API
This adds a generator utility to read an entire file and parse it as a
JSON value. This is heavily used by the CLDR generators. The idea here
is to put the file reading details in the utility so that when we have a
good story for generically reading an entire stream in LibCore, we can
update the generators to use that by only touching this helper.
2022-02-14 11:39:46 -05:00
Timothy Flynn
6efbafa6e0 Everywhere: Update copyrights with my new serenityos.org e-mail :^) 2022-01-31 18:23:22 +00:00
Timothy Flynn
bb0f548614 LibUnicode: Generate a list of available currencies 2022-01-31 00:32:41 +00:00
Timothy Flynn
4d43aeae30 LibUnicode: Fill in case-first and numeric BCP47 keywords
Unlike other BCP47 keywords that we are parsing, these only appear in
the BCP47 XML file itself within the CLDR. The values are very simple
though, so just hard code them until the Unicode org re-releases the
CLDR with BCP47: https://unicode-org.atlassian.net/browse/CLDR-15158
2022-01-29 20:27:24 +00:00
Timothy Flynn
bced4e9324 LibJS+LibUnicode: Convert Intl.ListFormat to use Unicode::Style
Remove ListFormat's own definition of the Style enum, which was further
duplicated by a generated ListPatternStyle enum with the same values.
2022-01-25 19:02:59 +00:00
Timothy Flynn
c86f7a675d LibUnicode: Do not limit language display names to known locales
Currently, the UnicodeLocale generator collects a list of known locales
from the CLDR before processing language display names. For each locale,
the identifier is broken into language, script, and region subtags, and
we create a list of seen languages. When processing display names, we
skip languages we hadn't seen in that first step.

This is insufficient for language display names like "en-GB", which do
not have an locale entry in the CLDR, and thus are skipped. So instead,
create the list of known languages by actually reading through the list
of languages which have a display name.
2022-01-13 23:05:31 +01:00
Timothy Flynn
91acc2e9c5 LibUnicode: Parse and generate locale display patterns
These patterns indicate how to display locale strings when that locale
contains multiple subtags. For example, "en-US" would be displayed as
"English (United States)".
2022-01-13 23:05:31 +01:00
Timothy Flynn
0d75949827 LibUnicode: Parse and generate locale display names for date fields 2022-01-13 13:43:57 +01:00
Timothy Flynn
7f162c471d LibUnicode: Parse and generate locale display names for calendars
Note there's a bit of an unfortunate duplication in the calendar enum
generated by UnicodeLocale and the existing enum generated by
UnicodeDateTimeFormat. The former contains every calendar known to the
CLDR, whereas the latter contains the calendars we've actually parsed
for DateTimeFormat (currently only Gregorian). The new enum generated
here can be removed once DateTimeFormat knows about all calendars.
2022-01-13 13:43:57 +01:00
Timothy Flynn
6da1bfeeea Meta: Support generating case-insensitive value-from-string methods
This also extracts the default parameters for generate_value_from_string
to a structure. This is just to make it cleaner to add new options.
2022-01-11 00:36:45 +01:00
Timothy Flynn
f576142fe8 LibJS+LibUnicode: Convert UnicodeLocale to link with weak symbols 2022-01-04 22:49:43 +00:00
Timothy Flynn
cf8e11a562 LibUnicode: Add temporary overload of value-from-string generator
This is a temporary mechanism while LibUnicode is in an in-between state
where some symbols are weakly linked and others are dynamically loaded.
The latter require an asm() label to be loaded.
2022-01-04 22:49:43 +00:00
Timothy Flynn
52394deece LibUnicode: Remove now unused value-from-string generator overload
The generate_value_from_string_for_dynamic_loading() overload was just
temporary until all generates were switched over to dynamic loading.
2021-12-21 13:09:49 -08:00
Timothy Flynn
09be26b5d2 LibUnicode: Dynamically load the generated UnicodeLocale symbols 2021-12-21 13:09:49 -08:00
Timothy Flynn
ce6c515873 LibUnicode: Generate unique list patterns and lists of list patterns 2021-12-13 21:28:56 -08:00
Timothy Flynn
0ad2decd04 LibUnicode: Generate unique list of keyword values 2021-12-13 21:28:56 -08:00
Timothy Flynn
0c6cc4ad96 LibUnicode: Generate unique lists of localized currencies 2021-12-13 21:28:56 -08:00
Timothy Flynn
a45f2ccc25 LibUnicode: Generate unique lists of languages, territories, and scripts 2021-12-13 21:28:56 -08:00
Timothy Flynn
bf79c73158 LibUnicode: Do not generate data for "generic" calendars
This is not a calendar supported by ECMA-402, so let's not waste space
with its data.

Further, don't generate "gregorian" as a valid Unicode locale extension
keyword. It's an invalid type identifier, thus cannot be used in locales
such as "en-u-ca-gregorian".
2021-12-01 16:36:26 +00:00
Timothy Flynn
71903ea7e1 LibUnicode: Parse and generate calendar (ca) Unicode keywords
Also removes a few fly-by "StringView x = nullptr;" unnecessary
initializers.
2021-11-29 22:48:46 +00:00
Timothy Flynn
0aa3e5c2ea LibUnicode: Port generator utility methods to ErrorOr
Most of these were VERIFY-ing for success, but propagating an error
message up to serenity_main() is much nicer than just a SIGABRT.
2021-11-23 22:58:05 +01:00
Timothy Flynn
8c5f19f7c8 LibUnicode: Port GenerateUnicodeLocale to ErrorOr and LibMain 2021-11-23 22:58:05 +01:00
Timothy Flynn
93ee922027 LibUnicode: Support locales-without-script aliases for ECMA-402
As noted by ECMA-402, if a supported locale contains all of a language,
script, and region subtag, then the implementation must also support the
locale without the script subtag. The most complicated example of this
is the zh-TW locale.

The list of locales in the CLDR database does not include zh-TW or its
maximized zh-Hant-TW variant. Instead, it inlcudes the zh-Hant locale.
However, zh-Hant-TW is listed in the default-content locale list in the
cldr-core package. This defines an alias from zh-Hant-TW to zh-Hant. We
must then also support the zh-Hant-TW alias without the script subtag:
zh-TW. This transitively maps zh-TW to zh-Hant, which is a case quite
heavily tested by test262.
2021-11-19 11:45:35 +01:00
Timothy Flynn
a13fa15a30 LibUnicode: Generate default-content locales as aliases
Previously, we were just copying the locale data into default-content
locales (for example, copying the "en" data into "en-US"). Instead, we
can just define the default-content locales as aliases to their main
locales.
2021-11-19 11:45:35 +01:00
Andreas Kling
587f9af960 AK: Make JSON parser return ErrorOr<JsonValue> (instead of Optional)
Also add slightly richer parse errors now that we can include a string
literal with returned errors.

This will allow us to use TRY() when working with JSON data.
2021-11-17 00:21:10 +01:00
Timothy Flynn
e9493a2cd5 LibUnicode: Ensure UnicodeNumberFormat is aware of default content
For example, there isn't a unique set of data for the en-US locale;
rather, it defaults to the data for the en locale. See this commit for
much more detail: 357c97dfa8
2021-11-13 11:52:45 +00:00
Timothy Flynn
39e031c4dd LibJS+LibUnicode: Generate all styles of currency localizations
Currently, LibUnicode is only parsing and generating the "long" style of
currency display names. However, the CLDR contains "short" and "narrow"
forms as well that need to be handled. Parse these, and update LibJS to
actually respect the "style" option provided by the user for displaying
currencies with Intl.DisplayNames.

Note: There are some discrepencies between the engines on how style is
handled. In particular, running:

new Intl.DisplayNames('en', {type:'currency', style:'narrow'}).of('usd')

Gives:

  SpiderMoney: "USD"
  V8: "US Dollar"
  LibJS: "$"

And running:

new Intl.DisplayNames('en', {type:'currency', style:'short'}).of('usd')

Gives:

  SpiderMonkey: "$"
  V8: "US Dollar"
  LibJS: "$"

My best guess is V8 isn't handling style, and just returning the long
form (which is what LibJS did before this commit). And SpiderMoney can
handle some styles, but if they don't have a value for the requested
style, they fall back to the canonicalized code passed into of().
2021-11-13 11:52:45 +00:00
Timothy Flynn
1f2ac0ab41 LibUnicode: Move number formatting code generator to UnicodeNumberFormat 2021-11-12 20:46:38 +00:00
Timothy Flynn
04e6b43f05 LibUnicode: Move (soon-to-be) common code out of GenerateUnicodeLocale
The data used for number formatting is going to grow quite a bit when
the cldr-units package is parsed. To prevent the generated UnicodeLocale
file from growing outrageously large, the number formatting data can go
into its own file. To prepare for this, move code that will be common
between the generators for UnicodeLocale and UnicodeNumberFormat to the
utility header.
2021-11-12 20:46:38 +00:00
Timothy Flynn
be69eae651 LibUnicode: Precompute the compact scale of each number formatting rule
This will be needed for the ComputeExponentForMagnitude AO for compact
formatting, namely step 5b:

  Let exponent be an implementation- and locale-dependent (ILD) integer
  by which to scale a number of the given magnitude in compact notation
  for the current locale.
2021-11-12 09:17:08 +00:00
Timothy Flynn
230b133ee3 LibUnicode: Parse number formats into zero/positive/negative patterns
A number formatting pattern in the CLDR contains one or two entries,
delimited by a semi-colon. Previously, LibUnicode was just storing the
entire pattern as one string. This changes the generator to split the
pattern on that delimiter and generate the 3 unique patterns expected by
ECMA-402.

The rules for generating the 3 patterns are as follows:

* If the pattern contains 1 entry, it is the zero pattern. The positive
  pattern is the zero pattern prepended with {plusSign}. The negative
  pattern is the zero pattern prepended with {minusSign}.

* If the pattern contains 2 entries, the first is the zero pattern, and
  the second is the negative pattern. The positive pattern is the zero
  pattern prepended with {plusSign}.
2021-11-12 09:17:08 +00:00
Timothy Flynn
1244ebcd4f LibUnicode: Parse and generate standard accounting formatting rules
Also known as "currency-accounting" in some CLDR documentation.
2021-11-12 09:17:08 +00:00
Timothy Flynn
967afc1b84 LibUnicode: Parse and generate standard currency formatting rules 2021-11-12 09:17:08 +00:00
Timothy Flynn
bffd73e0d4 LibUnicode: Parse and generate standard decimal formatting rules 2021-11-12 09:17:08 +00:00
Timothy Flynn
feb8c22a62 LibUnicode: Parse and generate standard percentage formatting rules 2021-11-12 09:17:08 +00:00
Timothy Flynn
4317a1b552 LibUnicode: Parse and generate compact currency formatting rules 2021-11-12 09:17:08 +00:00
Timothy Flynn
604a596c90 LibUnicode: Parse and generate compact decimal formatting rules 2021-11-12 09:17:08 +00:00
Timothy Flynn
12b468a588 LibUnicode: Begin parsing and generating locale number systems
The number system data in the CLDR contains information on how to format
numbers in a locale-dependent manner. Start parsing this data, beginning
with numeric symbol strings. For example the symbol NaN maps to "NaN" in
the en-US locale, and "非數值" in the zh-Hant locale.
2021-11-12 09:17:08 +00:00
Timothy Flynn
d3e83c9934 LibUnicode: Parse alternate default numbering systems
Some locales in the CLDR have alternate default numbering systems listed
under "defaultNumberingSystem-alt-*", e.g.:

    "defaultNumberingSystem": "arab",
    "defaultNumberingSystem-alt-latn": "latn",
    "otherNumberingSystems": {
      "native": "arab"
    },

We were previously only parsing "defaultNumberingSystem" and
"otherNumberingSystems". This odd format appears to be an artifact of
converting from XML.
2021-11-12 09:17:08 +00:00
Timothy Flynn
ae66188d43 LibUnicode: Capitialize generated identifiers in lieu of full title case
This isn't particularly important because this generates code that is
quite hidden from outside callers. But when viewing the generated code,
it's a bit nicer to read e.g. enum identifiers such as "MinusSign"
rather than "Minussign".
2021-11-12 09:17:08 +00:00