Commit graph

192 commits

Author SHA1 Message Date
Idan Horowitz
573061e76c LibUnicode: Extract the timeSeparator numeric symbol from CLDR
This will be used by Intl.DurationFormat
2022-07-01 01:00:05 +03:00
Timothy Flynn
1f2542247f LibUnicode: Upgrade to CLDR version 41.0.0
Release notes: https://cldr.unicode.org/index/downloads/cldr-41

Note that the HourCycleRegion enum now contains 272 entires, thus needs
to be bumped from u8 to u16.
2022-04-07 08:29:10 -04:00
Timothy Flynn
70ede2825e LibUnicode: Use BCP 47 data to filter valid calendar names 2022-02-16 07:23:07 -05:00
Timothy Flynn
71d86261c3 LibUnicode: Use BCP 47 data to filter valid numbering system names
There isn't too much of an effective difference here other than that the
BCP 47 data contains some aliases we would otherwise not handle.
2022-02-16 07:23:07 -05:00
Timothy Flynn
63c3437274 LibUnicode: Use BCP 47 data to generate available calendars and numbers
BCP 47 will be the single source of truth for known calendar and number
system keywords, and their aliases (e.g. "gregory" is an alias for
"gregorian"). Move the generation of available keywords to where we
parse the BCP 47 data, so that hard-coded aliases may be removed from
other generators.
2022-02-16 07:23:07 -05:00
Timothy Flynn
89ead8c00a LibJS+LibUnicode: Parse Unicode keywords from the BCP 47 CLDR package
We have a fair amount of hard-coded keywords / aliases that can now be
replaced with real data from BCP 47. As a result, the also changes the
awkward way we were previously generating keys. Before, we were more or
less generating keywords as a CSV list of keys, e.g. for the "nu" key,
we'd generate "latn,arab,grek" (ordered by locale preference). Then at
runtime, we'd split on the comma. We now just generate spans of keywords
directly.
2022-02-16 07:23:07 -05:00
thankyouverycool
0505e031f1 Meta+LibUnicode: Download and parse Unicode block properties
This parses Blocks.txt for CharacterType properties and creates
a global display array for use in apps.
2022-02-15 10:13:19 -05:00
Idan Horowitz
4967bcd4ce LibUnicode: Implement sentence segmentation 2022-01-31 21:05:04 +02:00
Idan Horowitz
a593a5c8ab LibUnicode: Implement word segmentation 2022-01-31 21:05:04 +02:00
Idan Horowitz
58b0eed6a7 LibUnicode: Implement grapheme segmentation 2022-01-31 21:05:04 +02:00
Idan Horowitz
2d50c08f34 LibUnicode: Download and parse {Grapheme,Word,Sentence} break props 2022-01-31 21:05:04 +02:00
Timothy Flynn
6efbafa6e0 Everywhere: Update copyrights with my new serenityos.org e-mail :^) 2022-01-31 18:23:22 +00:00
Timothy Flynn
bb0f548614 LibUnicode: Generate a list of available currencies 2022-01-31 00:32:41 +00:00
Timothy Flynn
481ced53d8 LibUnicode: Generate a list of available numbering systems 2022-01-31 00:32:41 +00:00
Timothy Flynn
ebd33e580b LibUnicode: Generate a list of available calendars 2022-01-31 00:32:41 +00:00
Timothy Flynn
f8892fdea2 LibUnicode: Templatize our naive implementation of plurality selection
As we didn't (and still don't) have Intl.PluralRules when we implemented
Intl.NumberFormat, we use a locale-unaware basic implementation to pick
a pattern based on a number's value. Templatize this method for now to
work other other format-like structures (will be used for relative-time
formatting).
2022-01-27 21:16:44 +00:00
Timothy Flynn
789f093b2e LibUnicode: Parse and generate relative-time format patterns
Relative-time format patterns are of one of two forms:

    * Tensed - refer to the past or the future, e.g. "N years ago" or
      "in N years".
    * Numbered - refer to a specific numeric value, e.g. "in 1 year"
      becomes "next year" and "in 0 years" becomes "this year".

In ECMA-402, tensed and numbered refer to the numeric formatting options
of "always" and "auto", respectively.
2022-01-27 21:16:44 +00:00
Timothy Flynn
2d2f713426 LibUnicode: Generate per-locale minimum grouping digit values
Previously, we were breaking up digits into groups without regard for
the locale's minimumGroupingDigits value in the CLDR. This value is 1 in
most locales, but is 2 in locales such as pl-PL. What this means is that
in those locales, the group separator should only be inserted if the
thousands group has at least 2 digits. So 1000 is formatted as "1,000"
in en-US, but "1000" in pl-PL. And 10000 is "10,000" in en-US and
"10 000" in pl-PL.
2022-01-27 20:30:52 +00:00
Timothy Flynn
bced4e9324 LibJS+LibUnicode: Convert Intl.ListFormat to use Unicode::Style
Remove ListFormat's own definition of the Style enum, which was further
duplicated by a generated ListPatternStyle enum with the same values.
2022-01-25 19:02:59 +00:00
Timothy Flynn
e261132e8b LibUnicode: Add helper methods to convert a Style to and from a string
This conversion is duplicated a few times in our Intl implementation, so
let's just define these once and be done with it.
2022-01-25 19:02:59 +00:00
Timothy Flynn
7f6edb7976 LibUnicode: Remove the Unicode::Style::Numeric value
It is unused.
2022-01-25 19:02:59 +00:00
Timothy Flynn
0a4430fc41 LibJS+LibTimeZone+LibUnicode: Remove direct linkage to LibTimeZone
This is no longer needed now that LibTimeZone is included within LibC.
Remove the direct linkage so that others do not mistakenly copy-paste
the CMakeLists text elsewhere.
2022-01-23 12:48:26 +00:00
Timothy Flynn
4400150cd2 LibJS+LibUnicode: Return the appropriate time zone name depending on DST 2022-01-19 21:20:41 +00:00
Timothy Flynn
70f49d0696 LibJS+LibTimeZone+LibUnicode: Indicate whether a time zone is in DST
Return whether the time zone is in DST during the provided time from
TimeZone::get_time_zone_offset,
2022-01-19 21:20:41 +00:00
Timothy Flynn
701b7810ba LibUnicode: Generate code point abbreviations 2022-01-18 15:13:25 +00:00
Timothy Flynn
c86f7a675d LibUnicode: Do not limit language display names to known locales
Currently, the UnicodeLocale generator collects a list of known locales
from the CLDR before processing language display names. For each locale,
the identifier is broken into language, script, and region subtags, and
we create a list of seen languages. When processing display names, we
skip languages we hadn't seen in that first step.

This is insufficient for language display names like "en-GB", which do
not have an locale entry in the CLDR, and thus are skipped. So instead,
create the list of known languages by actually reading through the list
of languages which have a display name.
2022-01-13 23:05:31 +01:00
Timothy Flynn
b0671ceb74 LibUnicode: Add a method to combine locale subtags into a display string
This is just a convenience wrapper around the underlying generated APIs.
2022-01-13 23:05:31 +01:00
Timothy Flynn
91acc2e9c5 LibUnicode: Parse and generate locale display patterns
These patterns indicate how to display locale strings when that locale
contains multiple subtags. For example, "en-US" would be displayed as
"English (United States)".
2022-01-13 23:05:31 +01:00
Timothy Flynn
8126cb2545 LibJS+LibUnicode: Remove unnecessary locale currency mapping wrapper
Before LibUnicode generated methods were weakly linked, we had a public
method (get_locale_currency_mapping) for retrieving currency mappings.
That method invoked one of several style-specific methods that only
existed in the generated UnicodeLocale.

One caveat of weakly linked functions is that every such function must
have a public declaration. The result is that each of those styled
methods are declared publicly, which makes the wrapper redundant
because it is just as easy to invoke the method for the desired style.
2022-01-13 13:43:57 +01:00
Timothy Flynn
0d75949827 LibUnicode: Parse and generate locale display names for date fields 2022-01-13 13:43:57 +01:00
Timothy Flynn
7f162c471d LibUnicode: Parse and generate locale display names for calendars
Note there's a bit of an unfortunate duplication in the calendar enum
generated by UnicodeLocale and the existing enum generated by
UnicodeDateTimeFormat. The former contains every calendar known to the
CLDR, whereas the latter contains the calendars we've actually parsed
for DateTimeFormat (currently only Gregorian). The new enum generated
here can be removed once DateTimeFormat knows about all calendars.
2022-01-13 13:43:57 +01:00
Timothy Flynn
c5138f0f2b LibUnicode: Parse number system digits from the CLDR
We had a hard-coded table of number system digits copied from ECMA-402.
Turns out these digits are in the CLDR, so let's parse the digits from
there instead of hard-coding them.
2022-01-12 10:49:07 +01:00
Timothy Flynn
d50f5e14f8 LibUnicode: Fall back to GMT offset when a time zone name is unavailable
The following table in TR-35 includes a web of fall back rules when the
requested time zone style is unavailable:
https://unicode.org/reports/tr35/tr35-dates.html#dfst-zone

Conveniently, the subset of styles supported by ECMA-402 (and therefore
LibUnicode) all either fall back to GMT offset or to a style that is
unsupported but itself falls back to GMT offset.
2022-01-11 23:56:35 +01:00
Timothy Flynn
8d35563f28 LibUnicode: Implement TR-35's localized GMT offset formatting
This adds an API to use LibTimeZone to convert a time zone such as
"America/New_York" to a GMT offset string like "GMT-5" (short form) or
"GMT-05:00" (long form).
2022-01-11 23:56:35 +01:00
Timothy Flynn
6409900a5b LibUnicode: Add an API to retrieve a locale's default numbering system 2022-01-11 23:56:35 +01:00
Timothy Flynn
cc5e9f0579 LibJS+LibUnicode: Move replacement of number system digits to LibUnicode
There are a few algorithms in TR-35 that need to replace digits before
returning any results to callers. For example, when formatting time zone
offsets, a string like "GMT+12:34" must have its digits replaced with
the default numbering system for the desired locale.
2022-01-11 23:56:35 +01:00
Timothy Flynn
15947aa1f0 LibUnicode: Add an hour-cycle field to DateTimeFormat's format pattern 2022-01-10 16:18:05 +01:00
Timothy Flynn
498b741434 LibUnicode: Use LibTimeZone's list of time zone names
LibUnicode no longer needs to generate a list of time zone names that it
parsed from metaZones.json. We can defer to the TZDB for a golden list
of time zones.
2022-01-08 12:45:34 +01:00
Timothy Flynn
6d7d9dd324 LibUnicode: Do not assume time zones & meta zones have a 1-to-1 mapping
The generator parses metaZones.json to form a mapping of meta zones to
time zones (AKA "golden zone" in TR-35). This parser errantly assumed
this was a 1-to-1 mapping.
2022-01-06 22:28:01 +01:00
Timothy Flynn
1116a29c19 LibUnicode: Remove now unused Unicode symbol loader
All generated sources are now linked via weak symbols.
2022-01-04 22:49:43 +00:00
Timothy Flynn
437b9fe204 LibUnicode: Convert UnicodeData to link with weak symbols 2022-01-04 22:49:43 +00:00
Timothy Flynn
f576142fe8 LibJS+LibUnicode: Convert UnicodeLocale to link with weak symbols 2022-01-04 22:49:43 +00:00
Timothy Flynn
ba4cdf34f8 LibUnicode: Convert UnicodeDateTimeFormat to link with weak symbols 2022-01-04 22:49:43 +00:00
Timothy Flynn
98709d9be1 LibUnicode: Convert UnicodeNumberFormat to link with weak symbols
Currently, we load the generated Unicode symbols with dlopen at runtime.
This is unnecessary as of 565a880ce5.

Applications that want Unicode data now link directly against the shared
library holding that data. So the same functionality can be achieved
with weak symbols.
2022-01-04 22:49:43 +00:00
Timothy Flynn
126a3fe180 LibUnicode: Add minimal support for generic & offset-based time zones
ECMA-402 now supports short-offset, long-offset, short-generic, and
long-generic time zone name formatting. For example, in the en-US locale
the America/Eastern time zone would be formatted as:

    short-offset: GMT-5
    long-offset: GMT-05:00
    short-generic: ET
    long-generic: Eastern Time

We currently only support the UTC time zone, however. Therefore, this
very minimal implementation does not consider GMT offset or generic
display names. Instead, the CLDR defines specific strings for UTC.
2022-01-03 15:11:59 +01:00
Timothy Flynn
c417374dd6 LibUnicode: Remove linkage from LibUnicode to LibUnicodeData
LibUnicodeData can now be loaded dynamically at runtime.
2021-12-21 13:09:49 -08:00
Timothy Flynn
15e1498419 LibUnicode: Dynamically load the generated UnicodeDateTimeFormat symbols 2021-12-21 13:09:49 -08:00
Timothy Flynn
a1f0ca59ae LibUnicode: Dynamically load the generated UnicodeNumberFormat symbols 2021-12-21 13:09:49 -08:00
Timothy Flynn
09be26b5d2 LibUnicode: Dynamically load the generated UnicodeLocale symbols 2021-12-21 13:09:49 -08:00
Timothy Flynn
3fd53baa25 LibUnicode: Dynamically load the generated UnicodeData symbols
The generated data for libunicodedata.so is quite large, and loading it
is a price paid by nearly every application by way of depending on
LibRegex. In order to defer this cost until an application actually uses
one of the surrounding APIs, dynamically load the generated symbols.

To be able to load the symbols dynamically, the generated methods must
have demangled names. Typically, this is accomplished with `extern "C"`
blocks. The clang toolchain complains about this here because the types
returned from the generators are strictly C++ types. So to demangle the
names, we use the asm() compiler directive to manually define a symbol
name; the caveat is that we *must* be sure the symbols are unique. As an
extra precaution, we prefix each symbol name with "unicode_". For more
details, see: https://gcc.gnu.org/onlinedocs/gcc/Asm-Labels.html

This symbol loader used in this implementation provides the additional
benefit of removing many [[maybe_unused]] attributes from the LibUnicode
methods. Internally, if ENABLE_UNICODE_DATABASE_DOWNLOAD is OFF, the
loader is able to stub out the function pointers it returns.

Note that as of this commit, LibUnicode is still directly linked against
LibUnicodeData. This commit is just a first step towards removing that.
2021-12-21 13:09:49 -08:00