This includes:
* The minimum number of days in a week for that week to count as the
first week of a new year.
* The day to be shown as the first day of the week in a calendar.
* The start/end days of the weekend.
Like the existing hour cycle data, week data is presented per-region in
the CLDR, rather than per-locale. The method to add likely subtags to a
locale to perform region lookups is the same.
The list of regions in the CLDR for hour cycle, minimum days, first day,
and weekend days are quite different. So rather than changing the
existing HourCycleRegion enum to a generic Region enum, we generate
separate enums for each of the week data fields. This allows each lookup
into these fields to remain simple array-based index access, without any
"jumps" for regions that don't have CLDR data for a field.
This commit has no behavior changes.
In particular, this does not fix any of the wrong uses of the previous
default parameter (which used to be 'false', meaning "only replace the
first occurence in the string"). It simply replaces the default uses by
String::replace(..., ReplaceMode::FirstOnly), leaving them incorrect.
BCP 47 will be the single source of truth for known calendar and number
system keywords, and their aliases (e.g. "gregory" is an alias for
"gregorian"). Move the generation of available keywords to where we
parse the BCP 47 data, so that hard-coded aliases may be removed from
other generators.
Our generator is currently preferring the DST variant of the time zone
display names over the non-DST variant. LibTimeZone currently does not
have DST support, and operates in a mode that basically assumes DST does
not exist. Swap the display names for now just to be consistent until we
have DST support.
Note we will need to generate both of these variants and select the
appropriate one at runtime once we have DST support.
This adds an API to use LibTimeZone to convert a time zone such as
"America/New_York" to a GMT offset string like "GMT-5" (short form) or
"GMT-05:00" (long form).
The generate_mapping helper generates a series of structs like:
Array<SomeType, 1> s_mapping_key_0 {};
Array<SomeType, 2> s_mapping_key_1 {};
Array<SomeType, 3> s_mapping_key_2 {};
Array<Span<SomeType const>> s_mapping { {
s_mapping_key_0.span(),
s_mapping_key_1.span(),
s_mapping_key_2.span(),
} };
Where the names of the struct were generated by the format_mapping_name
lambda inside the helper. Rather than this lambda making assumptions on
how each generator wants to name its structs, add a parameter for the
caller to provide a naming formatter.
This is because the TimeZoneData generator will want pretty specific
identifier formatting rules.
LibUnicode no longer needs to generate a list of time zone names that it
parsed from metaZones.json. We can defer to the TZDB for a golden list
of time zones.
The generator parses metaZones.json to form a mapping of meta zones to
time zones (AKA "golden zone" in TR-35). This parser errantly assumed
this was a 1-to-1 mapping.
In Unicode::get_time_zone_name(), we don't need to require that the time
zone is UTC for long- and short-style name lookups. This is required for
other styles, because they will depend on TZDB data - so move the VERIFY
to that scope.
When searching for the locale-specific flexible day period for a given
hour, we were neglecting to handle cases where the period crosses 00:00.
For example, the en locale defines a day period range of [21:00, 06:00).
When given the hour of 05:00, we were checking if (21 <= 5 && 5 < 6),
thus not recognizing that the hour falls in that period.
ECMA-402 now supports short-offset, long-offset, short-generic, and
long-generic time zone name formatting. For example, in the en-US locale
the America/Eastern time zone would be formatted as:
short-offset: GMT-5
long-offset: GMT-05:00
short-generic: ET
long-generic: Eastern Time
We currently only support the UTC time zone, however. Therefore, this
very minimal implementation does not consider GMT offset or generic
display names. Instead, the CLDR defines specific strings for UTC.
The variable `s_time_zone_list_index_type` seems to be unused (detected
when compiling with clang), and it seems logical to bind it even it if
it is not used for now.
Similar to commit 2a7f36b392, this change moves the generated
CalendarSymbol enumeration to the public LibUnicode/NumberFormat.h
header with a pre-defined set of symbols that we need. This is to
prepare for uniquely generating the CalendarSymbols structure.
Each of the 374 generated calendars include 4 sets of symbols, each of
which have 3 lists of symbols (narrow, short, long). Of these 4488
lists, only 819 are unique.
In the CLDR, there aren't "night" values, there are "night1" & "night2"
values. This is for locales which use a different name for nighttime
depending on the hour. For example, the ja locale uses "夜" between the
hours of 19:00 and 23:00, and "夜中" between the hours of 23:00 and
04:00. Our CLDR parser is currently ignoring "night2", so this rename
is to prepare for that.
We could probably come up with better names, but in the end, the API in
LibUnicode will be such that outside callers won't even see Night1, etc.
Pattern skeletons are more or less the "key" of format patterns. Every
format pattern is assigned a skeleton. Interval patterns (which are not
yet parsed) are also assigned a skeleton - this is used to match them to
an "owning" format pattern. So we will use the skeleton generated here
to match format patterns at runtime with their available interval
patterns.
An alternative approach would be to append interval patterns directly to
their owning format pattern, but this has some draw backs:
1. Skeletons aren't totally unique. A skeleton may appear in both
the "dateFormats" and "availableFormats" objects, in which case
the same interval formats would be generated more than once.
2. Otherwise unique format patterns may only differ by the interval
patterns assigned to them. This would cause the UniqueStorage for
the format patterns to increase in size, impacting both compile
times and libunicode.so size.
The parsing in parse_calendar_symbols() might be a bit more verbose than
it really needs to be, but it is to ensure the symbols are generated in
a known order that we can control with enumerations.
TR-35's Matching Skeleton algorithm dictates how user requests including
fractional second digits should be handled when the CLDR format pattern
does not include that field. When the format pattern contains {second},
but does not contain {fractionalSecondDigits}, generate a second pattern
which appends "{decimal}{fractionalSecondDigits}" to the {second} field.
TR-35 does define lengths for {ampm}, but they are unused by ECMA-402.
To the contrary, defining the day_period length for this segment will
prevent BasicFormatMatcher from ever selecting a pattern that contains
this segment. Instead, ECMA-402 will only use the short length for
{ampm} segments.
TR-35 describes how to combine date, time, and available formats with
date-time format patterns to generate more available format patterns:
https://unicode.org/reports/tr35/tr35-dates.html#Missing_Skeleton_Fields
Use these steps to generate ~400 new patterns for each calendar. These
are required for ECMA-402's BasicFormatMatcher to produce reasonable
results.
Similar to NumberFormat, replace the segments of date-time patterns with
partitions that can be split at runtime. Also generate the pattern style
fields for e.g. era, day, hour, etc.