Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).
No functional changes.
Error::from_string_literal now takes direct char const*s, while
Error::from_string_view does what Error::from_string_literal used to do:
taking StringViews. This change will remove the need to insert `sv`
after error strings when returning string literal errors once
StringView(char const*) is removed.
No functional changes.
These are mostly minor mistakes I've encountered while working on the
removal of StringView(char const*). The usage of builder.put_string over
Format<FormatString>::format is preferrable as it will avoid the
indirection altogether when there's no formatting to be done. Similarly,
there is no need to do format(builder, "{}", number) when
builder.put_u64(number) works equally well.
Additionally a few Strings where only constant strings were used are
replaced with StringViews.
To prepare for using plural rules within number & duration format, this
removes the NumberFormat::Plurality enumeration.
This also adds PluralCategory::ExactlyZero & PluralCategory::ExactlyOne.
These are used in locales like French, where PluralCategory::One really
means any value from 0.00 to 1.99. PluralCategory::ExactlyOne means only
the value 1, as the name implies. These exact rules are not known by the
general plural rules, they are explicitly for number / currency format.
The PluralCategory enum is currently generated for plural rules. Instead
of generating it, this moves the enum to the public LibUnicode header.
While it was nice to auto-discover these values, they are well defined
by TR-35, and we will need their values from within the number format
code generator (which can't rely on the plural rules generator having
run yet). Further, number format will require additional values in the
enum that plural rules doesn't know about.
Plural rules in the CLDR are of the form:
"cs": {
"pluralRule-count-one": "i = 1 and v = 0 @integer 1",
"pluralRule-count-few": "i = 2..4 and v = 0 @integer 2~4",
"pluralRule-count-many": "v != 0 @decimal 0.0~1.5, 10.0, 100.0 ...",
"pluralRule-count-other": "@integer 0, 5~19, 100, 1000, 10000 ..."
}
The syntax is described here:
https://unicode.org/reports/tr35/tr35-numbers.html#Plural_rules_syntax
There are up to 2 sets of rules for each locale, a cardinal set and an
ordinal set. The approach here is to generate a C++ function for each
set of rules. Each condition in the rules (e.g. "i = 1 and v = 0") is
transpiled to a C++ if-statement within its function. Then lookup tables
are generated to match locales to their generated functions.
NOTE: -Wno-parentheses-equality is added to the LibUnicodeData compile
flags because the generated plural rules have lots of extra parentheses
(because e.g. we need to selectively negate and combine rules). The code
to generate only exactly the right number of parentheses is quite hairy,
so this just tells the compiler to ignore the extras.
This includes:
* The minimum number of days in a week for that week to count as the
first week of a new year.
* The day to be shown as the first day of the week in a calendar.
* The start/end days of the weekend.
Like the existing hour cycle data, week data is presented per-region in
the CLDR, rather than per-locale. The method to add likely subtags to a
locale to perform region lookups is the same.
The list of regions in the CLDR for hour cycle, minimum days, first day,
and weekend days are quite different. So rather than changing the
existing HourCycleRegion enum to a generic Region enum, we generate
separate enums for each of the week data fields. This allows each lookup
into these fields to remain simple array-based index access, without any
"jumps" for regions that don't have CLDR data for a field.
Currently contains just each locale's character order, but is set up to
easily add other text layout fields from the CLDR if ECMA-402 eventually
requires them.
This matches the target names for the main serenity build, and will make
simplifying the Lagom build much easier going forward.
The LagomFoo name came from a time when we had both library builds in
the same CMake generated project and needed to deconflict the names.
This commit has no behavior changes.
In particular, this does not fix any of the wrong uses of the previous
default parameter (which used to be 'false', meaning "only replace the
first occurence in the string"). It simply replaces the default uses by
String::replace(..., ReplaceMode::FirstOnly), leaving them incorrect.
Similar reasoning to making Core::Stream::read() return Bytes, except
that every user of read_line() creates a StringView from the result, so
let's just return one right away.
A mistake I've repeatedly made is along these lines:
```c++
auto nread = TRY(source_file->read(buffer));
TRY(destination_file->write(buffer));
```
It's a little clunky to have to create a Bytes or StringView from the
buffer's data pointer and the nread, and easy to forget and just use
the buffer. So, this patch changes the read() function to return a
Bytes of the data that were just read.
The other read_foo() methods will be modified in the same way in
subsequent commits.
Fixes#13687
This is an editorial change in the Intl spec. See:
https://github.com/tc39/ecma402/commit/087995chttps://github.com/tc39/ecma402/commit/233d29c
This also adds a missing spec link for the sanctioned units and fixes a
broken spec link for IsSanctionedSingleUnitIdentifier. In LibUnicode,
the NumberFormat generator is updated to use the constexpr helper to
retrieve sanctioned units.
BCP 47 will be the single source of truth for known calendar and number
system keywords, and their aliases (e.g. "gregory" is an alias for
"gregorian"). Move the generation of available keywords to where we
parse the BCP 47 data, so that hard-coded aliases may be removed from
other generators.
We have a fair amount of hard-coded keywords / aliases that can now be
replaced with real data from BCP 47. As a result, the also changes the
awkward way we were previously generating keys. Before, we were more or
less generating keywords as a CSV list of keys, e.g. for the "nu" key,
we'd generate "latn,arab,grek" (ordered by locale preference). Then at
runtime, we'd split on the comma. We now just generate spans of keywords
directly.
This package was originally meant to be included in CLDR version 40, but
was missed in their release scripts. This has been resolved:
https://unicode-org.atlassian.net/browse/CLDR-15158
Unfortunately, the CLDR was re-released with the same version number. So
to bust the build's CLDR cache, change the "version" used to detect that
we need to redownload the CLDR.
This adds a generator utility to read an entire file and parse it as a
JSON value. This is heavily used by the CLDR generators. The idea here
is to put the file reading details in the utility so that when we have a
good story for generically reading an entire stream in LibCore, we can
update the generators to use that by only touching this helper.
This also moves the open_file helper to the utility file. It's currently
a lambda redefined in each TZDB/Unicode generator. It used to display
the missing command line flag and other info local to each generator.
After switching to LibMain, it just returns a generic error message, and
is duplicated several times.
Unlike other BCP47 keywords that we are parsing, these only appear in
the BCP47 XML file itself within the CLDR. The values are very simple
though, so just hard code them until the Unicode org re-releases the
CLDR with BCP47: https://unicode-org.atlassian.net/browse/CLDR-15158
Relative-time format patterns are of one of two forms:
* Tensed - refer to the past or the future, e.g. "N years ago" or
"in N years".
* Numbered - refer to a specific numeric value, e.g. "in 1 year"
becomes "next year" and "in 0 years" becomes "this year".
In ECMA-402, tensed and numbered refer to the numeric formatting options
of "always" and "auto", respectively.
This sets up the generator plumbing to create the relative-time data
files. This data could probably be included in the date-time generator,
but that generator is large enough that I'd rather put this tangentially
related data in its own file.
Previously, we were breaking up digits into groups without regard for
the locale's minimumGroupingDigits value in the CLDR. This value is 1 in
most locales, but is 2 in locales such as pl-PL. What this means is that
in those locales, the group separator should only be inserted if the
thousands group has at least 2 digits. So 1000 is formatted as "1,000"
in en-US, but "1000" in pl-PL. And 10000 is "10,000" in en-US and
"10 000" in pl-PL.
Currently, the UnicodeLocale generator collects a list of known locales
from the CLDR before processing language display names. For each locale,
the identifier is broken into language, script, and region subtags, and
we create a list of seen languages. When processing display names, we
skip languages we hadn't seen in that first step.
This is insufficient for language display names like "en-GB", which do
not have an locale entry in the CLDR, and thus are skipped. So instead,
create the list of known languages by actually reading through the list
of languages which have a display name.
These patterns indicate how to display locale strings when that locale
contains multiple subtags. For example, "en-US" would be displayed as
"English (United States)".