0ct0pu5/ladybird

Author	SHA1	Message	Date
Timothy Flynn	3ae4ff109f	LibUnicode: Extract canonicalization of Unicode extension values LibJS will need to canonicalize Unicode extension values, so extract the lambda that was doing this work to its own function. This also changes the helpers it invokes to take the provided key as a StringView because we don't need (and won't always have) full String objects here.	2021-09-11 11:05:50 +01:00
Timothy Flynn	b1d4bcf364	LibUnicode: Generate numeric keyword values for each locale This is needed for Intl.NumberFormat's usage of the ResolveLocale AO, where the [[RelevantExtensionKeys]] internal slot will be "nu".	2021-09-11 11:05:50 +01:00
Timothy Flynn	4f2bcebe74	LibUnicode+LibJS: Store locale keyword values as a single string Previously, LibUnicode would store the values of a keyword as a Vector. For example, the locale "en-u-ca-abc-def" would have its keyword "ca" stored as {"abc, "def"}. Then, canonicalization would occur on each of the elements in that Vector. This is incorrect because, for example, the keyword value "true" should only be dropped if that is the entire value. That is, the canonical form of "en-u-kb-true" is "en-u-kb", but "en-u-kb-abc-true" does not change for canonicalization. However, we would canonicalize that locale as "en-u-kb-abc".	2021-09-08 21:08:48 +01:00
Timothy Flynn	75657b79c6	LibUnicode: Update comment with link to related upstream issue LibUnicode has to hard-code some aliases because the related data is not available in the JSON export of CLDR. Turns out there is a ticket to add this data in an upcoming CLDR release. Add a link to that ticket for reference.	2021-09-08 21:08:48 +01:00
Timothy Flynn	3f64a14e06	LibUnicode: Parse and generate the Unicode locale list patterns dataset This data informs consumers how to join lists of values. For example, in en-US, the list ["a", "b", "c"] formatted to a string should become "a, b, and c".	2021-09-06 23:49:56 +01:00
Timothy Flynn	12ae0a44d7	LibUnicode: Add public wrapper for the generated locale_from_string	2021-09-06 15:24:27 +01:00
Timothy Flynn	a77f323dfb	LibUnicode: Implement the Remove Likely Subtags method Unlike Add Likely Subtags, this method doesn't require generated data. Instead, it is defined in terms of Add Likely Subtags.	2021-09-04 13:51:40 +01:00
Timothy Flynn	e6a2ab1202	LibUnicode: Generate an implementation of the Add Likely Subtags method	2021-09-04 13:51:40 +01:00
Timothy Flynn	ca90231794	LibUnicode: Define is_unicode_*_subtag helpers inline in their header The UnicodeLocale generator will need to parse canonicalized locale strings, and will require using these methods. However, the generator cannot depend on LibUnicode because Locale.cpp within LibUnicode already depends on the generated file. Instead, defining the methods that the generator needs inline allows the generator to use them without linking against LibUnicode.	2021-09-04 13:51:40 +01:00
Timothy Flynn	21c4922ac0	LibUnicode: Add helper methods to LocaleID and LanguageID for LibJS Add a method to remove an extension type from the locale's extension set and methods to convert a locale and language to a string without canonicalization. Each of these will be used by LibJS.	2021-09-02 17:56:42 +01:00
Timothy Flynn	a05419db55	LibUnicode: Add lexer to test if a string matches the "type" production	2021-09-02 17:56:42 +01:00
Timothy Flynn	fd0011989a	LibUnicode: Resolve the most likely territory alias when there are many	2021-09-01 14:14:47 +01:00
Timothy Flynn	1fbc5dba08	LibUnicode: Generate Unicode locale likely subtag data CLDR contains a set of likely subtag data where, given a locale, you can resolve what is the most likely language, script, or territory of that locale. This data is needed for resolving territory aliases. These aliases might contain multiple territories, and we need to resolve which of those territories is most likely correct for a locale. Note that the likely subtag data is quite huge (a few thousand entries). As an optimization encouraged by the spec, we only generate the smallest subset of this data that we actually need (about 150 entries).	2021-09-01 14:14:47 +01:00
Timothy Flynn	72f49e42b4	LibUnicode: Perform complex Unicode locale alias substitution	2021-09-01 14:14:47 +01:00
Timothy Flynn	da89cf9afb	LibUnicode: Canonicalize calendar subtags Calendar subtags are a bit of an odd-man-out in that we must match the variants "ethiopic-amete-alem" in that order, without any other variant in the locale. So a separate method is needed for this, and we now defer sorting the variant list until after other canonicalization is done.	2021-09-01 14:14:47 +01:00
Timothy Flynn	8458f477a4	LibUnicode: Canonicalize timezone subtags	2021-09-01 14:14:47 +01:00
Timothy Flynn	335f985b31	LibUnicode: Canonicalize the subtag "imperial" to "uksystem"	2021-09-01 14:14:47 +01:00
Timothy Flynn	2d90144888	LibUnicode: Canonicalize the subtag "primary" and "tertiary" to "levelN"	2021-09-01 14:14:47 +01:00
Timothy Flynn	409f39b336	LibUnicode: Canonicalize the subtag "names" to "prprname"	2021-09-01 14:14:47 +01:00
Timothy Flynn	f907a7dc38	LibUnicode: Canonicalize the subtag "yes" to "true"	2021-09-01 14:14:47 +01:00
Timothy Flynn	556374a904	LibUnicode: Substitute Unicode locale aliases during canonicalization Unicode TR35 defines how locale subtag aliases should be emplaced when converting a locale to canonical form. For most subtags, it is a simple substitution. Language subtags depend on context; for example, the language "sh" should become "sr-Latn", but if the original locale has a script subtag already ("sh-Cyrl"), then only the language subtag of the alias should be taken ("sr-Latn"). To facilitate this, we now make two passes when canonicalizing a locale. In the first pass, we convert the LocaleID structure to canonical syntax (where the conversions all happen in-place). In the second pass, we form the canonical string based on the canonical syntax.	2021-09-01 14:14:47 +01:00
Timothy Flynn	9b118f1f06	LibUnicode: Generate Unicode locale alias data CLDR contains a set of aliases for languages, territories, etc. that no longer are meant to be used (e.g. due to deprecation). For example, the language "aam" is deprecated and should be canonicalized as "aas".	2021-09-01 14:14:47 +01:00
Timothy Flynn	d13142f015	LibJS+LibUnicode: Store parsed Unicode locale data as full strings Originally, it was convenient to store the parsed Unicode locale data as views into the original string being parsed. But to implement locale aliases will require mutating the data that was parsed. To prepare for that, store the parsed data as proper strings.	2021-09-01 14:14:47 +01:00
Timothy Flynn	f897c2edb3	LibUnicode: Canonicalize locale private use extensions	2021-08-30 19:42:40 +01:00
Timothy Flynn	6f0cb52dc4	LibUnicode: Canonicalize locale extensions	2021-08-30 19:42:40 +01:00
Timothy Flynn	671eaa0c59	LibUnicode: Add helper lambda for appending canonicalized strings Once canonical extensions are implemented, the number of: if (optional_string.has_value() { builder.append('-'); builder.append(optional_string->to_lowercase_string()); } Will be quite large. This commit just adds a helper lambda to handle this pattern to prevent this function from becoming even more enormous.	2021-08-30 19:42:40 +01:00
Timothy Flynn	30855e6663	LibUnicode: Parse locale private use extensions	2021-08-30 19:42:40 +01:00
Timothy Flynn	29f76ef7c8	LibUnicode: Parse locale extensions of the other extension form	2021-08-30 19:42:40 +01:00
Timothy Flynn	d2d304fcf8	LibUnicode: Parse locale extensions of the transformed extension form	2021-08-30 19:42:40 +01:00
Timothy Flynn	eda92d15e4	LibUnicode: Parse locale extensions of the Unicode locale extension form	2021-08-30 19:42:40 +01:00
Timothy Flynn	dd89901b07	LibUnicode: Use GenericLexer to parse Unicode language IDs This is preparatory work to read locale extensions. The parser currently enforces that the entire string is consumed. But to parse extensions, parse_unicode_locale_id() will need parse_unicode_language_id() to just stop parsing on the first segment that does not match the language ID grammar. It will also need to know where the parsing stopped. Both of these needs are fulfilled by GenericLexer. The caveat is that we can no longer simply split the parsed string on separator characters. So parse_unicode_language_id() now operates as a small state machine.	2021-08-30 19:42:40 +01:00
Timothy Flynn	8b93d51212	LibUnicode: Parse Unicode CLDR currencies and generate locale mappings	2021-08-27 12:32:24 +01:00
Timothy Flynn	0f02def3c2	LibUnicode: Parse Unicode CLDR scripts and generate locale mappings	2021-08-27 12:32:24 +01:00
Timothy Flynn	ab7a1dd89e	LibUnicode: Parse Unicode CLDR languages and generate locale mappings	2021-08-27 12:32:24 +01:00
Timothy Flynn	6719e5cb17	LibUnicode: Generate locale subtag data as multiple smaller tables This commit is preemptive to upcoming commits which add more subtags to the CLDR generator. Rather than generating a giant HashMap containing all data, generate more (smaller) Array-based tables. This mimics the UCD generator. This also allows simpler lookups at runtime since we can generate index-based lookups into the smaller tables rather easily. Without this change, adding the remaining locale subtags would result in the generation and compilation of UnicodeLocale.cpp taking about 30s on my machine. With this change, it takes about half that. Additionally, the size of the generated file reduces by about 1.5MB.	2021-08-27 12:32:24 +01:00
Timothy Flynn	137e98cb6f	LibUnicode: Add public accessors to generated locale data	2021-08-26 22:04:09 +01:00
Timothy Flynn	b7a95cba65	LibUnicode: Implement grammar validators for Unicode TR-35 ECMA-402 requires validating user input against the EBNF grammar for Unicode locales described in TR-35: https://www.unicode.org/reports/tr35 This commit adds validators for that grammar, as well as other helper to e.g. canonicalize a locale string.	2021-08-26 22:04:09 +01:00

37 commits