This is preparatory work to read locale extensions. The parser currently
enforces that the entire string is consumed. But to parse extensions,
parse_unicode_locale_id() will need parse_unicode_language_id() to just
stop parsing on the first segment that does not match the language ID
grammar. It will also need to know where the parsing stopped. Both of
these needs are fulfilled by GenericLexer.
The caveat is that we can no longer simply split the parsed string on
separator characters. So parse_unicode_language_id() now operates as a
small state machine.
This commit is preemptive to upcoming commits which add more subtags to
the CLDR generator. Rather than generating a giant HashMap containing
all data, generate more (smaller) Array-based tables. This mimics the
UCD generator. This also allows simpler lookups at runtime since we can
generate index-based lookups into the smaller tables rather easily.
Without this change, adding the remaining locale subtags would result
in the generation and compilation of UnicodeLocale.cpp taking about 30s
on my machine. With this change, it takes about half that. Additionally,
the size of the generated file reduces by about 1.5MB.
ECMA-402 requires validating user input against the EBNF grammar for
Unicode locales described in TR-35: https://www.unicode.org/reports/tr35
This commit adds validators for that grammar, as well as other helper to
e.g. canonicalize a locale string.