0ct0pu5/ladybird

Author	SHA1	Message	Date
Timothy Flynn	c24a350a18	LibUnicode: Ignore U+200F when parsing format identifiers Noticed this while implementing multiple identifier support. We were errantly parsing U+200F as a lone identifier in some Hebrew formats.	2021-11-16 23:14:09 +00:00
Timothy Flynn	04b8b87c17	LibJS+LibUnicode: Support multiple identifiers within format pattern This wasn't the case for compact patterns, but unit patterns can contain multiple (up to 2, really) identifiers that must each be recognized by LibJS. Each generated NumberFormat object now stores an array of identifiers parsed. The format pattern itself is encoded with the index into this array for that identifier, e.g. the compact format string "0K" will become "{number}{compactIdentifier:0}".	2021-11-16 23:14:09 +00:00
Timothy Flynn	3b68370212	LibJS+LibUnicode: Rename the generated compact_identifier to identifier This field is currently used to store the StringView into the compact name/symbol in the format string. Units will need to store a similar field, so rename the field to be more generic, and extract the parser for it.	2021-11-16 23:14:09 +00:00
Timothy Flynn	1f546476d5	LibJS+LibUnicode: Fix computation of compact pattern exponents The compact scale of each formatting rule was precomputed in commit: `be69eae651` Using the formula: compact scale = magnitude - pattern scale This computation was off-by-one. For example, consider the format key "10000-count-one", which maps to "00 thousand" in en-US. What we are really after is the exponent that best represents the string "thousand" for values greater than 10000 and less than 100000 (the next format key). We were previously doing: log10(10000) - "00 thousand".count("0") = 2 Which clearly isn't what we want. Instead, if we do: log10(10000) + 1 - "00 thousand".count("0") = 3 We get the correct exponent for each format key for each locale. This commit also renames the generated variable from "compact_scale" to "exponent" to match the terminology used in ECMA-402.	2021-11-16 00:56:55 +00:00
Timothy Flynn	48d5684780	LibUnicode: Parse compact identifiers and replace them with a format key For example, in en-US, the decimal, long compact pattern for numbers between 10,000 and 100,000 is "00 thousand". In that pattern, "thousand" is the compact identifier, and the generated format pattern is now "{number} {compactIdentifier}". This also generates that identifier as its own field in the NumberFormat structure.	2021-11-16 00:56:55 +00:00
Timothy Flynn	30fbb7d9cd	LibUnicode: Parse and generate scientific formatting rules	2021-11-14 17:00:35 +00:00
Timothy Flynn	3645f6a0fc	LibUnicode: Fix typo in percent format parser Just by sheer luck this had no actual effect because the decimal format prefix has the same length as the percent format prefix.	2021-11-14 17:00:35 +00:00
Timothy Flynn	3b7f5af042	LibUnicode: Generate primary and secondary number grouping sizes Most locales have a single grouping size (the number of integer digits to be written before inserting a grouping separator). However some have a primary and secondary size. We parse the primary size as the size used for the least significant integer digits, and the secondary size for the most significant.	2021-11-14 10:35:19 +00:00
Timothy Flynn	c65dea64bd	LibJS+LibUnicode: Don't remove {currency} keys in GetNumberFormatPattern In order to implement Intl.NumberFormat.prototype.formatToParts, do not replace {currency} keys in the format pattern before ECMA-402 tells us to. Otherwise, the array return by formatToParts will not contain the expected currency key. Early replacement was done to avoid resolving the currency display more than once, as it involves a couple of round trips to search through LibUnicode data. So this adds a non-standard method to NumberFormat to do this resolution and cache the result. Another side effect of this change is that LibUnicode must replace unit format patterns of the form "{0} {1}" during code generation. These were previously skipped during code generation because LibJS would just replace the keys with the currency display at runtime. But now that the currency display injection is delayed, any {0} or {1} keys in the format pattern will cause PartitionNumberPattern to abort.	2021-11-13 19:01:25 +00:00
Timothy Flynn	a701ed52fc	LibJS+LibUnicode: Fully implement currency number formatting Currencies are a bit strange; the layout of currency data in the CLDR is not particularly compatible with what ECMA-402 expects. For example, the currency format in the "en" and "ar" locales for the Latin script are: en: "¤#,##0.00" ar: "¤\u00A0#,##0.00" Note how the "ar" locale has a non-breaking space after the currency symbol (¤), but "en" does not. This does not mean that this space will appear in the "ar"-formatted string, nor does it mean that a space won't appear in the "en"-formatted string. This is a runtime decision based on the currency display chosen by the user ("$" vs. "USD" vs. "US dollar") and other rules in the Unicode TR-35 spec. ECMA-402 shies away from the nuances here with "implementation-defined" steps. LibUnicode will store the data parsed from the CLDR however it is presented; making decisions about spacing, etc. will occur at runtime based on user input.	2021-11-13 11:52:45 +00:00
Timothy Flynn	e9493a2cd5	LibUnicode: Ensure UnicodeNumberFormat is aware of default content For example, there isn't a unique set of data for the en-US locale; rather, it defaults to the data for the en locale. See this commit for much more detail: `357c97dfa8`	2021-11-13 11:52:45 +00:00
Timothy Flynn	9421d5c0cf	LibUnicode: Generate currency unit-pattern number formats These are used when formatting a number as currency with a display option of "name" (e.g. for USD, the name is "US Dollars" in en-US). These patterns appear in the CLDR in a different manner than other number formats that are pluralized. They are of the form "{0} {1}", therefore do not undergo subpattern replacements.	2021-11-13 11:52:45 +00:00
Timothy Flynn	39e031c4dd	LibJS+LibUnicode: Generate all styles of currency localizations Currently, LibUnicode is only parsing and generating the "long" style of currency display names. However, the CLDR contains "short" and "narrow" forms as well that need to be handled. Parse these, and update LibJS to actually respect the "style" option provided by the user for displaying currencies with Intl.DisplayNames. Note: There are some discrepencies between the engines on how style is handled. In particular, running: new Intl.DisplayNames('en', {type:'currency', style:'narrow'}).of('usd') Gives: SpiderMoney: "USD" V8: "US Dollar" LibJS: "$" And running: new Intl.DisplayNames('en', {type:'currency', style:'short'}).of('usd') Gives: SpiderMonkey: "$" V8: "US Dollar" LibJS: "$" My best guess is V8 isn't handling style, and just returning the long form (which is what LibJS did before this commit). And SpiderMoney can handle some styles, but if they don't have a value for the requested style, they fall back to the canonicalized code passed into of().	2021-11-13 11:52:45 +00:00
Timothy Flynn	6cfd63e5bd	LibUnicode: Parse numbers in number formats a bit more leniently The parser was previously expecting number sections within a pattern to start with "#", but they may also begin with "0".	2021-11-13 11:52:45 +00:00
Timothy Flynn	1f2ac0ab41	LibUnicode: Move number formatting code generator to UnicodeNumberFormat	2021-11-12 20:46:38 +00:00
Timothy Flynn	04e6b43f05	LibUnicode: Move (soon-to-be) common code out of GenerateUnicodeLocale The data used for number formatting is going to grow quite a bit when the cldr-units package is parsed. To prevent the generated UnicodeLocale file from growing outrageously large, the number formatting data can go into its own file. To prepare for this, move code that will be common between the generators for UnicodeLocale and UnicodeNumberFormat to the utility header.	2021-11-12 20:46:38 +00:00
Timothy Flynn	be69eae651	LibUnicode: Precompute the compact scale of each number formatting rule This will be needed for the ComputeExponentForMagnitude AO for compact formatting, namely step 5b: Let exponent be an implementation- and locale-dependent (ILD) integer by which to scale a number of the given magnitude in compact notation for the current locale.	2021-11-12 09:17:08 +00:00
Timothy Flynn	230b133ee3	LibUnicode: Parse number formats into zero/positive/negative patterns A number formatting pattern in the CLDR contains one or two entries, delimited by a semi-colon. Previously, LibUnicode was just storing the entire pattern as one string. This changes the generator to split the pattern on that delimiter and generate the 3 unique patterns expected by ECMA-402. The rules for generating the 3 patterns are as follows: * If the pattern contains 1 entry, it is the zero pattern. The positive pattern is the zero pattern prepended with {plusSign}. The negative pattern is the zero pattern prepended with {minusSign}. * If the pattern contains 2 entries, the first is the zero pattern, and the second is the negative pattern. The positive pattern is the zero pattern prepended with {plusSign}.	2021-11-12 09:17:08 +00:00
Timothy Flynn	1244ebcd4f	LibUnicode: Parse and generate standard accounting formatting rules Also known as "currency-accounting" in some CLDR documentation.	2021-11-12 09:17:08 +00:00
Timothy Flynn	967afc1b84	LibUnicode: Parse and generate standard currency formatting rules	2021-11-12 09:17:08 +00:00
Timothy Flynn	bffd73e0d4	LibUnicode: Parse and generate standard decimal formatting rules	2021-11-12 09:17:08 +00:00
Timothy Flynn	feb8c22a62	LibUnicode: Parse and generate standard percentage formatting rules	2021-11-12 09:17:08 +00:00
Timothy Flynn	4317a1b552	LibUnicode: Parse and generate compact currency formatting rules	2021-11-12 09:17:08 +00:00
Timothy Flynn	604a596c90	LibUnicode: Parse and generate compact decimal formatting rules	2021-11-12 09:17:08 +00:00
Timothy Flynn	12b468a588	LibUnicode: Begin parsing and generating locale number systems The number system data in the CLDR contains information on how to format numbers in a locale-dependent manner. Start parsing this data, beginning with numeric symbol strings. For example the symbol NaN maps to "NaN" in the en-US locale, and "非數值" in the zh-Hant locale.	2021-11-12 09:17:08 +00:00
Timothy Flynn	d3e83c9934	LibUnicode: Parse alternate default numbering systems Some locales in the CLDR have alternate default numbering systems listed under "defaultNumberingSystem-alt-*", e.g.: "defaultNumberingSystem": "arab", "defaultNumberingSystem-alt-latn": "latn", "otherNumberingSystems": { "native": "arab" }, We were previously only parsing "defaultNumberingSystem" and "otherNumberingSystems". This odd format appears to be an artifact of converting from XML.	2021-11-12 09:17:08 +00:00
Timothy Flynn	ae66188d43	LibUnicode: Capitialize generated identifiers in lieu of full title case This isn't particularly important because this generates code that is quite hidden from outside callers. But when viewing the generated code, it's a bit nicer to read e.g. enum identifiers such as "MinusSign" rather than "Minussign".	2021-11-12 09:17:08 +00:00
Andreas Kling	8b1108e485	Everywhere: Pass AK::StringView by value	2021-11-11 01:27:46 +01:00
Timothy Flynn	357c97dfa8	LibUnicode: Parse the CLDR's defaultContent.json locale list This file contains the list of locales which default to their parent locale's values. In the core CLDR dataset, these locales have their own files, but they are empty (except for identity data). For example: https://github.com/unicode-org/cldr/blob/main/common/main/en_US.xml In the JSON export, these files are excluded, so we currently are not recognizing these locales just by iterating the locale files. This is a prerequisite for upgrading to CLDR version 40. One of these default-content locales is the popular "en-US" locale, which defaults to "en" values. We were previously inferring the existence of this locale from the "en-US-POSIX" locale (many implementations, including ours, strip variants such as POSIX). However, v40 removes the "en-US-POSIX" locale entirely, meaning that without this change, we wouldn't know that "en-US" exists (we would default to "en"). For more detail on this and other v40 changes, see: https://cldr.unicode.org/index/downloads/cldr-40#h.nssoo2lq3cba	2021-11-09 20:44:52 +01:00
Timothy Flynn	3ad159537e	LibUnicode: Use u16 for unique string indices instead of size_t Typically size_t is used for indices, but we can take advantage of the knowledge that there is approximately only 46K unique strings in the generated UnicodeLocale.cpp file. Therefore, we can get away with using u16 to store indices. There is a VERIFY that will fail if we ever exceed the limits of u16. On x86_64 builds, this reduces libunicode.so from 9.2 MiB to 7.3 MiB. On i686 builds, this reduces libunicode.so from 3.9 MiB to 3.3 MiB. These savings are entirely in the .rodata section of the shared library.	2021-10-15 00:06:18 +01:00
Timothy Flynn	f91d63af83	LibUnicode: Generate enum/alias from-string methods without a HashMap The _from_string() and resolve__alias() generated methods are the last remaining users of HashMap in the LibUnicode generated files (read: the last methods not using compile-time structures). This converts these methods to use an array containing pairs of hash values to the desired lookup value. Because this code generation is the same between GenerateUnicodeData.cpp and GenerateUnicodeLocale.cpp, this adds a GeneratorUtil.h header to the LibUnicode generators to contain the method that generates the methods.	2021-10-13 16:38:51 +02:00
Timothy Flynn	597379e864	LibUnicode: Generate and use unique locale-related alias strings Almost all of these are already in the unique string list.	2021-10-10 22:21:48 +02:00
Timothy Flynn	acb7bd917f	LibUnicode: Generate and use unique subtag and complex alias strings	2021-10-10 22:21:48 +02:00
Timothy Flynn	3d67f6bd29	LibUnicode: Generate and use unique list-format strings The list-format strings used for Intl.ListFormat are small, but quite heavily duplicated. For example, the string "{0}, {1}" appears 6,519 times. Generate unique strings for this data to avoid duplication.	2021-10-10 22:21:48 +02:00
Timothy Flynn	f9e605397c	LibUnicode: Generate and use a set of unique locale-related strings In the generated UnicodeLocale.cpp file, there are 296,408 strings for localizations of languages, territories, scripts, currencies & keywords. Of these, only 43,848 (14.8%) are actually unique, so there are quite a large number of duplicated strings. This generates a single compile-time array to store these strings. The arrays for the localizations now store an index into this single array rather than duplicating any strings.	2021-10-10 22:21:48 +02:00
Timothy Flynn	3f0095b57a	LibUnicode: Skip unknown languages and territories Some CLDR languages.json / territories.json files contain localizations for some lanuages/territories that are otherwise not present in the CLDR database. We already don't generate anything in UnicodeLocale.cpp for these anomalies, but this will stop us from even storing that data in the generator's memory. This doesn't affect the output of the generator, but will have an effect after an upcoming commit to unique-ify all of the strings in the CLDR.	2021-10-10 22:21:48 +02:00
Timothy Flynn	79707d83d3	LibUnicode: Stop generating large UnicodeData hash map The data in this hash map is now available by way of much smaller arrays and is now unused.	2021-10-10 13:49:37 +02:00
Timothy Flynn	d83b262e64	LibUnicode: Generate standalone compile-time array for combining class	2021-10-10 13:49:37 +02:00
Timothy Flynn	9f83774913	LibUnicode: Generate standalone compile-time array for special casing There are only 112 code points with special casing rules, so this array is quite small (compared to the size 34,626 UnicodeData hash map that is also storing this data). Removing all casing rules from UnicodeData will happen in a subsequent commit.	2021-10-10 13:49:37 +02:00
Timothy Flynn	da4b8897a7	LibUnicode: Generate standalone compile-time arrays for simple casing Currently, all casing information (simple and special) are stored in a compile-time array of size 34,626, then statically copied to a hash map at runtime. In an effort to reduce the resulting memory usage, store the simple casing rules in standalone compile-time arrays. The uppercase map is size 1,450 and the lowercase map is size 1,433. Any code point not in a map will implicitly have an identity mapping.	2021-10-10 13:49:37 +02:00
Nico Weber	9ec9886b04	Meta: Fix typos	2021-10-01 01:06:40 +01:00
Timothy Flynn	c8dbcdb0bc	LibUnicode: Do not compare generated file contents before writing This is now covered by unicode_data.cmake after the superbuild changes.	2021-09-30 17:37:57 +01:00
Andrew Kaster	a6d83e02d2	Meta: Define and use lagom_tool() CMake helper function for all Tools We'll use this to prevent repeating common tool dependencies. They all depend on LibCore and AK only. We also want to encapsulate common install rules for them.	2021-09-15 19:04:52 +04:30
Idan Horowitz	6704961c82	AK: Replace the mutable String::replace API with an immutable version This removes the awkward String::replace API which was the only String API which mutated the String and replaces it with a new immutable version that returns a new String with the replacements applied. This also fixes a couple of UAFs that were caused by the use of this API. As an optimization an equivalent StringView::replace API was also added to remove an unnecessary String allocations in the format of: `String { view }.replace(...);`	2021-09-11 20:36:43 +03:00
Timothy Flynn	b1d4bcf364	LibUnicode: Generate numeric keyword values for each locale This is needed for Intl.NumberFormat's usage of the ResolveLocale AO, where the [[RelevantExtensionKeys]] internal slot will be "nu".	2021-09-11 11:05:50 +01:00
Timothy Flynn	32a2a02489	LibUnicode: Fix typo in listPatterns.json parsing method	2021-09-08 21:08:48 +01:00
Timothy Flynn	4ad2159812	LibUnicode: Remove Unicode locale variants from CLDR path names There's only a couple of cases like this, but there are some locale paths in the CLDR that contain variants. For example, there isn't a en-US path, but there is a en-US-POSIX path. This interferes with the operation to search for locales by name. The algorithm is such that searching for en-US will not result in en-US-POSIX being found. To resolve this, we should remove variants from the locale name.	2021-09-06 23:49:56 +01:00
Timothy Flynn	3f64a14e06	LibUnicode: Parse and generate the Unicode locale list patterns dataset This data informs consumers how to join lists of values. For example, in en-US, the list ["a", "b", "c"] formatted to a string should become "a, b, and c".	2021-09-06 23:49:56 +01:00
Timothy Flynn	9cd986d8c0	LibUnicode: Extract cldr-misc dataset from CLDR database	2021-09-06 23:49:56 +01:00
Timothy Flynn	077a693de6	LibUnicode: Sort special casing array by locale specificity This is to simply the Default Case Conversion implementation. Otherwise, the implementation would need to determine which special casing rule to apply, instead of just picking the first match.	2021-09-06 15:24:27 +01:00

1 2

58 commits