0ct0pu5/ladybird

Author	SHA1	Message	Date
Timothy Flynn	3b7f5af042	LibUnicode: Generate primary and secondary number grouping sizes Most locales have a single grouping size (the number of integer digits to be written before inserting a grouping separator). However some have a primary and secondary size. We parse the primary size as the size used for the least significant integer digits, and the secondary size for the most significant.	2021-11-14 10:35:19 +00:00
Timothy Flynn	feb8c22a62	LibUnicode: Parse and generate standard percentage formatting rules	2021-11-12 09:17:08 +00:00
Timothy Flynn	604a596c90	LibUnicode: Parse and generate compact decimal formatting rules	2021-11-12 09:17:08 +00:00
Timothy Flynn	e6334cb856	LibUnicode: Add some data related to currency codes This data is published under ISO-4217 as an XML file. Since we can't parse XML files yet, and the data isn't very large, it was translated to C++ manually here.	2021-09-11 11:05:50 +01:00
Timothy Flynn	3f64a14e06	LibUnicode: Parse and generate the Unicode locale list patterns dataset This data informs consumers how to join lists of values. For example, in en-US, the list ["a", "b", "c"] formatted to a string should become "a, b, and c".	2021-09-06 23:49:56 +01:00
Timothy Flynn	113bf4a9dd	LibUnicode: Add missing structures to forwarding header	2021-09-02 17:56:42 +01:00
Timothy Flynn	9ae7ac4c87	LibUnicode: Generate complex Unicode locale alias matching Most alias substitutions are "simple", meaning that alias matching is done by examining a single locale subtag. However, there are a handful of "complex" aliases where matching is done by examining multiple subtags. For example, the variant subtag "lojban" causes the locale "art-lojban" to be canonicalized to "jbo", but only when the language subtag is "art" (i.e. this should not occur for the locale "en-lojban"). This generates a method to perform complex alias matching.	2021-09-01 14:14:47 +01:00
Timothy Flynn	6719e5cb17	LibUnicode: Generate locale subtag data as multiple smaller tables This commit is preemptive to upcoming commits which add more subtags to the CLDR generator. Rather than generating a giant HashMap containing all data, generate more (smaller) Array-based tables. This mimics the UCD generator. This also allows simpler lookups at runtime since we can generate index-based lookups into the smaller tables rather easily. Without this change, adding the remaining locale subtags would result in the generation and compilation of UnicodeLocale.cpp taking about 30s on my machine. With this change, it takes about half that. Additionally, the size of the generated file reduces by about 1.5MB.	2021-08-27 12:32:24 +01:00
Timothy Flynn	b8ad4d302e	LibUnicode: Move Locale enumeration from generated UCD data to CLDR data The UCD set of data contained a very small subset of all locales just to handle some special casing rules. This enumeration will be needed within the CLDR generator as well. So rather than duplicate the enum, remove it from the UCD generator in favor of the full list of locales known by the CLDR generator.	2021-08-27 12:32:24 +01:00
Timothy Flynn	ea21573ed8	LibUnicode: Download Unicode's CLDR database and generate locale data The Unicode standard publishes a database known as the Common Locale Data Repository (CLDR). This is a massive set of data from which anyone implementing Unicode's Technical Standard #35 may generate their implementation: https://www.unicode.org/reports/tr35/ This commit updates LibUnicode to download the compressed database and extract a small subset. That subset is used to generate a list of available locales and the territories (AKA regions) associated with each locale.	2021-08-26 22:04:09 +01:00
Timothy Flynn	5ac23d244d	LibUnicode: Generate separate tables for Unicode properties Similar to General Categories, this generates separate tables for the Property list.	2021-08-11 13:11:01 +02:00
Timothy Flynn	7dce2bfe23	LibUnicode: Generate separate tables for General Category properties Previously, each code point's General Category was part of the generated UnicodeData structure. This ultimately presented two problems, one functional and one performance related: * Some General Categories are applied to unassigned code points, for example the Unassigned (Cn) category. Unassigned code points are strictly excluded from UnicodeData.txt, so by relying on that file, the generator is unable to handle these categories. * Lookups for General Categories are slower when searching through the large UnicodeData hash map. Even though lookups are O(1), the hash function turned out to be slower than binary searching through a category-specific table. So, now a table is generated for each General Category. When querying a code point for a category, a binary search is done on each code point range in that category's table to check if code point has that category. Further, General Categories are now parsed from the UCD file DerivedGeneralCategory.txt. This file is a normal "prop list" file and contains the categories for unassigned code points.	2021-08-11 13:11:01 +02:00
Timothy Flynn	f5c1bbc00b	LibUnicode: Parse UCD Scripts.txt and generate as a Unicode property There are a couple of minor nuances with parsing script values, compared to other properties. In Scripts.txt, the UCD file lists the full name of each script; other properties, like General Category, list the shorter name in their primary files. This means that the aliases listed in PropertyValueAliases.txt are reversed for script values.	2021-08-04 13:50:32 +01:00
Timothy Flynn	16e86ae743	LibUnicode: Generate General Category unions and aliases This downloads the PropertyValueAliases.txt UCD file, which contains a set of General Category aliases. This changes the General Category enumeration to now be generated as a bitmask. This is to easily allow General Category unions. For example, the LC (Cased_Letter) category is the union of the Ll, Lu, and Lt categories.	2021-08-02 21:02:09 +04:30
Timothy Flynn	f1809db994	LibUnicode: Add public methods to compare and lookup Unicode properties Adds methods to retrieve a Unicode property from a string and to check if a code point matches a Unicode property. Also adds a <LibUnicode/Forward.h> header.	2021-07-30 21:26:31 +01:00

15 commits