0ct0pu5/ladybird

Author	SHA1	Message	Date
Timothy Flynn	a98d3a1a85	LibUnicode: Download and parse DerivedNormalizationProps UCD file This file contains the last properties that LibUnicode is not parsing. Much of the data in this file is not currently used; that is left as a FIXME for when String.prototype.normalize is implemented. Until then, only the code point properties are utilized for regular expression pattern escapes.	2021-08-11 13:11:01 +02:00
Timothy Flynn	1e91334008	LibUnicode: Handle edge-case script extensions, Common and Inherited These script extensions have some peculiar behavior in the Unicode spec. The UCD ScriptExtension file does not contain these scripts. Rather, it is implied the code points which have these scripts as an extension are the code points that both: 1. Have Common or Inherited as their primary script value 2. Do not have any other script value in their script extension lists Because these are not explictly listed in the UCD, we must manually form these script extensions.	2021-08-11 13:11:01 +02:00
Timothy Flynn	47bb350ebd	LibUnicode: Generate separate tables for scripts and script extensions Notice that unlike the note in populate_general_category_unions(), script extension do indeed have code point ranges which overlap. Thus, this commit adds code to handle that, and hooks it into the GC unions.	2021-08-11 13:11:01 +02:00
Timothy Flynn	e6e462249f	LibUnicode: Generate *_from_string methods using a hash map Rather than a long series of string comparisons, generate each of these methods using a hash map of the enumeration name to its value.	2021-08-11 13:11:01 +02:00
Timothy Flynn	5ac23d244d	LibUnicode: Generate separate tables for Unicode properties Similar to General Categories, this generates separate tables for the Property list.	2021-08-11 13:11:01 +02:00
Timothy Flynn	b06c104076	LibUnicode: Include Unassigned code points in the Other General Category Now that the generator parses unassigned General Category properties, it can include Unassigned (Cn) in the Other (C) category.	2021-08-11 13:11:01 +02:00
Timothy Flynn	7dce2bfe23	LibUnicode: Generate separate tables for General Category properties Previously, each code point's General Category was part of the generated UnicodeData structure. This ultimately presented two problems, one functional and one performance related: * Some General Categories are applied to unassigned code points, for example the Unassigned (Cn) category. Unassigned code points are strictly excluded from UnicodeData.txt, so by relying on that file, the generator is unable to handle these categories. * Lookups for General Categories are slower when searching through the large UnicodeData hash map. Even though lookups are O(1), the hash function turned out to be slower than binary searching through a category-specific table. So, now a table is generated for each General Category. When querying a code point for a category, a binary search is done on each code point range in that category's table to check if code point has that category. Further, General Categories are now parsed from the UCD file DerivedGeneralCategory.txt. This file is a normal "prop list" file and contains the categories for unassigned code points.	2021-08-11 13:11:01 +02:00
Timothy Flynn	4e546cee97	LibUnicode: Remove WordBreakProperty from generated Unicode data This was originally used for the "is_final_code_point" algorithm in LibUnicode/CharacterTypes.cpp. However, it has since been superseded by DerivedCoreProperties and is now unused. Remove it as it is currently a waste of time to process the data, and is trivial to add back if we need it again.	2021-08-11 13:11:01 +02:00
Timothy Flynn	6f2640d031	LibUnicode: Parse UCD DerivedBinaryProperties.txt and generate property	2021-08-04 13:50:32 +01:00
Timothy Flynn	9113f892a7	LibUnicode: Parse UCD emoji-data.txt and generate Unicode property	2021-08-04 13:50:32 +01:00
Timothy Flynn	5edd458420	LibUnicode: Parse UCD ScriptExtensions.txt and generate property	2021-08-04 13:50:32 +01:00
Timothy Flynn	6bdb19fe21	LibUnicode: Remove unused parameter from Unicode data generator	2021-08-04 13:50:32 +01:00
Timothy Flynn	f5c1bbc00b	LibUnicode: Parse UCD Scripts.txt and generate as a Unicode property There are a couple of minor nuances with parsing script values, compared to other properties. In Scripts.txt, the UCD file lists the full name of each script; other properties, like General Category, list the shorter name in their primary files. This means that the aliases listed in PropertyValueAliases.txt are reversed for script values.	2021-08-04 13:50:32 +01:00
Timothy Flynn	1bb6404a19	LibUnicode: Invoke Unicode data generator a single time It takes a non-neglible amount of time to parse all of the UCD files and generate the Unicode data files. To help compile times, only invoke the generator once.	2021-08-04 11:18:24 +02:00
Timothy Flynn	9413c3a0d1	LibUnicode: Generate a map of code points to their Unicode table index The current strategy of searching for a code point within the generated table is slow for code points > U+0377 (the last code point whose index is the same value as the code point itself). For larger code points, we are doing a linear search through the table. Instead, generate a HashMap of each code point to its entry in the table for faster runtime lookups. This had the added benefit of being able to remove a fair amount of code from the generator. We no longer need to track that last contiguous code point (U+0377) nor each code point's index in the generated table.	2021-08-04 11:18:24 +02:00
Timothy Flynn	5de6d3dd90	LibUnicode: Add public methods to compare and lookup General Categories Adds methods to retrieve a General Category from a string and to check if a code point matches a General Category.	2021-08-02 21:02:09 +04:30
Timothy Flynn	f63287cd63	LibUnicode: Initialize manually created Unicode properties inline Using initializer lists directly in the UnicodeData struct definition feels a bit cleaner than invoking HashMap::set in main().	2021-08-02 21:02:09 +04:30
Timothy Flynn	16e86ae743	LibUnicode: Generate General Category unions and aliases This downloads the PropertyValueAliases.txt UCD file, which contains a set of General Category aliases. This changes the General Category enumeration to now be generated as a bitmask. This is to easily allow General Category unions. For example, the LC (Cased_Letter) category is the union of the Ll, Lu, and Lt categories.	2021-08-02 21:02:09 +04:30
Timothy Flynn	f1809db994	LibUnicode: Add public methods to compare and lookup Unicode properties Adds methods to retrieve a Unicode property from a string and to check if a code point matches a Unicode property. Also adds a <LibUnicode/Forward.h> header.	2021-07-30 21:26:31 +01:00
Timothy Flynn	3f80791ed5	LibUnicode: Manually assign special code point properties The Unicode standard defines a few extra properties that are not defined in any UCD file, so we must assign them manually.	2021-07-30 21:26:31 +01:00
Timothy Flynn	bba3152104	LibUnicode: Parse and generate PropertyAliases These are all used for Unicode property escapes.	2021-07-30 21:26:31 +01:00
Timothy Flynn	761c16d873	LibUnicode: Parse and utilize DerivedCoreProperties DerivedCoreProperties are pseudo-properties that are the union of other categories and properties. For example, the derived property Math is the union of the general category Sm and the property Other_Math. Parsing these is necessary for implementing Unicode property escapes. But it also has the added benefit that LibUnicode now does not need to derive some of these properties at runtime.	2021-07-30 21:26:31 +01:00
Timothy Flynn	4eb4b06688	LibUnicode: Do not replace underscores in property names Originally, this was done to make the generated enums look more like the rest of Serenity's enums. But for Unicode property escapes, LibUnicode will need to compare property names from a RegExp.prototype object to these parsed property names, which will be easier without this modification.	2021-07-30 21:26:31 +01:00
Timothy Flynn	5d09a00189	LibUnicode: Generate PropList enumeration as a bitmask Rather than generating the PropList as a list of enums, generate it as a bitmask. Not only will this be better for runtime property searching, this will allow parsing of the DerivedCoreProperties list more easily.	2021-07-30 21:26:31 +01:00
Timothy Flynn	dff156b7c6	LibUnicode: Reduce Unicode data generator boilerplate There's a fair amount of boilerplate when e.g. adding a new UCD file to parse or a new enumeration to generate. Reduce the overhead by adding helper lambdas. Also adds a couple missing spec links with UCD field information.	2021-07-28 23:42:29 +02:00
Timothy Flynn	12fb3ae033	LibUnicode: Download and parse the word break property list UCD file Note that unlike the main property list, each code point has only one word break property. Code points that do not have a word break property are to be assigned the property "Other".	2021-07-28 23:42:29 +02:00
Timothy Flynn	38adfd8874	LibUnicode: Download and parse the property list UCD file	2021-07-28 23:42:29 +02:00
Timothy Flynn	5b110034dd	LibUnicode: Produce each code point's general category This will be needed for the Unicode Standard's Default Case Algorithm. Generate the field as an enumeration rather than a string for easier comparison.	2021-07-27 21:04:36 +01:00
Timothy Flynn	32ea461385	LibUnicode: Download and parse the special casing UCD file This adds a SpecialCasing structure to the generated UnicodeData.h/cpp files. This structure contains casing rules for code points which have non-1-to-1 upper-to-lower case code point mappings. Further, these rules may be limited to specific locales or other context.	2021-07-27 21:04:36 +01:00
Timothy Flynn	4dda3edc9e	LibUnicode: Introduce a Unicode library for interacting with UCD files The Unicode standard publishes the Unicode Character Database (UCD) with information about every code point, such as each code point's upper case mapping. LibUnicode exists to download and parse UCD files at build time and to provide accessors to that data. As a start, LibUnicode includes upper- and lower-case code point converters.	2021-07-26 17:03:55 +01:00

30 commits