Commit graph

31 commits

Author SHA1 Message Date
Timothy Flynn
b8ad4d302e LibUnicode: Move Locale enumeration from generated UCD data to CLDR data
The UCD set of data contained a very small subset of all locales just to
handle some special casing rules. This enumeration will be needed within
the CLDR generator as well. So rather than duplicate the enum, remove it
from the UCD generator in favor of the full list of locales known by the
CLDR generator.
2021-08-27 12:32:24 +01:00
Timothy Flynn
a98d3a1a85 LibUnicode: Download and parse DerivedNormalizationProps UCD file
This file contains the last properties that LibUnicode is not parsing.
Much of the data in this file is not currently used; that is left as a
FIXME for when String.prototype.normalize is implemented. Until then,
only the code point properties are utilized for regular expression
pattern escapes.
2021-08-11 13:11:01 +02:00
Timothy Flynn
1e91334008 LibUnicode: Handle edge-case script extensions, Common and Inherited
These script extensions have some peculiar behavior in the Unicode spec.
The UCD ScriptExtension file does not contain these scripts. Rather, it
is implied the code points which have these scripts as an extension are
the code points that both:

  1. Have Common or Inherited as their primary script value
  2. Do not have any other script value in their script extension lists

Because these are not explictly listed in the UCD, we must manually form
these script extensions.
2021-08-11 13:11:01 +02:00
Timothy Flynn
47bb350ebd LibUnicode: Generate separate tables for scripts and script extensions
Notice that unlike the note in populate_general_category_unions(),
script extension do indeed have code point ranges which overlap. Thus,
this commit adds code to handle that, and hooks it into the GC unions.
2021-08-11 13:11:01 +02:00
Timothy Flynn
e6e462249f LibUnicode: Generate *_from_string methods using a hash map
Rather than a long series of string comparisons, generate each of these
methods using a hash map of the enumeration name to its value.
2021-08-11 13:11:01 +02:00
Timothy Flynn
5ac23d244d LibUnicode: Generate separate tables for Unicode properties
Similar to General Categories, this generates separate tables for the
Property list.
2021-08-11 13:11:01 +02:00
Timothy Flynn
b06c104076 LibUnicode: Include Unassigned code points in the Other General Category
Now that the generator parses unassigned General Category properties, it
can include Unassigned (Cn) in the Other (C) category.
2021-08-11 13:11:01 +02:00
Timothy Flynn
7dce2bfe23 LibUnicode: Generate separate tables for General Category properties
Previously, each code point's General Category was part of the generated
UnicodeData structure. This ultimately presented two problems, one
functional and one performance related:

  * Some General Categories are applied to unassigned code points, for
    example the Unassigned (Cn) category. Unassigned code points are
    strictly excluded from UnicodeData.txt, so by relying on that file,
    the generator is unable to handle these categories.

  * Lookups for General Categories are slower when searching through the
    large UnicodeData hash map. Even though lookups are O(1), the hash
    function turned out to be slower than binary searching through a
    category-specific table.

So, now a table is generated for each General Category. When querying a
code point for a category, a binary search is done on each code point
range in that category's table to check if code point has that category.

Further, General Categories are now parsed from the UCD file
DerivedGeneralCategory.txt. This file is a normal "prop list" file and
contains the categories for unassigned code points.
2021-08-11 13:11:01 +02:00
Timothy Flynn
4e546cee97 LibUnicode: Remove WordBreakProperty from generated Unicode data
This was originally used for the "is_final_code_point" algorithm in
LibUnicode/CharacterTypes.cpp. However, it has since been superseded by
DerivedCoreProperties and is now unused. Remove it as it is currently a
waste of time to process the data, and is trivial to add back if we need
it again.
2021-08-11 13:11:01 +02:00
Timothy Flynn
6f2640d031 LibUnicode: Parse UCD DerivedBinaryProperties.txt and generate property 2021-08-04 13:50:32 +01:00
Timothy Flynn
9113f892a7 LibUnicode: Parse UCD emoji-data.txt and generate Unicode property 2021-08-04 13:50:32 +01:00
Timothy Flynn
5edd458420 LibUnicode: Parse UCD ScriptExtensions.txt and generate property 2021-08-04 13:50:32 +01:00
Timothy Flynn
6bdb19fe21 LibUnicode: Remove unused parameter from Unicode data generator 2021-08-04 13:50:32 +01:00
Timothy Flynn
f5c1bbc00b LibUnicode: Parse UCD Scripts.txt and generate as a Unicode property
There are a couple of minor nuances with parsing script values, compared
to other properties. In Scripts.txt, the UCD file lists the full name of
each script; other properties, like General Category, list the shorter
name in their primary files. This means that the aliases listed in
PropertyValueAliases.txt are reversed for script values.
2021-08-04 13:50:32 +01:00
Timothy Flynn
1bb6404a19 LibUnicode: Invoke Unicode data generator a single time
It takes a non-neglible amount of time to parse all of the UCD files and
generate the Unicode data files. To help compile times, only invoke the
generator once.
2021-08-04 11:18:24 +02:00
Timothy Flynn
9413c3a0d1 LibUnicode: Generate a map of code points to their Unicode table index
The current strategy of searching for a code point within the generated
table is slow for code points > U+0377 (the last code point whose index
is the same value as the code point itself). For larger code points, we
are doing a linear search through the table.

Instead, generate a HashMap of each code point to its entry in the table
for faster runtime lookups.

This had the added benefit of being able to remove a fair amount of code
from the generator. We no longer need to track that last contiguous code
point (U+0377) nor each code point's index in the generated table.
2021-08-04 11:18:24 +02:00
Timothy Flynn
5de6d3dd90 LibUnicode: Add public methods to compare and lookup General Categories
Adds methods to retrieve a General Category from a string and to check
if a code point matches a General Category.
2021-08-02 21:02:09 +04:30
Timothy Flynn
f63287cd63 LibUnicode: Initialize manually created Unicode properties inline
Using initializer lists directly in the UnicodeData struct definition
feels a bit cleaner than invoking HashMap::set in main().
2021-08-02 21:02:09 +04:30
Timothy Flynn
16e86ae743 LibUnicode: Generate General Category unions and aliases
This downloads the PropertyValueAliases.txt UCD file, which contains a
set of General Category aliases.

This changes the General Category enumeration to now be generated as a
bitmask. This is to easily allow General Category unions. For example,
the LC (Cased_Letter) category is the union of the Ll, Lu, and Lt
categories.
2021-08-02 21:02:09 +04:30
Timothy Flynn
f1809db994 LibUnicode: Add public methods to compare and lookup Unicode properties
Adds methods to retrieve a Unicode property from a string and to check
if a code point matches a Unicode property.

Also adds a <LibUnicode/Forward.h> header.
2021-07-30 21:26:31 +01:00
Timothy Flynn
3f80791ed5 LibUnicode: Manually assign special code point properties
The Unicode standard defines a few extra properties that are not defined
in any UCD file, so we must assign them manually.
2021-07-30 21:26:31 +01:00
Timothy Flynn
bba3152104 LibUnicode: Parse and generate PropertyAliases
These are all used for Unicode property escapes.
2021-07-30 21:26:31 +01:00
Timothy Flynn
761c16d873 LibUnicode: Parse and utilize DerivedCoreProperties
DerivedCoreProperties are pseudo-properties that are the union of other
categories and properties. For example, the derived property Math is the
union of the general category Sm and the property Other_Math.

Parsing these is necessary for implementing Unicode property escapes.
But it also has the added benefit that LibUnicode now does not need to
derive some of these properties at runtime.
2021-07-30 21:26:31 +01:00
Timothy Flynn
4eb4b06688 LibUnicode: Do not replace underscores in property names
Originally, this was done to make the generated enums look more like the
rest of Serenity's enums. But for Unicode property escapes, LibUnicode
will need to compare property names from a RegExp.prototype object to
these parsed property names, which will be easier without this
modification.
2021-07-30 21:26:31 +01:00
Timothy Flynn
5d09a00189 LibUnicode: Generate PropList enumeration as a bitmask
Rather than generating the PropList as a list of enums, generate it as
a bitmask. Not only will this be better for runtime property searching,
this will allow parsing of the DerivedCoreProperties list more easily.
2021-07-30 21:26:31 +01:00
Timothy Flynn
dff156b7c6 LibUnicode: Reduce Unicode data generator boilerplate
There's a fair amount of boilerplate when e.g. adding a new UCD file to
parse or a new enumeration to generate. Reduce the overhead by adding
helper lambdas. Also adds a couple missing spec links with UCD field
information.
2021-07-28 23:42:29 +02:00
Timothy Flynn
12fb3ae033 LibUnicode: Download and parse the word break property list UCD file
Note that unlike the main property list, each code point has only one
word break property. Code points that do not have a word break property
are to be assigned the property "Other".
2021-07-28 23:42:29 +02:00
Timothy Flynn
38adfd8874 LibUnicode: Download and parse the property list UCD file 2021-07-28 23:42:29 +02:00
Timothy Flynn
5b110034dd LibUnicode: Produce each code point's general category
This will be needed for the Unicode Standard's Default Case Algorithm.
Generate the field as an enumeration rather than a string for easier
comparison.
2021-07-27 21:04:36 +01:00
Timothy Flynn
32ea461385 LibUnicode: Download and parse the special casing UCD file
This adds a SpecialCasing structure to the generated UnicodeData.h/cpp
files. This structure contains casing rules for code points which have
non-1-to-1 upper-to-lower case code point mappings. Further, these rules
may be limited to specific locales or other context.
2021-07-27 21:04:36 +01:00
Timothy Flynn
4dda3edc9e LibUnicode: Introduce a Unicode library for interacting with UCD files
The Unicode standard publishes the Unicode Character Database (UCD) with
information about every code point, such as each code point's upper case
mapping. LibUnicode exists to download and parse UCD files at build time
and to provide accessors to that data.

As a start, LibUnicode includes upper- and lower-case code point
converters.
2021-07-26 17:03:55 +01:00