0ct0pu5/ladybird

Author	SHA1	Message	Date
Timothy Flynn	5cbf054651	LibUnicode: Fix typos causing text segmentation on mid-word punctuation For example the words "can't" and "32.3" should not have boundaries detected on the "'" and "." code points, respectively. The String test cases fixed here are because "b'ar" is now considered one word.	2023-02-15 12:36:47 +01:00
Timothy Flynn	8f2589b3b0	LibUnicode: Parse and generate case folding code point data Case folding rules have a similar mapping style as special casing rules, where one code point may map to zero or more case folding rules. These will be used for case-insensitive string comparisons. To see how case folding can differ from other casing rules, consider "ß" (U+00DF): >>> "ß".lower() 'ß' >>> "ß".upper() 'SS' >>> "ß".title() 'Ss' >>> "ß".casefold() 'ss'	2023-01-18 14:43:40 +00:00
Timothy Flynn	bc51017a03	LibUnicode: Support full case folding for titlecasing a string Unicode declares that to titlecase a string, the first cased code point after each word boundary should be transformed to its titlecase mapping. All other codepoints are transformed to their lowercase mapping.	2023-01-16 18:33:44 -05:00
Timothy Flynn	b562348d31	LibUnicode: Generate simple case folding mappings for titlecase Note we already generate the special case foldings for titlecase.	2023-01-16 18:33:44 -05:00
Timothy Flynn	1ff29afc45	LibUnicode+LibJS+LibWeb: Propagate OOM from Unicode case transformations	2023-01-09 22:48:15 +00:00
Timothy Flynn	f38c68177b	LibUnicode: Update code point ideographic replacements for Unicode 15	2022-10-07 18:17:40 +01:00
thankyouverycool	5658524aa3	Tests: Add Unicode tests for CharacterType block properties	2022-02-15 10:13:19 -05:00
Timothy Flynn	6efbafa6e0	Everywhere: Update copyrights with my new serenityos.org e-mail :^)	2022-01-31 18:23:22 +00:00
Timothy Flynn	7e6ad172a4	LibUnicode: Support code point names that apply to ranges of code points For example, consider the following adjacent entries in UnicodeData.txt: 3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;; 4DBF;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;; Our current implementation would assign the display name "CJK Ideograph Extension A" to code points U+3400 & U+4DBF, but not to the code points in between. Not only should those code points be assigned a name, but the Unicode spec also has formatting rules on what the names should be (the names for these ranged code points are not as they appear in UnicodeData.txt). The spec also defines names for code point ranges that actually are listed individually in UnicodeData.txt. For example: 2F800;CJK COMPATIBILITY IDEOGRAPH-2F800;Lo;0;L;4E3D;;;;N;;;;; 2F801;CJK COMPATIBILITY IDEOGRAPH-2F801;Lo;0;L;4E38;;;;N;;;;; 2F802;CJK COMPATIBILITY IDEOGRAPH-2F802;Lo;0;L;4E41;;;;N;;;;; Code points are only coalesced into a range if all fields after the name are equivalent. Our parser will insert the range and its name formatting pattern when it comes across the first code point in that range, then ignore other code points in that range. This reduces the number of names we generated by nearly 2,000.	2021-11-30 11:24:02 +01:00
Timothy Flynn	50158abaf1	LibUnicode: Implement locale-aware BEFORE_DOT special casing Note that the algorithm in the Unicode spec is for checking that a code point precedes U+0307, but the special casing condition NotBeforeDot is interested in the inverse of this rule.	2021-09-06 15:24:27 +01:00
Timothy Flynn	436faf9fd9	LibUnicode: Implement locale-aware MORE_ABOVE special casing	2021-09-06 15:24:27 +01:00
Timothy Flynn	1427ebc622	LibUnicode: Implement locale-aware AFTER_SOFT_DOTTED special casing	2021-09-06 15:24:27 +01:00
Timothy Flynn	0053d48c41	LibUnicode: Implement locale-aware AFTER_I special casing	2021-09-06 15:24:27 +01:00
Timothy Flynn	1e91334008	LibUnicode: Handle edge-case script extensions, Common and Inherited These script extensions have some peculiar behavior in the Unicode spec. The UCD ScriptExtension file does not contain these scripts. Rather, it is implied the code points which have these scripts as an extension are the code points that both: 1. Have Common or Inherited as their primary script value 2. Do not have any other script value in their script extension lists Because these are not explictly listed in the UCD, we must manually form these script extensions.	2021-08-11 13:11:01 +02:00
Timothy Flynn	47bb350ebd	LibUnicode: Generate separate tables for scripts and script extensions Notice that unlike the note in populate_general_category_unions(), script extension do indeed have code point ranges which overlap. Thus, this commit adds code to handle that, and hooks it into the GC unions.	2021-08-11 13:11:01 +02:00
Timothy Flynn	5ac23d244d	LibUnicode: Generate separate tables for Unicode properties Similar to General Categories, this generates separate tables for the Property list.	2021-08-11 13:11:01 +02:00
Timothy Flynn	b06c104076	LibUnicode: Include Unassigned code points in the Other General Category Now that the generator parses unassigned General Category properties, it can include Unassigned (Cn) in the Other (C) category.	2021-08-11 13:11:01 +02:00
Timothy Flynn	7dce2bfe23	LibUnicode: Generate separate tables for General Category properties Previously, each code point's General Category was part of the generated UnicodeData structure. This ultimately presented two problems, one functional and one performance related: * Some General Categories are applied to unassigned code points, for example the Unassigned (Cn) category. Unassigned code points are strictly excluded from UnicodeData.txt, so by relying on that file, the generator is unable to handle these categories. * Lookups for General Categories are slower when searching through the large UnicodeData hash map. Even though lookups are O(1), the hash function turned out to be slower than binary searching through a category-specific table. So, now a table is generated for each General Category. When querying a code point for a category, a binary search is done on each code point range in that category's table to check if code point has that category. Further, General Categories are now parsed from the UCD file DerivedGeneralCategory.txt. This file is a normal "prop list" file and contains the categories for unassigned code points.	2021-08-11 13:11:01 +02:00
Timothy Flynn	c4bfda7f7f	LibUnicode: Handle code points that are both cased and case-ignorable Apparently, some code points fit both categories, for example U+0345 (COMBINING GREEK YPOGEGRAMMENI). Handle this fact when determining if a code point is a final code point in a string.	2021-07-28 23:42:29 +02:00
Timothy Flynn	7827aede6f	LibUnicode: Check word break when deciding on case-ignorable code points	2021-07-28 23:42:29 +02:00
Timothy Flynn	c45a014645	LibUnicode: Check property list when deciding if a code point is cased	2021-07-28 23:42:29 +02:00
Timothy Flynn	39f971e42b	LibUnicode: Begin implementing special Unicode case folding This implements unconditional special case folding, and conditional folding for non-locale cases. Worth noting that the only conditional, non-locale special case is for converting an uppercase sigma to lowercase.	2021-07-27 21:04:36 +01:00
Timothy Flynn	4dda3edc9e	LibUnicode: Introduce a Unicode library for interacting with UCD files The Unicode standard publishes the Unicode Character Database (UCD) with information about every code point, such as each code point's upper case mapping. LibUnicode exists to download and parse UCD files at build time and to provide accessors to that data. As a start, LibUnicode includes upper- and lower-case code point converters.	2021-07-26 17:03:55 +01:00

23 commits

No results found.