beenull/ladybird

mirror of https://github.com/LadybirdBrowser/ladybird.git synced 2024-11-26 09:30:24 +00:00

Author	SHA1	Message	Date
Sam Atkins	0d021a63c7	LibUnicode: Generate data for bidirectional character types This will let us examine code points to determine the rtl/ltr direction of a piece of text.	2023-08-20 16:21:35 -04:00
Timothy Flynn	456211932f	LibUnicode: Perform code point case conversion lookups in constant time Similar to commit `0652cc4`, we now generate 2-stage lookup tables for case conversion information. Only about 1500 code points are actually cased. This means that case information is rather highly compressible, as the blocks we break the code points into will generally all have no casing information at all. In total, this change: * Does not change the size of libunicode.so (which is nice because, generally, the 2-stage lookup tables are expected to trade a bit of size for performance). * Reduces the runtime of the new benchmark test case added here from 1.383s to 1.127s (about an 18.5% improvement).	2023-07-28 05:28:50 +02:00
Timothy Flynn	0652cc48c0	LibUnicode: Perform code point property lookups in constant time We currently produce a single table for all categories of code point properties (GeneralCategory, Script, etc.). Each row contains a field indicating the range of code points to which that property applies. At runtime, we then do a binary search through that table to decide if a code point has a property. This changes our approach to generate a 2-stage lookup table for each of those categories. There is an in-depth explanation of these tables above the new `create_code_point_tables` method. The end effect is that code point property lookup is reduced from a binary search to constant-time array lookups. In total, this change: * Increases the size of libunicode.so from 2.7 MB to 2.9 MB. * Reduces the runtime of the new benchmark test case added here from 3.576s to 1.020s (a 3.5x speedup). * In a profile of resizing a TextEditor window with a 3MB file open, the runtime of checking if a code point has a word break property reduces from ~81% to ~56%.	2023-07-26 08:36:20 +02:00
Timothy Flynn	c950f88611	LibUnicode: Stop generating Block property data We started generating this data in commit `0505e03`, but it was unused. It's still not used, so let's remove it, rather than bloating the size of libunicode.so with unused data. If we need it in the future, it's trivial to add back. Note we have always used the block name data from that commit, and that is still present here.	2023-07-26 08:36:20 +02:00
Timothy Flynn	f8a0365002	LibUnicode: Detect ZWJ sequences when filtering by emoji presentation This was preventing some unqualified emoji sequences from rendering properly, such as the custom SerenityOS flag. We rendered the flag correctly when given the fully qualified sequence: U+1F3F3 U+FEOF U+200D U+1F41E But were not detecting the unqualified sequence as an emoji when also filtering for emoji-presentation sequences: U+1F3F3 U+200D U+1F41E	2023-03-05 20:21:57 +01:00
Timothy Flynn	73239fdd82	LibUnicode: Add a unit test for Unicode grapheme and word segmentation These include tests for previously broken boundary conditions.	2023-02-25 22:23:39 +01:00
Timothy Flynn	1484d3d9f5	LibUnicode: Add a method to check if a code point could start an emoji	2023-02-24 19:48:47 +01:00
Timothy Flynn	5cbf054651	LibUnicode: Fix typos causing text segmentation on mid-word punctuation For example the words "can't" and "32.3" should not have boundaries detected on the "'" and "." code points, respectively. The String test cases fixed here are because "b'ar" is now considered one word.	2023-02-15 12:36:47 +01:00
Timothy Flynn	8f2589b3b0	LibUnicode: Parse and generate case folding code point data Case folding rules have a similar mapping style as special casing rules, where one code point may map to zero or more case folding rules. These will be used for case-insensitive string comparisons. To see how case folding can differ from other casing rules, consider "ß" (U+00DF): >>> "ß".lower() 'ß' >>> "ß".upper() 'SS' >>> "ß".title() 'Ss' >>> "ß".casefold() 'ss'	2023-01-18 14:43:40 +00:00
Timothy Flynn	bc51017a03	LibUnicode: Support full case folding for titlecasing a string Unicode declares that to titlecase a string, the first cased code point after each word boundary should be transformed to its titlecase mapping. All other codepoints are transformed to their lowercase mapping.	2023-01-16 18:33:44 -05:00
Timothy Flynn	b562348d31	LibUnicode: Generate simple case folding mappings for titlecase Note we already generate the special case foldings for titlecase.	2023-01-16 18:33:44 -05:00
Timothy Flynn	3d22efccca	LibUnicode+LibJS: Propagate OOM from Unicode normalization	2023-01-09 22:48:15 +00:00
Timothy Flynn	1ff29afc45	LibUnicode+LibJS+LibWeb: Propagate OOM from Unicode case transformations	2023-01-09 22:48:15 +00:00
Timothy Flynn	f38c68177b	LibUnicode: Update code point ideographic replacements for Unicode 15	2022-10-07 18:17:40 +01:00
matcool	104b51b912	LibUnicode: Fix Hangul syllable composition for specific cases This fixes `combine_hangul_code_points` which would try to combine a LVT syllable with a trailing consonant, resulting in a wrong character. Also added a test for this specific case.	2022-10-07 07:53:27 -04:00
matcool	c8d7b0a33a	Tests: Add tests for LibUnicode's normalize	2022-10-06 08:24:39 -04:00
Timothy Flynn	9e860d973e	LibLocale: Move locale source files to the LibLocale library Everything is now setup to create the LibLocale library and link it where needed.	2022-09-05 14:37:16 -04:00
Timothy Flynn	b2d2bb43ce	LibLocale: Move locale test files to the LibLocale folder	2022-09-05 14:37:16 -04:00
Timothy Flynn	43a3471298	LibLocale: Move locale source files to the LibLocale folder These are still included in LibUnicode, but this updates their location and the include paths of other files which include them.	2022-09-05 14:37:16 -04:00
Timothy Flynn	ff48220dca	Userland: Move files destined for LibLocale to the Locale namespace	2022-09-05 14:37:16 -04:00
Timothy Flynn	fc8bf7ac3e	LibUnicode+Userland: Migrate generated CLDR data to LibLocaleData Currently, LibUnicodeData contains the generated UCD and CLDR data. Move the UCD data to the main LibUnicode library, and rename LibUnicodeData to LibLocaleData. This is another prepatory change to migrate to LibLocale.	2022-09-05 14:37:16 -04:00
sin-ack	3f3f45580a	Everywhere: Add sv suffix to strings relying on StringView(char const) Each of these strings would previously rely on StringView's char const constructor overload, which would call __builtin_strlen on the string. Since we now have operator ""sv, we can replace these with much simpler versions. This opens the door to being able to remove StringView(char const*). No functional changes.	2022-07-12 23:11:35 +02:00
thankyouverycool	5658524aa3	Tests: Add Unicode tests for CharacterType block properties	2022-02-15 10:13:19 -05:00
Timothy Flynn	6efbafa6e0	Everywhere: Update copyrights with my new serenityos.org e-mail :^)	2022-01-31 18:23:22 +00:00
Timothy Flynn	4400150cd2	LibJS+LibUnicode: Return the appropriate time zone name depending on DST	2022-01-19 21:20:41 +00:00
Timothy Flynn	bdf02c21e1	LibUnicode: Swap the preferred order of standard time zone display names Our generator is currently preferring the DST variant of the time zone display names over the non-DST variant. LibTimeZone currently does not have DST support, and operates in a mode that basically assumes DST does not exist. Swap the display names for now just to be consistent until we have DST support. Note we will need to generate both of these variants and select the appropriate one at runtime once we have DST support.	2022-01-12 15:43:12 +01:00
Timothy Flynn	e2dfbe8f67	LibUnicode: Parse and generate long and short generic time zone names This implements the CalendarPatternStyle::{Long,Short}Generic styles of time zone name formatting.	2022-01-11 23:56:35 +01:00
Timothy Flynn	d50f5e14f8	LibUnicode: Fall back to GMT offset when a time zone name is unavailable The following table in TR-35 includes a web of fall back rules when the requested time zone style is unavailable: https://unicode.org/reports/tr35/tr35-dates.html#dfst-zone Conveniently, the subset of styles supported by ECMA-402 (and therefore LibUnicode) all either fall back to GMT offset or to a style that is unsupported but itself falls back to GMT offset.	2022-01-11 23:56:35 +01:00
Timothy Flynn	8d35563f28	LibUnicode: Implement TR-35's localized GMT offset formatting This adds an API to use LibTimeZone to convert a time zone such as "America/New_York" to a GMT offset string like "GMT-5" (short form) or "GMT-05:00" (long form).	2022-01-11 23:56:35 +01:00
Timothy Flynn	6d7d9dd324	LibUnicode: Do not assume time zones & meta zones have a 1-to-1 mapping The generator parses metaZones.json to form a mapping of meta zones to time zones (AKA "golden zone" in TR-35). This parser errantly assumed this was a 1-to-1 mapping.	2022-01-06 22:28:01 +01:00
Timothy Flynn	ffb3ba3079	Tests: Link some tests directly against LibUnicodeData These were missed in `565a880ce5`. This wasn't an issue because these tests don't pledge/unveil anything, so they could happily dlopen() the library at runtime. But this is now needed in order to migrate LibUnicode towards weak symbols instead.	2022-01-04 22:49:43 +00:00
Timothy Flynn	7e6ad172a4	LibUnicode: Support code point names that apply to ranges of code points For example, consider the following adjacent entries in UnicodeData.txt: 3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;; 4DBF;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;; Our current implementation would assign the display name "CJK Ideograph Extension A" to code points U+3400 & U+4DBF, but not to the code points in between. Not only should those code points be assigned a name, but the Unicode spec also has formatting rules on what the names should be (the names for these ranged code points are not as they appear in UnicodeData.txt). The spec also defines names for code point ranges that actually are listed individually in UnicodeData.txt. For example: 2F800;CJK COMPATIBILITY IDEOGRAPH-2F800;Lo;0;L;4E3D;;;;N;;;;; 2F801;CJK COMPATIBILITY IDEOGRAPH-2F801;Lo;0;L;4E38;;;;N;;;;; 2F802;CJK COMPATIBILITY IDEOGRAPH-2F802;Lo;0;L;4E41;;;;N;;;;; Code points are only coalesced into a range if all fields after the name are equivalent. Our parser will insert the range and its name formatting pattern when it comes across the first code point in that range, then ignore other code points in that range. This reduces the number of names we generated by nearly 2,000.	2021-11-30 11:24:02 +01:00
Timothy Flynn	93ee922027	LibUnicode: Support locales-without-script aliases for ECMA-402 As noted by ECMA-402, if a supported locale contains all of a language, script, and region subtag, then the implementation must also support the locale without the script subtag. The most complicated example of this is the zh-TW locale. The list of locales in the CLDR database does not include zh-TW or its maximized zh-Hant-TW variant. Instead, it inlcudes the zh-Hant locale. However, zh-Hant-TW is listed in the default-content locale list in the cldr-core package. This defines an alias from zh-Hant-TW to zh-Hant. We must then also support the zh-Hant-TW alias without the script subtag: zh-TW. This transitively maps zh-TW to zh-Hant, which is a case quite heavily tested by test262.	2021-11-19 11:45:35 +01:00
Timothy Flynn	357c97dfa8	LibUnicode: Parse the CLDR's defaultContent.json locale list This file contains the list of locales which default to their parent locale's values. In the core CLDR dataset, these locales have their own files, but they are empty (except for identity data). For example: https://github.com/unicode-org/cldr/blob/main/common/main/en_US.xml In the JSON export, these files are excluded, so we currently are not recognizing these locales just by iterating the locale files. This is a prerequisite for upgrading to CLDR version 40. One of these default-content locales is the popular "en-US" locale, which defaults to "en" values. We were previously inferring the existence of this locale from the "en-US-POSIX" locale (many implementations, including ours, strip variants such as POSIX). However, v40 removes the "en-US-POSIX" locale entirely, meaning that without this change, we wouldn't know that "en-US" exists (we would default to "en"). For more detail on this and other v40 changes, see: https://cldr.unicode.org/index/downloads/cldr-40#h.nssoo2lq3cba	2021-11-09 20:44:52 +01:00
Timothy Flynn	4f2bcebe74	LibUnicode+LibJS: Store locale keyword values as a single string Previously, LibUnicode would store the values of a keyword as a Vector. For example, the locale "en-u-ca-abc-def" would have its keyword "ca" stored as {"abc, "def"}. Then, canonicalization would occur on each of the elements in that Vector. This is incorrect because, for example, the keyword value "true" should only be dropped if that is the entire value. That is, the canonical form of "en-u-kb-true" is "en-u-kb", but "en-u-kb-abc-true" does not change for canonicalization. However, we would canonicalize that locale as "en-u-kb-abc".	2021-09-08 21:08:48 +01:00
Timothy Flynn	50158abaf1	LibUnicode: Implement locale-aware BEFORE_DOT special casing Note that the algorithm in the Unicode spec is for checking that a code point precedes U+0307, but the special casing condition NotBeforeDot is interested in the inverse of this rule.	2021-09-06 15:24:27 +01:00
Timothy Flynn	436faf9fd9	LibUnicode: Implement locale-aware MORE_ABOVE special casing	2021-09-06 15:24:27 +01:00
Timothy Flynn	1427ebc622	LibUnicode: Implement locale-aware AFTER_SOFT_DOTTED special casing	2021-09-06 15:24:27 +01:00
Timothy Flynn	0053d48c41	LibUnicode: Implement locale-aware AFTER_I special casing	2021-09-06 15:24:27 +01:00
Timothy Flynn	a05419db55	LibUnicode: Add lexer to test if a string matches the "type" production	2021-09-02 17:56:42 +01:00
Andrew Kaster	58797a1289	Tests: Remove all file(GLOB) from CMakeLists in Tests Using a file(GLOB) to find all the test files in a directory is an easy hack to get things started, but has some drawbacks. Namely, if you add a test, it won't be found again without re-running CMake. `ninja` seems to do this automatically, but it would be nice to one day stop seeing it rechecking our globbed directories.	2021-09-02 09:08:23 +02:00
Timothy Flynn	fd0011989a	LibUnicode: Resolve the most likely territory alias when there are many	2021-09-01 14:14:47 +01:00
Timothy Flynn	72f49e42b4	LibUnicode: Perform complex Unicode locale alias substitution	2021-09-01 14:14:47 +01:00
Timothy Flynn	da89cf9afb	LibUnicode: Canonicalize calendar subtags Calendar subtags are a bit of an odd-man-out in that we must match the variants "ethiopic-amete-alem" in that order, without any other variant in the locale. So a separate method is needed for this, and we now defer sorting the variant list until after other canonicalization is done.	2021-09-01 14:14:47 +01:00
Timothy Flynn	8458f477a4	LibUnicode: Canonicalize timezone subtags	2021-09-01 14:14:47 +01:00
Timothy Flynn	335f985b31	LibUnicode: Canonicalize the subtag "imperial" to "uksystem"	2021-09-01 14:14:47 +01:00
Timothy Flynn	2d90144888	LibUnicode: Canonicalize the subtag "primary" and "tertiary" to "levelN"	2021-09-01 14:14:47 +01:00
Timothy Flynn	409f39b336	LibUnicode: Canonicalize the subtag "names" to "prprname"	2021-09-01 14:14:47 +01:00
Timothy Flynn	f907a7dc38	LibUnicode: Canonicalize the subtag "yes" to "true"	2021-09-01 14:14:47 +01:00
Timothy Flynn	556374a904	LibUnicode: Substitute Unicode locale aliases during canonicalization Unicode TR35 defines how locale subtag aliases should be emplaced when converting a locale to canonical form. For most subtags, it is a simple substitution. Language subtags depend on context; for example, the language "sh" should become "sr-Latn", but if the original locale has a script subtag already ("sh-Cyrl"), then only the language subtag of the alias should be taken ("sr-Latn"). To facilitate this, we now make two passes when canonicalizing a locale. In the first pass, we convert the LocaleID structure to canonical syntax (where the conversions all happen in-place). In the second pass, we form the canonical string based on the canonical syntax.	2021-09-01 14:14:47 +01:00

1 2

68 commits