0ct0pu5/ladybird

Author	SHA1	Message	Date
Andreas Kling	df547bb321	LibUnicode: Avoid redundant UTF-8 validation in AK::String helpers	2024-04-21 19:32:49 +02:00
Idan Horowitz	945c58c7c1	LibUnicode: Generate and use code point composition mappings These allow us to binary search the code point compositions based on the first code point being combined, which makes the search close to O(log N) instead of O(N).	2024-04-06 14:21:04 -04:00
Idan Horowitz	e227bf0f71	LibUnicode: Optimize the canonical composition algorithm implementation It now takes O(N) time instead of O(N^2) time. Additionally some always false conditions are removed.	2024-04-06 14:21:04 -04:00
Timothy Flynn	576c2f4f4d	LibURL+LibUnicode+LibWebView: Handle punycode directly in LibURL We had defined punycode handling in LibUnicode when LibURL (AK at the time) was unable to depend on LibUnicode. This is no longer the case.	2024-03-26 12:25:21 -04:00
Shannon Booth	e800605ad3	AK+LibURL: Move AK::URL into a new URL library This URL library ends up being a relatively fundamental base library of the system, as LibCore depends on LibURL. This change has two main benefits: * Moving AK back more towards being an agnostic library that can be used between the kernel and userspace. URL has never really fit that description - and is not used in the kernel. * URL _should_ depend on LibUnicode, as it needs punnycode support. However, it's not really possible to do this inside of AK as it can't depend on any external library. This change brings us a little closer to being able to do that, but unfortunately we aren't there quite yet, as the code generators depend on LibCore.	2024-03-18 14:06:28 -04:00
Timothy Flynn	aa0a6d58b2	Userland: Remove LibCore dependency from libraries that do not use it	2024-01-22 08:48:34 -05:00
Ali Mohammad Pur	5e1499d104	Everywhere: Rename {Deprecated => Byte}String This commit un-deprecates DeprecatedString, and repurposes it as a byte string. As the null state has already been removed, there are no other particularly hairy blockers in repurposing this type as a byte string (what it _really_ is). This commit is auto-generated: $ xs=$(ack -l \bDeprecatedString\b\\|deprecated_string AK Userland \ Meta Ports Ladybird Tests Kernel) $ perl -pie 's/\bDeprecatedString\b/ByteString/g; s/deprecated_string/byte_string/g' $xs $ clang-format --style=file -i \ $(git diff --name-only \| grep \.cpp\\|\.h) $ gn format $(git ls-files '.gn' '.gni')	2023-12-17 18:25:10 +03:30
Timothy Flynn	43e9dc0500	LibUnicode: Use weak symbols to provide default IDNA defintions Rather than using #ifdef blocks, update the fallback IDNA definitions to use weak symbols to match the rest of LibUnicode / LibLocale.	2023-12-10 10:19:14 -05:00
Timothy Flynn	1f0e24bc3b	LibUnicode: Fix compilation when ENABLE_UNICODE_DATABASE_DOWNLOAD is OFF	2023-12-10 10:19:14 -05:00
Simon Wanner	58f08107b0	AK+LibUnicode: Add Unicode::create_unicode_url This is a workaround for the fact that AK::URLParser can't call into LibUnicode directly.	2023-12-10 08:04:58 -05:00
Simon Wanner	5bcb019106	LibUnicode: Add IDNA::to_ascii This implements the ToASCII operation of Unicode Technical Standard 46	2023-12-10 08:04:58 -05:00
Simon Wanner	7d9fe44039	LibUnicode: Download and parse IDNA data	2023-12-10 08:04:58 -05:00
Simon Wanner	cfd0a60863	LibUnicode: Add Punycode::encode	2023-12-10 08:04:58 -05:00
Simon Wanner	299d35aadc	LibUnicode: Add Punycode::decode	2023-12-10 08:04:58 -05:00
Shannon Booth	d777b279e3	LibUnicode+Tests: Remove now unused `to_unicode_*_full` methods Relocating all of the tests for these in LibUnicode over to the AK String testsuite.	2023-11-28 17:15:27 -05:00
Shannon Booth	6b32a1f18f	AK+LibUnicode: Expose TrailingCodePointTransformation in to_titlecase Relocating the definition of this enum from LibUnicode to AK.	2023-11-28 17:15:27 -05:00
Timothy Flynn	6070df40f3	LibUnicode: Define case-insensitive string comparison more generically The only user is currently String::equals_ignoring_case, but LibRegex will need to do the same case-folded comparison with UTF-32 data. As it turns out, the comparison works with all Unicode view types without much fuss.	2023-11-08 12:54:26 -05:00
Cr4xy	bbfe0d3a82	LibWeb: Implement `text-transform: capitalize`	2023-10-03 09:47:17 -04:00
Timothy Flynn	139c575cc9	LibUnicode: Update to Unicode version 15.1.0 https://unicode.org/versions/Unicode15.1.0/ This update includes a new set of code point properties, Indic Conjunct Break. These may have the values Consonant, Linker, or Extend. These are used in text segmentation to prevent breaking on some extended grapheme cluster sequences.	2023-09-15 18:30:26 +02:00
Timothy Flynn	02a8683266	LibUnicode+LibJS: Stop propagating small OOM errors from normalization This API only perform small allocations, and is only used by LibJS.	2023-09-09 13:03:25 -04:00
Sam Atkins	0d021a63c7	LibUnicode: Generate data for bidirectional character types This will let us examine code points to determine the rtl/ltr direction of a piece of text.	2023-08-20 16:21:35 -04:00
Timothy Flynn	456211932f	LibUnicode: Perform code point case conversion lookups in constant time Similar to commit `0652cc4`, we now generate 2-stage lookup tables for case conversion information. Only about 1500 code points are actually cased. This means that case information is rather highly compressible, as the blocks we break the code points into will generally all have no casing information at all. In total, this change: * Does not change the size of libunicode.so (which is nice because, generally, the 2-stage lookup tables are expected to trade a bit of size for performance). * Reduces the runtime of the new benchmark test case added here from 1.383s to 1.127s (about an 18.5% improvement).	2023-07-28 05:28:50 +02:00
Timothy Flynn	cb128dcf75	LibUnicode: Move the CodePointRangeComparator struct to a public header Move it out of the generated code so that it may be used by the code generator itself.	2023-07-26 08:36:20 +02:00
Timothy Flynn	c950f88611	LibUnicode: Stop generating Block property data We started generating this data in commit `0505e03`, but it was unused. It's still not used, so let's remove it, rather than bloating the size of libunicode.so with unused data. If we need it in the future, it's trivial to add back. Note we have always used the block name data from that commit, and that is still present here.	2023-07-26 08:36:20 +02:00
Timothy Flynn	1393ed2000	AK+LibUnicode: Implement String::equals_ignoring_case without allocating We currently fully casefold the left- and right-hand sides to compare two strings with case-insensitivity. Now, we casefold one code point at a time, storing the result in a view for comparison, until we exhaust both strings.	2023-03-08 18:57:53 +00:00
Timothy Flynn	f8a0365002	LibUnicode: Detect ZWJ sequences when filtering by emoji presentation This was preventing some unqualified emoji sequences from rendering properly, such as the custom SerenityOS flag. We rendered the flag correctly when given the fully qualified sequence: U+1F3F3 U+FEOF U+200D U+1F41E But were not detecting the unqualified sequence as an emoji when also filtering for emoji-presentation sequences: U+1F3F3 U+200D U+1F41E	2023-03-05 20:21:57 +01:00
Timothy Flynn	42c272c059	LibUnicode: Allow ignoring text presentation emoji in sequence detection This adds an option to only detect emoji that should always present as emoji. For example, the copyright symbol (unless followed by an emoji presentation selector) should render as text.	2023-02-28 13:22:58 +00:00
Timothy Flynn	fa96811a22	LibUnicode: Skip over emoji sequences in grapheme boundary segmentation Emoji sequences in the grapheme segmentation spec are a bit tricky: \p{Extended_Pictographic} Extend* ZWJ × \p{Extended_Pictographic} Our current strategy of tracking a boolean to indicate if we are in an emoji sequence was causing us to break up emoji made of multiple sub- sequences. For example, in the "family: man, woman, girl, boy" sequence: U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466 We would break at indices 0 (correctly) and 6 (incorrectly). Instead of tracking a boolean, it's quite a bit simpler to reason about emoji sequences by just skipping past them entirely. Note that in cases like the above emoji, we skip one sub-sequence at a time.	2023-02-25 22:23:39 +01:00
Timothy Flynn	1484d3d9f5	LibUnicode: Add a method to check if a code point could start an emoji	2023-02-24 19:48:47 +01:00
Timothy Flynn	8c38d46c1a	LibUnicode: Generate the path to emoji images alongside emoji data This will provide for quicker emoji lookups, rather than having to discover and allocate these paths at runtime before we find out if they even exist.	2023-02-24 19:48:47 +01:00
Timothy Flynn	32a01a60e7	LibUnicode: Remove non-iterative text segmentation algorithms They are now unused.	2023-02-16 11:18:53 +01:00
Timothy Flynn	6ce7ec2eb3	LibUnicode: Use iterative text segmentation algorithms for titlecasing	2023-02-16 11:18:53 +01:00
Timothy Flynn	5cbf054651	LibUnicode: Fix typos causing text segmentation on mid-word punctuation For example the words "can't" and "32.3" should not have boundaries detected on the "'" and "." code points, respectively. The String test cases fixed here are because "b'ar" is now considered one word.	2023-02-15 12:36:47 +01:00
Timothy Flynn	6e7a6e2d02	LibUnicode: Support finding the next/previous text segmentation boundary	2023-02-15 12:36:47 +01:00
Timothy Flynn	abe7786a81	LibUnicode: Allow iterating over text segmentation boundaries This will be useful for e.g. finding the next boundary after a specific index - we can just stop iterating once a condition is satisfied.	2023-02-15 12:36:47 +01:00
Timothy Flynn	dd4c47456e	LibUnicode: Implement text segmentation algorithms for all UTF encodings Similar to commit `6d710eeb43`. Rather than pick-and-chosing what to support, let's just support all encodings now, as it is trivial. For example, LibGUI will want the UTF-32 overloads.	2023-02-15 12:36:47 +01:00
Timothy Flynn	2d487e4e4c	LibUnicode+LibJS: Move text segmentation algorithms to their own files These algorithms are quite chonky, and more APIs around them are to be added, so let's move them to their own files for a bit of organization.	2023-02-15 12:36:47 +01:00
MacDue	63b11030f0	Everywhere: Use ReadonlySpan<T> instead of Span<T const>	2023-02-08 19:15:45 +00:00
Timothy Flynn	537fcaf59e	AK+LibUnicode: Provide Unicode-aware caseless String matching The Unicode spec defines much more complicated caseless matching algorithms in its Collation spec. This implements the "basic" case folding comparison.	2023-01-18 14:43:40 +00:00
Timothy Flynn	8f2589b3b0	LibUnicode: Parse and generate case folding code point data Case folding rules have a similar mapping style as special casing rules, where one code point may map to zero or more case folding rules. These will be used for case-insensitive string comparisons. To see how case folding can differ from other casing rules, consider "ß" (U+00DF): >>> "ß".lower() 'ß' >>> "ß".upper() 'SS' >>> "ß".title() 'Ss' >>> "ß".casefold() 'ss'	2023-01-18 14:43:40 +00:00
Timothy Flynn	8d9fb898d7	LibUnicode: Update out-of-date spec links And remove links that aren't adding much value but will often get out of date (i.e. links to UCD files, which are already all listed in unicode_data.cmake).	2023-01-18 14:43:40 +00:00
Timothy Flynn	d6ddca0c0f	AK+LibUnicode: Provide Unicode-aware String titlecase transformation	2023-01-16 18:33:44 -05:00
Timothy Flynn	bc51017a03	LibUnicode: Support full case folding for titlecasing a string Unicode declares that to titlecase a string, the first cased code point after each word boundary should be transformed to its titlecase mapping. All other codepoints are transformed to their lowercase mapping.	2023-01-16 18:33:44 -05:00
Timothy Flynn	b562348d31	LibUnicode: Generate simple case folding mappings for titlecase Note we already generate the special case foldings for titlecase.	2023-01-16 18:33:44 -05:00
Timothy Flynn	6d710eeb43	LibUnicode: Add an overload of word segmentation for UTF-8 strings	2023-01-16 18:33:44 -05:00
Timothy Flynn	58bc831750	LibUnicode: Return a String from Unicode normalization	2023-01-15 01:00:20 +00:00
Timothy Flynn	6fcc1c7426	AK+LibUnicode: Provide Unicode-aware String case transformations Since AK can't refer to LibUnicode directly, the strategy here is that if you need case transformations, you can link LibUnicode and receive them. If you try to use either of these methods without linking it, then you'll of course get a linker error (note we don't do any fallbacks to e.g. ASCII case transformations). If you don't need these methods, you don't have to link LibUnicode.	2023-01-09 19:23:46 -07:00
Timothy Flynn	12f6793223	LibUnicode: Move Unicode-aware case transformations to a helper file These will be needed by AK::String as well, so move them to a helper file where they can be re-used.	2023-01-09 19:23:46 -07:00
Timothy Flynn	3d22efccca	LibUnicode+LibJS: Propagate OOM from Unicode normalization	2023-01-09 22:48:15 +00:00
Timothy Flynn	1ff29afc45	LibUnicode+LibJS+LibWeb: Propagate OOM from Unicode case transformations	2023-01-09 22:48:15 +00:00

1 2 3 4 5 ...

280 commits