0ct0pu5/ladybird

Author	SHA1	Message	Date
Timothy Flynn	139c575cc9	LibUnicode: Update to Unicode version 15.1.0 https://unicode.org/versions/Unicode15.1.0/ This update includes a new set of code point properties, Indic Conjunct Break. These may have the values Consonant, Linker, or Extend. These are used in text segmentation to prevent breaking on some extended grapheme cluster sequences.	2023-09-15 18:30:26 +02:00
Andreas Kling	8b936b5912	AK: Make SourceGenerator::set() infallible	2023-08-22 13:08:24 +02:00
Sam Atkins	0d021a63c7	LibUnicode: Generate data for bidirectional character types This will let us examine code points to determine the rtl/ltr direction of a piece of text.	2023-08-20 16:21:35 -04:00
Lucas CHOLLET	3f35ffb648	Userland: Prefer `_string` over `_short_string` As `_string` can't fail anymore (since `3434412`), there are no real benefits to use the short variant in most cases.	2023-08-08 07:37:21 +02:00
Timothy Flynn	b91af3c6a0	LibUnicode: Remove a few generator tracking fields that are now unused These were used to generate specialized tables. Now that those tables have been migrated to general 2-stage lookup tables, these fields are all unused.	2023-07-28 05:28:50 +02:00
Timothy Flynn	456211932f	LibUnicode: Perform code point case conversion lookups in constant time Similar to commit `0652cc4`, we now generate 2-stage lookup tables for case conversion information. Only about 1500 code points are actually cased. This means that case information is rather highly compressible, as the blocks we break the code points into will generally all have no casing information at all. In total, this change: * Does not change the size of libunicode.so (which is nice because, generally, the 2-stage lookup tables are expected to trade a bit of size for performance). * Reduces the runtime of the new benchmark test case added here from 1.383s to 1.127s (about an 18.5% improvement).	2023-07-28 05:28:50 +02:00
Timothy Flynn	0ee133af90	LibUnicode: Separate code point case information into its own structure There is no functional change here. This information will compose the upcoming multistage casing tables in an upcoming patch. Extract it to its own struct to prepare for that.	2023-07-28 05:28:50 +02:00
Timothy Flynn	a332a8ad19	LibUnicode: Prepare Unicode data generator for multistage casing tables There is no functional change here. This just adjusts the changes made in commit `0652cc4` to be a bit more generic for code point casing tables. We currently only generate property tables, which boil down to a vector of booleans. Casing tables will be a struct of varying types, so this generalizes some of the generator to prepare for that ahead of time, to make the upcoming casing patch smaller / easier to grok.	2023-07-28 05:28:50 +02:00
Timothy Flynn	3fae92eea2	LibUnicode: Search code point properties sequentially at compile time When generating code point property tables, we currently binary search the code point range lists for each property to decide if a code point has that property. However, we are both iterating over the code points and through the sorted properties in order. This means we do not need to search code point ranges that are below the current code point at all. We can even remove the code point ranges that fall below the current code point, as we will not see a code point in those ranges again. On my machine, this reduces the run time of GenerateUnicodeData from 3.4 seconds to 1.2 seconds.	2023-07-28 05:28:50 +02:00
Timothy Flynn	0652cc48c0	LibUnicode: Perform code point property lookups in constant time We currently produce a single table for all categories of code point properties (GeneralCategory, Script, etc.). Each row contains a field indicating the range of code points to which that property applies. At runtime, we then do a binary search through that table to decide if a code point has a property. This changes our approach to generate a 2-stage lookup table for each of those categories. There is an in-depth explanation of these tables above the new `create_code_point_tables` method. The end effect is that code point property lookup is reduced from a binary search to constant-time array lookups. In total, this change: * Increases the size of libunicode.so from 2.7 MB to 2.9 MB. * Reduces the runtime of the new benchmark test case added here from 3.576s to 1.020s (a 3.5x speedup). * In a profile of resizing a TextEditor window with a 3MB file open, the runtime of checking if a code point has a word break property reduces from ~81% to ~56%.	2023-07-26 08:36:20 +02:00
Timothy Flynn	8f1d73abde	LibUnicode: Use the public CodePointRange in the code generator The next commit will need a type from LibUnicode/CharacterTypes.h. To avoid conflicts between that header's CodePointRange and the one that is defined in the code generator, just use the public definition.	2023-07-26 08:36:20 +02:00
Timothy Flynn	cb128dcf75	LibUnicode: Move the CodePointRangeComparator struct to a public header Move it out of the generated code so that it may be used by the code generator itself.	2023-07-26 08:36:20 +02:00
Timothy Flynn	c950f88611	LibUnicode: Stop generating Block property data We started generating this data in commit `0505e03`, but it was unused. It's still not used, so let's remove it, rather than bloating the size of libunicode.so with unused data. If we need it in the future, it's trivial to add back. Note we have always used the block name data from that commit, and that is still present here.	2023-07-26 08:36:20 +02:00
Ben Wiederhake	5cfa883b9f	LibUnicode: Explicitly mark HashMap copy	2023-05-19 22:33:57 +02:00
Lucas CHOLLET	8c34959b53	AK: Add the `Input` word to input-only buffered streams This concerns both `BufferedSeekable` and `BufferedFile`.	2023-05-09 11:18:46 +02:00
gustrb	5141c86587	AK: Rename CaseInsensitiveStringViewTraits to reflect intent Now it is called `CaseInsensitiveASCIIStringViewTraits`, so we can be more specific about what data structure does it operate onto. ;)	2023-03-14 21:34:32 +00:00
Tim Schumacher	8032724574	CodeGenerators: Ensure that we always print the entire generated output	2023-03-13 15:16:20 +00:00
Tim Schumacher	d5871f5717	AK: Rename Stream::{read,write} to Stream::{read_some,write_some} Similar to POSIX read, the basic read and write functions of AK::Stream do not have a lower limit of how much data they read or write (apart from "none at all"). Rename the functions to "read some [data]" and "write some [data]" (with "data" being omitted, since everything here is reading and writing data) to make them sufficiently distinct from the functions that ensure to use the entire buffer (which should be the go-to function for most usages). No functional changes, just a lot of new FIXMEs.	2023-03-13 15:16:20 +00:00
Tim Schumacher	874c7bba28	LibCore: Remove `Stream.h`	2023-02-13 00:50:07 +00:00
Tim Schumacher	606a3982f3	LibCore: Move Stream-based file into the `Core` namespace	2023-02-13 00:50:07 +00:00
MacDue	63b11030f0	Everywhere: Use ReadonlySpan<T> instead of Span<T const>	2023-02-08 19:15:45 +00:00
Tim Schumacher	8464da1439	AK: Move `Stream` and `SeekableStream` from `LibCore` `Stream` will be qualified as `AK::Stream` until we remove the `Core::Stream` namespace. `IODevice` now reuses the `SeekMode` that is defined by `SeekableStream`, since defining its own would require us to qualify it with `AK::SeekMode` everywhere.	2023-01-29 19:16:44 -07:00
Timothy Flynn	8f2589b3b0	LibUnicode: Parse and generate case folding code point data Case folding rules have a similar mapping style as special casing rules, where one code point may map to zero or more case folding rules. These will be used for case-insensitive string comparisons. To see how case folding can differ from other casing rules, consider "ß" (U+00DF): >>> "ß".lower() 'ß' >>> "ß".upper() 'SS' >>> "ß".title() 'Ss' >>> "ß".casefold() 'ss'	2023-01-18 14:43:40 +00:00
Timothy Flynn	9226cf7272	LibUnicode: Rename a special casing variable name in the UCD generator This name will soon be a bit ambiguous with a similar case folding variable name.	2023-01-18 14:43:40 +00:00
Timothy Flynn	8d9fb898d7	LibUnicode: Update out-of-date spec links And remove links that aren't adding much value but will often get out of date (i.e. links to UCD files, which are already all listed in unicode_data.cmake).	2023-01-18 14:43:40 +00:00
Timothy Flynn	b562348d31	LibUnicode: Generate simple case folding mappings for titlecase Note we already generate the special case foldings for titlecase.	2023-01-16 18:33:44 -05:00
Timothy Flynn	12f6793223	LibUnicode: Move Unicode-aware case transformations to a helper file These will be needed by AK::String as well, so move them to a helper file where they can be re-used.	2023-01-09 19:23:46 -07:00
Thomas Queiroz	6debd967ba	Lagom/CodeGenerators: Use HashMap::try_ensure_capacity	2022-12-10 14:29:46 +01:00
Linus Groh	57dc179b1f	Everywhere: Rename to_{string => deprecated_string}() where applicable This will make it easier to support both string types at the same time while we convert code, and tracking down remaining uses. One big exception is Value::to_string() in LibJS, where the name is dictated by the ToString AO.	2022-12-06 08:54:33 +01:00
Linus Groh	6e19ab2bbc	AK+Everywhere: Rename String to DeprecatedString We have a new, improved string type coming up in AK (OOM aware, no null state), and while it's going to use UTF-8, the name UTF8String is a mouthful - so let's free up the String name by renaming the existing class. Making the old one have an annoying name will hopefully also help with quick adoption :^)	2022-12-06 08:54:33 +01:00
Timothy Flynn	b2164ad979	Meta: Do not hard-code index types for UCD/CLDR/TZDB code generators Hand-picking the smallest index type that fits a particular generated array started with commit `3ad159537e`. This was to reduce the size of the generated library. Since then, the number of types using UniqueStorage has grown a ton, creating a long list of types for which index types are manually picked. When a new UCD/CLDR/TZDB is released, and the current index type no longer fits the generated data, we fail to generate. Tracking down which index caused the failure is a pretty annoying process. Instead, we can just use size_t while in the generators themselves, then automatically pick the size needed for the generated code.	2022-11-18 17:00:51 +00:00
Gunnar Beutner	2d3567ee92	Meta+LibUnicode: Avoid relocations for static unicode data Previously the s_decomposition_mappings variable would refer to other data in s_decomposition_mappings_data. This would cause thousands of avoidable relocations at load time. This saves about 128kB RAM for each process which uses LibUnicode.	2022-11-06 17:34:06 +01:00
demostanis	3e8b5ac920	AK+Everywhere: Turn bool keep_empty to an enum in split* functions	2022-10-24 23:29:18 +01:00
Timothy Flynn	f08a979b96	LibUnicode: Remove GCC codegen workaround Reverts commits: `ffbf5596cd` `f190e394b3`	2022-10-07 18:21:40 +01:00
Timothy Flynn	f38c68177b	LibUnicode: Update code point ideographic replacements for Unicode 15	2022-10-07 18:17:40 +01:00
Andreas Kling	f190e394b3	LibUnicode: Let's use the GCC 11/12 workaround on all platforms I seem to be getting some miscompiles on Linux as well, so let's make the hitherto macOS-specific workaround universal.	2022-10-06 17:15:28 +02:00
matcool	70d0c1616f	LibUnicode: Add decomposition mappings and Unicode normalization The mappings are exposed via `Unicode::code_point_decomposition(u32)` and `Unicode::code_point_decompositions()`, the latter being useful for reverse searching a code point from its decomposition. The normalization code does not make use of `Quick_Check` props (https://www.unicode.org/reports/tr44/#Decompositions_and_Normalization), meaning no quick check optimizations.	2022-10-06 08:24:39 -04:00
Nico Weber	2af028132a	AK+Everywhere: Add AK_COMPILER_{GCC,CLANG} and use them most places Doesn't use them in libc headers so that those don't have to pull in AK/Platform.h. AK_COMPILER_GCC is set _only_ for gcc, not for clang too. (__GNUC__ is defined in clang builds as well.) Using AK_COMPILER_GCC simplifies things some. AK_COMPILER_CLANG isn't as much of a win, other than that it's consistent with AK_COMPILER_GCC.	2022-10-04 23:35:07 +01:00
Nico Weber	ffbf5596cd	Lagom: Work around gcc codegen bug Without this, GenerateUnicodeData crashes when run during the build. With this, `serenity.sh run` brings up a running SerenityOS. Since GenerateUnicodeData doesn't take a lot of time to run, just disable optimizations to work around the problem for now. Works around #15449.	2022-10-03 15:30:51 +01:00
Timothy Flynn	f082b6ae48	LibUnicode: Generate a separate Locale enumeration for special casing The UCD only cares about a few locales for special casing rules (az, lt, and tr). Unfortunately, LibUnicode cannot use LibLocale once the libraries are separate because LibLocale will need to use LibUnicode for many more things; thus there would be a circular dependency. Instead, just generate the small enum needed for this one use case.	2022-09-05 14:37:16 -04:00
Timothy Flynn	ff48220dca	Userland: Move files destined for LibLocale to the Locale namespace	2022-09-05 14:37:16 -04:00
Timothy Flynn	1e0276f541	LibLocale+LibUnicode: Move generated CLDR data files to LibLocale folder They are still included into LibUnicode, but this moves their generated location to be under LibLocale.	2022-09-05 14:37:16 -04:00
Timothy Flynn	89d1813b5d	LibUnicode: Move CLDR data generators to a LibLocale subfolder To prepare for placing all CLDR generated data in a new library, LibLocale, this moves the code generators for the CLDR data to the LibLocale subfolder.	2022-09-05 14:37:16 -04:00
Timothy Flynn	ca92e37ae0	LibUnicode: Generate code point display names with run-length encoding Similar to commit `becec35`, our code point display name data was a large list of StringViews. RLE can be used here as well to remove about 32 MB from the initialized data section to the read-only section. Some of the refactoring to store strings as indices into an RLE array also lets us clean up some of the code point name generators.	2022-08-17 15:42:12 +01:00
sin-ack	3f3f45580a	Everywhere: Add sv suffix to strings relying on StringView(char const) Each of these strings would previously rely on StringView's char const constructor overload, which would call __builtin_strlen on the string. Since we now have operator ""sv, we can replace these with much simpler versions. This opens the door to being able to remove StringView(char const*). No functional changes.	2022-07-12 23:11:35 +02:00
DexesTTP	7ceeb74535	AK: Use an enum instead of a bool for String::replace(all_occurences) This commit has no behavior changes. In particular, this does not fix any of the wrong uses of the previous default parameter (which used to be 'false', meaning "only replace the first occurence in the string"). It simply replaces the default uses by String::replace(..., ReplaceMode::FirstOnly), leaving them incorrect.	2022-07-06 11:12:45 +02:00
Sam Atkins	d564cf1e89	LibCore+Everywhere: Make Core::Stream read_line() return StringView Similar reasoning to making Core::Stream::read() return Bytes, except that every user of read_line() creates a StringView from the result, so let's just return one right away.	2022-04-16 13:27:51 -04:00
thankyouverycool	0505e031f1	Meta+LibUnicode: Download and parse Unicode block properties This parses Blocks.txt for CharacterType properties and creates a global display array for use in apps.	2022-02-15 10:13:19 -05:00
Timothy Flynn	a64a7940e4	LibUnicode: Port the UCD generator to the stream API	2022-02-14 11:39:46 -05:00
Idan Horowitz	2d50c08f34	LibUnicode: Download and parse {Grapheme,Word,Sentence} break props	2022-01-31 21:05:04 +02:00

1 2

73 commits