https://unicode.org/versions/Unicode15.1.0/
This update includes a new set of code point properties, Indic Conjunct
Break. These may have the values Consonant, Linker, or Extend. These are
used in text segmentation to prevent breaking on some extended grapheme
cluster sequences.
These were used to generate specialized tables. Now that those tables
have been migrated to general 2-stage lookup tables, these fields are
all unused.
Similar to commit 0652cc4, we now generate 2-stage lookup tables for
case conversion information. Only about 1500 code points are actually
cased. This means that case information is rather highly compressible,
as the blocks we break the code points into will generally all have no
casing information at all.
In total, this change:
* Does not change the size of libunicode.so (which is nice because,
generally, the 2-stage lookup tables are expected to trade a bit
of size for performance).
* Reduces the runtime of the new benchmark test case added here from
1.383s to 1.127s (about an 18.5% improvement).
There is no functional change here. This information will compose the
upcoming multistage casing tables in an upcoming patch. Extract it to
its own struct to prepare for that.
There is no functional change here. This just adjusts the changes made
in commit 0652cc4 to be a bit more generic for code point casing tables.
We currently only generate property tables, which boil down to a vector
of booleans. Casing tables will be a struct of varying types, so this
generalizes some of the generator to prepare for that ahead of time, to
make the upcoming casing patch smaller / easier to grok.
When generating code point property tables, we currently binary search
the code point range lists for each property to decide if a code point
has that property. However, we are both iterating over the code points
and through the sorted properties in order. This means we do not need
to search code point ranges that are below the current code point at
all. We can even remove the code point ranges that fall below the
current code point, as we will not see a code point in those ranges
again.
On my machine, this reduces the run time of GenerateUnicodeData from
3.4 seconds to 1.2 seconds.
We currently produce a single table for all categories of code point
properties (GeneralCategory, Script, etc.). Each row contains a field
indicating the range of code points to which that property applies. At
runtime, we then do a binary search through that table to decide if a
code point has a property.
This changes our approach to generate a 2-stage lookup table for each of
those categories. There is an in-depth explanation of these tables above
the new `create_code_point_tables` method. The end effect is that code
point property lookup is reduced from a binary search to constant-time
array lookups.
In total, this change:
* Increases the size of libunicode.so from 2.7 MB to 2.9 MB.
* Reduces the runtime of the new benchmark test case added here from
3.576s to 1.020s (a 3.5x speedup).
* In a profile of resizing a TextEditor window with a 3MB file open,
the runtime of checking if a code point has a word break property
reduces from ~81% to ~56%.
The next commit will need a type from LibUnicode/CharacterTypes.h. To
avoid conflicts between that header's CodePointRange and the one that is
defined in the code generator, just use the public definition.
We started generating this data in commit 0505e03, but it was unused.
It's still not used, so let's remove it, rather than bloating the size
of libunicode.so with unused data. If we need it in the future, it's
trivial to add back.
Note we *have* always used the block name data from that commit, and
that is still present here.
Similar to POSIX read, the basic read and write functions of AK::Stream
do not have a lower limit of how much data they read or write (apart
from "none at all").
Rename the functions to "read some [data]" and "write some [data]" (with
"data" being omitted, since everything here is reading and writing data)
to make them sufficiently distinct from the functions that ensure to
use the entire buffer (which should be the go-to function for most
usages).
No functional changes, just a lot of new FIXMEs.
`Stream` will be qualified as `AK::Stream` until we remove the
`Core::Stream` namespace. `IODevice` now reuses the `SeekMode` that is
defined by `SeekableStream`, since defining its own would require us to
qualify it with `AK::SeekMode` everywhere.
Case folding rules have a similar mapping style as special casing rules,
where one code point may map to zero or more case folding rules. These
will be used for case-insensitive string comparisons. To see how case
folding can differ from other casing rules, consider "ß" (U+00DF):
>>> "ß".lower()
'ß'
>>> "ß".upper()
'SS'
>>> "ß".title()
'Ss'
>>> "ß".casefold()
'ss'
And remove links that aren't adding much value but will often get out of
date (i.e. links to UCD files, which are already all listed in
unicode_data.cmake).
This will make it easier to support both string types at the same time
while we convert code, and tracking down remaining uses.
One big exception is Value::to_string() in LibJS, where the name is
dictated by the ToString AO.
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
Hand-picking the smallest index type that fits a particular generated
array started with commit 3ad159537e. This
was to reduce the size of the generated library.
Since then, the number of types using UniqueStorage has grown a ton,
creating a long list of types for which index types are manually picked.
When a new UCD/CLDR/TZDB is released, and the current index type no
longer fits the generated data, we fail to generate. Tracking down which
index caused the failure is a pretty annoying process.
Instead, we can just use size_t while in the generators themselves, then
automatically pick the size needed for the generated code.
Previously the s_decomposition_mappings variable would refer to other
data in s_decomposition_mappings_data. This would cause thousands of
avoidable relocations at load time.
This saves about 128kB RAM for each process which uses LibUnicode.
The mappings are exposed via `Unicode::code_point_decomposition(u32)`
and `Unicode::code_point_decompositions()`, the latter being useful for
reverse searching a code point from its decomposition.
The normalization code does not make use of `Quick_Check` props (https://www.unicode.org/reports/tr44/#Decompositions_and_Normalization),
meaning no quick check optimizations.
Doesn't use them in libc headers so that those don't have to pull in
AK/Platform.h.
AK_COMPILER_GCC is set _only_ for gcc, not for clang too. (__GNUC__ is
defined in clang builds as well.) Using AK_COMPILER_GCC simplifies
things some.
AK_COMPILER_CLANG isn't as much of a win, other than that it's
consistent with AK_COMPILER_GCC.
Without this, GenerateUnicodeData crashes when run during the build.
With this, `serenity.sh run` brings up a running SerenityOS.
Since GenerateUnicodeData doesn't take a lot of time to run, just
disable optimizations to work around the problem for now.
Works around #15449.
The UCD only cares about a few locales for special casing rules (az, lt,
and tr). Unfortunately, LibUnicode cannot use LibLocale once the
libraries are separate because LibLocale will need to use LibUnicode for
many more things; thus there would be a circular dependency. Instead,
just generate the small enum needed for this one use case.
To prepare for placing all CLDR generated data in a new library,
LibLocale, this moves the code generators for the CLDR data to the
LibLocale subfolder.
Similar to commit becec35, our code point display name data was a large
list of StringViews. RLE can be used here as well to remove about 32 MB
from the initialized data section to the read-only section.
Some of the refactoring to store strings as indices into an RLE array
also lets us clean up some of the code point name generators.
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).
No functional changes.
This commit has no behavior changes.
In particular, this does not fix any of the wrong uses of the previous
default parameter (which used to be 'false', meaning "only replace the
first occurence in the string"). It simply replaces the default uses by
String::replace(..., ReplaceMode::FirstOnly), leaving them incorrect.
Similar reasoning to making Core::Stream::read() return Bytes, except
that every user of read_line() creates a StringView from the result, so
let's just return one right away.