Commit graph

73 commits

Author SHA1 Message Date
Timothy Flynn
139c575cc9 LibUnicode: Update to Unicode version 15.1.0
https://unicode.org/versions/Unicode15.1.0/

This update includes a new set of code point properties, Indic Conjunct
Break. These may have the values Consonant, Linker, or Extend. These are
used in text segmentation to prevent breaking on some extended grapheme
cluster sequences.
2023-09-15 18:30:26 +02:00
Andreas Kling
8b936b5912 AK: Make SourceGenerator::set() infallible 2023-08-22 13:08:24 +02:00
Sam Atkins
0d021a63c7 LibUnicode: Generate data for bidirectional character types
This will let us examine code points to determine the rtl/ltr direction
of a piece of text.
2023-08-20 16:21:35 -04:00
Lucas CHOLLET
3f35ffb648 Userland: Prefer _string over _short_string
As `_string` can't fail anymore (since 3434412), there are no real
benefits to use the short variant in most cases.
2023-08-08 07:37:21 +02:00
Timothy Flynn
b91af3c6a0 LibUnicode: Remove a few generator tracking fields that are now unused
These were used to generate specialized tables. Now that those tables
have been migrated to general 2-stage lookup tables, these fields are
all unused.
2023-07-28 05:28:50 +02:00
Timothy Flynn
456211932f LibUnicode: Perform code point case conversion lookups in constant time
Similar to commit 0652cc4, we now generate 2-stage lookup tables for
case conversion information. Only about 1500 code points are actually
cased. This means that case information is rather highly compressible,
as the blocks we break the code points into will generally all have no
casing information at all.

In total, this change:

    * Does not change the size of libunicode.so (which is nice because,
      generally, the 2-stage lookup tables are expected to trade a bit
      of size for performance).

    * Reduces the runtime of the new benchmark test case added here from
      1.383s to 1.127s (about an 18.5% improvement).
2023-07-28 05:28:50 +02:00
Timothy Flynn
0ee133af90 LibUnicode: Separate code point case information into its own structure
There is no functional change here. This information will compose the
upcoming multistage casing tables in an upcoming patch. Extract it to
its own struct to prepare for that.
2023-07-28 05:28:50 +02:00
Timothy Flynn
a332a8ad19 LibUnicode: Prepare Unicode data generator for multistage casing tables
There is no functional change here. This just adjusts the changes made
in commit 0652cc4 to be a bit more generic for code point casing tables.
We currently only generate property tables, which boil down to a vector
of booleans. Casing tables will be a struct of varying types, so this
generalizes some of the generator to prepare for that ahead of time, to
make the upcoming casing patch smaller / easier to grok.
2023-07-28 05:28:50 +02:00
Timothy Flynn
3fae92eea2 LibUnicode: Search code point properties sequentially at compile time
When generating code point property tables, we currently binary search
the code point range lists for each property to decide if a code point
has that property. However, we are both iterating over the code points
and through the sorted properties in order. This means we do not need
to search code point ranges that are below the current code point at
all. We can even remove the code point ranges that fall below the
current code point, as we will not see a code point in those ranges
again.

On my machine, this reduces the run time of GenerateUnicodeData from
3.4 seconds to 1.2 seconds.
2023-07-28 05:28:50 +02:00
Timothy Flynn
0652cc48c0 LibUnicode: Perform code point property lookups in constant time
We currently produce a single table for all categories of code point
properties (GeneralCategory, Script, etc.). Each row contains a field
indicating the range of code points to which that property applies. At
runtime, we then do a binary search through that table to decide if a
code point has a property.

This changes our approach to generate a 2-stage lookup table for each of
those categories. There is an in-depth explanation of these tables above
the new `create_code_point_tables` method. The end effect is that code
point property lookup is reduced from a binary search to constant-time
array lookups.

In total, this change:

    * Increases the size of libunicode.so from 2.7 MB to 2.9 MB.

    * Reduces the runtime of the new benchmark test case added here from
      3.576s to 1.020s (a 3.5x speedup).

    * In a profile of resizing a TextEditor window with a 3MB file open,
      the runtime of checking if a code point has a word break property
      reduces from ~81% to ~56%.
2023-07-26 08:36:20 +02:00
Timothy Flynn
8f1d73abde LibUnicode: Use the public CodePointRange in the code generator
The next commit will need a type from LibUnicode/CharacterTypes.h. To
avoid conflicts between that header's CodePointRange and the one that is
defined in the code generator, just use the public definition.
2023-07-26 08:36:20 +02:00
Timothy Flynn
cb128dcf75 LibUnicode: Move the CodePointRangeComparator struct to a public header
Move it out of the generated code so that it may be used by the code
generator itself.
2023-07-26 08:36:20 +02:00
Timothy Flynn
c950f88611 LibUnicode: Stop generating Block property data
We started generating this data in commit 0505e03, but it was unused.
It's still not used, so let's remove it, rather than bloating the size
of libunicode.so with unused data. If we need it in the future, it's
trivial to add back.

Note we *have* always used the block name data from that commit, and
that is still present here.
2023-07-26 08:36:20 +02:00
Ben Wiederhake
5cfa883b9f LibUnicode: Explicitly mark HashMap copy 2023-05-19 22:33:57 +02:00
Lucas CHOLLET
8c34959b53 AK: Add the Input word to input-only buffered streams
This concerns both `BufferedSeekable` and `BufferedFile`.
2023-05-09 11:18:46 +02:00
gustrb
5141c86587 AK: Rename CaseInsensitiveStringViewTraits to reflect intent
Now it is called `CaseInsensitiveASCIIStringViewTraits`, so we can be
more specific about what data structure does it operate onto. ;)
2023-03-14 21:34:32 +00:00
Tim Schumacher
8032724574 CodeGenerators: Ensure that we always print the entire generated output 2023-03-13 15:16:20 +00:00
Tim Schumacher
d5871f5717 AK: Rename Stream::{read,write} to Stream::{read_some,write_some}
Similar to POSIX read, the basic read and write functions of AK::Stream
do not have a lower limit of how much data they read or write (apart
from "none at all").

Rename the functions to "read some [data]" and "write some [data]" (with
"data" being omitted, since everything here is reading and writing data)
to make them sufficiently distinct from the functions that ensure to
use the entire buffer (which should be the go-to function for most
usages).

No functional changes, just a lot of new FIXMEs.
2023-03-13 15:16:20 +00:00
Tim Schumacher
874c7bba28 LibCore: Remove Stream.h 2023-02-13 00:50:07 +00:00
Tim Schumacher
606a3982f3 LibCore: Move Stream-based file into the Core namespace 2023-02-13 00:50:07 +00:00
MacDue
63b11030f0 Everywhere: Use ReadonlySpan<T> instead of Span<T const> 2023-02-08 19:15:45 +00:00
Tim Schumacher
8464da1439 AK: Move Stream and SeekableStream from LibCore
`Stream` will be qualified as `AK::Stream` until we remove the
`Core::Stream` namespace. `IODevice` now reuses the `SeekMode` that is
defined by `SeekableStream`, since defining its own would require us to
qualify it with `AK::SeekMode` everywhere.
2023-01-29 19:16:44 -07:00
Timothy Flynn
8f2589b3b0 LibUnicode: Parse and generate case folding code point data
Case folding rules have a similar mapping style as special casing rules,
where one code point may map to zero or more case folding rules. These
will be used for case-insensitive string comparisons. To see how case
folding can differ from other casing rules, consider "ß" (U+00DF):

    >>> "ß".lower()
    'ß'

    >>> "ß".upper()
    'SS'

    >>> "ß".title()
    'Ss'

    >>> "ß".casefold()
    'ss'
2023-01-18 14:43:40 +00:00
Timothy Flynn
9226cf7272 LibUnicode: Rename a special casing variable name in the UCD generator
This name will soon be a bit ambiguous with a similar case folding
variable name.
2023-01-18 14:43:40 +00:00
Timothy Flynn
8d9fb898d7 LibUnicode: Update out-of-date spec links
And remove links that aren't adding much value but will often get out of
date (i.e. links to UCD files, which are already all listed in
unicode_data.cmake).
2023-01-18 14:43:40 +00:00
Timothy Flynn
b562348d31 LibUnicode: Generate simple case folding mappings for titlecase
Note we already generate the special case foldings for titlecase.
2023-01-16 18:33:44 -05:00
Timothy Flynn
12f6793223 LibUnicode: Move Unicode-aware case transformations to a helper file
These will be needed by AK::String as well, so move them to a helper
file where they can be re-used.
2023-01-09 19:23:46 -07:00
Thomas Queiroz
6debd967ba Lagom/CodeGenerators: Use HashMap::try_ensure_capacity 2022-12-10 14:29:46 +01:00
Linus Groh
57dc179b1f Everywhere: Rename to_{string => deprecated_string}() where applicable
This will make it easier to support both string types at the same time
while we convert code, and tracking down remaining uses.

One big exception is Value::to_string() in LibJS, where the name is
dictated by the ToString AO.
2022-12-06 08:54:33 +01:00
Linus Groh
6e19ab2bbc AK+Everywhere: Rename String to DeprecatedString
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
2022-12-06 08:54:33 +01:00
Timothy Flynn
b2164ad979 Meta: Do not hard-code index types for UCD/CLDR/TZDB code generators
Hand-picking the smallest index type that fits a particular generated
array started with commit 3ad159537e. This
was to reduce the size of the generated library.

Since then, the number of types using UniqueStorage has grown a ton,
creating a long list of types for which index types are manually picked.
When a new UCD/CLDR/TZDB is released, and the current index type no
longer fits the generated data, we fail to generate. Tracking down which
index caused the failure is a pretty annoying process.

Instead, we can just use size_t while in the generators themselves, then
automatically pick the size needed for the generated code.
2022-11-18 17:00:51 +00:00
Gunnar Beutner
2d3567ee92 Meta+LibUnicode: Avoid relocations for static unicode data
Previously the s_decomposition_mappings variable would refer to other
data in s_decomposition_mappings_data. This would cause thousands of
avoidable relocations at load time.

This saves about 128kB RAM for each process which uses LibUnicode.
2022-11-06 17:34:06 +01:00
demostanis
3e8b5ac920 AK+Everywhere: Turn bool keep_empty to an enum in split* functions 2022-10-24 23:29:18 +01:00
Timothy Flynn
f08a979b96 LibUnicode: Remove GCC codegen workaround
Reverts commits:
ffbf5596cd
f190e394b3
2022-10-07 18:21:40 +01:00
Timothy Flynn
f38c68177b LibUnicode: Update code point ideographic replacements for Unicode 15 2022-10-07 18:17:40 +01:00
Andreas Kling
f190e394b3 LibUnicode: Let's use the GCC 11/12 workaround on all platforms
I seem to be getting some miscompiles on Linux as well, so let's make
the hitherto macOS-specific workaround universal.
2022-10-06 17:15:28 +02:00
matcool
70d0c1616f LibUnicode: Add decomposition mappings and Unicode normalization
The mappings are exposed via `Unicode::code_point_decomposition(u32)`
and `Unicode::code_point_decompositions()`, the latter being useful for
reverse searching a code point from its decomposition.

The normalization code does not make use of `Quick_Check` props (https://www.unicode.org/reports/tr44/#Decompositions_and_Normalization),
meaning no quick check optimizations.
2022-10-06 08:24:39 -04:00
Nico Weber
2af028132a AK+Everywhere: Add AK_COMPILER_{GCC,CLANG} and use them most places
Doesn't use them in libc headers so that those don't have to pull in
AK/Platform.h.

AK_COMPILER_GCC is set _only_ for gcc, not for clang too. (__GNUC__ is
defined in clang builds as well.) Using AK_COMPILER_GCC simplifies
things some.

AK_COMPILER_CLANG isn't as much of a win, other than that it's
consistent with AK_COMPILER_GCC.
2022-10-04 23:35:07 +01:00
Nico Weber
ffbf5596cd Lagom: Work around gcc codegen bug
Without this, GenerateUnicodeData crashes when run during the build.
With this, `serenity.sh run` brings up a running SerenityOS.
Since GenerateUnicodeData doesn't take a lot of time to run, just
disable optimizations to work around the problem for now.

Works around #15449.
2022-10-03 15:30:51 +01:00
Timothy Flynn
f082b6ae48 LibUnicode: Generate a separate Locale enumeration for special casing
The UCD only cares about a few locales for special casing rules (az, lt,
and tr). Unfortunately, LibUnicode cannot use LibLocale once the
libraries are separate because LibLocale will need to use LibUnicode for
many more things; thus there would be a circular dependency. Instead,
just generate the small enum needed for this one use case.
2022-09-05 14:37:16 -04:00
Timothy Flynn
ff48220dca Userland: Move files destined for LibLocale to the Locale namespace 2022-09-05 14:37:16 -04:00
Timothy Flynn
1e0276f541 LibLocale+LibUnicode: Move generated CLDR data files to LibLocale folder
They are still included into LibUnicode, but this moves their generated
location to be under LibLocale.
2022-09-05 14:37:16 -04:00
Timothy Flynn
89d1813b5d LibUnicode: Move CLDR data generators to a LibLocale subfolder
To prepare for placing all CLDR generated data in a new library,
LibLocale, this moves the code generators for the CLDR data to the
LibLocale subfolder.
2022-09-05 14:37:16 -04:00
Timothy Flynn
ca92e37ae0 LibUnicode: Generate code point display names with run-length encoding
Similar to commit becec35, our code point display name data was a large
list of StringViews. RLE can be used here as well to remove about 32 MB
from the initialized data section to the read-only section.

Some of the refactoring to store strings as indices into an RLE array
also lets us clean up some of the code point name generators.
2022-08-17 15:42:12 +01:00
sin-ack
3f3f45580a Everywhere: Add sv suffix to strings relying on StringView(char const*)
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).

No functional changes.
2022-07-12 23:11:35 +02:00
DexesTTP
7ceeb74535 AK: Use an enum instead of a bool for String::replace(all_occurences)
This commit has no behavior changes.

In particular, this does not fix any of the wrong uses of the previous
default parameter (which used to be 'false', meaning "only replace the
first occurence in the string"). It simply replaces the default uses by
String::replace(..., ReplaceMode::FirstOnly), leaving them incorrect.
2022-07-06 11:12:45 +02:00
Sam Atkins
d564cf1e89 LibCore+Everywhere: Make Core::Stream read_line() return StringView
Similar reasoning to making Core::Stream::read() return Bytes, except
that every user of read_line() creates a StringView from the result, so
let's just return one right away.
2022-04-16 13:27:51 -04:00
thankyouverycool
0505e031f1 Meta+LibUnicode: Download and parse Unicode block properties
This parses Blocks.txt for CharacterType properties and creates
a global display array for use in apps.
2022-02-15 10:13:19 -05:00
Timothy Flynn
a64a7940e4 LibUnicode: Port the UCD generator to the stream API 2022-02-14 11:39:46 -05:00
Idan Horowitz
2d50c08f34 LibUnicode: Download and parse {Grapheme,Word,Sentence} break props 2022-01-31 21:05:04 +02:00