Commit graph

17 commits

Author SHA1 Message Date
Timothy Flynn
7e6ad172a4 LibUnicode: Support code point names that apply to ranges of code points
For example, consider the following adjacent entries in UnicodeData.txt:

    3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
    4DBF;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;

Our current implementation would assign the display name "CJK Ideograph
Extension A" to code points U+3400 & U+4DBF, but not to the code points
in between. Not only should those code points be assigned a name, but
the Unicode spec also has formatting rules on what the names should be
(the names for these ranged code points are not as they appear in
UnicodeData.txt).

The spec also defines names for code point ranges that actually are
listed individually in UnicodeData.txt. For example:

    2F800;CJK COMPATIBILITY IDEOGRAPH-2F800;Lo;0;L;4E3D;;;;N;;;;;
    2F801;CJK COMPATIBILITY IDEOGRAPH-2F801;Lo;0;L;4E38;;;;N;;;;;
    2F802;CJK COMPATIBILITY IDEOGRAPH-2F802;Lo;0;L;4E41;;;;N;;;;;

Code points are only coalesced into a range if all fields after the name
are equivalent. Our parser will insert the range and its name formatting
pattern when it comes across the first code point in that range, then
ignore other code points in that range. This reduces the number of names
we generated by nearly 2,000.
2021-11-30 11:24:02 +01:00
Timothy Flynn
f2f4980f15 LibUnicode: Remove unused field from UnicodeData generator 2021-11-30 11:24:02 +01:00
Timothy Flynn
88dbf3c348 LibUnicode: Port GenerateUnicodeData to ErrorOr and LibMain
Also store command line arguments as StringViews rather than pointers.
2021-11-23 22:58:05 +01:00
Ben Wiederhake
b06b54772e Meta+LibUnicode: Provide code point names through library 2021-11-20 00:31:55 +01:00
Timothy Flynn
9d1519e21c LibUnicode: Move GenerateUnicodeData's Alias struct to generator header
This will be used for locale aliases as well. Also rename the "property"
field in this struct to "name", as it no longer is only used for
property aliases.
2021-11-19 11:45:35 +01:00
Andreas Kling
8b1108e485 Everywhere: Pass AK::StringView by value 2021-11-11 01:27:46 +01:00
Timothy Flynn
f91d63af83 LibUnicode: Generate enum/alias from-string methods without a HashMap
The *_from_string() and resolve_*_alias() generated methods are the last
remaining users of HashMap in the LibUnicode generated files (read: the
last methods not using compile-time structures). This converts these
methods to use an array containing pairs of hash values to the desired
lookup value.

Because this code generation is the same between GenerateUnicodeData.cpp
and GenerateUnicodeLocale.cpp, this adds a GeneratorUtil.h header to the
LibUnicode generators to contain the method that generates the methods.
2021-10-13 16:38:51 +02:00
Timothy Flynn
79707d83d3 LibUnicode: Stop generating large UnicodeData hash map
The data in this hash map is now available by way of much smaller arrays
and is now unused.
2021-10-10 13:49:37 +02:00
Timothy Flynn
d83b262e64 LibUnicode: Generate standalone compile-time array for combining class 2021-10-10 13:49:37 +02:00
Timothy Flynn
9f83774913 LibUnicode: Generate standalone compile-time array for special casing
There are only 112 code points with special casing rules, so this array
is quite small (compared to the size 34,626 UnicodeData hash map that is
also storing this data). Removing all casing rules from UnicodeData will
happen in a subsequent commit.
2021-10-10 13:49:37 +02:00
Timothy Flynn
da4b8897a7 LibUnicode: Generate standalone compile-time arrays for simple casing
Currently, all casing information (simple and special) are stored in a
compile-time array of size 34,626, then statically copied to a hash map
at runtime. In an effort to reduce the resulting memory usage, store the
simple casing rules in standalone compile-time arrays. The uppercase map
is size 1,450 and the lowercase map is size 1,433. Any code point not in
a map will implicitly have an identity mapping.
2021-10-10 13:49:37 +02:00
Nico Weber
9ec9886b04 Meta: Fix typos 2021-10-01 01:06:40 +01:00
Timothy Flynn
c8dbcdb0bc LibUnicode: Do not compare generated file contents before writing
This is now covered by unicode_data.cmake after the superbuild changes.
2021-09-30 17:37:57 +01:00
Idan Horowitz
6704961c82 AK: Replace the mutable String::replace API with an immutable version
This removes the awkward String::replace API which was the only String
API which mutated the String and replaces it with a new immutable
version that returns a new String with the replacements applied. This
also fixes a couple of UAFs that were caused by the use of this API.

As an optimization an equivalent StringView::replace API was also added
to remove an unnecessary String allocations in the format of:
`String { view }.replace(...);`
2021-09-11 20:36:43 +03:00
Timothy Flynn
077a693de6 LibUnicode: Sort special casing array by locale specificity
This is to simply the Default Case Conversion implementation. Otherwise,
the implementation would need to determine which special casing rule to
apply, instead of just picking the first match.
2021-09-06 15:24:27 +01:00
Timothy Flynn
91db61ae8d LibUnicode: Generate canonical combining class in Unicode data
Will be used by special casing rules.
2021-09-06 15:24:27 +01:00
Andrew Kaster
63956b36d0 Everywhere: Move all host tools into the Lagom/Tools subdirectory
This allows us to remove all the add_subdirectory calls from the top
level CMakeLists.txt that referred to targets linking LagomCore.

Segregating the host tools and Serenity targets helps us get to a place
where the main Serenity build can simply use a CMake toolchain file
rather than swapping all the compiler/sysroot variables after building
host libraries and tools.
2021-08-28 08:44:17 +01:00
Renamed from Userland/Libraries/LibUnicode/CodeGenerators/GenerateUnicodeData.cpp (Browse further)