Commit graph

27 commits

Author SHA1 Message Date
Timothy Flynn
2d487e4e4c LibUnicode+LibJS: Move text segmentation algorithms to their own files
These algorithms are quite chonky, and more APIs around them are to be
added, so let's move them to their own files for a bit of organization.
2023-02-15 12:36:47 +01:00
Timothy Flynn
6fcc1c7426 AK+LibUnicode: Provide Unicode-aware String case transformations
Since AK can't refer to LibUnicode directly, the strategy here is that
if you need case transformations, you can link LibUnicode and receive
them. If you try to use either of these methods without linking it, then
you'll of course get a linker error (note we don't do any fallbacks to
e.g. ASCII case transformations). If you don't need these methods, you
don't have to link LibUnicode.
2023-01-09 19:23:46 -07:00
Timothy Flynn
12f6793223 LibUnicode: Move Unicode-aware case transformations to a helper file
These will be needed by AK::String as well, so move them to a helper
file where they can be re-used.
2023-01-09 19:23:46 -07:00
Tim Schumacher
ce2f1b845f Everywhere: Mark dependencies of most targets as PRIVATE
Otherwise, we end up propagating those dependencies into targets that
link against that library, which creates unnecessary link-time
dependencies.

Also included are changes to readd now missing dependencies to tools
that actually need them.
2022-11-01 14:49:09 +00:00
Andrew Kaster
b8e51425e9 Lagom+CMake: Propagate dependencies for generated custom targets
We have logic for serenity_generated_sources which works well for source
files that are specified in GENERATED_SOURCES prior to calling
serenity_lib or serenity_bin. However, code generated with
invoke_generator, and the LibWeb generators do not always follow the
pattern of the IDL and GML files.

For the LibWeb generators, we can just add_dependencies to LibWeb at the
time we declare the generate_Foo custom target. However for LibLocale,
LibTimeZone, and LibUnicode, we don't have the name of the target
available, so export the name in a variable to set into
GENERATED_SOURCES.

To make this work for Lagom, we need to make sure that lagom_lib and
serenity_bin in Lagom/CMakeLists.txt call serenity_generated_sources on
the target.

This enables the Xcode generator on macOS hosts, at least for Lagom.
2022-10-17 15:55:55 +02:00
matcool
70d0c1616f LibUnicode: Add decomposition mappings and Unicode normalization
The mappings are exposed via `Unicode::code_point_decomposition(u32)`
and `Unicode::code_point_decompositions()`, the latter being useful for
reverse searching a code point from its decomposition.

The normalization code does not make use of `Quick_Check` props (https://www.unicode.org/reports/tr44/#Decompositions_and_Normalization),
meaning no quick check optimizations.
2022-10-06 08:24:39 -04:00
Timothy Flynn
b61eca0a1e LibUncode: Parse and generate emoji code point data
According to TR #51, the "best definition of the full set [of emojis] is
in the emoji-test.txt file". This defines not only the emoji themselves,
but the order in which they should be displayed, and what "group" of
emojis they belong to.
2022-09-08 23:12:31 +01:00
Timothy Flynn
9e860d973e LibLocale: Move locale source files to the LibLocale library
Everything is now setup to create the LibLocale library and link it
where needed.
2022-09-05 14:37:16 -04:00
Timothy Flynn
43a3471298 LibLocale: Move locale source files to the LibLocale folder
These are still included in LibUnicode, but this updates their location
and the include paths of other files which include them.
2022-09-05 14:37:16 -04:00
Timothy Flynn
1e0276f541 LibLocale+LibUnicode: Move generated CLDR data files to LibLocale folder
They are still included into LibUnicode, but this moves their generated
location to be under LibLocale.
2022-09-05 14:37:16 -04:00
Timothy Flynn
fc8bf7ac3e LibUnicode+Userland: Migrate generated CLDR data to LibLocaleData
Currently, LibUnicodeData contains the generated UCD and CLDR data. Move
the UCD data to the main LibUnicode library, and rename LibUnicodeData
to LibLocaleData. This is another prepatory change to migrate to
LibLocale.
2022-09-05 14:37:16 -04:00
Timothy Flynn
89d1813b5d LibUnicode: Move CLDR data generators to a LibLocale subfolder
To prepare for placing all CLDR generated data in a new library,
LibLocale, this moves the code generators for the CLDR data to the
LibLocale subfolder.
2022-09-05 14:37:16 -04:00
Timothy Flynn
ea78bac36d LibUnicode: Parse and generate per-locale plural rules from the CLDR
Plural rules in the CLDR are of the form:

"cs": {
    "pluralRule-count-one": "i = 1 and v = 0 @integer 1",
    "pluralRule-count-few": "i = 2..4 and v = 0 @integer 2~4",
    "pluralRule-count-many": "v != 0 @decimal 0.0~1.5, 10.0, 100.0 ...",
    "pluralRule-count-other": "@integer 0, 5~19, 100, 1000, 10000 ..."
}

The syntax is described here:
https://unicode.org/reports/tr35/tr35-numbers.html#Plural_rules_syntax

There are up to 2 sets of rules for each locale, a cardinal set and an
ordinal set. The approach here is to generate a C++ function for each
set of rules. Each condition in the rules (e.g. "i = 1 and v = 0") is
transpiled to a C++ if-statement within its function. Then lookup tables
are generated to match locales to their generated functions.

NOTE: -Wno-parentheses-equality is added to the LibUnicodeData compile
flags because the generated plural rules have lots of extra parentheses
(because e.g. we need to selectively negate and combine rules). The code
to generate only exactly the right number of parentheses is quite hairy,
so this just tells the compiler to ignore the extras.
2022-07-08 11:51:54 +02:00
Timothy Flynn
789f093b2e LibUnicode: Parse and generate relative-time format patterns
Relative-time format patterns are of one of two forms:

    * Tensed - refer to the past or the future, e.g. "N years ago" or
      "in N years".
    * Numbered - refer to a specific numeric value, e.g. "in 1 year"
      becomes "next year" and "in 0 years" becomes "this year".

In ECMA-402, tensed and numbered refer to the numeric formatting options
of "always" and "auto", respectively.
2022-01-27 21:16:44 +00:00
Timothy Flynn
0a4430fc41 LibJS+LibTimeZone+LibUnicode: Remove direct linkage to LibTimeZone
This is no longer needed now that LibTimeZone is included within LibC.
Remove the direct linkage so that others do not mistakenly copy-paste
the CMakeLists text elsewhere.
2022-01-23 12:48:26 +00:00
Timothy Flynn
8d35563f28 LibUnicode: Implement TR-35's localized GMT offset formatting
This adds an API to use LibTimeZone to convert a time zone such as
"America/New_York" to a GMT offset string like "GMT-5" (short form) or
"GMT-05:00" (long form).
2022-01-11 23:56:35 +01:00
Timothy Flynn
498b741434 LibUnicode: Use LibTimeZone's list of time zone names
LibUnicode no longer needs to generate a list of time zone names that it
parsed from metaZones.json. We can defer to the TZDB for a golden list
of time zones.
2022-01-08 12:45:34 +01:00
Timothy Flynn
1116a29c19 LibUnicode: Remove now unused Unicode symbol loader
All generated sources are now linked via weak symbols.
2022-01-04 22:49:43 +00:00
Timothy Flynn
c417374dd6 LibUnicode: Remove linkage from LibUnicode to LibUnicodeData
LibUnicodeData can now be loaded dynamically at runtime.
2021-12-21 13:09:49 -08:00
Timothy Flynn
3fd53baa25 LibUnicode: Dynamically load the generated UnicodeData symbols
The generated data for libunicodedata.so is quite large, and loading it
is a price paid by nearly every application by way of depending on
LibRegex. In order to defer this cost until an application actually uses
one of the surrounding APIs, dynamically load the generated symbols.

To be able to load the symbols dynamically, the generated methods must
have demangled names. Typically, this is accomplished with `extern "C"`
blocks. The clang toolchain complains about this here because the types
returned from the generators are strictly C++ types. So to demangle the
names, we use the asm() compiler directive to manually define a symbol
name; the caveat is that we *must* be sure the symbols are unique. As an
extra precaution, we prefix each symbol name with "unicode_". For more
details, see: https://gcc.gnu.org/onlinedocs/gcc/Asm-Labels.html

This symbol loader used in this implementation provides the additional
benefit of removing many [[maybe_unused]] attributes from the LibUnicode
methods. Internally, if ENABLE_UNICODE_DATABASE_DOWNLOAD is OFF, the
loader is able to stub out the function pointers it returns.

Note that as of this commit, LibUnicode is still directly linked against
LibUnicodeData. This commit is just a first step towards removing that.
2021-12-21 13:09:49 -08:00
Timothy Flynn
92233660b8 LibUnicode: Compile generated sources optimized for size
This breaks LibUnicode into two libraries: LibUnicode containing the
public APIs for accessing the library, and LibUnicodeData containing the
generated source files. LibUnicodeData has compile options optimized for
size, which save about 1MB of data in total.
2021-12-15 13:26:03 +00:00
Timothy Flynn
f471ecdbe9 LibUnicode: Parse and generate date, time, and date-time format patterns 2021-11-29 22:48:46 +00:00
Timothy Flynn
914675e826 LibJS+LibUnicode: Separate number formatting methods from Locale.h
Currently, we generate separate data files for locale and number format
related tables/methods, but provide public accessors for all of the data
in one Locale.h file. Rather than continuing this trend for date-time,
relative time, etc. formatting, it's a bit easier to reason about if the
public accessors are also in separate files.
2021-11-29 22:48:46 +00:00
Timothy Flynn
e6334cb856 LibUnicode: Add some data related to currency codes
This data is published under ISO-4217 as an XML file. Since we can't
parse XML files yet, and the data isn't very large, it was translated to
C++ manually here.
2021-09-11 11:05:50 +01:00
Andrew Kaster
e88761b2b9 Meta+LibUnicode: Move unicode_data helper to Meta/CMake
Moving this helper CMake file to the centralized Meta/CMake folder helps
to get a better grasp on what extra files are required for the build,
and what files are generated.

While we're at it, don't use add_compile_definitions for
ENABLE_UNICODE_DATA, which only needs to be seen by LibUnicode sources.
2021-08-28 08:44:17 +01:00
Timothy Flynn
b7a95cba65 LibUnicode: Implement grammar validators for Unicode TR-35
ECMA-402 requires validating user input against the EBNF grammar for
Unicode locales described in TR-35: https://www.unicode.org/reports/tr35

This commit adds validators for that grammar, as well as other helper to
e.g. canonicalize a locale string.
2021-08-26 22:04:09 +01:00
Timothy Flynn
4dda3edc9e LibUnicode: Introduce a Unicode library for interacting with UCD files
The Unicode standard publishes the Unicode Character Database (UCD) with
information about every code point, such as each code point's upper case
mapping. LibUnicode exists to download and parse UCD files at build time
and to provide accessors to that data.

As a start, LibUnicode includes upper- and lower-case code point
converters.
2021-07-26 17:03:55 +01:00