Commit graph

65 commits

Author SHA1 Message Date
Timothy Flynn
9220a89d2f CI+LibUnicode: Remove the UCD from the system 2024-06-22 14:56:39 +02:00
Timothy Flynn
069bed5d47 LibUnicode+LibGfx: Remove superfluous emoji metadata
For SerenityOS, we parse emoji metadata from the UCD to learn emoji
groups, subgroups, names, etc. We used this information only in the
emoji picker dialog. It is entirely unused within Ladybird.

This removes our dependence on the UCD emoji file, as we no longer
need any of its information. All we need to know is the file path to
our custom emoji, which we get from Meta/emoji-file-list.txt.
2024-06-22 14:56:39 +02:00
Timothy Flynn
aa3a30870b LibUnicode: Replace code point bidirectional classes with ICU 2024-06-22 14:56:39 +02:00
Timothy Flynn
e77dafc987 LibUnicode: Replace code point scripts and script extensions with ICU 2024-06-22 14:56:39 +02:00
Timothy Flynn
986ff984cc LibUnicode: Replace code point general categories with ICU 2024-06-22 14:56:39 +02:00
Timothy Flynn
c804bda5fd LibUnicode: Replace code point properties with ICU 2024-06-22 14:56:39 +02:00
Timothy Flynn
ab56b8c8dc LibUnicode: Remove the locale-unaware text segmentation implementation 2024-06-20 13:46:54 +02:00
Timothy Flynn
5cf818e305 LibUnicode: Replace case transformations and comparison with ICUs
There are a couple of differences here due to using ICU:

1. Titlecasing behaves slightly differently. We previously transformed
   "123dollars" to "123Dollars", as we would use word segmentation to
   split a string into words, then transform the first cased character
   to titlecase. ICU doesn't go quite that far, and leaves the string
   as "123dollars". While this is a behavior change, the only user of
   this API is the `text-transform: capitalize;` CSS rule, and we now
   match the behavior of other browsers.

2. There isn't an API to compare strings with case insensitivity without
   allocating case-folded strings for both the left- and right-hand-side
   strings. Our implementation was previously allocation-free; however,
   in a benchmark, ICU is still ~1.4x faster.
2024-06-20 10:59:55 +02:00
Timothy Flynn
8d7216f4e0 LibUnicode: Replace IDNA ASCII conversion with ICU 2024-06-18 21:07:56 +02:00
Timothy Flynn
1feef17bf7 LibUnicode: Remove completely unused code point name & block name data
These were used for e.g. the Character Map on Serenity, but are not used
at all for Ladybird.
2024-06-18 21:07:56 +02:00
Timothy Flynn
4ce7f49f2f Meta: Use SHA-256 verification for downloaded UCD files 2024-05-24 08:47:26 -04:00
Timothy Flynn
91cd43a7ac Meta: Add a file containing a list of all emoji file names
And add a verification step to the emoji data generator to ensure all
emoji are listed in this file. This file will be used as a sources list
in both the CMake and GN build systems.

It is probably possible to generate this list. But in a first attempt,
the CMake code to set the file as a dependency of a pseudo target, which
would then parse the file and install the listed emoji was getting quite
verbose and complicated. So for now, let's just maintain this list.
2024-03-23 17:26:31 -04:00
Andrew Kaster
e0f990f1cb CMake: Don't download IDNA files when ENABLE_NETWORK_DOWNLOAD is OFF
Also tweak the debug message for the Emoji test file.
2023-12-13 10:51:27 -07:00
Simon Wanner
7d9fe44039 LibUnicode: Download and parse IDNA data 2023-12-10 08:04:58 -05:00
Timothy Flynn
139c575cc9 LibUnicode: Update to Unicode version 15.1.0
https://unicode.org/versions/Unicode15.1.0/

This update includes a new set of code point properties, Indic Conjunct
Break. These may have the values Consonant, Linker, or Extend. These are
used in text segmentation to prevent breaking on some extended grapheme
cluster sequences.
2023-09-15 18:30:26 +02:00
Andrew Kaster
941a9846a3 Meta: Assume files already extracted for ENABLE_NETWORK_DOWNLOADS=OFF
This allows external meta build systems to extract a cached archive into
our Cache directory without having to also copy the .tar.gz file.
2023-08-10 20:10:05 -06:00
Timothy Flynn
fd1fbad1d2 LibGfx+LibUnicode: Support specifying the path to search for emoji
Similar to the FontDatabase, this will be needed for Ladybird to find
emoji images. We now generate just the file name of emoji image in
LibUnicode, and look for that file in the specified path (defaulting to
/res/emoji) at runtime.
2023-03-01 14:54:16 +00:00
Timothy Flynn
8c38d46c1a LibUnicode: Generate the path to emoji images alongside emoji data
This will provide for quicker emoji lookups, rather than having to
discover and allocate these paths at runtime before we find out if they
even exist.
2023-02-24 19:48:47 +01:00
Timothy Flynn
8f2589b3b0 LibUnicode: Parse and generate case folding code point data
Case folding rules have a similar mapping style as special casing rules,
where one code point may map to zero or more case folding rules. These
will be used for case-insensitive string comparisons. To see how case
folding can differ from other casing rules, consider "ß" (U+00DF):

    >>> "ß".lower()
    'ß'

    >>> "ß".upper()
    'SS'

    >>> "ß".title()
    'Ss'

    >>> "ß".casefold()
    'ss'
2023-01-18 14:43:40 +00:00
Timothy Flynn
2334b4cebd Meta: Move UCD/CLDR/TZDB downloaded artifacts to Build/caches
They currently reside under Build/<arch>, meaning that they would be
redownloaded for each architecture/toolchain build combo. Move them to a
location that can be re-used for all builds.
2022-12-24 09:46:28 -05:00
Timothy Flynn
ed84a6f6ee LibUnicode: Use www.unicode.org domain to download emoji-test.txt
The non-www domain does not appear to be available now. We use the www
domain for UCD.zip already.

Co-authored-by: Stephan Unverwerth <s.unverwerth@serenityos.org>
2022-12-20 12:09:16 -05:00
Timothy Flynn
bd592480e4 Meta: Replace Bash script for generating emoji.txt with C++ generator
We currently have two build-time parsers for the UCD's emoji-test.txt
file. To prepare for future changes, this removes the Bash parser and
moves its functionality to the newer C++ parser.
2022-10-27 12:59:56 +02:00
Timothy Flynn
9fad23018a Meta: Remove unused "prefix" variable from invoke_generator() helper
This became unused after 1ae0cfd.
2022-10-16 21:16:48 +02:00
Andrew Kaster
1ae0cfd08b CMake+Userland: Use CMakeLists from Userland to build Lagom Libraries
Also do this for Shell.

This greatly simplifies the CMakeLists in Lagom, replacing many glob
patterns with a big list of libraries. There are still a few special
libraries that need some help to conform to the pattern, like LibELF and
LibWebView.

It also lets us remove essentially all of the Serenity or Lagom binary
directory detection logic from code generators, as now both projects
directories enter the generator logic from the same place.
2022-10-16 16:36:39 +02:00
Timothy Flynn
51854e345a LibUnicode: Update to Unicode version 15.0.0
https://unicode.org/versions/Unicode15.0.0/
2022-09-21 14:04:22 +01:00
Timothy Flynn
b7ef36aa36 LibUnicode: Parse and generate custom emoji added for SerenityOS
Parse emoji from emoji-serenity.txt to allow displaying their names and
grouping them together in the EmojiInputDialog.

This also adds an "Unknown" value to the EmojiGroup enum. This will be
useful for emoji that aren't found in the UCD, or for when UCD downloads
are disabled.
2022-09-11 20:33:57 +01:00
Timothy Flynn
b61eca0a1e LibUncode: Parse and generate emoji code point data
According to TR #51, the "best definition of the full set [of emojis] is
in the emoji-test.txt file". This defines not only the emoji themselves,
but the order in which they should be displayed, and what "group" of
emojis they belong to.
2022-09-08 23:12:31 +01:00
Timothy Flynn
89d1813b5d LibUnicode: Move CLDR data generators to a LibLocale subfolder
To prepare for placing all CLDR generated data in a new library,
LibLocale, this moves the code generators for the CLDR data to the
LibLocale subfolder.
2022-09-05 14:37:16 -04:00
Tim Schumacher
854792c340 Meta: Don't generate emoji.txt into the source tree 2022-09-05 09:50:31 -04:00
Timothy Flynn
2e0b20ef01 Meta: Only run the emoji generator for Serenity builds
It is not needed on Lagom, and was incidentally run twice.
2022-08-23 19:03:43 +01:00
Timothy Flynn
6dd8161002 Meta: Ensure the emoji generator depends on its own script
If the script changes, it better be re-run.
2022-08-23 19:03:43 +01:00
Timothy Flynn
d86b25c460 Meta: Move downloading of emoji-test.txt to unicode_data.cmake
The current emoji_txt.cmake does not handle download errors (which were
a common source of issues in the build problems channel) or Unicode
versioning. These are both handled by unicode_data.cmake. Move the
download to unicode_data.cmake so that we can more easily handle next
month's Unicode 15 release.
2022-08-22 16:00:29 +01:00
Timothy Flynn
ea78bac36d LibUnicode: Parse and generate per-locale plural rules from the CLDR
Plural rules in the CLDR are of the form:

"cs": {
    "pluralRule-count-one": "i = 1 and v = 0 @integer 1",
    "pluralRule-count-few": "i = 2..4 and v = 0 @integer 2~4",
    "pluralRule-count-many": "v != 0 @decimal 0.0~1.5, 10.0, 100.0 ...",
    "pluralRule-count-other": "@integer 0, 5~19, 100, 1000, 10000 ..."
}

The syntax is described here:
https://unicode.org/reports/tr35/tr35-numbers.html#Plural_rules_syntax

There are up to 2 sets of rules for each locale, a cardinal set and an
ordinal set. The approach here is to generate a C++ function for each
set of rules. Each condition in the rules (e.g. "i = 1 and v = 0") is
transpiled to a C++ if-statement within its function. Then lookup tables
are generated to match locales to their generated functions.

NOTE: -Wno-parentheses-equality is added to the LibUnicodeData compile
flags because the generated plural rules have lots of extra parentheses
(because e.g. we need to selectively negate and combine rules). The code
to generate only exactly the right number of parentheses is quite hairy,
so this just tells the compiler to ignore the extras.
2022-07-08 11:51:54 +02:00
Timothy Flynn
1f2542247f LibUnicode: Upgrade to CLDR version 41.0.0
Release notes: https://cldr.unicode.org/index/downloads/cldr-41

Note that the HourCycleRegion enum now contains 272 entires, thus needs
to be bumped from u8 to u16.
2022-04-07 08:29:10 -04:00
Timothy Flynn
8a46794ff8 LibUnicode: Replace individual UCD file downloads with single UCD.zip
Instead of downloading nearly 20 files individually, we can download a
single .zip file similar to how we download a single CLDR .zip. This is
to reduce the number of connections/downloads to/from unicode.org.
2022-04-06 17:12:08 -07:00
Brian Gianforcaro
66e7ac1954 Meta: Error out on find_program errors with CMake less than 3.18
We have seen some cases where the build fails for folks, and they are
missing unzip/tar/gzip etc. We can catch some of these in CMake itself,
so lets make sure to handle that uniformly across the build system.

The REQUIRED flag to `find_program` was only added on in CMake 3.18 and
above, so we can't rely on that to actually halt the program execution.
2022-03-19 15:01:22 -07:00
Timothy Flynn
d0fc61e79b LibUnicode: Extract the BCP 47 package from the CLDR
This package was originally meant to be included in CLDR version 40, but
was missed in their release scripts. This has been resolved:
https://unicode-org.atlassian.net/browse/CLDR-15158

Unfortunately, the CLDR was re-released with the same version number. So
to bust the build's CLDR cache, change the "version" used to detect that
we need to redownload the CLDR.
2022-02-16 07:23:07 -05:00
thankyouverycool
0505e031f1 Meta+LibUnicode: Download and parse Unicode block properties
This parses Blocks.txt for CharacterType properties and creates
a global display array for use in apps.
2022-02-15 10:13:19 -05:00
Idan Horowitz
2d50c08f34 LibUnicode: Download and parse {Grapheme,Word,Sentence} break props 2022-01-31 21:05:04 +02:00
Timothy Flynn
27eda77c97 LibUnicode: Create a nearly empty generator for relative-time formatting
This sets up the generator plumbing to create the relative-time data
files. This data could probably be included in the date-time generator,
but that generator is large enough that I'd rather put this tangentially
related data in its own file.
2022-01-27 21:16:44 +00:00
Timothy Flynn
c7ef86f5d9 Meta: Download UCD and CLDR data with fallible download function 2022-01-26 00:22:53 +00:00
Timothy Flynn
c5138f0f2b LibUnicode: Parse number system digits from the CLDR
We had a hard-coded table of number system digits copied from ECMA-402.
Turns out these digits are in the CLDR, so let's parse the digits from
there instead of hard-coding them.
2022-01-12 10:49:07 +01:00
Timothy Flynn
9ba386a7bb Meta: Move invoke_generator to utils.cmake 2022-01-08 12:45:34 +01:00
Timothy Flynn
d5f14b5ff9 Meta: Move remove_unicode_data_if_version_changed to utils.cmake
This function will be used by the time zone database parser. Move it to
the common utilities file, and rename it remove_path_if_version_changed
to be more generic.
2022-01-08 12:45:34 +01:00
Timothy Flynn
71903ea7e1 LibUnicode: Parse and generate calendar (ca) Unicode keywords
Also removes a few fly-by "StringView x = nullptr;" unnecessary
initializers.
2021-11-29 22:48:46 +00:00
Timothy Flynn
48ce72e472 LibUnicode: Parse and generate regional hour cycles
Unlike most data in the CLDR, hour cycles are not stored on a per-locale
basis. Instead, they are keyed by a string that is usually a region, but
sometimes is a locale. Therefore, given a locale, to determine the hour
cycles for that locale, we:

    1. Check if the locale itself is assigned hour cycles.
    2. If the locale has a region, check if that region is assigned hour
       cycles.
    3. Otherwise, maximize that locale, and if the maximized locale has
       a region, check if that region is assigned hour cycles.
    4. If the above all fail, fallback to the "001" region.

Further, each locale's default hour cycle is the first assigned hour
cycle.
2021-11-29 22:48:46 +00:00
Timothy Flynn
5c57341672 LibUnicode: Create a nearly empty generator for date-time formatting
Similar to number formatting, the data for date-time formatting will be
located in its own generated file. This extracts the cldr-dates package
from the CLDR and sets up the generator plumbing to create the date-time
data files.
2021-11-29 22:48:46 +00:00
Timothy Flynn
1539ed12f1 LibUnicode: Functionalize the Unicode generator CMake commands
Makes it a bit easier to add a new generator.
2021-11-23 22:58:05 +01:00
Ben Wiederhake
b06b54772e Meta+LibUnicode: Provide code point names through library 2021-11-20 00:31:55 +01:00
Timothy Flynn
4b535ce1c8 LibUnicode: Stop passing the cldr-core package to UnicodeNumberFormat
This is no longer needed now that this generator isn't parsing the
default-content locales.
2021-11-19 11:45:35 +01:00