Commit graph

29 commits

Author SHA1 Message Date
sin-ack
3f3f45580a Everywhere: Add sv suffix to strings relying on StringView(char const*)
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).

No functional changes.
2022-07-12 23:11:35 +02:00
DexesTTP
7ceeb74535 AK: Use an enum instead of a bool for String::replace(all_occurences)
This commit has no behavior changes.

In particular, this does not fix any of the wrong uses of the previous
default parameter (which used to be 'false', meaning "only replace the
first occurence in the string"). It simply replaces the default uses by
String::replace(..., ReplaceMode::FirstOnly), leaving them incorrect.
2022-07-06 11:12:45 +02:00
Sam Atkins
d564cf1e89 LibCore+Everywhere: Make Core::Stream read_line() return StringView
Similar reasoning to making Core::Stream::read() return Bytes, except
that every user of read_line() creates a StringView from the result, so
let's just return one right away.
2022-04-16 13:27:51 -04:00
thankyouverycool
0505e031f1 Meta+LibUnicode: Download and parse Unicode block properties
This parses Blocks.txt for CharacterType properties and creates
a global display array for use in apps.
2022-02-15 10:13:19 -05:00
Timothy Flynn
a64a7940e4 LibUnicode: Port the UCD generator to the stream API 2022-02-14 11:39:46 -05:00
Idan Horowitz
2d50c08f34 LibUnicode: Download and parse {Grapheme,Word,Sentence} break props 2022-01-31 21:05:04 +02:00
Timothy Flynn
6efbafa6e0 Everywhere: Update copyrights with my new serenityos.org e-mail :^) 2022-01-31 18:23:22 +00:00
Timothy Flynn
701b7810ba LibUnicode: Generate code point abbreviations 2022-01-18 15:13:25 +00:00
Timothy Flynn
437b9fe204 LibUnicode: Convert UnicodeData to link with weak symbols 2022-01-04 22:49:43 +00:00
Timothy Flynn
cf8e11a562 LibUnicode: Add temporary overload of value-from-string generator
This is a temporary mechanism while LibUnicode is in an in-between state
where some symbols are weakly linked and others are dynamically loaded.
The latter require an asm() label to be loaded.
2022-01-04 22:49:43 +00:00
Timothy Flynn
52394deece LibUnicode: Remove now unused value-from-string generator overload
The generate_value_from_string_for_dynamic_loading() overload was just
temporary until all generates were switched over to dynamic loading.
2021-12-21 13:09:49 -08:00
Timothy Flynn
3fd53baa25 LibUnicode: Dynamically load the generated UnicodeData symbols
The generated data for libunicodedata.so is quite large, and loading it
is a price paid by nearly every application by way of depending on
LibRegex. In order to defer this cost until an application actually uses
one of the surrounding APIs, dynamically load the generated symbols.

To be able to load the symbols dynamically, the generated methods must
have demangled names. Typically, this is accomplished with `extern "C"`
blocks. The clang toolchain complains about this here because the types
returned from the generators are strictly C++ types. So to demangle the
names, we use the asm() compiler directive to manually define a symbol
name; the caveat is that we *must* be sure the symbols are unique. As an
extra precaution, we prefix each symbol name with "unicode_". For more
details, see: https://gcc.gnu.org/onlinedocs/gcc/Asm-Labels.html

This symbol loader used in this implementation provides the additional
benefit of removing many [[maybe_unused]] attributes from the LibUnicode
methods. Internally, if ENABLE_UNICODE_DATABASE_DOWNLOAD is OFF, the
loader is able to stub out the function pointers it returns.

Note that as of this commit, LibUnicode is still directly linked against
LibUnicodeData. This commit is just a first step towards removing that.
2021-12-21 13:09:49 -08:00
Timothy Flynn
7e6ad172a4 LibUnicode: Support code point names that apply to ranges of code points
For example, consider the following adjacent entries in UnicodeData.txt:

    3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
    4DBF;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;

Our current implementation would assign the display name "CJK Ideograph
Extension A" to code points U+3400 & U+4DBF, but not to the code points
in between. Not only should those code points be assigned a name, but
the Unicode spec also has formatting rules on what the names should be
(the names for these ranged code points are not as they appear in
UnicodeData.txt).

The spec also defines names for code point ranges that actually are
listed individually in UnicodeData.txt. For example:

    2F800;CJK COMPATIBILITY IDEOGRAPH-2F800;Lo;0;L;4E3D;;;;N;;;;;
    2F801;CJK COMPATIBILITY IDEOGRAPH-2F801;Lo;0;L;4E38;;;;N;;;;;
    2F802;CJK COMPATIBILITY IDEOGRAPH-2F802;Lo;0;L;4E41;;;;N;;;;;

Code points are only coalesced into a range if all fields after the name
are equivalent. Our parser will insert the range and its name formatting
pattern when it comes across the first code point in that range, then
ignore other code points in that range. This reduces the number of names
we generated by nearly 2,000.
2021-11-30 11:24:02 +01:00
Timothy Flynn
f2f4980f15 LibUnicode: Remove unused field from UnicodeData generator 2021-11-30 11:24:02 +01:00
Timothy Flynn
88dbf3c348 LibUnicode: Port GenerateUnicodeData to ErrorOr and LibMain
Also store command line arguments as StringViews rather than pointers.
2021-11-23 22:58:05 +01:00
Ben Wiederhake
b06b54772e Meta+LibUnicode: Provide code point names through library 2021-11-20 00:31:55 +01:00
Timothy Flynn
9d1519e21c LibUnicode: Move GenerateUnicodeData's Alias struct to generator header
This will be used for locale aliases as well. Also rename the "property"
field in this struct to "name", as it no longer is only used for
property aliases.
2021-11-19 11:45:35 +01:00
Andreas Kling
8b1108e485 Everywhere: Pass AK::StringView by value 2021-11-11 01:27:46 +01:00
Timothy Flynn
f91d63af83 LibUnicode: Generate enum/alias from-string methods without a HashMap
The *_from_string() and resolve_*_alias() generated methods are the last
remaining users of HashMap in the LibUnicode generated files (read: the
last methods not using compile-time structures). This converts these
methods to use an array containing pairs of hash values to the desired
lookup value.

Because this code generation is the same between GenerateUnicodeData.cpp
and GenerateUnicodeLocale.cpp, this adds a GeneratorUtil.h header to the
LibUnicode generators to contain the method that generates the methods.
2021-10-13 16:38:51 +02:00
Timothy Flynn
79707d83d3 LibUnicode: Stop generating large UnicodeData hash map
The data in this hash map is now available by way of much smaller arrays
and is now unused.
2021-10-10 13:49:37 +02:00
Timothy Flynn
d83b262e64 LibUnicode: Generate standalone compile-time array for combining class 2021-10-10 13:49:37 +02:00
Timothy Flynn
9f83774913 LibUnicode: Generate standalone compile-time array for special casing
There are only 112 code points with special casing rules, so this array
is quite small (compared to the size 34,626 UnicodeData hash map that is
also storing this data). Removing all casing rules from UnicodeData will
happen in a subsequent commit.
2021-10-10 13:49:37 +02:00
Timothy Flynn
da4b8897a7 LibUnicode: Generate standalone compile-time arrays for simple casing
Currently, all casing information (simple and special) are stored in a
compile-time array of size 34,626, then statically copied to a hash map
at runtime. In an effort to reduce the resulting memory usage, store the
simple casing rules in standalone compile-time arrays. The uppercase map
is size 1,450 and the lowercase map is size 1,433. Any code point not in
a map will implicitly have an identity mapping.
2021-10-10 13:49:37 +02:00
Nico Weber
9ec9886b04 Meta: Fix typos 2021-10-01 01:06:40 +01:00
Timothy Flynn
c8dbcdb0bc LibUnicode: Do not compare generated file contents before writing
This is now covered by unicode_data.cmake after the superbuild changes.
2021-09-30 17:37:57 +01:00
Idan Horowitz
6704961c82 AK: Replace the mutable String::replace API with an immutable version
This removes the awkward String::replace API which was the only String
API which mutated the String and replaces it with a new immutable
version that returns a new String with the replacements applied. This
also fixes a couple of UAFs that were caused by the use of this API.

As an optimization an equivalent StringView::replace API was also added
to remove an unnecessary String allocations in the format of:
`String { view }.replace(...);`
2021-09-11 20:36:43 +03:00
Timothy Flynn
077a693de6 LibUnicode: Sort special casing array by locale specificity
This is to simply the Default Case Conversion implementation. Otherwise,
the implementation would need to determine which special casing rule to
apply, instead of just picking the first match.
2021-09-06 15:24:27 +01:00
Timothy Flynn
91db61ae8d LibUnicode: Generate canonical combining class in Unicode data
Will be used by special casing rules.
2021-09-06 15:24:27 +01:00
Andrew Kaster
63956b36d0 Everywhere: Move all host tools into the Lagom/Tools subdirectory
This allows us to remove all the add_subdirectory calls from the top
level CMakeLists.txt that referred to targets linking LagomCore.

Segregating the host tools and Serenity targets helps us get to a place
where the main Serenity build can simply use a CMake toolchain file
rather than swapping all the compiler/sysroot variables after building
host libraries and tools.
2021-08-28 08:44:17 +01:00
Renamed from Userland/Libraries/LibUnicode/CodeGenerators/GenerateUnicodeData.cpp (Browse further)