0x4261756D
96de4ef7e0
LibTextCodec: Add SingleByteEncoders
...
They are similar to their already existing decoder counterparts.
2024-10-10 10:39:28 +02:00
Andreas Kling
cc4b3cbacc
Meta: Update my e-mail address everywhere
CI / Lagom (false, FUZZ, ubuntu-24.04, Linux, Clang) (push) Waiting to run
CI / Lagom (false, NO_FUZZ, macos-14, macOS, Clang) (push) Waiting to run
CI / Lagom (false, NO_FUZZ, ubuntu-24.04, Linux, GNU) (push) Waiting to run
CI / Lagom (true, NO_FUZZ, ubuntu-24.04, Linux, Clang) (push) Waiting to run
Package the js repl as a binary artifact / build-and-package (macos-14, macOS, macOS-universal2) (push) Waiting to run
Package the js repl as a binary artifact / build-and-package (ubuntu-24.04, Linux, Linux-x86_64) (push) Waiting to run
Run test262 and test-wasm / run_and_update_results (push) Waiting to run
Lint Code / lint (push) Waiting to run
Push notes / build (push) Waiting to run
2024-10-04 13:19:50 +02:00
Shannon Booth
0b864bef60
LibTextCodec: Implement UTF8Decoder::to_utf8 using AK::String
...
String::from_utf8_with_replacement_character is equivalent to
https://encoding.spec.whatwg.org/#utf-8-decode from the encoding spec,
so we can simply call through to it.
2024-08-12 06:38:58 -04:00
BenJilks
0ca5675d59
LibTextCodec: Implement iso-2022-jp
encoder
...
Implements the `iso-2022-jp` encoder, as specified by
https://encoding.spec.whatwg.org/#iso-2022-jp-encoder
2024-08-08 17:49:58 +01:00
BenJilks
08a8d67a5b
LibTextCodec: Implement shift_jis
encoder
...
Implements the `shift_jis` encoder, as specified by
https://encoding.spec.whatwg.org/#shift_jis-encoder
2024-08-08 17:49:58 +01:00
BenJilks
d80575a410
LibTextCodec: Implement gb18030
and gbk
encoders
...
Implements the `gb18030` and `gbk` encoders, as specified by
https://encoding.spec.whatwg.org/#gb18030-encoder
https://encoding.spec.whatwg.org/#gbk-encoder
2024-08-08 17:49:58 +01:00
BenJilks
34c8c559c1
LibTextCodec: Implement big5
encoder
...
Implements the `big5` encoder, as specified by
https://encoding.spec.whatwg.org/#big5-encoder
2024-08-08 17:49:58 +01:00
BenJilks
826292536c
LibTextCodec: Implement euc-kr
encoder
...
Implements the `euc-kr` encoder, as specified by
https://encoding.spec.whatwg.org/#euc-kr-encoder
2024-08-08 17:49:58 +01:00
BenJilks
72d0e3284b
LibTextCodec+LibURL: Implement utf-8
and euc-jp
encoders
...
Implements the corresponding encoders, selects the appropriate one when
encoding URL search params. If an encoder for the given encoding could
not be found, fallback to utf-8.
2024-08-08 17:49:58 +01:00
Andreas Kling
1a46d8df5f
LibTextCodec: Use String::from_utf8() when decoding UTF-8 to UTF-8
...
This way, we still perform UTF-8 validation, but don't go through the
slow generic code path that rebuilds the decoded string one code point
at a time.
This was a bottleneck when loading a canned copy of reddit.com, which
ended up being ~120 MiB large.
- Time spent decoding UTF-8 before this change: 1192 ms
- Time spent decoding UTF-8 after this change: 154 ms
That's still a long time, but 7.7x faster is nothing to sneeze at! :^)
Note that if the input fails UTF-8 validation, we still fall back to
the slow path and insert replacement characters per the WHATWG Encoding
spec: https://encoding.spec.whatwg.org/#utf-8-decode
2024-07-20 14:29:37 +02:00
Timothy Flynn
368dad54ef
LibTextCodec: Use AK facilities to validate and convert UTF-16 to UTF-8
...
This allows LibTextCodec to make use of simdutf, and also reduces the
number of places with manual UTF-16 implementations.
2024-07-18 19:43:57 +02:00
Simon Wanner
0ab4722cee
LibTextCodec: Use generated lookup tables for all single byte decoders
2024-06-04 10:21:07 +02:00
Simon Wanner
6b2c459901
LibTextCodec: Fix ISO-8859-1 vs. windows-1252 handling in web contexts
...
The Encoding specification maps ISO-8859-1 to windows-1252 and expects
the windows-1252 translation table to be used, which differs from
ISO-8859-1 for 0x80-0x9F.
Other contexts expect to get the actual ISO-8859-1 encoding, with 1-to-1
mapping to U+0000-U+00FF, when requesting it.
`decoder_for_exact_name` is introduced, which skips the mapping from
aliases to the encoding name done by `get_standardized_encoding`.
2024-06-04 10:21:07 +02:00
Simon Wanner
46d5cf0443
LibTextCodec: Fix some incorrect encoding aliases
2024-06-04 10:21:07 +02:00
Simon Wanner
09f2d79cb1
LibTextCodec: Bring TextCodec::get_standardized_encoding closer to spec
2024-06-04 10:21:07 +02:00
Simon Wanner
11bb216912
LibTextCodec: Add replacement decoder
2024-05-31 07:56:26 +02:00
Simon Wanner
7f3b457e62
LibTextCodec: Add EUC-KR decoder
2024-05-31 07:56:26 +02:00
Simon Wanner
ded6512ca8
LibTextCodec: Add Shift_JIS decoder
2024-05-31 07:56:26 +02:00
Simon Wanner
06f7c393b2
LibTextCodec: Add ISO-2022-JP decoder
2024-05-31 07:56:26 +02:00
Simon Wanner
45f0ae52be
LibTextCodec: Add EUC-JP decoder
2024-05-31 07:56:26 +02:00
Simon Wanner
9943bb1d8e
LibTextCodec: Add Big5 decoder
2024-05-31 07:56:26 +02:00
Simon Wanner
2ce61fe6ea
LibTextCodec: Add GBK/GB18030 decoder
...
Includes changes from GB-18030-2022, which are not yet included in the
Encoding Specification, but WebKit, Blink and WPT are already updated.
2024-05-31 07:56:26 +02:00
Simon Wanner
9ed52504ab
LibTextCodec: Delegate to process() in default validate() implementation
2024-05-31 07:56:26 +02:00
Simon Wanner
88c2586f25
LibTextCodec: Remove unused decoder classes
2024-05-31 07:56:26 +02:00
Simon Wanner
b79815c5a5
LibTextCodec: Add x-mac-cyrillic decoder
2024-05-27 20:50:50 +02:00
Simon Wanner
07a9435da5
LibTextCodec: Add windows-1258 decoder
2024-05-27 20:50:50 +02:00
Simon Wanner
275b89720b
LibTextCodec: Add windows-1257 decoder
2024-05-27 20:50:50 +02:00
Simon Wanner
c76308c7e6
LibTextCodec: Add windows-1256 decoder
2024-05-27 20:50:50 +02:00
Simon Wanner
eb9ed10573
LibTextCodec: Add windows-1253 decoder
2024-05-27 20:50:50 +02:00
Simon Wanner
2d35687db0
LibTextCodec: Add windows-874 decoder
2024-05-27 20:50:50 +02:00
Simon Wanner
1b6878b6ca
LibTextCodec: Add KOI8-U decoder
2024-05-27 20:50:50 +02:00
Simon Wanner
1fd3a6f48c
LibTextCodec: Add ISO-8859-16 decoder
2024-05-27 20:50:50 +02:00
Simon Wanner
3e882f26db
LibTextCodec: Sort checks in decoder_for mostly alphabetically
...
Keeps checks for common encodings (Latin1 & UTF-*) at the top.
2024-05-27 20:50:50 +02:00
Simon Wanner
56241df604
LibTextCodec: Add ISO-8859-14 decoder
2024-05-27 20:50:50 +02:00
Simon Wanner
4188e328ac
LibTextCodec: Add ISO-8859-13 decoder
2024-05-27 20:50:50 +02:00
Simon Wanner
cc640f4363
LibTextCodec: Add ISO-8859-10 decoder
2024-05-27 20:50:50 +02:00
Simon Wanner
d73220837e
LibTextCodec: Add ISO-8859-8(-I) decoder
2024-05-27 20:50:50 +02:00
Simon Wanner
24028e353e
LibTextCodec: Add ISO-8859-7 decoder
2024-05-27 20:50:50 +02:00
Simon Wanner
01c3b8091a
LibTextCodec: Add ISO-8859-6 decoder
2024-05-27 20:50:50 +02:00
Simon Wanner
763d904ad5
LibTextCodec: Add ISO-8859-5 decoder
2024-05-27 20:50:50 +02:00
Simon Wanner
c6b17320db
LibTextCodec: Add ISO-8859-4 decoder
2024-05-27 20:50:50 +02:00
Simon Wanner
6c84edaaa2
LibTextCodec: Add ISO-8859-3 decoder
2024-05-27 20:50:50 +02:00
Simon Wanner
fc783199f1
LibTextCodec: Add IBM866 decoder
2024-05-27 20:50:50 +02:00
Simon Wanner
96b3c35358
LibTextCodec: Implement table based decoders as SingleByteDecoder
...
Instead of copy-pasting the implementation, let's use a single class.
This "Single Byte Decoder" concept even exists in the Encoding Spec :^)
2024-05-27 20:50:50 +02:00
Michal Grich
7a6d84d036
LibTextCodec: Add Windows-1250 text decoder
...
This commit is adding Windows-1250 decoding based on unicode.org
mapping table.
2024-04-23 16:26:16 +02:00
Andreas Kling
3c039903fb
LibTextCodec+AK: Don't validate UTF-8 strings twice
...
UTF8Decoder was already converting invalid data into replacement
characters while converting, so we know for sure we have valid UTF-8
by the time conversion is finished.
This patch adds a new StringBuilder::to_string_without_validation()
and uses it to make UTF8Decoder avoid half the work it was doing.
2023-12-30 13:49:50 +01:00
Nico Weber
8f47acee6a
LibTextCodec: Add PDFDocEncoding decoder
2023-11-22 09:08:06 -07:00
Idan Horowitz
079c96376c
LibTextCodec: Support validating encoded inputs
2023-11-17 16:02:36 +01:00
Luke Wilde
eaa4048870
LibTextCodec: Add "get output encoding" from the Encoding specification
2023-06-19 06:12:26 +02:00
Timothy Flynn
00fa23237a
LibTextCodec: Change UTF-8's decoder to replace invalid code points
...
The UTF-8 decoder will currently crash if it is provided invalid UTF-8
input. Instead, change its behavior to match that of all other decoders
to replace invalid code points with U+FFFD. This is required by the web.
2023-05-12 05:47:36 +02:00