beenull/ladybird

mirror of https://github.com/LadybirdBrowser/ladybird.git synced 2024-11-22 15:40:19 +00:00

Author	SHA1	Message	Date
Sam Atkins	3f10a5701d	AK: Add Utf8View::for_each_split_view() method Returns one Utf8View at a time, using a callback function to identify code points to split on.	2024-11-15 23:18:29 +01:00
Timothy Flynn	144452d638	AK: Explicitly check for null data in Utf8View The underlying CPU-specific instructions for operating on UTF-8 strings behave differently for null inputs. Add an explicit check for this state for consistency.	2024-07-21 19:57:07 +02:00
Diego	7560b640f3	AK: Add `AllowSurrogates` to UTF-8 validator The [UTF-8](https://datatracker.ietf.org/doc/html/rfc3629#page-5) standard says to reject strings with upper or lower surrogates. However, in many standards, ECMAScript included, unpaired surrogates (and therefore UTF-8 surrogates) are allowed in strings. So, this commit extends the UTF-8 validation API with `AllowSurrogates`, which will reject upper and lower surrogate characters.	2024-06-09 12:16:32 +02:00
Ali Mohammad Pur	5e1499d104	Everywhere: Rename {Deprecated => Byte}String This commit un-deprecates DeprecatedString, and repurposes it as a byte string. As the null state has already been removed, there are no other particularly hairy blockers in repurposing this type as a byte string (what it _really_ is). This commit is auto-generated: $ xs=$(ack -l \bDeprecatedString\b\\|deprecated_string AK Userland \ Meta Ports Ladybird Tests Kernel) $ perl -pie 's/\bDeprecatedString\b/ByteString/g; s/deprecated_string/byte_string/g' $xs $ clang-format --style=file -i \ $(git diff --name-only \| grep \.cpp\\|\.h) $ gn format $(git ls-files '.gn' '.gni')	2023-12-17 18:25:10 +03:30
Timothy Flynn	c4d78c29a2	AK: Invalidate overlong UTF-8 code point encodings For example, the code point U+002F could be encoded as UTF-8 with the bytes 0x80 0xAF. This trick has historically been used to bypass security checks.	2023-03-03 11:46:42 -05:00
Linus Groh	6e19ab2bbc	AK+Everywhere: Rename String to DeprecatedString We have a new, improved string type coming up in AK (OOM aware, no null state), and while it's going to use UTF-8, the name UTF8String is a mouthful - so let's free up the String name by renaming the existing class. Making the old one have an annoying name will hopefully also help with quick adoption :^)	2022-12-06 08:54:33 +01:00
Ben Wiederhake	ff8f3814cc	AK+Tests: Avoid creating invalid code points from malformed UTF-8 Instead of doing anything reasonable, Utf8CodePointIterator returned invalid code points, for example U+123456. However, many callers of this iterator assume that a code point is always at most 0x10FFFF. In fact, this is one of two reasons for the following OSS Fuzz issue: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=49184 This is probably a very old bug. In the particular case of URLParser, AK::is_url_code_point got confused: return /* ... / \|\| code_point >= 0xA0; If code_point is a "code point" beyond 0x10FFFF, this violates the condition given in the preceding comment, but satisfies the given condition, which eventually causes URLParser to crash. This commit fixes only* the erroneous UTF-8 decoding, and does not fully resolve OSS-Fuzz#49184.	2022-10-09 10:37:20 -06:00
sin-ack	c70f45ff44	Everywhere: Explicitly specify the size in StringView constructors This commit moves the length calculations out to be directly on the StringView users. This is an important step towards the goal of removing StringView(char const*), as it moves the responsibility of calculating the size of the string to the user of the StringView (which will prevent naive uses causing OOB access).	2022-07-12 23:11:35 +02:00
Timothy Flynn	9e5abec6f1	AK: Invalidate UTF-8 encoded code points larger than U+10ffff On oss-fuzz, the LibJS REPL is provided a file encoded with Windows-1252 with the following contents: /ô¡°½/ The REPL assumes the input file is UTF-8. So in Windows-1252, the above is represented as [0x2f 0xf4 0xa1 0xb0 0xbd 0x2f]. The inner 4 bytes are actually a valid UTF-8 encoding if we only look at the most significant bits to parse leading/continuation bytes. However, it decodes to the code point U+121c3d, which is not a valid code point. This commit adds additional validation to ensure the decoded code point itself is also valid.	2022-04-05 00:14:29 +01:00
Andreas Kling	1be4cbd639	AK: Make Utf8View constructors inline and remove C string constructor Using StringView instead of C strings is basically always preferable. The only reason to use a C string is because you are calling a C API.	2021-09-18 19:54:24 +02:00
Timothy Flynn	87848cdf7d	AK: Track byte length, rather than code point length, in Utf8View::trim Utf8View::trim uses Utf8View::substring_view to return its result, which requires the input to be a byte offset/length rather than code point length.	2021-07-17 16:59:59 +01:00
DexesTTP	e01f1c949f	AK: Do not VERIFY on invalid code point bytes in UTF8View The previous behavior was to always VERIFY that the UTF-8 bytes were valid when iterating over the code points of an UTF8View. This change makes it so we instead output the 0xFFFD 'REPLACEMENT CHARACTER' code point when encountering invalid bytes, and keep iterating the view after skipping one byte. Leaving the decision to the consumer would break symmetry with the UTF32View API, which would in turn require heavy refactoring and/or code duplication in generic code such as the one found in Gfx::Painter and the Shell. To make it easier for the consumers to detect the original bytes, we provide a new method on the iterator that returns a Span over the data that has been decoded. This method is immediately used in the TextNode::compute_text_for_rendering method, which previously did this in a ad-hoc waay. This also add tests for the new behavior in TestUtf8.cpp, as well as reinforcements to the existing tests to check if the underlying bytes match up with their expected values.	2021-06-03 18:28:27 +04:30
Andreas Kling	407d6cd9e4	AK: Rename Utf8CodepointIterator => Utf8CodePointIterator	2021-06-01 09:45:52 +02:00
Max Wipfli	14506e8f5e	AK: Implement Utf8CodepointIterator::peek(size_t) This adds a peek method for Utf8CodepointIterator, which enables it to be used in some parsing cases where peeking is necessary. peek(0) is equivalent to operator*, expect that peek() does not contain any assertions and will just return an empty Optional<u32>. This also implements a test case for iterating UTF-8.	2021-06-01 09:28:05 +02:00
Brian Gianforcaro	67322b0702	Tests: Move AK tests to Tests/AK	2021-05-06 17:54:28 +02:00

15 commits