0ct0pu5/ladybird

Author	SHA1	Message	Date
Timothy Flynn	feddecde5b	LibWeb: Emit the current token before EOF on invalid comments The spec for each of these state: -> EOF: This is an eof-in-comment parse error. Emit the current comment token. Emit an end-of-file token. We were neglecting to emit the current comment token before emitting an EOF token. Note the existing EMIT_CURRENT_TOKEN macro was unused.	2024-03-23 20:58:31 +01:00
Timothy Flynn	af57bd5cca	LibWeb: Stop parsing after `document.write` at the insertion point If a call to `document.write` inserts an incomplete HTML tag, e.g.: document.write("<p"); we would previously continue parsing the document until we reached a closing angle bracket. However, the spec states we should stop once we reach the new insertion point.	2024-02-20 17:04:36 +01:00
Timothy Flynn	64dcd3f1f4	LibWeb: Restore the previous tokenizer iterator after inserting input Otherwise, m_prev_utf8_iterator still points at the old source.	2024-02-20 17:04:36 +01:00
Timothy Flynn	fcf83a8ed0	LibWeb: Allocate fewer strings during `document.write`	2024-02-20 17:04:36 +01:00
Ali Mohammad Pur	5e1499d104	Everywhere: Rename {Deprecated => Byte}String This commit un-deprecates DeprecatedString, and repurposes it as a byte string. As the null state has already been removed, there are no other particularly hairy blockers in repurposing this type as a byte string (what it _really_ is). This commit is auto-generated: $ xs=$(ack -l \bDeprecatedString\b\\|deprecated_string AK Userland \ Meta Ports Ladybird Tests Kernel) $ perl -pie 's/\bDeprecatedString\b/ByteString/g; s/deprecated_string/byte_string/g' $xs $ clang-format --style=file -i \ $(git diff --name-only \| grep \.cpp\\|\.h) $ gn format $(git ls-files '.gn' '.gni')	2023-12-17 18:25:10 +03:30
Andreas Kling	3ff81dcb65	LibWeb: Make Web::Namespace::Foo strings be FlyString This required dealing with a lot of fallout, but it's all basically just switching from DeprecatedFlyString to either FlyString or Optional<FlyString> in a hundred places to accommodate the change.	2023-11-04 21:28:30 +01:00
Andreas Kling	b341aeb5c1	LibWeb: Switch HTMLToken and HTMLTokenizer to String & FlyString	2023-11-04 21:28:30 +01:00
Shannon Booth	d8635fe541	LibWeb: Port HTMLParser local name and value from DeprecatedString	2023-10-08 08:11:48 -04:00
Shannon Booth	9303e9e76f	LibWeb: Port Element::local_name and TagNames from Deprecated String Which pretty much needs to be done together due to the amount of places where they are compared together. This also involves porting over StackOfOpenElements over to FlyString from DeprecatedFly string to prevent a gazillion calls to `.to_deprecated_fly_string` calls in HTMLParser.	2023-10-03 14:47:53 +01:00
Shannon Booth	49eb3bfb1d	LibWeb: Make Document::run_the_document_write_steps take a StringView Which flows on down into HTMLTokenizer::insert_input_at_insertion_point.	2023-09-13 07:26:35 +02:00
Timothy Flynn	fea440055a	LibWeb: Track the byte offset of an HTMLToken's position We currently track the [line, column] position of every HTMLToken, as this is what is needed for LibGUI's syntax highlighting. Some non-LibGUI purposes (e.g. highlighting HTML with HTML) require a byte offset. Track both during tokenization.	2023-08-29 08:11:11 -04:00
Timothy Flynn	5a2bf7fdd1	LibWeb: Set the correct end position of HTML attribute names We were previously setting the end position of attribute names in self- closing HTML tags to the end of the attribute value. To illustrate the previous behavior, consider this tag and its attribute's start and end positions (shown inclusively below): <meta charset="UTF-8" /> ^ name start ^ value start ^ value end ^ name end Rather than setting the end position of the attribute name when we parse the closing slash, ensure the end position is already set while we are in the AttributeName state. We now have: <meta charset="UTF-8" /> ^ name start ^ name end ^ value start ^ value end The tokenizer unit test has been extended to test these positions.	2023-08-25 08:22:24 +02:00
Timothy Flynn	5b2bc90b50	LibWeb: Set consistent positions for the start and end of HTML tags To illustrate the previous behavior, consider these tags and their start and end positions (shown inclusively below): Start tag: End tag: <span> </span> ^ start ^ start ^end ^end The start position of a tag is the first ASCII-alpha code point after the opening brace. The start position of a close tag is the slash just before the first ASCII-alpha code point. And the end position of both is the closing brace. So the opening brace is not included in the emitted tag, but the closing brace is. And the end tag including the slash is an oddity that had to be worked around in its only use case (syntax highlighting). We now consistently exclude the braces from the emitted tag, and also exclude the slash from the end tag, so that it does not need to be accounted for in syntax highlighting. That is, we now have: Start tag: End tag: <span> </span> ^ start ^ start ^end ^end The tokenizer unit test has been extended to test these positions.	2023-08-25 08:22:24 +02:00
Andreas Kling	70db40c9b0	LibWeb: Don't include Layout/Node.h from DOM/Element.h This required moving the CSS::StyleProperty destruct out of line.	2023-05-08 09:29:44 +02:00
Sam Atkins	2db168acc1	LibTextCodec+Everywhere: Port Decoders to new Strings	2023-02-19 17:15:47 +01:00
Sam Atkins	f2a9426885	LibTextCodec+Everywhere: Return Optional<Decoder&> from `decoder_for()`	2023-02-19 17:15:47 +01:00
Linus Groh	6e7459322d	AK: Remove StringBuilder::build() in favor of to_deprecated_string() Having an alias function that only wraps another one is silly, and keeping the more obvious name should flush out more uses of deprecated strings. No behavior change.	2023-01-27 20:38:49 +00:00
Linus Groh	57dc179b1f	Everywhere: Rename to_{string => deprecated_string}() where applicable This will make it easier to support both string types at the same time while we convert code, and tracking down remaining uses. One big exception is Value::to_string() in LibJS, where the name is dictated by the ToString AO.	2022-12-06 08:54:33 +01:00
Linus Groh	6e19ab2bbc	AK+Everywhere: Rename String to DeprecatedString We have a new, improved string type coming up in AK (OOM aware, no null state), and while it's going to use UTF-8, the name UTF8String is a mouthful - so let's free up the String name by renaming the existing class. Making the old one have an annoying name will hopefully also help with quick adoption :^)	2022-12-06 08:54:33 +01:00
Andreas Kling	c79e8aab0a	LibWeb: Make ON_WHITESPACE less heavy in HTML tokenizer Once we know that the current code point is an ASCII character, we can just check if it's one of the HTML whitespace characters. Before this patch, we were using the generic StringView::contains(u32) path that splats a code point into a StringBuilder and then searches for it with memmem(). This reduces time spent in the HTML tokenizer from 16% to 6% when loading the ECMA-262 spec.	2022-11-05 00:31:11 +01:00
Andreas Kling	ab8432783e	LibWeb: Implement aborting the HTML parser This is roughly on-spec, although I had to invent a simple "aborted" state for the tokenizer.	2022-09-20 23:44:59 +02:00
sin-ack	3f3f45580a	Everywhere: Add sv suffix to strings relying on StringView(char const) Each of these strings would previously rely on StringView's char const constructor overload, which would call __builtin_strlen on the string. Since we now have operator ""sv, we can replace these with much simpler versions. This opens the door to being able to remove StringView(char const*). No functional changes.	2022-07-12 23:11:35 +02:00
stelar7	e547f5887e	LibWeb: Fix Array OOBs in the HTMLTokenizer Accessing last() if there are no elements makes WebContent crash :^)	2022-06-03 12:29:11 +01:00
Andreas Kling	1061c863f8	LibWeb: Fix issue where double-quoted doctype system ID was not captured We were storing double-quoted system ID's in the public ID field. 1% progression on ACID3. :^)	2022-03-02 12:30:15 +01:00
Lorenz Steinert	db789813c9	LibWeb: Add basic support for dynamic markup insertion This implements basic support for dynamic markup insertion, adding * Document::open() * Document::write(Vector<String> const&) * Document::writeln(Vector<String> const&) * Document::close() The HTMLParser is modified to make it possible to create a script-created parser which initially only contains a HTMLTokenizer without any data. Aditionally the HTMLParser::run method gains an overload which does not modify the Document and does not run HTMLParser::the_end() so that we can reenter the parser at a later time. Furthermore all FIXMEs that consern the insertion point are implemented wich is defined in the HTMLTokenizer. Additionally the following member-variables of the HTMLParser are now exposed by getter funcions: * m_tokenizer * m_aborted * m_script_nesting_level The HTMLTokenizer is modified so that it contains an insertion point which keeps track of where the next input from the Document::write functions will be inserted. The insertion point is implemented as the charakter offset into m_decoded_input and a boolean describing if the insertion point is defined. Functions to update, check and {re}store the insertion point are also added. The function HTMLTokenizer::insert_eof is added to tell a script-created parser that document::close was called and HTMLParser::the_end() should be called. Lastly an explicit default constructor is added to HTMLTokenizer to create a empty HTMLTokenizer into which data can be inserted.	2022-02-21 18:26:43 +01:00
Adam Hodgen	b6eaefa87d	LibWeb: Fix 'Comment end state' in HTML Tokenizer Also, update the expected hash in the LibWeb TestHTMLTokenizer regression test. This is due to the "This comment has a few too many dashes." comment token being updated.	2022-02-21 16:31:45 +01:00
Adam Hodgen	d73bb2633c	LibWeb: Implement tokenization newline preprocessing Newline normalization will replace \r and \r\n with \n. The spec specifically states > Before the tokenization stage, the input stream must be preprocessed > by normalizing newlines. wheras this is implemented the processing during the tokenization itself. This should still exhibit the same behaviour, while keeping the tokenization logic in the same place.	2022-02-21 16:31:45 +01:00
Adam Hodgen	c6fcdd0f93	LibWeb: Fix off by one error in HTML Tokenizer In 'NamedCharacterReference' we attempt to lookup the code point by a identifier, eg apos; becomes ' This is done by passing the entire rest of the document to the `HTML::code_points_from_entity` function. However, before this change we didn't sent the final character which meant if the document ended in a named character reference the lookup would fail.	2022-02-21 16:31:45 +01:00
Andreas Kling	25504f6a1b	LibWeb: Use Vector::clear_with_capacity() in HTMLTokenizer This avoids constantly reallocating the Vector<HTMLToken>.	2022-02-19 14:45:59 +01:00
Linus Groh	892f6394b8	LibWeb: Implement state switch for "[CDATA[" in HTML parser	2022-02-15 23:24:34 +01:00
Linus Groh	f61fb08492	LibWeb: Add spec links to each HTML tokenizer state section I didn't add full spec comments this time, but this is better than nothing :^)	2022-02-15 23:24:34 +01:00
Karol Kosek	c157c2148f	LibWeb: Don't emit current token on EOF in HTML Tokenizer Emitting tokens on EOF caused an infinite loop, freezing the app, which could be a bit annoying when writing an HTML comment at the end of the file in Text Editor. :^)	2022-02-14 12:50:44 +03:30
Karol Kosek	fb5e2670d6	LibWeb: Fix highlighting HTML comments Commit `b193351a99` caused the HTML comments to flash when changing the text cursor. Also, when double-clicking on a comment, the selection started from the beginning of the file instead. The following message was displaying when `TOKENIZER_TRACE_DEBUG` was enabled: (Tokenizer::nth_last_position) Invalid position requested: 4th-last of 4. Returning (0-0). Changing the `nth_last_position` to 3 fixes this. I'm guessing that's because the parser is at that moment on the second hyphen of the `<!--` string, so it has to go back only by three characters.	2022-02-14 12:50:44 +03:30
MacDue	b193351a99	LibWeb: Fix off-by-one in HTMLTokenizer::restore_to() The difference should be between m_utf8_iterator and the the new position, if m_prev_utf8_iterator is used one fewer source position is popped than required. This issue was not apparent on most pages since restore_to used for tokens such <!doctype> that are normally followed by a newline that resets the column to zero, but it can be seen on pages with minified HTML.	2022-02-13 14:51:09 +00:00
Sam Atkins	197759e30f	LibWeb: Fix off-by-one error when highlighting unquoted HTML attributes This fixes #11166	2021-12-10 21:27:13 +01:00
Andreas Kling	8b1108e485	Everywhere: Pass AK::StringView by value	2021-11-11 01:27:46 +01:00
Andreas Kling	f67648f872	LibWeb: Rename HTMLDocumentParser => HTMLParser	2021-09-25 23:36:43 +02:00
ovf	898b8ffcb6	LibWeb: Avoid assertion failure on parsing numeric character references	2021-07-28 18:32:22 +02:00
ovf	13c7d55320	LibWeb: Fix parsing of character references in attribute values	2021-07-27 00:03:43 +02:00
Max Wipfli	ccae0cae45	LibWeb: Rename HTMLToken::doctype_data() => ensure_doctype_data() This renames the accessor to better reflect what it does, as this will allocate a DoctypeData struct if there is none.	2021-07-17 16:24:57 +04:30
Max Wipfli	25cba4387b	LibWeb: Add HTMLToken(Type) constructor and use it	2021-07-17 16:24:57 +04:30
Max Wipfli	f2e3c770f9	LibWeb: Use setter for HTMLToken::m_{start,end}_position	2021-07-17 16:24:57 +04:30
Max Wipfli	8b31e41692	LibWeb: Change HTMLToken::m_doctype into named DoctypeData struct This is in preparation for an upcoming storage change of HTMLToken. In contrast to the other token types, the accessor can hand out a mutable reference to allow users to change parts of the DoctypeData easily.	2021-07-17 16:24:57 +04:30
Max Wipfli	918bde98b1	LibWeb: Hide implementation details of HTMLToken attribute list Previously, HTMLToken would expose the Vector<Attribute> directly to its users. In preparation for a future change, all users now use implementation-agnostic APIs which do not expose the Vector directly.	2021-07-17 16:24:57 +04:30
Max Wipfli	15d8635afc	LibWeb: User getter+setter for HTMLToken tag name and self-closing flag	2021-07-17 16:24:57 +04:30
Max Wipfli	1aeafcc58b	LibWeb: Use getter and setter for Character type HTMLTokens While storing the code point in a UTF-8 encoded String in horrendously inefficient, this problem will be addressed at a later stage.	2021-07-17 16:24:57 +04:30
Max Wipfli	e8e9426b4f	LibWeb: User getter and setter for Comment type HTMLTokens	2021-07-17 16:24:57 +04:30
Max Wipfli	f886aa15b8	LibWeb: Rename HTMLToken::AttributeBuilder struct to Attribute This does not contain StringBuilders anymore, so it can do with a simpler name: Attribute.	2021-07-17 16:24:57 +04:30
Max Wipfli	e22a34badb	LibWeb: Fix assertion failures in HTMLTokenizer The *TagName states are all very similar, so it seems to be correct to apply the fix from #8761 to all of those states. This fixes #8788.	2021-07-16 11:55:55 +02:00
Max Wipfli	2404ad6897	LibWeb: Fix assertion failure when tokenizing JS regex literals This fixes parsing the following regular expression: /</g; It also adds a simple script element to the HTMLTokenizer regression test, which also contains that specific regex.	2021-07-15 01:47:22 +02:00

1 2

75 commits