This implements basic support for dynamic markup insertion, adding
* Document::open()
* Document::write(Vector<String> const&)
* Document::writeln(Vector<String> const&)
* Document::close()
The HTMLParser is modified to make it possible to create a
script-created parser which initially only contains a HTMLTokenizer
without any data. Aditionally the HTMLParser::run method gains an
overload which does not modify the Document and does not run
HTMLParser::the_end() so that we can reenter the parser at a later time.
Furthermore all FIXMEs that consern the insertion point are implemented
wich is defined in the HTMLTokenizer. Additionally the following
member-variables of the HTMLParser are now exposed by getter funcions:
* m_tokenizer
* m_aborted
* m_script_nesting_level
The HTMLTokenizer is modified so that it contains an insertion
point which keeps track of where the next input from the Document::write
functions will be inserted. The insertion point is implemented as the
charakter offset into m_decoded_input and a boolean describing if the
insertion point is defined. Functions to update, check and {re}store the
insertion point are also added.
The function HTMLTokenizer::insert_eof is added to tell a script-created
parser that document::close was called and HTMLParser::the_end() should
be called.
Lastly an explicit default constructor is added to HTMLTokenizer to
create a empty HTMLTokenizer into which data can be inserted.
Also, update the expected hash in the LibWeb TestHTMLTokenizer
regression test.
This is due to the "This comment has a few too many dashes." comment
token being updated.
Newline normalization will replace \r and \r\n with \n.
The spec specifically states
> Before the tokenization stage, the input stream must be preprocessed
> by normalizing newlines.
wheras this is implemented the processing during the tokenization
itself.
This should still exhibit the same behaviour, while keeping the
tokenization logic in the same place.
In 'NamedCharacterReference' we attempt to lookup the code point by a
identifier, eg apos; becomes '
This is done by passing the entire rest of the document to the
`HTML::code_points_from_entity` function.
However, before this change we didn't sent the final character which
meant if the document ended in a named character reference the lookup
would fail.
Emitting tokens on EOF caused an infinite loop, freezing the app, which
could be a bit annoying when writing an HTML comment at the end of
the file in Text Editor. :^)
Commit b193351a99 caused the HTML comments to flash when changing
the text cursor. Also, when double-clicking on a comment, the selection
started from the beginning of the file instead.
The following message was displaying when `TOKENIZER_TRACE_DEBUG`
was enabled:
(Tokenizer::nth_last_position) Invalid position requested: 4th-last
of 4. Returning (0-0).
Changing the `nth_last_position` to 3 fixes this. I'm guessing that's
because the parser is at that moment on the second hyphen of the `<!--`
string, so it has to go back only by three characters.
The difference should be between m_utf8_iterator and the
the new position, if m_prev_utf8_iterator is used one fewer
source position is popped than required.
This issue was not apparent on most pages since restore_to
used for tokens such <!doctype> that are normally
followed by a newline that resets the column to zero,
but it can be seen on pages with minified HTML.
This is in preparation for an upcoming storage change of HTMLToken. In
contrast to the other token types, the accessor can hand out a mutable
reference to allow users to change parts of the DoctypeData easily.
Previously, HTMLToken would expose the Vector<Attribute> directly to
its users. In preparation for a future change, all users now use
implementation-agnostic APIs which do not expose the Vector directly.
This fixes parsing the following regular expression: /</g;
It also adds a simple script element to the HTMLTokenizer regression
test, which also contains that specific regex.
This patch aims to fix wrong highlighting for some cases in HTML's
syntax highlighter. The values were somewhat experimentally determined
are are subject to change. Regardless, it should be more correct with
this patch than without it. :^)
This patch changes HTMLTokenizer::nth_last_position to not fail if the
requested position is not available. Rather, it will just return (0-0).
While this is not the correct solution, it prevents the tokenizer from
crashing just because it cannot find a source position. This should only
affect SyntaxHighlighter.
This replaces ctype.h with CharacterType.h everywhere I could find
issues with narrowing conversions. While using it will probably make
sense almost everywhere in the future, the most critical places should
have been addressed.
skip() is supposed to end up keeping the previous iterator only one
index behind the current one, and restore_to() should actually do the
restore instead of just removing the now-useless source positions.
Fixes#7331.
SPDX License Identifiers are a more compact / standardized
way of representing file license information.
See: https://spdx.dev/resources/use/#identifiers
This was done with the `ambr` search and replace tool.
ambr --no-parent-ignore --key-from-file --rep-from-file key.txt rep.txt *
(...and ASSERT_NOT_REACHED => VERIFY_NOT_REACHED)
Since all of these checks are done in release builds as well,
let's rename them to VERIFY to prevent confusion, as everyone is
used to assertions being compiled out in release.
We can introduce a new ASSERT macro that is specifically for debug
checks, but I'm doing this wholesale conversion first since we've
accumulated thousands of these already, and it's not immediately
obvious which ones are suitable for ASSERT.