Commit graph

53 commits

Author SHA1 Message Date
Andreas Kling
34344120f2 AK: Make "foo"_string infallible
Stop worrying about tiny OOMs.

Work towards #20405.
2023-08-07 16:03:27 +02:00
Simon Wanner
a2efecac03 LibJS: Parse slashes after reserved identifiers correctly
Previously we were unable to parse code like `yield/2` because `/2`
was parsed as a regex. At the same time `for (a in / b/)` was parsed
as a division.

This is solved by defaulting to division in the lexer, but calling
`force_slash_as_regex()` from the parser whenever an IdentifierName
is parsed as a ReservedWord.
2023-06-10 07:20:33 +02:00
Linus Groh
09d40bfbb2 Everywhere: Use _{short_,}string to create Strings from literals 2023-02-25 20:51:49 +01:00
Evan Smal
3226ce3d83 LibJS: Remove some usage of DeprecatedString usage from Lexer
This changes the filename member from DeprecatedString to String. Parser
has also been updated to meet the updated Lexer interface.
2023-01-26 20:25:25 +00:00
Evan Smal
cfa6b4d815 LibJS: Remove DeprecatedString usage from Token 2023-01-26 20:25:25 +00:00
Timothy Flynn
f3db548a3d AK+Everywhere: Rename FlyString to DeprecatedFlyString
DeprecatedFlyString relies heavily on DeprecatedString's StringImpl, so
let's rename it to A) match the name of DeprecatedString, B) write a new
FlyString class that is tied to String.
2023-01-09 23:00:24 +00:00
Linus Groh
6e19ab2bbc AK+Everywhere: Rename String to DeprecatedString
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
2022-12-06 08:54:33 +01:00
davidot
b3edd94869 LibJS: Treat '\\' as an escaped character in template literals
Before this change we would ignore that the second backslash is escaped
and template strings ending with ` \\` would be unterminated as the
second slash was used to escape the closing quote.
2022-11-15 12:00:36 +00:00
sin-ack
fbc771efe9 Everywhere: Use default StringView constructor over nullptr
While null StringViews are just as bad, these prevent the removal of
StringView(char const*) as that constructor accepts a nullptr.

No functional changes.
2022-07-12 23:11:35 +02:00
sin-ack
3f3f45580a Everywhere: Add sv suffix to strings relying on StringView(char const*)
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).

No functional changes.
2022-07-12 23:11:35 +02:00
Andreas Kling
92e0378dbd LibJS: Always inline Lexer::current_code_point()
This gives a ~1% speedup when parsing the largest Discord JS file.
2022-02-13 14:44:36 +01:00
Andreas Kling
3d9f2e2f94 LibJS: Add ASCII fast path to Lexer::current_code_point()
If the current character under the lexer cursor is ASCII, we don't need
to create a Utf8View to consume a full code point.

This gives a ~3% speedup when parsing the largest Discord JS file.
2022-02-13 14:44:36 +01:00
Linus Groh
95a9f12b97 LibJS: Set Token's m_offset to the value's start index
This makes much more sense than the current way of setting it to the
Lexer's m_position after consuming the full value.
2022-01-19 20:33:08 +00:00
davidot
56c425eec1 LibJS: Detect invalid unicode and stop lexing at that point
Previously we might swallow invalid unicode point which would skip valid
ascii characters. This could be dangerous as we might skip a '"' thus
not closing a string where we should.
This might have been exploitable as it would not have been clear what
code gets executed when looking at a script.

Another approach to this would be simply replacing all invalid
characters with the replacement character (this is what v8 does). But
our lexer and parser are currently not set up for such a change.
2021-12-29 16:57:23 +01:00
davidot
a1308bfc60 LibJS: Make new lines in block comments reset line has token
Before this a closing html comment would not be treated as a comment if
directly following a block comment which was not the first token of its
first line.
2021-12-21 14:04:23 +01:00
davidot
e751dcea43 LibJS: Treat private identifier as divisible token
And also make sure private identifiers are correctly checked when
synthesizing a binding pattern.
2021-11-30 17:05:32 +00:00
davidot
afde1821b5 LibJS: Disallow numerical separators in octal numbers and after '.' 2021-11-30 17:05:32 +00:00
Idan Horowitz
681787de76 LibJS: Add support for async functions
This commit adds support for the most bare bones version of async
functions, support for async generator functions, async arrow functions
and await expressions are TODO.
2021-11-10 08:48:27 +00:00
davidot
eeb42c21d1 LibJS: Lex private identifiers, identifiers prefixed with a '#' 2021-10-20 23:19:17 +01:00
Nico Weber
b8dc3661ac Libraries: Fix -Wunreachable-code warnings from clang 2021-10-08 23:33:46 +02:00
davidot
ac2c3a73b1 LibJS: Add a specific test for invalid unicode characters in the lexer
Also fixes that it tried to make substrings past the end of the source
if we overran the source length.
2021-10-03 17:42:05 +02:00
Luke Wilde
ae0bdda86e LibJS: Remove read buffer overflow in Lexer::consume
The position is added to manually in the line terminator and Unicode
character cases. While it checks for EOF after doing so, the EOF check
used `!=` instead of `<`, meaning if the position went _over_ the
source length, it wouldn't think it was EOF and would cause read buffer
overflows.

For example, `0xea` followed by `0xfd` would cause this.
2021-10-02 17:16:09 +02:00
Andreas Kling
76bafe5542 LibJS: Always inline two hot (and trivial) functions in JS::Lexer
This improves parsing time on a large chunk of JS by ~5%.
2021-09-18 19:54:24 +02:00
Andreas Kling
8bde4e94d8 LibJS: Make Lexer::s_keywords store keywords as FlyString
This allows O(1) comparison against lexed keywords, since we lex to
FlyString.
2021-09-18 19:54:24 +02:00
Andreas Kling
bf46845819 LibJS: Avoid a temporary AK::String when lexing already-seen identifiers
By using the FlyString(StringView) constructor instead of the
FlyString(String) one, we can dodge a temporary String construction.

This improves parsing time on a large chunk of JS by ~1.6%.
2021-09-18 19:54:24 +02:00
Linus Groh
a50e33abe3 LibJS: Skip ID_{Start,Continue} property lookup for any ASCII characters
Before this change, Lexer::is_identifier_{start,middle}() would do a
Unicode property lookup via Unicode::code_point_has_property() quite
frequently, especially for common characters like .,;{}[]() etc.

Since these and any other ASCII characters not covered by the alpha /
alphanumeric check are known to not have the ID_Start / ID_Continue
(except '_', which is special-cased now) properties, we can easily
avoid this function call.
2021-09-14 02:48:57 +02:00
Andreas Kling
d7578ddebb LibJS: Share "parsed identifiers" between copied JS::Lexer instances
When we save/load state in the parser, we preserve the lexer state by
simply making a copy of it. This was made extremely heavy by the lexer
keeping a cache of all parsed identifiers.

It keeps the cache to ensure that StringViews into parsed Unicode escape
sequences don't become dangling views when the Token goes out of scope.

This patch solves the problem by replacing the Vector<FlyString> which
was used to cache the identifiers with a ref-counted
HashTable<FlyString> instead.

Since the purpose of the cache is just to keep FlyStrings alive, it's
fine for all Lexer instances to share the cache. And as a bonus, using a
HashTable instead of a Vector replaces the O(n) accesses with O(1) ones.

This makes a 1.9 MiB JavaScript file parse in 0.6s instead of 24s. :^)
2021-09-10 23:18:00 +02:00
davidot
bbddfeef4b LibJS: Clean up token constructor and use method instead for identifiers
Having two large constructor with just one parameter difference in the
middle seems quite dangerous so just do it with a method.
2021-09-06 08:43:38 +01:00
Brian Gianforcaro
77d8a65498 LibJS: Fix incorrect Lexer VERIFY when parsing Unicode characters
This bug was discovered via OSS fuzz, it's possible to fall through
to this assert with a char_size == 1, so we need to account for that
in the VERIFY(..).

Repro test case can be found in the OSS fuzz bug:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=37296
2021-08-25 09:21:23 +01:00
davidot
c108c8ff24 LibJS: Disallow yield expression correctly in formal parameters
And add ZERO WIDTH NO BREAK SPACE to valid whitespace.
2021-08-24 07:42:37 +01:00
davidot
7bcffd1b6a LibJS: Fix some small remaining issues with parsing unicode escapes
Added a test to ensure the behavior stays the same.
We now throw on a direct usage of an escaped keywords with a specific
error to make it more clear to the user.
2021-08-24 07:42:37 +01:00
Timothy Flynn
1259dc3623 LibJS: Allow Unicode escape sequences in identifiers
For example, "property.br\u{64}wn" should resolve to "property.brown".

To support this behavior, this commit changes the Token class to hold
both the evaluated identifier name and a view into the original source
for the unevaluated name. There are some contexts in which identifiers
are not allowed to contain Unicode escape sequences; for example, export
statements of the form "export {} from foo.js" forbid escapes in the
identifier "from".

The test file is added to .prettierignore because prettier will replace
all escaped Unicode sequences with their unescaped value.
2021-08-19 23:49:25 +02:00
davidot
47bc72bcf6 LibJS: Correctly handle Unicode characters in JS source text
Also recognize additional white space characters.
2021-08-16 23:20:04 +01:00
davidot
106f9e30d7 LibJS: Force the lexer to parse a regex when expecting a statement 2021-08-16 23:20:04 +01:00
davidot
4cc95ae39d LibJS: Fix that a windows-style new line was not escaped properly 2021-08-16 23:20:04 +01:00
davidot
7613c22b06 LibJS: Add a mode to parse JS as a module
In a module strict mode should be enabled at the start of parsing and we
allow import and export statements.
2021-08-15 23:51:47 +01:00
Ali Mohammad Pur
1a9518ebe3 LibJS: Implement parsing and evaluation for AssignmentPatterns
e.g. `[...foo] = bar` can now be evaluated :^)
2021-07-11 21:41:54 +01:00
Ali Mohammad Pur
0292ad33eb LibJS: Make a slash after a curly close mean not-division
There's no grammar rule that allows this.
2021-07-02 14:59:03 +02:00
Andreas Kling
49018553d3 LibJS+LibCrypto: Allow '_' as a numeric literal separator :^)
This patch adds support for the NumericLiteralSeparator concept from
the ECMAScript grammar.
2021-06-26 16:30:35 +02:00
Linus Groh
714a96619f LibJS: Disallow whitespace or comments between regex literal and flags
If we consumed whitespace and/or comments after a RegexLiteral token,
the following token must not be RegexFlags - no whitespace or comments
are allowed between the closing / and the flag characters.

Fixes #8201.
2021-06-22 14:08:40 +01:00
Linus Groh
597cf88c08 LibJS: Implement the 'Hashbang Grammar for JS' proposal
Stage 3 since August 2019 - we already have shebang stripping
implemented in js(1), so this removes it from there in favor of adding
support to the lexer directly.

Most straightforward proposal and implementation I've ever seen :^)

https://github.com/tc39/proposal-hashbang
2021-06-18 20:35:23 +01:00
Idan Horowitz
690eb3bb8a LibJS: Add support for hex, octal & binary big integer literals 2021-06-14 01:45:04 +01:00
Andreas Kling
39ad705c13 LibJS: Use the new is_ascii_foo() helpers from AK
These constexpr helpers generate nicer code than the LibC ctype.h
variants, so let's make use of them. :^)
2021-06-13 19:11:29 +02:00
Gunnar Beutner
d476144565 Userland: Allow building SerenityOS with -funsigned-char
Some of the code assumed that chars were always signed while that is
not the case on ARM hosts.

Also, some of the code tried to use EOF (-1) in a way similar to what
fgetc() does, however instead of storing the characters in an int
variable a char was used.

While this seemed to work it also meant that character 0xFF would be
incorrectly seen as an end-of-file.

Careful reading of fgetc() reveals that fgetc() stores character
data in an int where valid characters are in the range of 0-255 and
the EOF value is explicitly outside of that range (usually -1).
2021-06-13 18:52:58 +02:00
Stephan Unverwerth
10ceeb092f Everywhere: Use s.unverwerth@serenityos.org :^) 2021-05-29 12:30:08 +01:00
Linus Groh
ebdeed087c Everywhere: Use linusg@serenityos.org for my copyright headers 2021-04-22 22:51:19 +02:00
Brian Gianforcaro
1682f0b760 Everything: Move to SPDX license identifiers in all files.
SPDX License Identifiers are a more compact / standardized
way of representing file license information.

See: https://spdx.dev/resources/use/#identifiers

This was done with the `ambr` search and replace tool.

 ambr --no-parent-ignore --key-from-file --rep-from-file key.txt rep.txt *
2021-04-22 11:22:27 +02:00
Linus Groh
a178255a8b LibJS: Use 'if constexpr' / dbgln_if() instead of '#if LEXER_DEBUG' 2021-04-18 18:14:50 +02:00
Jean-Baptiste Boric
0039ecb189 LibJS: Keep track of file names, lines and columns inside the AST 2021-03-01 11:14:36 +01:00
Andreas Kling
635a5eec75 LibJS: Remove a whole bunch of unnecessary #includes 2021-02-10 09:13:29 +01:00