Commit graph

20 commits

Author SHA1 Message Date
Evan Smal
3226ce3d83 LibJS: Remove some usage of DeprecatedString usage from Lexer
This changes the filename member from DeprecatedString to String. Parser
has also been updated to meet the updated Lexer interface.
2023-01-26 20:25:25 +00:00
Timothy Flynn
f3db548a3d AK+Everywhere: Rename FlyString to DeprecatedFlyString
DeprecatedFlyString relies heavily on DeprecatedString's StringImpl, so
let's rename it to A) match the name of DeprecatedString, B) write a new
FlyString class that is tied to String.
2023-01-09 23:00:24 +00:00
Linus Groh
6e19ab2bbc AK+Everywhere: Rename String to DeprecatedString
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
2022-12-06 08:54:33 +01:00
Andreas Kling
b0b022507b LibJS: Reduce AST memory usage by shrink-wrapping source range info
Before this change, each AST node had a 64-byte SourceRange member.
This SourceRange had the following layout:

    filename:       StringView (16 bytes)
    start:          Position (24 bytes)
    end:            Position (24 bytes)

The Position structs have { line, column, offset }, all members size_t.

To reduce memory consumption, AST nodes now only store the following:

    source_code:    NonnullRefPtr<SourceCode> (8 bytes)
    start_offset:   u32 (4 bytes)
    end_offset:     u32 (4 bytes)

SourceCode is a new ref-counted data structure that keeps the filename
and original parsed source code in a single location, and all AST nodes
have a pointer to it.

The start_offset and end_offset can be turned into (line, column) when
necessary by calling SourceCode::range_from_offsets(). This will walk
the source code string and compute line/column numbers on the fly, so
it's not necessarily fast, but it should be rare since this information
is primarily used for diagnostics and exception stack traces.

With this, ASTNode shrinks from 80 bytes to 32 bytes. This gives us a
~23% reduction in memory usage when loading twitter.com/awesomekling
(330 MiB before, 253 MiB after!) :^)
2022-11-22 21:13:35 +01:00
sin-ack
3f3f45580a Everywhere: Add sv suffix to strings relying on StringView(char const*)
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).

No functional changes.
2022-07-12 23:11:35 +02:00
davidot
56c425eec1 LibJS: Detect invalid unicode and stop lexing at that point
Previously we might swallow invalid unicode point which would skip valid
ascii characters. This could be dangerous as we might skip a '"' thus
not closing a string where we should.
This might have been exploitable as it would not have been clear what
code gets executed when looking at a script.

Another approach to this would be simply replacing all invalid
characters with the replacement character (this is what v8 does). But
our lexer and parser are currently not set up for such a change.
2021-12-29 16:57:23 +01:00
Andreas Kling
8b1108e485 Everywhere: Pass AK::StringView by value 2021-11-11 01:27:46 +01:00
Andreas Kling
8bde4e94d8 LibJS: Make Lexer::s_keywords store keywords as FlyString
This allows O(1) comparison against lexed keywords, since we lex to
FlyString.
2021-09-18 19:54:24 +02:00
Andreas Kling
d7578ddebb LibJS: Share "parsed identifiers" between copied JS::Lexer instances
When we save/load state in the parser, we preserve the lexer state by
simply making a copy of it. This was made extremely heavy by the lexer
keeping a cache of all parsed identifiers.

It keeps the cache to ensure that StringViews into parsed Unicode escape
sequences don't become dangling views when the Token goes out of scope.

This patch solves the problem by replacing the Vector<FlyString> which
was used to cache the identifiers with a ref-counted
HashTable<FlyString> instead.

Since the purpose of the cache is just to keep FlyStrings alive, it's
fine for all Lexer instances to share the cache. And as a bonus, using a
HashTable instead of a Vector replaces the O(n) accesses with O(1) ones.

This makes a 1.9 MiB JavaScript file parse in 0.6s instead of 24s. :^)
2021-09-10 23:18:00 +02:00
davidot
7bcffd1b6a LibJS: Fix some small remaining issues with parsing unicode escapes
Added a test to ensure the behavior stays the same.
We now throw on a direct usage of an escaped keywords with a specific
error to make it more clear to the user.
2021-08-24 07:42:37 +01:00
Timothy Flynn
1259dc3623 LibJS: Allow Unicode escape sequences in identifiers
For example, "property.br\u{64}wn" should resolve to "property.brown".

To support this behavior, this commit changes the Token class to hold
both the evaluated identifier name and a view into the original source
for the unevaluated name. There are some contexts in which identifiers
are not allowed to contain Unicode escape sequences; for example, export
statements of the form "export {} from foo.js" forbid escapes in the
identifier "from".

The test file is added to .prettierignore because prettier will replace
all escaped Unicode sequences with their unescaped value.
2021-08-19 23:49:25 +02:00
davidot
47bc72bcf6 LibJS: Correctly handle Unicode characters in JS source text
Also recognize additional white space characters.
2021-08-16 23:20:04 +01:00
davidot
106f9e30d7 LibJS: Force the lexer to parse a regex when expecting a statement 2021-08-16 23:20:04 +01:00
davidot
7613c22b06 LibJS: Add a mode to parse JS as a module
In a module strict mode should be enabled at the start of parsing and we
allow import and export statements.
2021-08-15 23:51:47 +01:00
Andreas Kling
49018553d3 LibJS+LibCrypto: Allow '_' as a numeric literal separator :^)
This patch adds support for the NumericLiteralSeparator concept from
the ECMAScript grammar.
2021-06-26 16:30:35 +02:00
Gunnar Beutner
d476144565 Userland: Allow building SerenityOS with -funsigned-char
Some of the code assumed that chars were always signed while that is
not the case on ARM hosts.

Also, some of the code tried to use EOF (-1) in a way similar to what
fgetc() does, however instead of storing the characters in an int
variable a char was used.

While this seemed to work it also meant that character 0xFF would be
incorrectly seen as an end-of-file.

Careful reading of fgetc() reveals that fgetc() stores character
data in an int where valid characters are in the range of 0-255 and
the EOF value is explicitly outside of that range (usually -1).
2021-06-13 18:52:58 +02:00
Stephan Unverwerth
10ceeb092f Everywhere: Use s.unverwerth@serenityos.org :^) 2021-05-29 12:30:08 +01:00
Brian Gianforcaro
1682f0b760 Everything: Move to SPDX license identifiers in all files.
SPDX License Identifiers are a more compact / standardized
way of representing file license information.

See: https://spdx.dev/resources/use/#identifiers

This was done with the `ambr` search and replace tool.

 ambr --no-parent-ignore --key-from-file --rep-from-file key.txt rep.txt *
2021-04-22 11:22:27 +02:00
Jean-Baptiste Boric
0039ecb189 LibJS: Keep track of file names, lines and columns inside the AST 2021-03-01 11:14:36 +01:00
Andreas Kling
13d7c09125 Libraries: Move to Userland/Libraries/ 2021-01-12 12:17:46 +01:00
Renamed from Libraries/LibJS/Lexer.h (Browse further)