Instead of displaying locals as "locN", we now show them as "name~N".
This makes it a lot easier to follow bytecode dumps, especially in
longer functions.
Note that we keep displaying the local index, to avoid confusion in case
there are multiple separate locals with the same name in one executable.
Linking LibLocale publicly ensures that libicudata.a is also available
in all embedders of LibJS. Otherwise, ICU crashes in hard-to-track-down
ways at runtime when the data is not available.
The proposal has undergone quite a few normative changes since we last
synced with it. There was a time when it could not be implemented as it
was written, which is no longer the case. The resulting proposal has had
so many changes compared to our implementation, that it wouldn't make
sense to implement them commit-by-commit as we normally do. So instead,
this just implements the HEAD revision of the spec in one pass.
This uses ICU for the Intl.DateTimeFormat `format` `formatToParts`,
`formatRange`, and `formatRangeToParts`.
This lets us remove most data from our date-time format generator. All
that remains are time zone data and locale week info, which are relied
upon still for other interfaces. So they will be removed in a future
patch.
Note: All of the changes to the test files in this patch are now aligned
with other browsers. This includes:
* Some very incorrect formatting of Japanese symbols. (Looking at the
old results now, it's very obvious they were wrong.)
* Old FIXMEs regarding range formatting not including the start/end date
when only time fields were requested, but the dates differ.
* Day period inconsistencies.
Properties marked with the [[Unimplemented]] attribute behave as normal
but invoke the `VM::on_unimplemented_property_access callback` when
they are accessed.
The IntlMV is meant to be arbitrarily precise. If the user provides a
string value to be formatted, we lose precision by converting extremely
large values to a double. We were never able to address this, as support
for arbitrary precision was a big FIXME. But ICU can handle it by just
passing the raw string on through.
This uses ICU for the Intl.NumberFormat `formatRange` and
`formatRangeToParts` prototypes.
Note: All of the changes to the test files in this patch are now aligned
with both Chrome and Safari.
This uses ICU for the Intl.NumberFormat `format` and `formatToParts`
prototypes. It does not yet port the range formatter prototypes.
Most of the new code in LibLocale/NumberFormat is simply mapping from
ECMA-402 types to ICU types. Beyond that, the only algorithmic change is
that we have to mutate the output from ICU for `formatToParts` to match
what is expected by ECMA-402. This is explained in NumberFormat.cpp in
`flatten_partitions`.
This lets us remove most data from our number format generator. All that
remains are numbering system digits and symbols, which are relied upon
still for other interfaces (e.g. Intl.DateTimeFormat). So they will be
removed in a future patch.
Note: All of the changes to the test files in this patch are now aligned
with both Chrome and Safari.
Note: We keep locale parsing and syntactic validation as-is. ECMA-402
places additional restrictions on locales above what is required by the
Unicode spec. ICU doesn't provide methods that let us easily check those
restrictions, whereas LibLocale does. Other browsers also implement
their own validators here.
This introduces a locale cache to re-use parsed locale data and various
related structures (not doing so has a non-negligible performance impact
on Intl tests).
The existing APIs for canonicalization and display names are pretty
intertwined, so they must both be adapted at once here. The results of
canonicalization are slightly different on some edge cases. But the
changed results are actually now aligned with Chrome and Safari.
NetBSD and FreeBSD get upset when we don't set the fd to an invalid
value when using a non-shared mapping.
Reported-By: Thomas Klausner <wiz@gatalith.at>
Instead of scanning through the list of seen constants, we now have a
more structured storage of the constants true, false, null, undefined,
and every possible Int32 value.
This fixes an O(n^2) issue found by Kraken/json-stringify-tinderbox.js
This turns expressions like `(2 + 3) * 8 / 2` into a constant (20)
at bytecode compilation time instead of generating instructions
to calculate the value.
This is a new Bytecode::Generator helper that takes an operand and
returns the same operand, or a copy of it, in case a copy is required
to preserve correct evaluation order.
This can be used in a bunch of places where we're worried about
clobbering some value after obtaining it.
Practically, locals are always copied, and temporary registers as well
as constants are returned as-is.
Functions that don't have a FunctionEnvironment will get their `this`
value from the ExecutionContext. This patch stops generating
ResolveThisBinding instructions at all for functions like that, and
instead pre-populates the `this` register when entering a new bytecode
executable.
We already have a dedicated register slot for `this`, so instead of
having ResolveThisBinding take a `dst` operand, just write the value
directly into the `this` register every time.
...that maintains a list of allocated execution contexts so malloc()
does not have to be called every time we need to get a new one.
20% improvement in Octane/typescript.js
16% improvement in Kraken/imaging-darkroom.js
By using separate struct we can avoid updating AST node and
ECMAScriptFunctionObject constructors every time there is a need to
add or remove some additional information colllected during parsing.
Allows to skip function environment allocation for non-arrow functions
if the only reason it is needed is to hold `this` binding.
The parser is changed to do following:
- If a function is an arrow function and uses `this` then all functions
in a scope chain are marked to allocate function environment for
`this` binding.
- If a function uses `new.target` then all functions in a scope chain
are marked to allocate function environment.
`ordinary_call_bind_this()` is changed to put `this` value in execution
context when function environment allocation is skipped.
35% improvement in Octane/typescript.js
50% improvement in Octane/deltablue.js
19% improvement in Octane/raytrace.js
This allows us to skip allocating a function environment in cases where
it was previously impossible because the arguments object needed a
binding.
This change does not bring visible improvement in Kraken or Octane
benchmarks but seems useful to have anyway.
By doing this, we can remove all the special-case checks for `length`
from the generic GetById and GetByIdWithThis code paths, making every
other property lookup a bit faster.
Small progressions on most benchmarks, and some larger progressions:
- 12% on Octane/crypto.js
- 6% on Kraken/ai-astar.js
The VERIFY assumed that we either have a finally block which we already
visited through a previous boundary or have no finally block what so
ever; But since 301a1fc763 we propagate
finalizers through unwind scopes breaking that assumption.
It wasn't safe to use addition_would_overflow(a, -b) to check if
subtraction (a - b) would overflow, since it doesn't cover this case.
I don't know why we didn't have subtraction_would_overflow(), so this
patch adds it. :^)
With this only `ContinuePendingUnwind` needs to dynamically check if a
scheduled return needs to go through a `finally` block, making the
interpreter loop a bit nicer
Instead of SetVariable having 2x2 modes for variable/lexical and
initialize/set, those 4 modes are now separate instructions, which
makes each instruction much less branchy.
The last completion value in a function is not exposed to the language,
since functions always either return something, or undefined.
Given this, we can avoid emitting code that propagates the completion
value from various statements, as long as we know we're generating code
for a context where the completion value is not accessible. In practical
terms, this means that function code gets to do less completion
shuffling, while global and eval code has to keep doing it.
This means that SetVariable instructions will now remember which
(relative) environment contains the targeted binding, letting it bypass
the full binding resolution machinery on subsequent accesses.
Merging registers, constants and locals into single vector means:
- Better data locality
- No need to check type in Interpreter::get() and Interpreter::set()
which are very hot functions
Performance improvement is visible in almost all Octane and Kraken
tests.
Filling a typed array with an integer shouldn't have to go through the
generic Set for every element.
This knocks a 7% item down to <1% in profiles of Another World JS at
https://cyxx.github.io/another_js/
When comparing two numbers, we can avoid a lot of implicit type
conversion nonsense and go straight to comparison, saving time in the
most common case.
These were out-of-line because we had some ideas about marking
instruction streams PROT_READ only, but that seems pretty arbitrary and
there's a lot of performance to be gained by putting these inline.
Instead, generate bytecode to execute their AST nodes and save the
resulting operands inside the NewClass instruction.
Moving property expression evaluation to happen before NewClass
execution also moves along creation of new private environment and
its population with private members (private members should be visible
during property evaluation).
Before:
- NewClass
After:
- CreatePrivateEnvironment
- AddPrivateName
- ...
- AddPrivateName
- NewClass
- LeavePrivateEnvironment
This patch stops emitting the BlockDeclarationInstantiation instruction
when there are no locals, and no function declarations in the scope.
We were spending 20% of CPU time on https://ventrella.com/Clusters/ just
creating empty environments for no reason.
Instead of relying on native stack overflows to kick us out of circular
proxy chains, we now keep track of the recursion depth and kick
ourselves out if it exceeds 10'000.
This fixes an issue where compiler tail/sibling call optimizations would
turn infinite recursion into infinite loops, and thus never hit a stack
overflow to kick itself out.
For whatever reason, we've only seen the issue on SerenityOS with UBSAN,
but it could theoretically happen on any platform.
If the minimal amount of required bindings is known in advance, it could
be used to ensure capacity to avoid resizing the internal vector that
holds bindings.
By doing that all instructions required for instantiation are emitted
once in compilation and then reused for subsequent calls, instead of
running generic instantiation process for each call.
Instead of storing a full JS::Completion for the "throw completion"
case, we now store a simple JS::Value (wrapped in ErrorValue for the
type system).
This shrinks TCO<void> and TCO<Value> (the most common TCOs) by 8 bytes,
which has a non-trivial impact on performance.
If the callee is already a temporary register, we don't need to copy it
to *another* temporary before evaluating arguments. None of the
arguments will clobber the existing temporary anyway.
We only need to make copies of locals here, in case the locals are
modified by something like increment/decrement expressions.
Registers and constants can slip right through, without being Mov'ed
into a temporary first.
We know that `undefined` in the global scope is always the proper
undefined value. This commit takes advantage of that by simply emitting
a constant undefined value instead.
Unfortunately we can't be so sure in other scopes.
This helps some of the Cloudflare Turnstile stuff run faster, since they
are deliberately screwing with JS engines by asking us to do a bunch of
bitwise operations on e.g 65535.56
By rounding such values in bytecode generation, the interpreter can stay
on the happy path while executing, and finish quite a bit faster.
We now fuse sequences like [LessThan, JumpIf] to JumpLessThan.
This is only allowed for temporaries (i.e VM registers) with no other
references to them.
Before this change, switch codegen would interleave bytecode like this:
(test for case 1)
(code for case 1)
(test for case 2)
(code for case 2)
This meant that we often had to make many large jumps while looking for
the matching case, since code for each case can be huge.
It now looks like this instead:
(test for case 1)
(test for case 2)
(code for case 1)
(code for case 2)
This way, we can just fall through the tests until we hit one that fits,
without having to make any large jumps.
This removes a layer of indirection in the bytecode where we had to make
sure all the initializer elements were laid out in sequential registers.
Array expressions no longer clobber registers permanently, and they can
be reused immediately afterwards.
This patch adds a register freelist to Bytecode::Generator and switches
all operands inside the generator to a new ScopedOperand type that is
ref-counted and automatically frees the register when nothing uses it.
This dramatically reduces the size of bytecode executable register
windows, which were often in the several thousands of registers for
large functions. Most functions now use less than 100 registers.
The first block in every executable will always execute first, so if it
ends up doing a ResolveThisBinding, it's fine for all other blocks
within the same executable to use the same `this` value.
In the common case, parseInt() is being called with strings that don't
have leading whitespace. By checking for that first, we can avoid the
cost of trimming and multiple malloc/GC allocations.
Once executed, this instruction will always produce the same result
in subsequent executions, so it's okay to cache it.
Unfortunately it may throw, so we can't just hoist it to the top of
every executable, since that would break observable execution order.
If we don't have enough stack space, throw an exception while we still
can, and give the caller a chance to recover.
This particular problem will go away once we make calls non-recursive.
This means inlining all the things. This yields a 40% speedup on the for
loop microbenchmark, and everything else gets faster as well. :^)
This makes compilation take foreeeever with GCC, so I'm only enabling it
for Clang in this commit. We should figure out how to make GCC compile
this without timing out CI, since the speedup is amazing.
This commit converts the main loop in Bytecode::Interpreter to use a
label table and computed goto for fast instruction dispatch.
This yields roughly 35% speedup on the for loop microbenchmark,
and makes everything else faster as well. :^)
Now that the interpreter is unrolled, we can advance the program counter
manually based on the current instruction type.
This makes most instructions a bit smaller. :^)
This commit adds a HANDLE_INSTRUCTION macro that expands to everything
needed to handle a single instruction (invoking the handler function,
checking for exceptions, and advancing the program counter).
This gives a ~15% speed-up on a for loop microbenchmark, and makes
basically everything faster.
Instead of storing a BasicBlock* and forcing the size of Label to be
sizeof(BasicBlock*), we now store the basic block index as a u32.
This means the final version of the bytecode is able to keep labels
at sizeof(u32), shrinking the size of many instructions. :^)
Instead of storing source offsets with each instruction, we now keep
them in a side table in Executable.
This shrinks each instruction by 8 bytes, further improving locality.
Instead of keeping bytecode as a set of disjoint basic blocks on the
malloc heap, bytecode is now a contiguous sequence of bytes(!)
The transformation happens at the end of Bytecode::Generator::generate()
and the only really hairy part is rerouting jump labels.
This required solving a few problems:
- The interpreter execution loop had to change quite a bit, since we
were storing BasicBlock pointers all over the place, and control
transfer was done by redirecting the interpreter's current block.
- Exception handlers & finalizers are now stored per-bytecode-range
in a side table in Executable.
- The interpreter now has a plain program counter instead of a stream
iterator. This actually makes error stack generation a bit nicer
since we just have to deal with a number instead of reaching into
the iterator.
This yields a 25% performance improvement on this microbenchmark:
for (let i = 0; i < 1_000_000; ++i) { }
But basically everything gets faster. :^)
Before this change, all JumpFoo instructions inherited from Jump, which
forced the unconditional Jump to have an unusued "false target" member.
Also, labels were unnecessarily wrapped in Optional<>.
By defining each jump instruction separately, they all shrink in size,
and all ambiguity is removed.
If all lexical declaration use local variables then there is no need
to allocate declarative environment.
With this change we skip ~3x more environment allocations on Github.
We already had fast access to own properties via shape-based IC.
This patch extends the mechanism to properties on the prototype chain,
using the "validity cell" technique from V8.
- Prototype objects now have unique shape
- Each prototype has an associated PrototypeChainValidity
- When a prototype shape is mutated, every prototype shape "below" it
in any prototype chain is invalidated.
- Invalidation happens by marking the validity object as invalid,
and then replacing it with a new validity object.
- Property caches keep a pointer to the last seen valid validity.
If there is no validity, or the validity is invalid, the cache
misses and gets repopulated.
This is very helpful when using JavaScript to access DOM objects,
as we frequently have to traverse 4+ prototype objects before finding
the property we're interested in on e.g EventTarget or Node.
We can now tell the difference between an own property access and a
subsequent (automatic) prototype chain access.
This will be used to implement caching of prototype chain accesses.
If a function has the following properties:
- uses only local variables and registers
- does not use `this`
- does not use `new.target`
- does not use `super`
- does not use direct eval() calls
then it is possible to entirely skip function environment allocation
because it will never be used
This change adds gathering of information whether a function needs to
access `this` from environment and updates `prepare_for_ordinary_call()`
to skip allocation when possible.
For now, this optimisation is too aggressively blocked; e.g. if `this`
is used in a function scope, then all functions in outer scopes have to
allocate an environment. It could be improved in the future, although
this implementation already allows skipping >80% of environment
allocations on Discord, GitHub and Twitter.
This does two things:
* Clear exceptions when transferring control out of a finalizer
Otherwise they would resurface at the end of the next finalizer
(see test the new test case), or at the end of a function
* Pop one scheduled jump when transferring control out of a finalizer
This removes one old FIXME
Before this change both ExecutionContext and CallFrame were created
before executing function/module/script with a couple exceptions:
- executable created for default function argument evaluation has to
run in function's execution context.
- `execute_ast_node()` where executable compiled for ASTNode has to be
executed in running execution context.
This change moves all members previously owned by CallFrame into
ExecutionContext, and makes two exceptions where an executable that does
not have a corresponding execution context saves and restores registers
before running.
Now, all execution state lives in a single entity, which makes it a bit
easier to reason about and opens opportunities for optimizations, such
as moving registers and local variables into a single array.
For this case to work correctly in the current bytecode world:
func(a, a++)
We have to put the function arguments in temporaries instead of allowing
the postfix increment to modify `a` in place.
This fixes a problem where jQuery.each() would skip over items.
The following command was used to clang-format these files:
clang-format-18 -i $(find . \
-not \( -path "./\.*" -prune \) \
-not \( -path "./Base/*" -prune \) \
-not \( -path "./Build/*" -prune \) \
-not \( -path "./Toolchain/*" -prune \) \
-not \( -path "./Ports/*" -prune \) \
-type f -name "*.cpp" -o -name "*.mm" -o -name "*.h")
There are a couple of weird cases where clang-format now thinks that a
pointer access in an initializer list, e.g. `m_member(ptr->foo)`, is a
lambda return statement, and it puts spaces around the `->`.
This allows to skip iterating through all allocated blocks in
`find_min_and_max_block_addresses()`.
With this change `collect_garbage()` in profiles of Discord goes down
from 17% to 8%.
There's no need to capture things as Handle when using HeapFunction.
In this case, it was even creating a strong reference cycle, which ended
up leaking.
currently crashes with an assertion failure in `String::repeated` if
malloc can't serve a `count * input_size` sized request, so add
`String::repeated_with_error` to propagate the error.
These changes are compatible with clang-format 16 and will be mandatory
when we eventually bump clang-format version. So, since there are no
real downsides, let's commit them now.
When running with --log-all-js-exceptions, we will print the message
and backtrace for every single JS exception that is thrown, not just
the ones nobody caught.
This can sometimes be very helpful in debugging sites that swallow
important exceptions.