Merging registers, constants and locals into single vector means:
- Better data locality
- No need to check type in Interpreter::get() and Interpreter::set()
which are very hot functions
Performance improvement is visible in almost all Octane and Kraken
tests.
By doing that all instructions required for instantiation are emitted
once in compilation and then reused for subsequent calls, instead of
running generic instantiation process for each call.
Instead of keeping bytecode as a set of disjoint basic blocks on the
malloc heap, bytecode is now a contiguous sequence of bytes(!)
The transformation happens at the end of Bytecode::Generator::generate()
and the only really hairy part is rerouting jump labels.
This required solving a few problems:
- The interpreter execution loop had to change quite a bit, since we
were storing BasicBlock pointers all over the place, and control
transfer was done by redirecting the interpreter's current block.
- Exception handlers & finalizers are now stored per-bytecode-range
in a side table in Executable.
- The interpreter now has a plain program counter instead of a stream
iterator. This actually makes error stack generation a bit nicer
since we just have to deal with a number instead of reaching into
the iterator.
This yields a 25% performance improvement on this microbenchmark:
for (let i = 0; i < 1_000_000; ++i) { }
But basically everything gets faster. :^)
This does two things:
* Clear exceptions when transferring control out of a finalizer
Otherwise they would resurface at the end of the next finalizer
(see test the new test case), or at the end of a function
* Pop one scheduled jump when transferring control out of a finalizer
This removes one old FIXME
Before this change both ExecutionContext and CallFrame were created
before executing function/module/script with a couple exceptions:
- executable created for default function argument evaluation has to
run in function's execution context.
- `execute_ast_node()` where executable compiled for ASTNode has to be
executed in running execution context.
This change moves all members previously owned by CallFrame into
ExecutionContext, and makes two exceptions where an executable that does
not have a corresponding execution context saves and restores registers
before running.
Now, all execution state lives in a single entity, which makes it a bit
easier to reason about and opens opportunities for optimizations, such
as moving registers and local variables into a single array.
Instead of looking these up in the VM execution context stack whenever
we need them, we now just cache them in the interpreter when entering
a new call frame.
This patch moves us away from the accumulator-based bytecode format to
one with explicit source and destination registers.
The new format has multiple benefits:
- ~25% faster on the Kraken and Octane benchmarks :^)
- Fewer instructions to accomplish the same thing
- Much easier for humans to read(!)
Because this change requires a fundamental shift in how bytecode is
generated, it is quite comprehensive.
Main implementation mechanism: generate_bytecode() virtual function now
takes an optional "preferred dst" operand, which allows callers to
communicate when they have an operand that would be optimal for the
result to go into. It also returns an optional "actual dst" operand,
which is where the completion value (if any) of the AST node is stored
after the node has "executed".
One thing of note that's new: because instructions can now take locals
as operands, this means we got rid of the GetLocal instruction.
A side-effect of that is we have to think about the temporal deadzone
(TDZ) a bit differently for locals (GetLocal would previously check
for empty values and interpret that as a TDZ access and throw).
We now insert special ThrowIfTDZ instructions in places where a local
access may be in the TDZ, to maintain the correct behavior.
There are a number of progressions and regressions from this test:
A number of async generator tests have been accidentally fixed while
converting the implementation to the new bytecode format. It didn't
seem useful to preserve bugs in the original code when converting it.
Some "does eval() return the correct completion value" tests have
regressed, in particular ones related to propagating the appropriate
completion after control flow statements like continue and break.
These are all fairly obscure issues, and I believe we can continue
working on them separately.
The net test262 result is a progression though. :^)
The number of registers in a call frame never changes, so we can
allocate it at the end of the CallFrame object and save ourselves the
cost of allocating separate Vector storage for every call frame.
Until now, the unwind context stack has not been maintained by jitted
code, which meant we were unable to support the `with` statement.
This is a first step towards supporting that by making jitted code
call out to C++ to update the unwind context stack when entering/leaving
unwind contexts.
We also introduce a new "Catch" bytecode instruction that moves the
current exception into the accumulator. It's always emitted at the start
of a "catch" block.
The previous implementation was calling `backtrace()` for every
function call, which is quite slow.
Instead, this implementation provides VM::stack_trace() which unwinds
the native stack, maps it through NativeExecutable::get_source_range
and combines it with source ranges from interpreted call frames.
This works by walking a backtrace until the currently executing
native executable is found, and then mapping the native address
to its bytecode instruction.
These are then restored upon `ContinuePendingUnwind`.
This stops us from forgetting where we needed to jump when we do extra
try-catches in finally blocks.
Co-Authored-By: Jesús "gsus" Lapastora <cyber.gsuscode@gmail.com>
Instead of calling out to helper functions for flow control (and then
checking control flags on every iteration), we now simply inline those
ops in the interpreter loop directly.
This works by adding source start/end offset to every bytecode
instruction. In the future we can make this more efficient by keeping
a map of bytecode ranges to source ranges in the Executable instead,
but let's just get traces working first.
Co-Authored-By: Andrew Kaster <akaster@serenityos.org>
Because "this" value cannot be changed during function execution it is
safe to compute it once and then use for future access.
This optimization makes ai-astar.js run 8% faster.
This fixes an issue where returning inside a `try` block and then
calling a function inside `finally` would clobber the saved return
value from the `try` block.
Note that we didn't need to change the base of register allocation,
since it was already 1 too high.
With this fixed, https://microsoft.com/edge loads in bytecode mode. :^)
Thanks to Luke for reducing the issue!
These passes have not been shown to actually optimize any JS, and tests
have become very flaky with optimizations enabled. Until some measurable
benefit is shown, remove the optimization passes to reduce overhead of
maintaining bytecode operations and to reduce CI churn. The framework
for optimizations will live on in git history, and can be restored once
proven useful.
The instructions GetById and GetByIdWithThis now remember the last-seen
Shape, and if we see the same object again, we reuse the property offset
from last time without doing a new lookup.
This allows us to use Object::get_direct(), bypassing the entire lookup
machinery and saving lots of time.
~23% speed-up on Kraken/ai-astar.js :^)
The var environments will unwind as needed with the ExecutionContext
and there's no need to include it in the unwind info.
We still need to do this for lexical environments though, since they
can have short local lifetimes inside a function.
Since the relationship between VM and Bytecode::Interpreter is now
clear, we can have VM ask the Interpreter for roots in the GC marking
pass. This avoids having to register and unregister handles and
MarkedVectors over and over.
Since GeneratorObject can also own a RegisterWindow, we share the code
in a RegisterWindow::visit_edges() helper.
~4% speed-up on Kraken/stanford-crypto-ccm.js :^)
While this would be useful in the future for implementing a multi-tiered
optimization strategy, currently a binary on/off is enough for us. This
removes the confusingly on-by-default `OptimizationLevel::None` option
which made the optimization pipeline a no-op even if
`Bytecode::Interpreter::set_optimizations_enabled` had been called.
Fixes#15982
The JS::VM now owns the one Bytecode::Interpreter. We no longer have
multiple bytecode interpreters, and there is no concept of a "current"
bytecode interpreter.
If you ask for VM::bytecode_interpreter_if_exists(), it will return null
if we're not running the program in "bytecode enabled" mode.
If you ask for VM::bytecode_interpreter(), it will return a bytecode
interpreter in all modes. This is used for situations where even the AST
interpreter switches to bytecode mode (generators, etc.)
Don't try to implement this AO in bytecode. Instead, the bytecode
Interpreter class now has a run() API with the same inputs as the AST
interpreter. It sets up the necessary environments etc, including
invoking the GlobalDeclarationInstantiation AO.