If the minimal amount of required bindings is known in advance, it could
be used to ensure capacity to avoid resizing the internal vector that
holds bindings.
By doing that all instructions required for instantiation are emitted
once in compilation and then reused for subsequent calls, instead of
running generic instantiation process for each call.
We now fuse sequences like [LessThan, JumpIf] to JumpLessThan.
This is only allowed for temporaries (i.e VM registers) with no other
references to them.
This removes a layer of indirection in the bytecode where we had to make
sure all the initializer elements were laid out in sequential registers.
Array expressions no longer clobber registers permanently, and they can
be reused immediately afterwards.
If we don't have enough stack space, throw an exception while we still
can, and give the caller a chance to recover.
This particular problem will go away once we make calls non-recursive.
This means inlining all the things. This yields a 40% speedup on the for
loop microbenchmark, and everything else gets faster as well. :^)
This makes compilation take foreeeever with GCC, so I'm only enabling it
for Clang in this commit. We should figure out how to make GCC compile
this without timing out CI, since the speedup is amazing.
This commit converts the main loop in Bytecode::Interpreter to use a
label table and computed goto for fast instruction dispatch.
This yields roughly 35% speedup on the for loop microbenchmark,
and makes everything else faster as well. :^)
This commit adds a HANDLE_INSTRUCTION macro that expands to everything
needed to handle a single instruction (invoking the handler function,
checking for exceptions, and advancing the program counter).
This gives a ~15% speed-up on a for loop microbenchmark, and makes
basically everything faster.
Instead of keeping bytecode as a set of disjoint basic blocks on the
malloc heap, bytecode is now a contiguous sequence of bytes(!)
The transformation happens at the end of Bytecode::Generator::generate()
and the only really hairy part is rerouting jump labels.
This required solving a few problems:
- The interpreter execution loop had to change quite a bit, since we
were storing BasicBlock pointers all over the place, and control
transfer was done by redirecting the interpreter's current block.
- Exception handlers & finalizers are now stored per-bytecode-range
in a side table in Executable.
- The interpreter now has a plain program counter instead of a stream
iterator. This actually makes error stack generation a bit nicer
since we just have to deal with a number instead of reaching into
the iterator.
This yields a 25% performance improvement on this microbenchmark:
for (let i = 0; i < 1_000_000; ++i) { }
But basically everything gets faster. :^)
Before this change, all JumpFoo instructions inherited from Jump, which
forced the unconditional Jump to have an unusued "false target" member.
Also, labels were unnecessarily wrapped in Optional<>.
By defining each jump instruction separately, they all shrink in size,
and all ambiguity is removed.
This does two things:
* Clear exceptions when transferring control out of a finalizer
Otherwise they would resurface at the end of the next finalizer
(see test the new test case), or at the end of a function
* Pop one scheduled jump when transferring control out of a finalizer
This removes one old FIXME
Before this change both ExecutionContext and CallFrame were created
before executing function/module/script with a couple exceptions:
- executable created for default function argument evaluation has to
run in function's execution context.
- `execute_ast_node()` where executable compiled for ASTNode has to be
executed in running execution context.
This change moves all members previously owned by CallFrame into
ExecutionContext, and makes two exceptions where an executable that does
not have a corresponding execution context saves and restores registers
before running.
Now, all execution state lives in a single entity, which makes it a bit
easier to reason about and opens opportunities for optimizations, such
as moving registers and local variables into a single array.
When a PutById / PutByValue bytecode operation results in accessing a
nullish object, we now include the name of the property and the object
being accessed in the exception message (if available). This should make
it easier to debug live websites.
For example, the following errors would all previously produce a generic
error message of "ToObject on null or undefined":
> foo = null
> foo.bar = 1
Uncaught exception:
[TypeError] Cannot access property "bar" on null object "foo"
at <unknown>
> foo = { bar: undefined }
> foo.bar.baz = 1
Uncaught exception:
[TypeError] Cannot access property "baz" on undefined object "foo.bar"
at <unknown>
Note we certainly don't capture all possible nullish property write
accesses here. This just covers cases I've seen most on live websites;
we can cover more cases as they arise.
When a GetById / GetByValue bytecode operation results in accessing a
nullish object, we now include the name of the property and the object
being accessed in the exception message (if available). This should make
it easier to debug live websites.
For example, the following errors would all previously produce a generic
error message of "ToObject on null or undefined":
> foo = null
> foo.bar
Uncaught exception:
[TypeError] Cannot access property "bar" on null object "foo"
at <unknown>
> foo = { bar: undefined }
> foo.bar.baz
Uncaught exception:
[TypeError] Cannot access property "baz" on undefined object "foo.bar"
at <unknown>
Note we certainly don't capture all possible nullish property read
accesses here. This just covers cases I've seen most on live websites;
we can cover more cases as they arise.
As long as the inputs are Int32, we can convert them to UInt32 in a
spec-compliant way with a simple static_cast<u32>.
This allows calculations like `-3 >>> 2` to take the fast path as well,
which is extremely valuable for stuff like crypto code.
While we're doing this, also remove the fast paths from the generic
shift functions in Value.cpp, since we only end up there if we *didn't*
take the same fast path in the interpreter.
This patch adds a new "Peephole" pass for performing small, local
optimizations to bytecode.
We also introduce the first such optimization, fusing a sequence of
some comparison instruction FooCompare followed by a JumpIf into a
new set of JumpFooCompare instructions.
This gives a ~50% speed-up on the following microbenchmark:
for (let i = 0; i < 10_000_000; ++i) {
}
But more traditional benchmarks see a pretty sizable speed-up as well,
for example 15% on Kraken/ai-astar.js and 16% on Kraken/audio-dft.js :^)
Instead of emitting a NewBigInt instruction to construct a primitive
bigint from a parsed literal, we now instantiate the BigInt on the heap
during codegen.
Instead of emitting a NewString instruction to construct a primitive
string from a parsed literal, we now instantiate the PrimitiveString on
the heap during codegen.
Instead of looking these up in the VM execution context stack whenever
we need them, we now just cache them in the interpreter when entering
a new call frame.
Comparing two Values has to call the generic same_value() helper,
and we can avoid this by simply using a stronger type for built-in
native function handlers.
Instead of having Call refer to a range of VM registers, it now has
a trailing list of argument operands as part of the instruction.
This means we no longer have to shuffle every argument value into
a register before making a call, making bytecode smaller & faster. :^)
By handling common cases like Int32 arithmetic directly in the
instruction handler, we can avoid the cost of calling the generic helper
functions in Value.cpp.
Instead of splitting the postfix variants into ToNumeric + Inc/Dec,
we now have dedicated PostfixIncrement and PostfixDecrement instructions
that handle both outputs in one go.
This patch moves us away from the accumulator-based bytecode format to
one with explicit source and destination registers.
The new format has multiple benefits:
- ~25% faster on the Kraken and Octane benchmarks :^)
- Fewer instructions to accomplish the same thing
- Much easier for humans to read(!)
Because this change requires a fundamental shift in how bytecode is
generated, it is quite comprehensive.
Main implementation mechanism: generate_bytecode() virtual function now
takes an optional "preferred dst" operand, which allows callers to
communicate when they have an operand that would be optimal for the
result to go into. It also returns an optional "actual dst" operand,
which is where the completion value (if any) of the AST node is stored
after the node has "executed".
One thing of note that's new: because instructions can now take locals
as operands, this means we got rid of the GetLocal instruction.
A side-effect of that is we have to think about the temporal deadzone
(TDZ) a bit differently for locals (GetLocal would previously check
for empty values and interpret that as a TDZ access and throw).
We now insert special ThrowIfTDZ instructions in places where a local
access may be in the TDZ, to maintain the correct behavior.
There are a number of progressions and regressions from this test:
A number of async generator tests have been accidentally fixed while
converting the implementation to the new bytecode format. It didn't
seem useful to preserve bugs in the original code when converting it.
Some "does eval() return the correct completion value" tests have
regressed, in particular ones related to propagating the appropriate
completion after control flow statements like continue and break.
These are all fairly obscure issues, and I believe we can continue
working on them separately.
The net test262 result is a progression though. :^)