We now defer looking up the various identifiers by IdentifierTableIndex
until the last moment. This allows us to avoid the retrieval in common
cases like when a property access is cached.
Knocks a ~12% item off the profile on https://ventrella.com/Clusters/
Instead of displaying locals as "locN", we now show them as "name~N".
This makes it a lot easier to follow bytecode dumps, especially in
longer functions.
Note that we keep displaying the local index, to avoid confusion in case
there are multiple separate locals with the same name in one executable.
Instead of scanning through the list of seen constants, we now have a
more structured storage of the constants true, false, null, undefined,
and every possible Int32 value.
This fixes an O(n^2) issue found by Kraken/json-stringify-tinderbox.js
This is a new Bytecode::Generator helper that takes an operand and
returns the same operand, or a copy of it, in case a copy is required
to preserve correct evaluation order.
This can be used in a bunch of places where we're worried about
clobbering some value after obtaining it.
Practically, locals are always copied, and temporary registers as well
as constants are returned as-is.
Functions that don't have a FunctionEnvironment will get their `this`
value from the ExecutionContext. This patch stops generating
ResolveThisBinding instructions at all for functions like that, and
instead pre-populates the `this` register when entering a new bytecode
executable.
We already have a dedicated register slot for `this`, so instead of
having ResolveThisBinding take a `dst` operand, just write the value
directly into the `this` register every time.
This allows us to skip allocating a function environment in cases where
it was previously impossible because the arguments object needed a
binding.
This change does not bring visible improvement in Kraken or Octane
benchmarks but seems useful to have anyway.
By doing this, we can remove all the special-case checks for `length`
from the generic GetById and GetByIdWithThis code paths, making every
other property lookup a bit faster.
Small progressions on most benchmarks, and some larger progressions:
- 12% on Octane/crypto.js
- 6% on Kraken/ai-astar.js
The VERIFY assumed that we either have a finally block which we already
visited through a previous boundary or have no finally block what so
ever; But since 301a1fc763 we propagate
finalizers through unwind scopes breaking that assumption.
With this only `ContinuePendingUnwind` needs to dynamically check if a
scheduled return needs to go through a `finally` block, making the
interpreter loop a bit nicer
Instead of SetVariable having 2x2 modes for variable/lexical and
initialize/set, those 4 modes are now separate instructions, which
makes each instruction much less branchy.
The last completion value in a function is not exposed to the language,
since functions always either return something, or undefined.
Given this, we can avoid emitting code that propagates the completion
value from various statements, as long as we know we're generating code
for a context where the completion value is not accessible. In practical
terms, this means that function code gets to do less completion
shuffling, while global and eval code has to keep doing it.
Merging registers, constants and locals into single vector means:
- Better data locality
- No need to check type in Interpreter::get() and Interpreter::set()
which are very hot functions
Performance improvement is visible in almost all Octane and Kraken
tests.
When comparing two numbers, we can avoid a lot of implicit type
conversion nonsense and go straight to comparison, saving time in the
most common case.
These were out-of-line because we had some ideas about marking
instruction streams PROT_READ only, but that seems pretty arbitrary and
there's a lot of performance to be gained by putting these inline.
This patch stops emitting the BlockDeclarationInstantiation instruction
when there are no locals, and no function declarations in the scope.
We were spending 20% of CPU time on https://ventrella.com/Clusters/ just
creating empty environments for no reason.
If the minimal amount of required bindings is known in advance, it could
be used to ensure capacity to avoid resizing the internal vector that
holds bindings.
By doing that all instructions required for instantiation are emitted
once in compilation and then reused for subsequent calls, instead of
running generic instantiation process for each call.
We now fuse sequences like [LessThan, JumpIf] to JumpLessThan.
This is only allowed for temporaries (i.e VM registers) with no other
references to them.
This removes a layer of indirection in the bytecode where we had to make
sure all the initializer elements were laid out in sequential registers.
Array expressions no longer clobber registers permanently, and they can
be reused immediately afterwards.
This patch adds a register freelist to Bytecode::Generator and switches
all operands inside the generator to a new ScopedOperand type that is
ref-counted and automatically frees the register when nothing uses it.
This dramatically reduces the size of bytecode executable register
windows, which were often in the several thousands of registers for
large functions. Most functions now use less than 100 registers.
The first block in every executable will always execute first, so if it
ends up doing a ResolveThisBinding, it's fine for all other blocks
within the same executable to use the same `this` value.
Once executed, this instruction will always produce the same result
in subsequent executions, so it's okay to cache it.
Unfortunately it may throw, so we can't just hoist it to the top of
every executable, since that would break observable execution order.
Instead of storing a BasicBlock* and forcing the size of Label to be
sizeof(BasicBlock*), we now store the basic block index as a u32.
This means the final version of the bytecode is able to keep labels
at sizeof(u32), shrinking the size of many instructions. :^)
Instead of storing source offsets with each instruction, we now keep
them in a side table in Executable.
This shrinks each instruction by 8 bytes, further improving locality.
Instead of keeping bytecode as a set of disjoint basic blocks on the
malloc heap, bytecode is now a contiguous sequence of bytes(!)
The transformation happens at the end of Bytecode::Generator::generate()
and the only really hairy part is rerouting jump labels.
This required solving a few problems:
- The interpreter execution loop had to change quite a bit, since we
were storing BasicBlock pointers all over the place, and control
transfer was done by redirecting the interpreter's current block.
- Exception handlers & finalizers are now stored per-bytecode-range
in a side table in Executable.
- The interpreter now has a plain program counter instead of a stream
iterator. This actually makes error stack generation a bit nicer
since we just have to deal with a number instead of reaching into
the iterator.
This yields a 25% performance improvement on this microbenchmark:
for (let i = 0; i < 1_000_000; ++i) { }
But basically everything gets faster. :^)
This does two things:
* Clear exceptions when transferring control out of a finalizer
Otherwise they would resurface at the end of the next finalizer
(see test the new test case), or at the end of a function
* Pop one scheduled jump when transferring control out of a finalizer
This removes one old FIXME
When a GetById / GetByValue bytecode operation results in accessing a
nullish object, we now include the name of the property and the object
being accessed in the exception message (if available). This should make
it easier to debug live websites.
For example, the following errors would all previously produce a generic
error message of "ToObject on null or undefined":
> foo = null
> foo.bar
Uncaught exception:
[TypeError] Cannot access property "bar" on null object "foo"
at <unknown>
> foo = { bar: undefined }
> foo.bar.baz
Uncaught exception:
[TypeError] Cannot access property "baz" on undefined object "foo.bar"
at <unknown>
Note we certainly don't capture all possible nullish property read
accesses here. This just covers cases I've seen most on live websites;
we can cover more cases as they arise.
Instead of emitting a NewString instruction to construct a primitive
string from a parsed literal, we now instantiate the PrimitiveString on
the heap during codegen.
This patch moves us away from the accumulator-based bytecode format to
one with explicit source and destination registers.
The new format has multiple benefits:
- ~25% faster on the Kraken and Octane benchmarks :^)
- Fewer instructions to accomplish the same thing
- Much easier for humans to read(!)
Because this change requires a fundamental shift in how bytecode is
generated, it is quite comprehensive.
Main implementation mechanism: generate_bytecode() virtual function now
takes an optional "preferred dst" operand, which allows callers to
communicate when they have an operand that would be optimal for the
result to go into. It also returns an optional "actual dst" operand,
which is where the completion value (if any) of the AST node is stored
after the node has "executed".
One thing of note that's new: because instructions can now take locals
as operands, this means we got rid of the GetLocal instruction.
A side-effect of that is we have to think about the temporal deadzone
(TDZ) a bit differently for locals (GetLocal would previously check
for empty values and interpret that as a TDZ access and throw).
We now insert special ThrowIfTDZ instructions in places where a local
access may be in the TDZ, to maintain the correct behavior.
There are a number of progressions and regressions from this test:
A number of async generator tests have been accidentally fixed while
converting the implementation to the new bytecode format. It didn't
seem useful to preserve bugs in the original code when converting it.
Some "does eval() return the correct completion value" tests have
regressed, in particular ones related to propagating the appropriate
completion after control flow statements like continue and break.
These are all fairly obscure issues, and I believe we can continue
working on them separately.
The net test262 result is a progression though. :^)
This is pure prep work for refactoring the bytecode to use more operands
instead of only registers.
generate_bytecode() virtuals now return an Optional<Operand>, and the
idea is to return an Operand referring to the value produced by this
AST node.
They also take an Optional<Operand> "preferred_dst" input. This is
intended to communicate the caller's preference for an output operand,
if any. This will be used to elide temporaries when we can store the
result directly in a local, for example.
Previously, attempting to load a value from an invalid reference would
cause a crash. We now return a CodeGenerationError rather than hitting
an assertion. This is not a complete solution, as ideally we would want
to return a ReferenceError, but this now matches the behavior we see
when we attempt to store something to an invalid reference.
When iterating over an iterable, we get back a JS object with the fields
"value" and "done".
Before this change, we've had two dedicated instructions for retrieving
the two fields: IteratorResultValue and IteratorResultDone. These had no
fast path whatsoever and just did a generic [[Get]] access to fetch the
corresponding property values.
By replacing the instructions with GetById("value") and GetById("done"),
they instantly get caching and JIT fast paths for free, making iterating
over iterables much faster. :^)
26% speed-up on this microbenchmark:
function go(a) {
for (const p of a) {
}
}
const a = [];
a.length = 1_000_000;
go(a);