Merging registers, constants and locals into single vector means:
- Better data locality
- No need to check type in Interpreter::get() and Interpreter::set()
which are very hot functions
Performance improvement is visible in almost all Octane and Kraken
tests.
By doing that all instructions required for instantiation are emitted
once in compilation and then reused for subsequent calls, instead of
running generic instantiation process for each call.
Instead of keeping bytecode as a set of disjoint basic blocks on the
malloc heap, bytecode is now a contiguous sequence of bytes(!)
The transformation happens at the end of Bytecode::Generator::generate()
and the only really hairy part is rerouting jump labels.
This required solving a few problems:
- The interpreter execution loop had to change quite a bit, since we
were storing BasicBlock pointers all over the place, and control
transfer was done by redirecting the interpreter's current block.
- Exception handlers & finalizers are now stored per-bytecode-range
in a side table in Executable.
- The interpreter now has a plain program counter instead of a stream
iterator. This actually makes error stack generation a bit nicer
since we just have to deal with a number instead of reaching into
the iterator.
This yields a 25% performance improvement on this microbenchmark:
for (let i = 0; i < 1_000_000; ++i) { }
But basically everything gets faster. :^)
If all lexical declaration use local variables then there is no need
to allocate declarative environment.
With this change we skip ~3x more environment allocations on Github.
We already had fast access to own properties via shape-based IC.
This patch extends the mechanism to properties on the prototype chain,
using the "validity cell" technique from V8.
- Prototype objects now have unique shape
- Each prototype has an associated PrototypeChainValidity
- When a prototype shape is mutated, every prototype shape "below" it
in any prototype chain is invalidated.
- Invalidation happens by marking the validity object as invalid,
and then replacing it with a new validity object.
- Property caches keep a pointer to the last seen valid validity.
If there is no validity, or the validity is invalid, the cache
misses and gets repopulated.
This is very helpful when using JavaScript to access DOM objects,
as we frequently have to traverse 4+ prototype objects before finding
the property we're interested in on e.g EventTarget or Node.
If a function has the following properties:
- uses only local variables and registers
- does not use `this`
- does not use `new.target`
- does not use `super`
- does not use direct eval() calls
then it is possible to entirely skip function environment allocation
because it will never be used
This change adds gathering of information whether a function needs to
access `this` from environment and updates `prepare_for_ordinary_call()`
to skip allocation when possible.
For now, this optimisation is too aggressively blocked; e.g. if `this`
is used in a function scope, then all functions in outer scopes have to
allocate an environment. It could be improved in the future, although
this implementation already allows skipping >80% of environment
allocations on Discord, GitHub and Twitter.
Before this change both ExecutionContext and CallFrame were created
before executing function/module/script with a couple exceptions:
- executable created for default function argument evaluation has to
run in function's execution context.
- `execute_ast_node()` where executable compiled for ASTNode has to be
executed in running execution context.
This change moves all members previously owned by CallFrame into
ExecutionContext, and makes two exceptions where an executable that does
not have a corresponding execution context saves and restores registers
before running.
Now, all execution state lives in a single entity, which makes it a bit
easier to reason about and opens opportunities for optimizations, such
as moving registers and local variables into a single array.
This patch moves us away from the accumulator-based bytecode format to
one with explicit source and destination registers.
The new format has multiple benefits:
- ~25% faster on the Kraken and Octane benchmarks :^)
- Fewer instructions to accomplish the same thing
- Much easier for humans to read(!)
Because this change requires a fundamental shift in how bytecode is
generated, it is quite comprehensive.
Main implementation mechanism: generate_bytecode() virtual function now
takes an optional "preferred dst" operand, which allows callers to
communicate when they have an operand that would be optimal for the
result to go into. It also returns an optional "actual dst" operand,
which is where the completion value (if any) of the AST node is stored
after the node has "executed".
One thing of note that's new: because instructions can now take locals
as operands, this means we got rid of the GetLocal instruction.
A side-effect of that is we have to think about the temporal deadzone
(TDZ) a bit differently for locals (GetLocal would previously check
for empty values and interpret that as a TDZ access and throw).
We now insert special ThrowIfTDZ instructions in places where a local
access may be in the TDZ, to maintain the correct behavior.
There are a number of progressions and regressions from this test:
A number of async generator tests have been accidentally fixed while
converting the implementation to the new bytecode format. It didn't
seem useful to preserve bugs in the original code when converting it.
Some "does eval() return the correct completion value" tests have
regressed, in particular ones related to propagating the appropriate
completion after control flow statements like continue and break.
These are all fairly obscure issues, and I believe we can continue
working on them separately.
The net test262 result is a progression though. :^)
For parameters that exist strictly as "locals", we can save time and
space by not adding them to the function environment.
This is a speed-up across the board on basically every test.
For example, ~11% on Octane/typescript.js :^)
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).
This commit is auto-generated:
$ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
Meta Ports Ladybird Tests Kernel)
$ perl -pie 's/\bDeprecatedString\b/ByteString/g;
s/deprecated_string/byte_string/g' $xs
$ clang-format --style=file -i \
$(git diff --name-only | grep \.cpp\|\.h)
$ gn format $(git ls-files '*.gn' '*.gni')
For now, we handle this by creating a synthetic async function to wrap
the top-level module code. This allows us to piggyback on the async
function driver wrapper mechanism.
The number of registers in a call frame never changes, so we can
allocate it at the end of the CallFrame object and save ourselves the
cost of allocating separate Vector storage for every call frame.
Instead of allocating these in a mixture of ways, we now always put
them on the malloc heap, and keep an intrusive linked list of them
that we can iterate for GC marking purposes.
This required setting things up so that all function objects can plop
a PrimitiveString there instead of an AK string.
This is a step towards making ExecutionContext easier to allocate.
(Instead of MarkedVector<Value>.) This is a step towards not storing
argument lists in MarkedVector<Value> at all. Note that they still end
up in MarkedVectors since that's what ExecutionContext has.
This greatly reduces the number of compilations necessary when functions
declaring local functions are re-executed.
For example Octane/typescript.js goes from 58080 bytecode executables
to 960.
This patch adds two macros to declare per-type allocators:
- JS_DECLARE_ALLOCATOR(TypeName)
- JS_DEFINE_ALLOCATOR(TypeName)
When used, they add a type-specific CellAllocator that the Heap will
delegate allocation requests to.
The result of this is that GC objects of the same type always end up
within the same HeapBlock, drastically reducing the ability to perform
type confusion attacks.
It also improves HeapBlock utilization, since each block now has cells
sized exactly to the type used within that block. (Previously we only
had a handful of block sizes available, and most GC allocations ended
up with a large amount of slack in their tails.)
There is a small performance hit from this, but I'm sure we can make
up for it elsewhere.
Note that the old size-based allocators still exist, and we fall back
to them for any type that doesn't have its own CellAllocator.
This patch makes it possible for JS::Object::internal_set() to populate
a CacheablePropertyMetadata, and uses this to implement a basic
monomorphic cache for the most common form of property write access.
We can use `ensure_capacity` for binding vectors if we know their sizes
in advance. This ensures that binding vectors aren't reallocated during
the `function_declaration_instantiation` execution.
With this change, `try_grow_capacity()` and `shrink_to_fit()` are no
longer visible in the `function_declaration_instantiation()` profiles
when running React-Redux-TodoMVC from Speedometer.
We don't need to check if a function parameter is already declared
while creating bindings for them because we deduplicate their names by
storing them in a hash table in one of the previous steps.
This change makes React-Redux-TodoMVC test in Speedometer run 2%
faster.
This change moves steps that can be executed only once and then reused
in subsequent function instantiations from
`function_declaration_instantiation` to the ECMAScriptFunctionObject:
- Determine if there are any parameters with duplicate names.
- Determine if there are any parameters with expressions.
- Determine if an arguments object needs to be created.
- Create a list of distinct function names for which bindings need to
be created.
- Create a list of distinct variable names for which bindings need to
be created.
This change makes React-Redux-TodoMVC test in Speedometer
run 10% faster :)
This works by adding source start/end offset to every bytecode
instruction. In the future we can make this more efficient by keeping
a map of bytecode ranges to source ranges in the Executable instead,
but let's just get traces working first.
Co-Authored-By: Andrew Kaster <akaster@serenityos.org>
Stop worrying about tiny OOMs. Work towards #20449.
While going through these, I also changed the function signature in many
places where returning ThrowCompletionOr<T> is no longer necessary.
This loosens the connection to the AST interpreter and will allow us to
generate SourceRanges for the Bytecode interpreter in the future as well
Moves UnrealizedSourceRanges from TracebackFrame to the JS namespace for
this
These passes have not been shown to actually optimize any JS, and tests
have become very flaky with optimizations enabled. Until some measurable
benefit is shown, remove the optimization passes to reduce overhead of
maintaining bytecode operations and to reduce CI churn. The framework
for optimizations will live on in git history, and can be restored once
proven useful.