For example, for 7z7c.gif, we now store one 500x500 frame and then
a 94x78 frame at (196, 208) and a 91x78 frame at (198, 208).
This reduces how much data we have to store.
We currently store all pixels in the rect with changed pixels.
We could in the future store pixels that are equal in that rect
as transparent pixels. When inputs are gif files, this would
guaranteee that new frames only have at most 256 distinct colors
(since GIFs require that), which would help a future color indexing
transform. For now, we don't do that though.
The API I'm adding here is a bit ugly:
* WebPs can only store x/y offsets that are a multiple of 2. This
currently leaks into the AnimationWriter base class.
(Since we potentially have to make a webp frame 1 pixel wider
and higher due to this, it's possible to have a frame that has
<= 256 colors in a gif input but > 256 colors in the webp,
if we do the technique above.)
* Every client writing animations has to have logic to track
previous frames, decide which of the two functions to call, etc.
This also adds an opt-out flag to `animation`, because:
1. Some clients apparently assume the size of the last VP8L
chunk is the size of the image
(see https://github.com/discord/lilliput/issues/159).
2. Having incremental frames is good for filesize and for
playing the animation start-to-end, but it makes it hard
to extract arbitrary frames (have to extract all frames
from start to target frame) -- but this is mean tto be a
delivery codec, not an editing codec. It's also more vulnerable to
corrupted bytes in the middle of the file -- but transport
protocols are good these days.
(It'd also be an idea to write a full frame every N frames.)
For https://giphy.com/gifs/XT9HMdwmpHqqOu1f1a (an 184K gif),
output webp size goes from 21M to 11M.
For 7z7c.gif (an 11K gif), output webp size goes from 2.1M to 775K.
(The webp image data still isn't compressed at all.)
This means that SetVariable instructions will now remember which
(relative) environment contains the targeted binding, letting it bypass
the full binding resolution machinery on subsequent accesses.
Truncating the value is mathematically incorrect, this error made the
conversion to grayscale unstable. In other world, calling `to_grayscale`
on a gray value would return a different value. As an example,
`Color::from_string("#686868ff"sv).to_grayscale()` used to return
#676767ff.
Merging registers, constants and locals into single vector means:
- Better data locality
- No need to check type in Interpreter::get() and Interpreter::set()
which are very hot functions
Performance improvement is visible in almost all Octane and Kraken
tests.
Instead of copying the image data pixel-by-pixel, we can memcpy full
scanlines at a time.
This knocks a 4% item down to <1% in profiles of Another World JS.
Filling a typed array with an integer shouldn't have to go through the
generic Set for every element.
This knocks a 7% item down to <1% in profiles of Another World JS at
https://cyxx.github.io/another_js/
When comparing two numbers, we can avoid a lot of implicit type
conversion nonsense and go straight to comparison, saving time in the
most common case.
To align the cursor theme tab to how DisplaySettings themes tab works,
this change forces the theme combo box to not allow free-text. Currently
on a click it puts the text cursor in the box to allow typing anything
rather than acting as a dropdown when clicking anywhere on the field.
Fixes#24306
These were out-of-line because we had some ideas about marking
instruction streams PROT_READ only, but that seems pretty arbitrary and
there's a lot of performance to be gained by putting these inline.
Performing a lookup in the blob URL registry does not work in the case
of a web worker - as the registry is not shared between processes.
However - the URL itself passed to a worker has the blob attached to it,
which we can pull out of the URL on a fetch.
Implement this function to the spec, and use the full blown URL parser
that handles blob URLs, instead of the basic-url-parser.
Also clean up a FIXME that does not seem relevant any more.
Instead, generate bytecode to execute their AST nodes and save the
resulting operands inside the NewClass instruction.
Moving property expression evaluation to happen before NewClass
execution also moves along creation of new private environment and
its population with private members (private members should be visible
during property evaluation).
Before:
- NewClass
After:
- CreatePrivateEnvironment
- AddPrivateName
- ...
- AddPrivateName
- NewClass
- LeavePrivateEnvironment
This matches the IDL definition. Without this we experience a compile
time error when adding the WheelEvent constructor due to a mimatched
type between the enum class and UnsignedLong that the prototype is
passing through to the event init struct for the constructor.
This was an unfortunate artifact from development. I originally added
this class in the DOM namespace alongside HTMLCollection until I
realized it was defined by the HTML spec instead. To remind myself to
move it over to the HTML namespace, I left this comment. It was moved
to the correct namespace before upstreaming to the main repo, but I
forgot to remove this comment!
Once we see a frame with transparent pixels, we now toggle the
"has alpha" bit in the header.
To not require a SeekableStream opened for reading, we now pass the
unmodified original flag bit to WebPAnimationWriter.
The high-level design is that we have a static method on WebPWriter that
returns an AnimationWriter object. AnimationWriter has a virtual method
for writing individual frames. This allows streaming animations to disk,
without having to buffer up the entire animation in memory first.
The semantics of this function, add_frame(), are that data is flushed
to disk every time the function is called, so that no explicit `close()`
method is needed.
For some formats that store animation length at the start of the file,
including WebP, this means that this needs to write to a SeekableStream,
so that add_frame() can seek to the start and update the size when a
frame is written.
This design should work for GIF and APNG writing as well. We can move
AnimationWriter to a new header if we add writers for these.
Currently, `animation` can read any animated image format we can read
(apng, gif, webp) and convert it to an animated webp file.
The written animated webp file is not compressed whatsoever, so this
creates large output files at the moment.
This patch stops emitting the BlockDeclarationInstantiation instruction
when there are no locals, and no function declarations in the scope.
We were spending 20% of CPU time on https://ventrella.com/Clusters/ just
creating empty environments for no reason.
Instead of relying on native stack overflows to kick us out of circular
proxy chains, we now keep track of the recursion depth and kick
ourselves out if it exceeds 10'000.
This fixes an issue where compiler tail/sibling call optimizations would
turn infinite recursion into infinite loops, and thus never hit a stack
overflow to kick itself out.
For whatever reason, we've only seen the issue on SerenityOS with UBSAN,
but it could theoretically happen on any platform.
If the minimal amount of required bindings is known in advance, it could
be used to ensure capacity to avoid resizing the internal vector that
holds bindings.
By doing that all instructions required for instantiation are emitted
once in compilation and then reused for subsequent calls, instead of
running generic instantiation process for each call.
Fix the incorrect parsing in `Chess::Move::from_algebraic`
of moves with a capture and promotion (e.g. gxf8=Q) due to
the promoted piece type not being passed through to the
`Board::is_legal` method. This addresses a FIXME regarding
this issue in `ChessWidget.cpp`.
Instead of storing a full JS::Completion for the "throw completion"
case, we now store a simple JS::Value (wrapped in ErrorValue for the
type system).
This shrinks TCO<void> and TCO<Value> (the most common TCOs) by 8 bytes,
which has a non-trivial impact on performance.
If the callee is already a temporary register, we don't need to copy it
to *another* temporary before evaluating arguments. None of the
arguments will clobber the existing temporary anyway.
We only need to make copies of locals here, in case the locals are
modified by something like increment/decrement expressions.
Registers and constants can slip right through, without being Mov'ed
into a temporary first.
We know that `undefined` in the global scope is always the proper
undefined value. This commit takes advantage of that by simply emitting
a constant undefined value instead.
Unfortunately we can't be so sure in other scopes.
This helps some of the Cloudflare Turnstile stuff run faster, since they
are deliberately screwing with JS engines by asking us to do a bunch of
bitwise operations on e.g 65535.56
By rounding such values in bytecode generation, the interpreter can stay
on the happy path while executing, and finish quite a bit faster.
We now fuse sequences like [LessThan, JumpIf] to JumpLessThan.
This is only allowed for temporaries (i.e VM registers) with no other
references to them.
Specify the time in seconds to wait for a reply for each packet sent.
Replies received out of order will not be printed as replied but they
will be considered as replied when calculating statistics. Setting it
to 0 means infinite timeout.
In flood ping mode, the time interval between each request is set
to zero to provide a rapid display of how many packets are being
dropped. For each request a period '.' is printed, while for every
reply a backspace is printed.
Before this change, switch codegen would interleave bytecode like this:
(test for case 1)
(code for case 1)
(test for case 2)
(code for case 2)
This meant that we often had to make many large jumps while looking for
the matching case, since code for each case can be huge.
It now looks like this instead:
(test for case 1)
(test for case 2)
(code for case 1)
(code for case 2)
This way, we can just fall through the tests until we hit one that fits,
without having to make any large jumps.
This removes a layer of indirection in the bytecode where we had to make
sure all the initializer elements were laid out in sequential registers.
Array expressions no longer clobber registers permanently, and they can
be reused immediately afterwards.
This patch adds a register freelist to Bytecode::Generator and switches
all operands inside the generator to a new ScopedOperand type that is
ref-counted and automatically frees the register when nothing uses it.
This dramatically reduces the size of bytecode executable register
windows, which were often in the several thousands of registers for
large functions. Most functions now use less than 100 registers.
The first block in every executable will always execute first, so if it
ends up doing a ResolveThisBinding, it's fine for all other blocks
within the same executable to use the same `this` value.
In the common case, parseInt() is being called with strings that don't
have leading whitespace. By checking for that first, we can avoid the
cost of trimming and multiple malloc/GC allocations.
Once executed, this instruction will always produce the same result
in subsequent executions, so it's okay to cache it.
Unfortunately it may throw, so we can't just hoist it to the top of
every executable, since that would break observable execution order.
Two bugs:
1. Correctly set bits in VP8X header.
Turns out these were set in the wrong order.
2. Correctly set the `has_alpha` flag.
Also add a test for writing webp files with icc data. With the
additional checks in other commits in this PR, this test catches
the bug in WebPWriter.
Rearrange some existing functions to make it easier to write this test:
* Extract encode_bitmap() from get_roundtrip_bitmap().
encode_bitmap() allows passing extra_args that the test uses to pass
in ICC data.
* Extract expect_bitmaps_equal() from test_roundtrip()
If this turns out to be too strict in practice, we can replace it with
a `dbgln("VP8X and VP8L headers disagree about alpha; ignoring VP8X");`
instead.
ALso update catdog-alert-13-alpha-used-false.webp to not trigger this.
I had manually changed the VP8L alpha flag at offset 0x2a in
da48238fbd to clear it, but I hadn't changed the VP8X flag.
This changes the byte at offset 0x14 from 0x10 (has_alpha) to 0x00
(no alpha) as well, to match.
If this turns out to be too strict in practice, we can replace it with
a dbgln() with the same message, but with with ", ignoring header."
tacked on at the end.
The stream passed to LineProgram::create is a temporary, so rather than
keeping an (unused) dangling reference, pass this stream explicitly to
all functions used by LineProgram::create.
In particular, define a static LibC library *in LibC's CMakeLists* and
use it in DynamicLoader. This is similar to the way LibELF is included
in DynamicLoader.
Additionally, compile DynamicLoader with -ffunction-sections,
-fdata-sections, and -Wl,--gc-sections. This brings the loader size from
~2Mb to ~1Mb with debug symbols and from ~500Kb to ~150Kb without. Also,
this makes linking DynamicLoader with LibTimeZone unnecessary.
Static libc on Serenity is broken in a more than one way and requires a
lot of patches to bring it to a usable and useful state. Therefore,
instead of keeping it around (and breaking even more) during the
upcoming libc build refactor, let's just delete it.
Previously, if DynamicLoader was invoked directly, it wouldn't be able
to perform syscalls after the point just before the call to the main
executable entry. This made dlopen to crash if it was invoked from
within an executable that was itself executed using
`/usr/lib/Loader.so <executable>`.
The fix can be tested by executing
`cd /usr/Tests/LibELF; /usr/lib/Loader.so /usr/Tests/LibELF/TestDlOpen`
in Serenity's terminal.
That way, we can write 0 instead of 8 bits for every alpha byte.
Reduces the size of sunset-retro.png when saved as a webp file
from 3 MiB to 2.25 MiB, without affecting encode speed.
Once we use CanonicalCodes we'll get this for free for all channels,
but opaque images are common enough that it feels worth it to do this
before then.
If we don't have enough stack space, throw an exception while we still
can, and give the caller a chance to recover.
This particular problem will go away once we make calls non-recursive.
This means inlining all the things. This yields a 40% speedup on the for
loop microbenchmark, and everything else gets faster as well. :^)
This makes compilation take foreeeever with GCC, so I'm only enabling it
for Clang in this commit. We should figure out how to make GCC compile
this without timing out CI, since the speedup is amazing.
This commit converts the main loop in Bytecode::Interpreter to use a
label table and computed goto for fast instruction dispatch.
This yields roughly 35% speedup on the for loop microbenchmark,
and makes everything else faster as well. :^)
Now that the interpreter is unrolled, we can advance the program counter
manually based on the current instruction type.
This makes most instructions a bit smaller. :^)
This commit adds a HANDLE_INSTRUCTION macro that expands to everything
needed to handle a single instruction (invoking the handler function,
checking for exceptions, and advancing the program counter).
This gives a ~15% speed-up on a for loop microbenchmark, and makes
basically everything faster.
Instead of storing a BasicBlock* and forcing the size of Label to be
sizeof(BasicBlock*), we now store the basic block index as a u32.
This means the final version of the bytecode is able to keep labels
at sizeof(u32), shrinking the size of many instructions. :^)
Instead of storing source offsets with each instruction, we now keep
them in a side table in Executable.
This shrinks each instruction by 8 bytes, further improving locality.
Instead of keeping bytecode as a set of disjoint basic blocks on the
malloc heap, bytecode is now a contiguous sequence of bytes(!)
The transformation happens at the end of Bytecode::Generator::generate()
and the only really hairy part is rerouting jump labels.
This required solving a few problems:
- The interpreter execution loop had to change quite a bit, since we
were storing BasicBlock pointers all over the place, and control
transfer was done by redirecting the interpreter's current block.
- Exception handlers & finalizers are now stored per-bytecode-range
in a side table in Executable.
- The interpreter now has a plain program counter instead of a stream
iterator. This actually makes error stack generation a bit nicer
since we just have to deal with a number instead of reaching into
the iterator.
This yields a 25% performance improvement on this microbenchmark:
for (let i = 0; i < 1_000_000; ++i) { }
But basically everything gets faster. :^)
Before this change, all JumpFoo instructions inherited from Jump, which
forced the unconditional Jump to have an unusued "false target" member.
Also, labels were unnecessarily wrapped in Optional<>.
By defining each jump instruction separately, they all shrink in size,
and all ambiguity is removed.
For bitmap fonts, we will often not have an exact match for requested
sizes. Return the closest match instead of a nullptr.
LibWeb is currently the only user of this API. If it needs to be
configurable in the future to only allow exact matches, we can add a
parameter or another method at that time.
This commit replaces the `import_pgn` implementation
with a more resiliant parser to handle various edge
cases and give helpful error messages instead of crashing.
This doesn't use any transforms yet (in particular not the predictor
transform), and doesn't do anything else that actually compresses the
data.
It also give all 256 values code length 8 unconditionally. This means
the huffman trees are not data-dependent at all and provide no
compression. It also means we can just write out the image data
unmodified.
So the output is fairly large. But it _is_ a valid lossless webp file.
Possible follow-ups, to improve compression later:
1. Use actual byte distributions to create huffman trees, to get
huffman compression.
2. If the distribution has just 1 element, write a simple code length
code (that way, images that have alpha 0xff everywhere need to store
no data for alpha).
3. Add backref support, to get full deflate(ish) compression of pixels.
4. Add predictor transform.
5. Consider writing different sets of prefix codes for every 16x16 tile
(by writing a meta prefix code image).
(It might be nice to make the follow-ups optional, so that this can also
be used as a webp example file generator.)
If all lexical declaration use local variables then there is no need
to allocate declarative environment.
With this change we skip ~3x more environment allocations on Github.
This makes sure `get_mask_type_of_svg()` finds the mask or clipPath by
looking at its child layout nodes. Previously, it went via the DOM node
of the mask or clipPath, which is not always correct as there is not a
1-to-1 mapping from mask DOM node to SVGMaskBox (or SVGClipBox).
Fixes#24186