Deflate and WebP can store at most 15 bits per symbol, meaning their
huffman trees can be at most 15 levels deep.
During construction, when we hit this level, we used to try again
with an ever lower frequency cap per symbol. This had the effect
of giving the symbols with the highest frequency lower frequencies
first, causing the most-frequent symbols to be merged. For example,
maybe the most-frequent symbol had 1 bit, and the 2nd-frequent
two bits (and everything else at least 3). With the cap, the two
most frequent symbols might both have 2 symbols, freeing up bits
for the lower levels of the tree.
This has the effect of making the most-frequent symbols longer at
first, which isn't great for file size.
Instead of using a frequency cap, ignore ever more of the low
bits of the frequency. This sacrifices resolution where it hurts
the lower levels of the tree first, and those are stored less
frequently.
For deflate, the 64 kiB block size means this doesn't have a big
effect, but for WebP it can have a big effect:
sunset-retro.png (876K): 2.02M -> 1.73M -- now (very slightly) smaller
than twice the input size! Maybe we'll be competitive one day.
(For wow.webp and 7z7c.webp, it has no effect, since we don't hit
the "tree too deep" case there, since those have relatively few
colors.)
No behavior change other than smaller file size. (No performance
cost either, and it's less code too.)
For our deflate, block size is limited to less than 64 kiB, so the sum
of all byte frequencies always fits in a u16 by construction.
But while I haven't hit this in practice, but it can conceivably happen
when writing WebP files, which currently use a single huffman tree
(per channel) for a while image -- which is often much larger than
64 kiB.
No dramatic behavior change in practice, just feels more correct.
To be used in WebPWriter.
JPEGWriter currently hardcodes huffman tables; maybe it can use this
to build data-dependent huffman tables in the future as well.
Pure code move (except for removing the `DeflateCompressor::` prefix
on the function's name, and putting the default argument for the 4th
argument in the function's definition), no behavior change.
This moves both Gfx::CanonicalCode::write_symbol() and
Compress::CanonicalCode::write_symbol() inline.
It also adds `__attribute__((always_inline))` on the arguments
to visit() in the latter. (ALWAYS_INLINE doesn't work on lambdas.)
Numbers with `ministat`: I ran once:
Build/lagom/bin/image -o test.bmp Base/res/wallpapers/sunset-retro.png
and then ran to bench:
~/src/hack/bench.py -n 20 -o bench_foo1.txt \
Build/lagom/bin/image -o test.webp test.bmp
...and then `ministat bench_foo1.txt bench_foo2.txt` to compare.
The previous commit increased the time for this command by 38% compared
to the before state.
With this, it's an 8.6% regression. So still a regression, but a smaller
one.
Or, in other words, this commit reduces times by 21% compared to the
previous commit.
Numbers with hyperfine are similar -- with this on top of the previous
commit, this is a 7-11% regression, instead of an almost 50% regression.
(A local branch that changes how we compute CanonicalCodes so that we
actually compress a bit is perf-neutral since the image writing code
doesn't change.)
`hyperfine 'image -o test.webp test.bmp'`:
* Before: 23.7 ms ± 0.7 ms (116 runs)
* Previous commit: 33.2 ms ± 0.8 ms (82 runs)
* This commit: 25.5 ms ± 0.7 ms (102 runs)
`hyperfine 'animation -o wow.webp giphy.gif'`:
* Before: 85.5 ms ± 2.0 ms (34 runs)
* Previous commit: 127.7 ms ± 4.4 ms (22 runs)
* This commit: 95.3 ms ± 2.1 ms (31 runs)
`hyperfine 'animation -o wow.webp 7z7c.gif'`:
* Before: 12.6 ms ± 0.6 ms (198 runs)
* Previous commit: 16.5 ms ± 0.9 ms (153 runs)
* This commit: 13.5 ms ± 0.6 ms (186 runs)
These changes are compatible with clang-format 16 and will be mandatory
when we eventually bump clang-format version. So, since there are no
real downsides, let's commit them now.
We previously skipped updating the lookback buffer when copying
uncompressed data, which resulted in a wrong total byte count.
With a wrong total byte count, our decompressor implementation
ended up choosing a wrong offset into the dictionary.
Previously we would calculate the index of the first parent node as
heap.size() (which is initialized to non_zero_freqs), so in the edge
case in which all symbols had a non-zero frequency, we would use the
Size-index entry in the array for both the first symbol's leaf node,
and the first parent node.
The result would either be a non-optimal huffman code (bad), or an
illegal huffman code that would then go on to crash due to an error
check in CanonicalCode::from_bytes. (worse)
We now store parent nodes starting at heap.size() - 1, which eliminates
the potential overlap, and resolves the issue.
This method takes bytes as input and decompress everything to a
ByteBuffer. It uses two control codes (clear and end of data) as
described in the GIF, TIFF and PDF specifications.
Some users of the LZW algorithm use a different value to determine when
the size of the code changes. GIF increments the size when the number of
elements in the table is equal to 2^code_size while TIFF does it for a
count of 2^code_size - 1.
This patch adds the parameter m_offset_for_size_change with a default
value of 0 and our decoder will increment the code size when we reach
a table length of 2^code_size + m_offset_for_size_change. This allows us
to support both situations.
XZ writes filters in the order that they are used during compression, so
we need to process them in the reverse order while decompression.
This wasn't noticed earlier because we only supported the LZMA2 filter.
Since it will become a stream in a little bit, it should behave like all
non-trivial stream classes, who are not primarily intended to have
shared ownership to make closing behavior more predictable. Across all
uses of MappedFile, there is only one use case of shared mapped files in
LibVideo, which now uses the thin SharedMappedFile wrapper.
The class was an inner class of `BrotliDecompressionStream`, let's move
it outside the `Stream` object in order to ease the access to user only
interested in this part.
These routines:
- read_prefix_code
- read_simple_prefix_code
- read_complex_prefix_code
were methods of `BrotliDecompressionStream` taking a `CanonicalCode` as
an out parameter. This patch puts them in `CanonicalCode` as static
methods.
This now searches the memory in blocks, which should be slightly more
efficient. However, it doesn't make much difference (e.g. ~1% in LZMA
compression) in most real-world applications, as the non-hint function
is more expensive by orders of magnitude.
The "operation modes" of this function have very different focuses, and
trying to combine both in a way where we share the most amount of code
probably results in the worst performance.
Instead, split up the function into "existing distances" and "no
existing distances" so that we can optimize either case separately.
We will be adding extra logic to the CircularBuffer to optimize
searching, but this would negatively impact the performance of
CircularBuffer users that don't need that functionality.