Instead of checking the header in ZlibDecompressor::create(), we now
check it in read_some() when it is called for the first time. This
resolves a FIXME in the new DecompressionStream implementation.
These compressors will be used by w3c's CompressionStream, which can run
arbitrary JS, and thus never reach their "finish" steps. Let's not crash
the WebContent process if that happens.
GzipCompressor is currently written assuming that it's write_some method
is only called once. When we use this class for LibWeb, we may very well
receive data to compress in small chunks. So this patch makes us write
the gzip header and footer only once, which now resembles the zlib and
deflate compressors.
Compared to version 10 this fixes a bunch of formatting issues, mostly
around structs/classes with attributes like [[gnu::packed]], and
incorrect insertion of spaces in parameter types ("T &"/"T &&").
I also removed a bunch of // clang-format off/on and FIXME comments that
are no longer relevant - on the other hand it tried to destroy a couple of
neatly formatted comments, so I had to add some as well.
OutputMemoryStream was originally a proxy for DuplexMemoryStream that
did not expose any reading API.
Now I need to add another class that is like OutputMemoryStream but only
for static buffers. My first idea was to make OutputMemoryStream do that
too, but I think it's much better to have a distinct class for that.
I originally wanted to call that class FixedOutputMemoryStream but that
name is really cumbersome and it's a bit unintuitive because
InputMemoryStream is already reading from a fixed buffer.
So let's just use DuplexMemoryStream instead of OutputMemoryStream for
any dynamic stuff and create a new OutputMemoryStream for static
buffers.
Consider the following snippet:
void foo(InputStream& stream) {
if(!stream.eof()) {
u8 byte;
stream >> byte;
}
}
There is a very subtle bug in this snippet, for some input streams eof()
might return false even if no more data can be read. In this case an
error flag would be set on the stream.
Until now I've always ensured that this is not the case, but this made
the implementation of eof() unnecessarily complicated.
InputFileStream::eof had to keep a ByteBuffer around just to make this
possible. That meant a ton of unnecessary copies just to get a reliable
eof().
In most cases it isn't actually necessary to have a reliable eof()
implementation.
In most other cases a reliable eof() is avaliable anyways because in
some cases like InputMemoryStream it is very easy to implement.
The streaming operator doesn't short-circuit, consider the following
snippet:
void foo(InputStream& stream) {
int a, b;
stream >> a >> b;
}
If the first read fails, the second is called regardless. It should be
well defined what happens in this case: nothing.
I suspected an error in CircularDuplexStream::read(Bytes, size_t). This
does not appear to be the case, this test case is useful regardless.
The following script was used to generate the test:
import gzip
uncompressed = []
for _ in range(0x100):
uncompressed.append(1)
for _ in range(0x7e00):
uncompressed.append(0)
for _ in range(0x100):
uncompressed.append(1)
compressed = gzip.compress(bytes(uncompressed))
compressed = ", ".join(f"0x{byte:02x}" for byte in compressed)
print(f"""\
TEST_CASE(gzip_decompress_repeat_around_buffer)
{{
const u8 compressed[] = {{
{compressed}
}};
u8 uncompressed[0x8011];
Bytes{{ uncompressed, sizeof(uncompressed) }}.fill(0);
uncompressed[0x8000] = 1;
const auto decompressed = Compress::GzipDecompressor::decompress_all({{ compressed, sizeof(compressed) }});
EXPECT(compare({{ uncompressed, sizeof(uncompressed) }}, decompressed.bytes()));
}}
""", end="")
Now we have an actual stream implementation that can read arbitrary
(dynamic codes aren't supported yet) deflate encoded data. Even if
the blocks are really large.
And all of that happens with a single buffer of 32KiB. DEFLATE is
amazing!
Previously, the implementation would produce one Vector<u8> which
would contain the whole decompressed data. That can be a lot and
even exhaust memory.
With these changes it is still necessary to store the whole input data
in one piece (I am working on this next,) but the output can be read
block by block. (That's not optimal either because blocks can be
arbitrarily large, but it's good for now.)