...but only as long as REFAGGNINST == 1. That's enough for 0000337.pdf.
Except that it also needs GRTEMPLATE=1 support in the generic
refinement region decoding procedure, so no behaivor change yet.
...instead of a lambda that checks the template on every call.
Doesn't make a performance difference locally, but seems maybe nicer?
No behavior change.
Template 2 is needed by some symbols in 0000372.pdf page 11 and
0000857.pdf pages 1-4. Implement the others too while here. (The
mentioned pages in those two PDFs also use the "end of stripe" segment,
so they still don't render yet.
We still don't support EXTTEMPLATE.
...and make text_region_decoding_procedure() call it.
generic_refinement_region_decoding_procedure() still just returns
"unimplemented", so no behavior change yet.
Text segments using refinement are still rejected later, by
text_region_decoding_procedure(). But we deserialize the input data now,
and the error when this feature is used is now slightly different.
It seems to do the right thing already, and nothing in the spec says
not to do this as far as I can tell.
With this, we can finally decode the test input from #23659.
See f391c7822d for a similar change for generic regions and
lossless generic regions.
Text segments conceptually store (x,y,id) triples. (x,y) are a
coordinate, and id refers to an id from a symbol segment.
A text segment has the effect of drawing some of the bitmaps stored
in a symbol segment to the output bitmap.
For example, the symbol segment might contain a small bitmap that
happens to look like the letter 'A', and the text segment might
draw that everywhere a scanned page has an 'A'. (The JBIG2 format
only treats it as an abstract bitmap. It doesn't know that this
small bitmap is an 'A'.)
This is missing support for many things:
* Huffman-coded input (not used in practice)
* Symbol refinement
* Transposed symbols
* Colors (not used in practice)
Still, we now have basic symbol/text segment support. This is enough
to decode the downloadable PDF here:
https://www.google.com/books/edition/Paradise_Lost/6qdbAAAAQAAJ
It doesn't lead to any progression on my 1000 file test PDF set.
The 7 files in there that use JBIG2 with symbol and text segments
now fail to load for other reasons (4 need symbol refinement for
text segments, one needs end-of-stripe segment support, one needs
support for symbol segments referring to other segments).
(And possibly, many other PDFs from Google Books, but that's the
only one I've tried so far.)
This extracts the bitbuffer combining code we had into a new function
composite_bitbuffer() and adds the following features:
* Real support for combination operators (which also lets us allow black
as background color again, even if that's never used in practice)
* Clipping support (not used here yet, but will be needed elsewhere
soon)
We're going to need this for text segment handling.
No behavior change.
A symbol segment defines a bunch of small bitmaps and associates them
with numeric IDs.
This only implements reading symbols encoded with the arithmetic coder.
It does not support huffman coding. (In practice, everything seems to
use arithmetic coding.)
Support for refinement or aggregate coding isn't implemented yet.
Support for retaining bitmap coding contexts isn't implemented yet.
Support for symbol segments referring to other symbol segments isn't
implemented yet.
But all produce diagnostics if encountered, so we won't forget about
them. (I haven't seen either being used in the wild.)
No visible behavior change yet, but with JBIG2_DEBUG turned on,
it produces all kinds of debug output.
The symbol segment decoding procedure will read generic regions
that aren't at a byte boundary, and that share contexts across
several regions.
No behavior change.
The existing ArithmeticEncoder (from Annex E) reads one bit at a
time.
ArithmeticIntegerDecoder (from Annex A) builds on top of that to
read integer values.
This will be used by both the symbol segment and the text segment
readers.
(This does not yet implement the IAID decoding procedure in A.3.
We only need that one in the text segment decoder at the moment,
and it's pretty small, so I'll put it inline there for now.)
Not used yet, so no behavior change yet.
It seems to do the right thing already, and nothing in the spec says
not to do this as far as I can tell.
With this, we can finally decode
Tests/LibGfx/test-inputs/jbig2/bitmap.jbig2 and add a test for
decoding simple arithmetic-coded images.
This errors out on many special cases. None of those seem to be hit
in practice (with the exception of TPGDON, which is used in a handful
PDFs. I have an implementation of that locally, but I'll put that
in a separate PR. The code for it is straightforward, but adding a
test for it is a bit involved.)
With this, we can decode about half of the JBIG2 images in my PDF
test dataset.
In practice, everything uses white backgrounds and operators `or`
or `xor` to turn them black, at least for the simple images we're
about to be able to decode.
To make sure we don't forget implementing this for real once needed,
reject other ops, and also reject black backgrounds (because 1 | 0
is 1, not 0 like our overwrite implementation will produce).
This means we have to remove a test, but since this scenario doesn't
seem to happen in practice, that seems ok.
The context can vary for every bit we read.
This does not affect the one use in the test which reuses the same
context for all bits, but it is necessary for future changes.
I think the context normally changes for every bit. But this here
is enough to correctly decode the test bitstream in Annex H.2 in
the spec, which seems like a good checkpoint.
The internals of the decoder use spec naming, to make the code
look virtually identical to what's in the spec. (Even so, I managed
to put in several typos that took a while to track down.)
EXTTEMPLATE=1 was added later and doesn't seem to be used much in
practice -- it doesn't appear in no simple generic regions in any PDF
I tested so far at least. Since the spec contradicts itself on what
to do with these as far as I can tell, error out on them for now and
then add support once we find actual files using this, so that we can
check our implementation actually works.
Deduplicate the data reading for the different cases, and
zero-initialize all adaptive template pixels to zero to make that
possible.
Other than prohibiting EXTTEMPLATE=1, no behavior change.
The memmem() call passes `data.size() - 19 - sizeof(u32)` for big_len,
(18 prefix bytes skipped, the flag byte, and the trailing u32), so the
buffer needs to be at least that large.
Should fix https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=67332
If the "MMR" bit is set, the generic region decoding procedure just
uses ITU-T T.6 (2D CCITT), which we already have an implementation of.
In practice, this is used almost never in .jbig2 files and in none of
the PDFs I have.
The two files that do use MMR are:
1.) JBIG2_ConformanceData-A20180829/F01_200_TT9.jb2
2) 042_3.jb2 from the ghostpdl tests
The first uses an immediate _lossless_ generic region, which we don't
implement yet (but I think it should just forward to the normal
immediate generic region code? Not in this commit, though). The second
uses a regular immediate generic region, and we can decode it now:
Build/lagom/bin/image -o out.png \
path/to/ghostpdl/tests/jbig2/042_3.jb2
With this, `image` can convert any jbig2 file, as long as it's
black (or white), and LibPDF can draw jbig2 files (again, as long
as they only contain a single color stored in just a
PageInformation segment).
Also make scan_for_page_size() not early return, so that it has the
same behavior as the main decoding look. (Multiple page information
segments for a single page are likely invalid and don't happen in
practice, so this is mostly an academic change.)
Add a BitBuffer class to store the bit image data, and introduce a
Page struct for storing data associated with a page. We currently
only handle a single page, but a) this makes it easier to decode
multiple pages in the future if we want b) it makes the code easier
to understand.
7.4.8.2 Page bitmap height:
"In some cases, this value may not be known at the time that the page
information segment is written. In this case, this field must contain
0xFFFFFFFF, and the actual page height may be communicated later, once
it is known."
Sounds like the spec guarantees that that's the number of the first
page.
(In practice, all but one of all jbig2 files I've found contain just
page 1. PDFs almost always contain just page 1, and very rarely a
page 0 for globally shared parameters.)
This allows `file` to correctly print the dimensions of a .jbig2 file,
and it allows us to write a test that covers much of all the code
written so far.
Several ramifications:
* /JBIG2Globals is an indirect reference, which means we now need
a Document for unfiltering. (Technically, other decode parameters
can also be indirect objects and we should use the Document to
resolve() those too, but in practice it only seems to be needed
for /JBIG2Globals.)
* Since /JBIG2Globals are so rare, we just parse once for each
image that use them, and decode_embedded() now receives a
Vector<ReadonlyBytes> with all sections of sequences of
segments.
* Internally, decode_segment_headers() is now called several times
for embedded JBIG2s with multiple such sections (e.g. PDFs with
/JBIG2Globals).
* That means `data` is now no longer part of JBIG2LoadingContext
and things get slightly reshuffled due to this.
This completes the LibPDF part of JBIG2 support. Once LibGfx
implements actual decoding of JBIG2s, things should start to
Just Work in PDFs.
They're in different places for Sequential/Embedded (right after
the header) and RandomAccess (which has all headers first, followed
by all data bits next).
We don't do anything with the data yet, but now everything's in
place to actually process segment data.