When upsampling e.g. the 4-bit value 0b1101 to 8-bit, we used to repeat
the value to fill the full 8-bits, e.g. 0b11011101. This maps RGB colors
to 8-bit nicely, but is the wrong thing to do for palette indices.
Stop doing this for palette indices.
Fixes "Indexed color space index out of range" for 11 files in the
PDF/A 0000.zip test set now that we correctly handle palette indices
as of the previous commit:
Malformed PDF file: Indexed color space lookup table doesn't match
size, in 4 files, on 8 pages, 73 times
path/to/0000/0000206.pdf 2 4 (2x) 5 (3x) 6 (4x)
path/to/0000/0000364.pdf 5 6
path/to/0000/0000918.pdf 5
path/to/0000/0000683.pdf 8
Page 1 of 0000277.pdf does:
BT 22 0 0 22 59 28 Tm /TT2 1 Tf
(Presented at Photonics West OPTO, February 17, 2016) Tj ET
BT 32 0 0 32 269 426 Tm /TT1 1 Tf
(Robert W. Boyd) Tj ET
BT 22 0 0 22 253 357 Tm /TT2 1 Tf
(Department of Physics and) Tj ET
BT 22 0 0 22 105 326 Tm /TT2 1 Tf
(Max-Planck Centre for Extreme and Quantum Photonics) Tj ET
Every line begins a text operation, then updates the font matrix,
selects a font (TT2, TT1, TT2, TT1), draws some text and ends the text
operation.
`Tm` (which sets the font matrix) contains a scale, and uses that
to update the font size of the currently-active font (cf #20084).
But in this file, we `Tm` first and `Tf` (font selection) second,
so this updates the size of the old font. So when we pull it out
of the cache again on line 3, it would still have the old size
from the `Tm` on line 2.
(The whole text scaling logic in LibPDF imho needs a rethink; the
current approach also causes issues with zero-width glyphs which
currently lead to divisions by zero. But that's for another PR.)
Fixes another regression from c8510b58a3 (which I've accidentally
referred to by 2340e834cd in another commit).
SMasks are greyscale images that get used as alpha channel for a
different image.
JPEGs in PDFs are stored as streams with /DCTDecode filters, and
we have a separate code path for loading those in the PDF renderer.
That code path just calls our JPEG decoder, which creates bitmaps
with format BGRx8888.
So when we process an SMask for such a bitmap, we have to change
the bitmap's format to BGRA8888 in addition to setting alpha values
on all pixels.
This is a very inefficient implementation: Every time a type 3 font
glyph is drawn, we parse its operator stream and execute all the
operators therein.
We'll want to instead cache the glyphs in bitmaps (at least in most
cases), like we do for other fonts. But it's a good first step, and
all the coordinate math seems to work in the files I've tested.
Good test files from pdfa dataset 0000.zip:
- 0000559.pdf page 1 (and 2): Has a non-default font matrix;
text appears mirrored if the font matrix isn't handled correctly
- 0000425.pdf, page 1: Draws several glyphs in a single run;
glyphs overlap if Renderer::render_type3_glyph() ignores the
passed-in point
- 0000211.pdf, any page: Uses type 3 glyphs for all text.
Good perf test (already "reasonably fast")
- 0000521.pdf, page 5 (or 7 or or 16): The little red flag in the
purple box is a type 3 font glyph, and it's colored (which in part
means the first operator is `d0`, while all the other documents above
use `d1`)
Type 3 font glyphs begin with either `d0` or `d1`. If we bail out
with an "unsupported" error on the very first operator in a glyph,
we'll never paint the glyph.
Just stub these out for now. We probably want to do more in here in
the future (see "TABLE 5.10 Type 3 font operators" in the 1.7 spec).
It's a bit unfortunate that fonts need to know about the renderer,
but type 3 fonts contain PDF drawing operators, so it's necessary.
On the bright side, it makes it possible to pass fewer parameters
around and compute things locally as needed.
(As we implement more fonts, we'll probably want to create some
functions to do these computations in a central place, eventually.)
No behavior change.
In the main page contents, /T0 might refer to a different font than
it might refer to in an XObject. So don't use the `Tf` argument as
font cache key. Instead, use the address of the font dictionary object.
Fixes false cache sharing, and also allows us to share cache entries
if the same font dict is referred to by two different names.
Fixes a regression from 2340e834cd (but keeps the speed-up intact).
TJ acts on a list of either strings or numbers.
The strings are drawn, and the numbers are treated as offsets.
Previously, we'd only apply the last-seen number as offset when
we saw a string. That had the effect of us ignoring all but the
last number in front of a string, and ignoring numbers at the
end of the list.
Now, we apply all numbers as offsets.
Our rendering of Tests/LibPDF/text.pdf now matches other PDF viewers.
Images can have multiple filters, each one of them is processed
sequentially. Only the last one will be relevant for the image format
(DCT or JPXDecode), so use the last filter instead of the first one to
detect that property.
0000101.pdf from 0000.zip from the pdfa dataset has /Height set to
an indirect object that contains an int.
Make that work, and make sure various other similar places getting
values of the image dict also resolve indirect references.
If a PDF uses `/CustomName cs` and `/CustomName` then points at just a
name like `/DeviceGray` instead of an array, that's ok. Just using
`/DeviceGray cs` is simpler, so this extra level of indirection is
somewhat rare in practice, but it's valid and it does happen. So support
it.
We already have a helper that does the right thing that we just need to
call.
Together with #21524 and #21525, reduces number of crashes on 300 random
PDFs from the web (the first 300 from 0000.zip from
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/)
from 29 (9%) to 25 (8%).
It used to be called ColorSpaceFamily::never_needs_parameters().
But in the cpp file, the macro arg was called ever_needs_parameters,
and the spec says
"If the color space is one that can be specified by a name and no
additional parameters (DeviceGray, DeviceRGB, DeviceCMYK, and certain
cases of Pattern), the name may be specified directly."
so let's use that language here.
No behavior change.
Implements the `ri` operator, and the `RI` key in a graphics state
dictionary.
We don't do anything yet with the color rendering intent except
store it.
No behavior change except removing a few "not yet implemented"
messages.
Follow-up to #21489. There, I made us use a RAII object.
That's great, but if the embedded instruction stream pushes
its own graphics state, then an early return would cause us to
not process graphics state pop instructions in the embedded stream.
To fix this, remember the graphics stack depth before entering
the nested instruction stream, and explicitly shrink the stack back
to that size upon exit.
Enables us to render all pages of
https://devstreaming-cdn.apple.com/videos/wwdc/2017/821kjtggolzxsv/821/821_get_started_with_display_p3.pdf
without crashing.
Previously, if one operator returned an error, the TRY() would cause
us to return without restoring the outer graphics state, leading to
problems such as handing a 3-tuple to a grayscale color space
(because the inner object set up a grayscale color space that we
failed to dispose of).
Makes us crash later on page 43 of
https://devstreaming-cdn.apple.com/videos/wwdc/2017/821kjtggolzxsv/821/821_get_started_with_display_p3.pdf
Previously, every time a page switched fonts, we'd completely
re-parse the font.
Now, we cache fonts in Renderer, effectively caching them per page.
It'd be nice to have an LRU cache across pages too, but that's a
bigger change, and this already helps a lot.
Font size is part of the cache key, which means we re-parse the same
font at different font sizes. That could be better too, but again,
it's a big help as-is already.
Takes rendering the 1310 pages of the PDF 1.7 reference with
Build/lagom/bin/pdf --debugging-stats \
~/Downloads/pdf_reference_1-7.pdf
from 71 s to 11s :^)
Going through pages especially in the index is noticeably snappier.
(On the PDF 2.0 spec, ISO_32000-2-2020_sponsored.pdf, it's less
dramatic: From 19s to 16s.)
We now track it in the graphics state. It isn't used for anything yet.
Fixes the one thing that rendering the first 100 pages of
pdf_reference_1-7.pdf complains about.
This class had slightly confusing semantics and the added weirdness
doesn't seem worth it just so we can say "." instead of "->" when
iterating over a vector of NNRPs.
This patch replaces NonnullRefPtrVector<T> with Vector<NNRP<T>>.
The PDFFont class hierarchy was very simple (a top-level PDFFont class,
followed by all the children classes that derived directly from it).
While this design was good enough for some things, it didn't correctly
model the actual organization of font types:
* PDF fonts are first divided between "simple" and "composite" fonts.
The latter is the Type0 font, while the rest are all simple.
* PDF fonts yield a glyph per "character code". Simple fonts char codes
are always 1 byte long, while Type0 char codes are of variable size.
To this effect, this commit changes the hierarchy of Font classes,
introducing a new SimpleFont class, deriving from PDFFont, and acting as
the parent of Type1Font and TrueTypeFont, while Type0 still derives from
PDFFont directly. This distinction allows us now to:
* Model string rendering differently from simple and composite fonts:
PDFFont now offers a generic draw_string method that takes a whole
string to be rendered instead of a single char code. SimpleFont
implements this as a loop over individual bytes of the string, with
T1 and TT implementing draw_glyph for drawing a single char code.
* Some common fields between T1 and TT fonts now live under SimpleFont
instead of under PDFfont, where they previously resided.
* Some other interfaces specific to SimpleFont have been cleaned up,
with u16/u32 not appearing on these classes (or in PDFFont) anymore.
* Type0Font's rendering still remains unimplemented.
As part of this exercise I also took the chance to perform the following
cleanups and restructurings:
* Refactored the creation and initialisation of fonts. They are all
centrally created at PDFFont::create, with a virtual "initialize"
method that allows them to initialise their inner members in the
correct order (parent first, child later) after creation.
* Removed duplicated code.
* Cleaned up some public interfaces: receive const refs, removed
unnecessary ctro/dtors, etc.
* Slightly changed how Type1 and TrueType fonts are implemented: if
there's an embedded font that takes priority, otherwise we always
look for a replacement.
* This means we don't do anything special for the standard fonts. The
only behavior previously associated to standard fonts was choosing an
encoding, and even that was under questioning.
Errors can (and do) occur when trying to render text, and so far we've
silently ignored them, making us think that all is well when it isn't.
Letting show_text return errors will allow us to inform the user about
these errors instead of having to hiding them.
While the clipping logic was correct (current v/s new clipping path),
the clipping path contents weren't. This commit fixed that.
We calculate the clipping path in two places: when we set it to be the
whole page at graphics state creation time, and when we perform clipping
path intersection to calculate a new clipping path. The clipping path is
then used to limit painting by passing it to the painter (more
precisely, but passing its bounding box to the painter, as the latter
doesn't support arbitrary path clipping). For this last point the
clipping path must be in device coordinates.
There was however a mix of coordinate systems involved in the creation,
update and usage of the clipping path:
* The initial values of the path (i.e., the whole page) were in user
coordinates.
* Clipping path intersection was performed against m_current_path,
which is in device coordinates.
* To perform the clipping operation, the current clipping path was
assumed to be in user coordinates.
This mix resulted in the clipping not working correctly depending on the
zoom level at which one visualised a page.
This commit fixes the issue by always keeping track of the clipping path
in device coordinates. This means that the initial full-page contents
are now converted to device coordinates before putting them in the
graphics state, and that no mapping is performed when applied the
clipping to the painter.