Commit graph

667 commits

Author SHA1 Message Date
Nico Weber
0374c1eb3b LibPDF: Handle indirect reference resolving during parsing more robustly
If `Document::resolve()` was called during parsing, it'd change the
reader's current position, so the parsing code that called it would
then end up at an unexpected position in the file.

Parser.cpp already had special-case recovery when a stream's length
was stored in an indirect reference.

Commit ead02da98ac70c ("/JBIG2Globals") in #23503 added another case
where we could resolve indirect reference during parsing, but wasn't
aware of having to save and restore the reader position for that.

Put the save/restore code in `DocumentParser::parse_object_with_index`
instead, right before the place that ultimately changes the reader's
position during `Document::resolve`. This fixes `/JBIG2Globals` and
lets us remove the special-case code for `/Length` handling.

Since this is kind of subtle, include a test.
2024-03-19 19:20:01 -04:00
Nico Weber
495aaa295c LibPDF: Add some logging behind PDF_DEBUG
I've added these two lines a bunch of times by now. Let's check
them in. If they turn out to be annoying, we can remove them again.
2024-03-19 19:20:01 -04:00
Lucas CHOLLET
1e023a589d LibPDF: Plug in the CCITT3 1D decoder and pass corresponding options 2024-03-19 12:22:28 +01:00
MacDue
8057542dea LibGfx: Simplify path storage and tidy up APIs
Rather than make path segments virtual and refcounted let's store
`Gfx::Path`s as a list of `FloatPoints` and a separate list of commands.

This reduces the size of paths, for example, a `MoveTo` goes from 24
bytes to 9 bytes (one point + a single byte command), and removes a
layer of indirection when accessing segments. A nice little bonus is
transforming a path can now be done by applying the transform to all
points in the path (without looking at the commands).

Alongside this there's been a few minor API changes:

- `path.segments()` has been removed
  * All current uses could be replaced by a new `path.is_empty()` API
  * There's also now an iterator for looping over `Gfx::Path` segments
- `path.add_path(other_path)` has been removed
  * This was a duplicate of `path.append_path(other_path)`
- `path.ensure_subpath(point)` has been removed
  * Had one use and is equivalent to an `is_empty()` check + `move_to()`
- `path.close()` and `path.close_all_subpaths()` assume an implicit
  `moveto 0,0` if there's no `moveto` at the start of a path (for
  consistency with `path.segmentize_path()`).

Only the last point could change behaviour (though in LibWeb/SVGs all
paths start with a `moveto` as per the spec, it's only possible to
construct a path without a starting `moveto` via LibGfx APIs).
2024-03-18 07:09:37 +01:00
Nico Weber
21917e7b1e LibPDF+PDFViewer+MacPDF: Don't draw hidden text by default
Text can be rendered in various ways in PDFs: Filled, stroked,
both filled and stroked, set as clipping path, hidden, or
some combinations thereof.

We don't implement any of this at the moment except "filled".

Hidden text is used in scanned documents: The image of the scan is
drawn in the background, and then OCRd text is "drawn" as hidden
on top of the scanned bitmap. That way, the (hidden) text can be
selected and copied, and it looks like you're selecting text from
the scanned bitmap. Find-in-page also works similarly. (We currently
have neither text selection nor find-in-page, but one day we will.)

Now that we have pretty good support for CCITT and are growing some
support for JBIG2, we now draw both the scanned background image
as well as the foreground text. They're not always perfectly aligned.

This change makes it so that we don't render text that's marked as
hidden. (We still do most of the coordinate math, which will probably
come in handy at some point when we implement text selection.)

This makes these scanned documents appear as they're supposed to
appear (at least in documents where we manage to decode the background
bitmap).

This also adds a debug option to force rendering of hidden text.
2024-03-16 13:10:48 -04:00
Nico Weber
be9a6caa0a LibPDF: In Filter::decode_jbig2(), invert bits
See included comment.
2024-03-10 10:10:55 -04:00
Nico Weber
1eaaa8c3e9 LibPDF+LibGfx: Support JBIG2s with /JBIG2Globals set
Several ramifications:

* /JBIG2Globals is an indirect reference, which means we now need
  a Document for unfiltering. (Technically, other decode parameters
  can also be indirect objects and we should use the Document to
  resolve() those too, but in practice it only seems to be needed
  for /JBIG2Globals.)

* Since /JBIG2Globals are so rare, we just parse once for each
  image that use them, and decode_embedded() now receives a
  Vector<ReadonlyBytes> with all sections of sequences of
  segments.

* Internally, decode_segment_headers() is now called several times
  for embedded JBIG2s with multiple such sections (e.g. PDFs with
  /JBIG2Globals).

* That means `data` is now no longer part of JBIG2LoadingContext
  and things get slightly reshuffled due to this.

This completes the LibPDF part of JBIG2 support. Once LibGfx
implements actual decoding of JBIG2s, things should start to
Just Work in PDFs.
2024-03-09 16:01:22 +01:00
Nico Weber
953f6c5d9b LibPDF+LibGfx: Pass jbig2-filtered data to JBIG2ImageDecoderPlugin
Except for /JBIG2Globals, which we bail out on for now. In my 1000
files, 13 use JBIG2, and of those, 2 use JBIG2Globals (0000372.pdf e.g.
page 11 and 0000857.pdf e.g. page 1), and only one (the latter) of the
two uses the same JBIG2Globals stream for more than a single image.

JBIG2ImageDecoderPlugin cannot decode the data yet, so no behavior
change, but with `#define JBIG2_DEBUG 1` at the top of that file,
it now prints segment header info for PDFs containing JBIG2 data :^)
2024-03-09 16:01:22 +01:00
Nico Weber
24951a039e LibPDF: Clip stroke for B / B* operators
Fixes pages 17-19 on
https://www.iro.umontreal.ca/~feeley/papers/ChevalierBoisvertFeeleyECOOP15.pdf

Calling the fill handler after painting the stroke as previously doesn't
work, since we need to set up the clip before both stroke and fill, and
unset it after both. The duplication is a bit unfortunate, but also
minor.
2024-03-08 10:45:28 -05:00
Nico Weber
3a39939995 LibPDF: Make truetype fonts marked as symbol fonts actually work
Turns out the spec didn't mean that the whole range is populated,
but that one of these ranges is populated. So take the argmax.

As fallout, explicitly mark the Liberation fonts as nonsymbolic
when we use them for the 14 standard fonts. Else, we'd regress
"PostScrõpt", since the Liberation fonts would otherwise go down
the "is symbolic or doesn't have explicit encoding" codepath,
since the standard fonts usually don't have an explicit encoding.

As a fallout from _that_, since the 14 standard fonts now go down
the regular truetype rendering path, and since we don't implement
lookup by postscript name yet, glyphs not present in Liberation
now cause text to stop rendering with a diag, instead of rendering
a "glyph not found" symbol. That isn't super common, only an
additional 4 files appear for the "'post' table not yet implemented"
diag. Since we'll implement that soon, this seems fine until then.
2024-03-07 11:29:47 -05:00
Lucas CHOLLET
be5e7a360f LibGfx/CCITT: Add support for images with an unknown number of lines 2024-03-07 11:07:20 -05:00
Kyle Lanmon
a099d0e140 PDFViewer: Hide the rendering diagnostics window by default
You can enable it in the Debug menu and we will remember your choice.
2024-03-04 10:43:41 +01:00
Nico Weber
c6b484a728 LibPDF: Make image creation in Renderer::load_image() fallible 2024-03-03 11:18:37 -05:00
Nico Weber
9bff8abcc7 LibPDF: Add support for array image masks
An array image mask contains a min/max range for each channel,
and if each channel of a given pixel is in that channel's range,
that pixel is masked out (i.e. transparent). (It's similar to
having a single color or palette index be transparent, but it
supports a range of transparent colors if desired.)

What makes this a bit awkward is that the range is relative to the
origin bits per pixel and the inputs to the image's color space.

So an indexed (palettized) image with 4bpp has a 2-element mask
array where both entries are between 0 and 15.

We currently apply masks after converting images to a Gfx::Bitmap,
that is after converting to 8bpp sRGB. And we do this by mapping
everything to 8bpp very early on in load_image().

This leaves us with a bunch of options that are all a bit awkward:

1. Make load_image() store the up- (or for 16bpp inputs, down-)
   sampled-to-8bpp pixel data. And also return if we expanded the
   pixel range while resampling (for color values) or not (for
   palettized images). Then, when applying the image filter,
   resample the array bounds in exactly the same way. This requires
   passing around more stuff.

2. Like 1, but pass in the mask array to load_image() and apply
   the mask right there and then. This means we'd apply mask arrays
   at a different time than other masks.

3. Make the function that computes the mask from the mask array
   work from the original, unprocessed image data. This is the most
   local change, but probably also requires the largest amount of
   code (in return, the color mask for 16bpp images is precise, in
   addition that it separates concerns the most nicely).

This goes with 3 for now.
2024-03-03 11:18:37 -05:00
Nico Weber
98e272ce15 LibPDF: Silently ignore BX / EX operators
See the added comment for reasoning.
2024-03-02 17:43:53 -05:00
Nico Weber
9e502dcfe4 LibPDF: Honor writing mode in TJ operator as well 2024-03-02 12:25:09 +01:00
Nico Weber
c69797fda9 LibPDF: Implement support for vertical text for Type0
For Identity-V only for now.
2024-03-02 12:25:09 +01:00
Nico Weber
6348a857ea LibPDF: Prepare for more encodings than just Identity-H in Type0 code
Introduces CIDIterator, an iterator type for iterating over CIDs.

Also introduces Type0CMap which can return a CIDIterator given some
bytes.

The existing code of treating the bytes as an identity map of
big-endian u16s is now implemented in IdentityType0CMap.

No behavior change.
2024-03-02 12:25:09 +01:00
Nico Weber
b9a4689af3 LibPDF: In Type0Font, read metrics /DW2 and /W2 for vertical text
Not used for anything yet.
2024-03-02 12:25:09 +01:00
Nico Weber
ef5d7b685d LibPDF: In Type0::initialize(), move variable increment next to cause
No behavior change.
2024-03-02 12:25:09 +01:00
Nico Weber
fc9b2440bd LibPDF: Add some spec comments in Type0Font::initialize() 2024-03-02 12:25:09 +01:00
Nico Weber
004e47df88 LibPDF: Remove minor duplication in Renderer::text_show_string_array()
This "regressed" in #10080 (back then, the branches were smaller).

No behavior change.
2024-03-02 12:25:09 +01:00
Hendiadyoin1
773a280bdf LibPDF: Use a struct for the subsection in parse_xref_stream 2024-03-01 14:05:53 -07:00
Hendiadyoin1
fe0fde2154 Userland+Tests: Remove unused <AK/Tuple.h> includes 2024-03-01 14:05:53 -07:00
Nico Weber
c3980eda9e LibPDF: Give Type0 CIDFontType2 a ScaledFont instead of a Font
...with the same reasoning as the previous commit.

No behavior change.
2024-03-01 17:56:59 +01:00
Nico Weber
f374ad50a1 LibPDF: Give TrueTypePainter a ScaledFont instead of a Font
This will allow us to get at the font's glyphs as paths, which will
eventually enable us to implement glyph rotation. We'll have to do our
own caching then, but we can then hopefully share the caching across the
Type0 / Type1 / TrueType codepaths.

It also gives us access to a font's glyphs by glyph id, which will help
us implementing looking up glyph ids by postscript name.  (Else we'd
have to plumb through a whole Painter::draw_glyph_by_postscript_name()
API just for LibPDF.)

No behavior change.
2024-03-01 17:56:59 +01:00
Nico Weber
5dad8b693e LibPDF: Make PDFFont::replacement_for() return a ScaledFont
We only want to load non-bitmap fallback fonts as PDF fallback fonts,
so let's make the return type represent that.

No behavior change.
2024-03-01 17:56:59 +01:00
Nico Weber
2bbdfe0fba LibPDF: Treat "Oblique" as italic indicator
The standard 14 fonts include e.g. "CourierBoldOblique" and
"HelveticaOblique". Let's map them to italic fonts :^)
2024-03-01 14:17:42 +01:00
Nico Weber
8e3c54f203 LibPDF: Implement ZapfDingbats clause of the adobe glphy list algorithm
Liberation Sans still doesn't have the vast majority of the
Zapf Dingbats glyphs, but now we map the Zapf Dingbats names to good
unicode values.  So we only need to use a different font and all should
work.  (And Liberation Sans has _some_ of the glyphs, like 13 of the
223.) And we now render empty squares instead of wrong glyphs for the
ones we don't have.

I haven't seen any PDFs using ZapfDingbats in the wild, but they
probably exist somewhere.
(Tests/LibPDF/standard-14-fonts.pdf is a synthetic PDF using it.)
2024-03-01 14:17:42 +01:00
Nico Weber
2eb099aabe LibPDF: Implement some of the AdobeGlyphList algorithm
Turns out there's a spec that goes with the table.

The big change here is that we can now map `uni1234` to 0x1234 and
`u123456` to 0x123456.

The parts where we split a name on `_` and map each component
and the part where we're supposed to allow multiple groups of 4
after `uni` aren't implemented yet.

The ZapfDingbats lookup is also still missing.

I haven't seen this have an effect in practice, but it's easy to
construct a PDF with a custom encoding where it would make a
difference.
2024-03-01 14:17:42 +01:00
Nico Weber
9aa31157d5 LibPDF: Use right encoding for standard fonts Symbol and ZapfDingbats
We use Liberation Sans for the actual glyph for these, and that's
missing some (Symbol) / all (ZapfDingbats) of the glyphs we need
for these two standard fonts (...or at least the mapping from
name to glyph, not sure). But still, better rendering squares than
completely incorrect glpyhs.

Our code deciding what to do when a value isn't found in an encoding,
or when the name doesn't map to a glpyh, also needs work, but that's
mostly independent of this change. I think this is a nice small
standalone progression.
2024-02-27 17:42:08 -05:00
Nico Weber
76105d5d7f LibPDF: Resize images to the larger of image and mask dimensions
Makes text show up on 0000646.pdf pages 87-92, which for some reason
renders all text using 2x2 images with huge masks that contain
rendered text outlines.
2024-02-27 17:39:13 -05:00
Nico Weber
472bc367d3 LibPDF: Do not have redundant variables for image size
This way, the size of the bitmap cannot become out of sync with these
variables.

No behavior change.
2024-02-27 17:39:13 -05:00
Nico Weber
83d29b3e45 LibPDF: Hack around a FIXME in TrueTypePainter::get_glyph_width()
This will need further thought once we implement support for the
truetype 'post' table, but for now it's correct most of the time,
and better than not doing it.
2024-02-27 07:02:27 +01:00
Nico Weber
448eaa2966 LibPDF: Let Type1Font use TrueTypePainter for standard fonts
...and for fallback fonts too.

We use Liberation Sans (a truetype font) for standard and fallback
fonts. So we should use the standard PDF algorithm for mapping bytes
to truetype glyphs. TrueTypePainter knows how to do this.

Makes the "fi" ligature in the title on page 1 of 5014.CIDFont_Spec.pdf
or the dotless-i in the title of page 2 of ThinkingInPostScript.pdf
show up. They use Helvetica and TImes, and Helvetica and Symbol
respecitively (with -Bold variants).
2024-02-27 07:02:27 +01:00
Nico Weber
86a7753d65 LibPDF: Move TrueType painting into a new class
No behavior change.
2024-02-27 07:02:27 +01:00
Nico Weber
84d1e3956f LibPDF: Make truetype ascent adjustment more local
It's only used in this function.

No behavior change.
2024-02-27 07:02:27 +01:00
Nico Weber
03fab7089a LibPDF+PDFViewer: Extract Renderer::apply_page_rotation()
No behavior change.
2024-02-27 07:02:02 +01:00
Nico Weber
cafaaa0e76 LibPDF: Don't crash on zero-width characters in type1 fonts
Since ScaledFont bakes the size of the font into the font type, we
do the same for Type1 fonts, and then have to divide by the font height
when figuring out what to scale by. For a target width of 0, chances are
the source width is also 0, and we end up with NaN due to dividing
0 by 0. This then triggered the `VERIFY(isfinite(error))` in
can_approximate_bezier_curve() in Painter.cpp.

Check for this case and scale by 0 instead of dividing.

It could happen that the denominator is 0 without the numerator being 0,
but it's not clear what that's supposed to mean. In this case we'd end
up with +inf/-inf, which would also trigger the assert. I haven't seen
this case in practice, so let's not worry about that for now.

(A nicer longer-term fix is probably to make LibPDF use VectorFont
instead of ScaledFont, so that we don't have to bake the font size into
the font type. Then we won't need this division at all. In the meantime,
this fixes the crash.)

Fixes a crash on page 66 of
https://developer.apple.com/library/archive/documentation/mac/pdf/Text.pdf

Fixes a crash on page 37 of
https://open-std.org/jtc1/sc22/wg14/www/docs/n3220.pdf

Fixes crashes in `0000310.pdf`, `0000430.pdf`, `0000229.pdf`.

Brings down the number of crashes on my 1000 file test set from
5 with 3 distinct stacks to 2 with 1 distinct stack.

(The number went up from 3 crashes with 2 distinct stacks to 5/3 when we
started rendering much more text when Type0 font support was added.
This fixes the crashes we had before Type0 support.)
2024-02-27 07:01:05 +01:00
Nico Weber
83128d093e LibPDF: Implement most of the spec algorithm for picking TrueType glyphs
Non-CID-keyed fonts in PDFs have 8-bit codepoints which are mapped from
bytes to character names via encoding.

TrueType fonts don't index glyphs by name (Type1 fonts do), so the fix
(codified in the spec) was to make a list of all possible glyph names
and map those to (16-bit) unicode values, and then pass those into the
truetype cmap.

(As a fallback, we're supposed to look at the optional names in the
font's "post" table. That part isn't implemented here yet.)

(Note that this affects the behavior of fallback fonts for TrueType
fonts, but not yet fallback fonts for Type1 fonts, and neither the
behavior of the 14 built-in Type1 fonts (which we implement as
fallback fonts), since the TrueType fallback in Type1Font.cpp does
not use this algorithm yet. This will be fixed in a future patch.)
2024-02-25 15:15:20 +01:00
Nico Weber
207717982c LibPDF: Read /Flags off font descriptors 2024-02-25 15:15:20 +01:00
Lucas CHOLLET
cb03ab4a5a LibPDF: Handle the BlackIs1 parameter of the CCITTFaxDecode Filter 2024-02-24 16:24:45 -07:00
Lucas CHOLLET
6b3bab5c8a LibPDF: Plug in the CCITTFaxDecode filter to our CCITT decoder
We only call the decoder for Group 4 images. We do support Group 3
images, but let's wait to find a PDF with these before adding support.
2024-02-24 16:24:45 -07:00
Nico Weber
b258ba2767 LibPDF: Use decode_hex_digit() more
For `:#xx` in names, we now also handle lower-case hex digits.
The spec is silent on the case of these hex digits.
Our previous check (isxdigit(), and now is_ascii_hex_digit()) lets
through lower-case hex digits, so it seems better to handle them
rather than computing e.g. `'a' - 'A' + 10` (== 42 -- off by 32!).
I don't know if this has any visible effect on any files, but it's
more correct, and less code, and the code looks more like the code
in Filter::decode_ascii_hex().
2024-02-23 12:11:25 -05:00
Nico Weber
783b1d1c11 LibPDF: Use is_ascii_hex_digit() instead of isxdigit()
See description of #7684 for motivation.

Also, makes this code look more like the hex code in
Filter::decode_ascii_hex().

No behavior change.
2024-02-23 12:11:25 -05:00
Nico Weber
c9234f35f1 LibPDF/CFF: Clear stack after "endchar" commands
Both type 1 and type 2 spec tell us to do this.

I haven't observed a difference from this, but I noticed it in the
spec while I was touching this code. Probably good to do what the
spec tells us to do.
2024-02-22 06:59:28 +01:00
Nico Weber
020c00ede2 LibPDF/CFF: Use offset in accented_character() data
Without this, the dieresis above an a is all the way to the left
instead of over the letter.
2024-02-22 06:59:28 +01:00
Nico Weber
12859dfde5 LibPDF/CFF: Treat endchar in type 2 as type 2 "seac" when requested
With this, a character can be defined that uses two existing glyphs.
This is useful for umlauts and the like, which then just need to
reference e.g. the glyphs named "a" and "dieresis" and provide a
translation.

Makes umlauts appear on some PDFs using CFF type2 data in Type 1
fonts.
2024-02-22 06:59:28 +01:00
Nico Weber
cade76d240 LibPDF+LibGfx: Do not try to read "OS/2" table for PDFs
It is sometimes truncated in fonts embedded in PDFs, and the data
is not needed to render PDFs. 2 of my 1000 test PDFs used to
complain "Could not load OS2 v1: Not enough data" and 1
"Could not load OS2 v2: Not enough data" before.

Increases number of PDFs that render without diagnostics from
764 to 765 (and decreases the number of distinct error messages
from 27 to 25).
2024-02-21 13:38:33 +01:00
Nico Weber
0dee94ef40 LibPDF+LibGfx: Do not try to read "hmtx" table for PDFs
It is sometimes truncated in fonts embedded in PDFs, and the data
is not needed to render PDFs. 26 of my 1000 test files complained
"Could not load Hmtx: Not enough data" before.

Increases number of PDFs that render without diagnostics from
743 to 764.
2024-02-21 13:38:33 +01:00