Commit graph

41 commits

Author SHA1 Message Date
Nico Weber
0374c1eb3b LibPDF: Handle indirect reference resolving during parsing more robustly
If `Document::resolve()` was called during parsing, it'd change the
reader's current position, so the parsing code that called it would
then end up at an unexpected position in the file.

Parser.cpp already had special-case recovery when a stream's length
was stored in an indirect reference.

Commit ead02da98ac70c ("/JBIG2Globals") in #23503 added another case
where we could resolve indirect reference during parsing, but wasn't
aware of having to save and restore the reader position for that.

Put the save/restore code in `DocumentParser::parse_object_with_index`
instead, right before the place that ultimately changes the reader's
position during `Document::resolve`. This fixes `/JBIG2Globals` and
lets us remove the special-case code for `/Length` handling.

Since this is kind of subtle, include a test.
2024-03-19 19:20:01 -04:00
Hendiadyoin1
773a280bdf LibPDF: Use a struct for the subsection in parse_xref_stream 2024-03-01 14:05:53 -07:00
Nico Weber
87112dcbdc LibPDF: Return null for invalid refs, tolerate null objects as outline
https://llvm.org/devmtg/2022-11/slides/TechTalk5-WhatDoesItTakeToRunLLVMBuildbots.pdf
has an xref table that starts like so:

```
xref
0 214
0000000002 65535 f
0000924663 00000 n
0000000003 00000 f
0000000000 00000 f
0000000016 00000 n
0000000160 00000 n
0000000263 00000 n
```

This is a list of objects in the PDF file. The lines ending with 'f'
mean that this object is "free", that is it's not stored in the file.
In this file, objects 0, 2, 3 are free. For free objects, the first
number is the offset of the next free object: Object 0 refers to object
2, 2 to 3, and 3 back to 0 (since it's the last free object).
The lines ending with "n" are actual objects; here the first number is
a byte offset to where that object is stored in the file.

Furthermore, the file contains

```
/Outlines
2
0
R
```

in its root object, meaning that object 2 stores the page outlines.

Since object 2 is set as free, there is no object 2. But the spec
says that an invalid object reference is just the null object.

This patch makes us return null objects for references to free
objects, and it also makes us treat a null object as /Outlines value
the same as not having /Outlines in the first place.

Fixes #23023 -- we can now open that file. (We don't render it super
well, but only for already-known reasons.)

Since I found it a bit confusing: XRefTable has two related methods
here:

1. has_object() returns if an object was explicitly listed in an
   xref table. The first number right after `xref` is the start
   index. So if an xref table were to start with `10`, we'd implicitly
   create 10 trailing objects for which has_object() would return false
2. is_object_in_use() returns true if an object that was in a table
   (i.e. one where has_object() returns true) was listed with 'n' and
   false if it was listed with 'f'.

DocumentParser::parse_object_with_index() should probably return a null
object for the `!has_object()` case as well instead of VERIFY()ing
that has_object() is true. But I haven't seen this in the wild yet,
so keeping as-is for now.
2024-01-31 12:10:19 -05:00
Tim Ledbetter
459fa8b840 LibPDF: Ensure that xref subsection numbers are u32
Previously, parsing an xref entry with a floating point subsection
number would cause a crash.
2024-01-18 15:11:42 +01:00
Nico Weber
e16345555b LibPDF: Port 59b50fa43f8c2 to xref and object streams
0000440.pdf contains an xref stream object (at offset 3643676) starting:

```
294 0 obj <<
/Type /XRef
/Index [0 295]
/Size 295
```

and an object stream object (at offset 3640121) starting:

```
230 0 obj <<
/Type /ObjStm
/N 73
/First 614
```

In both cases, the `obj` and the `<<` are separated by non-newline
whitespace.

633e1632d0 made parse_indirect_value() tolerate this, but it didn't
update neither parse_xref_stream() (which parses xref streams) nor
parse_compressed_object_with_index() (which parses object streams),
despite all three changes being part of #14873.

Make parse_xref_stream() and parse_compressed_object_with_index()
call parse_indirect_value() to pick up the fix over there. It's a bit
less code too.

(0000440.pdf is the only PDF in my 1000 test PDFs that this helps,
somewhat surprisingly.)
2024-01-04 11:27:24 +01:00
Nico Weber
9d69c5d434 LibPDF: Tolerate trailing whitespace after %%EOF marker
At first I tried implmenting the quirk from PDF 1.7 Appendix H,
3.4.4, "File Trailer": """Acrobat viewers require only that the %%EOF
marker appear somewhere within the last 1024 bytes of the file.""
This would've been like #22548 but at end-of-file instead of at
start-of-file.

This helped a bunch of files, but also broke a bunch of files that
made more than 1024 bytes of stuff at the end, and it wouldn't have
helped 0000059.pdf, which has over 40k of \0 bytes after the %%EOF.
So just tolerate whitespace after the %%EOF line, and keep ignoring
and arbitrary amount of other stuff after that like before.

This helps:
* 0000599.pdf
  One trailing \0 byte after %%EOF. Due to that byte, the
  is_linearized() check fails and we go down the non-linearized
  codepath. But with this fix, that code path succeeds.
* 0000937.pdf
  Same.
* 0000055.pdf
  Has one space followed by a \n after %%EOF
* 0000059.pdf
  Has over 40kB of trailing \0 bytes

The following files keep working with it:
* 0000242.pdf
  5586 bytes of trailing HTML
* 0000336.pdf
  5586 bytes of trailing HTML fragment
* 0000136.pdf
  2054 bytes of trailing space characters
  This one kind of only worked by accident before since it found
  the %%EOF block before the final %%EOF block. Maybe this is
  even an intentional XRefStm compat hack? Anyways, now it
  find the final block instead.
* 0000327.pdf
  11044 bytes of trailing HTML
2024-01-04 11:19:15 +01:00
Nico Weber
2d12647e29 LibPDF: Add FIXME for "was linearized PDF incrementally updated" check
It's pretty tricky to do, and also tricky with respect to skipping
trailing bytes after %%EOF: The check requires knowning the full size of
the PDF (which means web servers not sending content lengths are out),
but that size has to be after stripping trailing bytes, which normal
static file servers won't do. So PDF viewers would have to download the
last couple bytes of the PDF unconditionally, then strip trailing bytes
and use the count to figure out the final actual PDF size.

Luckily, we don't incrementally download PDFs from the net but
instead require all data to be available in one chunk, so it's
not currently a problem.
2024-01-04 11:19:15 +01:00
Nico Weber
1b45c3e127 LibPDF: Tolerate whitespace after xref and startxref
The spec isn't super clear on if this is allowed:

"""Each cross-reference section shall begin with a line containing the
keyword xref. Following this line..."""

"""The two preceding lines shall contain, one per line and in order, the
keyword startxref and..."""

It kind of sounds like anything goes on both lines as long as they
contain `xref` and `startxref`.

In practice, both seem to always occur at the start of their line,
but in 0000780.pdf (and nowhere else), there's one space after each
keyword before the following linebreak, and this makes that file load.
2024-01-04 10:14:30 +01:00
Nico Weber
0bb0c7dac2 LibPDF: Scan for PDF file start in first 1024 bytes
Other readers do this too, and files depend on this.

Fixes opening these four files from the PDFA 0000.zip dataset:

* 0000015.pdf
  Starts with `C:\web\webeuncet\_cat\_docs\_publics\` before header
* 0000408.pdf
  Starts with UTF-8 BOM
* 0000524.pdf
  Starts with 867 bytes of HTML containing a PHP backtrace
* 0000680.pdf
  Starts with `C:\web\webeuncet\_cat\_docs\_publics\` too
2024-01-03 10:12:35 +01:00
Ali Mohammad Pur
5e1499d104 Everywhere: Rename {Deprecated => Byte}String
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).

This commit is auto-generated:
  $ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
    Meta Ports Ladybird Tests Kernel)
  $ perl -pie 's/\bDeprecatedString\b/ByteString/g;
    s/deprecated_string/byte_string/g' $xs
  $ clang-format --style=file -i \
    $(git diff --name-only | grep \.cpp\|\.h)
  $ gn format $(git ls-files '*.gn' '*.gni')
2023-12-17 18:25:10 +03:30
Tim Ledbetter
5c0c55d2c0 LibPDF: Ensure xref stream field widths are within expected range
Previously, an xref stream with a field with larger than 8 would
result in an undefined shift occurring. We now ensure that each field
width is a number and is less than or equal to 8.
2023-10-28 13:17:09 -04:00
Tim Ledbetter
b4296e1c9b LibPDF: Don't use unsanitized values in error messages
Previously, constructing error messages with unsanitized input could
fail because error message strings must be UTF-8.
2023-10-26 11:05:32 +02:00
Nico Weber
e7f7c434f7 LibPDF: Don't check for startxref after trailer dict
Several files have a comment after the trailer dict and the
`startxref` after it.

We really should add a consume_whitespace_and_comments() function
and call that in most places we currently call consume_whitespace().

But in this case, for non-linearized files, we first jump to the
end of the file, read `startxref`, then jump to `xref` from the
offset there, and then read the trailer after the `xref`,
only to read `startxref` again. So we can just not do that.

(For linearized files, we now completely ignore `startxref`.
But we don't use the data in `startxref` in linearized files
anyways, so it's fine to not read it there too.)

Reduces number of crashes on 300 random PDFs from the web (the first 300
from 0000.zip from
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/)
from 25 (8%) to 23 (7%).
2023-10-24 13:32:01 -04:00
Nico Weber
cf26fc2393 LibPDF: Make parser skip whitespace after header
0000990.pdf from 0000.zip from
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/
starts like so:

```
%PDF-1.7

4 0 obj
```

parse_heaader() used to put the cursor at the start of the 2nd,
empty, line. initialize_linearization_dict() would then check
if `m_reader.matches_number()` to see if there could possibly
be a linearization dict.

In this case, there isn't one, but we should detect linearization
dicts even if they're separated by whitespace from the first line.
2023-10-21 09:09:53 +02:00
Matthew Olsson
edd7de3c77 LibPDF: Fix incorrectly parsing subsections in xref stream
Subsections are generally not contiguous, however this logic assumed
that they were, and kept a persistent "entry_index" count while looping
through all subsections. This commit rewrites the logic to be more
straightforward; just loop through all of the subsections and handle
each one separately.
2023-07-18 00:51:23 +02:00
Matthew Olsson
bfd8faedf9 LibPDF: Assert compressed xref's 2nd field is non-zero 2023-07-18 00:51:23 +02:00
Matthew Olsson
f9c1d11380 LibPDF: Do not crash when linearized length is incorrect
This is a perfectly valid situation, and in this case we should just
parse a standard non-linearized xref table.
2023-07-18 00:51:23 +02:00
Nico Weber
323d76fbb9 LibPDF: Make encrypted object streams work
There were two problems:
1. parse_compressed_object_with_index() parses indirect objects
   without going through Parser::parse_indirect_value(), so
   push_reference() / pop_reference() weren't called.
   Manually call them, both for the indirect object containing
   the object stream and for the indirect object within the
   object stream.
2. The indirect object within the object stream got decrypted
   twice: Once when the object stream data itself got decrypted,
   and then incorrectly a second time when the object data within
   the stream was read. To fix, disable encryption while parsing
   object stream data (since it's already decrypted).

The test is from http://opf-labs.org/format-corpus/pdfCabinetOfHorrors/
which according to readme.md at the same location is CC0.
2023-07-12 17:16:25 +02:00
Nico Weber
67d8c8badb LibPDF: Use more direct method to access linearization dict
We know indirect_value_or_error.value contains an IndirectObject,
so there's no need to go through resolve().

No behavior change.
2023-07-12 06:28:15 +02:00
Nico Weber
39b2eed3f6 LibPDF: Do not crash on encrypted files that start unluckily
PDF files can be linearized. In that case, they start with a
"linearization dict" that stores the key `/Linearized` and the value
`1`. To check if a file is linearized, we just read the first dict, and
then checked if it has that key.

If the first object of a PDF was a stream with a compression filter
and the input PDF was encrypted and not linearized, then us trying to
decode the linearization dict could crash due to stream contents being
encrypted, decryption state not yet being initialized, and us trying
to decompress stream data before decrypting it.

To prevent this, disable uncompression when parsing the first object
to determine if it's a lineralization dictionary.

(A linearization dict never stores string values, so decryption
not yet being initialized is not a problem. Integer values aren't
encrypted in encrypted PDF files.)
2023-07-12 06:28:15 +02:00
Nico Weber
ea89053c12 LibPDF: Make PDF version accessible on Document 2023-07-11 13:49:17 -04:00
Julian Offenhäuser
fd78875662 LibPDF: Fix navigate_to_before_eof_marker() for PDFs not ending in EOL
The way this was factored before, we would miss the %%EOF marker if it
didn't have a valid end-of-line sequence after it.
2023-03-22 09:04:00 +01:00
Julian Offenhäuser
93062e2b78 LibPDF: Be more cautious of errors when looking for linearization dict
We would previously assume that, following the header, there must be a
valid PDF object that could be a linearization dict.

However, if the file is not linearized, this is not necessarily true.
We now try to detect if there even is an object, and don't treat
parsing errors as fatal.
2023-03-22 09:04:00 +01:00
Julian Offenhäuser
6c0f7d83bb LibPDF: Don't treat a broken document header as a fatal error
As the current goal is to make our best effort loading documents, we
might as well ignore a broken header and power through, giving the user
a warning.
2023-03-22 09:04:00 +01:00
Julian Offenhäuser
34350ee9e7 LibPDF: Allow reading documents with incremental updates
The PDF spec allows incremental changes of a document by appending a
new XRef table and file trailer to it. These will only contain the
changed objects and will point back to the previous change, forming an
arbitrarily long chain of XRef sections and file trailers.

Every one of those XRef sections may be encoded as an XRef stream as
well, in which case the trailer is part of the stream dictionary as
usual. To make this easier, I made it so every XRef table may "own" a
trailer. This means that the main file trailer is now part of the main
XRef table.
2023-02-12 10:55:37 +00:00
MacDue
63b11030f0 Everywhere: Use ReadonlySpan<T> instead of Span<T const> 2023-02-08 19:15:45 +00:00
Tim Schumacher
220fbcaa7e AK: Remove the fallible constructor from FixedMemoryStream 2023-02-08 17:44:32 +00:00
Tim Schumacher
261d62438f AK: Remove the fallible constructor from LittleEndianInputBitStream 2023-02-08 17:44:32 +00:00
Tim Schumacher
093cf428a3 AK: Move memory streams from LibCore 2023-01-29 19:16:44 -07:00
Tim Schumacher
2470dd3bb5 AK: Move bit streams from LibCore 2023-01-29 19:16:44 -07:00
Tim Schumacher
ae64b68717 AK: Deprecate the old AK::Stream
This also removes a few cases where the respective header wasn't
actually required to be included.
2023-01-29 19:16:44 -07:00
Tim Schumacher
b1bfeb391e LibPDF: Use Core::Stream to parse the page offset hint table 2023-01-21 00:45:33 +00:00
Linus Groh
57dc179b1f Everywhere: Rename to_{string => deprecated_string}() where applicable
This will make it easier to support both string types at the same time
while we convert code, and tracking down remaining uses.

One big exception is Value::to_string() in LibJS, where the name is
dictated by the ToString AO.
2022-12-06 08:54:33 +01:00
Linus Groh
6e19ab2bbc AK+Everywhere: Rename String to DeprecatedString
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
2022-12-06 08:54:33 +01:00
Julian Offenhäuser
d1bc89e30b LibPDF: Try to repair XRef tables with broken indices
An XRef table usually starts with an object number of zero. While it
could technically start at any other number, this is a tell-tale sign
of a broken table.

For the "broken" documents I encountered, this always meant that some
objects must have been removed from the start of the table, without
updating the following indices. When this is the case, the document is
not able to be read normally.

However, most other PDF parsers seem to know of this quirk and fix the
XRef table automatically.

Likewise, we now check for this exact case, and if it matches up with
what we expect, we update the XRef table such that all object numbers
match the actual objects found in the file again.
2022-11-25 22:44:47 +01:00
Julian Offenhäuser
4b1a72ff7a LibPDF: Fix loop condition in parse_xref_stream()
We previously compared two unrelated values to determine if we parsed
the xref table to completion. We now check if we added every subsection
instead, and double check to make sure we never read past the end.
2022-11-19 15:42:08 +01:00
Julian Offenhäuser
a17a23a3f0 LibPDF: Make some variable names in parse_xref_stream() more clear
I found these to be a bit misleading.
2022-11-19 15:42:08 +01:00
Julian Offenhäuser
77f5f7a6f4 LibPDF: Support parsing page tree nodes that are in object streams
conditionally_parse_page_tree_node used to assume that the xref table
contained a byte offset, even for compressed objects. It now uses the
common facilities for parsing objects, at the expense of some
performance.
2022-09-17 10:07:14 +01:00
Julian Offenhäuser
563d91b6c4 LibPDF: Implement loading compressed objects from object streams
Now, whenever the xref table points to a compressed object,
parse_object_with_index will look it up in the corresponding object
stream as if it were a regular object.

With this, our parser gains the bare minimum support for xref streams.
2022-09-17 10:07:14 +01:00
Julian Offenhäuser
f9beff7b5e LibPDF: Initial work on parsing xref streams
Since PDF version 1.5, a document may omit the xref table in favor of
a new kind of xref stream object. This is used to reference so-called
"compressed" objects that are part of an object stream.

With this patch we are able to parse this new kind of xref object, but
we'll have to implement object streams to use them correctly.
2022-09-17 10:07:14 +01:00
Julian Offenhäuser
4887aacec7 LibPDF: Move document-specific parsing functionality into its own class
The Parser class is now a generic PDF object parser, of which the new
DocumentParser class derives. DocumentParser now takes over all
functions relating to linearization, pages, xref and trailer handling.

This allows the use of multiple parsers in the same document's
context, which will be needed in order to handle PDF object streams.
2022-09-17 10:07:14 +01:00