Commit graph

101 commits

Author SHA1 Message Date
Ava Rider
3a7bea7402 LibPDF: Added empty read check to parse_hex_string 2024-05-05 06:45:42 +01:00
Nico Weber
0374c1eb3b LibPDF: Handle indirect reference resolving during parsing more robustly
If `Document::resolve()` was called during parsing, it'd change the
reader's current position, so the parsing code that called it would
then end up at an unexpected position in the file.

Parser.cpp already had special-case recovery when a stream's length
was stored in an indirect reference.

Commit ead02da98ac70c ("/JBIG2Globals") in #23503 added another case
where we could resolve indirect reference during parsing, but wasn't
aware of having to save and restore the reader position for that.

Put the save/restore code in `DocumentParser::parse_object_with_index`
instead, right before the place that ultimately changes the reader's
position during `Document::resolve`. This fixes `/JBIG2Globals` and
lets us remove the special-case code for `/Length` handling.

Since this is kind of subtle, include a test.
2024-03-19 19:20:01 -04:00
Nico Weber
495aaa295c LibPDF: Add some logging behind PDF_DEBUG
I've added these two lines a bunch of times by now. Let's check
them in. If they turn out to be annoying, we can remove them again.
2024-03-19 19:20:01 -04:00
Nico Weber
1eaaa8c3e9 LibPDF+LibGfx: Support JBIG2s with /JBIG2Globals set
Several ramifications:

* /JBIG2Globals is an indirect reference, which means we now need
  a Document for unfiltering. (Technically, other decode parameters
  can also be indirect objects and we should use the Document to
  resolve() those too, but in practice it only seems to be needed
  for /JBIG2Globals.)

* Since /JBIG2Globals are so rare, we just parse once for each
  image that use them, and decode_embedded() now receives a
  Vector<ReadonlyBytes> with all sections of sequences of
  segments.

* Internally, decode_segment_headers() is now called several times
  for embedded JBIG2s with multiple such sections (e.g. PDFs with
  /JBIG2Globals).

* That means `data` is now no longer part of JBIG2LoadingContext
  and things get slightly reshuffled due to this.

This completes the LibPDF part of JBIG2 support. Once LibGfx
implements actual decoding of JBIG2s, things should start to
Just Work in PDFs.
2024-03-09 16:01:22 +01:00
Nico Weber
b258ba2767 LibPDF: Use decode_hex_digit() more
For `:#xx` in names, we now also handle lower-case hex digits.
The spec is silent on the case of these hex digits.
Our previous check (isxdigit(), and now is_ascii_hex_digit()) lets
through lower-case hex digits, so it seems better to handle them
rather than computing e.g. `'a' - 'A' + 10` (== 42 -- off by 32!).
I don't know if this has any visible effect on any files, but it's
more correct, and less code, and the code looks more like the code
in Filter::decode_ascii_hex().
2024-02-23 12:11:25 -05:00
Nico Weber
783b1d1c11 LibPDF: Use is_ascii_hex_digit() instead of isxdigit()
See description of #7684 for motivation.

Also, makes this code look more like the hex code in
Filter::decode_ascii_hex().

No behavior change.
2024-02-23 12:11:25 -05:00
Nico Weber
9495f64f91 LibPDF: Improve hex string parsing
A local (non-public) PDF I have lying around contains this in
a page's operator stream:

```
[<00b4003e> 3 <002600480051> 3 <005700550044004f0003> -29
<00330044> 3 <0055> -3 <004e0040> 4 <0003> -29 <004c00560003> -31
<0057004b> 4 <00480003> -37 <0050
>] TJ
```

That is, there's a newline in a hexstring after a character.

This led to `Parser error at offset 5184: Unexpected character`.

The spec says in 3.2.3 String Objects, Hexadecimal Strings:
"""Each pair of hexadecimal digits defines one byte of the string.
White-space characters (such as space, tab, carriage return, line feed,
and form feed) are ignored."""

But we didn't ignore whitespace before or after a character, only
in between the bytes.

The spec also says:
"""If the final digit of a hexadecimal string is missing—that is, if
there is an odd number of digits—the final digit is assumed to be 0."""

In that case, we were skipping the closing `>` twice -- or, more
accurately, we ignored the character after it too. This has been
wrong all the way back in #6974.

Add a test that fails if either of the two changes isn't present.
2024-01-02 22:13:21 +01:00
Nico Weber
6723552e95 LibPDF: Add a spec comment and remove a FIXME
I think the ASCIIHexDecode / ASCII85Decode unfilter functions handle
what this FIXME was about already.
2023-12-22 10:58:54 +01:00
Nico Weber
3d07684891 LibPDF: Extract Parser::parse_inline_image()
Pure code move, no intended behavior change.

The motivation is just to make Parser::parse_operators() less nested
and more focused.
2023-12-22 10:58:54 +01:00
Nico Weber
022fce75a6 LibPDF: Get inline image data from parser to renderer
We create a inline_image_end operator that has all the relevant data
in a synthetic StreamObject.

inline_image_end is still a RENDERER_TODO(), so no real behavior
change. (Previously we'd call only inline_image_begin, so string the
todo message is about is now a bit different. But no interesting
behavior change.)
2023-12-20 12:19:08 +01:00
Nico Weber
3285502ec6 LibPDF: Extract a Parser::unfilter_stream() method
No behavior change.
2023-12-20 12:19:08 +01:00
Ali Mohammad Pur
5e1499d104 Everywhere: Rename {Deprecated => Byte}String
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).

This commit is auto-generated:
  $ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
    Meta Ports Ladybird Tests Kernel)
  $ perl -pie 's/\bDeprecatedString\b/ByteString/g;
    s/deprecated_string/byte_string/g' $xs
  $ clang-format --style=file -i \
    $(git diff --name-only | grep \.cpp\|\.h)
  $ gn format $(git ls-files '*.gn' '*.gni')
2023-12-17 18:25:10 +03:30
Nico Weber
11354dbf9e LibPDF: Remember inline image stream bytes
We still don't process inline images, but now we have the pieces we need
for doing it (`map` and `stream_bytes`).
2023-12-11 10:50:39 +01:00
Nico Weber
cabc6a9d80 LibPDF: Add a comment that PDF 2.0 added a length key for inline images
In practice, basically no file has it, since it was only added in 2.0,
and 1.7 explicitly said "in particular, the Type, Subtype, and Length
entries normally found in a stream or image dictionary are unnecessary."
2023-12-11 10:50:39 +01:00
Nico Weber
071f890847 LibPDF: Require whitespace in front of inline image marker EI
Fixes a crash on page 3 of 0000450.pdf of 0000.zip, where we previously
started interpreting the middle of an inline image content stream as
operators, since it contained `EI` in its pixel data.
2023-12-11 10:50:39 +01:00
Nico Weber
27aae7e2b1 LibPDF: Parse inline image key-value pairs
Not used for anything yet.
2023-12-11 10:50:39 +01:00
Nico Weber
0912896ae0 LibPDF: Extract Parser::parse_dict_contents_until()
No behavior change.
2023-12-11 10:50:39 +01:00
Kyle Pereira
e4b8d68039 LibPDF: Permit comments at the end of a stream 2023-12-10 16:44:24 +01:00
Nico Weber
e39a790c82 LibPDF: Stop converting encodings in object parser
Per 1.7 spec 3.8.1, there are multiple logical text string types:
* text strings
* ASCII strings
* byte strings

Text strings can be in UTF-16BE, PDFDocEncoding, or (since PDF 2.0)
UTF-8.

But byte strings shouldn't be converted but treated as binary
data.

This makes us no longer convert strings used for drawing page text.
TABLE 5.6 "Text-showing operators" lists the operands for text-showing
operators as just "string", not "text string" (even though these strings
confusingly are called "text strings" in the body text), so not doing
this there is correct (and matches other viewers).

We also no longer incorrectly convert strings used for cypto data
(such as passwords), if they start with an UTF-16BE or UTF-8 marker.

No behavior change for outlines and info dict entries.

https://pdfa.org/understanding-utf-8-in-pdf-2-0/ has a good overview of
this.

(ASCII strings only contain ASCII characters and behave the same
anyways.)
2023-11-22 09:08:06 -07:00
Nico Weber
14bcb5219d LibPDF: Tolerate comments before drawing operators
Necessary to be able to render
https://github.com/pdf-association/pdf20examples/blob/master/pdf20-utf8-test.pdf
2023-11-22 08:56:43 +00:00
Nico Weber
9e8cf4fc1a LibPDF: Tolerate comment after last dict item
Necessary to be able to open
https://github.com/pdf-association/pdf20examples/blob/master/pdf20-utf8-test.pdf
2023-11-22 08:56:43 +00:00
Nico Weber
54c98a46d8 LibPDF: Correctly parse the d0 and d1 operators
They are the first operator in a type 3 charproc.
Operator.h already knew about them, but we didn't manage to parse
them, since they're the only two operators that contain a digit.
2023-11-17 19:47:53 +00:00
Tim Ledbetter
b4296e1c9b LibPDF: Don't use unsanitized values in error messages
Previously, constructing error messages with unsanitized input could
fail because error message strings must be UTF-8.
2023-10-26 11:05:32 +02:00
Nico Weber
4549d6cf1b LibPDF: Add a FIXME comment to the inline image data skipping path 2023-10-26 10:59:45 +02:00
Nico Weber
54cdcd0d06 LibPDF: Reject non-hexdigits in hex string with error
...instead of VERIFY()ing input data.

I haven't seen this in the wild, but since I'm here anyways,
might as well fix this.
2023-10-25 10:44:26 +02:00
Nico Weber
4675700057 LibPDF: Reject unterminated literal strings with an error
0000459.pdf in 0000.zip in the pdfa dataset contains this as the
very first object:

```
1 0 obj
<<
/Creator (Developer 2000)
/CreatorDate (
/Author (Oracle Reports)
/Producer (Oracle PDF driver)
/Title (2021_06_29 Tutoritzacions APTES.PDF)
>>
endobj
```

The `/CreatorDate` value string is unterminated.

Before, we'd assert when trying to check if the first object is
a linearization dict.

Now, we never read the first object (an error during the linearization
dict reading is treated as "file is not linearized") unless we try
to print the document's metadata -- and there we now show an error
instead of asserting.
2023-10-25 10:44:26 +02:00
Nico Weber
c0f3f1674c LibPDF: Make string literal parsing fallible
...and make running out of data after a \ an error instead of silently
returning an empty string.
2023-10-25 10:44:26 +02:00
Nico Weber
6153dd7b84 LibPDF: Tolerate comments after dict values
Makes 0000607.pdf from 0000.zip from the pdfa dataset load.
2023-10-23 09:28:00 -04:00
Nico Weber
a1f17bd643 LibPDF: Skip inline image data in operator stream
Inline images can contain arbitrary binary data in the operator stream,
greatly confusing the operator parser.

Just skip them for now. They'll produce a
`Rendering of feature not supported: draw operation: inline_image_begin`
diag as usual, so we won't forget about it.

After #21536, reduces number of crashes on 300 random PDFs from the web
(the first 300 from 0000.zip from
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/)
from 23 (7%) to 22 (7%).

On a larger sample (`Meta/test_pdf.py -n 500 ~/Downloads/0000`),
reduces number of crashes from 53 (10.6%) with 36 distinct crash
stacks to 46 (9.2%) with 33 distinct stacks.
2023-10-23 07:51:08 +02:00
Ali Mohammad Pur
aeee98b3a1 AK+Everywhere: Remove the null state of DeprecatedString
This commit removes DeprecatedString's "null" state, and replaces all
its users with one of the following:
- A normal, empty DeprecatedString
- Optional<DeprecatedString>

Note that null states of DeprecatedFlyString/StringView/etc are *not*
affected by this commit. However, DeprecatedString::empty() is now
considered equal to a null StringView.
2023-10-13 18:33:21 +03:30
Nico Weber
532230c0e4 LibPDF: Extract a Document::read_filters() method
No behavior change.
2023-07-24 09:50:45 -04:00
Nico Weber
77e6dbab33 LibPDF: Fix symbol for text_next_line_show_string_set_spacing operator
It's `"`, not `''`.

Now the `text_next_line_show_string_set_spacing` gets called and logs
a TODO at page render time if `"` is used in a PDF:

    warning: Rendering of feature not supported:
        draw operation: text_next_line_show_string_set_spacing

It caused a parse error (also at page render time) previously:

    [parse_value @ .../LibPDF/Parser.cpp:104]
        Parser error at offset 611: Unexpected char """
2023-07-22 12:25:30 -04:00
Nico Weber
39b2eed3f6 LibPDF: Do not crash on encrypted files that start unluckily
PDF files can be linearized. In that case, they start with a
"linearization dict" that stores the key `/Linearized` and the value
`1`. To check if a file is linearized, we just read the first dict, and
then checked if it has that key.

If the first object of a PDF was a stream with a compression filter
and the input PDF was encrypted and not linearized, then us trying to
decode the linearization dict could crash due to stream contents being
encrypted, decryption state not yet being initialized, and us trying
to decompress stream data before decrypting it.

To prevent this, disable uncompression when parsing the first object
to determine if it's a lineralization dictionary.

(A linearization dict never stores string values, so decryption
not yet being initialized is not a problem. Integer values aren't
encrypted in encrypted PDF files.)
2023-07-12 06:28:15 +02:00
Nico Weber
63670f27de LibPDF: Rename m_disable_encryption to m_enable_encryption
Double negation is confusing.

No behavior change.
2023-07-12 06:28:15 +02:00
Nico Weber
93357a8b70 LibPDF: Fix a typo in a function name
...and while here, a comment typo too.
2023-07-05 18:42:39 +01:00
Ben Wiederhake
f866c80222 LibPDF: Avoid unnecessary HashMap copy, mark other copies 2023-05-19 22:33:57 +02:00
Sam Atkins
2db168acc1 LibTextCodec+Everywhere: Port Decoders to new Strings 2023-02-19 17:15:47 +01:00
Sam Atkins
d6075ef5b5 LibTextCodec+Everywhere: Make TextCodec::decoder_for() take a StringView
We don't need a full String/DeprecatedString inside this function, so we
might as well not force users to create one.
2023-02-15 12:48:26 -05:00
Julian Offenhäuser
96064ec5af LibPDF: Allow filter DecodeParms array entries to be null
Filters will use the default values in this case.
2023-02-12 10:55:37 +00:00
Rodrigo Tobar
a533ea7ae6 LibPDF: Improve stream parsing
When parsing streams we rely on a /Length item being defined in the
stream's dictionary to know how much data comprises the stream. Its
value is usually a direct value, but it can be indirect. There was
however a contradiction in the code: the condition that allowed it to
read and use the /Length value required it to be a direct value, but the
actual code using the value would have worked with indirect ones. This
meant that indirect /Length values triggered the fallback, "manual"
stream parsing code.

On the other hand, this latter code was also buggy, because it relied on
the "endstream" keyword to appear on a separate line, which isn't always
the case.

This commit both fixes the bug in the manual stream parsing scenario,
while also allowing for indirect /Length values to be used to parse
streams more directly and avoid the manual approach. The main caveat to
this second change is that for a brief period of time the Document is
not able to resolve references (i.e., before the xref table itself is
not parsed). Any parsing happening before that (e..g, the linearization
dictionary) must therefore use the manual stream parsing approach.
2023-02-08 19:47:15 +01:00
Timothy Flynn
f3db548a3d AK+Everywhere: Rename FlyString to DeprecatedFlyString
DeprecatedFlyString relies heavily on DeprecatedString's StringImpl, so
let's rename it to A) match the name of DeprecatedString, B) write a new
FlyString class that is tied to String.
2023-01-09 23:00:24 +00:00
Julian Offenhäuser
a37f3390dc LibPDF: Allow numbers to start with whitespace 2023-01-09 22:54:36 +00:00
Rodrigo Tobar
d9718064d1 LibPDF: Add support for multi-line comments
The code parsing comments parsed only a single line of comments, but
callers assumed they parsed all comments that appeared contiguously in a
block. The latter is an easier to understand API, so this commit changes
the parse_comment function to parse entire blocks of comments instead of
single lines.
2022-12-16 10:04:23 +01:00
Linus Groh
57dc179b1f Everywhere: Rename to_{string => deprecated_string}() where applicable
This will make it easier to support both string types at the same time
while we convert code, and tracking down remaining uses.

One big exception is Value::to_string() in LibJS, where the name is
dictated by the ToString AO.
2022-12-06 08:54:33 +01:00
Linus Groh
6e19ab2bbc AK+Everywhere: Rename String to DeprecatedString
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
2022-12-06 08:54:33 +01:00
Rodrigo Tobar
e776048309 LibPDF: Ignore whitespace on hex strings
The spec says that whitespaces should be ignored, but we weren't. PDFs
with whitespaces in their hex strings were thus crushing the parser.
2022-11-30 14:51:14 +01:00
Julian Offenhäuser
0bc3333740 LibPDF: Parse integer numbers with atoi() instead of strtof()
strtof() produces rounding errors for very large numbers, which we
don't want for integers, as they may have to be precise.
2022-11-19 15:42:08 +01:00
Julian Offenhäuser
c2ad29c85f LibPDF: Implement png predictor decoding for flate filter
For flate and lzw filters, the data can be transformed by this
predictor function to make it compress better. For us this means that
we have to undo this step in order to get the right result.

Although this feature is meant for images, I found at least a few
documents that use it all over the place, making this step very
important.
2022-11-19 15:42:08 +01:00
Julian Offenhäuser
16ed407c01 LibPDF: Support cascading stream filters
You can specify multiple filters as an array, where each one is fed the
output of the one before it.
2022-11-19 15:42:08 +01:00
Julian Offenhäuser
becd648a78 LibPDF: Parse hexadecimal values in name objects correctly 2022-11-19 15:42:08 +01:00