Commit graph

64 commits

Author SHA1 Message Date
Sam Atkins
d6075ef5b5 LibTextCodec+Everywhere: Make TextCodec::decoder_for() take a StringView
We don't need a full String/DeprecatedString inside this function, so we
might as well not force users to create one.
2023-02-15 12:48:26 -05:00
Julian Offenhäuser
96064ec5af LibPDF: Allow filter DecodeParms array entries to be null
Filters will use the default values in this case.
2023-02-12 10:55:37 +00:00
Rodrigo Tobar
a533ea7ae6 LibPDF: Improve stream parsing
When parsing streams we rely on a /Length item being defined in the
stream's dictionary to know how much data comprises the stream. Its
value is usually a direct value, but it can be indirect. There was
however a contradiction in the code: the condition that allowed it to
read and use the /Length value required it to be a direct value, but the
actual code using the value would have worked with indirect ones. This
meant that indirect /Length values triggered the fallback, "manual"
stream parsing code.

On the other hand, this latter code was also buggy, because it relied on
the "endstream" keyword to appear on a separate line, which isn't always
the case.

This commit both fixes the bug in the manual stream parsing scenario,
while also allowing for indirect /Length values to be used to parse
streams more directly and avoid the manual approach. The main caveat to
this second change is that for a brief period of time the Document is
not able to resolve references (i.e., before the xref table itself is
not parsed). Any parsing happening before that (e..g, the linearization
dictionary) must therefore use the manual stream parsing approach.
2023-02-08 19:47:15 +01:00
Timothy Flynn
f3db548a3d AK+Everywhere: Rename FlyString to DeprecatedFlyString
DeprecatedFlyString relies heavily on DeprecatedString's StringImpl, so
let's rename it to A) match the name of DeprecatedString, B) write a new
FlyString class that is tied to String.
2023-01-09 23:00:24 +00:00
Julian Offenhäuser
a37f3390dc LibPDF: Allow numbers to start with whitespace 2023-01-09 22:54:36 +00:00
Rodrigo Tobar
d9718064d1 LibPDF: Add support for multi-line comments
The code parsing comments parsed only a single line of comments, but
callers assumed they parsed all comments that appeared contiguously in a
block. The latter is an easier to understand API, so this commit changes
the parse_comment function to parse entire blocks of comments instead of
single lines.
2022-12-16 10:04:23 +01:00
Linus Groh
57dc179b1f Everywhere: Rename to_{string => deprecated_string}() where applicable
This will make it easier to support both string types at the same time
while we convert code, and tracking down remaining uses.

One big exception is Value::to_string() in LibJS, where the name is
dictated by the ToString AO.
2022-12-06 08:54:33 +01:00
Linus Groh
6e19ab2bbc AK+Everywhere: Rename String to DeprecatedString
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
2022-12-06 08:54:33 +01:00
Rodrigo Tobar
e776048309 LibPDF: Ignore whitespace on hex strings
The spec says that whitespaces should be ignored, but we weren't. PDFs
with whitespaces in their hex strings were thus crushing the parser.
2022-11-30 14:51:14 +01:00
Julian Offenhäuser
0bc3333740 LibPDF: Parse integer numbers with atoi() instead of strtof()
strtof() produces rounding errors for very large numbers, which we
don't want for integers, as they may have to be precise.
2022-11-19 15:42:08 +01:00
Julian Offenhäuser
c2ad29c85f LibPDF: Implement png predictor decoding for flate filter
For flate and lzw filters, the data can be transformed by this
predictor function to make it compress better. For us this means that
we have to undo this step in order to get the right result.

Although this feature is meant for images, I found at least a few
documents that use it all over the place, making this step very
important.
2022-11-19 15:42:08 +01:00
Julian Offenhäuser
16ed407c01 LibPDF: Support cascading stream filters
You can specify multiple filters as an array, where each one is fed the
output of the one before it.
2022-11-19 15:42:08 +01:00
Julian Offenhäuser
becd648a78 LibPDF: Parse hexadecimal values in name objects correctly 2022-11-19 15:42:08 +01:00
Julian Offenhäuser
2f71e0f09a LibPDF: Allow text operator sequences to start with whitespace 2022-10-16 17:44:54 +02:00
Julian Offenhäuser
633e1632d0 LibPDF: Allow whitespace other than EOL after an object marker 2022-09-17 10:07:14 +01:00
Julian Offenhäuser
65e83bed53 LibPDF: Disallow parsing indirect values as operands
An operation like 0 0 0 RG would have been confused for [ 0, 0 0 R ] G
2022-09-17 10:07:14 +01:00
Julian Offenhäuser
4887aacec7 LibPDF: Move document-specific parsing functionality into its own class
The Parser class is now a generic PDF object parser, of which the new
DocumentParser class derives. DocumentParser now takes over all
functions relating to linearization, pages, xref and trailer handling.

This allows the use of multiple parsers in the same document's
context, which will be needed in order to handle PDF object streams.
2022-09-17 10:07:14 +01:00
Julian Offenhäuser
9f4659cc63 LibPDF: Move consume and match helper functions to the Reader class 2022-09-17 10:07:14 +01:00
sin-ack
3f3f45580a Everywhere: Add sv suffix to strings relying on StringView(char const*)
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).

No functional changes.
2022-07-12 23:11:35 +02:00
Idan Horowitz
086969277e Everywhere: Run clang-format 2022-04-01 21:24:45 +01:00
Matthew Olsson
468ceb1b48 LibPDF: Rename Command to Operator
This is the correct name, according to the spec
2022-03-31 18:10:45 +02:00
Matthew Olsson
4e81663b31 LibPDF: Attempt to unecrypt strings and streams 2022-03-29 02:52:57 +02:00
Matthew Olsson
60c3e786be LibPDF: Require Document* in Parser constructor
This makes it a bit easier to avoid calling parser->set_document, an
issue which cost me ~30 minutes to find.
2022-03-29 02:52:57 +02:00
Matthew Olsson
a8de9cf541 LibPDF: Keep track of the current object index/generation while Parsing
This information is required to decrypt encrypted strings/streams.
2022-03-29 02:52:57 +02:00
Matthew Olsson
c98bda8ce6 LibPDF: Get rid of PlainText/Encoded StreamObject
This was a small optimization to allow a stream object to simply hold
a reference to the bytes in a PDF document rather than duplicating
them. However, as we move into features such as encryption, this
optimization does more harm than good. This can be revisited in the
future if necessary.
2022-03-29 02:52:57 +02:00
Matthew Olsson
6133acb8c0 LibPDF: Allow newlines between xref table and "trailer" keyword 2022-03-07 10:53:57 +01:00
Matthew Olsson
544e44eec1 LibPDF: Fix bad hex string parsing logic 2022-03-07 10:53:57 +01:00
Matthew Olsson
3cfecc3d3b LibPDF: Remove useless hex string substring call 2022-03-07 10:53:57 +01:00
Matthew Olsson
73cf8205b4 LibPDF: Propagate errors in Parser and Document 2022-03-07 10:53:57 +01:00
Matthew Olsson
c1aa8c4a44 LibPDF: Remove unused function in Parser 2022-03-07 10:53:57 +01:00
Sam Atkins
fa3c61cf5a LibPDF: Make Filter::decode() return ErrorOr 2022-01-24 22:36:09 +01:00
Sam Atkins
45cf40653a Everywhere: Convert ByteBuffer factory methods from Optional -> ErrorOr
Apologies for the enormous commit, but I don't see a way to split this
up nicely. In the vast majority of cases it's a simple change. A few
extra places can use TRY instead of manual error checking though. :^)
2022-01-24 22:36:09 +01:00
Simon Woertz
c857b5d22f LibPDF: Convert PDF::Parser::m_document from RefPtr to WeakPtr
Otherwise both `PDF::Document` and `PDF::Parser` have a `RefPtr`
pointing to each other which leads to a memory leak due to a circular
dependency.
2022-01-08 18:57:55 +01:00
Andreas Kling
216e21a1fa AK: Convert AK::Format formatting helpers to returning ErrorOr<void>
This isn't a complete conversion to ErrorOr<void>, but a good chunk.
The end goal here is to propagate buffer allocation failures to the
caller, and allow the use of TRY() with formatting functions.
2021-11-17 00:21:13 +01:00
Simon Woertz
b87ab989a3 LibPDF: Check if there is data left before consuming
Add a check to `Parser::consume_eol` to ensure that there is more data
to read before actually consuming any data. Not checking if there is
data left leads to failing an assertion in case of e.g., a truncated
pdf file.
2021-11-16 00:16:57 +01:00
Andreas Kling
80d4e830a0 Everywhere: Pass AK::ReadonlyBytes by value 2021-11-11 01:27:46 +01:00
Andreas Kling
a15ed8743d AK: Make ByteBuffer::try_* functions return ErrorOr<void>
Same as Vector, ByteBuffer now also signals allocation failure by
returning an ENOMEM Error instead of a bool, allowing us to use the
TRY() and MUST() patterns.
2021-11-10 21:58:58 +01:00
Brendan Coles
6ccfa3e75e LibPDF: Parser::parse_header() return false if remaining bytes is zero 2021-10-30 17:34:56 +02:00
Ben Wiederhake
f84a7e2e22 LibPDF: Replace Value class by AK::Variant
This decreases the memory consumption by LibPDF by 4 bytes per Value,
compensating exactly for the increase in an earlier commit. :^)
2021-09-20 17:39:36 +04:30
Ben Wiederhake
d344253b08 LibPDF: Extract reference bitpacking into dedicated class 2021-09-20 17:39:36 +04:30
Ben Wiederhake
da170997d5 LibPDF: Move inline function definition
This breaks the dependency cycle between Parser and Document.
2021-09-20 17:39:36 +04:30
Ali Mohammad Pur
97e97bccab Everywhere: Make ByteBuffer::{create_*,copy}() OOM-safe 2021-09-06 01:53:26 +02:00
Ali Mohammad Pur
3a9f00c59b Everywhere: Use OOM-safe ByteBuffer APIs where possible
If we can easily communicate failure, let's avoid asserting and report
failure instead.
2021-09-06 01:53:26 +02:00
Hendiadyoin1
ed46d52252 Everywhere: Use AK/Math.h if applicable
AK's version should see better inlining behaviors, than the LibM one.
We avoid mixed usage for now though.

Also clean up some stale math includes and improper floatingpoint usage.
2021-07-19 16:34:21 +04:30
Wesley Moret
1b8f73b6b3 LibPDF: Fix treating not finding the linearized dict as a fatal error
We now try to parse the first indirect value and see 
if it's the `Linearization Parameter Dictionary`. if it's not, we 
fallback to reading the xref table from the end of the document
2021-07-16 20:44:10 +02:00
Wesley Moret
5d4d70355e LibPDF: Fix checking minor_ver instead of major_ver 2021-07-16 20:44:10 +02:00
Matthew Olsson
612b183703 LibPDF: Convert to east-const to comply with the recent style changes 2021-06-12 22:45:01 +04:30
Matthew Olsson
ea3abb14fe LibPDF: Parse hint tables
This code isn't _actually_ used as of right now, but I wrote it at the
same time as all of the code in the previous commit. I realized after
I wrote it that these hint tables aren't super useful if the parser
already has access to the full file. However, this will be useful if
we ever want to stream PDFs from the web (and possibly view them in
the browser).
2021-06-12 22:45:01 +04:30
Matthew Olsson
e23bfd7252 LibPDF: Parse linearized PDF files
This is a big step, as most PDFs which are downloaded online will be
linearized. Pretty much the only difference is that the xref structure
is slightly different.
2021-06-12 22:45:01 +04:30
Matthew Olsson
be1be47613 LibPDF: Fix two parser bugs
- A newline was assumed to follow the "stream" keyword, when it can also
  be a windows-style line break
- Fix not consuming the "endobj" at the end of every indirect object
2021-06-12 22:45:01 +04:30