When parsing streams we rely on a /Length item being defined in the
stream's dictionary to know how much data comprises the stream. Its
value is usually a direct value, but it can be indirect. There was
however a contradiction in the code: the condition that allowed it to
read and use the /Length value required it to be a direct value, but the
actual code using the value would have worked with indirect ones. This
meant that indirect /Length values triggered the fallback, "manual"
stream parsing code.
On the other hand, this latter code was also buggy, because it relied on
the "endstream" keyword to appear on a separate line, which isn't always
the case.
This commit both fixes the bug in the manual stream parsing scenario,
while also allowing for indirect /Length values to be used to parse
streams more directly and avoid the manual approach. The main caveat to
this second change is that for a brief period of time the Document is
not able to resolve references (i.e., before the xref table itself is
not parsed). Any parsing happening before that (e..g, the linearization
dictionary) must therefore use the manual stream parsing approach.
DeprecatedFlyString relies heavily on DeprecatedString's StringImpl, so
let's rename it to A) match the name of DeprecatedString, B) write a new
FlyString class that is tied to String.
The code parsing comments parsed only a single line of comments, but
callers assumed they parsed all comments that appeared contiguously in a
block. The latter is an easier to understand API, so this commit changes
the parse_comment function to parse entire blocks of comments instead of
single lines.
This will make it easier to support both string types at the same time
while we convert code, and tracking down remaining uses.
One big exception is Value::to_string() in LibJS, where the name is
dictated by the ToString AO.
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
For flate and lzw filters, the data can be transformed by this
predictor function to make it compress better. For us this means that
we have to undo this step in order to get the right result.
Although this feature is meant for images, I found at least a few
documents that use it all over the place, making this step very
important.
The Parser class is now a generic PDF object parser, of which the new
DocumentParser class derives. DocumentParser now takes over all
functions relating to linearization, pages, xref and trailer handling.
This allows the use of multiple parsers in the same document's
context, which will be needed in order to handle PDF object streams.
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).
No functional changes.
This was a small optimization to allow a stream object to simply hold
a reference to the bytes in a PDF document rather than duplicating
them. However, as we move into features such as encryption, this
optimization does more harm than good. This can be revisited in the
future if necessary.
Apologies for the enormous commit, but I don't see a way to split this
up nicely. In the vast majority of cases it's a simple change. A few
extra places can use TRY instead of manual error checking though. :^)
This isn't a complete conversion to ErrorOr<void>, but a good chunk.
The end goal here is to propagate buffer allocation failures to the
caller, and allow the use of TRY() with formatting functions.
Add a check to `Parser::consume_eol` to ensure that there is more data
to read before actually consuming any data. Not checking if there is
data left leads to failing an assertion in case of e.g., a truncated
pdf file.
Same as Vector, ByteBuffer now also signals allocation failure by
returning an ENOMEM Error instead of a bool, allowing us to use the
TRY() and MUST() patterns.
AK's version should see better inlining behaviors, than the LibM one.
We avoid mixed usage for now though.
Also clean up some stale math includes and improper floatingpoint usage.
We now try to parse the first indirect value and see
if it's the `Linearization Parameter Dictionary`. if it's not, we
fallback to reading the xref table from the end of the document
This code isn't _actually_ used as of right now, but I wrote it at the
same time as all of the code in the previous commit. I realized after
I wrote it that these hint tables aren't super useful if the parser
already has access to the full file. However, this will be useful if
we ever want to stream PDFs from the web (and possibly view them in
the browser).
This is a big step, as most PDFs which are downloaded online will be
linearized. Pretty much the only difference is that the xref structure
is slightly different.
- A newline was assumed to follow the "stream" keyword, when it can also
be a windows-style line break
- Fix not consuming the "endobj" at the end of every indirect object