LibPDF: Don't accidentally form new tokens on pages with contents arrays

A page's /Contents can be an array of streams, and the page's contents are then as if those streams are concatenated. Most of the time, a stream ends with whitespace. But in some cases (e.g. 0000642.pdf from 0000.zip from the pdfa dataset), the first stream ends with an operator (`Q`) and the next stream starts with one (`q`), and the concatenation would form a new, unkonwn operator (`Qq`). Separate the streams' contents with a space to prevent that. Reduces numbers of PDF files we fail to open in the -n 500 case from 11 to 10 (in either case, we then crash on 18 of the PDFs that we do manage to open).
Author: https://github.com/nico Commit: https://github.com/SerenityOS/serenity/commit/3fe9f8e48d Pull-request: https://github.com/SerenityOS/serenity/pull/21561
2024-12-04 05:20:30 +00:00 · 2023-10-23 09:28:41 -07:00 · 2023-10-23 09:28:41 -07:00 · 3fe9f8e48d · 2024-07-17 06:00:02 +09:00
commit 3fe9f8e48d
parent f885839ba5
1 changed files with 7 additions and 2 deletions
--- a/Userland/Libraries/LibPDF/Page.cpp
+++ b/Userland/Libraries/LibPDF/Page.cpp
@ -19,13 +19,18 @@ PDFErrorOr<ByteBuffer> Page::page_contents(Document& document) const

    // "The value may be either a single stream or an array of streams. If the value
    //  is an array, the effect is as if all the streams in the array were concatenated,
-    //  in order, to form a single stream."
+    //  in order, to form a single stream. The division between streams may occur only at
+    //  the boundaries between lexical tokens"
    if (contents->is<StreamObject>())
        return TRY(ByteBuffer::copy(contents->cast<StreamObject>()->bytes()));

+    // If one stream ends with (say) a `Q` and the next starts with `q`, that should be
+    // two distinct tokens. Insert spaces between stream contents to ensure that.
    ByteBuffer byte_buffer;
-    for (auto& ref : *contents->cast<ArrayObject>())
+    for (auto& ref : *contents->cast<ArrayObject>()) {
        TRY(byte_buffer.try_append(TRY(document.resolve_to<StreamObject>(ref))->bytes()));
+        TRY(byte_buffer.try_append(' '));
+    }
    return byte_buffer;
 }