0ct0pu5/ladybird

Author	SHA1	Message	Date
Matthew Olsson	edd7de3c77	LibPDF: Fix incorrectly parsing subsections in xref stream Subsections are generally not contiguous, however this logic assumed that they were, and kept a persistent "entry_index" count while looping through all subsections. This commit rewrites the logic to be more straightforward; just loop through all of the subsections and handle each one separately.	2023-07-18 00:51:23 +02:00
Matthew Olsson	bfd8faedf9	LibPDF: Assert compressed xref's 2nd field is non-zero	2023-07-18 00:51:23 +02:00
Matthew Olsson	f9c1d11380	LibPDF: Do not crash when linearized length is incorrect This is a perfectly valid situation, and in this case we should just parse a standard non-linearized xref table.	2023-07-18 00:51:23 +02:00
Nico Weber	323d76fbb9	LibPDF: Make encrypted object streams work There were two problems: 1. parse_compressed_object_with_index() parses indirect objects without going through Parser::parse_indirect_value(), so push_reference() / pop_reference() weren't called. Manually call them, both for the indirect object containing the object stream and for the indirect object within the object stream. 2. The indirect object within the object stream got decrypted twice: Once when the object stream data itself got decrypted, and then incorrectly a second time when the object data within the stream was read. To fix, disable encryption while parsing object stream data (since it's already decrypted). The test is from http://opf-labs.org/format-corpus/pdfCabinetOfHorrors/ which according to readme.md at the same location is CC0.	2023-07-12 17:16:25 +02:00
Nico Weber	67d8c8badb	LibPDF: Use more direct method to access linearization dict We know indirect_value_or_error.value contains an IndirectObject, so there's no need to go through resolve(). No behavior change.	2023-07-12 06:28:15 +02:00
Nico Weber	39b2eed3f6	LibPDF: Do not crash on encrypted files that start unluckily PDF files can be linearized. In that case, they start with a "linearization dict" that stores the key `/Linearized` and the value `1`. To check if a file is linearized, we just read the first dict, and then checked if it has that key. If the first object of a PDF was a stream with a compression filter and the input PDF was encrypted and not linearized, then us trying to decode the linearization dict could crash due to stream contents being encrypted, decryption state not yet being initialized, and us trying to decompress stream data before decrypting it. To prevent this, disable uncompression when parsing the first object to determine if it's a lineralization dictionary. (A linearization dict never stores string values, so decryption not yet being initialized is not a problem. Integer values aren't encrypted in encrypted PDF files.)	2023-07-12 06:28:15 +02:00
Nico Weber	ea89053c12	LibPDF: Make PDF version accessible on Document	2023-07-11 13:49:17 -04:00
Julian Offenhäuser	fd78875662	LibPDF: Fix navigate_to_before_eof_marker() for PDFs not ending in EOL The way this was factored before, we would miss the %%EOF marker if it didn't have a valid end-of-line sequence after it.	2023-03-22 09:04:00 +01:00
Julian Offenhäuser	93062e2b78	LibPDF: Be more cautious of errors when looking for linearization dict We would previously assume that, following the header, there must be a valid PDF object that could be a linearization dict. However, if the file is not linearized, this is not necessarily true. We now try to detect if there even is an object, and don't treat parsing errors as fatal.	2023-03-22 09:04:00 +01:00
Julian Offenhäuser	6c0f7d83bb	LibPDF: Don't treat a broken document header as a fatal error As the current goal is to make our best effort loading documents, we might as well ignore a broken header and power through, giving the user a warning.	2023-03-22 09:04:00 +01:00
Julian Offenhäuser	34350ee9e7	LibPDF: Allow reading documents with incremental updates The PDF spec allows incremental changes of a document by appending a new XRef table and file trailer to it. These will only contain the changed objects and will point back to the previous change, forming an arbitrarily long chain of XRef sections and file trailers. Every one of those XRef sections may be encoded as an XRef stream as well, in which case the trailer is part of the stream dictionary as usual. To make this easier, I made it so every XRef table may "own" a trailer. This means that the main file trailer is now part of the main XRef table.	2023-02-12 10:55:37 +00:00
MacDue	63b11030f0	Everywhere: Use ReadonlySpan<T> instead of Span<T const>	2023-02-08 19:15:45 +00:00
Tim Schumacher	220fbcaa7e	AK: Remove the fallible constructor from `FixedMemoryStream`	2023-02-08 17:44:32 +00:00
Tim Schumacher	261d62438f	AK: Remove the fallible constructor from `LittleEndianInputBitStream`	2023-02-08 17:44:32 +00:00
Tim Schumacher	093cf428a3	AK: Move memory streams from `LibCore`	2023-01-29 19:16:44 -07:00
Tim Schumacher	2470dd3bb5	AK: Move bit streams from `LibCore`	2023-01-29 19:16:44 -07:00
Tim Schumacher	ae64b68717	AK: Deprecate the old `AK::Stream` This also removes a few cases where the respective header wasn't actually required to be included.	2023-01-29 19:16:44 -07:00
Tim Schumacher	b1bfeb391e	LibPDF: Use `Core::Stream` to parse the page offset hint table	2023-01-21 00:45:33 +00:00
Linus Groh	57dc179b1f	Everywhere: Rename to_{string => deprecated_string}() where applicable This will make it easier to support both string types at the same time while we convert code, and tracking down remaining uses. One big exception is Value::to_string() in LibJS, where the name is dictated by the ToString AO.	2022-12-06 08:54:33 +01:00
Linus Groh	6e19ab2bbc	AK+Everywhere: Rename String to DeprecatedString We have a new, improved string type coming up in AK (OOM aware, no null state), and while it's going to use UTF-8, the name UTF8String is a mouthful - so let's free up the String name by renaming the existing class. Making the old one have an annoying name will hopefully also help with quick adoption :^)	2022-12-06 08:54:33 +01:00
Julian Offenhäuser	d1bc89e30b	LibPDF: Try to repair XRef tables with broken indices An XRef table usually starts with an object number of zero. While it could technically start at any other number, this is a tell-tale sign of a broken table. For the "broken" documents I encountered, this always meant that some objects must have been removed from the start of the table, without updating the following indices. When this is the case, the document is not able to be read normally. However, most other PDF parsers seem to know of this quirk and fix the XRef table automatically. Likewise, we now check for this exact case, and if it matches up with what we expect, we update the XRef table such that all object numbers match the actual objects found in the file again.	2022-11-25 22:44:47 +01:00
Julian Offenhäuser	4b1a72ff7a	LibPDF: Fix loop condition in parse_xref_stream() We previously compared two unrelated values to determine if we parsed the xref table to completion. We now check if we added every subsection instead, and double check to make sure we never read past the end.	2022-11-19 15:42:08 +01:00
Julian Offenhäuser	a17a23a3f0	LibPDF: Make some variable names in parse_xref_stream() more clear I found these to be a bit misleading.	2022-11-19 15:42:08 +01:00
Julian Offenhäuser	77f5f7a6f4	LibPDF: Support parsing page tree nodes that are in object streams conditionally_parse_page_tree_node used to assume that the xref table contained a byte offset, even for compressed objects. It now uses the common facilities for parsing objects, at the expense of some performance.	2022-09-17 10:07:14 +01:00
Julian Offenhäuser	563d91b6c4	LibPDF: Implement loading compressed objects from object streams Now, whenever the xref table points to a compressed object, parse_object_with_index will look it up in the corresponding object stream as if it were a regular object. With this, our parser gains the bare minimum support for xref streams.	2022-09-17 10:07:14 +01:00
Julian Offenhäuser	f9beff7b5e	LibPDF: Initial work on parsing xref streams Since PDF version 1.5, a document may omit the xref table in favor of a new kind of xref stream object. This is used to reference so-called "compressed" objects that are part of an object stream. With this patch we are able to parse this new kind of xref object, but we'll have to implement object streams to use them correctly.	2022-09-17 10:07:14 +01:00
Julian Offenhäuser	4887aacec7	LibPDF: Move document-specific parsing functionality into its own class The Parser class is now a generic PDF object parser, of which the new DocumentParser class derives. DocumentParser now takes over all functions relating to linearization, pages, xref and trailer handling. This allows the use of multiple parsers in the same document's context, which will be needed in order to handle PDF object streams.	2022-09-17 10:07:14 +01:00

27 commits