0ct0pu5/ladybird

Author	SHA1	Message	Date
Nico Weber	3d07684891	LibPDF: Extract Parser::parse_inline_image() Pure code move, no intended behavior change. The motivation is just to make Parser::parse_operators() less nested and more focused.	2023-12-22 10:58:54 +01:00
Nico Weber	022fce75a6	LibPDF: Get inline image data from parser to renderer We create a inline_image_end operator that has all the relevant data in a synthetic StreamObject. inline_image_end is still a RENDERER_TODO(), so no real behavior change. (Previously we'd call only inline_image_begin, so string the todo message is about is now a bit different. But no interesting behavior change.)	2023-12-20 12:19:08 +01:00
Nico Weber	3285502ec6	LibPDF: Extract a Parser::unfilter_stream() method No behavior change.	2023-12-20 12:19:08 +01:00
Ali Mohammad Pur	5e1499d104	Everywhere: Rename {Deprecated => Byte}String This commit un-deprecates DeprecatedString, and repurposes it as a byte string. As the null state has already been removed, there are no other particularly hairy blockers in repurposing this type as a byte string (what it _really_ is). This commit is auto-generated: $ xs=$(ack -l \bDeprecatedString\b\\|deprecated_string AK Userland \ Meta Ports Ladybird Tests Kernel) $ perl -pie 's/\bDeprecatedString\b/ByteString/g; s/deprecated_string/byte_string/g' $xs $ clang-format --style=file -i \ $(git diff --name-only \| grep \.cpp\\|\.h) $ gn format $(git ls-files '.gn' '.gni')	2023-12-17 18:25:10 +03:30
Nico Weber	11354dbf9e	LibPDF: Remember inline image stream bytes We still don't process inline images, but now we have the pieces we need for doing it (`map` and `stream_bytes`).	2023-12-11 10:50:39 +01:00
Nico Weber	cabc6a9d80	LibPDF: Add a comment that PDF 2.0 added a length key for inline images In practice, basically no file has it, since it was only added in 2.0, and 1.7 explicitly said "in particular, the Type, Subtype, and Length entries normally found in a stream or image dictionary are unnecessary."	2023-12-11 10:50:39 +01:00
Nico Weber	071f890847	LibPDF: Require whitespace in front of inline image marker EI Fixes a crash on page 3 of 0000450.pdf of 0000.zip, where we previously started interpreting the middle of an inline image content stream as operators, since it contained `EI` in its pixel data.	2023-12-11 10:50:39 +01:00
Nico Weber	27aae7e2b1	LibPDF: Parse inline image key-value pairs Not used for anything yet.	2023-12-11 10:50:39 +01:00
Nico Weber	0912896ae0	LibPDF: Extract Parser::parse_dict_contents_until() No behavior change.	2023-12-11 10:50:39 +01:00
Kyle Pereira	e4b8d68039	LibPDF: Permit comments at the end of a stream	2023-12-10 16:44:24 +01:00
Nico Weber	e39a790c82	LibPDF: Stop converting encodings in object parser Per 1.7 spec 3.8.1, there are multiple logical text string types: * text strings * ASCII strings * byte strings Text strings can be in UTF-16BE, PDFDocEncoding, or (since PDF 2.0) UTF-8. But byte strings shouldn't be converted but treated as binary data. This makes us no longer convert strings used for drawing page text. TABLE 5.6 "Text-showing operators" lists the operands for text-showing operators as just "string", not "text string" (even though these strings confusingly are called "text strings" in the body text), so not doing this there is correct (and matches other viewers). We also no longer incorrectly convert strings used for cypto data (such as passwords), if they start with an UTF-16BE or UTF-8 marker. No behavior change for outlines and info dict entries. https://pdfa.org/understanding-utf-8-in-pdf-2-0/ has a good overview of this. (ASCII strings only contain ASCII characters and behave the same anyways.)	2023-11-22 09:08:06 -07:00
Nico Weber	14bcb5219d	LibPDF: Tolerate comments before drawing operators Necessary to be able to render https://github.com/pdf-association/pdf20examples/blob/master/pdf20-utf8-test.pdf	2023-11-22 08:56:43 +00:00
Nico Weber	9e8cf4fc1a	LibPDF: Tolerate comment after last dict item Necessary to be able to open https://github.com/pdf-association/pdf20examples/blob/master/pdf20-utf8-test.pdf	2023-11-22 08:56:43 +00:00
Nico Weber	54c98a46d8	LibPDF: Correctly parse the d0 and d1 operators They are the first operator in a type 3 charproc. Operator.h already knew about them, but we didn't manage to parse them, since they're the only two operators that contain a digit.	2023-11-17 19:47:53 +00:00
Tim Ledbetter	b4296e1c9b	LibPDF: Don't use unsanitized values in error messages Previously, constructing error messages with unsanitized input could fail because error message strings must be UTF-8.	2023-10-26 11:05:32 +02:00
Nico Weber	4549d6cf1b	LibPDF: Add a FIXME comment to the inline image data skipping path	2023-10-26 10:59:45 +02:00
Nico Weber	54cdcd0d06	LibPDF: Reject non-hexdigits in hex string with error ...instead of VERIFY()ing input data. I haven't seen this in the wild, but since I'm here anyways, might as well fix this.	2023-10-25 10:44:26 +02:00
Nico Weber	4675700057	LibPDF: Reject unterminated literal strings with an error 0000459.pdf in 0000.zip in the pdfa dataset contains this as the very first object: ``` 1 0 obj << /Creator (Developer 2000) /CreatorDate ( /Author (Oracle Reports) /Producer (Oracle PDF driver) /Title (2021_06_29 Tutoritzacions APTES.PDF) >> endobj ``` The `/CreatorDate` value string is unterminated. Before, we'd assert when trying to check if the first object is a linearization dict. Now, we never read the first object (an error during the linearization dict reading is treated as "file is not linearized") unless we try to print the document's metadata -- and there we now show an error instead of asserting.	2023-10-25 10:44:26 +02:00
Nico Weber	c0f3f1674c	LibPDF: Make string literal parsing fallible ...and make running out of data after a \ an error instead of silently returning an empty string.	2023-10-25 10:44:26 +02:00
Nico Weber	6153dd7b84	LibPDF: Tolerate comments after dict values Makes 0000607.pdf from 0000.zip from the pdfa dataset load.	2023-10-23 09:28:00 -04:00
Nico Weber	a1f17bd643	LibPDF: Skip inline image data in operator stream Inline images can contain arbitrary binary data in the operator stream, greatly confusing the operator parser. Just skip them for now. They'll produce a `Rendering of feature not supported: draw operation: inline_image_begin` diag as usual, so we won't forget about it. After #21536, reduces number of crashes on 300 random PDFs from the web (the first 300 from 0000.zip from https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/) from 23 (7%) to 22 (7%). On a larger sample (`Meta/test_pdf.py -n 500 ~/Downloads/0000`), reduces number of crashes from 53 (10.6%) with 36 distinct crash stacks to 46 (9.2%) with 33 distinct stacks.	2023-10-23 07:51:08 +02:00
Ali Mohammad Pur	aeee98b3a1	AK+Everywhere: Remove the null state of DeprecatedString This commit removes DeprecatedString's "null" state, and replaces all its users with one of the following: - A normal, empty DeprecatedString - Optional<DeprecatedString> Note that null states of DeprecatedFlyString/StringView/etc are not affected by this commit. However, DeprecatedString::empty() is now considered equal to a null StringView.	2023-10-13 18:33:21 +03:30
Nico Weber	532230c0e4	LibPDF: Extract a Document::read_filters() method No behavior change.	2023-07-24 09:50:45 -04:00
Nico Weber	77e6dbab33	LibPDF: Fix symbol for text_next_line_show_string_set_spacing operator It's `"`, not `''`. Now the `text_next_line_show_string_set_spacing` gets called and logs a TODO at page render time if `"` is used in a PDF: warning: Rendering of feature not supported: draw operation: text_next_line_show_string_set_spacing It caused a parse error (also at page render time) previously: [parse_value @ .../LibPDF/Parser.cpp:104] Parser error at offset 611: Unexpected char """	2023-07-22 12:25:30 -04:00
Nico Weber	39b2eed3f6	LibPDF: Do not crash on encrypted files that start unluckily PDF files can be linearized. In that case, they start with a "linearization dict" that stores the key `/Linearized` and the value `1`. To check if a file is linearized, we just read the first dict, and then checked if it has that key. If the first object of a PDF was a stream with a compression filter and the input PDF was encrypted and not linearized, then us trying to decode the linearization dict could crash due to stream contents being encrypted, decryption state not yet being initialized, and us trying to decompress stream data before decrypting it. To prevent this, disable uncompression when parsing the first object to determine if it's a lineralization dictionary. (A linearization dict never stores string values, so decryption not yet being initialized is not a problem. Integer values aren't encrypted in encrypted PDF files.)	2023-07-12 06:28:15 +02:00
Nico Weber	63670f27de	LibPDF: Rename m_disable_encryption to m_enable_encryption Double negation is confusing. No behavior change.	2023-07-12 06:28:15 +02:00
Nico Weber	93357a8b70	LibPDF: Fix a typo in a function name ...and while here, a comment typo too.	2023-07-05 18:42:39 +01:00
Ben Wiederhake	f866c80222	LibPDF: Avoid unnecessary HashMap copy, mark other copies	2023-05-19 22:33:57 +02:00
Sam Atkins	2db168acc1	LibTextCodec+Everywhere: Port Decoders to new Strings	2023-02-19 17:15:47 +01:00
Sam Atkins	d6075ef5b5	LibTextCodec+Everywhere: Make TextCodec::decoder_for() take a StringView We don't need a full String/DeprecatedString inside this function, so we might as well not force users to create one.	2023-02-15 12:48:26 -05:00
Julian Offenhäuser	96064ec5af	LibPDF: Allow filter DecodeParms array entries to be null Filters will use the default values in this case.	2023-02-12 10:55:37 +00:00
Rodrigo Tobar	a533ea7ae6	LibPDF: Improve stream parsing When parsing streams we rely on a /Length item being defined in the stream's dictionary to know how much data comprises the stream. Its value is usually a direct value, but it can be indirect. There was however a contradiction in the code: the condition that allowed it to read and use the /Length value required it to be a direct value, but the actual code using the value would have worked with indirect ones. This meant that indirect /Length values triggered the fallback, "manual" stream parsing code. On the other hand, this latter code was also buggy, because it relied on the "endstream" keyword to appear on a separate line, which isn't always the case. This commit both fixes the bug in the manual stream parsing scenario, while also allowing for indirect /Length values to be used to parse streams more directly and avoid the manual approach. The main caveat to this second change is that for a brief period of time the Document is not able to resolve references (i.e., before the xref table itself is not parsed). Any parsing happening before that (e..g, the linearization dictionary) must therefore use the manual stream parsing approach.	2023-02-08 19:47:15 +01:00
Timothy Flynn	f3db548a3d	AK+Everywhere: Rename FlyString to DeprecatedFlyString DeprecatedFlyString relies heavily on DeprecatedString's StringImpl, so let's rename it to A) match the name of DeprecatedString, B) write a new FlyString class that is tied to String.	2023-01-09 23:00:24 +00:00
Julian Offenhäuser	a37f3390dc	LibPDF: Allow numbers to start with whitespace	2023-01-09 22:54:36 +00:00
Rodrigo Tobar	d9718064d1	LibPDF: Add support for multi-line comments The code parsing comments parsed only a single line of comments, but callers assumed they parsed all comments that appeared contiguously in a block. The latter is an easier to understand API, so this commit changes the parse_comment function to parse entire blocks of comments instead of single lines.	2022-12-16 10:04:23 +01:00
Linus Groh	57dc179b1f	Everywhere: Rename to_{string => deprecated_string}() where applicable This will make it easier to support both string types at the same time while we convert code, and tracking down remaining uses. One big exception is Value::to_string() in LibJS, where the name is dictated by the ToString AO.	2022-12-06 08:54:33 +01:00
Linus Groh	6e19ab2bbc	AK+Everywhere: Rename String to DeprecatedString We have a new, improved string type coming up in AK (OOM aware, no null state), and while it's going to use UTF-8, the name UTF8String is a mouthful - so let's free up the String name by renaming the existing class. Making the old one have an annoying name will hopefully also help with quick adoption :^)	2022-12-06 08:54:33 +01:00
Rodrigo Tobar	e776048309	LibPDF: Ignore whitespace on hex strings The spec says that whitespaces should be ignored, but we weren't. PDFs with whitespaces in their hex strings were thus crushing the parser.	2022-11-30 14:51:14 +01:00
Julian Offenhäuser	0bc3333740	LibPDF: Parse integer numbers with atoi() instead of strtof() strtof() produces rounding errors for very large numbers, which we don't want for integers, as they may have to be precise.	2022-11-19 15:42:08 +01:00
Julian Offenhäuser	c2ad29c85f	LibPDF: Implement png predictor decoding for flate filter For flate and lzw filters, the data can be transformed by this predictor function to make it compress better. For us this means that we have to undo this step in order to get the right result. Although this feature is meant for images, I found at least a few documents that use it all over the place, making this step very important.	2022-11-19 15:42:08 +01:00
Julian Offenhäuser	16ed407c01	LibPDF: Support cascading stream filters You can specify multiple filters as an array, where each one is fed the output of the one before it.	2022-11-19 15:42:08 +01:00
Julian Offenhäuser	becd648a78	LibPDF: Parse hexadecimal values in name objects correctly	2022-11-19 15:42:08 +01:00
Julian Offenhäuser	2f71e0f09a	LibPDF: Allow text operator sequences to start with whitespace	2022-10-16 17:44:54 +02:00
Julian Offenhäuser	633e1632d0	LibPDF: Allow whitespace other than EOL after an object marker	2022-09-17 10:07:14 +01:00
Julian Offenhäuser	65e83bed53	LibPDF: Disallow parsing indirect values as operands An operation like 0 0 0 RG would have been confused for [ 0, 0 0 R ] G	2022-09-17 10:07:14 +01:00
Julian Offenhäuser	4887aacec7	LibPDF: Move document-specific parsing functionality into its own class The Parser class is now a generic PDF object parser, of which the new DocumentParser class derives. DocumentParser now takes over all functions relating to linearization, pages, xref and trailer handling. This allows the use of multiple parsers in the same document's context, which will be needed in order to handle PDF object streams.	2022-09-17 10:07:14 +01:00
Julian Offenhäuser	9f4659cc63	LibPDF: Move consume and match helper functions to the Reader class	2022-09-17 10:07:14 +01:00
sin-ack	3f3f45580a	Everywhere: Add sv suffix to strings relying on StringView(char const) Each of these strings would previously rely on StringView's char const constructor overload, which would call __builtin_strlen on the string. Since we now have operator ""sv, we can replace these with much simpler versions. This opens the door to being able to remove StringView(char const*). No functional changes.	2022-07-12 23:11:35 +02:00
Idan Horowitz	086969277e	Everywhere: Run clang-format	2022-04-01 21:24:45 +01:00
Matthew Olsson	468ceb1b48	LibPDF: Rename Command to Operator This is the correct name, according to the spec	2022-03-31 18:10:45 +02:00

1 2

93 commits