0ct0pu5/ladybird

Author	SHA1	Message	Date
Nico Weber	1cb450e9a3	LibPDF: Give CFF Glyph 0 the name .notdef This is required by the CFF spec, and is consistent with what we do for the encoding 24 lines down. As far as I can tell, nothing in `Type1FontProgram::rasterize_glyph()` or in Type1Font.cpp implements the "If an encoding maps to a character name that does not exist in the Type 1 font pro- gram, the .notdef glyph is substituted." line from the PDF 1.7 spec (in 5.5.5 Character Encoding, Encodings for Type 1 Fonts) yet, so this does yet have an effect.	2024-02-20 06:54:50 -05:00
Nico Weber	05a7482118	LibPDF/CFF: Add dbgln() when failing encoding bounds check	2024-02-20 08:43:10 +00:00
Nico Weber	4705d38fa7	LibPDF/CFF: Fix off-by-one when reading internal encoding We use `i - 1` to index these arrays, so that's what we should use for the bounds check as well.	2024-02-20 08:43:10 +00:00
Nico Weber	bd74447dba	LibPDF: Initial support for drawing CFF-based Type0 fonts Together with the already-merged #23122, #23128, #23135, #23136, #23162, and #23167, #23179, #23190, #23194 this adds initial support for rendering some CFF-based Type0 fonts :^) There's a long list of things that still need improving after this: * A small number of CFF programs contain the charstring command 0, which is invalid. Currently, this makes us reject the whole font. * Type1FontProgram::rasterize_glyph() is name-based. For CID-based fonts, we want a version that takes CIDs (character IDs) instead. For now, I'm printing the CID to a string and using that, yuck. (I looked into doing this nicely. I do want to do that, but I need to read up on how the `seac` type1 charstring command uses character names to identify parts of an accented character. Also, it looks like `seac`'s accented character handling moved over to `endchar` in type2 charstring commands (i.e. in CFF data), and it looks like we don't implement that at all. So I need to do more reading first, and I didn't want to block this on that.) * The name for the first string in name-based CFF fonts looks wrong; added a FIXME for that for now. * This supports the named Identity-H cmap only for now. Identity-H maps UTF16-BE values to glyph IDs with the idenity function, and assumes it's horizontal text. Other named cmaps in my test files are UniJIS-UCS2-H, UniCNS-UCS2-H, Identity-V, UniGB-UCS2-H, UniKS-UCS2-H. (There are also 2 files using the stream-based cmaps instead of the name-based ones.) * In particular, we can't draw vertical text (`-V`) yet * Passing in the encoding to CFF::create() is awkward (it's nullptr for CID-keyed fonts), and it's also not necessary since `Type1Font::draw_glyph()` already does the "take encoding from PDF, and only from font if the PDF doesn't store one" dance. * This doesn't cache glyphs but re-rasterizes them each time. Easy to add, but maybe I want to look at rotation first. And things don't feel glacial as-is. * Type0Font::draw_glyph() is pretty similar to second half of Type1Font::draw_glyph()	2024-02-16 12:41:10 -05:00
Nico Weber	c9d48bbca4	LibPDF/CFF: Add a comment to CFF::parse_charset()	2024-02-16 12:41:10 -05:00
Nico Weber	5c8778a161	LibPDF/CFF: Compute per-glyph glyph width in CID-keyed fonts Make TopDict's defaultWidthX and nominalWidthX Optional<>s so that we can check if they're set per fdselect-selected font dict, and if so use the value from there in CID-keyed fonts. Otherwise, keep using the value in the top dict.	2024-02-16 12:41:10 -05:00
Nico Weber	1d1e406b3a	LibPDF/CFF: Implement some special handling for CID-keyed fonts * FDArray, FDSelect must be present * Encoding must not be present * Charset maps from GID (Glyph ID) to CID (Character ID), instead of to character name	2024-02-15 12:32:31 +01:00
Nico Weber	7494f24430	LibPDF/CFF: Store if a font program is CID-keyed ...and reject CID-keyed font programs for Type1 fonts.	2024-02-15 12:32:31 +01:00
Nico Weber	bb7d29d007	LibPDF/CFF: Read font dicts pointed to by the fdarray offset The fdselect array (that we already read) maps eachs glyph ID to an fdarray index. The font dict at that index then stores information for that glyph. In practice, this is used to assign different defaultWidthX / nominalWidthX values to blocks of glyphs in CID-keyed fonts. We don't do anything yet with the data, and we also don't send data of CID-keyed CFFs into this parser either, so no behavior change.	2024-02-15 12:32:31 +01:00
Nico Weber	524a4f6256	LibPDF/CFF: Make parse_top_dict() return all top dicts This happens for CFFs that contain multiple fonts. This doesn't happen in practice, but the same code will be used for fdarray parsing, which will contain several dicts. No behavior change.	2024-02-15 12:32:31 +01:00
Nico Weber	9f1cf8babc	LibPDF/CFF: Extract parse_top_dict() function Pure code move, no behavior change.	2024-02-15 12:32:31 +01:00
Nico Weber	eb4632e08a	LibPDF: Give CFF built-in encoding and charset arrays an underlying type These arrays store SIDs ("String IDs"), so give them that type now that we have to_array() and it's easy to do. No behavior change.	2024-02-14 06:56:43 +01:00
Nico Weber	7ab4e53b99	LibPDF/CFF: Add code for fdselect parsing This is one of the two top dict entries we need for CID-keyed fonts. We don't send any CID-keyed font data into the CFF parser yet, so no behavior change.	2024-02-12 14:05:16 +01:00
Nico Weber	6ebddab448	LibPDF/CFF: Add enum values for CID-keyed font top dict entries No behavior change.	2024-02-12 14:05:16 +01:00
Nico Weber	8e7cb11856	LibPDF/CFF: Add enum values for remaining PrivDictOperators No behavior change, except that we now dbgln() if we see a PrivDictOperator we don't know about. (I haven't seen this in practice, but I found this useful while debugging things.)	2024-02-11 14:52:54 +01:00
Nico Weber	a91fecb17e	Revert "LibPDF: Don't over-read in charset formats 1 and 2" This reverts commit `52afa936c4`. No longer necessary after #23122 -- turns out things work better when you do them right. No behavior change.	2024-02-09 16:52:01 +00:00
Nico Weber	9bccb8c8d7	LibPDF: Make CFF::parse_charset() return SIDs ...and do string expansion at the call site. CID-keyed fonts treat the charset as CIDs instead of as SIDs, so having access to the SIDs in numberic form will be useful when we implement support for CID-keyed CFF fonts. No behavior change.	2024-02-09 13:57:23 +01:00
Nico Weber	9750261921	LibPDF: Rename charset to charset_names in CFF parser No behavior change.	2024-02-09 13:57:23 +01:00
Nico Weber	32f601f9a4	LibPDF: Fix small bug from #21452 I implemented CFF charset format 2 in `6f783929dd` with the note "I haven't seen this being used in the wild". Now that I have seen it (0000658.pdf), I can say that this has never worked, despite me claiming "it's easy to implement". But now it works!	2024-02-08 13:48:56 +00:00
Ali Mohammad Pur	5e1499d104	Everywhere: Rename {Deprecated => Byte}String This commit un-deprecates DeprecatedString, and repurposes it as a byte string. As the null state has already been removed, there are no other particularly hairy blockers in repurposing this type as a byte string (what it _really_ is). This commit is auto-generated: $ xs=$(ack -l \bDeprecatedString\b\\|deprecated_string AK Userland \ Meta Ports Ladybird Tests Kernel) $ perl -pie 's/\bDeprecatedString\b/ByteString/g; s/deprecated_string/byte_string/g' $xs $ clang-format --style=file -i \ $(git diff --name-only \| grep \.cpp\\|\.h) $ gn format $(git ls-files '.gn' '.gni')	2023-12-17 18:25:10 +03:30
Nico Weber	52afa936c4	LibPDF: Don't over-read in charset formats 1 and 2 `left` might be a number bigger than there are actually glyphs in the CFF. The spec says "The number of ranges is not explicitly specified in the font. Instead, software utilizing this data simply processes ranges until all glyphs in the font are covered." Apparently we have to check for this within each range as well. Needed for example in 0000054.pdf and 0000354.pdf in 0000.zip in the pdfa dataset. Together with the previous commit: From 21 (7%) to 20 (6%) crashes on the 300 first PDFs, and from 41 (8.2%) to 39 (7.8%) on the 500-random PDFs test.	2023-10-23 09:31:11 -04:00
Nico Weber	58ff7b5336	LibPDF: Support offset size 3 in CFF index reading ...and replace template instantiations with a loop, to make this easily possible. Vaguely nice for code size as well. Needed for example in 0000054.pdf and 0000354.pdf in 0000.zip in the pdfa dataset.	2023-10-23 09:31:11 -04:00
Nico Weber	3197f0cab6	LibPDF: Handle CFF fonts with charset format 0 and > 255 glyphs better We used to use an u8 as loop counter, which would overflow if there were more than 255 glyphs, producing hundreds of megabytes of Couldn't find string for SID x, going with space output in the process, while all data until the end of the CFF section got interpreted as SIDs, until a try_read() would finally fail. We now no longer fail miserably trying to render page 2 of 0000352.pdf of 0000.zip from the pdfa dataset. Fixes just one crash of the larger 500-document test set, but when I tweak test_pdf.py to print all stacks instead of just the top 5, it no longer produces 260 MB of output.	2023-10-23 09:31:11 -04:00
Nico Weber	0869ca5615	LibPDF: Add more CFF_DEBUG output	2023-10-23 09:31:11 -04:00
Nico Weber	04aec4a032	LibPDF: Don't log CFF Copyright tag as unknown	2023-10-21 21:04:02 +02:00
Nico Weber	3907374621	LibPDF: Implement support for callgsubr in CFF font programs Font programs are bytecode programs defining glyphs. If several glyphs share a piece of outline, that opcode sequence can be put in a subroutine ("subr") table and the definition of those glyphs can then call that subroutine by number, to reduce file size. CFF fonts can in theory contain multiple fonts, and so there's a global subr table shared by all the fonts in one CFF, and a local per-fornt subr table. We used to only implement the local subr table, now we implement both. (We only support one font per CFF, and at least in PDF files, that's all that's ever used. So a global subr table isn't very useful. But the spec explicitly allows it -- "Global subroutines may be used in a FontSet even if it only contains one font." -- and it happens in practice.)	2023-10-18 10:50:32 -04:00
Nico Weber	44efff81b9	LibPDF: Remove a dbgln() call in CFF subrs decoding This code is a lot more reliable now than it used to be, and this dbgln() is quite noisy for some files. So let's remove it.	2023-10-18 10:43:51 -04:00
Nico Weber	46fd6fdfa3	LibPDF: Read Global subr data in CFF reader This was the last piece of data we didn't read yet. (We also don't yet support multiple fonts per CFF, but I haven't found a PDF using that yet.) We still don't do anything with it, but now we at least print a warning if this data is there and we ignore it.	2023-10-18 11:02:10 +02:00
Nico Weber	3be5719987	LibPDF: Rename `subroutines` to `local_subroutines` in CFF code	2023-10-18 11:02:10 +02:00
Nico Weber	9a0b559932	LibPDF: Tweak formatting of built-in CFF tables This makes the code look more like the pages in the spec. No behavior change, whitespace change only.	2023-10-18 11:00:17 +02:00
Nico Weber	cb961101c7	LibPDF: Implement CFF built-in Standard and Expert encodings With this, all tables from the spec appendixes are in CFF.cpp. This fixes a crash reading page 2 (and onward) of 2ThestructureoftheCIE1997ColourAppearanceModelCIECAM97s.pdf in the pdffiles repo.	2023-10-17 10:21:38 +02:00
Nico Weber	eeada4678c	LibPDF: Postpone CFF encoding processing after Top DICT has been read The encoding offset defaults to 0, i.e. the Standard Encoding. That means reading the encoding only if the tag is present causes us to not read it if a font uses the Standard Encoding. Now, we always read an encoding, even if it's the (implicit) default one.	2023-10-17 10:21:38 +02:00
Nico Weber	1cfe639b6c	LibPDF: Implement CFF supplemental encoding The main encoding data maps glyph ID ("GID") to its codepoint. If a glyph has several codepoints, then a secondary table mapping codepoint to string ID ("SID") of the glyph's name is present. (A separate table associates each glyph with its name already.) I haven't seen this used in the wild, but the structure of the supplemental data is also going to be needed for built-in encodings.	2023-10-17 10:21:38 +02:00
Nico Weber	37daeae6fd	LibPDF: Add spec comments, dbgln_if()s to CFF's parse_encoding()	2023-10-17 10:21:38 +02:00
Nico Weber	96a4936567	LibPDF: Checking for built-in CFF encodings Only prints a warning for them for now. Also warn on the not-yet-implemented encoding supplement.	2023-10-16 08:32:18 +02:00
Nico Weber	414a164850	LibPDF: Be louder about unimplemented CFF dict entries	2023-10-16 08:32:18 +02:00
Nico Weber	c825194fb9	LibPDF: Reject CFFs with more than one font The code assumes that there's just one Top DICT, so let's be loud when that isn't the case.	2023-10-16 08:32:18 +02:00
Nico Weber	6f783929dd	LibPDF: Implement support for CFF charset format 2 I haven't seen this being used in the wild (yet), but it's easy to implement, and with this we support all charset formats. So we can now mention if we see a format we don't know about.	2023-10-15 15:27:15 +02:00
Nico Weber	5b915fb15c	LibPDF: Add more spec comments to parse_charset()	2023-10-15 15:27:15 +02:00
Nico Weber	49275c4b17	LibPDF: Don't overflow SIDs in type 1 charset parsing first_sid has type SID (aka u16), so don't store it in an u8. This fixes (among other things) page 24 on the PDF 1.7 spec.	2023-10-15 15:27:15 +02:00
Nico Weber	23d6e9f577	LibPDF: Implement CFF built-in charsets ISOAdobe, Expert, Expert Subset	2023-10-15 09:33:34 +02:00
Nico Weber	8060957d8d	LibPDF: Use Appendix A instead of Appendix C for standard names From "10 String INDEX": "Further space saving is obtained by allocating commonly occurring strings to predefined SIDs. These strings, known as the standard strings, describe all the names used in the ISOAdobe and Expert character sets along with a few other strings common to Type 1 fonts. A complete list of standard strings is given in Appendix A. The client program will contain an array of standard strings with nStoStrings elements. Thus, the standard strings take SIDs in the range 0 to (nStaStrings-1)." And "13 Charsets" says that charsets store SIDs. Fixes all "Couldn't find string for SID $n, going with space" messages when going through the encoding pages (page 1010 and thereabouts) in the PDF 1.7 spec.	2023-10-15 09:33:34 +02:00
Nico Weber	aba787a441	LibPDF: Implement reading of CFF String Index Only really useful for reading SIDs in the Top DICT (copyright text etc), which we currently don't do. I haven't seen a difference from looking things up in the string table. The only real effect from the commit that I need is that it pulls a local resolve() labmda into a real function resolve_sid(), which I want to call in a future commit. But it makes things more spec-compliant, and if we ever want to read SIDs in metadata in the future, now we can.	2023-10-15 09:33:34 +02:00
Nico Weber	3c49d0dad3	LibPDF: Add a CFF_DEBUG toggle I'd like to put some debug prints behind this soon. No behavior change.	2023-10-15 07:14:29 +02:00
Nico Weber	2249e79630	LibPDF: Add two FIXMEs	2023-10-13 07:53:27 +02:00
Nico Weber	d451197d3d	LibPDF: Add spec comments to CFF	2023-10-13 07:53:27 +02:00
Nico Weber	349996f7f2	LibPDF: Don't crash on files with float CFF defaultWidthX We'd unconditionally get the int from a Variant<int, float> here, but PDFs often have a float for defaultWidthX and nominalWidthX. Fixes crash opening Bakke2010a.pdf from pdffiles (but while the file loads ok, it looks completely busted).	2023-10-12 19:43:57 +02:00
Nico Weber	f56b897622	Everywhere: Fix a few typos Some even user-visible!	2023-04-12 19:37:35 +02:00
Rodrigo Tobar	4a20751ff6	LibPDF: Detect CFF encodings with supplements These are not yet actually parsed, but detecting them means we at least don't fail to understand the actual format value, which was causing some CFF fonts to fail to load.	2023-03-02 12:18:53 +01:00
Rodrigo Tobar	c4507bb56e	LibPDF: Add more built-in SIDs The first iteration has enough SIDs to display simple documents, but when trying more and more documents we started to need more of these SIDs to be properly defined. This is a copy/paste exercise from the CFF document, which is tedious, so it will continue in small drops. This commit fills all the gaps until SID 228, which covers all the ISOAdobe space, and should be enough for most use cases. Since this is a continuous space starting at 0, we now use an Array instead of a Map to store these names, which should be more performant. Also to simplify things I've moved the Array out of the CFF class, making it a simpler static variable, which allows us to use template type deduction.	2023-02-13 00:23:17 +00:00

1 2

56 commits