This is required by the CFF spec, and is consistent with what we do for
the encoding 24 lines down.
As far as I can tell, nothing in `Type1FontProgram::rasterize_glyph()`
or in Type1Font.cpp implements the "If an encoding maps to a character
name that does not exist in the Type 1 font pro- gram, the .notdef glyph
is substituted." line from the PDF 1.7 spec (in 5.5.5 Character
Encoding, Encodings for Type 1 Fonts) yet, so this does yet have an
effect.
Together with the already-merged #23122, #23128, #23135, #23136, #23162,
and #23167, #23179, #23190, #23194 this adds initial support for
rendering some CFF-based Type0 fonts :^)
There's a long list of things that still need improving after this:
* A small number of CFF programs contain the charstring command 0,
which is invalid. Currently, this makes us reject the whole font.
* Type1FontProgram::rasterize_glyph() is name-based. For CID-based
fonts, we want a version that takes CIDs (character IDs) instead.
For now, I'm printing the CID to a string and using that, yuck.
(I looked into doing this nicely. I do want to do that, but I
need to read up on how the `seac` type1 charstring command uses
character names to identify parts of an accented character.
Also, it looks like `seac`'s accented character handling moved
over to `endchar` in type2 charstring commands (i.e. in CFF data),
and it looks like we don't implement that at all. So I need to do
more reading first, and I didn't want to block this on that.)
* The name for the first string in name-based CFF fonts looks wrong;
added a FIXME for that for now.
* This supports the named Identity-H cmap only for now. Identity-H
maps UTF16-BE values to glyph IDs with the idenity function, and
assumes it's horizontal text. Other named cmaps in my test files are
UniJIS-UCS2-H, UniCNS-UCS2-H, Identity-V, UniGB-UCS2-H, UniKS-UCS2-H.
(There are also 2 files using the stream-based cmaps instead of the
name-based ones.)
* In particular, we can't draw vertical text (`-V`) yet
* Passing in the encoding to CFF::create() is awkward (it's nullptr
for CID-keyed fonts), and it's also not necessary since
`Type1Font::draw_glyph()` already does the "take encoding from PDF,
and only from font if the PDF doesn't store one" dance.
* This doesn't cache glyphs but re-rasterizes them each time. Easy
to add, but maybe I want to look at rotation first. And things
don't feel glacial as-is.
* Type0Font::draw_glyph() is pretty similar to second half of
Type1Font::draw_glyph()
Make TopDict's defaultWidthX and nominalWidthX Optional<>s so that
we can check if they're set per fdselect-selected font dict, and
if so use the value from there in CID-keyed fonts. Otherwise, keep
using the value in the top dict.
* FDArray, FDSelect must be present
* Encoding must not be present
* Charset maps from GID (Glyph ID) to CID (Character ID),
instead of to character name
The fdselect array (that we already read) maps eachs glyph ID
to an fdarray index. The font dict at that index then stores
information for that glyph.
In practice, this is used to assign different defaultWidthX /
nominalWidthX values to blocks of glyphs in CID-keyed fonts.
We don't do anything yet with the data, and we also don't send
data of CID-keyed CFFs into this parser either, so no behavior
change.
This happens for CFFs that contain multiple fonts. This doesn't
happen in practice, but the same code will be used for fdarray
parsing, which will contain several dicts.
No behavior change.
This is one of the two top dict entries we need for CID-keyed fonts.
We don't send any CID-keyed font data into the CFF parser yet,
so no behavior change.
No behavior change, except that we now dbgln() if we see a
PrivDictOperator we don't know about. (I haven't seen this in
practice, but I found this useful while debugging things.)
...and do string expansion at the call site.
CID-keyed fonts treat the charset as CIDs instead of as SIDs,
so having access to the SIDs in numberic form will be useful
when we implement support for CID-keyed CFF fonts.
No behavior change.
I implemented CFF charset format 2 in 6f783929dd with the note
"I haven't seen this being used in the wild". Now that I have
seen it (0000658.pdf), I can say that this has never worked,
despite me claiming "it's easy to implement".
But now it works!
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).
This commit is auto-generated:
$ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
Meta Ports Ladybird Tests Kernel)
$ perl -pie 's/\bDeprecatedString\b/ByteString/g;
s/deprecated_string/byte_string/g' $xs
$ clang-format --style=file -i \
$(git diff --name-only | grep \.cpp\|\.h)
$ gn format $(git ls-files '*.gn' '*.gni')
`left` might be a number bigger than there are actually glyphs in the
CFF.
The spec says "The number of ranges is not explicitly specified in the
font. Instead, software utilizing this data simply processes ranges
until all glyphs in the font are covered." Apparently we have to check
for this within each range as well.
Needed for example in 0000054.pdf and 0000354.pdf in 0000.zip in the
pdfa dataset.
Together with the previous commit:
From 21 (7%) to 20 (6%) crashes on the 300 first PDFs, and from
41 (8.2%) to 39 (7.8%) on the 500-random PDFs test.
...and replace template instantiations with a loop, to make this
easily possible.
Vaguely nice for code size as well.
Needed for example in 0000054.pdf and 0000354.pdf in 0000.zip in the
pdfa dataset.
We used to use an u8 as loop counter, which would overflow
if there were more than 255 glyphs, producing hundreds of megabytes
of
Couldn't find string for SID x, going with space
output in the process, while all data until the end of the CFF
section got interpreted as SIDs, until a try_read() would finally
fail.
We now no longer fail miserably trying to render page 2 of
0000352.pdf of 0000.zip from the pdfa dataset.
Fixes just one crash of the larger 500-document test set, but
when I tweak test_pdf.py to print all stacks instead of just the
top 5, it no longer produces 260 MB of output.
Font programs are bytecode programs defining glyphs. If several glyphs
share a piece of outline, that opcode sequence can be put in a
subroutine ("subr") table and the definition of those glyphs can then
call that subroutine by number, to reduce file size.
CFF fonts can in theory contain multiple fonts, and so there's a global
subr table shared by all the fonts in one CFF, and a local per-fornt
subr table. We used to only implement the local subr table, now we
implement both.
(We only support one font per CFF, and at least in PDF files, that's
all that's ever used. So a global subr table isn't very useful.
But the spec explicitly allows it -- "Global subroutines may be used in
a FontSet even if it only contains one font." -- and it happens in
practice.)
This was the last piece of data we didn't read yet.
(We also don't yet support multiple fonts per CFF, but I haven't
found a PDF using that yet.)
We still don't do anything with it, but now we at least print a
warning if this data is there and we ignore it.
With this, all tables from the spec appendixes are in CFF.cpp.
This fixes a crash reading page 2 (and onward) of
2ThestructureoftheCIE1997ColourAppearanceModelCIECAM97s.pdf in
the pdffiles repo.
The encoding offset defaults to 0, i.e. the Standard Encoding.
That means reading the encoding only if the tag is present causes
us to not read it if a font uses the Standard Encoding.
Now, we always read an encoding, even if it's the (implicit) default
one.
The main encoding data maps glyph ID ("GID") to its codepoint.
If a glyph has several codepoints, then a secondary table mapping
codepoint to string ID ("SID") of the glyph's name is present.
(A separate table associates each glyph with its name already.)
I haven't seen this used in the wild, but the structure of the
supplemental data is also going to be needed for built-in encodings.
I haven't seen this being used in the wild (yet), but it's easy
to implement, and with this we support all charset formats.
So we can now mention if we see a format we don't know about.
From "10 String INDEX":
"Further space saving is obtained by allocating commonly occurring
strings to predefined SIDs. These strings, known as the standard
strings, describe all the names used in the ISOAdobe and Expert
character sets along with a few other strings common to Type 1 fonts. A
complete list of standard strings is given in Appendix A. The client
program will contain an array of standard strings with nStoStrings
elements. Thus, the standard strings take SIDs in the range 0 to
(nStaStrings-1)."
And "13 Charsets" says that charsets store SIDs.
Fixes all
"Couldn't find string for SID $n, going with space"
messages when going through the encoding pages (page 1010 and
thereabouts) in the PDF 1.7 spec.
Only really useful for reading SIDs in the Top DICT (copyright
text etc), which we currently don't do.
I haven't seen a difference from looking things up in the string
table. The only real effect from the commit that I need is that
it pulls a local resolve() labmda into a real function
resolve_sid(), which I want to call in a future commit.
But it makes things more spec-compliant, and if we ever want to
read SIDs in metadata in the future, now we can.
We'd unconditionally get the int from a Variant<int, float> here,
but PDFs often have a float for defaultWidthX and nominalWidthX.
Fixes crash opening Bakke2010a.pdf from pdffiles (but while the
file loads ok, it looks completely busted).
These are not yet actually parsed, but detecting them means we at least
don't fail to understand the *actual* format value, which was causing
some CFF fonts to fail to load.
The first iteration has enough SIDs to display simple documents, but
when trying more and more documents we started to need more of these
SIDs to be properly defined. This is a copy/paste exercise from the CFF
document, which is tedious, so it will continue in small drops.
This commit fills all the gaps until SID 228, which covers all the
ISOAdobe space, and should be enough for most use cases. Since this is a
continuous space starting at 0, we now use an Array instead of a Map to
store these names, which should be more performant. Also to simplify
things I've moved the Array out of the CFF class, making it a simpler
static variable, which allows us to use template type deduction.