All "Simple Fonts" in PDF (all but Type0 fonts) have the property that
glyphs are selected with single byte character codes. This means that
the Encoding objects should use u8 for representing these character
codes. Moreover, and as mentioned in a previous commit, there is no need
to store the unicode code point associated with a character (which was
in turn wrongly associated to a glyph).
This commit greatly simplifies the Encoding class. Namely it:
* Removes the unnecessary CharDescriptor class.
* Changes the internal maps to be u8 -> FlyString and vice-versa,
effectively providing two-way lookups.
* Adds a new method to set a two-way u8 -> FlyString mapping and uses
it in all possible places.
* Simplified the creation of Encoding objects.
* Changes how the WinAnsi special treatment for bullet points is
implemented.
The initial values were fine, but those starting at 100 were wrong: they
are all octal values, but since they were missing an initial 0 they were
interpreted as decimals.
In PDF's fonts, encoding objects are used to translate bytes into fonts'
glyphs. Glyphs (in the fonts we currently support) organise their glyphs
in such a way that they are accessed by name, and thus encoding
translate between a byte sequence and a glyph name.
Note that an no point this translation includes a Unicode character, and
therefore assigning a character to a glyph in the Encoding object is the
wrong thing to do. Moreover, using the code point for this character
during the byte-sequence-to-glyph translation sequence is double-wrong.
This commit removes the characters associated to each translation in the
built-in Encoding objects. In order to keep commits short and sweet, I'm
currently simply removing the character from the enumeration, leaving
the old structure this information was held on intact. Instead, I'm
filling the "code_point" member with a zero, and filling both mappings
(which will be changed later on too) with the glyph name and the
associated char code.
Fonts with the encoding name "WinAnsiEncoding" should render missing
characters above character code 040 (octal) as a "bullet" character.
This patch adds Encoding::should_map_to_bullet(char_code) which is then
called by char_code_to_code_point() to check if the given char code
should be displayed as a bullet instead.
I didn't have a good way to test this, so I've only verified that it
works by manually overriding inputs to the function during the rendering
stage.
This takes care of a FIXME in the Annex D part of the PDF specification.
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
Previously we would draw all text, no matter what font type, as
Liberation Serif, which results in things like ugly character spacing.
We now have partial support for drawing Type 1 glyphs, which are part of
a PostScript font program. We completely ignore hinting for now, which
results in ugly looking characters at low resolutions, but gain support
for a large number of typefaces, including most of the default fonts
used in TeX.
When looking up differences in the specified encoding, we previously
didn't recognize a lot of characters, namely those that are referred to
by a string in the PDF itself, like "/germandbls".
We now create a mapping of those characters to the code points they are
referring to, and correctly look them up when needed.