-
Notifications
You must be signed in to change notification settings - Fork 17
[PDF spec] Fonts
With the goal of getting tests\viewer.py
to render fonts appropriately, I've been reading about how to extract font programs from PDF documents so that they can be displayed on tkinter
. Loading fonts into tkinter
is a non-trivial task, but I have a solution for loading fonts on Windows, provided they are in a recognized format and we have the name of the family. (see my stackoverflow answer for details).
The question then becomes: how do we extract the font family name and the embedded font program (if any) from the PDF document? I'm putting together this wiki page to keep track of my efforts towards that question.
When drawing text on a PDF page, the application keeps track of what's known as the text state. In the text state, there is a parameter Tf called text font. Whenever text is drawn on the page, it is drawn using the font stored in the Tf field of the text state. The text font is set and updated through the use of the Tf
graphics operator.
When using the Tf
operator, the first argument is "the name of a font resource in the Font
subdictionary of the current resource dictionary" (p. 398). These font resources are themselves dictionaries, identified by having their 'Type' set to /Font
[1]. Using minecart
and pdfminer
, we can explore these structures with the following code:
import minecart
import pdfminer.pdfpage
doc = minecart.Document(open("path/to/sample.pdf", 'rb'))
page = next(pdfminer.pdfpage.PDFPage.create_pages(doc.doc))
fonts = page.resources['Font']
print fonts
# {'F0': <PDFObjRef:7>}
font = fonts['F0'].resolve()
print font
# {'Encoding': /Identity-H,
# 'BaseFont': /HDIABS+AlbanyWTTC-Identity-H,
# 'DescendantFonts': [<PDFObjRef:26>],
# 'Subtype': /Type0,
# 'ToUnicode': <PDFObjRef:25>,
# 'Type': /Font}
At this point, the exercise become more of a choose-your-own-adventure, since it will largely depend on the fonts that are referenced in your document.
PDF allows documents to use a variety of font formats, which can be embedded with the document, included in the viewer application, or found elsewhere in the system. Font types are identified by the /Subtype
entry in the dictionary; fonts can be in the following formats:
/Subtype = /Type1
. Type 1 fonts fall into two categories, distinguished by their /BaseFont
key.
-
The 14 standard fonts: There are 14 fonts (
Times
roman, bold, italic, and bold/italic;Helvetica
plain, bold, oblique, and bold/oblique;Courier
plain, bold, oblique, and bold/oblique;Symbol
andZapfDingbats
) that get special treatment in PDF. [Tk takes care of] (http://core.tcl.tk/tk/artifact/8b7760dce509891d) aliasing the first three families properly across platforms.Symbol
looks like [is frowned upon to use nowadays] (https://en.wikipedia.org/wiki/Symbol_(typeface)#HTML), so won't worry about it for now (see [issue #7] (/felipeochoa/minecart/issues/7)).ZapfDingbats
I think can be incorporated by buying the ITC Zapf Dingbats® Std Medium, but I'm not sure the license would allow me to distribute it. -
Everything else: The
/BaseFont
key is in theory the font's family name that can be used as the argument to theFont(family=XXX)
call intkinter
. [2] If the font program is embedded, it is a stream referenced from the font descriptor subdictionary (font['/FontDescriptor']
) in one of the following places:-
Under the
/FontFile
key, in the non-compact Type 1 format -
Under the
/FontFile3
key, in the compact Type 1 format. In this case, the stream must contain a/Subtype
key with value/Type1C
-
Under the
/FontFile3
key, as an OpenType font program. In this case the stream must contain a/Subtype
key with valueOpenType
-
/Subtype = /MMType1
. Per Wikipedia, "Current application support for these fonts is sparse, if not entirely absent." So we won't support them either.
/Subtype = /TrueType
.
TrueType fonts must also have a /BaseFont
key, whose value may be used to look up the font in the central
repository. In some (rare ?) cases [3], the name may be mangled and thus not usable. The entire font program can be embedded under the /FontFile2
key of the /FontDescriptor
subdictionary. It can also appear under the /FontFile3
key as an OpenType font program if the stream has subtype /OpenType
.
/Subtype = /Type3
.
Type 3 fonts have no font program to embed or reference. Instead, they specify PDF graphics procedures for rendering each character as a PDF shape. Rendering the text is thus a job for the shape engine and not for the text engine. I'd have to investigate how pdfminer
handles Type3 fonts, since it's possible this is taken care of already. If not, it would require adding support for Type 3 fonts through the interpreter class.
/Subtype = /Type0
.
Type 0 fonts are also called "composite fonts" in the spec. They have a "subfont" that's stored in the /DescendantFonts
entry of the main font dicitonary.[4] Type 0 fonts can contain two types of embedded subfonts, distinguished by the value of their /SubType
entry:
-
Type 0 subfont:
/Subtype = /CIDFontType0
. Must appear underFontFile3
. If the stream has subtype/CIDFontType0C
, the font is in Compact Font Format. Otherwise, the stream must have subtype/OpenType
, and the font program will be in OpenType form. -
Type 2 subfont:
/Subtype = /CIDFontType2
.Can appear underFontFile3
in OpenType form if the stream has subtype/OpenType
. Otherwise, it must appear underFontFile2
and will be in TrueType format.
-
Strictly speaking, it's not sufficient, since Type 2 CIDFonts also have
Type
set to/Font
, but aren't actually PDF font instances. (back to content) -
For both Type 1 and TrueType fonts, the
/BaseFont
entry may begin with 6 uppercase letters followed by a+
sign that are extraneous to the font's family name and should be stripped out. This naming style indicates that only a subset of the font is used. I'm still not sure how to deal with these. (back to content) -
Namely, when the font doesn't include the optional PostScript name in the
name
table and the font family name has spaces in it. The name can also be mangled if "the font in a source document uses a bold or italic style but there is no font data for that style" (p. 418). ([back to content] (#reference3)) -
The actual value stored is a 1-element array containing the subfont as its only element. (back to content)