Skip to content
Felipe edited this page Jun 5, 2015 · 10 revisions

Fonts in PDF

With the goal of getting tests\viewer.py to render fonts appropriately, I've been reading about how to extract font programs from PDF documents so that they can be displayed on tkinter. Loading fonts into tkinter is a non-trivial task, but I have a solution for loading fonts on Windows, provided they are in a recognized format and we have the name of the family. (see my stackoverflow answer for details).

The question then becomes: how do we extract the font family name and the embedded font program (if any) from the PDF document? I'm putting together this wiki page to keep track of my efforts towards that question.

How are fonts stored/referenced in PDF?

When drawing text on a PDF page, the application keeps track of what's known as the text state. In the text state, there is a parameter Tf  called text font. Whenever text is drawn on the page, it is drawn using the font stored in the Tf  field of the text state. The text font is set and updated through the use of the Tf graphics operator.

When using the Tf operator, the first argument is "the name of a font resource in the Font subdictionary of the current resource dictionary" (p. 398). These font resources are themselves dictionaries, identified by having their 'Type' set to /Font [1]. Using minecart and pdfminer, we can explore these structures with the following code:

import minecart
import pdfminer.pdfpage
doc = minecart.Document(open("path/to/sample.pdf", 'rb'))
page = next(pdfminer.pdfpage.PDFPage.create_pages(doc.doc))
fonts = page.resources['Font']
print fonts
# {'F0': <PDFObjRef:7>}
font = fonts['F0'].resolve()
print font
# {'Encoding': /Identity-H,
#  'BaseFont': /HDIABS+AlbanyWTTC-Identity-H,
#  'DescendantFonts': [<PDFObjRef:26>],
#  'Subtype': /Type0,
#  'ToUnicode': <PDFObjRef:25>,
#  'Type': /Font}

At this point, the exercise become more of a choose-your-own-adventure, since it will largely depend on the fonts that are referenced in your document.

The different types of PDF fonts

PDF allows documents to use a variety of font formats, which can be embedded with the document, included in the viewer application, or found elsewhere in the system. Font types are identified by the /Subtype entry in the dictionary; fonts can be in the following formats:

Type 1

/Subtype = /Type1. Type 1 fonts fall into two categories, distinguished by their /BaseFont key.

  • The 14 standard fonts: There are 14 fonts (Times roman, bold, italic, and bold/italic; Helvetica plain, bold, oblique, and bold/oblique; Courier plain, bold, oblique, and bold/oblique; Symbol and ZapfDingbats) that get special treatment in PDF. [Tk takes care of] (http://core.tcl.tk/tk/artifact/8b7760dce509891d) aliasing the first three families properly across platforms. Symbol looks like [is frowned upon to use nowadays] (https://en.wikipedia.org/wiki/Symbol_(typeface)#HTML), so won't worry about it for now (see [issue #7] (/felipeochoa/minecart/issues/7)). ZapfDingbats I think can be incorporated by buying the ITC Zapf Dingbats® Std Medium, but I'm not sure the license would allow me to distribute it.

  • Everything else: The /BaseFont key is in theory the font's family name that can be used as the argument to the Font(family=XXX) call in tkinter. [2] If the font program is embedded, it is a stream referenced from the font descriptor subdictionary (font['/FontDescriptor']) in one of the following places:

    • Under the /FontFile key, in the non-compact Type 1 format

    • Under the /FontFile3 key, in the compact Type 1 format. In this case, the stream must contain a /Subtype key with value /Type1C

    • Under the /FontFile3 key, as an OpenType font program. In this case the stream must contain a /Subtype key with value OpenType

Multiple Master

/Subtype = /MMType1. Per Wikipedia, "Current application support for these fonts is sparse, if not entirely absent." So we won't support them either.

TrueType

/Subtype = /TrueType.

TrueType fonts must also have a /BaseFont key, whose value may be used to look up the font in the central repository. In some (rare ?) cases [3], the name may be mangled and thus not usable. The entire font program can be embedded under the /FontFile2 key of the /FontDescriptor subdictionary. It can also appear under the /FontFile3 key as an OpenType font program if the stream has subtype /OpenType.

Type 3

/Subtype = /Type3.

Type 3 fonts have no font program to embed or reference. Instead, they specify PDF graphics procedures for rendering each character as a PDF shape. Rendering the text is thus a job for the shape engine and not for the text engine. I'd have to investigate how pdfminer handles Type3 fonts, since it's possible this is taken care of already. If not, it would require adding support for Type 3 fonts through the interpreter class.

Type 0

/Subtype = /Type0.

Type 0 fonts are also called "composite fonts" in the spec. They have a "subfont" that's stored in the /DescendantFonts entry of the main font dicitonary.[4] Type 0 fonts can contain two types of embedded subfonts, distinguished by the value of their /SubType entry:

  • Type 0 subfont: /Subtype = /CIDFontType0. Must appear under FontFile3. If the stream has subtype /CIDFontType0C, the font is in Compact Font Format. Otherwise, the stream must have subtype /OpenType, and the font program will be in OpenType form.

  • Type 2 subfont: /Subtype = /CIDFontType2.Can appear under FontFile3 in OpenType form if the stream has subtype /OpenType. Otherwise, it must appear under FontFile2 and will be in TrueType format.

Footnotes

  1. Strictly speaking, it's not sufficient, since Type 2 CIDFonts also have Type set to /Font, but aren't actually PDF font instances. (back to content)

  2. For both Type 1 and TrueType fonts, the /BaseFont entry may begin with 6 uppercase letters followed by a + sign that are extraneous to the font's family name and should be stripped out. This naming style indicates that only a subset of the font is used. I'm still not sure how to deal with these. (back to content)

  3. Namely, when the font doesn't include the optional PostScript name in the name table and the font family name has spaces in it. The name can also be mangled if "the font in a source document uses a bold or italic style but there is no font data for that style" (p. 418). ([back to content] (#reference3))

  4. The actual value stored is a 1-element array containing the subfont as its only element. (back to content)