Amadan Amadan - 1 month ago 4x
Ruby Question

Unrecognised glyphs in PDF (summationdisplay, summationtext)

I am trying to process a PDF with pdf-reader gem. It is mostly fine, but where there should be a summation symbol, I'm getting

instead of
. The relevant font object is:

:Widths=>[1444, 1056],
:Encoding=>{:Type=>:Encoding, :Differences=>[1, :summationdisplay, :summationtext]},
:FontBBox=>[0, -1400, 1387, 0],
"H\x89bd`ab`dd\xE4s\f\xF0\xF0v\xF7\xD3v\xF6u\x8D04\x00\x89(\xFD\x90e\xFC!\xCE\xF2C\x8EG\xACX\xE6K\x81\f\xEB\xBA\x9F3X\xBF;\xF1\x7Fw\x13\xF8\xEE%\xB8\xE2\x87\xA7\x10\x03\vP\x9F\\rfqinnbIf~^IjE\t\x9C\x93\x92Y\\\x90\x93X\xE9\x9C_PY\x94\x99\x9EQ\xA2\xA0\xE1\xAC\xA9`hii\xAE\xE0\x98\x9BZ\x94\x99\x9C\x98\xA7\xE0\x9BX\x92\x91\nR\x9D\x9C\x98\xA3\x10\x9C\x9F\x9C\x99ZR\xA9\xA7\xE0\x98\x93\xA3\x10\x04\xD2Q\xAC\x10\x94Z\x9CZT\x96\x9A\x02u\x15\xD0Y\xED\x8C\fL\x01\x11\f\xCC\x8C\x8C\xECE?\xFF3\xFA\x86\x86\xF1\xFDg\x91\xEFO\xF8Ws\xE8\x97\xECf\xC6\x1F\xD5\x7Ff\x88N\x9A\xD2\xDB\xD7/\xD5\xDF\xD5\xD3:E\xEE\xF7\xCD\x1FA\xAC?\x14\xD8\xBE\xB3}\xAFj\xF9\xED\x7FQ~\t\x9B\xE9\xF7:\xD6\xBF\x17\xD9\n\xBA\xBAr\xE4\x7F0\xFE\xE9\xFA\xFD\xFD\x8F7kscWg\xBBT\xC3\x94\xEE\xB9r?/\xB2=\xFC\xDE\xCBZ\xC4V\xE4\xE0\xE1g\x96\xC7\xD1V\xEDV\xFC[]\xFA\x8F-\e\xDF\x7F\xD6%\x85'd~u<\x92a\xF9\xB8\x9BQ\x86\xE5\x13\x90-\xFA\x9D\xF7\xFB\x15\xA0\xEA\x14eE\xF7\xDF\xEC\xB9\x1Cme\x9A\x85\xBFC\xA4\xFF\xBCg\xFB1\xF1\xC7K\xD6I\x93{\xFB&H\xF5v\xF7\xB5L\x95\xFB\x93\xF6S\x90\xF5\xC7\x0E\xB6\xEFR\xCFj;\xA7\xC8\x1Fl~Tu+rI\xF5\xF9\xB8\xB5V\x1CK\xD8~\xF3~_\xCB*\xF3;\x89\xAD\xA4\xAB\xAB\xB5C\xBE\xAB\xA3\xBB\xA2A\xEA\xC7\xD2\xBF\x19\x7Ff\xFD\xF9\xCC\xDAX\xDF\xDD\xD6\x05q _\xF9|6\x99\xDF\x95\xF3\xD9\xE5\x16\xB8O\x9D9\xE3?\x0F\xE7.\xAE]\xDC\x9B'\xF1\xF0\x001/@\x80\x01\x00J\xBC\xBFN\n",
@hash={:Filter=>:FlateDecode, :Length=>464, :Subtype=>:Type1C},

Since the Adobe
(replicated at
) only includes
, and not
don't get applied to
, and
fails to fetch the correct glyph (it returns the glyphcode as a fallback, which is why I end up with
). I.e. The font mapping differences inside the PDF font object should (according to my understanding) reference glyphs on the master glyph list by name, but these two don't match.

What am I missing? If
are not on Adobe's
, how do other PDF readers render this font correctly?


This is defining a font subset with custom encoding and non-standard glyph names. Furthermore it does not include a ToUnicode reverse mapping from the custom encoding.

The PDF-32000 Specification covers this scenario:

9.10 Extraction of Text Content

9.10.1 General


When extracting character content, a conforming reader can easily convert text to Unicode values if a font’s characters are identified according to a standard character set that is known to the conforming reader. This character identification can occur if either the font uses a standard named encoding or the characters in the font are identified by standard character names or CIDs in a well-known collection. 9.10.2, "Mapping Character Codes to Unicode Values", describes in detail the overall algorithm for mapping character codes to Unicode values.

If a font is not defined in one of these ways, the glyphs can still be shown, but the characters cannot be converted to Unicode values without additional information:

• This information can be provided as an optional ToUnicode entry in the font dictionary (PDF 1.2; see 9.10.3, "ToUnicode CMaps"), whose value shall be a stream object containing a special kind of CMap file that maps character codes to Unicode values.

pdf-reader does seem to be conforming to the above. There is a custom sub-set encoding with /summationdisplay mapped to \u0001. There enough information to render, but not to reverse map the font back to Unicode.