Amadan Amadan - 3 months ago 12
Ruby Question

Unrecognised glyphs in PDF (summationdisplay, summationtext)

I am trying to process a PDF with pdf-reader gem. It is mostly fine, but where there should be a summation symbol, I'm getting

\u0001
instead of
\u2211
. The relevant font object is:

{:Type=>:Font,
:Subtype=>:Type1,
:FirstChar=>1,
:LastChar=>2,
:Widths=>[1444, 1056],
:Encoding=>{:Type=>:Encoding, :Differences=>[1, :summationdisplay, :summationtext]},
:BaseFont=>:"APHKGN+CMEX10",
:FontDescriptor=>
{:Type=>:FontDescriptor,
:Ascent=>0,
:CapHeight=>0,
:Descent=>0,
:Flags=>4,
:FontBBox=>[0, -1400, 1387, 0],
:FontName=>:"APHKGN+CMEX10",
:ItalicAngle=>0,
:StemV=>47,
:StemH=>47,
:CharSet=>"/summationdisplay/summationtext",
:FontFile3=>
#<PDF::Reader::Stream:0x007faab138a528
@data=
"H\x89bd`ab`dd\xE4s\f\xF0\xF0v\xF7\xD3v\xF6u\x8D04\x00\x89(\xFD\x90e\xFC!\xCE\xF2C\x8EG\xACX\xE6K\x81\f\xEB\xBA\x9F3X\xBF;\xF1\x7Fw\x13\xF8\xEE%\xB8\xE2\x87\xA7\x10\x03\vP\x9F\\rfqinnbIf~^IjE\t\x9C\x93\x92Y\\\x90\x93X\xE9\x9C_PY\x94\x99\x9EQ\xA2\xA0\xE1\xAC\xA9`hii\xAE\xE0\x98\x9BZ\x94\x99\x9C\x98\xA7\xE0\x9BX\x92\x91\nR\x9D\x9C\x98\xA3\x10\x9C\x9F\x9C\x99ZR\xA9\xA7\xE0\x98\x93\xA3\x10\x04\xD2Q\xAC\x10\x94Z\x9CZT\x96\x9A\x02u\x15\xD0Y\xED\x8C\fL\x01\x11\f\xCC\x8C\x8C\xECE?\xFF3\xFA\x86\x86\xF1\xFDg\x91\xEFO\xF8Ws\xE8\x97\xECf\xC6\x1F\xD5\x7Ff\x88N\x9A\xD2\xDB\xD7/\xD5\xDF\xD5\xD3:E\xEE\xF7\xCD\x1FA\xAC?\x14\xD8\xBE\xB3}\xAFj\xF9\xED\x7FQ~\t\x9B\xE9\xF7:\xD6\xBF\x17\xD9\n\xBA\xBAr\xE4\x7F0\xFE\xE9\xFA\xFD\xFD\x8F7kscWg\xBBT\xC3\x94\xEE\xB9r?/\xB2=\xFC\xDE\xCBZ\xC4V\xE4\xE0\xE1g\x96\xC7\xD1V\xEDV\xFC[]\xFA\x8F-\e\xDF\x7F\xD6%\x85'd~u<\x92a\xF9\xB8\x9BQ\x86\xE5\x13\x90-\xFA\x9D\xF7\xFB\x15\xA0\xEA\x14eE\xF7\xDF\xEC\xB9\x1Cme\x9A\x85\xBFC\xA4\xFF\xBCg\xFB1\xF1\xC7K\xD6I\x93{\xFB&H\xF5v\xF7\xB5L\x95\xFB\x93\xF6S\x90\xF5\xC7\x0E\xB6\xEFR\xCFj;\xA7\xC8\x1Fl~Tu+rI\xF5\xF9\xB8\xB5V\x1CK\xD8~\xF3~_\xCB*\xF3;\x89\xAD\xA4\xAB\xAB\xB5C\xBE\xAB\xA3\xBB\xA2A\xEA\xC7\xD2\xBF\x19\x7Ff\xFD\xF9\xCC\xDAX\xDF\xDD\xD6\x05q _\xF9|6\x99\xDF\x95\xF3\xD9\xE5\x16\xB8O\x9D9\xE3?\x0F\xE7.\xAE]\xDC\x9B'\xF1\xF0\x001/@\x80\x01\x00J\xBC\xBFN\n",
@hash={:Filter=>:FlateDecode, :Length=>464, :Subtype=>:Type1C},
@udata=nil>}}


Since the Adobe
glyphlist.txt
(replicated at
pdf-reader/lib/pdf/reader/glyphlist.txt
) only includes
summation
, and not
summationtext
nor
summationdisplay
,
@differences
don't get applied to
@mapping
in
PDF::Reader::Encoding#differences=
, and
@state.current_font.to_utf8(1)
fails to fetch the correct glyph (it returns the glyphcode as a fallback, which is why I end up with
\u0001
). I.e. The font mapping differences inside the PDF font object should (according to my understanding) reference glyphs on the master glyph list by name, but these two don't match.

What am I missing? If
summationdisplay
and
summationtext
are not on Adobe's
glyphlist.txt
, how do other PDF readers render this font correctly?

Answer

This is defining a font subset with custom encoding and non-standard glyph names. Furthermore it does not include a ToUnicode reverse mapping from the custom encoding.

The PDF-32000 Specification covers this scenario:

9.10 Extraction of Text Content

9.10.1 General

...

When extracting character content, a conforming reader can easily convert text to Unicode values if a font’s characters are identified according to a standard character set that is known to the conforming reader. This character identification can occur if either the font uses a standard named encoding or the characters in the font are identified by standard character names or CIDs in a well-known collection. 9.10.2, "Mapping Character Codes to Unicode Values", describes in detail the overall algorithm for mapping character codes to Unicode values.

If a font is not defined in one of these ways, the glyphs can still be shown, but the characters cannot be converted to Unicode values without additional information:

• This information can be provided as an optional ToUnicode entry in the font dictionary (PDF 1.2; see 9.10.3, "ToUnicode CMaps"), whose value shall be a stream object containing a special kind of CMap file that maps character codes to Unicode values.

pdf-reader does seem to be conforming to the above. There is a custom sub-set encoding with /summationdisplay mapped to \u0001. There enough information to render, but not to reverse map the font back to Unicode.

Comments