pratik solanki pratik solanki - 1 month ago 6
C Question

how to exract the text in pdf if encoding and ToUnicode both are present in pdf? how to map it?

here, i used qpdf tool to uncompress data and below is output. if you see there are encoding and ToUnicode both are present in pdf. i know if only ToUnicode is there so how to map individual char with Cmap file.but if you see output of Content stream is following

Tf
0.999402 0 0 1 71.9995 759.561 Tm
[()-2.11826()-1.14177()2.67786()-2.11826()8.55269()-5.44998()-4.70186()2.67786()-2.32338()2.67786()12.679( )-3.75591()9.73429()]TJ

in break-at there are some garbag data that is not visible. so how to link data to cmap file ?

and one another question is that in /Encoding what are values contain in Difference ?

10 0 obj
<< /BaseEncoding /WinAnsiEncoding /Differences [ 1 /g100 /g28 /g94 /g3 /g87 /g24 /g38 /g47 /g62 ] /Type /Encoding >>

even i pass one by one values of Difference array into one of FreeType function is named as FT_Get_Name_Indek. this function return values like [ 100 28 94 3 87 24 38 47 62]

what is those values ? how to map those Value ?

here is pdf

run following cmd

qpdf --stream-data=uncompress input.pdf output.text

output.text

same output i got if i pass contents stream data into zlib. kindly check output.txt file from link

mkl mkl
Answer

Firstly the general question

how to exract the text in pdf if encoding and ToUnicode both are present in pdf? how to map it?

[...] if you see there are encoding and ToUnicode both are present in pdf. i know if only ToUnicode is there so how to map individual char with Cmap file.

In such a case, i.e. when you have both a sufficiently complete and correct ToUnicode map and an Encoding for a font, you can ignore the Encoding and only use the ToUnicode map.

This follows from the PDF specification which in section 9.10.2 "Mapping Character Codes to Unicode Values" states that the methods to map a character code to a Unicode value with the highest priority is

If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode.

Thus, if you (as you say) already know how to extract text if there only is a ToUnicode map, you can use the same algorithm unchanged. And as a corollary, if that doesn't work, the ToUnicode map in question is insufficiently complete or incorrect, or your knowledge itself on how to extract text using only a ToUnicode map actually is incomplete.

Secondly the sample document

You wrote

[()-2.11826()-1.14177()2.67786()-2.11826()8.55269()-5.44998()-4.70186()2.67786()-2.32338()2.67786()12.679( )-3.75591()9.73429()]TJ

in break-at there are some garbag data that is not visible. so how to link data to cmap file ?

In the brackets there are the values identifying your glyphs, so they definitively are not garbage.

Thus, here are the byte values from within the brackets:

[(
    01
)-2.11826(
    02
)-1.14177(
    03
)2.67786(
    01
)-2.11826(
    04
)8.55269(
    05
)-5.44998(
    06
)-4.70186(
    07
)2.67786(
    04
)-2.32338(
    07
)2.67786(
    08
)12.679(
    09
)-3.75591(
    02
)9.73429(
    04
)]TJ

Using the ToUnicode map of the font in question

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CMapType 2 def
1 begincodespacerange
<00><ff>
endcodespacerange
9 beginbfrange
<01><01><0054>
<02><02><0045>
<03><03><0053>
<04><04><0020>
<05><05><0050>
<06><06><0044>
<07><07><0046>
<08><08><0049>
<09><09><004c>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end 

the byte values from within the brackets map to:

    01    0054    "T"
    02    0045    "E"
    03    0053    "S"
    01    0054    "T"
    04    0020    " "
    05    0050    "P"
    06    0044    "D"
    07    0046    "F"
    04    0020    " "
    07    0046    "F"
    08    0049    "I"
    09    004c    "L"
    02    0045    "E"
    04    0020    " "

Thus,

"TEST PDF FILE "

which matches the rendered file just fine:

Screenshot

Thirdly the encoding

and one another question is that in /Encoding what are values contain in Difference ?

10 0 obj << /BaseEncoding /WinAnsiEncoding /Differences [ 1 /g100 /g28 /g94 /g3 /g87 /g24 /g38 /g47 /g62 ] /Type /Encoding >>

According to the PDF specification,

The value of the Differences entry shall be an array of character codes and character names organized as follows:

code1 name1,1 name1,2

code2 name2,1 name2,2

coden namen,1 namen,2

Each code shall be the first index in a sequence of character codes to be changed. The first character name after the code becomes the name corresponding to that code. Subsequent names replace consecutive code indices until the next code appears in the array or the array ends. These sequences may be specified in any order but shall not overlap.

Thus, the encoding entry in your case says that the encoding basically is WinAnsiEncoding with the difference that the codes 1, ..., 9 instead represent the glyphs named /g100, /g28, /g94, /g3, /g87, /g24, /g38, /g47, and /g62 respectively.

As these glyph names are no standard glyph names, the PDF specification does not consider this encoding helpful for text extraction because it only describes a method for a simple font

that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D)

The "/gXX" names in your sample clearly are not among them.

Comments