Neeraj Neeraj - 4 years ago 344
Java Question

How to get font color using pdfbox

I am trying to extract text with all information from the pdf using pdfbox. I got all the information i want, except color. I tried different ways to get the fontcolor (including Getting Text Colour with PDFBox). But not working. And now I copied code from PageDrawer class of pdfBox. But then also the RGB value is not correct.

protected void processTextPosition(TextPosition text) {

Composite com;
Color col;
switch(this.getGraphicsState().getTextState().getRenderingMode()) {
com = this.getGraphicsState().getNonStrokeJavaComposite();
int r = this.getGraphicsState().getNonStrokingColor().getJavaColor().getRed();
int g = this.getGraphicsState().getNonStrokingColor().getJavaColor().getGreen();
int b = this.getGraphicsState().getNonStrokingColor().getJavaColor().getBlue();
int rgb = this.getGraphicsState().getNonStrokingColor().getJavaColor().getRGB();
float []cosp = this.getGraphicsState().getNonStrokingColor().getColorSpaceValue();
PDColorSpace pd = this.getGraphicsState().getNonStrokingColor().getColorSpace();
//basic support for text rendering mode "invisible"
Color nsc = this.getGraphicsState().getStrokingColor().getJavaColor();
float[] components = {,,};
Color c1 = new Color(nsc.getColorSpace(),components,0f);

I am using the above code. The values getting are
r = 0, g = 0, b = 0, inside cosp object value is [0.0], inside pd object array = null and colorSpace = null. and RGB value is always -16777216. Please help me. Thanks in advance.

Answer Source

I tried the code in the link you posted and it worked for me. The colors I get back are 148.92, 179.01001 and 214.965. I wish I could give you my PDF to work with, maybe if I store it externally to SO? My PDF used a sort of palish blue color and that seems to match. It was just one page of text created in Word 2010 and exported, nothing too intense.

A couple of suggestions ....

  1. Recall that the value returned is a float between 0 and 1. If a value is accidentally cast to int, then of course the values will end up containing nearly all 0. The linked to code multiples by 255 to get a range of 0 to 255.
  2. As the commenter said, the most common color for a PDF file is black which is 0 0 0

That is all I can think of now, otherwise I have version of 1.7.1 of pdfbox and fontbox and like I said I pretty much followed the link you gave.


Based upon my comments, here perhaps is a minorly invasive way of doing it for pdf files like color.pdf?

In in the processOperator method one can do inside the try block

if (operation.equals("RG")) {
   // stroking color space
} else if (operation.equals("rg")) {
   // non-stroking color space
} else if (operation.equals("BT")) {
} else if (operation.equals("ET")) {

This will show you the information, then it is up to you to process the color information for each section according to your needs. Here is a snippet from the beginning of the output of the above code when run on color.pdf ...

BT rG [COSInt(1), COSInt(0), CosInt(0)] RG [COSInt(1), COSInt(0), CosInt(0)] ET BT ET BT rG [COSFloat{0.573}, COSFloat{0.816}, COSFloat{0.314}] RG [COSFloat{0.573}, COSFloat{0.816}, COSFloat{0.314}] ET ......

You see in the above output an empty BT ET section, this being a section which is marked DEVICEGRAY. All the other give you [0,1] values for the R, G and B components

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download