KVohra95 KVohra95 - 3 months ago 42
Java Question

How to extract hyperlink information PDFBox

I am trying to extract the hyperlink information from a PDF using PDFBox but I am unsure how to get

for( Object p : pages ) {
PDPage page = (PDPage)p;

List<?> annotations = page.getAnnotations();
for( Object a : annotations ) {
PDAnnotation annotation = (PDAnnotation)a;

if( annotation instanceof PDAnnotationLink ) {
PDAnnotationLink link = (PDAnnotationLink)annotation;
System.out.println(link.toString());
System.out.println(link.getDestination());

}
}

}


I want to extract the url of the hyperlink destination and the text of the hyperlink. How can one do this?

Thanks

Answer

Use this code from the PrintURLs sample code from the source code download:

for( PDPage page : doc.getPages() )
{
    pageNum++;
    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
    List<PDAnnotation> annotations = page.getAnnotations();
    //first setup text extraction regions
    for( int j=0; j<annotations.size(); j++ )
    {
        PDAnnotation annot = annotations.get(j);
        if( annot instanceof PDAnnotationLink )
        {
            PDAnnotationLink link = (PDAnnotationLink)annot;
            PDRectangle rect = link.getRectangle();
            //need to reposition link rectangle to match text space
            float x = rect.getLowerLeftX();
            float y = rect.getUpperRightY();
            float width = rect.getWidth();
            float height = rect.getHeight();
            int rotation = page.getRotation();
            if( rotation == 0 )
            {
                PDRectangle pageSize = page.getMediaBox();
                y = pageSize.getHeight() - y;
            }
            else if( rotation == 90 )
            {
                //do nothing
            }

            Rectangle2D.Float awtRect = new Rectangle2D.Float( x,y,width,height );
            stripper.addRegion( "" + j, awtRect );
        }
    }

    stripper.extractRegions( page );

    for( int j=0; j<annotations.size(); j++ )
    {
        PDAnnotation annot = annotations.get(j);
        if( annot instanceof PDAnnotationLink )
        {
            PDAnnotationLink link = (PDAnnotationLink)annot;
            PDAction action = link.getAction();
            String urlText = stripper.getTextForRegion( "" + j );
            if( action instanceof PDActionURI )
            {
                PDActionURI uri = (PDActionURI)action;
                System.out.println( "Page " + pageNum +":'" + urlText.trim() + "'=" + uri.getURI() );
            }
        }
    }
}

It works in two parts, one is getting the URL which is easy, the other is getting the URL text, which is done with a text extraction at the rectangle of the annotation.

Comments