pratz pratz - 2 months ago 12
Apache Configuration Question

How to detect image in a document

How can I detect images in a document say doc,xls,ppt or pdf ?

I came across with Apache Tika, I am trying its command line option.
http://tika.apache.org/1.2/gettingstarted.html

But not quite sure how it will detect images.

Any help is appreciated.

Thanks

Answer

You've said you want to use a command line solution, and not write any Java code, so it's not going to be the prettiest way to do it... If you are happy to write a little bit of Java, and create a new program to call from Python, then you can do it much nicer!

The first thing to do is to have the Tika App extract out any embedded resources within your file. Use the --extract option for this, and have the extraction occur in a special temp directory you app controls, eg

$ java -jar tika.jar --extract ../testWORD_embedded_pdf.doc
Extracting 'image1.emf' (application/x-emf)
Extracting '_1402837031.pdf' (application/pdf)

Grab the output of the extraction if you can, and parse that looking for images (but be aware that some images have an application/ prefix on their canconical mimetype!). You might need to run a second --detect step on a few, I'm not sure, test how the parsers get on with the extraction.

Now, if there were images, they'll be in your test dir. Process them as you want. Finally, zap the temp dir when you're done with the file!

Comments