Chumbawamba Chumbawamba - 10 months ago 100
C# Question

Removing text from PDF

I'm looking for a solution to remove/delete ALL text from a pdf. I've been using iTextSharp for a while now, and extracting text from a pdf with it is easy (wihouth the use of OCR). However I can't find an option to delete the text.

This solution frankly doesn't work for me.


returns null for me, also when using
and some others I've tried.

The library to use doesn't really matter, I just think iTextsharp should be able to do this. However if there is another (free) solution, bring it

EDIT: Just to make clear why I want to remove all text from the pdfs

I want to reduce the size of the pdf's. I do this by reducing the resolution of the images in the pdf. However, in alot of cases the vector images take up most of the space. So I thought of the following:
Remove all text, than convert the remaining pdf (with only the images and vectors) to a bitmap (jpeg). After that I paste the text over it again.
Another option would be to make the text invisible, but I don't think this is any easier.

Answer Source
  1. The /Contents of a page dictionary doesn't always consist of an array. It should be evident that GetAsArray() returns null if the content is stored as a stream.
  2. Suppose you use GetAsStream() and you remove all the text contents from the stream, then you may still have text content in XObjects. That text won't be referenced from a content stream, but iText won't be able to remove the XObjects as 'unused objects' because the objects will still be referenced from the /Resources in the page dictionary.

Please read ISO-32000-1 to find out what you're doing wrong.