Chumbawamba Chumbawamba - 1 month ago 17
C# Question

Removing text from PDF

I'm looking for a solution to remove/delete ALL text from a pdf. I've been using iTextSharp for a while now, and extracting text from a pdf with it is easy (wihouth the use of OCR). However I can't find an option to delete the text.

This solution frankly doesn't work for me.

page.GetAsArray(PdfName.CONTENTS);


returns null for me, also when using
PdfName.Text
and some others I've tried.

The library to use doesn't really matter, I just think iTextsharp should be able to do this. However if there is another (free) solution, bring it

EDIT: Just to make clear why I want to remove all text from the pdfs

I want to reduce the size of the pdf's. I do this by reducing the resolution of the images in the pdf. However, in alot of cases the vector images take up most of the space. So I thought of the following:
Remove all text, than convert the remaining pdf (with only the images and vectors) to a bitmap (jpeg). After that I paste the text over it again.
Another option would be to make the text invisible, but I don't think this is any easier.

Answer
  1. The /Contents of a page dictionary doesn't always consist of an array. It should be evident that GetAsArray() returns null if the content is stored as a stream.
  2. Suppose you use GetAsStream() and you remove all the text contents from the stream, then you may still have text content in XObjects. That text won't be referenced from a content stream, but iText won't be able to remove the XObjects as 'unused objects' because the objects will still be referenced from the /Resources in the page dictionary.

Please read ISO-32000-1 to find out what you're doing wrong.

Comments