B.K. B.K. - 3 months ago 19
C# Question

Extracting text with greater font weight

I have a number of documents with predicted placement of certain text which I'm trying to extract. For the most part, it works very well, but I'm having difficulties with a certain fraction of documents which have slightly thicker text.

Thin text:

enter image description here

Thick text:

enter image description here

I know it's hard to tell the difference at this resolution, but if you look at MO DAY YEAR TIME (2400) portion, you can tell that the second one is thicker.

The thin text gives me exactly what is expected:

09/28/2015
0820

However, the thick version gives me a triple of every character with white space in between each duplicated character:

1 1 11 1 1/ / /1 1 19 9 9/ / /2 2 20 0 01 1 15 5 5
1 1 17 7 70 0 02 2 2

I'm using the following code to extract text from documents:

public static Document GetDocumentInfo(string fileName)
{
// Using 11 in x 8.5 in dimensions at 72 dpi.
var boudingBoxes = new[]
{
new RectangleJ(446, 727, 85, 14),
new RectangleJ(396, 702, 43, 14),
new RectangleJ(306, 680, 58, 7),
new RectangleJ(378, 680, 58, 7),
new RectangleJ(446, 680, 45, 7),
new RectangleJ(130, 727, 29, 10),
new RectangleJ(130, 702, 29, 10)
};

var data = GetPdfData(fileName, 1, boudingBoxes);

// I would populated the new document with extracted data
// here, but it's not important for the example.
var doc = new Document();
return doc;
}

public static string[] GetPdfData(string fileName, int pageNum, RectangleJ[] boundingBoxes)
{
// Omitted safety checks, as they're not important for the example.

var data = new string[boundingBoxes.Length];

using (var reader = new PdfReader(fileName))
{
if (reader.NumberOfPages < 1)
{
return null;
}

RenderFilter filter;
ITextExtractionStrategy strategy;

for (var i = 0; i < boundingBoxes.Length; ++i)
{
filter = new RegionTextRenderFilter(boundingBoxes[i]);
strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
data[i] = PdfTextExtractor.GetTextFromPage(reader, pageNum, strategy);
}

return data;
}
}


Obviously, if nothing else works, I can get rid of duplicate characters after reading them in, as there is a very apparent pattern, but I'd rather find a proper way than a hack. I tried looking around for the past few hours, but couldn't find anyone encountering a similar issue.

EDIT:

I finally came across this SO question:

Text Extraction Duplicate Bold Text

...and in the comments it's indicated that some of the lower quality PDF producers duplicate text to simulate boldness, so that's one of the things that might be happening. However, there is a mention of omitting duplicate text at the location, which I don't know how can be achieved since this portion of my code...

data[i] = PdfTextExtractor.GetTextFromPage(reader, pageNum, strategy);


...reads in the duplicated text completely in any of the specified locations.

EDIT:

I now have come across documents that duplicate contents up to four times to simulate thickness. That's a very strange way of doing things, but I'm sure designers of that method had their reasons.

EDIT:

I produced A solution (see my answer). It processes the data after it's already extracted and removes any repetitions. Ideally this would have been done during the extraction process, but it can get pretty complicated and this seemed like a very clean and easy way of getting the same accomplished.

Answer

As @mkl has suggested, one way of tackling this issue is to override LocationExtractionStrategy; however, things get pretty complicated since it would require comparison of locations for each character found at specific boundaries. I tried doing some research in order to accomplish that, but due to poor documentation, it was getting a bit out of hand.

So, instead as I created a post-processing method, loosely based around what @TheMuffinMan has suggested, to clean up any repetitions. I decided not to deal with pixels, but rather with character count anomalies in known static locations. In my case, I know that the second data piece extracted can never be greater than three characters, so it's a good comparison point for me. If you know the document layout, you can use anything on it that you know will always be of fixed length.

After I extract the data with the method listed in my original post, I check to see if the second data piece is greater than three in length. If it returns true, then I divide the given length by three, as that's the most characters it can have and since all repitions come out to even length, I know I'll get an even number of repetition cases:

var data = GetPdfData(fileName, 1, boudingBoxes);

if (data[1].Length > 3)
{
    var count = data[1].Length / 3;
    for (var i = 0; i < data.Length; ++i)
    {
        data[i] = RemoveRepetitions(data[i], count);
    }
}

As you can see, I then loop over the data and pass each piece into RemoveRepetitions() method:

public static string RemoveRepetitions(string original, int count)
{
    if (original.Length % count != 0)
    {
        return null;
    }
    var temp = new char[original.Length / count];
    for (int i = 0; i < original.Length; i += count)
    {
        temp[i / count] = original[i];
    }

    return new string(temp);
}

This method takes the string and the number of expected repetitions, which we calculated earlier. One thing to note is that I don't have to worry about the white spaces that are inserted in the duplicated process, as the example shows in the original post, due to the fact that count will represent the total number of characters where only one should have been.