Nicholas Nicholas - 28 days ago 10
Python Question

Removing the background noise of a captcha image by replicating the chopping filter of TesserCap

I have a captcha image that looks like this:



Using a utility called TesserCap from McAfee, I could apply a "chopping" filter to the image. (Before running it, I made sure there were only two colors in the image, white and black.) I was very impressed with the results of using that filter with a value of 2 in the text box. It accurately removed most of the noise but kept the main text, resulting in this:



I wanted to implement something like this on one of my own scripts, so I tried to find out what image processing library TesserCap used. I couldn't find anything; it turns out it uses its own code to process the image. I then read this whitepaper that explains exactly how the program works. It gave me the following description of what this chopping filter does:


If the contiguous number of pixels for given grayscale values are less
than the number provided in the numeric box, the
chopping filter replaces these sequences with 0 (black) or 255 (white)
as per user choice. The CAPTCHA is analyzed in both horizontal and
vertical directions and corresponding changes are made.


I am not sure I understand what it is doing. My script is in Python, so I tried using PIL to manipulate the pixels kind of like that quote described. It sounds kind of simple, but I failed, probably because I didn't really know what exactly the filter was doing:


(This is made from a slightly different captcha that uses a circular pattern.)

I also tried seeing if it could easily be done with ImageMagick's convert.exe. Their -chop option is something completely different. Using -median along with some -morphology commands helped to reduce some of the noise, but nasty dots appeared and the letters became very distorted. It wasn't nearly as simple as doing the chopping filter with TesserCap.

So, my question is as follows: how do I implement the chopping filter of TesserCap in Python, be it using PIL or ImageMagick? That chopping filter works much better than any of the alternatives I've tried, but I can't seem to replicate it. I've been working on this for hours and haven't figured anything out yet.

Answer Source

The algorithm essentially checks if there are multiple target pixels (in this case, non-white pixels) in a row, and changes those pixels if the number of pixels is less than or equal to the chop factor.

For example, in a sample row of pixels, where # is black and - is white, applying a chop factor of 2 would transform --#--###-##---#####---#-# into ------###-------#####-------. This is because there sequences of black pixels that are smaller than or equal to 2 pixels, and these sequences are replaced with white. The continuous sequences of greater than 2 pixels remain.

This is the result of the chop algorithm as implemented in my Python code (below) on the original image on your post:

'Chopped' image

In order to apply this to the whole image, you simply perform this algorithm on every row and on every column. Here's Python code that accomplishes that:

import PIL.Image
import sys

# python chop.py [chop-factor] [in-file] [out-file]

chop = int(sys.argv[1])
image = PIL.Image.open(sys.argv[2]).convert('1')
width, height = image.size
data = image.load()

# Iterate through the rows.
for y in range(height):
    for x in range(width):

        # Make sure we're on a dark pixel.
        if data[x, y] > 128:
            continue

        # Keep a total of non-white contiguous pixels.
        total = 0

        # Check a sequence ranging from x to image.width.
        for c in range(x, width):

            # If the pixel is dark, add it to the total.
            if data[c, y] < 128:
                total += 1

            # If the pixel is light, stop the sequence.
            else:
                break

        # If the total is less than the chop, replace everything with white.
        if total <= chop:
            for c in range(total):
                data[x + c, y] = 255

        # Skip this sequence we just altered.
        x += total


# Iterate through the columns.
for x in range(width):
    for y in range(height):

        # Make sure we're on a dark pixel.
        if data[x, y] > 128:
            continue

        # Keep a total of non-white contiguous pixels.
        total = 0

        # Check a sequence ranging from y to image.height.
        for c in range(y, height):

            # If the pixel is dark, add it to the total.
            if data[x, c] < 128:
                total += 1

            # If the pixel is light, stop the sequence.
            else:
                break

        # If the total is less than the chop, replace everything with white.
        if total <= chop:
            for c in range(total):
                data[x, y + c] = 255

        # Skip this sequence we just altered.
        y += total

image.save(sys.argv[3])