Trying to solve a problem of preventing duplicate images to be uploaded.
I have two JPGs. Looking at them I can see that they are in fact identical. But for some reason they have different file size (one is pulled from a backup, the other is another upload) and so they have a different md5 checksum.
How can I efficiently and confidently compare two images in the same sense as a human would be able to see that they are clearly identical?
Example: http://static.peterbe.com/a.jpg and http://static.peterbe.com/b.jpg
I wrote this script:
import math, operator
from PIL import Image
def compare(file1, file2):
image1 = Image.open(file1)
image2 = Image.open(file2)
h1 = image1.histogram()
h2 = image2.histogram()
rms = math.sqrt(reduce(operator.add,
map(lambda a,b: (a-b)**2, h1, h2))/len(h1))
file1, file2 = sys.argv[1:]
print compare(file1, file2)
There is a OSS project that uses WebDriver to take screen shots and then compares the images to see if there are any issues (http://code.google.com/p/fighting-layout-bugs/)). It does it by openning the file into a stream and then comparing every bit.
You may be able to do something similar with PIL.
After more research I found
h1 = Image.open("image1").histogram() h2 = Image.open("image2").histogram() rms = math.sqrt(reduce(operator.add, map(lambda a,b: (a-b)**2, h1, h2))/len(h1))