Fabio Fabio - 4 months ago 9
Python Question

md5 hash of file calculated not correct in Python

I have a function for calculating the md5 hashes of all the files in a drive. A hash is calculated but it's different from the hash I got using other programs or online services that are designed for that.

def md5_files(path, blocksize = 2**20):
hasher = hashlib.md5()
hashes = {}
for root, dirs, files in os.walk(path):
for file in files:
file_path = os.path.join(root, file)
print(file_path)
with open(file_path, "rb") as f:
data = f.read(blocksize)
if not data:
break
hasher.update(data)
hashes[file_path] = hasher.hexdigest()
return hashes


the
path
provided is the drive letter, for example "K:\" then I navigate through the files and I open the file for binary read. I read chunks of data of the size specified in
blocksize
. Then I store the filename and md5 hash of every file in a dictionary called
hashes
. The code looks okay, I also checked other questions on Stack Overflow. I don't know why the generated md5 hash is wrong.

Answer

you need to construct a new md5 object for each file and read it completely. eg. like so

def md5_files(path, blocksize = 2**20):    
    hashes = {}
    for root, dirs, files in os.walk(path):
        for file in files:
            file_path = os.path.join(root, file)
            print(file_path)
            with open(file_path, "rb") as f:
                data = f.read(blocksize)
                hasher = hashlib.md5(data)
                while data:
                    data = f.read(blocksize)   
                    hasher.update(data)             
                hashes[file_path] = hasher.hexdigest()
    return hashes