Agostino Agostino - 11 days ago 6
Python Question

Compare 2 files line by line ignoring newline differences

I'm using Python 2.7 to compare two text files line by line, ignoring:


  1. different line endings ('\r\n' vs '\n')

  2. number of empty lines at the end of the files



Below is the code I have. It works for point 2., but it does not work for point 1. The files I'm comparing can be big, so I'm reading them line by line. Please, don't suggest zip or similar libraries.

def compare_files_by_line(fpath1, fpath2):
# notice the opening mode 'r'
with open(fpath1, 'r') as file1, open(fpath2, 'r') as file2:
file1_end = False
file2_end = False
found_diff = False
while not file1_end and not file2_end and not found_diff:
try:
# reasons for stripping explained below
f1_line = next(file1).rstrip('\n')
except StopIteration:
f1_line = None
file1_end = True
try:
f2_line = next(file2).rstrip('\n')
except StopIteration:
f2_line = None
file2_end = True

if f1_line != f2_line:
if file1_end or file2_end:
if not (f1_line == '' or f2_line == ''):
found_diff = True
break
else:
found_diff = True
break

return not found_diff


You can test this code failing to meet point 1. by feeding it 2 files, one having a line ending with a UNIX newline

abc\n


the other having a line ending with a Windows newline

abc\r\n


I'm stripping the endline characters before each comparison to account for point 2. This solves the problem of two files containing a series of identical lines, this code will recognize them as "not different" even if one file ends with an empty line while the other one does not.

According to this answer, opening the files in 'r' mode (instead of 'rb') should take care of the OS-specific line endings and read them all as '\n'. This is not happening.

How can I make this work to treat line endings '\r\n' just as '\n' endings?

I'm using Python 2.7.12 with the Anaconda distribution 4.2.0.

Answer

The problem is the strip function, which should be f1_line.rstrip('\r\n') etc.

Also I think your program can be simplified:

from itertools import izip

def compare_files(fpath1, fpath2):
    with open(fpath1, 'r') as file1, open(fpath2, 'r') as file2:
        for linef1, linef2 in izip(file1, file2):
            linef1 = linef1.rstrip('\r\n')
            linef2 = linef2.rstrip('\r\n')

            if linef1 != linef2:
                return False
        return next(file1, None) == None and next(file2, None) == None
Comments