Ferdinand Ferdinand - 1 month ago 21
Python Question

Compare files line by line to see if they are the same, if so output them

How would I go about this, I have files which I have sorted the information in, I want to compare a certain index in that file with an index in another, one problem is that the files are enormously large, millions of lines. I want to compare line by line the files I have, if they match I want to input both those values along with other values using an index method.


Let me clarify, I want to take say line[x] the x will remain the same as it is formatted uniformly, I want to run line[x] against line[y] in another file, I want to do this to the whole file and output every matching pair to another file. In that other file I also want to be able to include other pieces from the first file which would be like just adding more indexes such as; line[a],line[b],line[c],line[d], and finally line[y] as the match to that information.

Try 3:

I have a file with information in this format:

#x is a line

x= data,data,data,data,data,data

there is millions of lines of that.

I have another file, same format:

xis a line
x= data,data,data,data

I want to use x[#] from first file and x[#] from second file, I want to see if those two values match, if they do I want to output those, along with several other x[#] values from the second file, which are on the same line.

Did that help at all to understand?
The format the files are in are like i said:(but there is millions, and I want to find the pairs in the two files because they all should match up)

line 1 data,data,data,data
line 2 data,data,data,data

data from file 1:

(N'068D556A1A665123A6DD2073A36C1CAF', N'A76EEAF6D310D4FD2F0BD610FAC02C04DFE6EB67',

data from file 2:

00000040f2213a27ff74019b8bf3cfd1|index.docbook|Redhat 7.3 (32bit)|Linux
00000040f69413a27ff7401b8bf3cfd1|index.docbook|Redhat 8.0 (32bit)|Linux
00000965b3f00c92a18b2b31e75d702c|Localizable.strings|Mac OS X 10.4|OSX
0000162d57845b6512e87db4473c58ea|SYSTEM|Windows 7 Home Premium (32bit)|Windows
000011b20f3cefd491dbc4eff949cf45|totem.devhelp|Linux Ubuntu Desktop 9.10 (32bit)|Linux

The order it is sorted in is alphanumeric, and I want to use a slider method. By that I mean if file1[x] is < file2[x] move the slider down or up depending on whether one value is greater than the other, until a match is found, when and if so, print the output along with other values that will identify that hash.

What I want as a result would be:

file1[x] and its corresponding match on file2[x] outputted to a file, as well as other file1[x] where x can be any index from the line.


What I got from the clarification:

  • file1 and file2 are in the same format, where each line looks like

    {32 char hex key}|{text1}|{text2}|{text3}
  • the files are sorted in ascending order by key

  • for each key that appears in both file1 and file2, you want merged output, so each line looks like

    {32 char hex key}|{text11}|{text12}|{text13}|{text21}|{text22}|{text23}

You basically want the collisions from a merge sort:

import csv

def getnext(csvfile, key=lambda row: int(row[0], 16)):
    row = csvfile.next()
    return key(row),row

with open('file1.dat','rb') as inf1, open('file2.dat','rb') as inf2, open('merged.dat','wb') as outf:
    a = csv.reader(inf1, delimiter='|')
    b = csv.reader(inf2, delimiter='|')
    res = csv.writer(outf, delimiter='|')

    a_key, b_key = -1, 0
        while True:
            while a_key < b_key:
                a_key, a_row = getnext(a)
            while b_key < a_key:
                b_key, b_row = getnext(b)
            if a_key==b_key:
                res.writerow(a_row + b_row[1:])
    except StopIteration:
        # reached the end of an input file

I still have no idea what you are trying to communicate by 'as well as other file1[x] where x can be any index from the line'.