Plaiska Plaiska - 9 months ago 50
Java Question

Efficient way to find the differences in 2 csv files

I have 2 large ( say 300,000 * 100) rectangular csv files which have same number of corresponding rows and columns. I need to find the differences if any between each corresponding cell in the 2 files.
Program should o/p the row and cell number that are different and the contents that are different.

Since the number of rows/columns is very large , I'm looking for the most efficient way to do this.

Initially I started exploring with awk , and this looked promising , but I haven't worked with awk before and was unsuccessful in extending this to 100 columns instead of 2 as in the example

Next , I tried a sort of brute force approach using Java - Load the files into 2 2-D arrays. Initialize 100 threads, each working on a given column , and when a difference is found, each thread puts in the row,cell and diff values into a HashMap (with column number as Key). I did try optimizing it by comparing while reading the second file into the array , but really since I'm visiting each cell, there is no way it can be fast (Took almost 8 hours to complete the comparison)

I'm ok with either awk or Java. And open to any other entirely different approach.

Answer Source

awk to the rescue!

It's a no-brainer to do this in awk

$ paste -d, file.1 file.2 | 
  awk -F, '{m=NF/2; for(i=1;i<=m;i++) if($i!=$(i+m)) print NR,i,$i,$(i+m)}'

prints "row# column# left right" values of the unequal cells.

If you want to print a key column in addition to the row number, you can add it easily

$ paste -d, file.1 file.2 | 
  awk -F, -key=8 '{m=NF/2; 
                     if($i!=$(i+m)) print $key,NR,i,$i,$(i+m)}'