schaffioverflow schaffioverflow -4 years ago 105
MySQL Question

Compare and match 2 csv-files

I have 2 files(both files are > 1,000,000 rows):

First file just contains an md5 hash

second file contains md5;emailadress

now I have to compare this two files and if the md5 hash is the same -> write the emailadress in a separate file.

I tried it with mysql and join

SELECT `email` FROM `intern` INNER JOIN `extern` ON `intern`.`md5` = `extern`.`md5`


and reading the 2 csv and compare it in 2 loops. nodejs:

fs.readFile('public/md5-data/reachadout.csv', (err, internFile) => {
fs.readFile('public/md5-data/customer.csv', (err, externFile) => {
var internLines = internFile.toString().split('\n');
var externLines = externFile.toString().split('\n');
internLines.forEach(function(iLine){
var internCells = iLine.split(';');
if(externLines.indexOf(internCells[0]) > -1){
fs.appendFileSync('public/md5-data/blacklist.csv', internCells[1] + '\n');
}
})
})
})


Both ways are horrible slow. Has anyone another algorithm or solution for me?

Answer Source

Instead make a lookup object, where you can lookup whether a hash is in the externFile I modified your code to do this

fs.readFile('public/md5-data/reachadout.csv', (err, internFile) => {
    fs.readFile('public/md5-data/customer.csv', (err, externFile) => {
      var internLines = internFile.toString().split('\n');
      var externLines = externFile.toString().split('\n');
      var externLookup = {};   //Added line
      externLines.forEach(function (eLine){   //Added line
          externLookup[eLine] = true;         //Added line
      });                                     //Added line
      internLines.forEach(function(iLine){
        var internCells = iLine.split(';');
        if(externLookup[internCells[0]]){  //Changed line
          fs.appendFileSync('public/md5-data/blacklist.csv', internCells[1] + '\n');
        }
      })
    })
  })

The biggest problem with your code is that for each line in "InternFile", it has to check every line in the externfile and see if there is a match. If you make an object like I did, it is way faster to query whether an object exists

EDIT: Alternatively, you could use SQL like you did, but you have to add an INDEX to the hash_column.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download