IcyFlame IcyFlame - 4 months ago 47
Python Question

Removing duplicate rows from a csv file using a python script

Goal

I have downloaded a CSV file from hotmail, but it has a lot of duplicates in it. These duplicates are complete copies and I don't know why my phone created them.

I want to get rid of the duplicates.

Approach

Write a python script to remove duplicates.

Technical specification



Windows XP SP 3
Python 2.7
CSV file with 400 contacts


Answer

A more efficient version of @IcyFlame's solution

with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        if line in seen: continue # skip duplicate

        seen.add(line)
        out_file.write(line)

To edit the same file in-place you could use this

import fileinput
seen = set() # set for fast O(1) amortized lookup
for line in fileinput.FileInput('1.csv', inplace=1):
    if line in seen: continue # skip duplicate

    seen.add(line)
    print line, # standard output is now redirected to the file