my requirement is to remove duplicate rows from csv file, but the size of the file is 11.3GB. So I bench marked the pandas and python file generator.
Python File Generator:
with open(r'D:\my-file.csv') as fp, open(r'D:\mining.csv', 'w') as mg:
dups = set()
for i, line in enumerate(fp):
if i == 0:
cols = line.split(',')
if cols in dups:
import pandas as pd
df = pd.read_csv(r'D:\my-file.csv', sep=',', iterator=True, chunksize=1024*128)
for d in df:
d_clean = d.drop_duplicates('NPI')
Pandas is not a good choice for this task. It reads the entire 11.3G file into memory and does string-to-int conversions on all of the columns. I'm not surprised that your machine bogged down!
The line-by-line version is much leaner. It doesn't do any conversions, doesn't bother looking at unimportant columns and doesn't keep a large dataset in memory. It is the better tool for the job.
def fileTestInPy(): with open(r'D:\my-file.csv') as fp, open(r'D:\mining.csv', 'w') as mg: dups = set() next(fp) # <-- advance fp so you don't need to check each line # or use enumerate for line in fp: col = line.split(',', 1) # <-- only split what you need if col in dups: continue dups.add(col) mg.write(line) # mg.write('\n') # <-- line still has its \n, did you # want another?
Also, if this is python 3.x and you know your file is ascii or UTF-8, you could open both files in binary mode and save a conversion.