dhana dhana - 1 month ago 12
Python Question

Is pandas read_csv really slow compared to python open?

my requirement is to remove duplicate rows from csv file, but the size of the file is 11.3GB. So I bench marked the pandas and python file generator.

Python File Generator:

def fileTestInPy():
with open(r'D:\my-file.csv') as fp, open(r'D:\mining.csv', 'w') as mg:
dups = set()
for i, line in enumerate(fp):
if i == 0:
continue
cols = line.split(',')
if cols[0] in dups:
continue
dups.add(cols[0])
mg.write(line)
mg.write('\n')


Python File Generator

Using Pandas read_csv:

import pandas as pd
df = pd.read_csv(r'D:\my-file.csv', sep=',', iterator=True, chunksize=1024*128)
def fileInPandas():
for d in df:
d_clean = d.drop_duplicates('NPI')
d_clean.to_csv(r'D:\mining1.csv', mode='a')


Pandas read_csv

Details:
Size: 11.3 GB
rows: 100 million, but in this 50 million are duplicate
Python Version: 3.5.2
Pandas Version: 0.19.0
RAM: 8GB
CPU: Core-i5 2.60GHz

What I'm observed here, 643 sec took when I use the python file generator, but 1756 took when I use the pandas.

Even my system was not hanged when I used the python file generator, but when I used the pandas my system was hanged.

Am I using correct way in pandas ?
Even I want to do sorting on 11.3GB file, how to do that ?

Answer

Pandas is not a good choice for this task. It reads the entire 11.3G file into memory and does string-to-int conversions on all of the columns. I'm not surprised that your machine bogged down!

The line-by-line version is much leaner. It doesn't do any conversions, doesn't bother looking at unimportant columns and doesn't keep a large dataset in memory. It is the better tool for the job.

def fileTestInPy():
    with open(r'D:\my-file.csv') as fp, open(r'D:\mining.csv', 'w') as mg:
        dups = set()
        next(fp) # <-- advance fp so you don't need to check each line
                 # or use enumerate
        for line in fp:
            col = line.split(',', 1)[0]  # <-- only split what you need
            if col in dups:
                continue
            dups.add(col)
            mg.write(line)
            # mg.write('\n')  # <-- line still has its \n, did you
                              # want another?

Also, if this is python 3.x and you know your file is ascii or UTF-8, you could open both files in binary mode and save a conversion.

Comments