Jane Lee Jane Lee - 2 days ago 4
Python Question

Speed up my data reading in python?

My current code looks like this:

import pandas as pd
import csv
import matplotlib.pyplot as plt

def data_reader(filename, rowname):
with open(filename, newline='') as fp:
yield from (row[1:] for row in csv.reader(fp, skipinitialspace=True)
if row[0] == rowname)
File = 'data.csv'
ASA = pd.DataFrame.from_records(data_reader(File, 'ASA'))
GDS = pd.DataFrame.from_records(data_reader(File, 'GDS'))
SCD = pd.DataFrame.from_records(data_reader(File, 'SCD'))
ASF = pd.DataFrame.from_records(data_reader(File, 'ASF'))
ADC = pd.DataFrame.from_records(data_reader(File, 'ADC'))
DFS = pd.DataFrame.from_records(data_reader(File, 'DFS'))
DCS = pd.DataFrame.from_records(data_reader(File, 'DCS'))
DFDS = pd.DataFrame.from_records(data_reader(File, 'DFDS'))


It is reading data that looks like this:

legend, useless data, useless data, DCS, useless data, sped, air, xds, sas, dac
legend, useless data, useless data, GDS, useless data, sped, air
Legend, useless data, useless data, ASA, useless data, sped, air, gnd
ASA, 231, 123, 12
GDS, 12, 1
DCS, 13, 12, 123, 12, 4
ASA, 123, 132, 12
and so on for couple of millions....


I am trying to write an IF statement that looks something like this:

pd.DataFrame.from_records(data_reader(
if rowname = 'ASA'
ASA.append(row)
elif rowname = 'GDS'
GDS.append(row)


and so on. Would this be faster? currently it is taking about 1 minute to run my code and plot one graph. I am sure it will be much longer when I have about 10-15 plots to do. I have tried different methods of writing the if/elseif statement but I am not having any luck doing so.

Answer

Reading from disk is the bottleneck here, so we should try to avoid reading the file more than once. If you have enough memory to parse the entire CSV into a dict of lists, then you could use

import csv
import collections

def data_reader(filename):
    dfs = collections.defaultdict(list)
    columns = dict()
    with open(filename, newline='') as fp:
        for row in csv.reader(fp, skipinitialspace=True):
            key = row[0].upper()
            if key == 'LEGEND':
                name = row[3]
                columns[name] = row
            else:
                dfs[key].append(row[1:])

    for key in dfs:
        num_cols = len(dfs[key][0])
        dfs[key] = pd.DataFrame(dfs[key], columns=columns[key][-num_cols:])
    return dfs

filename = 'data.csv'
dfs = data_reader(filename)

The loop

for row in csv.reader(fp, skipinitialspace=True):
    key = row[0].upper()
    ...
    dfs[key].append(row[1:])

loads the CSV into a dict, dfs. The dict keys are strings like 'ASA', 'GDS' and 'DCS'. The dict values are lists of lists.

The other loop

for key in dfs:
    ...
    dfs[key] = pd.DataFrame(dfs[key], columns=columns[key][:-num_cols:])

converts the lists of lists to DataFrames.


The if-statement:

if key == 'LEGEND':
    name = row[3]
    columns[name] = row
else:
    dfs[key].append(row[1:])

records the row in the columns dict if the row begins with LEGEND (with or without capitalization), or otherwise records the row in the dfs dict.

Later in the for-loop:

for key in dfs:
    num_cols = len(dfs[key][0])
    dfs[key] = pd.DataFrame(dfs[key], columns=columns[key][-num_cols:])

The keys are strings such as 'ASA'. For each key, the number of columns is obtained by finding the length of the first row in dfs[key]. Presumably, every every row will have the same length -- otherwise and exception will be raised by pd.DataFrame.

columns[key] returns the corresponding legend row for key. columns[key][-num_cols:] returns the last num_cols values from that row.


The result returned by data_reader is a dict of DataFrames:

In [211]: dfs['ASA']
Out[211]: 
  sped  air gnd 
0  231  123   12
1  123  132   12

In [212]: dfs['GDS']
Out[212]: 
  sped air
0   12   1

In [213]: dfs['DCS']
Out[213]: 
  sped air  xds sas dac
0   13  12  123  12   4
Comments