RDS RDS - 2 months ago 13
Python Question

The Pythonic way to grow a list of lists

I have a large file (2GB) of categorical data (mostly "Nan"--but populated here and there with actual values) that is too large to read into a single data frame. I had a rather difficult time coming up with a object to store all the unique values for each column (Which is my goal--eventually I need to factorize this for modeling)

What I ended it up doing was reading the file in chunks into a dataframe and then get the unique values of each column and store them in a list of lists. My solution works, but seemed most un-pythonic--is there a cleaner way to accomplish this in Python (ver 3.5). I do know the number of columns (~2100).

import pandas as pd
#large file of csv separated text data
data=pd.read_csv("./myratherlargefile.csv",chunksize=100000, dtype=str)

collist=[]
master=[]
i=0
initialize=0
for chunk in data:
#so the first time through I have to make the "master" list
if initialize==0:
for col in chunk:
#thinking about this, i should have just dropped this col
if col=='Id':
continue
else:
#use pd.unique as a build in solution to get unique values
collist=chunk[col][chunk[col].notnull()].unique().tolist()
master.append(collist)
i=i+1
#but after first loop just append to the master-list at
#each master-list element
if initialize==1:
for col in chunk:
if col=='Id':
continue
else:
collist=chunk[col][chunk[col].notnull()].unique().tolist()
for item in collist:
master[i]=master[i]+collist
i=i+1
initialize=1
i=0


after that, my final task for all the unique values is as follows:

i=0
names=chunk.columns.tolist()
for item in master:
master[i]=list(set(item))
master[i]=master[i].append(names[i+1])
i=i+1


thus master[i] gives me the column name and then a list of unique values--crude but it does work--my main concern is building the list in a "better" way if possible.

Answer

I would suggest instead of a list of lists, using a collections.defaultdict(set).

Say you start with

uniques = collections.defaultdict(set)

Now the loop can become something like this:

for chunk in data: 
    for col in chunk:
        uniques[col] = uniques[col].union(chunk[col].unique())

Note that:

  1. defaultdict always has a set for uniques[col] (that's what it's there for), so you can skip initialized and stuff.

  2. For a given col, you simply update the entry with the union of the current set (which initially is empty, but it doesn't matter) and the new unique elements.

Edit

As Raymond Hettinger notes (thanks!), it is better to use

       uniques[col].update(chunk[col].unique())
Comments