eleanora eleanora - 2 months ago 29
Python Question

Can dask be used to groupby and recode out of core?

I have 8GB csv files and 8GB of RAM. Each file has two strings per row in this form:

a,c
c,a
f,g
a,c
c,a
b,f
c,a


For smaller files I remove duplicates counting how many copies of each row there were in the first two columns and then recode the strings to integers as follows:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.read_csv("file.txt", header=None, prefix="ID_")

# Perform the groupby (before converting letters to digits).
df = df.groupby(['ID_0', 'ID_1']).size().rename('count').reset_index()

# Initialize the LabelEncoder.
le = LabelEncoder()
le.fit(df[['ID_0', 'ID_1']].values.flat)

# Convert to digits.
df[['ID_0', 'ID_1']] = df[['ID_0', 'ID_1']].apply(le.transform)


This gives:

ID_0 ID_1 count
0 0 1 2
1 1 0 3
2 2 4 1
3 4 3 1


which is exactly what I need for this toy example.

For the larger file I can't take these steps because of lack of RAM.

I can imagine it is possible to combine unix sort and a bespoke python solution doing multiple passes over the data to process my data set. But someone suggested dask might be suitable. Having read the docs I am still not clear.


Can dask be used to do this sort of out of core processing or is there some other out of core pandas solution?

sim sim
Answer

Assuming that the grouped dataframe fits your memory, the change you would have to make to your code should be pretty minor. Here's my attempt:

import pandas as pd
from dask import dataframe as dd
from sklearn.preprocessing import LabelEncoder

# import the data as dask dataframe, 100mb per partition
# note, that at this point no data is read yet, dask will read the files
# once compute or get is called.
df = dd.read_csv("file.txt", header=None, prefix="ID_", blocksize=100000000)

# Perform the groupby (before converting letters to digits).
# For better understanding, let's split this into two parts:
#     (i) define the groupby operation on the dask dataframe and call compute()
#     (ii) compute returns a pandas dataframe, which we can then use for further analysis
pandas_df = df.groupby(['ID_0', 'ID_1']).apply(lambda x: len(x), columns=0).compute()
pandas_df = pandas_df.rename('count').reset_index()

# Initialize the LabelEncoder.
le = LabelEncoder()
le.fit(pandas_df[['ID_0', 'ID_1']].values.flat)

# Convert to digits.
pandas_df[['ID_0', 'ID_1']] = pandas_df[['ID_0', 'ID_1']].apply(le.transform)

One possible solution in pandas would be to read the files in chunks (passing the chunksize argument to read_csv), running the groupby on individual chunks and combining the results.


Here's how you can solve the problem in pure python:

counts = {}
with open('data') as fp:
    for line in fp:
        id1, id2 = line.rstrip().split(',')
        counts[(id1, id2)] = 1 + counts.get((id1, id2), 0)

df = pd.DataFrame(data=[(k[0], k[1], v) for k, v in counts.items()],
                  columns=['ID_0', 'ID_1', 'count'])
# apply label encoding etc.
le = LabelEncoder()
le.fit(df[['ID_0', 'ID_1']].values.flat)

# Convert to digits.
df[['ID_0', 'ID_1']] = df[['ID_0', 'ID_1']].apply(le.transform)