fbrundu fbrundu - 4 months ago 69
Python Question

How to discretize large dataframe by columns with variable bins in Pandas/Dask

I am able to discretize a Pandas dataframe by columns with this code:

import numpy as np
import pandas as pd

def discretize(X, n_scale=1):

for c in X.columns:
loc = X[c].median()

# median absolute deviation of the column
scale = mad(X[c])

bins = [-np.inf, loc - (scale * n_scale),
loc + (scale * n_scale), np.inf]
X[c] = pd.cut(X[c], bins, labels=[-1, 0, 1])

return X


I want to discretize each column using as parameters: loc (the median of the column) and scale (the median absolute deviation of the column).

With small dataframes the time required is acceptable (since it is a single thread solution).

However, with larger dataframes I want to exploit more threads (or processes) to speed up the computation.

I am no expert of Dask, which should provide the solution for this problem.

However, in my case the discretization should be feasible with the code:

import dask.dataframe as dd
import numpy as np
import pandas as pd

def discretize(X, n_scale=1):

# I'm using only 2 partitions for this example
X_dask = dd.from_pandas(X, npartitions=2)

# FIXME:
# how can I define bins to compute loc and scale
# for each column?
bins = [-np.inf, loc - (scale * n_scale),
loc + (scale * n_scale), np.inf]

X = X_dask.apply(pd.cut, axis=1, args=(bins,), labels=[-1, 0, 1]).compute()

return X


but the problem here is that
loc
and
scale
are dependent on column values, so they should be computed for each column, either before or during the apply.

How can it be done?

Answer

I've never used dask, but I guess you can define a new function to be used in apply.

import dask.dataframe as dd
import multiprocessing as mp
import numpy as np
import pandas as pd

def discretize(X, n_scale=1):

    X_dask = dd.from_pandas(X.T, npartitions=mp.cpu_count()+1)
    X = X_dask.apply(_discretize_series,
                     axis=1, args=(n_scale,),
                     columns=X.columns).compute().T

    return X

def _discretize_series(x, n_scale=1):

    loc = x.median()
    scale = mad(x)
    bins = [-np.inf, loc - (scale * n_scale),
            loc + (scale * n_scale), np.inf]
    x = pd.cut(x, bins, labels=[-1, 0, 1])

    return x