fbrundu - 1 year ago 219

Python Question

I am able to discretize a Pandas dataframe by columns with this code:

`import numpy as np`

import pandas as pd

def discretize(X, n_scale=1):

for c in X.columns:

loc = X[c].median()

# median absolute deviation of the column

scale = mad(X[c])

bins = [-np.inf, loc - (scale * n_scale),

loc + (scale * n_scale), np.inf]

X[c] = pd.cut(X[c], bins, labels=[-1, 0, 1])

return X

I want to discretize each column using as parameters: loc (the median of the column) and scale (the median absolute deviation of the column).

With small dataframes the time required is acceptable (since it is a single thread solution).

However, with larger dataframes I want to exploit more threads (or processes) to speed up the computation.

I am no expert of Dask, which should provide the solution for this problem.

However, in my case the discretization should be feasible with the code:

`import dask.dataframe as dd`

import numpy as np

import pandas as pd

def discretize(X, n_scale=1):

# I'm using only 2 partitions for this example

X_dask = dd.from_pandas(X, npartitions=2)

# FIXME:

# how can I define bins to compute loc and scale

# for each column?

bins = [-np.inf, loc - (scale * n_scale),

loc + (scale * n_scale), np.inf]

X = X_dask.apply(pd.cut, axis=1, args=(bins,), labels=[-1, 0, 1]).compute()

return X

but the problem here is that

`loc`

`scale`

How can it be done?

Answer Source

I've never used `dask`

, but I guess you can define a new function to be used in `apply`

.

```
import dask.dataframe as dd
import multiprocessing as mp
import numpy as np
import pandas as pd
def discretize(X, n_scale=1):
X_dask = dd.from_pandas(X.T, npartitions=mp.cpu_count()+1)
X = X_dask.apply(_discretize_series,
axis=1, args=(n_scale,),
columns=X.columns).compute().T
return X
def _discretize_series(x, n_scale=1):
loc = x.median()
scale = mad(x)
bins = [-np.inf, loc - (scale * n_scale),
loc + (scale * n_scale), np.inf]
x = pd.cut(x, bins, labels=[-1, 0, 1])
return x
```