user189035 - 1 year ago 147
Python Question

# assign hash to row of categorical data in pandas

So I have many pandas data frames with 3 columns of categorical variables:

``````             D              F     False
T              F     False
D              F     False
T              F     False
``````

The first and second columns can take one of three values. The third one is binary. So there are a grand total of 18 possible rows (not all combination may be represented on each data frame).

I would like to assign a number 1-18 to each row, so that rows with the same combination of factors are assigned the same number and vise-versa (no hash collision).

What is the most efficient way to do this in pandas?

So,
`all_combination_df`
is a df with all possible combination of the factors. I am trying to turn df such as
`big_df`
to a Series with unique numbers in it

``````import pandas, itertools

def expand_grid(data_dict):
"""Create a dataframe from every combination of given values."""
rows = itertools.product(*data_dict.values())
return pandas.DataFrame.from_records(rows, columns=data_dict.keys())

all_combination_df = expand_grid(
{'variable_1': ['D', 'A', 'T'],
'variable_2': ['C', 'A', 'B'],
'variable_3'     : [True, False]})

big_df = pandas.concat([all_combination_df, all_combination_df, all_combination_df])
``````

I would try to use factorize method:

``````In [135]: df['category'] = pd.factorize(df.a + '~' + df.b + '~' + df.c.astype(str))[0]

In [136]: df
Out[136]:
a  b      c  category
0  A  X   True         0
1  B  Y  False         1
2  A  X   True         0
3  C  Z  False         2
4  A  Z   True         3
5  C  Z   True         4
6  B  Y  False         1
7  C  Z  False         2
``````

Explanation: this simple way we can glue all columns into a single series:

``````In [137]: df.a + '~' + df.b + '~' + df.c.astype(str)
Out[137]:
0     A~X~True
1    B~Y~False
2     A~X~True
3    C~Z~False
4     A~Z~True
5     C~Z~True
6    B~Y~False
7    C~Z~False
dtype: object
``````

One can use the following bit more general (but slower) solution:

``````In [141]: df.apply(lambda x: '~'.join(x.astype(str)), axis=1)
Out[141]:
0     A~X~True
1    B~Y~False
2     A~X~True
3    C~Z~False
4     A~Z~True
5     C~Z~True
6    B~Y~False
7    C~Z~False
dtype: object
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download