Zhubarb Zhubarb - 2 months ago 19
Python Question

Pandas Ordinal Variable Treatment in Similarity Calculation

I have a Pandas version 0.15.2 dataframe as below with an ordinal column

rate
, represented initially as strings. My end goal is to find the similarities of different rows in the
df
(in reality I have a lot more rows and more ordinal variables).

Currently, to factorize() while enforcing the ordering I want, I first convert
rate
to Categorical, then sort the
df
and finally factorize so
Bad
gets the smallest integer value and
Good
gets the highest. In this fashion I create the
factor_rate
column, which I can then use as a dimension (among others) to calculate similarities / distances. These are given in the code below.

However, I feel this is a memory inefficient and slow hack. What is the fastest and most memory-efficient way to achieve this?

df = pd.DataFrame({'id' : range(1,9),
'rate' : ['bad', 'bad', 'neutral', 'good',
'neutral', 'good', 'bad', 'neutral'],
'amount' : np.random.randn(8)}, columns= ['id','rate','amount'])

df.rate = pd.Categorical(df.rate, categories=["bad","neutral","good"], ordered=True)

df = df.sort('rate', ascending=True)
df.loc[:,'factor_rate'] = pd.factorize(df.rate)[0]


Note 1: This question is linked to my previous one.

Note 2: I know that
pandas.factorize()
had an
order
parameter which is now or soon will be deprecated, so I cannot use it directly.

Note 3: I am aware of the Gower metric and that ordinal variables should be treated differently to purely numeric ones in similarity calculations, but that is not the point of my question.

Answer

If you're concerned or reliant on categorical then another approach is to define your categories in a list and an order, use this to create a dict to map the order to the categories and pass this dict to map:

In [560]:

df = pd.DataFrame({'id' : range(1,9),
                   'rate' : ['bad', 'bad', 'neutral', 'good',
                                'neutral', 'good', 'bad', 'neutral'],
                   'amount' : np.random.randn(8)},  columns= ['id','rate','amount'])
cat_list = ["bad","neutral","good"]
cat_dict = dict(zip(cat_list, np.arange(len(cat_list))))

df['factor_rate'] = df['rate'].map(cat_dict)
df
Out[560]:
   id     rate    amount  factor_rate
0   1      bad -1.436574            0
1   2      bad  0.210717            0
2   3  neutral -0.896904            1
3   4     good -1.724393            2
4   5  neutral  1.792339            1
5   6     good -1.312713            2
6   7      bad  0.555877            0
7   8  neutral -0.681881            1

sorting by the new 'factor_rate' column produces the same output as your current solution:

In [562]:

df.sort('factor_rate')
Out[562]:
   id     rate    amount  factor_rate
0   1      bad -1.436574            0
1   2      bad  0.210717            0
6   7      bad  0.555877            0
2   3  neutral -0.896904            1
4   5  neutral  1.792339            1
7   8  neutral -0.681881            1
3   4     good -1.724393            2
5   6     good -1.312713            2

EDIT

Actually the underlying coding is stored in the Category dtype:

In [569]:

df = pd.DataFrame({'id' : range(1,9),
                   'rate' : ['bad', 'bad', 'neutral', 'good',
                                'neutral', 'good', 'bad', 'neutral'],
                   'amount' : np.random.randn(8)},  columns= ['id','rate','amount'])
‚Äč
df['rate'] = pd.Categorical(df.rate, categories=["bad","neutral","good"], ordered=True)
df['rate'].cat.codes
Out[569]:
0    0
1    0
2    1
3    2
4    1
5    2
6    0
7    1
dtype: int8

So my answer may not be necessary, the link to the docs is empty at the moment.