Zhubarb - 1 month ago 11
Python Question

# Pandas Ordinal Variable Treatment in Similarity Calculation

I have a Pandas version 0.15.2 dataframe as below with an ordinal column

`rate`
, represented initially as strings. My end goal is to find the similarities of different rows in the
`df`
(in reality I have a lot more rows and more ordinal variables).

Currently, to factorize() while enforcing the ordering I want, I first convert
`rate`
to Categorical, then sort the
`df`
and finally factorize so
`Bad`
gets the smallest integer value and
`Good`
gets the highest. In this fashion I create the
`factor_rate`
column, which I can then use as a dimension (among others) to calculate similarities / distances. These are given in the code below.

However, I feel this is a memory inefficient and slow hack. What is the fastest and most memory-efficient way to achieve this?

``````df = pd.DataFrame({'id' : range(1,9),
'amount' : np.random.randn(8)},  columns= ['id','rate','amount'])

df = df.sort('rate', ascending=True)
df.loc[:,'factor_rate'] = pd.factorize(df.rate)[0]
``````

Note 1: This question is linked to my previous one.

Note 2: I know that
`pandas.factorize()`
`order`
parameter which is now or soon will be deprecated, so I cannot use it directly.

Note 3: I am aware of the Gower metric and that ordinal variables should be treated differently to purely numeric ones in similarity calculations, but that is not the point of my question.

If you're concerned or reliant on categorical then another approach is to define your categories in a list and an order, use this to create a dict to map the order to the categories and pass this dict to `map`:

``````In [560]:

df = pd.DataFrame({'id' : range(1,9),
'amount' : np.random.randn(8)},  columns= ['id','rate','amount'])
cat_dict = dict(zip(cat_list, np.arange(len(cat_list))))

df['factor_rate'] = df['rate'].map(cat_dict)
df
Out[560]:
id     rate    amount  factor_rate
2   3  neutral -0.896904            1
3   4     good -1.724393            2
4   5  neutral  1.792339            1
5   6     good -1.312713            2
7   8  neutral -0.681881            1
``````

sorting by the new 'factor_rate' column produces the same output as your current solution:

``````In [562]:

df.sort('factor_rate')
Out[562]:
id     rate    amount  factor_rate
2   3  neutral -0.896904            1
4   5  neutral  1.792339            1
7   8  neutral -0.681881            1
3   4     good -1.724393            2
5   6     good -1.312713            2
``````

EDIT

Actually the underlying coding is stored in the Category dtype:

``````In [569]:

df = pd.DataFrame({'id' : range(1,9),
'amount' : np.random.randn(8)},  columns= ['id','rate','amount'])
​
df['rate'].cat.codes
Out[569]:
0    0
1    0
2    1
3    2
4    1
5    2
6    0
7    1
dtype: int8
``````

So my answer may not be necessary, the link to the docs is empty at the moment.