Brian Keegan - 7 months ago 62

Python Question

Given a dataframe with different categorical variables, how do I return a cross-tabulation with percentages instead of frequencies?

`df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 6,`

'B' : ['A', 'B', 'C'] * 8,

'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4,

'D' : np.random.randn(24),

'E' : np.random.randn(24)})

pd.crosstab(df.A,df.B)

B A B C

A

one 4 4 4

three 2 2 2

two 2 2 2

Using the margins option in crosstab to compute row and column totals gets us close enough to think that it should be possible using an aggfunc or groupby, but my meager brain can't think it through.

`B A B C`

A

one .33 .33 .33

three .33 .33 .33

two .33 .33 .33

Answer

```
pd.crosstab(df.A, df.B).apply(lambda r: r/r.sum(), axis=1)
```

Basically you just have the function that does `row/row.sum()`

, and you use `apply`

with `axis=1`

to apply it by row.

(If doing this in Python 2, you should use `from __future__ import division`

to make sure division always returns a float.)

Source (Stackoverflow)