max - 1 year ago 126
Python Question

# COUNT DISTINCT / nunique within groups

I want to count the number of distinct tuples within each group:

``````df = pd.DataFrame({'a': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'b': [1, 2, 1, 2, 1, 2, 1, 2],
'c': [1, 1, 2, 2, 2, 1, 2, 1]})
counts = count_distinct(df, by='a', columns=['b', 'c'])
assert counts == pd.Series([4, 2], index=['A', 'B'])
``````

In other words, counts should report that for group 'A', there are four distinct tuples and for group 'B', there are two.

I tried using
`df.groupby('a')['b', 'c'].nunique()`
, but
`nunique`
works only with a single column.

I know I could count distinct tuples by df.groupby(['b', 'c']), but that means I have use a slow
`apply`
with a pure python function (the number of groups of column 'a' is large).

I could convert the 'b' and 'c' columns into a single column of tuples, but that would be super slow since it will no longer use vectorized operations.

I think your logic is equivalent to count the size of data frames grouped by column `a` after dropping the duplicated values of combined columns `a`, `b` and `c`, since duplicated tuples within each group must also be duplicated records in the data frame assuming your data frame contains only columns `a`, `b` and `c` and vice versa:
``````df.drop_duplicates().groupby('a').size()