max - 1 year ago 105

Python Question

I want to count the number of distinct tuples within each group:

`df = pd.DataFrame({'a': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],`

'b': [1, 2, 1, 2, 1, 2, 1, 2],

'c': [1, 1, 2, 2, 2, 1, 2, 1]})

counts = count_distinct(df, by='a', columns=['b', 'c'])

assert counts == pd.Series([4, 2], index=['A', 'B'])

In other words, counts should report that for group 'A', there are four distinct tuples and for group 'B', there are two.

I tried using

`df.groupby('a')['b', 'c'].nunique()`

`nunique`

I know I could count distinct tuples by df.groupby(['b', 'c']), but that means I have use a slow

`apply`

I could convert the 'b' and 'c' columns into a single column of tuples, but that would be super slow since it will no longer use vectorized operations.

Answer Source

I think your logic is equivalent to count the size of data frames grouped by column `a`

after dropping the duplicated values of combined columns `a`

, `b`

and `c`

, since duplicated tuples within each group must also be duplicated records in the data frame assuming your data frame contains only columns `a`

, `b`

and `c`

and vice versa:

```
df.drop_duplicates().groupby('a').size()
# a
# A 4
# B 2
# dtype: int64
```