mar tin mar tin - 1 month ago 12
SQL Question

Pandas distinct count as a DataFrame

Suppose I have a Pandas DataFrame called

df
with columns
a
and
b
and what I want is the number of distinct values of
b
per each
a
. I would do:

distcounts = df.groupby('a')['b'].nunique()


which gives the desidered result, but it is as Series object rather than another DataFrame. I'd like a DataFrame instead. In regular SQL, I'd do:

SELECT a, COUNT(DISTINCT(b)) FROM df


and haven't been able to emulate this query in Pandas exactly. How to?

Answer

I think you need reset_index:

distcounts = df.groupby('a')['b'].nunique().reset_index()

Sample:

df = pd.DataFrame({'a':[7,8,8],
                   'b':[4,5,6]})

print (df)
   a  b
0  7  4
1  8  5
2  8  6

distcounts = df.groupby('a')['b'].nunique().reset_index()
print (distcounts)
   a  b
0  7  1
1  8  2