ajkumar25 - 8 days ago 4
Python Question

# Finding count of distinct elements in DataFrame in each column

I am trying to find the count of distinct values in each column using Pandas. This is what I did.

``````import pandas as pd

# print(df)

a = pd.unique(df.values.ravel())
print(a)
``````

It counts unique elements in the DataFrame irrespective of rows/columns, but I need to count for each column with output formatted as below.

``````policyID              0
statecode             0
county                0
eq_site_limit         0
hu_site_limit         454
fl_site_limit         647
fr_site_limit         0
tiv_2011              0
tiv_2012              0
eq_site_deductible    0
hu_site_deductible    0
fl_site_deductible    0
fr_site_deductible    0
point_latitude        0
point_longitude       0
line                  0
construction          0
point_granularity     0
``````

What would be the most efficient way to do this, as this method will be applied to files which have size greater than 1.5GB?

`df.apply(lambda x: len(x.unique()))`
is the fastest.

``````In[23]: %timeit df.apply(pd.Series.nunique)
1 loops, best of 3: 1.45 s per loop
In[24]: %timeit df.apply(lambda x: len(x.unique()))
1 loops, best of 3: 335 ms per loop
In[25]: %timeit df.T.apply(lambda x: x.nunique(), axis=1)
1 loops, best of 3: 1.45 s per loop
``````

You could do a transpose of the df and then using `apply` call `nunique` row-wise:

``````In [205]:
df = pd.DataFrame({'a':[0,1,1,2,3],'b':[1,2,3,4,5],'c':[1,1,1,1,1]})
df

Out[205]:
a  b  c
0  0  1  1
1  1  2  1
2  1  3  1
3  2  4  1
4  3  5  1

In [206]:
df.T.apply(lambda x: x.nunique(), axis=1)

Out[206]:
a    4
b    5
c    1
dtype: int64
``````

EDIT

As pointed out by @ajcr the transpose is unnecessary:

``````In [208]:
df.apply(pd.Series.nunique)

Out[208]:
a    4
b    5
c    1
dtype: int64
``````