ajkumar25 - 1 year ago 138

Python Question

I am trying to find the count of distinct values in each column using Pandas. This is what I did.

`import pandas as pd`

df = pd.read_csv('train.csv')

# print(df)

a = pd.unique(df.values.ravel())

print(a)

It counts unique elements in the DataFrame irrespective of rows/columns, but I need to count for each column with output formatted as below.

`policyID 0`

statecode 0

county 0

eq_site_limit 0

hu_site_limit 454

fl_site_limit 647

fr_site_limit 0

tiv_2011 0

tiv_2012 0

eq_site_deductible 0

hu_site_deductible 0

fl_site_deductible 0

fr_site_deductible 0

point_latitude 0

point_longitude 0

line 0

construction 0

point_granularity 0

What would be the most efficient way to do this, as this method will be applied to files which have size greater than 1.5GB?

Based upon the answers,

`df.apply(lambda x: len(x.unique()))`

`In[23]: %timeit df.apply(pd.Series.nunique)`

1 loops, best of 3: 1.45 s per loop

In[24]: %timeit df.apply(lambda x: len(x.unique()))

1 loops, best of 3: 335 ms per loop

In[25]: %timeit df.T.apply(lambda x: x.nunique(), axis=1)

1 loops, best of 3: 1.45 s per loop

Answer Source

You could do a transpose of the df and then using `apply`

call `nunique`

row-wise:

```
In [205]:
df = pd.DataFrame({'a':[0,1,1,2,3],'b':[1,2,3,4,5],'c':[1,1,1,1,1]})
df
Out[205]:
a b c
0 0 1 1
1 1 2 1
2 1 3 1
3 2 4 1
4 3 5 1
In [206]:
df.T.apply(lambda x: x.nunique(), axis=1)
Out[206]:
a 4
b 5
c 1
dtype: int64
```

**EDIT**

As pointed out by @ajcr the transpose is unnecessary:

```
In [208]:
df.apply(pd.Series.nunique)
Out[208]:
a 4
b 5
c 1
dtype: int64
```