Jabb Jabb - 17 days ago 8
Python Question

How to get the distinct count of values in a python pandas dataframe

I would like to get the distinct count of values in a python pandas dataframe and write the result to a new column. This is what I have so far.

import pandas as pd

df = pd.DataFrame( {
'OrderNo': [1,1,1,1,2,2,2,3,3],
'Barcode': [1234,2345,3456,3456,1234,1234,2345,1234,3456]
} );

df['barcodeCountPerOrderNo'] = df.groupby(['OrderNo', 'Barcode'])['Barcode'].transform('count')

df['distinctBarcodesPerOrderNo'] = '?'

print df


This gives:

Barcode OrderNo barcodeCountPerOrderNo distinctBarcodesPerOrder
0 1234 1 1 ?
1 2345 1 1 ?
2 3456 1 2 ?
3 3456 1 2 ?
4 1234 2 2 ?
5 1234 2 2 ?
6 2345 2 1 ?
7 1234 3 1 ?
8 3456 3 1 ?


But how can I get the distinctBarcodesPerOrder?

Barcode OrderNo barcodeCountPerOrderNo distinctBarcodesPerOrder
0 1234 1 1 3
1 2345 1 1 3
2 3456 1 2 3
3 3456 1 2 3
4 1234 2 2 2
5 1234 2 2 2
6 2345 2 1 2
7 1234 3 1 2
8 3456 3 1 2

Answer

I would use map to get unique values and directly merge them into the original dataframe:

df['distinctBarcodesPerOrder'] = df['OrderNo'].map(df.groupby('OrderNo')['Barcode'].nunique())

which returns:

   Barcode  OrderNo  barcodeCountPerOrderNo  distinctBarcodesPerOrder
0     1234        1                       1                         3
1     2345        1                       1                         3
2     3456        1                       2                         3
3     3456        1                       2                         3
4     1234        2                       2                         2
5     1234        2                       2                         2
6     2345        2                       1                         2
7     1234        3                       1                         2
8     3456        3                       1                         2