user5779223 user5779223 - 2 months ago 6
Python Question

Quick way to find the duplicate cell in a certain column of data frame in python-pandas?

Given a data frame

df
in the following form:

item attr
1 {1, 2, 3, 4}
2 {2, 4, 3, 2, 10}
3 {4, 37}
4 {1, 2, 3, 4}


I want to find the item-pair with same
attr
, like,
item 1
and
item 2
. Please notice that the
df
has
200,000
items totally. And I want a fastest way to find them. Do you know how to do it? Thanks in advance!

Answer

You can first convert set to tuple and then aggregate nunique and unique. Last use boolean indexing:

df = pd.DataFrame({'item':[1,2,3,4],
                  'attr':[set({1, 2, 3, 4}),set({2, 4, 3, 2, 10}),
                          set({4, 37}), set({1, 2, 3, 4})]})

print (df)
            attr  item
0   {1, 2, 3, 4}     1
1  {3, 10, 2, 4}     2
2        {4, 37}     3
3   {1, 2, 3, 4}     4

df.attr = df.attr.apply(tuple)
print (df)
            attr  item
0   (1, 2, 3, 4)     1
1  (3, 10, 2, 4)     2
2        (4, 37)     3
3   (1, 2, 3, 4)     4

df1 = df.item.groupby(df['attr']).agg(['nunique', 'unique'])
df1 = df1[df1['nunique'] == 2]
print (df1)
              nunique  unique
attr                         
(1, 2, 3, 4)        2  [1, 4]

Another solution if there are only one or pair values in DataFrame with duplicated:

df = pd.DataFrame({'item':[1,2,3,4],
                  'attr':[set({1, 2, 3, 4}),set({4, 37}),
                          set({4, 37}), set({1, 2, 3, 4})]})

print (df)
           attr  item
0  {1, 2, 3, 4}     1
1       {4, 37}     2
2       {4, 37}     3
3  {1, 2, 3, 4}     4

df.attr = df.attr.apply(tuple)


df1 = df[df.duplicated('attr', keep=False)]
df1 = df1.groupby('attr')['item'].apply(lambda x: x.tolist())
print (df1)
(1, 2, 3, 4)    [1, 4]
(4, 37)         [2, 3]
Name: item, dtype: object

EDIT by comment:

Use melt for reshaping:

df = pd.DataFrame({'item':[1,2,3,4,5],
                  'attr1':[set({1, 2, 3, 4}),set({4, 37}),set({4, 37}), 
                           set({1, 2, 3, 4}), set({4,8})],
                  'attr2':[set({1, 2 }),set({4, 37}),
                           set({4, 3}), set({1, 2}), set({4,8})]})

print (df)
          attr1    attr2  item
0  {1, 2, 3, 4}   {1, 2}     1
1       {4, 37}  {4, 37}     2
2       {4, 37}   {3, 4}     3
3  {1, 2, 3, 4}   {1, 2}     4
4        {8, 4}   {8, 4}     5

df = pd.melt(df, id_vars='item', value_name='attr').drop('variable', axis=1)
df.attr = df.attr.apply(tuple)
print (df)
   item          attr
0     1  (1, 2, 3, 4)
1     2       (4, 37)
2     3       (4, 37)
3     4  (1, 2, 3, 4)
4     5        (8, 4)
5     1        (1, 2)
6     2       (4, 37)
7     3        (3, 4)
8     4        (1, 2)
9     5        (8, 4)
Comments