emax emax - 7 months ago 18
Python Question

Python: how to count collaborations between pairs in pandas dataframe?

I have a dataframe like this

df = pd.DataFrame( {'Item':['A','A','A','B','B','C','C','C','C'],
'Name':[Tom,John,Paul,Tom,Frank,Tom, John, Richard, James],
'Weight:[2,2,2,3,3,5, 5, 5, 5]'})
df
Item Name Weight
A Tom 4
A John 4
A Paul 4
B Tom 3
B Frank 3
C Tom 5
C John 5
C Richard 5
C James 5


For each people I want the list of the people with same item averaged over the
weight


df1
Name People Times
Tom [John, Paul, Frank, Richard, James] [(1/4+1/5),1/4,1/3,1/5,1/5]
John [Tom, Richard, James] [(1/4+1/5),1/5,1/5]
Paul [Tom, John] [1/4,1/4]
Frank [Tom] [1/3]
Richard [Tom, John, James] [1/5,1/5,1/5]
James [Tom, John, Richard] [1/5,1/5,1/5]


In order to count the times of collaboration without considering the
weight
, I did:

#merge M:N by column Item
df1 = pd.merge(df, df, on=['Item'])

#remove duplicity - column Name_x == Name_y
df1 = df1[~(df1['Name_x'] == df1['Name_y'])]
#print df1

#create lists
df1 = df1.groupby('Name_x')['Name_y'].apply(lambda x: x.tolist()).reset_index()
print df1
Name_x Name_y
0 Frank [Tom]
1 James [Tom, John, Richard]
2 John [Tom, Paul, Tom, Richard, James]
3 Paul [Tom, John]
4 Richard [Tom, John, James]
5 Tom [John, Paul, Frank, John, Richard, James]


#get count by np.unique
df1['People'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[0])
df1['times'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[1])
#remove column Name_y
df1 = df1.drop('Name_y', axis=1).rename(columns={'Name_x':'Name'})
print df1
Name People times
0 Frank [Tom] [1]
1 James [John, Richard, Tom] [1, 1, 1]
2 John [James, Paul, Richard, Tom] [1, 1, 1, 2]
3 Paul [John, Tom] [1, 1]
4 Richard [James, John, Tom] [1, 1, 1]
5 Tom [Frank, James, John, Paul, Richard] [1, 1, 2, 1, 1]


In the last dataframe I have the count of collaboration between all the pairs, however I would like their weighted counting of collaboration

Answer

You could use .apply() to create the values and .unstack() for the wide format:

collab = df1.groupby(level=['Name_x', 'Name_y']).apply(lambda x: np.sum(1/x)).unstack().loc[:, 'Weight_x']

Name_y      Frank  James  John  Paul  Richard       Tom
Name_x                                                 
Frank         NaN    NaN   NaN   NaN      NaN  0.333333
James         NaN    NaN   0.2   NaN      0.2  0.200000
John          NaN    0.2   NaN   0.5      0.2  0.700000
Paul          NaN    NaN   0.5   NaN      NaN  0.500000
Richard       NaN    0.2   0.2   NaN      NaN  0.200000
Tom      0.333333    0.2   0.7   0.5      0.2       NaN

then iterate over the rows and convert to lists:

df = pd.DataFrame(columns=['People', 'Times'])
for p, data in collab.iterrows():
    s = data.dropna()
    df.loc[p] = [s.index.tolist(), s.values]

                                      People  \
Frank                                  [Tom]   
James                   [John, Richard, Tom]   
John             [James, Paul, Richard, Tom]   
Paul                             [John, Tom]   
Richard                   [James, John, Tom]   
Tom      [Frank, James, John, Paul, Richard]   

                                        Times  
Frank                        [0.333333333333]  
James                         [0.2, 0.2, 0.2]  
John                     [0.2, 0.5, 0.2, 0.7]  
Paul                               [0.5, 0.5]  
Richard                       [0.2, 0.2, 0.2]  
Tom      [0.333333333333, 0.2, 0.7, 0.5, 0.2]