Christian Christian - 1 month ago 18
Python Question

count pair id matches via proxy id - mongodb etc.?

suppose I have a very big data.frame/collection with the field id and an proxy id.

1 a
2 a
1 b
3 b
1 c
3 c
4 d


Now I'd like to count/get the matches which id has with another id.

1 2 1 #id1 id2 count
1 3 2


Ok with some python itertools.combinations and lookups this works. But feels cumbersome. Is there an more approriate simple fast approach/technology?

My approach later appended:


  • I filtered the ids which are appear > x , beacuse I have millions.

    def matchings(id):
    #mapping is the mongodb collection
    match = mapping.find({'id':id})
    valid_proxies = [doc['proxy'] for doc in match]
    other_ids = [doc['id'] for doc in mapping.find({'proxy': {'$in':valid_proxies}})]
    c = Counter([(id, id2) for id2 in other_ids if id2 !=id])
    #possible filter
    #c_filtered {k:v for k, v in c.items() if v > 3 }
    #some stats
    #s1 = [id,len(proxies),len(other_ids)]
    s2 = [[k[0],k[1],v] for k,v in c.items()]
    return s2


    res = [matchings(id) for id in list(df_id_filtered['id'])]
    df_final_matching_counts = pd.DataFrame(list(itertools.chain(*res)))



Thanks!

Answer

Here was my approach

df = pd.DataFrame({'id': {0: 1, 1: 2, 2: 1, 3: 3, 4: 1, 5: 3, 6: 4}, 'proxy': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c', 5: 'c', 6: 'd'}})
print df

group by and aggregate into sets

df_g = df.groupby('id').apply(lambda x: set(x['proxy'])).to_frame()
df_g.columns = ['proxy']

a trick to make cartesian product cartesian product in pandas

df_g['X'] = 'X'
merged = pd.merge(df_g.reset_index(),df_g.reset_index(), on = ['X'])

which gives:

    id_x    proxy_x  X  id_y    proxy_y
0      1  {a, c, b}  X     1  {a, c, b}
1      1  {a, c, b}  X     2        {a}
2      1  {a, c, b}  X     3     {c, b}
3      1  {a, c, b}  X     4        {d}
4      2        {a}  X     1  {a, c, b}
5      2        {a}  X     2        {a}
6      2        {a}  X     3     {c, b}
7      2        {a}  X     4        {d}
8      3     {c, b}  X     1  {a, c, b}
9      3     {c, b}  X     2        {a}
10     3     {c, b}  X     3     {c, b}
11     3     {c, b}  X     4        {d}
12     4        {d}  X     1  {a, c, b}
13     4        {d}  X     2        {a}
14     4        {d}  X     3     {c, b}
15     4        {d}  X     4        {d}

some tidy up

# we dont care about (1,1) (2,2), (3,3) etc.
merged_filtered =  merged[merged['id_x'] != merged['id_y'] ].copy(deep=True)

# use intersection on the sets, sorted list (or set) for the keys & len()
merged_filtered['intersect']  = merged_filtered.apply(lambda row: len(row['proxy_x'].intersection(row['proxy_y'])), axis=1)

# for us (1,2) = (2,1) etc. sorting or set, then drop duplicates will address that.
merged_filtered['keys'] = merged_filtered.apply(lambda row: sorted([row['id_x'],row['id_y']]),axis=1)
merged_filtered = merged_filtered[['keys','intersect']]

there are other ways to split the column of list objects into two columns

merged_filtered['key1'] = merged_filtered['keys'].map(lambda x: x[0])
merged_filtered['key2'] = merged_filtered['keys'].map(lambda x: x[1])
merged_filtered.drop('keys', axis=1, inplace=True)

now drop the duplicates

merged_filtered = merged_filtered.drop_duplicates().set_index(['key1','key2'])
print merged_filtered

           intersect
key1 key2           
1    2             1
     3             2
     4             0
2    3             0
     4             0
3    4             0

and if you want to drop the zeros:

print merged_filtered[merged_filtered['intersect'] !=0]

           intersect
key1 key2           
1    2             1
     3             2
Comments