S. 16 S. 16 - 9 days ago 5
Python Question

function in python for searching posible combination for specific name in a big file

I have a huge file(50,000 row) with 2 columns(id and name). One id can have different names but I am only looking for specific names and their ids.This specific name might have any combination with each other which I need to check whole file for any possible combination they came together.
I wrote the following function it gives me no error but it is not working neither.

I also want to count any combination of these names.

By the way I am using pandas and importing data as data frame.

for example:

id name
a TD
a NB
a LB
b LR
b NB
c LR
c NB
d LB


I want result like:

a TD,NB,LR # they might have any combination I just wrote them as example
b NB,LR
c NB,LR
d LB


and for counting I want:

TD,NB,LR 1
NB,LR 2
LB 1


def Gene_count(df_file):
df_group_id = df.groupby('id').name
for j in df_group_id:
j = df.id
for i in df_group_id:
if i == 'TD' or i=='NB' or i=='LR' or i== 'LB':
print(i,j)


Thanks

Answer

You can use first groupby with apply join:

df1 = df.groupby('id')['name'].apply(','.join)
print (df1)
id
a    TD,NB,LB
b       LR,NB
c       LR,NB
d          LB
Name: name, dtype: object

and then value_counts:

print (df1.value_counts())
LR,NB       2
LB          1
TD,NB,LB    1
Name: name, dtype: int64

If want filter some values in concatenated output use contains with join | (regex or) and boolean indexing:

df1 = df.groupby('id')['name'].apply(','.join)

df2 = df1[df1.str.contains('|'.join(['LR','NB']))]
print (df2)
id
a    TD,NB,LB
b       LR,NB
c       LR,NB
Name: name, dtype: object

print (df2.value_counts())
LR,NB       2
TD,NB,LB    1
Name: name, dtype: int64

Another possible solution is filter before with double isin:

#get all id where is value LR or NB (unique is for better performance)
ids = df.loc[df.name.isin(['LR','NB']), 'id'].unique()
print (ids)
['a' 'b' 'c']

#filter by ids
df3 = df[df.id.isin(ids)]
print (df3)
  id name
0  a   TD
1  a   NB
2  a   LB
3  b   LR
4  b   NB
5  c   LR
6  c   NB

df4 = df3.groupby('id')['name'].apply(','.join)
print (df4)
id
a    TD,NB,LB
b       LR,NB
c       LR,NB
Name: name, dtype: object

print (df4.value_counts())
LR,NB       2
TD,NB,LB    1
Name: name, dtype: int64

I was really interested about performance of both solutions - it is same:

np.random.seed(123)
N = 1000000
L1 = list("abcdefghijklmnopqrstuvwxyz")
df = pd.DataFrame({'id':np.random.choice(L1, N), 
                   'name': np.random.choice(L1, N)})

In [31]: %timeit (df.groupby('id')['name'].apply(','.join))
10 loops, best of 3: 130 ms per loop

In [32]: %timeit (df.groupby('id')['name'].apply(lambda x: ','.join(x.tolist())))
10 loops, best of 3: 131 ms per loop