S Ringne S Ringne - 23 days ago 5
Python Question

create a bigram from a column in pandas df

i have this test table in pandas dataframe

Leaf_category_id session_id product_id
0 111 1 987
3 111 4 987
4 111 1 741
1 222 2 654
2 333 3 321


this is the extension of my previous question, which was answered by @jazrael.
view answer

so after getting the values in product_id column as(just an assumption, little different from the output of my previous question,

|product_id |
---------------------------
|111,987,741,34,12 |
|987,1232 |
|654,12,324,465,342,324 |
|321,741,987 |
|324,654,862,467,243,754 |
|6453,123,987,741,34,12 |


and so on,
i want to create a new column, in which all the values in a row should be made as a bigram with its next one, and the last no. in the row combined with the first one,for example:

|product_id |Bigram
-------------------------------------------------------------------------
|111,987,741,34,12 |(111,987),**(987,741)**,(741,34),(34,12),(12,111)
|987,1232 |(987,1232),(1232,987)
|654,12,324,465,342,32 |(654,12),(12,324),(324,465),(465,342),(342,32),(32,654)
|321,741,987 |(321,741),**(741,987)**,(987,321)
|324,654,862 |(324,654),(654,862),(862,324)
|123,987,741,34,12 |(123,987),(987,741),(34,12),(12,123)


ignore the **( i'll tell you later on why i starred that)

the code to achive the bigram is

for i in df.Leaf_category_id.unique():
print (df[df.Leaf_category_id == i].groupby('session_id')['product_id'].apply(lambda x: list(zip(x, x[1:]))).reset_index())


from this df, i want to consider the bigram column and make one more column named as frequency, which gives me frequency of bigram occured.


Note* : (987,741) and (741,987) are to be considered as same and one dublicate entry should be removed and thus frequency of (987,741) should be 2.
similar is the case with (34,12) it occurs two times, so frequency should be 2


|Bigram
---------------
|(111,987),
|**(987,741)**
|(741,34)
|(34,12)
|(12,111)
|**(741,987)**
|(987,321)
|(34,12)
|(12,123)


Final Result should be.

|Bigram | frequency |
--------------------------
|(111,987) | 1
|(987,741) | 2
|(741,34) | 1
|(34,12) | 2
|(12,111) | 1
|(987,321) | 1
|(12,123) | 1


i am hoping to find answer here, please help me, i have elaborated it as much as possible.

Answer

try this code

from collections import Counter
from itertools import combinations
import pandas as pd

df = pd.DataFrame.from_csv("data.csv")
#consecutive
grouped_consecutive_product_ids = df.groupby(['Leaf_category_id','session_id'])['product_id'].apply(lambda x: list(zip(x,x[1:]))).reset_index()

bigram_frequency_consecutive = pd.DataFrame.from_dict(data=Counter(tuple(sorted(bigram)) for bigrams in grouped_consecutive_product_ids.product_id for bigram in bigrams).items())
bigram_frequency_consecutive.rename(columns = {0:'Bigram',1:'frequency'}, inplace = True)

for combinations (all possible bi-grams)

from collections import Counter
from itertools import combinations
import pandas as pd

df = pd.DataFrame.from_csv("data.csv")
#combinations
grouped_combination_product_ids = df.groupby(['Leaf_category_id','session_id'])['product_id'].apply(lambda x: list(combinations(x,2))).reset_index()

bigram_frequency_combinations = pd.DataFrame.from_dict(data=Counter(tuple(sorted(bigram)) for bigrams in grouped_combination_product_ids.product_id for bigram in bigrams).items())
bigram_frequency_combinations.rename(columns = {0:'Bigram',1:'frequency'}, inplace = True)

where data.csv contains

Leaf_category_id,session_id,product_id
0,111,1,111
3,111,4,987
4,111,1,741
1,222,2,654
2,333,3,321
5,111,1,87
6,111,1,34
7,111,1,12
8,111,1,987
9,111,4,1232
10,222,2,12
11,222,2,324
12,222,2,465
13,222,2,342
14,222,2,32
15,333,3,321
16,333,3,741
17,333,3,987
18,333,3,324
19,333,3,654
20,333,3,862
21,222,1,123
22,222,1,987
23,222,1,741
24,222,1,34
25,222,1,12

The resultant bigram_frequency_consecutive will be

         Bigram  frequency
0   (987, 1232)          1
1     (87, 741)          1
2    (123, 987)          1
3     (12, 987)          1
4      (12, 34)          2
5    (342, 465)          1
6    (654, 862)          1
7    (324, 465)          1
8    (324, 654)          1
9    (324, 987)          1
10     (34, 87)          1
11   (321, 741)          1
12    (12, 324)          1
13   (741, 987)          2
14    (32, 342)          1
15    (34, 741)          1
16   (111, 741)          1
17   (321, 321)          1
18    (12, 654)          1

The resultant bigram_frequency_combinations will be

       Bigram     frequency
0     (87, 741)          1
1    (123, 987)          1
2      (12, 32)          1
3    (321, 654)          2
4     (32, 324)          1
5    (324, 654)          2
6    (324, 987)          1
7      (34, 87)          1
8    (321, 741)          2
9    (324, 342)          1
10   (741, 987)          3
11    (34, 741)          2
12    (34, 987)          2
13    (12, 342)          1
14   (862, 987)          1
15   (111, 987)          1
16   (654, 741)          1
17   (324, 741)          1
18    (32, 654)          1
19   (654, 862)          1
20  (987, 1232)          1
21   (654, 987)          1
22    (12, 324)          1
23     (12, 87)          1
24    (34, 123)          1
25   (321, 321)          1
26   (741, 862)          1
27     (12, 34)          2
28    (12, 111)          1
29    (12, 654)          1
30    (12, 987)          2
31    (32, 342)          1
32   (342, 465)          1
33   (321, 324)          2
34    (87, 111)          1
35   (111, 741)          1
36    (32, 465)          1
37    (12, 123)          1
38   (321, 862)          2
39   (321, 987)          2
40   (324, 465)          1
41   (342, 654)          1
42    (12, 741)          2
43    (12, 465)          1
44   (123, 741)          1
45    (34, 111)          1
46    (87, 987)          1
47   (465, 654)          1
48   (324, 862)          1

in the above case it groups by both