ritam deb ritam deb - 4 months ago 8x
Python Question

Extracting data from 2 data frames and creating new one python

Its a bit long question have patience.
firstly I have 2 data frames one in which i have name of a guy and pages liked by him in columns. So no. of columns will be different for different person here is the example. 1st column is the name of user.Then pages liked by him is stored across the row.So no. of columns for 'random guy' will be different from 'mank rion'. 'BlackBick' , '500 Startups' e.t.c are name of the page. let say name of this data frame is User_page

random guy BlackBuck GiveMeSport Analytics Ninja
mank nion DJ CHETAS Celina Jaitly Gurkeerat Singh
pop rajuel WOW Editions 500 Startups Biswapati Sarkar
Roshan ghai MensXP No Abuse the smartian

Now I have another Data frame in which is kind of same as upper one but in the place of page's name there is a category of page.you might now there are different category of pages on fb. so let say 'BlacBuck''s category is 'Transport/Freight'. There are pages with same name and different category.That is why i cant use name directly as key this is how my data frame looks like.Let say name of this data frame User_category.

random guy Transport/Freight Sport Insurance Company
mank nion Arts/Entertainment Actress Actor/Director
pop rajuel Concert Tour App Page Actor/Director
Roshan ghai News/Media Website Community Public Figure

Now I have two more Data frames. one in which I have name of fb pages as 1st column and 162 more columns with some tag for each page there is value 1 for i*j element if ith page comes in to jth tag otherwise left empty so it will look like.let say name of this dataframe is Page_tag

name of page tag 1 tag2 tag3
BlackBuck 1 1
GiveMeSport 1 1
Analytics Ninja 1 1

the another one have name of categories as 1st column and same 162 as further. like this. let say name of this dataframe is Category_tag.

category_name tag 1 tag2 tag3
Sport 1 1
App Page 1 1
Actor/Director 1
Public Figure 1 1

Now what I have to get the tag counts for each user from pages he has liked. for that first I have to first check that the page which he has liked where exist in data frame of Page_tag which is 3rd dataframe in my question if it exist there take the counts of tags that how many times a specific tags appeared for that user.this is first step if not found the name of page as no. of pages in Page_tag dataframe(3rd one) is limited. I will go to category of page (from 2nd dataframe in this question) for the pages which are left out and for that category i will count the tags count for the specific user from dataframe named Category_tags(4th dataframe in this question) and sum the tag count and my output something like this.

username tag1 tag2 tag3
random guy 1 2 2
mank nion 2 1 3
pop rajuel 4 0 2
Roshan ghai 0 2 1

a i*j element on this dataframe shows no. times that the jth tag appears for ith user. I have written code for this and more in R i am stuck in this particular step. The code of R wasnt optimal as i used loops many time. I wanted to rhis optimally, hopefully can be done in pandas. Please me know if clarification is needed. Any help will be appreciated. Thank you.


Consider melting your original two data frames then merge with latter two with aggregation on tags. Additionally, a final aggregation on tags to combine both pages and categories datasets. Do note: results do not match exactly with your desired results.

import pandas as pd
import numpy as np
df1 = pd.melt(df1, id_vars=[0])
df1.columns = ['user', 'variable', 'value']

df2 = pd.melt(df2, id_vars=[0])
df2.columns = ['user', 'variable', 'value']

df1 = pd.merge(df1, df3, left_on=['value'], 
               right_on=['name of page'], how='outer')
df2 = pd.merge(df2, df4, left_on=['value'], 
               right_on=['category_name'], how='outer')

df = pd.merge(df1, df2, left_on=['user', 'variable'], 
              right_on=['user','variable'], how='outer')

df['tag1'] = np.where(df['tag1_x'].notnull(), df['tag1_x'], df['tag1_y'])
df['tag2'] = np.where(df['tag2_x'].notnull(), df['tag2_x'], df['tag2_y'])
df['tag3'] = np.where(df['tag3_x'].notnull(), df['tag3_x'], df['tag3_y'])

finaldf = df[['user', 'tag1', 'tag2', 'tag3']].groupby(['user']).agg(sum).reset_index()
print finaldf
#           user  tag1  tag2  tag3
# 0  Roshan ghai   0.0   1.0   1.0
# 1    mank nion   1.0   1.0   1.0
# 2   pop rajuel   2.0   0.0   1.0
# 3   random guy   2.0   1.0   1.0