SanBan SanBan - 1 month ago 10
Python Question

Pandas : Making Decision on groupby size()

I am trying to do a 'Change Data Capture' using two spreadsheet.
I have grouped my resulting dataframe and stuck with a strange problem.
Requirement:

Case 1) size of a group == 2, do certain tasks

Case 2) size of a group == 1 , do certain tasks

Case 3) size_of_a_group > 2, do certain tasks

Problem is no matter how I try I can not break down result of groupby as per its size and then loop through it

I would like to do something like :

if(group_by_1.filter(lambda x : len(x) ==2):
for grp,rows in sub(??)group:
for j in range(len(rows)-1):
#check rows[j,'column1'] != rows[j+1,'column1']:
do something


here is my code snippet. Any help is much appreciated.

import pandas as pd
import numpy as np

pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
print("reading wolverine xlxs")


# defining metadata

df_header = ['DisplayName','StoreLanguage','Territory','WorkType','EntryType','TitleInternalAlias',
'TitleDisplayUnlimited','LocalizationType','LicenseType','LicenseRightsDescription',
'FormatProfile','Start','End','PriceType','PriceValue','SRP','Description',
'OtherTerms','OtherInstructions','ContentID','ProductID','EncodeID','AvailID',
'Metadata', 'AltID', 'SuppressionLiftDate','SpecialPreOrderFulfillDate','ReleaseYear','ReleaseHistoryOriginal','ReleaseHistoryPhysicalHV',
'ExceptionFlag','RatingSystem','RatingValue','RatingReason','RentalDuration','WatchDuration','CaptionIncluded','CaptionExemption','Any','ContractID',
'ServiceProvider','TotalRunTime','HoldbackLanguage','HoldbackExclusionLanguage']
df_w01 = pd.read_excel("wolverine_1.xlsx", names = df_header)

df_w02 = pd.read_excel("wolverine_2.xlsx", names = df_header)





df_w01['version'] = 'OLD'
df_w02['version'] = 'NEW'

#print(df_w01)
df_m_d = pd.concat([df_w01, df_w02], ignore_index = True).reset_index()

#print(df_m_d)

first_pass_get_duplicates = df_m_d[df_m_d.duplicated(['StoreLanguage','Territory','TitleInternalAlias','LocalizationType','LicenseType',
'LicenseRightsDescription','FormatProfile','Start','End','PriceType','PriceValue','ContentID','ProductID',
'AltID','ReleaseHistoryPhysicalHV','RatingSystem','RatingValue','CaptionIncluded'], keep='first')] # This datframe has records which are DUPES on NEW and OLD
#print(first_pass_get_duplicates)

first_pass_drop_duplicate = df_m_d.drop_duplicates(['StoreLanguage','Territory','TitleInternalAlias','LocalizationType','LicenseType',
'LicenseRightsDescription','FormatProfile','Start','End','PriceType','PriceValue','ContentID','ProductID',
'AltID','ReleaseHistoryPhysicalHV','RatingSystem','RatingValue','CaptionIncluded'], keep=False) # This datframe has records which are unique on desired values evn for first time

#print(first_pass_drop_duplicate)


group_by_1 = first_pass_drop_duplicate.groupby(['StoreLanguage','Territory','TitleInternalAlias','LocalizationType','LicenseType','FormatProfile'],as_index=False)
#Best Case group_by has 2 elements on big key and at least one row is 'new'
#print(group_by_1.grouper.group_info[0])
#for i,rows in group_by_1:

#if(.transform(lambda x : len(x)==2)):
#print(group_by_1.grouper.group_info[0])

#print(group_by_1.describe())

'''for i,rows in group_by_1:
temp_rows = rows.reset_index()
temp_rows.reindex(index=range(0,len(rows)))
print("group has: ", len(temp_rows))
for j in range(len(rows)-1):
print(j)
print("this iteration: ", temp_rows.loc[j,'Start'])
print("next iteration: ", temp_rows.loc[j+1,'Start'])
if(temp_rows.loc[j+1,'Start'] == temp_rows.loc[j,'Start']):
print("Match")
else:
print("no_match")
print(temp_rows.loc[j,'Start'])
print("++++-----++++")'''


Any Help is much appreciated.

Answer

This is a case where using a new index might make your life easier, depending on the operations you need to perform. I tried to mimic what some of your data might look like:

In [1]:
   ...: pd.set_option('display.max_rows', 10)
   ...: pd.set_option('display.max_columns', 50)
   ...:
   ...:
   ...: df_header = ['DisplayName','StoreLanguage','Territory','WorkType','EntryType','TitleInternalAlias',
   ...:          'TitleDisplayUnlimited','LocalizationType','LicenseType','LicenseRightsDescription',
   ...:          'FormatProfile','Start','End','PriceType','PriceValue','SRP','Description',
   ...:          'OtherTerms','OtherInstructions','ContentID','ProductID','EncodeID','AvailID',
   ...:          'Metadata', 'AltID', 'SuppressionLiftDate','SpecialPreOrderFulfillDate','ReleaseYear','ReleaseHistoryOriginal','ReleaseHistoryP
   ...: hysicalHV',
   ...:           'ExceptionFlag','RatingSystem','RatingValue','RatingReason','RentalDuration','WatchDuration','CaptionIncluded','CaptionExempti
   ...: on','Any','ContractID',
   ...:           'ServiceProvider','TotalRunTime','HoldbackLanguage','HoldbackExclusionLanguage']
   ...:
   ...:
   ...: import itertools as it
   ...:
   ...: catcols = 'StoreLanguage','Territory','TitleInternalAlias','LocalizationType','LicenseType','FormatProfile'
   ...:
   ...: headers = list(catcols) + [chr(c + 65) for c in range(10)]
   ...:
   ...: df = pd.DataFrame(data=np.random.rand(100, len(headers)), columns=headers)
   ...:
   ...: df.StoreLanguage = list(it.islice((it.cycle(["en", "fr"])), 100))
   ...:
   ...: df.Territory  =list(it.islice(it.cycle(["us", "fr", "po", "nz", "au"]), 100) )
   ...:
   ...: df.TitleInternalAlias  =list(it.islice(it.cycle(['a', 'b', 'c']), 100) )
   ...:
   ...: df.LocalizationType  =list(it.islice(it.cycle(['d', 'g']), 100) )
   ...:
   ...: df.LicenseType  =list(it.islice(it.cycle(["free", "com", "edu", "home"]), 100) )
   ...:
   ...: df.FormatProfile  =list(it.islice(it.cycle(["g", "q"]), 100) )
   ...:

Here's the trick:

   ...: gb = df.groupby(catcols, as_index=False)
   ...:
   ...: reindexed = (df.assign(group_size = gb['A'].transform(lambda x: x.shape[0]))
   ...:              .set_index("group_size")
   ...:              )
   ...:


In [2]: reindexed.head()
Out[2]:
           StoreLanguage Territory TitleInternalAlias LocalizationType  \
group_size
2.0                   en        us                  a                d
2.0                   fr        fr                  b                g
2.0                   en        po                  c                d
2.0                   fr        nz                  a                g
2.0                   en        au                  b                d

           LicenseType FormatProfile         A         B         C         D  \
group_size
2.0               free             g  0.312705  0.346577  0.910688  0.317494
2.0                com             q  0.575515  0.627054  0.025820  0.943633
2.0                edu             g  0.489421  0.518020  0.988816  0.833306
2.0               home             q  0.146965  0.823234  0.155927  0.865554
2.0               free             g  0.327784  0.107795  0.678729  0.178454

                   E         F         G         H         I         J
group_size
2.0         0.032420  0.232436  0.279712  0.167969  0.847725  0.777870
2.0         0.833150  0.261634  0.832250  0.511341  0.865027  0.850981
2.0         0.924992  0.129079  0.419342  0.603113  0.705015  0.683255
2.0         0.560832  0.434411  0.260553  0.208577  0.259383  0.997590
2.0         0.431881  0.729873  0.606323  0.806250  0.000556  0.793380

In [3]: reindexed.loc[2, "FormatProfile"].head()
Out[3]:
group_size
2.0    g
2.0    q
2.0    g
2.0    q
2.0    g
Name: FormatProfile, dtype: object

You can drop duplicates here...

In [4]: reindexed.loc[2, "FormatProfile"].drop_duplicates()
Out[4]:
group_size
2.0    g
2.0    q
Name: FormatProfile, dtype: object

And recombine the slices as you see fit.