helloB helloB - 2 months ago 8
Python Question

Pandas relabel rows to recognize unique values within a groupby

I have a

Pandas DataFrame
that describes some version testing and looks like this:

MailingName EmailSubject MailingID
Promo_v1s1 Hello! A8FEFE
Promo_v1s2 Line 2 A8FEFE
Promo_v2s1 Line 2 A8FEFE
Promo_v2s2 Yo! A8FEFE
Promo_v2S3 Hello! A8FEFE
deal_v2s1 Line 2 bbb
deal_v2s2 Yo! bbb
deal_v2ss Hello bbb


The same mailing campaign, with different version tests, can be identified by the
MailingID
(so that would be the
groupby
term for more characteristics).

The naming convention for
MailingName
for these is that
v + a number
indicates the email body version that was tested, and
s + a number
indicates the email subject line that was tested in a particular combo. However, the convention is not helpful in the sense that the subject line from a
v1s1
is not necessarily the same as a subject line in
v2s2
even when the mailingID is shared.

I want to, within each
MailingID
group, have all email subject lines that are actually identical, have the same 'subject line version number'. So I'd like to create another column that would result in something like this:

MailingName EmailSubject MailingID TrueEmailVersionNumber
Promo_v1s1 Hello! A8FEFE 1
Promo_v1s2 Line 2 A8FEFE 2
Promo_v2s1 Line 2 A8FEFE 2
Promo_v2s2 Yo! A8FEFE 3
Promo_v2S3 Hello! A8FEFE 1
deal_v2s1 Line 2 bbb 1
deal_v2s2 Yo! bbb 2
deal_v2ss Hello bbb 3


Basically I want to add unique labels, per group, to a column. How can I do this with
Pandas
?

I had an idea of getting a starting in a clunky way like so:

def processThis(x):
uni = list(set(x))
keys = {x_i:uni.index(x_i) for x_i in x}
return keys
ab_data.groupby('mailing_id')['subject'].apply(processThis)


But this actually did not yield back a list of dictionaries, so even my first step is a non-starter. Thanks for any advice!

Answer
In [217]: import itertools

In [218]: df
Out[218]: 
  MailingName EmailSubject MailingID
0  Promo_v1s1       Hello!    A8FEFE
1  Promo_v1s2       Line 2    A8FEFE
2  Promo_v2s1       Line 2    A8FEFE
3  Promo_v2s2          Yo!    A8FEFE
4  Promo_v2S3       Hello!    A8FEFE
5   deal_v2s1       Line 2       bbb
6   deal_v2s2          Yo!       bbb
7   deal_v2ss        Hello       bbb

In [219]: def f(x):
     ...:     unq = list(x['EmailSubject'].unique())
     ...:     return [unq.index(y) + 1 for y in x['EmailSubject']]
     ...: 

In [220]: versions = df.groupby('MailingID').apply(f)

In [221]: df['TrueEmailVersionNumber'] = list(itertools.chain(*versions))

In [222]: df
Out[222]: 
  MailingName EmailSubject MailingID  TrueEmailVersionNumber
0  Promo_v1s1       Hello!    A8FEFE                       1
1  Promo_v1s2       Line 2    A8FEFE                       2
2  Promo_v2s1       Line 2    A8FEFE                       2
3  Promo_v2s2          Yo!    A8FEFE                       3
4  Promo_v2S3       Hello!    A8FEFE                       1
5   deal_v2s1       Line 2       bbb                       1
6   deal_v2s2          Yo!       bbb                       2
7   deal_v2ss        Hello       bbb                       3
Comments