ldevyataykina ldevyataykina - 1 month ago 15
Python Question

Pandas: Union strings in dataframe

I have a dataframe

df


ID active_seconds domain subdomain search_engine search_term
0120bc30e78ba5582617a9f3d6dfd8ca 35 vk.com vk.com None None
0120bc30e78ba5582617a9f3d6dfd8ca 54 vk.com vk.com None None
0120bc30e78ba5582617a9f3d6dfd8ca 34 vk.com vk.com None None
16c28c057720ab9fbbb5ee53357eadb7 4 facebook.com facebook.com None None
16c28c057720ab9fbbb5ee53357eadb7 4 facebook.com facebook.com None None
16c28c057720ab9fbbb5ee53357eadb7 8 facebook.com facebook.com None None
0120bc30e78ba5582617a9f3d6dfd8ca 16 megarand.ru megarand.ru None None
0120bc30e78ba5582617a9f3d6dfd8ca 6 vk.com vk.com None None


I need to change
df
. If to
ID
subdomain[i] == subdomain[i-1]
I should union this string and
active_seconds[i-1] + active_seconds[i]
.
From this df I want to get

ID active_seconds domain subdomain search_engine search_term
0120bc30e78ba5582617a9f3d6dfd8ca 123 vk.com vk.com None None
16c28c057720ab9fbbb5ee53357eadb7 16 facebook.com facebook.com None None
0120bc30e78ba5582617a9f3d6dfd8ca 16 megarand.ru megarand.ru None None
0120bc30e78ba5582617a9f3d6dfd8ca 6 vk.com vk.com None None


What sould I use to do it?

Answer

This get's real close. Not sure if getting that order correct is important to you.

Also, I made an assumption that I should groupby ID. This means that if the same ID spans across another ID and still in the same subdomain, I'll aggregate the active_seconds.

def proc_id(df):
    cond = df.subdomain != df.subdomain.shift()
    part = cond.cumsum()
    df_ = df.groupby(part).first()
    df_.active_seconds = df.groupby(part).active_seconds.sum()
    return df_

df.groupby('ID').apply(proc_id).reset_index(drop=True)

enter image description here