XsjakieX XsjakieX - 5 months ago 23
Python Question

Python pandas loop efficient through two dataframes with different lengths

I have two dataframes with different lengths(df,df1). They share one similar label

"collo_number"
. I want to search the second dataframe for every
collo_number
in the first data frame. Problem is that the second date frame contains multiple rows for different dates for every
collo_nummer
. So i want to sum these dates and add this in a new column in the first database.

I now use a loop but it is rather slow and has to perform this operation for al 7 days in a week. Is there a way to get a better performance? I tried multiple solutions but keep getting the error that i cannot use the equal sign for two databases with different lenghts. Help would really be appreciated! Here is an example of what is working but with a rather bad performance.

df5=[df1.loc[(df1.index == nasa) & (df1.afleverdag == x1) & (df1.ind_init_actie=="N"), "aantal_colli"].sum() for nasa in df.collonr]

Answer

Your description is a bit vague (hence my comment). First what you good do is to select the rows of the dataframe that you want to search:

dftmp = df1[(df1.afleverdag==x1) & (df1.ind_init_actie=='N')]

so that you don't do this for every item in the loop. Second, use .groupby.

newseries = dftmp['aantal_colli'].groupby(dftmp.index).sum()  
newseries = newseries.ix[df.collonr.unique()]
Comments