abou abou - 1 year ago 52
Python Question

Copying set of specific columns between Pandas dfs where specific values match

I am sure this is going to be a 'doh' moment, but I am having a difficult time trying to copy a set of columns between dataframes where the value of a specific column in df1 is also found in df2.

A simplified version of df1 looks like this:

A simplified version of df2 looks like this:

From here I'm building a list of columns (cols) that does not include the 'p_people_id' field from df1 and creating those fields in df2 and assigning a nan value.
cols=columns from df1 != p_people_id
After which df2 looks like this:
df2 w appended cols
Working with these dfs I'm trying to look find all instances where df2.a_people_id == df1.p_people_id and assign the values of df1[cols] to the df2 instance.

Finding the instance using .loc is simple enough. I've managed to be able to select the cols I want to target using .loc as well


This works fine and returns: enter image description here

But, if I try something like this to set/assign those specific columns where the id field in df1 matches the id field in df2:


Nothing happens and I'm not sure why. df2 assignment results

I've attempted to utilize .ix, .loc, .iloc, .where, .select, .set in various ways, but this has to be one of those areas where I'm "just not doin it right." I can post other examples where I've managed to get the syntax right to find [cols] where the specific id matches and no error or 'view vs copy' warning is printed, but no assignment happens either. Where am I going wrong here?

Answer Source

If I'm not mistaken, I think you're looking for a join operation.

In particular, this statement in your description:


Means "look in df2 for all rows where p_people_id matches the first p_people_id and for those rows, select the cols columns.

This produces a set of rows and a columns (6 in your example, all containing NA's) and, if I get things correctly, you then want to put those 6 lines together with the corresponding line in df1, plus do that for all p_people_id in df1.

If my assumption above is correct, then this can be done with a simple join. Like so:

pd.merge(left=df1, left_on="p_people_id",
         right=df2, right_on="p_people_id")

If there are 1000 different p_people_id in df1 and each of them had 6 lines in df2, the above statement would produce a dataframe with 6000 rows.

You can then select the desired columns in the result.