Josh Josh - 3 months ago 17
Python Question

Re-assignment in Pandas: Copy or view?

Say we have the following dataframe:

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : randn(8), 'D' : randn(8)})


shown below:

> df
A B C D
0 foo one 0.846192 0.478651
1 bar one 2.352421 0.141416
2 foo two -1.413699 -0.577435
3 bar three 0.569572 -0.508984
4 foo two -1.384092 0.659098
5 bar two 0.845167 -0.381740
6 foo one 3.355336 -0.791471
7 foo three 0.303303 0.452966


And then I do the following:

df2 = df
df = df[df['C']>0]


If you now look at
df
and
df2
you will see that
df2
holds the original data, whereas
df
was updated to only keep the values where
C
was greater than 0.

I thought Pandas wasn't supposed to make a copy in an assignment like
df2 = df
and that it would only make copies with either:


  1. df2 = df.copy(deep=True)

  2. df2 = copy.deepcopy(df)



What happened above then? Did
df2 = df
make a copy? I presume that the answer is no, so it must have been
df = df[df['C']>0]
that made a copy, and I presume that, if I didn't have
df2=df
above, there would have been a copy without any reference to it floating in memory. Is that correct?

Note: I read through Returning a view versus a copy and I wonder if the following:


Whenever an array of labels or a boolean vector are involved in the indexing operation, the result will be a copy.


explains this behavior.

Answer

It's not that df2 is making the copy, it's that the df = df[df['C'] > 0] is returning a copy.

Just print out the ids and you'll see:

print id(df)
df2 = df
print id(df2)
df = df[df['C'] > 0]
print id(df)