kdragger kdragger - 1 year ago 131
Python Question

Pandas dataframe scope and changes

I'm using Python 3.6 and Pandas 0.20.3.

I'm sure this must be addressed somewhere, but I can't seem to find it. I alter a dataframe inside a function by adding columns; then I restore the dataframe to the original columns. I don't return the dataframe. The added columns stay.
I could understand if I add columns inside the function and they are not permanent AND updating the dataframe does not work. I'd also understand if adding columns altered the dataframe and assigning the dataframe also stuck.
Here is the code:

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10, 5))
df


which gives

0 1 2 3 4
0 0.406779 -0.481733 -1.187696 -0.210456 -0.608194
1 0.732978 -0.079787 -0.051720 1.097441 0.089850
2 1.859737 -1.422845 -1.148805 0.254504 1.207134
3 0.074400 -1.352875 -1.341630 -1.371050 0.005505
4 -0.102024 -0.905506 -0.165681 2.424180 0.761963
5 0.400507 -0.069214 0.228971 -0.079805 -1.059972
6 1.284812 0.843705 -0.885566 1.087703 -1.006714
7 0.135243 0.055807 -1.217794 0.018104 -1.571214
8 -0.524320 -0.201561 1.535369 -0.840925 0.215584
9 -0.495721 0.284237 0.235668 -1.412262 -0.002418


Now, I create a function:

def mess_around(df):
cols = df.columns
df['extra']='hi'
df = df[cols]


then run it and display dataframe:

mess_around(df)
df


which gives:

0 1 2 3 4 extra
0 0.406779 -0.481733 -1.187696 -0.210456 -0.608194 hi
1 0.732978 -0.079787 -0.051720 1.097441 0.089850 hi
2 1.859737 -1.422845 -1.148805 0.254504 1.207134 hi
3 0.074400 -1.352875 -1.341630 -1.371050 0.005505 hi
4 -0.102024 -0.905506 -0.165681 2.424180 0.761963 hi
5 0.400507 -0.069214 0.228971 -0.079805 -1.059972 hi
6 1.284812 0.843705 -0.885566 1.087703 -1.006714 hi
7 0.135243 0.055807 -1.217794 0.018104 -1.571214 hi
8 -0.524320 -0.201561 1.535369 -0.840925 0.215584 hi
9 -0.495721 0.284237 0.235668 -1.412262 -0.002418 hi


I know I can solve the problem by return ts. So I can fix the problem. I want to understand where I am going wrong. I suspect that the scope of the variable ts is inside the function; it is given a pointer but that does not change because of scope. Yet the column assignment is using the pointer that is passed in and therefore impacts the dataframe "directly". Is that correct?

EDIT:
For those that might want to address the dataframe in place, I've added:

for c in ts.columns:
if c not in cols:
del ts[c]


I'm guessing if I return the new dataframe, then there will be a potentially large dataframe that will have to be dealt with by garbage collection.

Answer Source

To understand what happens, you should know the difference between passing attributes to functions by value versus passing them by reference:


You pass a variable df to your function messing_around. The function modifies the original dataframe in-place by adding a column.

This subsequent line of code seems to be the cause for confusion here:

df = df[cols]

What happens here is that the variable df originally held a reference to your dataframe. But, the reassignment causes the variable to point to a different object - your original dataframe is not changed.

Here's a simpler example:

def foo(l):
    l.insert(0, np.nan)   # original modified
    l = [4, 5, 6]         # reassignment - no change to the original, 
                          # but the variable l points to something different

lst = [1, 2, 3]    
foo(lst)

print(lst)
[nan, 1, 2, 3]            # notice here that the insert modifies the original,
                          # but not the reassignment
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download