I'm using Python 3.6 and Pandas 0.20.3.
I'm sure this must be addressed somewhere, but I can't seem to find it. I alter a dataframe inside a function by adding columns; then I restore the dataframe to the original columns. I don't return the dataframe. The added columns stay.
I could understand if I add columns inside the function and they are not permanent AND updating the dataframe does not work. I'd also understand if adding columns altered the dataframe and assigning the dataframe also stuck.
Here is the code:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10, 5))
df
0 1 2 3 4
0 0.406779 -0.481733 -1.187696 -0.210456 -0.608194
1 0.732978 -0.079787 -0.051720 1.097441 0.089850
2 1.859737 -1.422845 -1.148805 0.254504 1.207134
3 0.074400 -1.352875 -1.341630 -1.371050 0.005505
4 -0.102024 -0.905506 -0.165681 2.424180 0.761963
5 0.400507 -0.069214 0.228971 -0.079805 -1.059972
6 1.284812 0.843705 -0.885566 1.087703 -1.006714
7 0.135243 0.055807 -1.217794 0.018104 -1.571214
8 -0.524320 -0.201561 1.535369 -0.840925 0.215584
9 -0.495721 0.284237 0.235668 -1.412262 -0.002418
def mess_around(df):
cols = df.columns
df['extra']='hi'
df = df[cols]
mess_around(df)
df
0 1 2 3 4 extra
0 0.406779 -0.481733 -1.187696 -0.210456 -0.608194 hi
1 0.732978 -0.079787 -0.051720 1.097441 0.089850 hi
2 1.859737 -1.422845 -1.148805 0.254504 1.207134 hi
3 0.074400 -1.352875 -1.341630 -1.371050 0.005505 hi
4 -0.102024 -0.905506 -0.165681 2.424180 0.761963 hi
5 0.400507 -0.069214 0.228971 -0.079805 -1.059972 hi
6 1.284812 0.843705 -0.885566 1.087703 -1.006714 hi
7 0.135243 0.055807 -1.217794 0.018104 -1.571214 hi
8 -0.524320 -0.201561 1.535369 -0.840925 0.215584 hi
9 -0.495721 0.284237 0.235668 -1.412262 -0.002418 hi
for c in ts.columns:
if c not in cols:
del ts[c]
To understand what happens, you should know the difference between passing attributes to functions by value versus passing them by reference:
You pass a variable df
to your function messing_around
. The function modifies the original dataframe in-place by adding a column.
This subsequent line of code seems to be the cause for confusion here:
df = df[cols]
What happens here is that the variable df
originally held a reference to your dataframe. But, the reassignment causes the variable to point to a different object - your original dataframe is not changed.
Here's a simpler example:
def foo(l):
l.insert(0, np.nan) # original modified
l = [4, 5, 6] # reassignment - no change to the original,
# but the variable l points to something different
lst = [1, 2, 3]
foo(lst)
print(lst)
[nan, 1, 2, 3] # notice here that the insert modifies the original,
# but not the reassignment