cqcn1991 cqcn1991 - 3 months ago 7
Python Question

Python & Pandas: Will using many df.copy affect code's performance?

I'm doing some data analysis, and the data are in pandas

DataFrame
,
df
.

There are several function that I defined to do process on the
df
.

For encapsulation purpose, I define the functions like this:

def df_process(df):
df=df.copy()
# do some process work on df
return df


In Jupyter Notebook, I use the function as

df = df_process(df)


The reason for using
df.copy()
is that otherwise the original
df
would be modified, whether you assign it back or not. (see Python & Pandas: How to return a copy of a dataframe?)

My question is:


  1. Is using
    df=df.copy()
    proper here? If not, how should a function handle data be defined?

  2. Since I use several such data processing function, will it affect my program's performance? And how much?


msw msw
Answer

Far better would be:

def df_process(df):
    # do some process work on df

def df_another(df):
    # other processing

def df_more(df):
    # yet more processing

def process_many(df):
    for frame_function in (df_process, df_another, df_more):
        df_copy = df.copy()
        frame_function(df_copy)
        # emit the results to a file or screen or whatever

The key here is if you must make a copy, make only one, process it, stash the results somewhere and then dispose of it by reassigning df_copy. Your question made no mention of why you are hanging onto processed copies so this assumes you need not.

Comments