cqcn1991 cqcn1991 - 8 months ago 25
Python Question

Python & Pandas: Will using many df.copy affect code's performance?

I'm doing some data analysis, and the data are in pandas


There are several function that I defined to do process on the

For encapsulation purpose, I define the functions like this:

def df_process(df):
# do some process work on df
return df

In Jupyter Notebook, I use the function as

df = df_process(df)

The reason for using
is that otherwise the original
would be modified, whether you assign it back or not. (see Python & Pandas: How to return a copy of a dataframe?)

My question is:

  1. Is using
    proper here? If not, how should a function handle data be defined?

  2. Since I use several such data processing function, will it affect my program's performance? And how much?

msw msw

Far better would be:

def df_process(df):
    # do some process work on df

def df_another(df):
    # other processing

def df_more(df):
    # yet more processing

def process_many(df):
    for frame_function in (df_process, df_another, df_more):
        df_copy = df.copy()
        # emit the results to a file or screen or whatever

The key here is if you must make a copy, make only one, process it, stash the results somewhere and then dispose of it by reassigning df_copy. Your question made no mention of why you are hanging onto processed copies so this assumes you need not.