view raw
tumbler tumbler - 7 months ago 50
Python Question

Python Pandas: Reuse stored means correctly to replace nan

Over some data, I computed means columnwise.

Let's say the data looks like this

A B C ... Z
0.1 0.2 0.15 ... 0.17
. . . .
. . . .
. . . .

I used the mean() function of DataFrame and as result I got

A some_mean_A
B some_mean_B
Z some_mean_Z

For replacing NaN, I use fillna(). It works for the case of computing the mean and using it during the same execution.

But as soon as I save the means in a file and read it to use it in a different .py file, I get rubbish. The reason is the file with the means are not interpreted correctly. In the new dataset, each NaN of the column A should be replaced by some_mean_A. Same for B and the rest till Z. But this is not happening, because by reading the means with read_csv(), I get the following

0 1
A some_mean_A
B some_mean_B
Z some_mean_Z

When I use this with fillna(), I do not get the expected result.

So, I hope you are understanding my problem. Do you know how to solve this problem?

EDIT 1.0:

This is how I compute and store the means:

df_mean = df.mean()
df.fillna(df_mean, inplace=True) // df is the dataframe for dataset where it works

This is how I read the means:

df_mean = pd.read_csv('mean.csv', header=None)


df.mean() returns a Series. In that Series, values are the means of columns and the indices are the column names. It is a one-dimensional structure. However, if you read that file using pd.read_csv's default parameters, it will read it as a DataFrame: one column for the column names, and another column for the means. To get the same data structure, you need to specify the index and pass squeeze=True. This way, pandas will read it into a Series:

df_mean = pd.read_csv('mean.csv', header=None, index_col=0, squeeze=True)

would give you the same Series for the mean vector. You can add rename_axis(None) at the end to get rid of the index name (I think this requires pandas 0.18.0):

df_mean = pd.read_csv('mean.csv', header=None, index_col=0).squeeze().rename_axis(None)