clog14 clog14 - 22 days ago 4x
Python Question

Python filling string column "forward" and groupby attaching groupby result to dataframe

I have a dataframe looking generated by:

df = pd.DataFrame([[100, ' tes t ', 3], [100, np.nan, 2], [101, ' test1', 3 ], [101,' ', 4]])

It looks like

0 1 2
0 100 tes t 3
1 100 NaN 2
2 101 test1 3
3 101 4

I would like to a fill column 1 "forward" with test and test1. I believe one approach would be to work with replacing whitespace by np.nan, but it is difficult since the words contain whitespace as well. I could also groupby column 0 and then use the first element of each group to fill forward. Could you provide me with some code for both alternatives I do not get it coded?

Additionally, I would like to add a column that contains the group means that is
the final dataframe should look like this

0 1 2 3
0 100 tes t 3 2.5
1 100 tes t 2 2.5
2 101 test1 3 3.5
3 101 test1 4 3.5

Could you also please advice how to accomplish something like this?

Many thanks please let me know in case you need further information.


IIUC, you could use str.strip and then check if the stripped string is empty. Then, perform groupby operations and filling the Nans by the method ffill and calculating the means using groupby.transform function as shown:

df[1] = df[1].str.strip().dropna().apply(lambda x: np.NaN if len(x) == 0 else x)

df[1] = df.groupby(0)[1].fillna(method='ffill')
df[3] = df.groupby(0)[2].transform(lambda x: x.mean())


Note: If you must forward fill NaN values with first element of that group, then you must do this:

df.groupby(0)[1].apply(lambda x: x.fillna(x.iloc[0]))

Breakup of steps:

Since we want to apply the function only on strings, we drop all the NaN values present before, else we would be getting the TypeError due to both floats and string elements present in the column and complains of float having no method as len.


0    tes  t    # operates only on indices where strings are present(empty strings included)
2     test1
Name: 1, dtype: object

The reindexing part isn't a necessary step as it only computes on the indices where strings are present.

Also, the reset_index(drop=True) part was indeed unwanted as the groupby object returns a series after fillna which could be assigned back to column 1.