Philip - 1 year ago 107

Python Question

I have a df arranged like follows:

`x y z`

0 a jj Nan

1 b ii mm

2 c kk nn

3 d ii NaN

4 e Nan oo

5 f jj mm

6 g Nan nn

The desired output is:

`x y z w`

0 a jj Nan a

1 b ii mm a

2 c kk nn c

3 d ii NaN a

4 e Nan oo e

5 f jj mm a

6 g Nan nn c

The logic is

- to take union of column y & z : since in index 1 and 5, they both have
`ii == jj`

in column z`mm`

- group this union :
*index 0,1,3,5 are a group, index 2,6 are another group* - within the group, randomly take one cell in column x and assign it to column w for the whole group

I have no clue at all about this problem.

Can somebody help me?

I was first post a perfectly sorted column y and column z like follows:

`x y z w`

0 a ii NaN a

1 b ii mm a

2 c jj mm a

3 d jj Nan a

4 e kk nn e

5 f Nan nn e

6 g Nan oo g

For this case, piRSquared's solution works perfect.

Nickil Maveli's solution works perfect for my problem. However, I noted that there's a situation that the solution can not handle, that is :

`x y z`

0 a ii mm

1 b ii nn

2 c jj nn

3 d jj oo

4 e kk oo

By Nickil Maveli's solution, the result would be like follows:

`0 1 2 w`

0 a ii mm a

1 b ii mm a

2 c jj nn c

3 d jj nn c

4 e kk oo e

However, the desired output should be w = ['a', 'a', 'a', 'a', 'a'].

Answer Source

In the general case this is a set consolidation/connected components problem. While if we assume certain things about your data we can solve a reduced case, it's just a bit of bookkeeping to do the whole thing.

scipy has a connected components function we can use if we do some preparation:

```
import scipy.sparse
def via_cc(df_in):
df = df_in.copy()
# work with ranked version
dfr = df[["y","z"]].rank(method='dense')
# give nans their own temporary rank
dfr = dfr.fillna(dfr.max().fillna(0) + dfr.isnull().cumsum(axis=0))
# don't let x and y get mixed up; have separate nodes per column
dfr["z"] += dfr["y"].max()
# build the adjacency matrix
size = int(dfr.max().max()) + 1
m = scipy.sparse.coo_matrix(([1]*len(dfr), (dfr.y, dfr.z)),
(size, size))
# do the work to find the groups
_, cc = scipy.sparse.csgraph.connected_components(m)
# get the group codes
group = pd.Series(cc[dfr["y"].astype(int).values], index=dfr.index)
# fill in w from x appropriately
df["w"] = df["x"].groupby(group).transform(min)
return df
```

which gives me

```
In [230]: via_cc(df0)
Out[230]:
x y z w
0 a jj NaN a
1 b ii mm a
2 c kk nn c
3 d ii NaN a
4 e NaN oo e
5 f jj mm a
6 g NaN nn c
In [231]: via_cc(df1)
Out[231]:
x y z w
0 a ii mm a
1 b ii nn a
2 c jj nn a
3 d jj oo a
4 e kk oo a
```

If you have a set consolidation recipe around, like the one here, you can simplify some of the above at the cost of an external function.

(Aside: note that in my df0, the "Nan"s are really NaNs. If you have a *string* "Nan" (note how it's different from NaN), then the code will think it's just another string and will assume that you want all "Nan"s to be in the same group.)