Philip - 8 months ago 52

Python Question

I have a df arranged like follows:

`x y z`

0 a jj Nan

1 b ii mm

2 c kk nn

3 d ii NaN

4 e Nan oo

5 f jj mm

6 g Nan nn

The desired output is:

`x y z w`

0 a jj Nan a

1 b ii mm a

2 c kk nn c

3 d ii NaN a

4 e Nan oo e

5 f jj mm a

6 g Nan nn c

The logic is

- to take union of column y & z : since in index 1 and 5, they both have
`ii == jj`

in column z`mm`

- group this union :
*index 0,1,3,5 are a group, index 2,6 are another group* - within the group, randomly take one cell in column x and assign it to column w for the whole group

I have no clue at all about this problem.

Can somebody help me?

I was first post a perfectly sorted column y and column z like follows:

`x y z w`

0 a ii NaN a

1 b ii mm a

2 c jj mm a

3 d jj Nan a

4 e kk nn e

5 f Nan nn e

6 g Nan oo g

For this case, piRSquared's solution works perfect.

Nickil Maveli's solution works perfect for my problem. However, I noted that there's a situation that the solution can not handle, that is :

`x y z`

0 a ii mm

1 b ii nn

2 c jj nn

3 d jj oo

4 e kk oo

By Nickil Maveli's solution, the result would be like follows:

`0 1 2 w`

0 a ii mm a

1 b ii mm a

2 c jj nn c

3 d jj nn c

4 e kk oo e

However, the desired output should be w = ['a', 'a', 'a', 'a', 'a'].

Answer

In the general case this is a set consolidation/connected components problem. While if we assume certain things about your data we can solve a reduced case, it's just a bit of bookkeeping to do the whole thing.

scipy has a connected components function we can use if we do some preparation:

```
import scipy.sparse
def via_cc(df_in):
df = df_in.copy()
# work with ranked version
dfr = df[["y","z"]].rank(method='dense')
# give nans their own temporary rank
dfr = dfr.fillna(dfr.max().fillna(0) + dfr.isnull().cumsum(axis=0))
# don't let x and y get mixed up; have separate nodes per column
dfr["z"] += dfr["y"].max()
# build the adjacency matrix
size = int(dfr.max().max()) + 1
m = scipy.sparse.coo_matrix(([1]*len(dfr), (dfr.y, dfr.z)),
(size, size))
# do the work to find the groups
_, cc = scipy.sparse.csgraph.connected_components(m)
# get the group codes
group = pd.Series(cc[dfr["y"].astype(int).values], index=dfr.index)
# fill in w from x appropriately
df["w"] = df["x"].groupby(group).transform(min)
return df
```

which gives me

```
In [230]: via_cc(df0)
Out[230]:
x y z w
0 a jj NaN a
1 b ii mm a
2 c kk nn c
3 d ii NaN a
4 e NaN oo e
5 f jj mm a
6 g NaN nn c
In [231]: via_cc(df1)
Out[231]:
x y z w
0 a ii mm a
1 b ii nn a
2 c jj nn a
3 d jj oo a
4 e kk oo a
```

If you have a set consolidation recipe around, like the one here, you can simplify some of the above at the cost of an external function.

(Aside: note that in my df0, the "Nan"s are really NaNs. If you have a *string* "Nan" (note how it's different from NaN), then the code will think it's just another string and will assume that you want all "Nan"s to be in the same group.)

Source (Stackoverflow)