Philip - 2 months ago 7
Python Question

# Take union of two columns, Python + Pandas

I have a df arranged like follows:

``````   x    y    z
0  a   jj  Nan
1  b   ii   mm
2  c   kk   nn
3  d   ii  NaN
4  e  Nan   oo
5  f   jj   mm
6  g  Nan   nn
``````

The desired output is:

``````   x    y    z   w
0  a   jj  Nan   a
1  b   ii   mm   a
2  c   kk   nn   c
3  d   ii  NaN   a
4  e  Nan   oo   e
5  f   jj   mm   a
6  g  Nan   nn   c
``````

The logic is

1. to take union of column y & z :
`ii == jj`
since in index 1 and 5, they both have
`mm`
in column z

2. group this union : index 0,1,3,5 are a group, index 2,6 are another group

3. within the group, randomly take one cell in column x and assign it to column w for the whole group

Can somebody help me?

EDITNOTE:

I was first post a perfectly sorted column y and column z like follows:

``````   x    y    z   w
0  a   ii  NaN   a
1  b   ii   mm   a
2  c   jj   mm   a
3  d   jj  Nan   a
4  e   kk   nn   e
5  f  Nan   nn   e
6  g  Nan   oo   g
``````

For this case, piRSquared's solution works perfect.

EDITNOTE2:

Nickil Maveli's solution works perfect for my problem. However, I noted that there's a situation that the solution can not handle, that is :

``````   x   y   z
0  a  ii  mm
1  b  ii  nn
2  c  jj  nn
3  d  jj  oo
4  e  kk  oo
``````

By Nickil Maveli's solution, the result would be like follows:

``````   0   1   2  w
0  a  ii  mm  a
1  b  ii  mm  a
2  c  jj  nn  c
3  d  jj  nn  c
4  e  kk  oo  e
``````

However, the desired output should be w = ['a', 'a', 'a', 'a', 'a'].

In the general case this is a set consolidation/connected components problem. While if we assume certain things about your data we can solve a reduced case, it's just a bit of bookkeeping to do the whole thing.

scipy has a connected components function we can use if we do some preparation:

``````import scipy.sparse

def via_cc(df_in):
df = df_in.copy()

# work with ranked version
dfr = df[["y","z"]].rank(method='dense')
# give nans their own temporary rank
dfr = dfr.fillna(dfr.max().fillna(0) + dfr.isnull().cumsum(axis=0))
# don't let x and y get mixed up; have separate nodes per column
dfr["z"] += dfr["y"].max()

size = int(dfr.max().max()) + 1
m = scipy.sparse.coo_matrix(([1]*len(dfr), (dfr.y, dfr.z)),
(size, size))

# do the work to find the groups
_, cc = scipy.sparse.csgraph.connected_components(m)

# get the group codes
group = pd.Series(cc[dfr["y"].astype(int).values], index=dfr.index)
# fill in w from x appropriately
df["w"] = df["x"].groupby(group).transform(min)

return df
``````

which gives me

``````In [230]: via_cc(df0)
Out[230]:
x    y    z  w
0  a   jj  NaN  a
1  b   ii   mm  a
2  c   kk   nn  c
3  d   ii  NaN  a
4  e  NaN   oo  e
5  f   jj   mm  a
6  g  NaN   nn  c

In [231]: via_cc(df1)
Out[231]:
x   y   z  w
0  a  ii  mm  a
1  b  ii  nn  a
2  c  jj  nn  a
3  d  jj  oo  a
4  e  kk  oo  a
``````

If you have a set consolidation recipe around, like the one here, you can simplify some of the above at the cost of an external function.

(Aside: note that in my df0, the "Nan"s are really NaNs. If you have a string "Nan" (note how it's different from NaN), then the code will think it's just another string and will assume that you want all "Nan"s to be in the same group.)

Source (Stackoverflow)