Philip Philip - 23 days ago 4x
Python Question

Take union of two columns, Python + Pandas

I have a df arranged like follows:

x y z
0 a jj Nan
1 b ii mm
2 c kk nn
3 d ii NaN
4 e Nan oo
5 f jj mm
6 g Nan nn

The desired output is:

x y z w
0 a jj Nan a
1 b ii mm a
2 c kk nn c
3 d ii NaN a
4 e Nan oo e
5 f jj mm a
6 g Nan nn c

The logic is

  1. to take union of column y & z :
    ii == jj
    since in index 1 and 5, they both have
    in column z

  2. group this union : index 0,1,3,5 are a group, index 2,6 are another group

  3. within the group, randomly take one cell in column x and assign it to column w for the whole group

I have no clue at all about this problem.
Can somebody help me?


I was first post a perfectly sorted column y and column z like follows:

x y z w
0 a ii NaN a
1 b ii mm a
2 c jj mm a
3 d jj Nan a
4 e kk nn e
5 f Nan nn e
6 g Nan oo g

For this case, piRSquared's solution works perfect.


Nickil Maveli's solution works perfect for my problem. However, I noted that there's a situation that the solution can not handle, that is :

x y z
0 a ii mm
1 b ii nn
2 c jj nn
3 d jj oo
4 e kk oo

By Nickil Maveli's solution, the result would be like follows:

0 1 2 w
0 a ii mm a
1 b ii mm a
2 c jj nn c
3 d jj nn c
4 e kk oo e

However, the desired output should be w = ['a', 'a', 'a', 'a', 'a'].


In the general case this is a set consolidation/connected components problem. While if we assume certain things about your data we can solve a reduced case, it's just a bit of bookkeeping to do the whole thing.

scipy has a connected components function we can use if we do some preparation:

import scipy.sparse

def via_cc(df_in):
    df = df_in.copy()

    # work with ranked version
    dfr = df[["y","z"]].rank(method='dense')
    # give nans their own temporary rank
    dfr = dfr.fillna(dfr.max().fillna(0) + dfr.isnull().cumsum(axis=0))
    # don't let x and y get mixed up; have separate nodes per column
    dfr["z"] += dfr["y"].max() 

    # build the adjacency matrix
    size = int(dfr.max().max()) + 1
    m = scipy.sparse.coo_matrix(([1]*len(dfr), (dfr.y, dfr.z)),
                                (size, size))

    # do the work to find the groups
    _, cc = scipy.sparse.csgraph.connected_components(m)

    # get the group codes
    group = pd.Series(cc[dfr["y"].astype(int).values], index=dfr.index)
    # fill in w from x appropriately
    df["w"] = df["x"].groupby(group).transform(min)

    return df

which gives me

In [230]: via_cc(df0)
   x    y    z  w
0  a   jj  NaN  a
1  b   ii   mm  a
2  c   kk   nn  c
3  d   ii  NaN  a
4  e  NaN   oo  e
5  f   jj   mm  a
6  g  NaN   nn  c

In [231]: via_cc(df1)
   x   y   z  w
0  a  ii  mm  a
1  b  ii  nn  a
2  c  jj  nn  a
3  d  jj  oo  a
4  e  kk  oo  a

If you have a set consolidation recipe around, like the one here, you can simplify some of the above at the cost of an external function.

(Aside: note that in my df0, the "Nan"s are really NaNs. If you have a string "Nan" (note how it's different from NaN), then the code will think it's just another string and will assume that you want all "Nan"s to be in the same group.)