K_B1 K_B1 -4 years ago 103
Python Question

Pandas: combine columns without duplicates/ find unique words after combining

I have a dataframe where I would like to concatenate certain columns.

My issue is that the text in these columns may or may not contain duplicate information. I would like to strip out the duplicates in order to retain only the relevant information.

For example, if I had a data frame such as:

pd.read_csv("animal.csv")

animal1 animal2 label
1 cat dog dolphin 19
2 dog cat cat 72
3 pilchard 26 koala 26
4 newt bat 81 bat 81


I want to combine the columns but retain only unique information from each of the strings.

You can see that in row 2, 'cat' is contained in both columns 'Animal1' and 'Animal2'. In row 3, the number 26 is in both column 'Animal1' and 'Label'. Whereas in row 4, information that is in columns 'Animal2' and 'Label' are already contained in order in 'Animal1'.

I combine the columns by doing the following

animals["detail"] = animals["animal1"].map(str) + animals["animal2"].map(str) + animals["label"].map(str)

animal1 animal2 label detail
1 cat dog dolphin 19 cat dog dolphin 19
2 dog cat cat 72 dog cat cat 72
3 pilchard 26 koala 26 pilchard 26 koala 26
4 newt bat 81 bat 81 newt bat 81 bat 81


Row 1 is fine, but the other rows, of course, contain duplicates as described above.

The output I would desire is:

animal1 animal2 label detail
1 cat dog dolphin 19 cat dog dolphin 19
2 dog cat cat 72 dog cat 72
3 pilchard 26 koala 26 pilchard koala 26
4 newt bat 81 bat 81 newt bat 81


or if I could retain only the first unique instance of each word/ number per row in the detail column, this would also be suitable i.e.:

detail
1 cat dog dolphin 19
2 dog cat 72
3 pilchard koala 26
4 newt bat 81


I've had a look at doing this for a string in python e.g. How can I remove duplicate words in a string with Python?, How to get all the unique words in the data frame?, show distinct column values in pyspark dataframe: python
but can't figure out how to apply this to individual rows within the detail column. I've looked at splitting the text after I've combined the columns, then using apply and lambda, but haven't got this to work yet. Or is there perhaps a way to do it when combining the columns?

I have the solution in R but want to recode in python.

Would greatly appreciate any help or advice. I'm currently using Spyder(Python 3.5)

Answer Source

You can add custom function where first split by whitespace, then get unique values by pandas.unique and last join to string back:

animals["detail"] = animals["animal1"].map(str) + ' ' + 
                    animals["animal2"].map(str) + ' ' +
                    animals["label"].map(str)

animals["detail"] = animals["detail"].apply(lambda x: ' '.join(pd.unique(x.split())))
print (animals)
       animal1  animal2  label              detail
1      cat dog  dolphin     19  cat dog dolphin 19
2      dog cat      cat     72          dog cat 72
3  pilchard 26    koala     26   pilchard 26 koala
4  newt bat 81      bat     81         newt bat 81

Also is possible join values in apply:

animals["detail"] = animals.astype(str)
                           .apply(lambda x: ' '.join(pd.unique(' '.join(x).split())),axis=1)
print (animals)
       animal1  animal2  label              detail
1      cat dog  dolphin     19  cat dog dolphin 19
2      dog cat      cat     72          dog cat 72
3  pilchard 26    koala     26   pilchard 26 koala
4  newt bat 81      bat     81         newt bat 81

Solution with set, but it change order:

animals["detail"] = animals.astype(str)
                           .apply(lambda x: ' '.join(set(' '.join(x).split())), axis=1)
print (animals)
       animal1  animal2  label              detail
1      cat dog  dolphin     19  cat dolphin 19 dog
2      dog cat      cat     72          cat dog 72
3  pilchard 26    koala     26   26 pilchard koala
4  newt bat 81      bat     81         bat 81 newt
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download