Mike Mike - 1 year ago 78
Python Question

Pandas: drop rows based on duplicated values in a list

I would like to drop rows within my dataframe based on if a piece of a string is duplicated within that string. For example, if the string is jkl-ghi-jkl, I would drop this row because jkl is repeated twice. I figured that creating a list and checking the list for duplicates would be the ideal approach.

My dataframe for this example consist of 1 column and two data points:

df1 = pd.DataFrame({'Col1' : ['abc-def-ghi-jkl', 'jkl-ghi-jkl-mno'],})

My first step I take is to apply a split to my data, and split of "-"

List = df1['Col1].str.split('-')

Which is yields the output:

0 [abc, def, ghi, jkl]
1 [jkl, ghi, jkl, mno]
Name: Col1, dtype: object

My second step I take is to convert my output into lists:

List = List.tolist()

Which yields:

[['abc', 'def', 'ghi', 'jkl'], ['jkl', 'ghi', 'jkl', 'mno']]

My last step I wish to accomplish is to compare a full list with a distinct list of unique values:

len(List) > len(set(List))

Which yields the error:

TypeError: unhashable type: 'list'

I am aware that my .tolist() creates a list of 2 series. Is there a way to convert these series into a list in order to test for duplicates? I wish to use this piece of code:

len(List) > len(set(List)

with a drop in order to drop all rows with a duplicated value within each cell.

Is this the correct way of approaching, or is there a simpler way?

My end output should look like:


Because string jkl-ghi-jkl-mno gets dropped due to "jkl" repeating twice

Answer Source

Here is another option, using set and len:

df1 = pd.DataFrame({'Col1' : ['abc-def-ghi-jkl', 'jkl-ghi-jkl-mno'],})

df1['length'] = df1['Col1'].str.split('-').apply(set).apply(len)

print( df1 )

              Col1  length
0  abc-def-ghi-jkl       4
1  jkl-ghi-jkl-mno       3

df1 = df1.loc[ df1['length'] < 4 ]


              Col1  length
1  jkl-ghi-jkl-mno       3