LoneRanger LoneRanger - 10 days ago 7
Python Question

pandas drop duplicates from a dataframe(grouped) with some specific condition

Hi all I have a dataframe whose content will be like

name,mv_str
abc,Exorsist part1
abc,doc str 2D
abc,doc str 3D
abc,doc str QA
abc,doc flash
def,plastic
def,plastic income
def,doc str 2D ###i added this row for better clarity


My expected o/p should have .... per group get unique record-rows in the sense--for each mailid mv_str should not be similar type i:e 1st 2 words from one 'mv_str' should not be there in 2nd/any row for that particular User name.

Note: comparision should happen per user name level.

name,mv_str
abc,Exorist part1
abc,doc str 2D ###3D and QA removes as 1st 2 words "doc str" matched
abc, doc flash ###only 1st word is matching, 2nd word does not
def,plastic
def,plastic income #It should be present as only one word is matching
def,doc str 2D ###this row should be there as this is for another User


please anybody help me forming the logic, or code sample will be great Help. Thanks.

Answer

I think you need first split strings in column mv_str by whitespace and create new DataFrame df1:

df1 = df.mv_str.str.split(expand=True)
print (df1)
          0       1     2
0  Exorsist   part1  None
1       doc     str    2D
2       doc     str    3D
3       doc     str    QA
4       doc   flash  None
5   plastic    None  None
6   plastic  income  None
7       doc     str    2D

Add original DataFrame df by concat:

df = pd.concat([df, df1], axis=1)
print (df)
  name          mv_str         0       1     2
0  abc  Exorsist part1  Exorsist   part1  None
1  abc      doc str 2D       doc     str    2D
2  abc      doc str 3D       doc     str    3D
3  abc      doc str QA       doc     str    QA
4  abc       doc flash       doc   flash  None
5  def         plastic   plastic    None  None
6  def  plastic income   plastic  income  None
7  def      doc str 2D       doc     str    2D

Then drop_duplicates by columns name, 0 and 1, first value remains:

print (df.drop_duplicates(['name',0,1]))
  name          mv_str         0       1     2
0  abc  Exorsist part1  Exorsist   part1  None
1  abc      doc str 2D       doc     str    2D
4  abc       doc flash       doc   flash  None
5  def         plastic   plastic    None  None
6  def  plastic income   plastic  income  None
7  def      doc str 2D       doc     str    2D

Remove columns 0, 1, 2 by drop:

print (df.drop_duplicates(['name',0,1]).drop([0,1,2], axis=1))
  name          mv_str
0  abc  Exorsist part1
1  abc      doc str 2D
4  abc       doc flash
5  def         plastic
6  def  plastic income
7  def      doc str 2D

Or better remove columns by selecting only name and mv_str columns:

print (df.drop_duplicates(['name',0,1])[['name','mv_str']])
  name          mv_str
0  abc  Exorsist part1
1  abc      doc str 2D
4  abc       doc flash
5  def         plastic
6  def  plastic income
7  def      doc str 2D
Comments