vks vks - 28 days ago 17
Python Question

Python Pandas Memory Error during Drop

I have a df of 825468 rows.
I am performing this over it.

frame = frame.drop(frame.loc[(
frame['RR'].str.contains(r"^([23])[^-]*-\1[^-]*$")), 'RR'].str.replace("[23]([^-]*)-[23]([^-]*)", r"\1-\2").isin(
series1.str.replace("1([^-]*)-1([^-]*)", r"\1-\2"))[lambda d: d].index)


where

series1 = frame.loc[frame['RR'].str.contains("^1[^-]*-1"), 'RR']


So what it does it

prepares a series of where
RR
has value like
1abc-1bcd
and then if in frame there is an
RR
like
2abc-2bcd
which after replacement becomes
abc-bcd
and its there in series as well after replacement,its dropped.

But it gives Memory Error.Is there a more efficient way to perform the same.

For ex.

if in a df ..

RR
0 2abc-2abc
1 1abc-1abc
2 3abc-3abc
3 2def-2def
4 3def-3def
5 def-dfd
6 sdsd-sdsd
7 1def-1def


Then from this frame
2abc-2abc
and
3abc-3abc
should be dropped,as after removing 2,3 it becomes
abc-abc
and when we remove
1
from
1abc-1abc
it also is
abc-abc
.
2def-2def
should not be dropped as there is no
1def-1def


Output:

RR
0 1abc-1abc
1 def-dfd
2 sdsd-sdsd
3 1def-1def

Answer

UPDATE2:

In [176]: df
Out[176]:
          RR
0  2abc-2abc
1  3abc-3abc
2  2def-2def
3  3def-3def
4    def-dfd
5  sdsd-sdsd
6  1def-1def
7    abc-abc
8    def-def

In [177]: df[['d1','s','s2']] = df.RR.str.extract(r'^(?P<d1>\d+)(?P<s1>[^-]*)-\1(?P<s2>[^-]*)', expand=True)

In [178]: df
Out[178]:
          RR   d1    s   s2
0  2abc-2abc    2  abc  abc
1  3abc-3abc    3  abc  abc
2  2def-2def    2  def  def
3  3def-3def    3  def  def
4    def-dfd  NaN  NaN  NaN
5  sdsd-sdsd  NaN  NaN  NaN
6  1def-1def    1  def  def
7    abc-abc  NaN  NaN  NaN
8    def-def  NaN  NaN  NaN

In [179]: df.s += df.pop('s2')

In [180]: df
Out[180]:
          RR   d1       s
0  2abc-2abc    2  abcabc
1  3abc-3abc    3  abcabc
2  2def-2def    2  defdef
3  3def-3def    3  defdef
4    def-dfd  NaN     NaN
5  sdsd-sdsd  NaN     NaN
6  1def-1def    1  defdef
7    abc-abc  NaN     NaN
8    def-def  NaN     NaN

In [181]: result = df.loc[~df.s.isin(df.loc[df.d1 == '1', 's']) | (~df.d1.isin(['2','3'])), 'RR']

In [182]: result
Out[182]:
0    2abc-2abc
1    3abc-3abc
4      def-dfd
5    sdsd-sdsd
6    1def-1def
7      abc-abc
8      def-def
Name: RR, dtype: object

UPDATE:

In [171]: df
Out[171]:
          RR
0  2abc-2abc
1  1abc-1abc
2  3abc-3abc
3  2def-2def
4  3def-3def
5    def-dfd
6  sdsd-sdsd
7  1def-1def
8    abc-abc

NOTE: I have intentionally added 8th row: abc-abc, which should NOT be dropped (if i understood your question correctly)

Solution 1: using .str.replace() and drop_duplicates() methods:

In [178]: (df.sort_values('RR')
     ...:    .RR
     ...:    .str.replace("[23]([^-]*)-[23]([^-]*)", r"1\1-1\2")
     ...:    .drop_duplicates()
     ...: )
     ...:
Out[178]:
1    1abc-1abc
7    1def-1def
8      abc-abc
5      def-dfd
6    sdsd-sdsd
Name: RR, dtype: object

Solution 2: using .str.replace() and .str.contains() methods and boolean indexing:

In [172]: df.loc[~df.sort_values('RR')
     ...:           .RR
     ...:           .str.replace("[23]([^-]*)-[23]([^-]*)", r"_\1-_\2")
     ...:           .str.contains(r"^_[^-]*-_")]
     ...:
Out[172]:
          RR
1  1abc-1abc
5    def-dfd
6  sdsd-sdsd
7  1def-1def
8    abc-abc

NOTE: you may want to replace '_' with another symbol(s), which will never occur in the RR column

Comments