Justin Justin - 1 month ago 10x
Python Question

Slicing pandas dataframe by looking for character "in" string

I want to extract a set of rows from a dataframe based on whether a string in that row contains a given substring.

For example, say I have

testdf = pd.DataFrame({'A':['abc','efc','abz'], 'B':[4,5,6]})

I want to get the rows containing the substring
in column

I tried
testdf.loc[lambda df: 'ab' in df['A'], :]
, but got the following error:

KeyError Traceback (most recent call last)
<ipython-input-12-86c436129f94> in <module>()
----> 1 testdf.loc[lambda df: 'a' in df['A'], :]

/Users/justinpounders/anaconda/lib/python2.7/site-packages/pandas/core/indexing.pyc in __getitem__(self, key)
1293 if type(key) is tuple:
-> 1294 return self._getitem_tuple(key)
1295 else:
1296 return self._getitem_axis(key, axis=0)

/Users/justinpounders/anaconda/lib/python2.7/site-packages/pandas/core/indexing.pyc in _getitem_tuple(self, tup)
782 def _getitem_tuple(self, tup):
783 try:
--> 784 return self._getitem_lowerdim(tup)
785 except IndexingError:
786 pass

/Users/justinpounders/anaconda/lib/python2.7/site-packages/pandas/core/indexing.pyc in _getitem_lowerdim(self, tup)
906 for i, key in enumerate(tup):
907 if is_label_like(key) or isinstance(key, tuple):
--> 908 section = self._getitem_axis(key, axis=i)
910 # we have yielded a scalar ?

/Users/justinpounders/anaconda/lib/python2.7/site-packages/pandas/core/indexing.pyc in _getitem_axis(self, key, axis)
1465 # fall thru to straight lookup
1466 self._has_valid_type(key, axis)
-> 1467 return self._get_label(key, axis=axis)

/Users/justinpounders/anaconda/lib/python2.7/site-packages/pandas/core/indexing.pyc in _get_label(self, label, axis)
91 raise IndexingError('no slices here, handle elsewhere')
---> 93 return self.obj._xs(label, axis=axis)
95 def _get_loc(self, key, axis=0):

/Users/justinpounders/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in xs(self, key, axis, level, copy, drop_level)
1747 drop_level=drop_level)
1748 else:
-> 1749 loc = self.index.get_loc(key)
1751 if isinstance(loc, np.ndarray):

/Users/justinpounders/anaconda/lib/python2.7/site-packages/pandas/indexes/base.pyc in get_loc(self, key, method, tolerance)
1945 return self._engine.get_loc(key)
1946 except KeyError:
-> 1947 return self._engine.get_loc(self._maybe_cast_indexer(key))
1949 indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3977)()

pandas/index.pyx in pandas.index.Int64Engine._check_type (pandas/index.c:7634)()

KeyError: False

What confuses me is that
testdf.loc[lambda df: df['A'] == 'abc', :]
does not five an error: it returns the one row containing the value
. So it appears that there something about the
'ab' in df['A']
boolean that is not correct...

I am using python 2.7 and pandas 0.18.1 in a Jupyter (4.0.6) notebook.


use str.contains:

In [67]:

     A  B
0  abc  4
2  abz  6

What you tried doesn't make sense firstly:

In [70]:
'ab' in testdf['A']


but what you're really trying to do is test 'ab' in each element of that column:

In [71]:
testdf['A'].apply(lambda x: 'ab' in x)

0     True
1    False
2     True
Name: A, dtype: bool

However, there is no need for apply here when there is a vectorised method

What you tried here:

testdf.loc[lambda df: 'ab' in testdf['A']]

raised a keyerror because the lambda returned a scalar False which can't be used to index the whole df, but testdf.loc[lambda df: df['A'] == 'abc', :] works because df['A'] == 'abc' returns a boolean mask which can be used to mask the entire df

Also the lambda is unnecessary in the loc:

testdf.loc[testdf['A'] == 'abc', :]

would've worked, if you think about it, all you did was provide a lambda for your df which is no different to the above