AJG519 AJG519 - 7 days ago 5
Python Question

Pandas - identifying dataframe values that start with value in a list

Say I have the following dataframe:

>>> import pandas as pd
>>> d=pd.DataFrame()
>>> d['A']=['12345','12354','76','4']
>>> d['B']=['4442','2345','33','5']
>>> d['C']=['5553','4343','33','5']
>>> d
A B C
0 12345 4442 5553
1 12354 2345 4343
2 76 33 33
3 4 5 5


And say I have 3 values of interest:

>>> vals=['123','76']


I am interested in determining which values in my dataframe start with any of the values in my list. There are 3 cases in my example: (0,A) starts with 123; (1,A) starts with 123; and (2,A) starts with 76.

Is there a way I can do this without looping through each of my values?

If I were interested in matching values exactly I could just do:

>>> d.isin(vals)
A B C
0 False False False
1 False False False
2 True False False
3 False False False
>>>


And if I was interested in whether the values start with 1 particular value I could do:

>>> d.applymap(lambda x:x.startswith('123'))
A B C
0 True False False
1 True False False
2 False False False
3 False False False
>>>


But how can I combine these two to find any value that starts with any value in my list?

Answer

You can construct a regex pattern and test each column in turn using apply with a lambda calling str.contains:

In [9]:
vals=['123','76']
v = ['^' + x for x in vals]
d.apply(lambda x: x.str.contains('|'.join(v)))

Out[9]:
       A      B      C
0   True  False  False
1   True  False  False
2   True  False  False
3  False  False  False

The resulting regex pattern:

In [10]:
'|'.join(v)

Out[10]:
'^123|^76'

update

Actually you can do this using stack and unstack so that you initially stack all columns into a single column, call str.contains with the regex pattern and then unstacking back to the original form:

In [9]:
vals=['123','76']
v = ['^' + x for x in vals]
d.stack().str.contains('|'.join(v)).unstack()

Out[9]:
       A      B      C
0   True  False  False
1   True  False  False
2   True  False  False
3  False  False  False

This is a cleaner way of doing it compared to using apply