Tyler Rinker Tyler Rinker - 4 months ago 34
Python Question

Pandas extract multicharacter regex

I would like to extract an expression each time it occurs in an element of a Pandas DataFrame as an array but get an error every time I use a multiple character expression. Why am I getting this error? How do I make the extraction work as expected?

MWE



import pandas as pd

wiki = ["In theoretical computer the like operations.",
"The a filter.",
"In the.",
"the dog is the one",
"See below for details."
]
wiki

x = pd.DataFrame(wiki, columns = ['wiki'])
x


Error for multicharacter expression



x.wiki.str.extractall('(the)')

## x.wiki.str.extractall('(the)')
## Traceback (most recent call last):
##
## File "<ipython-input-7-ca5d102219f3>", line 1, in <module>
## x.wiki.str.extractall('(the)')
##
## File "C:\WinPython-64bit-3.5.2.1Qt5\python-3.5.2.amd64\lib\site-packages\pandas\core\strings.py", line 1621, in extractall
## return str_extractall(self._orig, pat, flags=flags)
##
## File "C:\WinPython-64bit-3.5.2.1Qt5\python-3.5.2.amd64\lib\site-packages\pandas\core\strings.py", line 716, in str_extractall
## result = DataFrame(match_list, index, columns)
##
## File "C:\WinPython-64bit-3.5.2.1Qt5\python-3.5.2.amd64\lib\site-packages\pandas\core\frame.py", line 263, in __init__
## arrays, columns = _to_arrays(data, columns, dtype=dtype)
##
## File "C:\WinPython-64bit-3.5.2.1Qt5\python-3.5.2.amd64\lib\site-packages\pandas\core\frame.py", line 5352, in _to_arrays
## dtype=dtype)
##
## File "C:\WinPython-64bit-3.5.2.1Qt5\python-3.5.2.amd64\lib\site-packages\pandas\core\frame.py", line 5431, in _list_to_arrays
## coerce_float=coerce_float)
##
## File "C:\WinPython-64bit-3.5.2.1Qt5\python-3.5.2.amd64\lib\site-packages\pandas\core\frame.py", line 5489, in _convert_object_array
## 'columns' % (len(columns), len(content)))
##
## AssertionError: 1 columns passed, passed data had 3 columns


As expected single character expression



x.wiki.str.extractall('(t)')

## x.wiki.str.extractall('(t)')
## Out[8]:
## 0
## match
## 0 0 t
## 1 t
## 2 t
## 3 t
## 4 t
## 1 0 t
## 2 0 t
## 3 0 t
## 1 t
## 4 0 t


I was expecting this:



match
0 0 the
1 the
2 0 the
3 0 the
1 the

Answer

The extractall() method has a bug which should be fixed in pandas 0.18.2, which should be released pretty soon, so let's be patient or risk a little bit and use a beta 0.18.2 version ... ;)