CF84 CF84 - 15 days ago 14
Python Question

Pandas: select rows based on a condition applied to string

Working with a dictionary of dataframes, each key is an integer

0, ..., 999
, and each value is a dataframe like this:

A B
1 10010001 17
2 10020001 5
3 10020002 11
4 10020003 2
5 10030001 86
...


I need to iterate through the entire dictionary, and to put together in a new dataframe all lines that have the 3rd and 4th digit in column A equals to
02
. In my example, only lines 2, 3, and 4 would form the new dataframe. All values of column
A
are strings.

What could be the most efficient way of doing this within
pandas
?

Answer

How about something like the following, where dis your dict:

pd.concat((v[v.A.str[2:4] == '02'] for v in d.itervalues()))

With your a dict consisting of your sample dataframe repeated 3 times and keys 0-2

d = dict(zip(range(3), [df]*3))

this yields:

          A   B
2  10020001   5
3  10020002  11
4  10020003   2
2  10020001   5
3  10020002  11
4  10020003   2
2  10020001   5
3  10020002  11
4  10020003   2

This should be more memory efficient than creating a list of rows or using a list comprehension because it uses a generator expression instead. It also should be faster than using regex due to direct indexing (assuming your data values are standardized).


If you don't like the index of the combined array, you could always reset_index(). For example:

y = pd.concat((v[v.A.str[2:4] == '02'] for v in d.itervalues()))
y.reset_index.drop('index', axis=1)

          A   B
0  10020001   5
1  10020002  11
2  10020003   2
3  10020001   5
4  10020002  11
5  10020003   2
6  10020001   5
7  10020002  11
8  10020003   2