WhoIsJack WhoIsJack - 2 months ago 55
Python Question

Selecting string columns in pandas df (equivalent to df.select_dtypes)

Given a Pandas

df
with different data types,
df.select_dtypes
can be very useful to keep only desired columns or to get rid of unwanted columns for a particular application.

However, there seems to be no way of addressing
string
dtypes using this method.

From the docs (emphasis mine):


ValueError

  If both of include and exclude are empty

  If include and exclude have overlapping elements

  If any kind of string dtype is passed in.


and


To select strings you must use the object dtype, but note that this will return all object dtype columns


Indeed, using
df.select_dtypes(exclude=['str'])
raises an error (although it is a
TypeError
and not a
ValueError
as the docs claim) and using
df.select_dtypes(exclude=['object'])
removes all
object
columns, not just
string
columns.


Given a
df
like this:

df = pd.DataFrame({'int_col':[0,1,2,3,4],
'dict_col':[dict() for i in range(5)],
'str_col':list('abcde')})


and considering that

df.dtypes


is
object
for both
str_col
and
dict_col
:


What is the best way of excluding or including all string columns?



Answer Source

Option 1

Using df.applymap and type, and equating to str:

In [377]: (df.applymap(type) == str).all(0)
Out[377]: 
dict_col    False
int_col     False
str_col      True
dtype: bool

Each element in each column converted to its type and then equated to str. After that, just call .all(0) or .min(0) to get a per-column verdict.


Option 2

Using df.applymap and isinstance:

In [342]: df.applymap(lambda x: isinstance(x, str)).all(0)
Out[342]: 
dict_col    False
int_col     False
str_col      True

To include these string columns, you can boolean index on the columns:

idx = ... # one of the two methods above
df_new = df[df.columns[idx]]

Exclusion would be

df_new = df[df.columns[~idx]]