Kartik Kartik - 3 months ago 14
Python Question

pandas DataFrame columns named True and False work just fine

The sample code:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.rand(5,2), columns=[True, False])
# All of the following works fine. Just like you would expect
# them to, if the columns had any other (string) name.
# (Because True == True, True == False and False == False are
# valid comparisons -- they have to be.)
df.loc[:, True]
df.loc[:, False]
df.loc[:, [col for col in df.columns if col]]
df.loc[:, :]

# However, the below line, only returns column `True`. But if
# the names were strings, it would return both columns.
df.loc[:, [True, False]]


What witchcraft makes this possible? I thought some check for keys will fail. But they didn't and I had to ask because they didn't.

So rephrasing my question: How does pandas (Python, for that matter) decide between Boolean and non-Boolean (for lack of better expression) indexing? How does it avoid confusion? And what prevents misbehavior? Had the first line (
df = pd.DataFrame(np.random.rand(5,2), columns=[True, False])
) returned a single column (
True
) I would have been less surprised.

Answer

There is no witchcraft. As far as I know, columns can be labeled by any hashable type. Given that booleans are instances of ints, is it really any more strange than:

In [7]: df1 = pd.DataFrame(np.random.rand(5,2), columns=[0, 1])

In [8]: df1
Out[8]: 
          0         1
0  0.706135  0.307180
1  0.713418  0.006204
2  0.308810  0.688868
3  0.582871  0.738771
4  0.418600  0.948231

However, since .loc lets you select by label, there is one way where boolean labels will be ambiguous. Consider what I can do with my int labelled columns:

In [10]: df1.loc[:, [1, 0]]
Out[10]: 
          1         0
0  0.307180  0.706135
1  0.006204  0.713418
2  0.688868  0.308810
3  0.738771  0.582871
4  0.948231  0.418600

However, if I try to do the same thing with the boolean labelled columns:

In [11]: df
Out[11]: 
      True      False
0  0.487752  0.545283
1  0.921928  0.715808
2  0.618667  0.946385
3  0.975142  0.078050
4  0.994348  0.468887

In [12]: df.loc[:, [False, True]]
Out[12]: 
      False
0  0.545283
1  0.715808
2  0.946385
3  0.078050
4  0.468887

Whoops! now it is reverting to boolean indexing behavior. Still, you can always use .iloc:

In [13]: df.iloc[:, [1, 0]]
Out[13]: 
      False     True 
0  0.545283  0.487752
1  0.715808  0.921928
2  0.946385  0.618667
3  0.078050  0.975142
4  0.468887  0.994348

Edit to address OP edit

Notice, however, that df = pd.DataFrame(np.random.rand(5,2), columns=[True, False]) is works fine because it isn't an indexing or selection operation, it is creating a DataFrame. Finally, notice that:

In [17]: df.loc[:, [False]]
Out[17]: 
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]

Also uses boolean indexing on the columns, as expected. So, it reverts to boolean indexing as far as I can tell.

Edit by asker

Do also see this answer to the question to get the other part of the story.