lanery lanery - 11 months ago 81
Python Question

Pandas DataFrame with tuple of strings as index

I'm sensing some weird

behavior here. I have a dataframe that looks like

df = pd.DataFrame(columns=['Col 1', 'Col 2', 'Col 3'],
index=[('1', 'a'), ('2', 'a'), ('1', 'b'), ('2', 'b')])

In [14]: df
Col 1 Col 2 Col 3
(1, a) NaN NaN NaN
(2, a) NaN NaN NaN
(1, b) NaN NaN NaN
(2, b) NaN NaN NaN

I can set the value of an arbitrary element

In [15]: df['Col 2'].loc[('1', 'b')] = 6

In [16]: df
Col 1 Col 2 Col 3
(1, a) NaN NaN NaN
(2, a) NaN NaN NaN
(1, b) NaN 6 NaN
(2, b) NaN NaN NaN

But when I go to reference the element that I just set using the same syntax, I get

In [17]: df['Col 2'].loc[('1', 'b')]
KeyError: 'the label [1] is not in the [index]'

Can someone tell me what I'm doing wrong or why this behavior occurs? Am I simply not allowed to set the index as a multi-element tuple?


Apparently, wrapping the tuple index in a list works.

In [38]: df['Col 2'].loc[[('1', 'b')]]
(1, b) 6
Name: Col 2, dtype: object

Although I'm still getting some weird behavior in my actual use case so it'd be nice to know if this is not recommended usage.

Answer Source

Your tuple in the selection brackets is seen as a sequence containing the elements you want to retrieve. It's like you would have passed ['1', 'b'] as argument. Thus the KeyError message: pandas tries to find the key '1' and obviously doesn't find it.

That's why it works when you add additional brackets, as now the argument becomes a sequence of one element - your tuple.

You should avoid dealing with ambiguities around list and tuple arguments in selection. The behavior can be also different depending on the index being a simple index or a multiindex.

In any case, if you ask about recommendations here, the one I see is that you should try to not build simple indexes made of tuples: pandas will work better and will be more powerful to use if you actually build a multiindex instead:

df = pd.DataFrame(columns=['Col 1', 'Col 2', 'Col 3'],
                  index=pd.MultiIndex.from_tuples([('1', 'a'), ('2', 'a'), ('1', 'b'), ('2', 'b')]))

df['Col 2'].loc[('1', 'b')] = 6

df['Col 2'].loc[('1', 'b')]
Out[13]: 6

    Col 1 Col 2 Col 3
1 a   NaN   NaN   NaN
2 a   NaN   NaN   NaN
1 b   NaN     6   NaN
2 b   NaN   NaN   NaN