jdzejdzej jdzejdzej - 1 month ago 10
Python Question

pandas loc modifies dataFrame with multi index?

I've found some interesting behavior (bug?) of loc, with multi indexed data frame, where first index is single. After using loc (for the first time) first index (of multi index) disappears!

example:

In [1]: import pandas as pd

In [2]: x = pd.DataFrame({'idx1':[1]*10, 'idx2':[1]*5+[2]*5, 'idx3':range(5)+range(5), 'data': [1]*10})

In [3]: x = x.set_index(['idx1', 'idx2', 'idx3']).sortlevel()


My dataFrame:

In [4]: x
Out[4]:
data
idx1 idx2 idx3
1 1 0 1
1 1
2 1
3 1
4 1
2 0 1
1 1
2 1
3 1
4 1


loc used for the first time:

In [5]: x.loc[1,:,:]
Out[5]:
data
idx2 idx3
1 0 1
1 1
2 1
3 1
4 1
2 0 1
1 1
2 1
3 1
4 1


Now DataFrame has only two indexes:

In [6]: x
Out[6]:
data
idx2 idx3
1 0 1
1 1
2 1
3 1
4 1
2 0 1
1 1
2 1
3 1
4 1


This doesn't happen when 'idx1' has more than one value:

In [7]: x = pd.DataFrame({'idx1':[1]*3+[2]*7, 'idx2':[1]*5+[2]*5, 'idx3':range(5)+range(5), 'data': [1]*10})

In [8]: x = x.set_index(['idx1', 'idx2', 'idx3']).sortlevel()

In [9]: x
Out[9]:
data
idx1 idx2 idx3
1 1 0 1
1 1
2 1
2 1 3 1
4 1
2 0 1
1 1
2 1
3 1
4 1

In [10]: x.loc[1,:,:]
Out[10]:
data
idx1 idx2 idx3
1 1 0 1
1 1
2 1

In [11]: x
Out[11]:
data
idx1 idx2 idx3
1 1 0 1
1 1
2 1
2 1 3 1
4 1
2 0 1
1 1
2 1
3 1
4 1


Is this normal behavior? How to avoid this?

python 2.7 32bit, pandas==0.16.2, numpy==1.11.1+mkl

Answer

I think better is select with slicers, ther it return same output - all levels:

x = pd.DataFrame({'idx1':[1]*10, 'idx2':[1]*5+[2]*5, 'idx3':list(range(5))+list(range(5)), 'data': [1]*10})
x = x.set_index(['idx1', 'idx2', 'idx3']).sortlevel()
print (x)
                data
idx1 idx2 idx3      
1    1    0        1
          1        1
          2        1
          3        1
          4        1
     2    0        1
          1        1
          2        1
          3        1
          4        1

idx = pd.IndexSlice
print (x.loc[idx[1,:,:],:])
                data
idx1 idx2 idx3      
1    1    0        1
          1        1
          2        1
          3        1
          4        1
     2    0        1
          1        1
          2        1
          3        1
          4        1

If need remove level, use xs with parameter drop_level:

print (x.xs(1, level=0, drop_level=True))
           data
idx2 idx3      
1    0        1
     1        1
     2        1
     3        1
     4        1
2    0        1
     1        1
     2        1
     3        1
     4        1

print (x.xs(1, level=0, drop_level=False))
                data
idx1 idx2 idx3      
1    1    0        1
          1        1
          2        1
          3        1
          4        1
     2    0        1
          1        1
          2        1
          3        1
          4        1

Second sample:

x = pd.DataFrame({'idx1':[1]*3+[2]*7, 'idx2':[1]*5+[2]*5, 'idx3':list(range(5))+list(range(5)), 'data': [1]*10})

x = x.set_index(['idx1', 'idx2', 'idx3']).sortlevel()
print (x)
                data
idx1 idx2 idx3      
1    1    0        1
          1        1
          2        1
2    1    3        1
          4        1
     2    0        1
          1        1
          2        1
          3        1
          4        1

idx = pd.IndexSlice
print (x.loc[idx[1,:,:],:])
                data
idx1 idx2 idx3      
1    1    0        1
          1        1
          2        1
print (x.xs(1, level=0, drop_level=True))
           data
idx2 idx3      
1    0        1
     1        1
     2        1

print (x.xs(1, level=0, drop_level=False))
                data
idx1 idx2 idx3      
1    1    0        1
          1        1
          2        1