JC_CL JC_CL - 8 days ago 7
Python Question

pandas .sortlevel cant sort negative numbers

I have a pandas dataframe that I constructed by reading in various CSV files. It looks like this:

ID V2 H1-b V3 H2 V1 H1a
position -50.0 600.0 -125.0 -720.0 23.0 450.0
2000-01-01 -1.057609 1.160002 1.082879 -1.304001 -0.259283 1.285260
2000-02-01 -1.133474 -0.385869 0.756780 2.311465 1.060337 -1.059041
2000-03-01 1.209086 -0.774133 0.018603 0.969665 -1.221080 1.717816
...


When I try sorting it with
df_sort = df.sortlevel(level=1,axis=1)
(or
sortlevel='position'
, same result), I get the following result:

ID V3 V2 H2 V1 H1-a H1-b
position -125.0 -50.0 -720.0 23.0 450.0 600.0
2000-01-01 1.082879 -1.057609 -1.304001 -0.259283 1.285260 1.160002
2000-02-01 0.756780 -1.133474 2.311465 1.060337 -1.059041 -0.385869
2000-03-01 0.018603 1.209086 0.969665 -1.221080 1.717816 -0.774133


The positive Numbers are sorted the correct way (23<450<600), but the negative numbers are "random".

As far as I can tell, all my CSV files are the same (no spaces before the numbers or something), and all the entries in the dataframe are produced by the same script.

But when I tried to see if I can reproduce that with a simple synthetic dataframe, sorting works:

header=pd.MultiIndex.from_product([[-3,-300,4,100,34,-324],['s']],names=['loc','X'])
df = pd.DataFrame(np.random.randn(5, 6), index=['a','b','c','d','e'], columns = header)


results in

In [6]: df.head()
Out[6]:
loc -3 -300 4 100 34 -324
S s s s s s s
a -0.444521 -0.616153 2.261075 -1.857406 0.367582 1.212705
b -1.389062 -0.741163 0.512457 1.013495 -2.003147 0.651232
c -0.376925 -0.271408 -0.854247 0.355438 -0.791896 -1.359056
d -2.929450 0.228446 1.287110 -1.117579 -0.501250 1.340859
e -0.653089 0.245901 0.036066 0.776839 -1.112828 -0.476782

In [9]: df_sort = df.sortlevel('loc',axis=1)

In [10]: df_sort.head()
Out[10]:
loc -324 -300 -3 4 34 100
S s s s s s s
a 1.212705 -0.616153 -0.444521 2.261075 0.367582 -1.857406
b 0.651232 -0.741163 -1.389062 0.512457 -2.003147 1.013495
c -1.359056 -0.271408 -0.376925 -0.854247 -0.791896 0.355438
d 1.340859 0.228446 -2.929450 1.287110 -0.501250 -1.117579
e -0.476782 0.245901 -0.653089 0.036066 -1.112828 0.776839


as does
sortlevel(level = 0


First idea was that the the other things in my index disturb the sorting, but
df_sort = df_GW.sortlevel(level='location',axis=1,sort_remaining=False)
does not change anything in the sorting.

What am I doing wrong?

I suspect that for whatever reason something gets treated as a string or something, but I can't find any indication for that.

EDIT
output of
df.dtypes
:
real df:

In [29]: df_GW.dtypes
Out[29]:
ID Position
V2 -50.0 float64
H1-b 600.0 float64
V3 -125.0 float64
H2 -720.0 float64
V1 23.0 float64
H1-a 450.0 float64
dtype: object


synthetic:

AttributeError: 'DataFrame' object has no attribute 'dtype'


df.columns

real:

MultiIndex(levels=[[u'H1-a', u'H1-b', u'H2', u'V1', u'V2', u'V3'], [u'-125.0', u'-50.0', u'-720.0', u'23.0', u'450.0', u'600.0']],
labels=[[4, 1, 5, 2, 3, 0], [1, 5, 0, 2, 3, 4], [4, 1, 5, 2, 3, 0], [0, 0, 0, 0, 0, 0]], #not sure what's happening here. The original df is a bit bigger, and I'm cutting it to size
names=[u'ID', u'position'])


synthetic:

MultiIndex(levels=[[-720.0, -125.0, -50.0, 23.0, 450.0, 600.0], [u's']],
labels=[[2, 5, 1, 0, 3, 4], [0, 0, 0, 0, 0, 0]],
names=[u'loc', u'S'])

Answer

I think there is problem types of numbers in first level of MultiIndex are not float, but string:

np.random.seed(0)
header=pd.MultiIndex.from_product([['-125','-50','4','100','34','-720'],
                                   ['s']],names=['loc','X'])
df = pd.DataFrame(np.random.randn(5, 6), index=['a','b','c','d','e'], columns = header)
print (df)
loc      -125       -50         4       100        34      -720
X           s         s         s         s         s         s
a    1.764052  0.400157  0.978738  2.240893  1.867558 -0.977278
b    0.950088 -0.151357 -0.103219  0.410599  0.144044  1.454274
c    0.761038  0.121675  0.443863  0.333674  1.494079 -0.205158
d    0.313068 -0.854096 -2.552990  0.653619  0.864436 -0.742165
e    2.269755 -1.454366  0.045759 -0.187184  1.532779  1.469359

df.sortlevel('loc',axis=1, inplace=True)
print (df)
loc      -125       -50      -720       100        34         4
X           s         s         s         s         s         s
a    1.764052  0.400157 -0.977278  2.240893  1.867558  0.978738
b    0.950088 -0.151357  1.454274  0.410599  0.144044 -0.103219
c    0.761038  0.121675 -0.205158  0.333674  1.494079  0.443863
d    0.313068 -0.854096 -0.742165  0.653619  0.864436 -2.552990
e    2.269755 -1.454366  1.469359 -0.187184  1.532779  0.045759

If need cast string level to float, need change values and assign to new column names:

#change multiindex
cols = list(zip(df.columns.get_level_values('loc').astype(float),
                df.columns.get_level_values('X')))
df.columns = pd.MultiIndex.from_tuples(cols, names = df.columns.names)


df.sortlevel('loc',axis=1, inplace=True)
print (df)
loc      -720      -125      -50        4         34        100
X           s         s         s         s         s         s
a   -0.977278  1.764052  0.400157  0.978738  1.867558  2.240893
b    1.454274  0.950088 -0.151357 -0.103219  0.144044  0.410599
c   -0.205158  0.761038  0.121675  0.443863  1.494079  0.333674
d   -0.742165  0.313068 -0.854096 -2.552990  0.864436  0.653619
e    1.469359  2.269755 -1.454366  0.045759  1.532779 -0.187184