toom toom - 4 months ago 39
Python Question

Sorting python pandas dataframe with respect to a column of type object

I've got a dataframe that looks like the following (subset):

0 2014-01-23 17:01:09
1000 2014-01-23 17:38:28
2000 2014-01-23 18:08:28
3000 2014-01-23 18:33:10
4000 2014-01-24 14:19:28
5000 2014-01-24 14:31:04
6000 2014-01-23 18:34:53
7000 2014-01-23 17:15:36
8000 2014-01-23 19:31:45
9000 2014-01-23 19:23:06
10000 2014-01-23 17:58:15
Name: date, dtype: object


As you can see the column type of column
date
is
object
. So, I'd like to sort this column chronologically. Therefore in my code I call

data = data.sort(['date'])


however, the result looks like this:

0 2014-01-23 17:01:09
1000 2014-01-23 17:38:28
2000 2014-01-23 18:08:28
3000 2014-01-23 18:33:10
4000 2014-01-24 14:19:28
5000 2014-01-24 14:31:04
6000 2014-01-23 18:34:53
7000 2014-01-23 17:15:36
8000 2014-01-23 19:31:45
9000 2014-01-23 19:23:06
10000 2014-01-23 17:58:15
Name: date, dtype: object


So, nothing was sorted. But why does it not work?

UPDATE: I now found the following stuff on the internet:

import dateutil
data['date'] = data['date'].apply(dateutil.parser.parse)
print str(data.dtypes)


yields:

date datetime64[ns]


However, even now that the type is explicitly set to a datetime object I'm still not able to sort it chronologically. Now, I'm really confused.

UPDATE 2:
Now I did the following in the code

data['date'] = pd.to_datetime(data['date'])
data = data.sort(['date'])


However, it still has no effect. No sorting w.r.t. date.

Answer

Okay, I now found out. The index of the dataframe is also permutated according to the sorting order of the dataframe. So, the output of

print str(data['date'].ix[np.array(range(0,11))*1000])
print str(data[:20])

actually is different. Therefore I did not see any effect on the sorting, although everything was properly sorted already. However, this is extremly counter intuitive!

In order to create a corrected index order one can repair the permutation by:

data = data.sort(['date'])
data.index = range(0,len(data))

ADDITIONAL COMMENT: Actually it would be great if in the documentation of pandas there would be a remark on such a pitfall (and others). It nearly drove me crazy and only a collegue knew this by coincidence.