harbun harbun - 9 months ago 65
Python Question

Setting datetime64 series as pandas dataframe index automatically adds timezone offset

I am reading an csv with datetimes without timezone data, but once I use the datetime column as index, a(n incorrect) timezone offset is being added. How can I prevent this from happening?

The data:

Time (UTC),Open,High,Low,Close,Volume
2005.01.03 00:00:00,1.8275,1.858,1.7971,1.819,41998.5
2005.01.10 00:00:00,1.8095,1.8376,1.771,1.766,46353.9

It is weekly OHLC data.

import pandas as pd
df = pd.read_csv("test.csv", parse_dates=["Time (UTC)"])

After reading in the data, there is no timezone offset:

df["Time (UTC)"].head(2)
0 1973-02-26
1 1973-03-05
Name: Time (UTC), dtype: datetime64[ns]

But when I set this data as index, a timezone offset is added:

df.index = df["Time (UTC)"]
array(['1973-02-26T01:00:00.000000000+0100'], dtype='datetime64[ns]')

, I get back that
, so there is no timezone added even though there is a timezone offset added (which, by the way seems to have summertime too). If I set the timezone to UTC with
df = df.tz_localize("UTC")
shows me dtype='datetime64[ns, UTC]'. However, it has no effect on the offsets.

Since I know what timezone the data is in, I don't need an timezone offset, much less an incorrect one probably based on my machines timezone.
I would rather have ["Time (UTC)"] column set as index upon using pd.read_csv for performance reasons, but I get the same behavior when doing that.

How can I prevent an timezone offset of being added, or set the correct one?

My python version is 2.7.11 (Anaconda 2.5.0 64 Bit), pandas version is 0.17.1, numpy 1.10.4.

Answer Source

This is solely a display issue - your dates are still timezone-naive, it's just that numpy displays an offset in the repr.

If you upgrade to a more recent numpy (1.11+), it will fix the display issue.

In [31]: np.__version__
Out[31]: '1.11.1'

In [32]: df.index.values[:1]
Out[32]: array(['2005-01-03T00:00:00.000000000'], dtype='datetime64[ns]')