Methamortix Methamortix - 3 months ago 24
R Question

Conversion with as.date() of csv file false

I have a 1gb csv file with Dates and according values. Now is the Dates are in "undefined Format" - so they are diplayed as numbers in Excel like this:

DATE FXVol.DKK.EUR,0.75,4
38719 0.21825


I cannot open the csv file and change it to the date format I like since I would lose data in this way.

If I now import the data to R and convert the Dates:

as.Date( workingfilereturns[,1], format = "%Y-%m-%d")


It always yields dates that are 70 years + so 2076 instead of 2006. I really have no idea what goes wrong or how to fix this issue.

Answer

Going by your sample data, 38719 appears to be the number of days which have elapsed since January 1, 1900. So you can just add this number of days to January 1, 1900 to arrive at the correct Date object which you want:

as.Date("1900-01-01") + workingfilereturns[,1]

or

as.Date("1900-01-01") + workingfilereturns$DATE

Example:

> as.Date("1900-01-01") + 38719
[1] "2006-01-04"

Update:

As @Roland correctly pointed out, you could also use as.Date.numeric while specifying an origin of January 1, 1900:

> as.Date.numeric(38719, origin="1900-01-01")
[1] "2006-01-04"

Bug warning:

As the asker @Methamortix pointed out, my solution, namely using January 1, 1900, as the origin, yields a date which is two days too late in R. There are two reasons for this:

  • In R, the origin is indexed with 0, meaning that as.Date.numeric(0, origin="1900-01-01") is January 1, 1900, in R, but Excel starts counting at 1, meaning that formatting the number 1 in Excel yields January 1, 1900. This explains why R is one day ahead of Excel.
  • (Hold your breath) It appears that R's as.Date function has a bug in the year 1900, specifically that it ignores the fact that 1900 is a leap year, where February has a 29th day. As a result, when dealing with dates greater than February 28, 1900, R is a second day ahead of Excel.


As evidence of this, consider the following code:

> as.Date.numeric(57, origin="1900-01-01")
[1] "1900-02-27"
> as.Date.numeric(58, origin="1900-01-01")
[1] "1900-02-28"
> as.Date.numeric(59, origin="1900-01-01")
[1] "1900-03-01"

In other words, R's as.Date() skipped over February 29th. My guess is that this has been reported somewhere, possibly on Stack Overflow or elsewhere, but let this serve as another reference point.

So, going back to the original question, the origin needs to be offset by 2 days when dealing with Excel dates in R, where the date is greater than February 28, 1900 (which is the case of the original problem). So he should use his date data frame in the following way:

as.Date.numeric(workingfilereturns$DATE - 2, origin="1900-01-01")

where the date column has been rolled back by two days to sync up with the values in Excel.

Comments