LongGamma77 LongGamma77 - 11 months ago 49
R Question

Reading zipped folder containing non-traditional spreadsheet

I'm trying to read a zipped folder called etfreit.zip contained in Purchases from April 2016 onward.

Inside the zipped folder is a file called 2016.xls which is difficult to read as it contains empty rows along with Japanese text.

I have tried various ways of reading the xls from R, but I keep getting errors. This is the code I tried:

download.file("http://www3.boj.or.jp/market/jp/etfreit.zip", destfile="etfreit.zip")
data <- read.csv(text=readLines("2016.xls")[-(1:10)])

I'm trying to skip the first 10 rows as I simply wish to read the data in the xls file. The code works only to the extent that it runs, but the data looks truly bizarre.

Would greatly appreciate any help on reading the spreadsheet properly in R for purposes of performing analysis.

Answer Source

There is more than one bizzare thing going on here I think, but I had some success with (somewhat older) gdata package:

data = gdata::read.xls("2016.xls")

By the way, treating xls file as csv seldom works. Actually it shouldn't work at all :) Find out a proper import function for your type of data and then use it, don't assume that read.csv is going to take care about anything else than csv (properly).

As per your comment: I'm not sure what you mean by "not properly aligned", but here is some code that cleans the data a bit, and gives you numeric variables instead of factors (note I'm using tidyr for that):

data2 = data[-c(1:7), -c(1, 6)]
names(data2) = c("date", "var1", "var2", "var3")
data2[, c(2:4)] = sapply(data2[, c(2:4)], tidyr::extract_numeric)

# Optionally convert the column with factor dates to Posixct
data2$date = as.POSIXct(data2$date)

Also, note that I am removing only 7 upper rows - this seems to be the portion of the data that contains the header with Japanese.