Jingnan Lai Jingnan Lai -3 years ago 53
R Question

How to get data according to stock code and date while avoiding for loops?

I have two data frame.

data1
consist of three columns, one is stock code like 600287, another is its earning announcement day like 2015-09-07, the third is its announcement earning (earning per share) like 0.8. This data frame is in random order, it consists of different stock and different announcement days because it covers a time from 2014 to 2016. Every stock could announce 4 times a year and I have 2400 stocks in this data frame.

code1 day1 announcement
600181 2015-09-08 0.9


data2
are stock daily performance data. It has daily return rates of 2500 stocks from 2014-2015. So it has over 2 million rows which is the reason I'm looking for an efficient solution.
data2
also has code numbers and dates.

code2 day2 return
600298 2016-08-09 0.03


I'm researching company post-announcement stock price reaction. Basically, for example, if company "A" announces its earning on 2016-09-08, I have to know every return rate of stock "A" in the next 5 dealing days (including 2015-9-8 if it is a dealing day). Dealing days for every stock are different but if and only if this day appears in
data2
it is a dealing day of stock "A".

The difficulty here is that stock "A" announces on 2015-06-09 but this day does not appear in
data2
of stock "A" (this can be due to the fact that 2015-06-09 is a Sunday which is not a dealing day in China). What I did is using the
difftime()
function and then order it but this is slow!

The final data frame I want to get is like this (7 columns)

code announce-day d1 d2 d3 d4 d5
600287 2015-08-07 0.08 0.06 0.02 0.01 -0.02


(again I want to say day1 could be 2015-08-07 if this day is a dealing day. It could also be 2015-08-09. The only judge is it appears first in
data2
after 2015-08-07)

I have been working on this problem for so long and I can't fix it.
I give a brief example.

code1<-"600187"
day1<-as.Date("2016-10-09") ##stock 600187 announce on 2016-10-09
announcement<-0.8
data1<-data.frame(code1, day1,announcement)
code2<-c(rep("600187",10),"600234")
x<-as.Date("2016-07-08")
x<-seq(x,x+4,by=1)
y<-as.Date("2016-10-11")
y<-seq(y,y+4,by=1)
day2<-c(x,y,as.Date("2016-12-30"))
return<-"whatever"
data2<-data.frame(code2,day2,return)


In this case
data1
only consist of one announcement of one stock. The announce day is 2016-10-09 but the next day to appear in
data2
is 2016-10-11.

Here is my for-loop code, I still use test data because I don't know how to uplode the whole data.

require(snow)


code1<-c("600187","600111","600111")

day1<-as.Date(c("2016-10-09","2011-02-02","2011-09-09"))

Answer Source

According to the OP, data2 has daily return rates of 2500 stocks from 2014-2015 with over 2 million rows.

I recommend to use the data.table package for this task for two reasons: It's designed for fast joins on large data, and it it allows us to use a rolling join. There is no need to use for loops for this task.

So, with sample data sets given by the OP, the data.table solution

library(data.table)   # CRAN version 1.10.4 used
# coerce to data.table, 
# set keys to make sure data are properly ordered,
# add column to join on
setDT(data1, key = c("code1", "day1"))[, join_day := day1] # announcements
setDT(data2, key = c("code2", "day2"))[, join_day := day2] # returns

# join on stock code and do a rolling join on day
data1[data2, on = c(code1 = "code2", "join_day"), roll = TRUE]

returns

     code1       day1 announcement   join_day       day2   return
 1: 600187       <NA>           NA 2016-07-08 2016-07-08 whatever
 2: 600187       <NA>           NA 2016-07-09 2016-07-09 whatever
 3: 600187       <NA>           NA 2016-07-10 2016-07-10 whatever
 4: 600187       <NA>           NA 2016-07-11 2016-07-11 whatever
 5: 600187       <NA>           NA 2016-07-12 2016-07-12 whatever
 6: 600187 2016-10-09          0.8 2016-10-11 2016-10-11 whatever
 7: 600187 2016-10-09          0.8 2016-10-12 2016-10-12 whatever
 8: 600187 2016-10-09          0.8 2016-10-13 2016-10-13 whatever
 9: 600187 2016-10-09          0.8 2016-10-14 2016-10-14 whatever
10: 600187 2016-10-09          0.8 2016-10-15 2016-10-15 whatever
11: 600234       <NA>           NA 2016-12-30 2016-12-30 whatever

The rolling join has copied the announcement of 2016-10-09 to all returns of the matching stock after that day. This is called LOCF or last observation carried forward. It will do so until the next announcement for that particular stock is encountered.

The rows with NAs can be removed from the result with:

data1[data2, on = c(code1 = "code2", "join_day"), roll = TRUE, nomatch = 0]

which yields

    code1       day1 announcement   join_day       day2   return
1: 600187 2016-10-09          0.8 2016-10-11 2016-10-11 whatever
2: 600187 2016-10-09          0.8 2016-10-12 2016-10-12 whatever
3: 600187 2016-10-09          0.8 2016-10-13 2016-10-13 whatever
4: 600187 2016-10-09          0.8 2016-10-14 2016-10-14 whatever
5: 600187 2016-10-09          0.8 2016-10-15 2016-10-15 whatever

Now, the OP has requested to study returns of the next five trading days on and after the day of announcement. The rolling join allows to limit how far values are carried forward but that works on the difference (in days here) but not on the number of rows. As there are gaps in the sequence of trading days, this feature can't be used directly for that purpose. Instead, head() is used to pick the first n_days rows in each group.

As the given data samples are limited in size, let's assume we want to observe returns on the next two trading days for demonstration and test:

n_days <- 2L
joined <- data1[data2, on = c(code1 = "code2", "join_day"), roll = TRUE, nomatch = 0][
  order(day2), head(.SD, n_days), .(code1, day1)]

joined is now reduced to contain n_days of return values on or after each announcement day of each stock:

    code1       day1 announcement   join_day       day2   return
1: 600187 2016-10-09          0.8 2016-10-11 2016-10-11 whatever
2: 600187 2016-10-09          0.8 2016-10-12 2016-10-12 whatever

Finally, the OP wants the result to be reshaped from long to wide format. This can be done using dcast():

dcast(joined, code1 + day1 ~ paste0("d", rowid(code1, day1)), 
      value.var = "return")

which yields:

    code1       day1       d1       d2
1: 600187 2016-10-09 whatever whatever

Potential performance improvement by reducing the data volume

As already mentioned above, the roll parameter allows us to limit how far values are carried forward. Together with nomatch = 0, this can be used to reduce the data volume resulting from the rolling join operation. However, the roll parameter must be choosen carefully due to gaps in the sequence of return days. Therefore, the length of the longest gap plus the number of trading days is used:

max_gap <- data2[order(day2), max(diff(day2))]
joined <- data1[data2, on = c(code1 = "code2", "join_day"), roll = max_gap + n_days,
                nomatch = 0][
                  order(day2), head(.SD, n_days), .(code1, day1)]

Data

data1 <- structure(list(code1 = structure(1L, .Label = "600187", class = "factor"), 
    day1 = structure(17083, class = "Date"), announcement = 0.8, 
    join_day = structure(17083, class = "Date")), .Names = c("code1", 
"day1", "announcement", "join_day"), row.names = c(NA, -1L), class = "data.frame")
data2 <- structure(list(code2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 2L), .Label = c("600187", "600234"), class = "factor"), 
    day2 = structure(c(16990, 16991, 16992, 16993, 16994, 17085, 
    17086, 17087, 17088, 17089, 17165), class = "Date"), return = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), class = "factor", .Label = "whatever"), 
    join_day = structure(c(16990, 16991, 16992, 16993, 16994, 
    17085, 17086, 17087, 17088, 17089, 17165), class = "Date")), .Names = c("code2", 
"day2", "return", "join_day"), row.names = c(NA, -11L), class = "data.frame")
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download