MathsQuant525 - 7 months ago 26

R Question

I am trying to clean a large dataset. I have a matrix of prices in date order with the most recent date in the first row and different stocks along the columns. If a price is missing or NA on a given day for a given stock, I use the price from the day before. If the price on the final day is NA, I leave it.

I start by looping over the whole matrix and using an IF statement for each (i,j) pair. This was extremely slow. The next approach is as follows for the matrix of prices b: I use indexing to find indices which are NA, and only deal with these.

`for(j in 1:ncol(b))`

{

Index<-which( is.na(b[,j]) | b[,j]==0)

if(length(Index)==0)

{print("0 Missings")

Index<-c(1)#to ensure its not empty}

for(k in length(Index):1 )#backwards to fill from the bottom

{

i=Index[k]

#If the oldest date is missing, then set it to N/A so that N/A is passed forward as opposed to 0.

if( i==nrow(b) & ( b[[i,j]]==0 | is.na(b[[i,j]]) ) )

{

b[[i,j]]<-'#N/A'

}

else( b[[i,j]]==0 | is.na(b[[i,j]]) )

{

b[[i,j]] <- b[[i+1,j]]#Take the price from the date before

}

}

}

This is a little faster, but not much. It still takes over an hour for a 400x6000 matrix. I was hoping for a fully vectorised approach, where I did something like:

`b[[Index,j]]<-b[[Index+1,j]]`

However, I don't think R will use sequentially updated values. By this, I mean it wont progressively update from the bottom so that new values are used. This is important when I have 2 NA entries in a row, since the vectorised approach above will only fill one. But some sort of efficient sequential vectorised code will update the first, and use this to update the second. Any ideas?

Many thanks for your efforts

Answer

Here's an possibility using the `MESS`

package and is essentially not different from @Roland's comment above so I'm only including it here so you can see the formatting. The `filldown`

function is written in C++ so it's rather fast:

```
x <- matrix(c(1, 2, 3, 4, NA, 6, NA, NA, NA, NA, 11, 12, 13, 14, 15, NA, 17, 18, NA, 20), nrow=5)
x
[,1] [,2] [,3] [,4]
[1,] 1 6 11 NA
[2,] 2 NA 12 17
[3,] 3 NA 13 18
[4,] 4 NA 14 NA
[5,] NA NA 15 20
```

and then use

```
library(MESS)
apply(x, 2, filldown)
```

which produces

```
[,1] [,2] [,3] [,4]
[1,] 1 6 11 NA
[2,] 2 6 12 17
[3,] 3 6 13 18
[4,] 4 6 14 18
[5,] 4 6 15 20
```