Andrew Pruet Andrew Pruet - 7 days ago 6
R Question

R vectorize findInterval over large data sets

I have two data frames that I'm using findInterval on. Wellbore data is data of the x,y, and z of a wellbore to produce oil (VSS = vertical subsea-depth, md = measured depth a.k.a. actual distance the drill bit traveled down the well). Perfs data is data where a wellbore has been perforated to allow for flow (top_perf = md, bot_perf = md).

Perfs:

Well_ID top_perf bot_perf well_name surface ID x y VSS
056-W 2808 2958 056-W Ranger 2 0 0 0
056-W 3150 3250 056-W Ranger 1 0 0 0
056-W 3150 3250 056-W Ranger 2 0 0 0
056-W 3559 3664 056-W UT 1 1 0 0 0
056-W 3559 3664 056-W UT 2 2 0 0 0
057-W 2471 2952 057-W Tar 1 0 0 0
057-W 2471 2952 057-W Tar 2 0 0 0
058-W 2615 2896 058-W Ranger 1 0 0 0
058-W 2615 2896 058-W Ranger 2 0 0 0


Wellbore:

well_name well_id md vss x y
056-W 056-W 3260 -3251.46 4221436 4030454
056-W 056-W 3280 -3271.45 4221436 4030454
056-W 056-W 3300 -3291.45 4221435 4030453
056-W 056-W 3320 -3311.44 4221435 4030453
056-W 056-W 3340 -3331.44 4221434 4030453
056-W 056-W 3360 -3351.43 4221434 4030453
056-W 056-W 3380 -3371.43 4221433 4030453
056-W 056-W 3400 -3391.42 4221433 4030453


The goal is to find Perfs$top_perf and Perfs$bot_perf that are closest in value to Wellbore$md where Perfs$Well_ID = Wellbore$well_id and then extract the vss, x, and y from Wellbore and add it to Perfs. (I don't care about interpolating if it's in between, just need something that's close).

Here is my code to do this:

for(i in 1:dim(Perfs)[1]){
if(Perfs$ID[i] == 1){
Wellbore_temp <- Wellbore[which(Wellbore$well_id == Perfs[i,"Well_ID"]),]
interval <- findInterval(Perfs[i,"top_perf"], Wellbore_temp$md)
Perfs[i,c("x","y","VSS")] <- Wellbore_temp[interval, c("x","y","vss")]
}else{
Wellbore_temp <- Wellbore[which(Wellbore$well_id == Perfs[i,"Well_ID"]),]
interval <- findInterval(Perfs[i,"bot_perf"], Wellbore_temp$md)
Perfs[i,c("x","y","VSS")] <- Wellbore_temp[interval, c("x","y","vss")]
}
}


This code does work, it's just far too slow for the application this will be used in. How can I get rid of the loop and do this in a more vectorized manner to speed things up? Also open to suggestions outside of findInterval.

Answer

Found the answer to the question here: Join R data.tables where key values are not exactly equal--combine rows with closest times

Based on the thoughts of a data.table provided by @ds440

Here is the code I used and it runs very fast:

Perf.Data <- Perfs


Wellbore.Perfs <- data.table(Wellbore[,c("well_id","md","vss")])
Spotfire.Top.Perf <- data.table(Perf.Data[,c("Well_ID","top_perf", "bot_perf")])
Spotfire.Bot.Perf <- data.table(Perf.Data[,c("Well_ID","bot_perf", "top_perf")])

#Change the column names to match up with Wellbore.Perfs
#Add in the bot_perf to .top.perf and the top_perf to the .bot.perf is done to make these unique and ensure everything is captured from the perfs table
colnames(Spotfire.Top.Perf) <- c("well_id","md", "bot_perf")
colnames(Spotfire.Bot.Perf) <- c("well_id","md","top_perf")

#set key to join on
setkey(Wellbore.Perfs, "well_id","md")

#roll = "nearest" will take the nearest value of md in .top.perf or .bot.perf and match it to the md in wellbore.perfs where Well_ID = Well_ID
Perfs.Wellbore.Top <- Wellbore.Perfs[Spotfire.Top.Perf, roll = "nearest"]
Perfs.Wellbore.Bot <- Wellbore.Perfs[Spotfire.Bot.Perf, roll = "nearest"]