Kryo Kryo - 2 months ago 6
R Question

How to map the segment boundaries to the closest position in reference file in R

How can i map the coordinate positions (start and end positions) in segment file to the nearest position in reference file.

seg <- Sample Chromosome Start End Num_markers LogRatio
Nf1 1 3020000.5 195340000.5 4732 0.2981
Nf2 2 3100000.5 181980000.5 4091 0.2986

Ref <- Name Chromosome Position
1:3010000.5 1 3010000.5
1:195330000.5 1 195330000.5
2:3090000.5 2 3090000.5
2:181970000.5 2 181970000.5


Desired out put

result <- Sample Chromosome Start End Num_markers LogRatio
Nf1 1 3010000.5 195330000.5 4732 0.2981
Nf2 2 3090000.5 181970000.5 4091 0.2986

Answer

Using data.table, you could perform two rolling joins while specifying roll = "nearest". You would need to this twice as you need to join to different columns each time but this should be very efficient. Here's a possible implementation

library(data.table)
setDT(seg)
setDT(Ref)
StartInd <- Ref[seg, on = c(Chromosome = "Chromosome", Position = "Start"), which = TRUE, roll = "nearest"]
EndInd <- Ref[seg, on = c(Chromosome = "Chromosome", Position = "End"), which = TRUE, roll = "nearest"]
seg[, `:=`(Start = Ref[StartInd, Position], End =  Ref[EndInd, Position])]
print(seg, digits = 10)
#    Sample Chromosome     Start         End Num_markers LogRatio
# 1:    Nf1          1 3010000.5 195330000.5        4732   0.2981
# 2:    Nf2          2 3090000.5 181970000.5        4091   0.2986