Xizam Xizam - 9 months ago 54
R Question

Will defining a start position when doing multiple matches on a sorted vector be faster?

I have a vector with 1 million integers, in ascending order and a vector with a subset of 1000 of these integers, also sorted.

What would be faster? Will the second version become faster if samplevec becomes bigger?

samplevec=sort(sample(1:10000000, 1000000))
matchvec=sort(sample(samplevec, 10000))

for (i in matchvec) {
index=match(i, samplevec)
print(index)
}


Or

samplevec=sort(sample(1:10000000, 1000000))
matchvec=sort(sample(samplevec, 10000))

previous=1
for (i in matchvec) {
index=match(i, samplevec[previous:length(samplevec)])
previous=index
print(index)
}

Answer Source

It's easy to benchmark. Here are just two points in time. Feel free to pimp this and automate to increase the number of points in time.

library(microbenchmark)

set.seed(357)

samplevec = sort(sample(1:1000, 1000))
matchvec = sort(sample(samplevec, 1000))

microbenchmark(
  version1 = {
    previous=1
    for (i in matchvec) {
      index=match(i, samplevec[previous:length(samplevec)])
      previous=index
    }},
  version2 = {
    for (i in matchvec) {
      index = match(i, samplevec)
    }}
)

Unit: milliseconds
     expr       min        lq      mean    median       uq
 version1 10.619105 10.711438 12.057713 10.811051 12.71902
 version2  2.419441  2.487062  2.853868  2.506603  2.56024

Here is the second point. This one runs a tad longer.

set.seed(357)

samplevec = sort(sample(1:100000, 100000))
matchvec = sort(sample(samplevec, 100000))

microbenchmark(
  version1 = {
    previous=1
    for (i in matchvec) {
      index=match(i, samplevec[previous:length(samplevec)])
      previous=index
    }},
  version2 = {
    for (i in matchvec) {
      index=match(i, samplevec)
    }}
)

Unit: seconds
     expr       min        lq      mean    median        uq
 version1 108.96069 109.61137 110.87308 110.70554 111.61337
 version2  15.63668  15.71792  16.20434  15.84646  16.07487