Xizam - 9 months ago 54

R Question

I have a vector with 1 million integers, in ascending order and a vector with a subset of 1000 of these integers, also sorted.

What would be faster? Will the second version become faster if samplevec becomes bigger?

`samplevec=sort(sample(1:10000000, 1000000))`

matchvec=sort(sample(samplevec, 10000))

for (i in matchvec) {

index=match(i, samplevec)

print(index)

}

Or

`samplevec=sort(sample(1:10000000, 1000000))`

matchvec=sort(sample(samplevec, 10000))

previous=1

for (i in matchvec) {

index=match(i, samplevec[previous:length(samplevec)])

previous=index

print(index)

}

Answer Source

It's easy to benchmark. Here are just two points in time. Feel free to pimp this and automate to increase the number of points in time.

```
library(microbenchmark)
set.seed(357)
samplevec = sort(sample(1:1000, 1000))
matchvec = sort(sample(samplevec, 1000))
microbenchmark(
version1 = {
previous=1
for (i in matchvec) {
index=match(i, samplevec[previous:length(samplevec)])
previous=index
}},
version2 = {
for (i in matchvec) {
index = match(i, samplevec)
}}
)
Unit: milliseconds
expr min lq mean median uq
version1 10.619105 10.711438 12.057713 10.811051 12.71902
version2 2.419441 2.487062 2.853868 2.506603 2.56024
```

Here is the second point. This one runs a tad longer.

```
set.seed(357)
samplevec = sort(sample(1:100000, 100000))
matchvec = sort(sample(samplevec, 100000))
microbenchmark(
version1 = {
previous=1
for (i in matchvec) {
index=match(i, samplevec[previous:length(samplevec)])
previous=index
}},
version2 = {
for (i in matchvec) {
index=match(i, samplevec)
}}
)
Unit: seconds
expr min lq mean median uq
version1 108.96069 109.61137 110.87308 110.70554 111.61337
version2 15.63668 15.71792 16.20434 15.84646 16.07487
```