Zico - 2 months ago 13

R Question

I have below dataset

`w`

`x`

`Case 1:`

x = 4

w = c(1,2,4,4,4,4,6,7,8,9,10,11,12,14,15)

Case2:

x = 12

w = c(1,2,4,4,4,4,6,7,8,9,10,11,12,14,15)

I want to create a function which will search for

`x`

`w`

`x`

`w`

`create_chunk <- function(val, tab, L=1L, H=length(tab))`

{

if(H >= L)

{

mid = L + ((H-L)/2)

## If the element is present within middle length

if(tab[mid] > val)

{

## subset the original data in reduced size and again do mid position value checking

## then subset the data

} else

{

mid = mid + (mid/2)

## Increase the mid position to go for right side checking

}

}

}

In the output I am looking for below:

`Output for Case 1:`

Dataset containing: 1,2,4,4,4,4

Output for Case 2:

Dataset containing: 1,2,4,4,4,4,6,7,8,9,10,11,12

Please note:

1. Dataset may contain duplicate values for search key and

all the duplicate values are expected in the output dataset.

2. I have huge size datasets (say around 2M rows) from

where I am trying to subset smaller dataset as per my requirement of search key.

Input Data:

`date value size stockName`

1 2016-08-12 12:44:43 10093.40 4 HWA IS Equity

2 2016-08-12 12:44:38 10093.35 2 HWA IS Equity

3 2016-08-12 12:44:47 10088.00 2 HWA IS Equity

4 2016-08-12 12:44:52 10089.95 1 HWA IS Equity

5 2016-08-12 12:44:53 10089.95 1 HWA IS Equity

6 2016-08-12 12:44:54 10088.95 1 HWA IS Equity

Search Key is:

`10089.95`

Expected Output is:

`date value size stockName`

1 2016-08-12 12:44:47 10088.00 2 HWA IS Equity

2 2016-08-12 12:44:54 10088.95 1 HWA IS Equity

3 2016-08-12 12:44:52 10089.95 1 HWA IS Equity

4 2016-08-12 12:44:53 10089.95 1 HWA IS Equity

Answer

You could do this which takes care of duplicate values. In case of duplicates, the highest position of which will be returned. Please note that `A`

should be in non-decreasing order.

```
binSearch <- function(A, value, left=1, right=length(A)){
if (left > right)
return(-1)
middle <- (left + right) %/% 2
if (A[middle] == value){
while (A[middle] == value)
middle<-middle+1
return(middle-1)
}
else {
if (A[middle] > value)
return(binSearch(A, value, left, middle - 1))
else
return(binSearch(A, value, middle + 1, right))
}
}
w[1:binSearch(w,x1)]
# [1] 1 2 4 4 4 4
w[1:binSearch(w,x2)]
# [1] 1 2 4 4 4 4 6 7 8 9 10 11 12
```

However, as its mentioned in the comments, you could simply use `findInterval`

to achieve the same:

```
w[1:findInterval(x1,w)]
```

As you know, binary search has order of `log(n)`

but as stated in `?findInterval`

, it also benefits from `log(n)`

since the length of the first argument is one:

The function findInterval finds the index of one vector x in another, vec, where the latter must be non-decreasing. Where this is trivial, equivalent to apply( outer(x, vec, ">="), 1, sum), as a matter of fact, the internal algorithm uses interval search ensuring O(n * log(N)) complexity where n <- length(x) (and N <- length(vec)). For (almost) sorted x, it will be even faster, basically O(n).

**EDIT**

As per your edit and your new setting, you could do this (suppose your data is in `df`

):

```
o <- order(df$value)
rows <- o[1:findInterval(key, df$value[o])]
df[rows,]
```

**data**

```
x1 <- 4
x2 <- 12
w <- c(1,2,4,4,4,4,6,7,8,9,10,11,12,14,15)
key <- 10089.95
```