sherbert-lemon - 1 month ago 5x

R Question

I have a data.table that is quite large and have been attempting to return a list or vector of counts of specific default values in each column (they vary per column). It is organized as such:

`set.seed(1);`

DT = as.data.table(matrix(sample(1:100, 100*100, TRUE), 100, 100))

#DT output below

param1 param2 param3 ... param100 #column names

1 1 1 ... 1 #first row = default values

. #elems in remaining rows are random

. # a param can be set to non default

1 666 1 ... 143 # or default values within a column

.

.

10000 1 1 ... 420

I am curious to know what a data.table way of doing this is? I have been sifting through past documentation, and am attempting to avoid for loops and methods that are intensive in memory and computation (I've attempted to use filter, lapply, and grouping, without luck).

An analogous example of what I am ideally looking for is with counting the number of non-NA values that exist per column:

`count <- colSums(!is.na(DT))`

#which outputs the following:

print(count)

param1 param2 param3 ... param177

1 292 0 7

Is there a way to do this similar to the

`colSums(!is.na(DT))`

Answer

Well, I think this is what's meant:

```
set.seed(1)
DT = as.data.table(matrix(sample(1:1e5, 1e8, TRUE), ncol = 10))
# vector scan
sapply(DT, function(x) sum(x == x[1L]))
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 124 100 111 101 113 101 94 108 79 112
# binary search
sapply(names(DT), function(x){
q = substitute(x == x[1L], list(x=as.symbol(x)))
DT[eval(q), .N]
})
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 124 100 111 101 113 101 94 108 79 112
```

Benchmarks:

- Vector scan takes .22 seconds
- Binary search
*initially*takes 1.8 seconds - Binary search
*every time after that*is instant

The speedup is thanks to *indices* on the data.table, more specifically on *auto indexing*. See `?indices`

and read through the vignettes (this one appears in the fourth). Indices will speed up other operations on the data as well. To create them from the start, do `for (nm in names(DT)) setindexv(DT, nm)`

, though of course this will cost ~1.8 seconds.

*Note:* If your data are floating-point, you will run into trouble no matter how you approach this. Floats don't play nice with equality tests like `x==x[1L]`

.

Source (Stackoverflow)

Comments