rafa.pereira rafa.pereira - 2 months ago 10
R Question

Improve speed of queries using complex survey design in R

I have a large data set (more than 20 million obs) that I analyze with

survey
package and it is taking me ages to run simple queries. I have tried to find a way to speed up my code but I would like to know if there better ways to make this more efficient.

In my benchmark, I compare the speed of three commands using
svyby
/
svytotal
:


  1. Simple command
    svyby
    /
    svytotal

  2. Parallel computing with
    foreach
    dopar
    using 7 cores

  3. A compiled version of option 2



Spoiler: Option 3 is more than twice as fast as the first option BUT it is not suitable for large data sets because it relies on Parallel computing, which quickly hit memory limits when dealing with large data sets. I also face this problem despite my 16GB of RAM. There are a few solutions to this memory limitation, but none of them are applicable to survey design objects.

Any ideas on how to make it faster and not crash because of memory limits?

My code with a reproducible example:



# Load Packages
library(survey)
library(data.table)
library(compiler)
library(foreach)
library(doParallel)
options(digits=3)

# Load Data
data(api)

# Convert data to data.table format (mostly to increase speed of the process)
apiclus1 <- as.data.table(apiclus1)

# Multiplicate data observations by 1000
apiclus1 <- apiclus1[rep(seq_len(nrow(apiclus1)), 1000), ]

# create a count variable
apiclus1[, Vcount := 1]

# create survey design
dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)


1) Simple code



t1 <- Sys.time()
table1 <- svyby(~Vcount,
~stype+dnum+cname,
design = dclus1,
svytotal)
T1 <- Sys.time() - t1


2) Parallel computing with foreach dopar using 7 cores



# in this option, I create a list with different subsets of the survey design
# that will be passed to different CPU cores to work at the same time

subdesign <- function(i){ subset(dclus1, dnum==i)}
groups <- unique(apiclus1$dnum)
list_subsets <- lapply(groups[], subdesign) # apply function and get all subsets in a list
i <- NULL

# Start Parallel
registerDoParallel(cores=7)

t2 <- Sys.time()
table2 <- foreach (i = list_subsets, .combine= rbind, .packages="survey") %dopar% {
options( survey.lonely.psu = "remove" )
svyby(~Vcount,
~stype+dnum+cname,
design = i,
svytotal)}
T2 <- Sys.time() - t2


3. A compiled version of option 2



# make a function of the previous query
query2 <- function (list_subsets) { foreach (i = list_subsets, .combine= rbind, .packages="survey") %dopar% {
svyby(~Vcount,
~stype+dnum+cname,
design = i,
svytotal)}}

# Compile the function to increase speed
query3 <- cmpfun(query2 )

t3 <- Sys.time()
table3 <- query3 (list_subsets)
T3 <- Sys.time() - t3


Results



>T1: 1.9 secs
>T2: 1.13 secs
>T3 0.58 secs

barplot(c(T1, T2, T3),
names.arg = c("1) simple table", "2) parallel", "3) compiled parallel"),
ylab="Seconds")


enter image description here

Answer

thanks for laying this question out so well. working efficiently with large survey datasets in R probably requires some basic SQL syntax (which is much easier to learn than R). MonetDB is the only large-data option compatible with the survey package, exploring other high performance packages is (probably) not going to be fruitful. generally when i am exploring a huge dataset, i write directly in SQL queries rather than using the survey package because the standard error computations are computationally-intensive (and variances aren't as useful during interactive data exploration). notice how the final SQL timestamp blows away all of the other options. to calculate a quick weighted mean, use something like "SELECT by_column , SUM( your_column * the_weight ) / SUM( the_weight ) FROM yourdata GROUP BY by_column"

when you do need standard errors interactively, linearization (svydesign) is often more computationally-intensive than replication (svrepdesign), but sometimes creating the replication designs (as i have done with the jk1w_dclus1 below) requires more survey methods familiarity than some users are comfortable with.

# Load Packages
library(MonetDB.R)
library(MonetDBLite)
library(DBI)   # suggested in comments and needed on OSX
library(survey)

# Load Data
data(api)

# Multiplicate data observations by 10000 
apiclus1 <- apiclus1[rep(seq_len(nrow(apiclus1)), 10000), ]

# create a count variable
apiclus1$vcount <- 1

# create survey design
dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)


dbfolder <- tempdir()

db <- dbConnect( MonetDBLite() , dbfolder )
dbWriteTable( db , 'apiclus1' , apiclus1 )


db_dclus1 <-
    svydesign(
        weight = ~pw ,
        id = ~dnum ,
        data = "apiclus1" , 
        dbtype = "MonetDBLite" ,
        dbname = dbfolder ,
        fpc = ~fpc
    )

# you provided a design without strata,
# so type="JK1" matches that most closely.
# but see survey:::as.svrepdesign for other linearization-to-replication options
jk1w <- jk1weights( psu = apiclus1$dnum , fpc = apiclus1$fpc )

# after the replicate-weights have been constructed,
# here's the `svrepdesign` call..
jk1w_dclus1 <-
    svrepdesign(
        weight = ~pw ,
        type = "JK1" ,
        repweights = jk1w$repweights ,
        combined.weights = FALSE ,
        scale = jk1w$scale ,
        rscales = jk1w$rscales ,
        data = 'apiclus1' ,
        dbtype = "MonetDBLite" ,
        dbname = dbfolder
    )

# slow
system.time(res1 <- svyby(~vcount,~stype+dnum+cname,design = dclus1,svytotal))
# > system.time(res1 <- svyby(~vcount,~stype+dnum+cname,design = dclus1,svytotal))
   # user  system elapsed 
  # 17.40    2.86   20.27 


# faster
system.time(res2 <- svyby(~vcount,~stype+dnum+cname,design = db_dclus1,svytotal))
# > system.time(res2 <- svyby(~vcount,~stype+dnum+cname,design = db_dclus1,svytotal))
   # user  system elapsed 
  # 13.00    1.20   14.18 


# fastest
system.time(res3 <- svyby(~vcount,~stype+dnum+cname,design = jk1w_dclus1,svytotal))
# > system.time(res3 <- svyby(~vcount,~stype+dnum+cname,design = jk1w_dclus1,svytotal))
   # user  system elapsed 
  # 10.75    1.19   11.96 

# same standard errors across the board
all.equal( SE( res1 ) , SE( res2 ) )
all.equal( SE( res2 ) , SE( res3 ) )
# NOTE: the replicate-weighted design will be slightly different
# for certain designs.  however this technique is defensible
# and gets used in 
# https://github.com/ajdamico/asdfree/tree/master/Censo%20Demografico


# at the point you do not care about standard errors,
# learn some sql:
system.time( res4 <- dbGetQuery( db , "SELECT stype , dnum , cname , SUM( pw ) FROM apiclus1 GROUP BY stype , dnum , cname" ) )
# because this is near-instantaneous, no matter how much data you have.

# same numbers as res1:
all.equal( as.numeric( sort( coef( res1 ) ) ) , sort( res4$L1 ) )
# > system.time( res4 <- dbGetQuery( db , "SELECT stype , dnum , cname , SUM( pw ) FROM apiclus1 GROUP BY stype , dnum , cname" ) )
   # user  system elapsed 
   # 0.15    0.20    0.23