20salmon 20salmon - 2 months ago 10
R Question

Making my R function faster

EDIT: Sorry for the low quality post. I should have taken more time to present this to you. The post has been edited, I've added a working syntax example for the whole thing. Thank you to everybody who has offered advice so far.

EDIT2: Found that script is only slow on the other computer. Probably caused by some local issue, or the REPL.

I made this function. It produces frequency tables of values in a labelled (labelled & haven packages) data frame. It works, but I intend to use it on data frames with a lot of columns and I think it runs a bit slow; a user may think R has crashed when running it over 100+ columns, so I would like to speed it up.

The point of this script is to produce output that helps me look for processing errors in a survey dataset. It's a bit fiddly because I want to know about answer frequencies, and evaluate the shape of the value labels at the same time. So this script produces one frequency table per variable, revealing frequencies, unused labels, and values without value labels. This will hopefully be clearer when looking at the output from the script.

I would be grateful if you could point out some ways to make this more efficient:

# demonstration dataset
library(knitr)
library(data.table)
library(labelled)

df <- data.frame(q1 = rep(1:6, 3), q2 = rep(6:1, 3))
val_labels(df[, c("q1", "q2")]) <- c(YES = 1, MAYBE = 2, NO = 3, DK = 4, MISSING=5)
val_label(df$q2, 1) <- NULL

# Produce a frequency table over values and labels in a labelled-class dataframe object
# --------------------------------------------------------------------------------------------------
# Example: freqlab(ds[[1]]) or freqlab(ds[1:10]) or freqlab(ds)
# Wrong: freqlab(ds[1])

freqlab <- function(x){

# If the function is called on double brackets, eg. freqlab(ds[[11]])
if (!is.list(x)){

# Make a frequency distribution, put it in a data.table
xFreq <- data.table(table(x))
names(xFreq) <- c("Value", "Frequency")
class(xFreq[[1]]) <- "numeric"
setkey(xFreq, Value)

# Put the value labels in another data.table
if (!is.null(val_labels(x))){
xLab <- data.table(val_labels(x), names(val_labels(x)))
names(xLab) <- c("Value", "Label")
setkey(xLab, Value)
} else {
# If the variable does not have labels, create one to avoid errors
xLab <- data.table(xFreq[[1,1]], "** UNLABELLED **")
names(xLab) <- c("Value", "Label")
setkey(xLab, Value)
}

# Perform a FULL OUTER JOIN
outTable <- merge(xFreq, xLab, all = TRUE)

# Arrange values in ascending order of absolute value
outTable <- arrange(outTable, abs(outTable[[1]]))

# Edit the Label column for value cases with no label
outTable[[2]][is.na(outTable[[2]])] <- 0
outTable[[3]][is.na(outTable[[3]])] <- "** UNLABELLED **"

# If the output has more than 25 rows, cut it short
if (dim(outTable)[1] > 25){
outTable <- outTable[1:25]
}

# Output the table
print(kable(outTable, format = "rst", align = "l"))


# If the function is called on a list of variables, eg. freqlab(ds[10:11]),
# do the same steps as above, looping through all the input variables
} else {

for (y in 1:length(x)){

xFreq <- data.table(table(x[[y]]))
names(xFreq) <- c("Value", "Frequency")
class(xFreq[[1]]) <- "numeric"
setkey(xFreq, Value)

if (!is.null(val_labels(x[[y]]))){
xLab <- data.table(val_labels(x[[y]]), names(val_labels(x[[y]])))
names(xLab) <- c("Value", "Label")
setkey(xLab, Value)
} else {
xLab <- data.table(xFreq[[1,1]], "** UNLABELLED **")
names(xLab) <- c("Value", "Label")
setkey(xLab, Value)
}

outTable <- merge(xFreq, xLab, all = TRUE)
outTable <- arrange(outTable, abs(outTable[[1]]))
outTable[[2]][is.na(outTable[[2]])] <- 0
outTable[[3]][is.na(outTable[[3]])] <- "** UNLABELLED **"

if (dim(outTable)[1] > 25){
outTable <- outTable[1:25]
}

# Extra information printed when function is called on a list of variables
cat("Name:\t", names(x[y]),"\n")
print(kable(outTable, format = "rst", align = "l"))
cat(rep("-", 80), sep='', "\n\n")
}
}
}


Example of output:

> freqlab(df)
Name: q1


===== ========= ================
Value Frequency Label
===== ========= ================
1 3 YES
2 3 MAYBE
3 3 NO
4 3 DK
5 3 MISSING
6 3 ** UNLABELLED **
===== ========= ================
--------------------------------------------------------------------------------

Name: q2


===== ========= ================
Value Frequency Label
===== ========= ================
1 3 ** UNLABELLED **
2 3 MAYBE
3 3 NO
4 3 DK
5 3 MISSING
6 3 ** UNLABELLED **
===== ========= ================
--------------------------------------------------------------------------------

Answer Source

It is not easy to help you without toy data, more simple code, and a clear explanation of input and output. Anyway, a first step is typically to profile your code in order determine the bottlenecks which consume time. See ?Rprof for the the Rprof()-function which provides profiling information.

This small example illustrates how to use it:

square <- function (x) {
 Sys.sleep(3)
 return(x^2)
}

add <- function (x, y) {
 Sys.sleep(1)
  return(x + y)
}

complicatedFunction <- function(x, y) {
  res <- square(add(square(x), square(y)))
  return(res)
}

# Try to profile out "complicated" function
Rprof()  # Start of profiling
res <- complicatedFunction(2, 5)  # Function to profile
Rprof(NULL) # End of profiling
summaryRprof() # Show results
#$by.self
#            self.time self.pct total.time total.pct
#"Sys.sleep"      9.54      100       9.54       100
#
#$by.total
#                      total.time total.pct self.time self.pct
#"Sys.sleep"                 9.54    100.00      9.54      100
#"complicatedFunction"       9.54    100.00      0.00        0
#"square"                    9.54    100.00      0.00        0
#"add"                       6.58     68.97      0.00        0
#
#$sample.interval
#[1] 0.02
#
#$sampling.time
#[1] 9.54

Here you see how long time is spent inside the called functions of the function --- in this example Sys.sleep clearly takes up all the time. See ?summaryRprof for more info on how to understand this output.