Alwin Alwin - 2 months ago 17
R Question

How to add multiple columns to a dataframe from a custom function in R

I've created code that will take an input vector, create a dataframe based on the input, optimise some values and return some of these values. I'm now turning this into a function that will apply the calculations rowwise on an input dataframe. Below is a minimum working example of what I would like to achieve (my actual function would be too long to share here!):

# Randomly generated dataframe
df <- data.frame(a = rnorm(10, 0, 1), x = rnorm(10, 1, 3), y = rnorm(10, 2, 3))

# Function that takes multiple arguments and returns multiple values in a list
zsummary <- function(x, y) {
if (y < 0) return(list(NA, NA))
z = rnorm(10, x, abs(y))
return(list(mean(z), sd(z)))

# Example of something that works using dplyr
# However, this results in a lot of function calls...
# especially if there were a lot of columns in the list...
df %>% rowwise() %>%
mutate(mean = zsummary(x,y)[[1]], sd = zsummary(x,y)[[1]])

As you can see, I can't apply individual functions to each new
columns as they depend on a
vector that can only be generated once. I've looked around on SO already, but I haven't been able to find an answer yet. I think a solution would be using one of the
functions and not something from
, but I've honestly never fully understood
functions. I would also not like solutions that use
loops with
as I've tried this in previous projects and for large dataframes it becomes very slow!


We can use mapply for this. As the zsummary takes two arguments, the mapply would be one option as it take corresponding element of 'x' and 'y' to apply the zsummary.

t(mapply(zsummary, df$x, df$y))

We can also change the function slightly and get the output with dplyr

zsummary <- function(x, y) { 
   if (y < 0) return(data.frame(mean = NA, sd = NA))
   z = rnorm(10, x, abs(y))
   data.frame(mean = mean(z), sd = sd(z))

 df %>%
     rowwise() %>% 
     do(data.frame(., zsummary(.$x, .$y)))

Or as we discussed in the comments, instead of having the function taking multiple arguments, have a single argument and use apply with MARGIN=1 for applying it on each row.

zsummary2 <- function(v1){
      if(v1[2] < 0) return(c(mean = NA, sd = NA))
      z <- rnorm(10, v1[1], abs(v1[2]))
       c(mean = mean(v1), sd= sd(v1))

t(apply(df[-1], 1, zsummary2))
#         mean        sd
# [1,]  1.403066 0.8757504
# [2,]  5.058188 5.1401507
# [3,]  4.288365 1.4194393
# [4,]  1.932829 6.7587054
# [5,] -1.864236 3.7587462
# [6,]        NA        NA
# [7,]  3.328629 1.3711950
# [8,] -2.347699 5.0449958
# [9,]  2.936615 1.7332283
#[10,]        NA        NA

NOTE: The values will be different in each run as we didn't set any seed for the rnorm.