Neil Neil - 3 months ago 23
R Question

Filtering NA values in dplyr pipe and summerising only numeric column means

I have dataframe structure like below

$ Lead.Score : int 105 120 150 60 80 0 80 0 80 145
$ Average.Time.Per.Visit : num 0 83.8 4 0 0 ...
$ TotalVisits : int 0 5 2 1 2 0 2 0 2 4 ...
$ Page.Views.Per.Visit : num 0 2.5 2 1 1 0 2 0 2 4 ...
$ Average.Time.Per.Visit.1 : num 0 83.8 4 0 0 ...
$ Last.Activity : chr "Page Visited on Website" "Email
$ Last.Activity.Date : POSIXct, format: NA NA ...
$ First.Landing.Page.Submission.Date: POSIXct, format: NA NA ...
$ Created.On : POSIXct, format: "2016-07-31 17:11:00"


I want to calculate mean of all the numeric and int columns. How to do it in dplyr?

I did something like this

train_webdata %>%
select(which(sapply(., is.numeric))) %>%
group_by(Lead.Stage) %>%
summarise_each(funs(mean,na.rm=TRUE))


But it doesn't work.

Answer

We can use summarise_if

train_webdata %>%
      group_by(Lead.Stage) %>%
      summarise_if(is.numeric, mean, na.rm = TRUE)

Using a reproducible example

data(iris)  
iris[1:3, 1] <- NA #create some NA elements
iris$Sepal.Length <- as.character(iris$Sepal.Length) #for testing
iris %>% 
     group_by(Species) %>%
     summarise_if(is.numeric, mean, na.rm = TRUE)    
#    Species Sepal.Width Petal.Length Petal.Width
#      <fctr>       <dbl>        <dbl>       <dbl>
#1     setosa       3.428        1.462       0.246
#2 versicolor       2.770        4.260       1.326
#3  virginica       2.974        5.552       2.026

Regarding the OP's error, the group_by operation is called after the select. As the 'Lead.Stage' column is not showed in the OP's str, it is not clear whether it is a non-numeric column or not. If it is non-numeric, it gets removed after the select. So, we can do the select operation after the group_by step

iris %>%
    group_by(Species) %>%
    select(which(sapply(., is.numeric))) %>%
    summarise_each(funs(mean(., na.rm = TRUE)))
#        Species Sepal.Width Petal.Length Petal.Width
#      <fctr>       <dbl>        <dbl>       <dbl>
#1     setosa       3.428        1.462       0.246
#2 versicolor       2.770        4.260       1.326
#3  virginica       2.974        5.552       2.026