aju_k aju_k - 3 months ago 9
R Question

R select email addresses with > 2 similar start dates in a year

I would like to set unique email addresses with more than 2 similar start years in a dataframe in R to NA in a new column.

start_year email
2016 a@a.com
2016 a@a.com
2016 a@a.com
2015 a@a.com
2015 a@a.com
2014 a@a.com
2015 b@b.com
2014 b@b.com
2014 b@b.com
2015 c@c.com


Result (a@a.com has 3 similar start years 2016 and is therefore set to NA in a new column):

start_year email email_new
2016 a@a.com NA
2016 a@a.com NA
2016 a@a.com NA
2015 a@a.com NA
2015 a@a.com NA
2014 a@a.com NA
2015 b@b.com b@b.com
2014 b@b.com b@b.com
2014 b@b.com b@b.com
2015 c@c.com c@c.com


So far I have this which gives an error: All select() inputs must resolve to integer column positions.:

result <- df %>%
group_by(email) %>%
select(length(unique(start_year)) > 2)


Any help would be greatly appreciated.

Answer

Using dplyr, as far as I understood it, you have two conditions for converting email to NA.

1)At least 3 of the start_year are the same

2) There are more than 2 observations

 df %>% 
    group_by(email) %>% 
    mutate(new = ifelse(length(which(table(start_year) > 2)) > 0 & n()>2, 'NA', as.character(email)))

#Source: local data frame [7 x 3]
#Groups: email [3]

#  start_year   email     new
#       <int>   <chr>   <chr>
#1       2016 a@a.com    <NA>
#2       2016 a@a.com    <NA>
#3       2016 a@a.com    <NA>
#4       2015 b@b.com b@b.com
#5       2014 b@b.com b@b.com
#6       2014 b@b.com b@b.com
#7       2015 c@c.com c@c.com

Adding another 2014 for b@b.com, thus making 3 same years for that email, then

df1 %>% 
     group_by(email) %>% 
     mutate(new = ifelse(length(which(table(start_year) > 2)) > 0 & n()>2, 'NA', as.character(email)))

#Source: local data frame [8 x 3]
#Groups: email [3]

#  start_year   email     new
#       <dbl>   <chr>   <chr>
#1       2016 a@a.com      NA
#2       2016 a@a.com      NA
#3       2016 a@a.com      NA
#4       2015 b@b.com      NA
#5       2014 b@b.com      NA
#6       2014 b@b.com      NA
#7       2014 b@b.com      NA
#8       2015 c@c.com c@c.com

DATA

dput(df)
structure(list(start_year = c(2016L, 2016L, 2016L, 2015L, 2014L, 
2014L, 2015L), email = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L
), .Label = c("a@a.com", "b@b.com", "c@c.com"), class = "factor")), .Names = c("start_year", 
"email"), class = "data.frame", row.names = c(NA, -7L))

dput(df1)
structure(list(start_year = c(2016, 2016, 2016, 2015, 2014, 2014, 
2014, 2015), email = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 
3L), .Label = c("a@a.com", "b@b.com", "c@c.com"), class = "factor")), row.names = c(NA, 
-8L), .Names = c("start_year", "email"), class = "data.frame")
Comments