Kasia Kulma Kasia Kulma - 2 months ago 12
R Question

creating new variable using conditional mutating returns NAs for a subset of rows

I'm trying to create a new factor variable based on logical conditioning on the date variable:

'data.frame': 364458 obs. of 2 variables:

$ first_order_date: Date, format: "2015-11-24" "2015-12-15" "2015-06-10" "2015-12-22" ...
$ order_date : Date, format: "2016-02-09" "2016-03-15" "2015-12-22" "2015-12-28" ...



library(moisaic)

bb =
df %>%
mutate(days_since_first = as.integer(order_date - first_order_date),
time_after_first = derivedFactor(
"<3months" = order_date <= first_order_date +months(3),
"3-6months" = (order_date <= first_order_date +months(6) & order_date > first_order_date +months(3)),

...

"15-18months" = (order_date <= first_order_date +months(18) & order_date > first_order_date +months(15)),
"18-21months" = (order_date <= first_order_date +months(21) & order_date > first_order_date +months(18)),
.default = "21month+"))


after running it, I received warnings:

Warning messages:
1: In base::max(x, ..., na.rm = na.rm) :
no non-missing arguments to max; returning -Inf


it still worked fairly well in most cases,

> head(bb[!is.na(bb$time_after_first), ])

first_order_date order_date days_since_first time_after_first
1 2015-11-24 2016-02-09 77 <3months
2 2015-12-15 2016-03-15 91 <3months
3 2015-06-10 2015-12-22 195 6-9months
4 2015-12-22 2015-12-28 6 <3months
5 2016-01-21 2016-05-29 129 3-6months
6 2016-03-12 2016-05-30 79 <3months


but not all

sum(is.na(bb$time_after_first))
[1] 7174


I can't see any pattern why these particular entries don't work

summary(bb[is.na(bb$time_after_first), ])

first_order_date order_date days_since_first
Min. :2015-01-31 Min. :2015-01-31 Min. : 0.0
1st Qu.:2015-08-31 1st Qu.:2016-02-07 1st Qu.: 41.5
Median :2015-11-30 Median :2016-03-31 Median :106.0
Mean :2015-11-05 Mean :2016-03-12 Mean :128.3
3rd Qu.:2016-01-31 3rd Qu.:2016-05-13 3rd Qu.:175.0
Max. :2016-05-31 Max. :2016-09-04 Max. :546.0

time_after_first
<3months : 0
3-6months : 0
6-9months : 0
9-12months : 0
12-15months: 0
(Other) : 0
NA's :6455


Also, I tried to use ordinary ifelse() statements to achieve this,

bb2 =
all_orders3 %>% select(user_id, order_id, first_order_date, order_date2, category) %>%
mutate(days_since_first = as.integer(order_date - first_order_date),
time_after_first= as.factor(ifelse(order_date <= first_order_date +months(3), "<3months",
ifelse(order_date <= first_order_date +months(6) & order_date > first_order_date +months(3), "3-6months",


....

ifelse(order_date <= first_order_date +months(24) & order_date > first_order_date +months(21), "21-24months",
"24months+"))))))))))


with no warnings received but with more NA's generated and still no clear pattern why this is happening

summary(bb2[is.na(bb2$time_after_first), ])

first_order_date order_date days_since_first
Min. :2015-01-31 Min. :2015-01-31 Min. : 0.0
1st Qu.:2015-08-31 1st Qu.:2016-02-10 1st Qu.: 52.0
Median :2015-11-30 Median :2016-04-08 Median :123.0
Mean :2015-10-27 Mean :2016-03-25 Mean :150.2
3rd Qu.:2016-01-31 3rd Qu.:2016-05-29 3rd Qu.:211.0
Max. :2016-05-31 Max. :2016-09-04 Max. :582.0

time_after_first
<3months : 0
12-15months: 0
15-18months: 0
18-21months: 0
3-6months : 0
(Other) : 0
NA's :7407


here's the link with a sample of my data

Any useful suggestions how to overcome this will be welcome, thanks!

Answer

Updated for sample v2:

#Read csv...
dat = read.csv("data/07092016-sample_V2.csv")

library(dplyr)
dat = dat %>%
  mutate(t1 = as.Date(first_order_date, "%d/%m/%Y"),
         t2 = as.Date(order_date, "%d/%m/%Y"),
         days = as.numeric(difftime(t2, t1, units = "days"))) %>% 
  select(t1:days)

head(dat)
#           t1         t2 days
# 1 2016-03-02 2016-05-29   88
# 2 2015-04-25 2015-05-01    6
# 3 2015-06-29 2015-07-04    5
# 4 2015-09-09 2016-02-05  149
# 5 2016-01-08 2016-02-15   38
# 6 2016-04-17 2016-04-21    4

breaks = c(-Inf, 90, 180, Inf)
labels = c("<3 months", "3-6 months", "6+ months")

days = cut(dat$days, breaks, labels)
sum(is.na(days))
# [1] 0
Comments