I've been using quite a few bits of code that highlight when certain requirements aren't met, or tell me which entries are duplicates, but I haven't been able to figure out how to code if a requirement isn't met.
I am using a relatively average dataframe which includes dates. Normally there is supposed to be 24 entries per day, i.e. one made per hour. But in some occasions there is more or less than this. I am in need something that would tell me which entry number/data does not satisfy this criteria of 24 entries. Does anyone have any suggestions on how to approach this?
I've attached below an example of the code I've been using up until now (to fulfil the other functions).
td_1 <- read.csv("testdata_1.csv", header=TRUE)
td_1$OB_DATE <- as.Date(td_1$OB_DATE)
valueMissing <- seq(min(td_1$OB_DATE), max(td_1$OB_DATE), by = 1)
valueMissing[!valueMissing %in% td_1$OB_DATE]
countDup <- anyDuplicated(td_1$OB_DATE)
valueDup <- td1[duplicated(td_1$OB_DATE),]
OB_DATE AIR_TEMPERATURE
09/05/1973 00:00 10
09/05/1973 01:00 10.2
09/05/1973 02:00 10
09/05/1973 03:00 10
09/05/1973 04:00 9.9
09/05/1973 05:00 9.9
09/05/1973 06:00 10.2
09/05/1973 07:00 10.8
09/05/1973 08:00 12.2
09/05/1973 09:00 11.9
09/05/1973 10:00 12.7
09/05/1973 11:00 12.8
09/05/1973 12:00 13.4
09/05/1973 13:00 13.9
09/05/1973 14:00 14.6
09/05/1973 15:00 13.5
09/05/1973 16:00 13.5
09/05/1973 17:00 12.8
09/05/1973 18:00 12.2
09/05/1973 19:00 11.9
09/05/1973 20:00 11
09/05/1973 21:00 10.3
09/05/1973 22:00 10.2
09/05/1973 23:00 10
10/05/1973 00:00 10
10/05/1973 01:00 9.8
10/05/1973 02:00 9.6
10/05/1973 03:00 9.7
10/05/1973 04:00 9.5
10/05/1973 05:00 8.5
10/05/1973 06:00 7.5
10/05/1973 07:00 7.8
10/05/1973 08:00 8.8
10/05/1973 09:00 9.6
10/05/1973 10:00 10
10/05/1973 11:00 11
10/05/1973 12:00 8
10/05/1973 13:00 10.3
10/05/1973 14:00 12.2
10/05/1973 15:00 12.7
10/05/1973 16:00 12.7
10/05/1973 17:00 12.4
10/05/1973 17:00 12.4
10/05/1973 18:00 12
10/05/1973 18:00 12
10/05/1973 19:00 10.9
10/05/1973 20:00 9.4
10/05/1973 21:00 7.2
10/05/1973 22:00 6.7
10/05/1973 23:00 6.8
11/05/1973 00:00 5.7
11/05/1973 01:00 5.2
11/05/1973 02:00 4.7
11/05/1973 03:00 4.3
11/05/1973 04:00 4
11/05/1973 05:00 4.2
11/05/1973 06:00 5
11/05/1973 08:00 8.4
11/05/1973 09:00 9.2
11/05/1973 10:00 10.8
11/05/1973 11:00 11.7
11/05/1973 12:00 11.4
11/05/1973 13:00 12.9
11/05/1973 14:00 13.3
11/05/1973 15:00 13.3
11/05/1973 16:00 13.5
11/05/1973 17:00 13.6
11/05/1973 18:00 12.6
11/05/1973 19:00 11.8
11/05/1973 20:00 10.3
11/05/1973 21:00 9.7
11/05/1973 22:00 8.8
11/05/1973 23:00 7.6
Using dplyr
,
library(dplyr)
df %>%
group_by(dates = gsub('\\s+.*', '', OB_DATE)) %>%
summarise(new = n())
# A tibble: 3 × 2
# dates new
# <chr> <int>
#1 1973-09-05 24
#2 1973-10-05 26
#3 1973-11-05 23
or similarly you can do something like,
df %>%
group_by(dates = gsub('\\s+.*', '', OB_DATE)) %>%
mutate(new = ifelse(n() == 24, 0, 1)) #Will give value of 1 to dates that don't satisfy the 24 criterion
select(-dates)