Visser Visser - 2 months ago 7
R Question

Flag if a requirement isn't met

I've been using quite a few bits of code that highlight when certain requirements aren't met, or tell me which entries are duplicates, but I haven't been able to figure out how to code if a requirement isn't met.

I am using a relatively average dataframe which includes dates. Normally there is supposed to be 24 entries per day, i.e. one made per hour. But in some occasions there is more or less than this. I am in need something that would tell me which entry number/data does not satisfy this criteria of 24 entries. Does anyone have any suggestions on how to approach this?

I've attached below an example of the code I've been using up until now (to fulfil the other functions).

td_1 <- read.csv("testdata_1.csv", header=TRUE)
td_1$OB_DATE <- as.Date(td_1$OB_DATE)

valueMissing <- seq(min(td_1$OB_DATE), max(td_1$OB_DATE), by = 1)
valueMissing[!valueMissing %in% td_1$OB_DATE]

countDup <- anyDuplicated(td_1$OB_DATE)
valueDup <- td1[duplicated(td_1$OB_DATE),]


Below is an example of the data (note that in reality there are over 500,000 rows, this is just a small sample)

OB_DATE AIR_TEMPERATURE
09/05/1973 00:00 10
09/05/1973 01:00 10.2
09/05/1973 02:00 10
09/05/1973 03:00 10
09/05/1973 04:00 9.9
09/05/1973 05:00 9.9
09/05/1973 06:00 10.2
09/05/1973 07:00 10.8
09/05/1973 08:00 12.2
09/05/1973 09:00 11.9
09/05/1973 10:00 12.7
09/05/1973 11:00 12.8
09/05/1973 12:00 13.4
09/05/1973 13:00 13.9
09/05/1973 14:00 14.6
09/05/1973 15:00 13.5
09/05/1973 16:00 13.5
09/05/1973 17:00 12.8
09/05/1973 18:00 12.2
09/05/1973 19:00 11.9
09/05/1973 20:00 11
09/05/1973 21:00 10.3
09/05/1973 22:00 10.2
09/05/1973 23:00 10
10/05/1973 00:00 10
10/05/1973 01:00 9.8
10/05/1973 02:00 9.6
10/05/1973 03:00 9.7
10/05/1973 04:00 9.5
10/05/1973 05:00 8.5
10/05/1973 06:00 7.5
10/05/1973 07:00 7.8
10/05/1973 08:00 8.8
10/05/1973 09:00 9.6
10/05/1973 10:00 10
10/05/1973 11:00 11
10/05/1973 12:00 8
10/05/1973 13:00 10.3
10/05/1973 14:00 12.2
10/05/1973 15:00 12.7
10/05/1973 16:00 12.7
10/05/1973 17:00 12.4
10/05/1973 17:00 12.4
10/05/1973 18:00 12
10/05/1973 18:00 12
10/05/1973 19:00 10.9
10/05/1973 20:00 9.4
10/05/1973 21:00 7.2
10/05/1973 22:00 6.7
10/05/1973 23:00 6.8
11/05/1973 00:00 5.7
11/05/1973 01:00 5.2
11/05/1973 02:00 4.7
11/05/1973 03:00 4.3
11/05/1973 04:00 4
11/05/1973 05:00 4.2
11/05/1973 06:00 5
11/05/1973 08:00 8.4
11/05/1973 09:00 9.2
11/05/1973 10:00 10.8
11/05/1973 11:00 11.7
11/05/1973 12:00 11.4
11/05/1973 13:00 12.9
11/05/1973 14:00 13.3
11/05/1973 15:00 13.3
11/05/1973 16:00 13.5
11/05/1973 17:00 13.6
11/05/1973 18:00 12.6
11/05/1973 19:00 11.8
11/05/1973 20:00 10.3
11/05/1973 21:00 9.7
11/05/1973 22:00 8.8
11/05/1973 23:00 7.6


In this situation, the data for the 10th is complete with 24 entries, but for the 11th there are only 26 entries, and the 12th there are 23 entries. I need something that will alert me to this fact, e.g. giving the dates 11/05/1973 and 12/05/1973 (similar to the output I have generated for my missing values code).

Answer

Using dplyr,

library(dplyr)
df %>% 
  group_by(dates = gsub('\\s+.*', '', OB_DATE)) %>% 
  summarise(new = n())
# A tibble: 3 × 2
#       dates   new
#       <chr> <int>
#1 1973-09-05    24
#2 1973-10-05    26
#3 1973-11-05    23

or similarly you can do something like,

df %>% 
 group_by(dates = gsub('\\s+.*', '', OB_DATE)) %>% 
 mutate(new = ifelse(n() == 24, 0, 1)) #Will give value of 1 to dates that don't satisfy the 24 criterion 
 select(-dates)
Comments