Xavier Prudent Xavier Prudent - 1 month ago 9
R Question

Merging complementary rows of a dataframe with R

I have such a data frame

0 weekday day month year hour basal bolus carb period.h
1 Tuesday 01 03 2016 0.0 0.25 NA NA 0
2 Tuesday 01 03 2016 10.9 NA NA 67 10
3 Tuesday 01 03 2016 10.9 NA 4.15 NA 10
4 Tuesday 01 03 2016 12.0 0.30 NA NA 12
5 Tuesday 01 03 2016 17.0 0.50 NA NA 17
6 Tuesday 01 03 2016 17.6 NA NA 33 17
7 Tuesday 01 03 2016 17.6 NA 1.35 NA 17
8 Tuesday 01 03 2016 18.6 NA NA 44 18
9 Tuesday 01 03 2016 18.6 NA 1.80 NA 18
10 Tuesday 01 03 2016 18.9 NA NA 17 18
11 Tuesday 01 03 2016 18.9 NA 0.70 NA 18
12 Tuesday 01 03 2016 22.0 0.40 NA NA 22
13 Wednesday 02 03 2016 0.0 0.25 NA NA 0
14 Wednesday 02 03 2016 9.7 NA NA 39 9
15 Wednesday 02 03 2016 9.7 NA 2.65 NA 9
16 Wednesday 02 03 2016 11.2 NA NA 13 11
17 Wednesday 02 03 2016 11.2 NA 0.30 NA 11
18 Wednesday 02 03 2016 12.0 0.30 NA NA 12
19 Wednesday 02 03 2016 12.0 NA NA 16 12
20 Wednesday 02 03 2016 12.0 NA 0.65 NA 12


If you look at the lines 2 and 3, you notice that they correspond exactly to the same day & time: just for the line #2 the "carb" is not NA, and the "bolus" is not NA (These are data about diabete).

I want to merge such lines into a single one:

2 Tuesday 01 03 2016 10.9 NA NA 67 10
3 Tuesday 01 03 2016 10.9 NA 4.15 NA 10


->

2 Tuesday 01 03 2016 10.9 NA 4.15 67 10


I could of course do a brutal double loop over each line, but I look for a cleverer and faster way.

Answer

You can group your data frame by the common identifier columns weekday, day, month, year, hour, period.h here and then sort and take the first element from the remaining columns which you would like to merge, sort() function by default will remove NAs in the vector to be sorted and thus you will end up with non-NA elements for each column within each group; if all elements in a column are NA, sort(col)[1] returns NA:

library(dplyr)
df %>% 
       group_by(weekday, day, month, year, hour, period.h) %>% 
       summarise_all(funs(sort(.)[1]))

#      weekday   day month  year  hour period.h basal bolus  carb
#       <fctr> <int> <int> <int> <dbl>    <int> <dbl> <dbl> <int>
# 1    Tuesday     1     3  2016   0.0        0  0.25    NA    NA
# 2    Tuesday     1     3  2016  10.9       10    NA  4.15    67
# 3    Tuesday     1     3  2016  12.0       12  0.30    NA    NA
# 4    Tuesday     1     3  2016  17.0       17  0.50    NA    NA
# 5    Tuesday     1     3  2016  17.6       17    NA  1.35    33
# 6    Tuesday     1     3  2016  18.6       18    NA  1.80    44
# 7    Tuesday     1     3  2016  18.9       18    NA  0.70    17
# 8    Tuesday     1     3  2016  22.0       22  0.40    NA    NA
# 9  Wednesday     2     3  2016   0.0        0  0.25    NA    NA
# 10 Wednesday     2     3  2016   9.7        9    NA  2.65    39
# 11 Wednesday     2     3  2016  11.2       11    NA  0.30    13
# 12 Wednesday     2     3  2016  12.0       12  0.30  0.65    16

Instead of sort(), maybe a more appropriate function to use here is na.omit():

df %>% group_by(weekday, day, month, year, hour, period.h) %>% 
       summarise_all(funs(na.omit(.)[1]))
Comments