Rory Shaw Rory Shaw - 3 months ago 7
R Question

R changing variable value of one factor level to represent value mean of factor levels by day

I have the following dataframe:

> df
Time_Start Time_End Cut Plot Inlet_NH4N Outlet_NH4N Pump_reading Anemometer_reading
1 2016-05-05 11:19:00 2016-05-06 09:30:00 1 1 0.2336795 0.30786350 79846.9 6296343
2 2016-05-05 11:25:00 2016-05-06 09:35:00 1 3 1.0905045 0.50816024 78776.5 333116
3 2016-05-05 11:33:00 2016-05-06 09:39:00 1 6 1.3538576 0.34866469 79585.1 8970447
4 2016-05-05 11:37:00 2016-05-06 09:51:00 1 7 0.6862018 0.34124629 80043.1 8436546
5 2016-05-05 11:43:00 2016-05-06 09:43:00 1 9 0.2633531 0.73813056 79227.7 9007387
6 2016-05-05 11:48:00 2016-05-06 09:47:00 1 12 0.5934718 1.10905045 79121.5 8070785
7 2016-05-06 09:33:00 2013-05-07 10:13:00 1 1 0.5213904 2.46791444 88800.2 7807792
8 2016-05-06 09:38:00 2013-05-07 10:23:00 1 3 0.1684492 0.22905526 89123.0 14127
9 2016-05-06 09:42:00 2013-05-07 10:28:00 1 6 0.4393939 0.09001782 89157.6 9844162
10 2016-05-06 09:53:00 2013-05-07 10:34:00 1 7 0.1470588 1.03832442 88852.6 9143733
11 2016-05-06 09:45:00 2013-05-07 10:40:00 1 9 0.1114082 0.32531194 89635.6 10122720
12 2016-05-06 09:50:00 2013-05-07 10:43:00 1 12 0.6853832 2.51426025 89582.6 8924198


Here is the
str
:

> str(df)

'data.frame': 12 obs. of 8 variables:
$ Time_Start : POSIXct, format: "2016-05-05 11:19:00" "2016-05-05 11:25:00" "2016-05-05 11:33:00" ...
$ Time_End : POSIXct, format: "2016-05-06 09:30:00" "2016-05-06 09:35:00" "2016-05-06 09:39:00" ...
$ Cut : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
$ Plot : Factor w/ 8 levels "1","3","6","7",..: 1 2 3 4 5 6 1 2 3 4 ...
$ Inlet_NH4N : num 0.234 1.091 1.354 0.686 0.263 ...
$ Outlet_NH4N : num 0.308 0.508 0.349 0.341 0.738 ...
$ Pump_reading : num 79847 78777 79585 80043 79228 ...
$ Anemometer_reading: int 6296343 333116 8970447 8436546 9007387 8070785 7807792 14127 9844162 9143733 ...


This is a small segment of a larger dataset.
I have a problem with these data in that the
Anemometer_reading
for
plot "3"
is always much lower than for the other plots. This is due to a mechanical problem. I want to remove this artifact and think that the best way to do this is to take an average of the
Anemometer_reading
for all the plots outwith
plot "3"
. I want to calculate this average on a daily basis.

I can calculate the daily
Anemometer_reading
average, excluding
plot "3"
like this:

library(dplyr)
> df_avg <- df %>% filter(Plot != "3") %>% group_by(as.Date(Time_End)) %>% summarise(Anemometer_mean = mean(Anemometer_reading))
> df_avg
Source: local data frame [2 x 2]


as.Date(Time_End) Anemometer_mean
<date> <dbl>
1 2013-05-07 9168521
2 2016-05-06 8156302


I'm not sure how to go about using the resulting dataframe to replace the
Anemometer_reading
values from
plot "3"
.
Can anyone point me in the right direction please?
Thanks

Answer

I would follow @roland's comment. However, if you care about how you would use dplyr to do what you asked:

result <- df %>% group_by(as.Date(Time_End)) %>% 
                 mutate(Anemometer_mean = mean(Anemometer_reading[Plot != "3"])) %>% 
                 mutate(Anemometer_reading = replace(Anemometer_reading, Plot == "3", first(Anemometer_mean))) %>%
                 ungroup() %>% select(-`as.Date(Time_End)`, -Anemometer_mean)
print(result)
## A tibble: 12 x 8
##            Time_Start            Time_End   Cut  Plot Inlet_NH4N Outlet_NH4N Pump_reading Anemometer_reading
##                <fctr>              <fctr> <int> <int>      <dbl>       <dbl>        <dbl>              <dbl>
##1  2016-05-05 11:19:00 2016-05-06 09:30:00     1     1  0.2336795  0.30786350      79846.9            6296343
##2  2016-05-05 11:25:00 2016-05-06 09:35:00     1     3  1.0905045  0.50816024      78776.5            8156302
##3  2016-05-05 11:33:00 2016-05-06 09:39:00     1     6  1.3538576  0.34866469      79585.1            8970447
##4  2016-05-05 11:37:00 2016-05-06 09:51:00     1     7  0.6862018  0.34124629      80043.1            8436546
##5  2016-05-05 11:43:00 2016-05-06 09:43:00     1     9  0.2633531  0.73813056      79227.7            9007387
##6  2016-05-05 11:48:00 2016-05-06 09:47:00     1    12  0.5934718  1.10905045      79121.5            8070785
##7  2016-05-06 09:33:00 2013-05-07 10:13:00     1     1  0.5213904  2.46791444      88800.2            7807792
##8  2016-05-06 09:38:00 2013-05-07 10:23:00     1     3  0.1684492  0.22905526      89123.0            9168521
##9  2016-05-06 09:42:00 2013-05-07 10:28:00     1     6  0.4393939  0.09001782      89157.6            9844162
##10 2016-05-06 09:53:00 2013-05-07 10:34:00     1     7  0.1470588  1.03832442      88852.6            9143733
##11 2016-05-06 09:45:00 2013-05-07 10:40:00     1     9  0.1114082  0.32531194      89635.6           10122720
##12 2016-05-06 09:50:00 2013-05-07 10:43:00     1    12  0.6853832  2.51426025      89582.6            8924198

Instead of filter and summarise, mutate to create a new column Anemometer_mean that computes the mean with all rows for Plot!=3. Then replace the Anemometer_read for those rows Plot==3 with this mean.

In fact, you can do all this with just one mutate:

result <- df %>% group_by(as.Date(Time_End)) %>% 
                 mutate(Anemometer_reading = replace(Anemometer_reading, Plot == "3", mean(Anemometer_reading[Plot != "3"]))) %>%
                 ungroup() %>% select(-`as.Date(Time_End)`)

Hope this helps.