user6821911 user6821911 - 2 months ago 15
R Question

R- Several linear regressions in one dataframe with factors and NA's

I'm very new to R and I have to work with a dataset of more than 100 columns, simplified below:

Station time data1 data2 data3 data4.....
1 0.0 35.02430310 44.2229390 NA
1 0.8 -68.75294241 -85.5847503 NA
1 1.8 -43.10200333 -62.8035400 NA
3 0.0 0.02217693 0.1336396 0.03203031
3 0.9 7.84203118 -6.4854953 6.22910506
3 2.2 -0.41682970 -7.7022785 0.92807170
17 0.0 4.24864888 4.2104517 0.00000000
17 0.9 1.79933934 -6.6360999 -10.10756894
17 2.1 1.99226283 2.2676248 -13.15887674


With every
data
column I would like to do a linear regression with
time
, but I need the coefficients for every Station (which are factors). From the
plyr
package I used

ddply(dataframe, .(Station), function(z) coef(lm(data1 ~ time, data=z)))


for example for
data1
:

Station (Intercept) t.h.
1 1 9.674588 -40.5399850
2 37 3.130705 -0.6284611
3 48 3.657316 -0.9474062


This would be the way I need the coefficients, but for every
data
column. Now, even if I would use this code for every single
data
column, I get problems with the columns that have NA values. I would like to simply drop these stations, but only for the specific column (in this case only for
data3
. For
data1
and
data2
I would like to keep Station 1.

Is there a solution for this? Any suggestion would be appreciated.

data
dput
:

structure(list(Station = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L), .Label = c("1", "3", "17"), class = "factor"), time = c(0,
0.8, 1.8, 0, 0.9, 2.2, 0, 0.9, 2.2), data1 = c(35.0243031, -68.75294241,
-43.10200333, 0.02217693, 7.84203118, -0.4168297, 4.24864888,
1.79933934, 1.99226283), data2 = c(44.222939, -85.5847503, -62.80354,
0.1336396, -6.4854953, -7.7022785, 4.2104517, -6.6360999, 2.2676248
), data3 = c(NA, NA, NA, 0.1410939, 30.0332505, 11.449285, 0.1161954,
-2.061781, 0.2289149)), .Names = c("Station", "time", "data1",
"data2", "data3"), row.names = c(NA, -9L), class = "data.frame")

Answer

We need tp reshape your data.frame to long format first, then omit NA values, and consequently apply the model per unique key ('data' and Station), and finally tidy up the output from the lm() call.

library(tidyr)
library(broom)

df %>% gather(data, value, -c(Station, time)) %>%
  na.omit() %>%
  group_by(data, Station) %>% 
  do(tidy(coef(lm(value ~ time, data = .)))) %>%
  spread(names, x) 

#   data Station `(Intercept)`        time
#* <chr>  <fctr>         <dbl>       <dbl>
#1 data1       1     9.5534021 -40.5734035
#2 data1       3     3.1391280  -0.6354857
#3 data1      17     3.6539549  -0.9424560
#4 data2       1    13.8883780 -56.0886482
#5 data2       3    -1.1964287  -3.3757574
#6 data2      17     0.2938263  -0.3353234
#7 data3       3     9.9859146   3.7631889
#8 data3      17    -0.7504115   0.1724399

The example data used is what you shared up to column data3.