hsmith - 6 months ago 27

R Question

So I'm taking a class in R Statistics and need to complete a series of questions to pass my first assignment. I've done fine so far but am starting to get a bit stuck.

The assessment works with a dataset on conservation and its influence on biomass, in which fifty plots of land, each one hectare, were sampled at random from a ten thousand hectare area in Northern England.

For each plot of land, the following variables were recorded:

• biomass: an estimate of the biomass of vegetation in kg per square metre.

• alt: the mean altitude of the plot in metres above sea level.

• cons: a categorical variable, which was coded 1 if the plot was part of a conservation area, and 2 otherwise.

• soil a categorical variable crudely classifying soil type as 1 for chalk, 2 for clay and 3 for loam.

At the moment I am struggling with two things in particular:

How to calculate the average difference in biomass between clay (soil2) and loam (soil3) soils based on my fitted model (model1) and calculate 95% confidence interval for this mean predicted value.

And how to calculate the mean predicted biomass for a plot located in a conservation area with predominantly clay soil at an altitude of 300m?

This is a summary of the linear model that I'm working with.

`Call:`

lm(formula = biomass ~ alt + soil + cons, data = conservation)

Residuals:

Min 1Q Median 3Q Max

-0.183105 -0.052926 0.005593 0.061844 0.194402

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 2.2928629 0.0357850 64.073 < 2e-16 ***

alt -0.0029068 0.0001302 -22.318 < 2e-16 ***

soil2 -0.0862220 0.0342955 -2.514 0.0156 *

soil3 -0.2309939 0.0354480 -6.516 5.33e-08 ***

cons2 0.0488634 0.0292075 1.673 0.1013

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.09428 on 45 degrees of freedom

Multiple R-squared: 0.9459, Adjusted R-squared: 0.9411

F-statistic: 196.7 on 4 and 45 DF, p-value: < 2.2e-16

I'm sure there are very simple answers to these questions but I'm completely new to statistics and R, so any help is much appreciated!

Please ask if you need more information to be able to help.

And here's the data:

`dput(conservation)`

structure(list(biomass = c(2.01, 2.06, 1.7, 2.07, 1.88, 2.11,

0.98, 2.14, 1.75, 1.81, 2.15, 1.68, 2.23, 2.04, 1.67, 1.77, 1.74,

1.53, 1.79, 2.15, 1.39, 2.19, 2.14, 2.29, 1.91, 1.73, 2.21, 1.96,

2.07, 2.01, 2.2, 2.24, 1.33, 1.05, 1.36, 1.72, 1.44, 1.52, 2.09,

1.42, 1.64, 0.92, 1.65, 1.37, 0.77, 1.57, 2.25, 2.23, 2.03, 1.18

), alt = c(116L, 21L, 130L, 65L, 117L, 82L, 359L, 5L, 86L, 91L,

64L, 178L, 79L, 70L, 209L, 110L, 161L, 248L, 146L, 23L, 237L,

84L, 40L, 7L, 161L, 122L, 25L, 146L, 67L, 118L, 42L, 57L, 277L,

338L, 331L, 153L, 239L, 237L, 67L, 171L, 206L, 371L, 107L, 236L,

482L, 240L, 56L, 42L, 68L, 436L), cons = structure(c(2L, 2L,

1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L,

1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L,

1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L

), .Label = c("1", "2"), class = "factor"), soil = structure(c(2L,

3L, 2L, 2L, 2L, 1L, 2L, 2L, 3L, 3L, 2L, 1L, 1L, 2L, 2L, 3L, 2L,

3L, 2L, 2L, 3L, 1L, 3L, 2L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 3L,

3L, 2L, 1L, 3L, 2L, 1L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 1L, 1L, 1L,

2L), .Label = c("1", "2", "3"), class = "factor"), alt.factor =

structure(c(1L,

1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L,

2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 2L,

2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L,

2L), .Label = c("below median", "above median"), class = "factor")),

.Names = c("biomass",

"alt", "cons", "soil", "alt.factor"), row.names = c(NA, -50L), class =

"data.frame")

Answer

How to calculate the average difference in biomass between clay (soil2) and loam (soil3) soils based on my fitted model (model1) and calculate 95% confidence interval for this mean predicted value.

Strictly speaking, this is a special case of something we call "linear hypothesis test". But I think it is not the intention of your assignment, so I will not go for this approach. If you are interested in it, read Get p-value for group mean difference without refitting linear model with a new reference level.

What I will do here, is to simply use a different factor level as contrast and refit your model. At the moment, you have "soil1" as contrast level; I will reset "soil2" as contrast level. Have a look at How to set contrasts for my variable in regression analysis with R? for a general treatment.

```
fit <- lm(biomass ~ alt + soil + cons, data = conservation,
contrasts = list(soil = contr.treatment(n = 3, base = 2)))
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 2.2066409 0.0400572 55.087 < 2e-16 ***
#alt -0.0029068 0.0001302 -22.318 < 2e-16 ***
#soil1 0.0862220 0.0342955 2.514 0.0156 *
#soil3 -0.1447719 0.0325295 -4.450 5.59e-05 ***
#cons2 0.0488634 0.0292075 1.673 0.1013
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 0.09428 on 45 degrees of freedom
#Multiple R-squared: 0.9459, Adjusted R-squared: 0.9411
#F-statistic: 196.7 on 4 and 45 DF, p-value: < 2.2e-16
```

Now, the coefficient for "soil3" is giving the difference of group mean for "soil3" and the contrast level "soil2". It is pretty straightforward to obtain confidence interval for this coefficient, from the standard error and residual degree of freedom of the model, but again, that might be too technical to you. Consider using `confint`

:

```
confint(fit, "soil3", level = 0.95)
# 2.5 % 97.5 %
#soil3 -0.2102896 -0.0792541
```

And how to calculate the mean predicted biomass for a plot located in a conservation area with predominantly clay soil at an altitude of 300m?

For prediction on response `biomass`

, we can use `predict`

:

```
predict(fit, newdata = list(alt = 300, soil = "2", cons = "1"))
# 1
#1.334606
```

So the prediction mean is about `1.3346`

.