Categorical variables, interactions, confounding and variable selection for linear regression models.
This worksheet is based on the sixth lecture of the Statistical Modelling in Stata course, created by Dr Mark Lunt and offered by the Centre for Epidemiology Versus Arthritis at the University of Manchester. The original Stata exercises and solutions are here translated into their R equivalents.
Refer to the original slides and Stata worksheets here.
Load the auto
dataset distributed with Stata by downloading it from the company’s web site.
We are going to investigate whether US-made vehicles tend to be heavier than non-US-made vehicles (in 1978).
auto <- foreign::read.dta('http://www.stata-press.com/data/r8/auto.dta')
Fit a linear model with the function call lm(weight ~ foreign)
.
From the summary output we can see that non-US vehicles are on average 1000 lbs lighter than those made in the USA. The difference is statistically significant at the 0.1% level.
lm(weight ~ factor(foreign))
. Does this make any difference?No, because foreign
is already a factor and R handles these natively.
summary.lm
gives the same results as t.test
.Look at the difference between the means and its standard error: are they the same as those obtained by the linear model?
t.test(weight ~ foreign, data = auto)
The test statistics are similar, but for them to be exactly the same as in the summary
output, you would need to set var.equal = TRUE
, i.e. perform a test that assumes the variances of the two samples were equal.
By default, t.test
assumes they are different.
t.test(weight ~ foreign, data = auto, var.equal=TRUE)
Does the variance of weight
look the same for the two categories?
In base R graphics:
boxplot(weight ~ foreign, data = auto)
Or in ggplot2:
library(ggplot2)
ggplot(auto) +
aes(foreign, weight) +
geom_boxplot() +
theme_classic()
In Stata you might use hettest
, which assumes normality by default.
In R, you can use var.test
or bartlett.test
, which assume normality, or ansari.test
and mood.test
, which are non-parametric.
ansari.test(weight ~ foreign, data = auto)
mood.test(weight ~ foreign, data = auto)
There does not appear to be sufficient evidence to reject the hypothesis that the scales of the two samples are the same. In other words: the variances are not significantly different.
The dataset soap.csv
gives information on the scores given to 90 bars of soap for their appearance.
The scales are on a scale of 1–10; the higher the score, the better.
Three operators each produced 30 bars of soap.
We wish to determine if the operators are all equally proficient.
soap <- read.csv('soap.csv')
Which appears to have the highest scores?
boxplot(appearance ~ operator, data = soap)
ggplot(soap) +
aes(operator, appearance, group = operator) +
geom_boxplot() +
theme_classic()
Make sure that operator
is treated as a categorical variable.
The parameter estimators are relative to the first operator.
Operator 2 and 3 are both significantly different from operator 1, at the 1% level.
But the summary output above does not tell us about the differences between operators 2 and 3.
Rather than test all the combinations, we can perform an \(F\)-test, using the anova
function.
anova(soap_model)
There is no equivalent in base R for Stata’s lincom
command, to perform hypothesis tests on linear combinations of parameters.
The R package multcomp
provides the function glht
(general linear hypothesis test) which serves a similar purpose as lincom
.
However, we will do the computations by hand.
The linear combination we want to consider, in this case, is \[\beta_0 + \beta_1 = 0,\] where \(\beta_0\) is the intercept or baseline term, and \(\beta_1\) is the difference between the baseline and the mean of operator 2, in the above model.
The mean is \[\mathbf{c}^T \boldsymbol\beta = \pmatrix{1 & 1 & 0} \cdot \pmatrix{\beta_0 \\ \beta_1 \\ \beta_2},\] which yields
as expected.
As in the previous question, we are computing a linear combination. In this case, the equation is \[(\beta_0 + \beta_1) - (\beta_0 + \beta_2) = \beta_1 - \beta_2 = 0,\] corresponding to the linear combination \[\mathbf{c}^T\boldsymbol\beta = \pmatrix{0 & 1 & -1} \cdot \pmatrix{\beta_0 \\ \beta_1 \\ \beta_2} = 0,\] under the null hypothesis. The mean difference is
And the variance of the estimate is \[\operatorname{Var}(\mathbf{c}\hat{\boldsymbol\beta}) = \mathbf{c} \Sigma \mathbf{c}^T,\] where \(\Sigma = \operatorname{Var}(\hat{\boldsymbol\beta})\) is the variance-covariance matrix of the estimated parameter vector \(\hat{\boldsymbol\beta}\). In R,
and the test statistic is \[t = \frac{\mathbf{c}\hat{\boldsymbol\beta}}{\sqrt{\mathbf{c \Sigma c}^T}},\]
which has a \(t\)-distribution with \(n-3\) degrees of freedom. The \(p\)-value is:
pt(t_stat, df.residual(soap_model))
which is very small, so we reject the null hypothesis—in other words, there is a significant difference between the appearance scores of operators 2 and 3.
The dataset cadmium
give data on the ages, vital capacities and durations of exposure to cadmium (in three categories) in 88 workers. We wish to see if exposure to cadmium has a detrimental effect on vital capacity.
However, we know that vital capacity decreases with age, and that the subjects with the longest exposures will tend to be older than those with shorter exposures.
Thus, age could confound the relationship between exposure to cadmium and vital capacity.
cadmium <- read.csv('cadmium.csv', stringsAsFactors = TRUE)
plot(capacity ~ age, data = cadmium)
ggplot(cadmium) +
aes(age, capacity) +
geom_point() +
theme_classic()
lm
) to predict vital capacity from age.cd_model <- lm(capacity ~ age, data = cadmium)
It would be nice to be able to tell to which exposure group each point on the plot of vital capacity against age belonged.
We can simply plot ages by exposure group directly:
plot(age ~ exposure, data = cadmium)
which shows that the group with the highest exposure tend to be older.
But we could also do this while keeping the capacity
data in the visualisation at the same time.
To understand how we will do this, have a look at the column exposure
.
It is a factor, which behind the scenes is really a vector of integers with associated names for each value (or level
).
levels(cadmium$exposure) # the possible factor levels
as.integer(cadmium$exposure) # underlying numerical values
So we can select a colour scheme and then colour and shape each point accordingly.
(Try typing c('blue', 'red', 'grey')[cadmium$exposure]
into the console and see what you get.)
Rather than write out a vector of colours each time, you can assign them to the default colour palette
,
and then you can retrieve them by index when making your next plot.
(For a cheat sheet of base R graphical parameters, visit Quick-R.
In ggplot2
, these sorts of things are much more straightforward.
Legends are added automatically.
ggplot(cadmium) +
aes(age, capacity, colour = exposure, shape = exposure) +
geom_point() +
theme_classic()
That might be quite hard to read (especially if you are colour blind), so we can also consider a facetted (lattice, trellis, or small multiples) plot.
ggplot(cadmium) +
aes(age, capacity) +
geom_point() +
facet_grid(~exposure) +
theme_classic()
plot(capacity ~ exposure, data = cadmium)
Possibly not. The variance between the groups is not significantly different from the variance within the groups, even before we take age into account.
It looks like there are no significant differences between the groups.
To get a visual idea of the meaning of the previous regression model, we can plot the data with the lines of best fit. In base R graphics, we can use the abline
command with the corresponding intercepts and slopes from the vector of coefficients
.
# Scatter plot, as before.
plot(capacity ~ age, data = cadmium,
col = cadmium$exposure,
pch = as.numeric(cadmium$exposure))
legend('bottomleft',
pch = 1:nlevels(cadmium$exposure),
legend = levels(cadmium$exposure),
col = 1:nlevels(cadmium$exposure))
# < 10 years
abline(a = coef(cd_model3)['(Intercept)'], # baseline
b = coef(cd_model3)['age'], # slope
col = 1)
# > 10 years
abline(a = coef(cd_model3)['(Intercept)'] + coef(cd_model3)[3],
b = coef(cd_model3)['age'],
col = 2)
# No exposure
abline(a = coef(cd_model3)['(Intercept)'] + coef(cd_model3)[4],
b = coef(cd_model3)['age'],
col = 3)
Another way of doing this is to generate two (or more) ages, assign an exposure level, and predict
the capacities for those observations.
Then draw a line (segment) through the points.
In R, interaction terms are represented by termA:termB
.
You can specify them explicitly along with the main effects, or you can use the syntax termA*termA
, which is shorthand for the main and interaction effects termA + termB + termA:termB
.
Remember, it is not valid to estimate an interaction effect without also estimating the main effects.
It appears, then, that though the intercepts of the lines are not significantly different, the slopes are—at the 5% level. We can draw them, using similar syntax to before).
coefs <- coef(cd_model4) # to save typing
# < 10 years
abline(a = coefs['(Intercept)'], # baseline
b = coefs['age'], # baseline slope
col = 1)
# > 10 years
abline(a = coefs['(Intercept)'] + coefs['exposure> 10 years'],
b = coefs['age'] + coefs['age:exposure> 10 years'],
col = 2)
# No exposure
abline(a = coefs['(Intercept)'] + coefs['exposureNo exposure'],
b = coefs['age'] + coefs['age:exposureNo exposure'],
col = 3)
In ggplot2
, a full model is implicitly fitted when you ask for a geom_smooth
line using method = lm
.
ggplot(cadmium) +
aes(age, capacity, colour = exposure) +
geom_smooth(method = lm, se = TRUE) +
geom_point()
It shows a bit more clearly that, whilst the fitted lines look quite different above, the standard error intervals reveal a large degree of uncertainty in the estimation of their slopes and intercepts, and formal hypothesis testing suggests the effect of exposure group, once age is taken into account, is not statistically significant.
summary(cd_model4)
The steepest line is attributed to the group with more than 10 years exposure. The ‘no exposure’ group has the least steep line. (Remember the lines are sloping downwards, so it is the magnitude of the slopes we are interested in.)
coefs['age'] + coefs['age:exposure> 10 years']
‘Have you read about the vast amount of evidence that variable selection causes severe problems of estimation and inference?’
— Frank Harrell
As we said in the lecture, automatic variable selection procedures are rarely a good idea, but they are very common in the medical literature so you need to understand them, even if you never use them yourself.
Use the dataset hald.csv
.
This contains data on the amount of heat evolved as cement hardens, as a function of the proportions of four chemicals in the composition of the cement.
hald <- read.csv('hald.csv')
As a warning of things to come, let’s visualise the data, first.
plot(hald)
Yes, both directions lead to the same model: \[\hat Y = \hat\beta_0 + \hat\beta_1 x_1 + \hat\beta_2 x_2 + \hat\beta_4 x_4.\]
R performs stepwise model selection by AIC, so this question is not relevant.
Unless you want the students to perform stepwise regression via hypothesis tests.
This is possible via repeated application of the add1
and drop1
functions.
cor
function. What is the correlation between x2
and x4
?In this case we fitted via AIC so the results are probably different to whatever Stata’s sw
says.
Slightly ambiguous question. The fit of the model compared to what?
full_model <- lm(y ~ ., data = hald)
If the exercise means ‘compared to a null model’, i.e. testing for the existence of regression, then we have
The \(F\)-statistic is significant, suggesting the full model is a better fit than a null model.
summary(full_model)$r.squared
Yes, but the \(R^2\) statistic is not necessarily a good way to determine the quality of a model.
This question is extremely dubious. Without specifying a null hypothesis and a corresponding test statistic, it could mean anything.
The desired output can be produced with
summary(full_model)
but it would be unwise to draw any conclusions from this. Each of these tests amounts to comparing this five-parameter model with each of the possible four-parameter models induced by dropping just one parameter at a time. But we have already seen that two of the variables are highly correlated. And the visualisation suggests that blind testing may not reveal a great deal of insight.
Use the data in growth.csv
.
This dataset gives the weight, in ounces, of a baby weighed weekly for the first 20 weeks of its life.
growth <- read.csv('growth.csv')
plot(weight ~ week, data = growth)
You can produce partial residual plots with crPlot
(s
) from the car
package.
Personally I find these a little bit challenging to read and would consider a plot of residuals against fitted values instead.
plot(growth_model1, which = 1)
The evidence of non-linearity is much clearer, here.
weight ~ week + I(week^2)
, or by creating a new variable equal to week^2
and adding it as a term in the model. Does this improve the fit of the model?The reason for the I()
is that in the formula interface, ^2
has a special meaning (to do with interactions), rather as a mathematical operator (just as +
and *
have special roles, also).
You can also just create a variable with value equal to the square of the original predictor:
growth$week2 <- growth$week^2
growth_model2a <- lm(weight ~ week + week2, data = growth)
Finally, you can use the poly
function to add polynomial terms. By default, it does orthogonal regression; set raw = TRUE
to switch that off.
Each of the three forms above yield the same model with the same parameter estimates and test statistics.
Unlike the previous sections, the fitted curve is no longer a straight line, so instead we need to plot the fitted points and then join them together.
In this case, the week
values are all in order, so it is simple. But if they were not, we would have to sort the predictor and then make sure the corresponding fitted values had the same order.
Or, just make an evenly-spaced grid of points and predict
values from this.
Handy if the observed \(x\)-values were not evenly spaced, or if you wanted your line to extend beyond the range of the data.
The above is just for demonstration. Of course you can’t have negative weeks!
Don’t do this. Some form of regularisation must be employed to ensure model complexity does not increase too much. Simple hypothesis tests do nothing to account for this.
The slow in decrease in AIC suggests that the quadratic model (polynomial of degree 2) is the best one to use.
As it happens, if we did use hypothesis tests then we would get the same result.
week
and week^2
?No much of a correlation matrix: it only contains one value!
Would be as well to run:
The poly
function, described earlier, anticipated this. We can use the default setting (raw = FALSE
) to perform orthogonal polynomial regression.
Compare the correlation of the predictors:
cor(model.matrix(growth_model2)[, -1])
cor(model.matrix(growth_orthog2)[, -1])
and we see the orthogonal terms are, well, orthogonal!