Limits of linear regression and how these motivate generalised linear models. Introduction to logistic regression, diagnostics, sensitivity and specificity. Alternative models.
This worksheet is based on the seventh lecture of the Statistical Modelling in Stata course, created by Dr Mark Lunt and offered by the Centre for Epidemiology Versus Arthritis at the University of Manchester. The original Stata exercises and solutions are here translated into their R equivalents.
Refer to the original slides and Stata worksheets here.
Use the dataset epicourse.csv
.
epicourse <- read.csv('epicourse.csv')
table()
function. What is the prevalance of hip pain in men and in women?Generate a table of counts using
And you can convert this into a table of proportions within each sex (2nd margin) using:
prop.table(hip_tab, margin = 2)
In R 4.0.0, you can also use the new proportions()
function.
proportions(hip_tab, margin = 2)
chisq.test(hip_tab)
Recall the formula for an odds ratio is \[\text{OR} = \frac{D_E / H_E}{D_N / H_N},\] where \(D\) denotes ‘diseased’, \(H\) is ‘healthy’, \(E\) is ‘exposed’ (i.e. female) and \(N\) is ‘not exposed’ (i.e. male).
odds_ratio <- (hip_tab['yes', 'F'] / hip_tab['no', 'F']) /
(hip_tab['yes', 'M'] / hip_tab['no', 'M'])
odds_ratio
And the Wald 95% confidence interval for the odds ratio given by \[\exp \left\{ \log \text{OR}~\pm~z_{\alpha/2} \times \sqrt{\tfrac1{D_E} + \tfrac1{D_N} +\tfrac1{H_E} + \tfrac1{H_N}} \right\},\] where \(z_{\alpha/2}\) is the critical value of the standard normal distribution at the \(\alpha\) significance level. In R, this interval can be computed as follows:
The relative risk is \[\text{RR} = \frac{D_E / N_E}{D_N / N_N},\] where \(N_E = D_E + H_E\) and \(N_N = D_N + H_N\) are the respective total number of exposed (female) and not-exposed (male) people in the sample. In R, it is
I can’t see any point in computing a relative risk confidence interval here, even if the Stata command cs
reports it.
glm
function. How does the result compare to that computed above?Note. The glm
function needs responses to be between 0 and 1, so it won’t work with a character vector of ‘yes’ and ‘no’.
An easy solution is to replace hip_p
with TRUE
and FALSE
(which in R is the same as 1 and 0).
Either do this before fitting the model, or inline:
In this model, the intercept parameter estimate represents the log-odds of having hip pain, for women (taken to be the baseline).
The sexM
is the additive effect (to these log-odds) from being a man.
Thus the odds ratio is simply the exponential of the negation of this second parameter:
Age may well affect the prevalence of hip pain, as well as gender. To test this, create an agegroup
variable with the following code:
table()
function. Is there evidence that the prevalence of hip pain increases with age?age_tab <- with(epicourse, table(hip_p, agegroup))
age_tab
chisq.test(age_tab)
age
to the logistic regression model for hip pain. Is age a significant predictor of hip pain?You can rewrite the model from scratch with glm()
, or use this convenient shorthand:
It is a logistic model, so the parameter estimate represents the increase in log-odds for one year increase in age.
It might naively be thought that the only justification for fitting such a model would be if the underlying continuous data were not available. But it can also be useful for reporting purposes if the categories are meaningful clinically. Or if the log odds of pain do not increase linearly with age (spoiler alert: that is the case here). Admittedly, in that case, doing something clever with the continuous predictor might be at least as good as using categories, but lets see what happens with the categories anyway.
Look back at the logistic regression model that was linear in age.
(There is no need to refit the model if it is still stored as a variable,
e.g. model2
in our case)
This test is not avalable in base R, so we need to install one of the (many) packages that provides it: we’ll use glmtoolbox
.
if(require(glmtoolbox)==FALSE) {
install.packages("glmtoolbox")
library(glmtoolbox)
}
hltest(model2)
I also had to look this up.
There are various R packages for computing and plotting ROC curves. We will use pROC
, since it makes producing confidence intervals easy, and they are vital to any statistical inference.
hltest(model4)
Obviously it will improve the fit, but whether it is worth the added complexity is another question.
Using the same procedure as above, we obtain the AUC:
It is not easy to replicate Stata’s diagnostics in R
fitted
values for the previous regression model. Is there any evidence of an influential point?Pregibon’s \(\Delta\hat\beta_i\) statistic, according to the Stata documentation, has the formula \[\Delta\hat\beta_i = \frac{r_j^2 h_j}{(1 - h_j)^2},\] where \(r_j\) is the Pearson residual for the \(j^\text{th}\) observation and \(h_j\) is the \(j^\text{th}\) diagonal element of the hat matrix.
Unfortunately, calculating it is not straightforward, as R and stata
differ about what a Pearson residual is and what the hat matrix
is. And I could not find a package that did it. Don’t worry if you
don’t understand what the function pregibon
is doing below: I’m not
sure I do (``I’’ here being ML, not DS :-)). However, it does
duplicate the statistics produced by Stata.
pregibon <- function(model) {
# This is very similar to Cook's distance
Y <- tapply(model$y, list(model$model$sex, model$model$age), sum)
M <- tapply(model$y, list(model$model$sex, model$model$age), length)
vY <- diag(Y[as.character(model$model$sex), as.character(model$model$age)])
vM <- diag(M[as.character(model$model$sex), as.character(model$model$age)])
pr <- (vY - vM*model$fitted.values)/sqrt(vM*model$fitted.values*(1-model$fitted.values))
V <- vcov(model)
X <- cbind(rep(1, 4515), model$model$sex=="M", model$model$age, model$model$age*model$model$age)
uhat <- diag(X %*% V %*% t(X))
hat <- uhat*vM*model$fitted.values*(1-model$fitted.values)
db <- (pr^2*hat)/(1-hat)^2
}
plot(fitted(modelq), pregibon(modelq), type = 'h')
# You could compare this to Cook's distance using the command below, but Cook's distance does not have an
# interpretation for logistic regression, only linear regression
# plot(modelq, which = 4)
Deviance residuals are accessible via residuals(model, type = 'deviance')
.
You can use loess()
, or geom_smooth()
in ggplot2. If you are having trouble, try
adjusting the smoothing parameter (span
), and make sure the variables are in a
numeric/binary format, as loess()
doesn’t really like factors.
if(require(ggplot2)==FALSE) {
install.packages(ggplot2)
library(ggplot2)
}
ggplot(epicourse) +
aes(age, as.numeric(hip_p) - 1) +
geom_point(aes(y = fitted(modelq))) +
geom_smooth(aes(colour = sex), method = 'loess', se = F, span = .75) +
ylab('Pr(hip pain)')
nonpar <- loess(data = transform(epicourse,
pain = as.numeric(hip_p == 'yes'),
male = as.numeric(sex == 'M')),
pain ~ age * male, span = .75)
plot(epicourse$age, fitted(modelq), xlab = 'Age', ylab = 'Pr(hip pain)')
lines(1:100, predict(nonpar, newdata = data.frame(age = 1:100, male = T)), col = 'steelblue')
lines(1:100, predict(nonpar, newdata = data.frame(age = 1:100, male = F)), col = 'tomato2')
legend('topleft', lty = 1, col = c('steelblue', 'tomato2'), legend = c('M', 'F'))
This section of the practical gives you the opportunity to work through the CHD example that was used in the notes.
chd <- read.csv('chd.csv')
The following commands will reproduce Figure 1.2 using base R:
It can also be done in dplyr and ggplot2:
if (require(dplyr)==FALSE) {
install.packages(dplyr)
library(dplyr)
}
chd %>%
group_by(agegrp) %>%
summarise(agemean = mean(age),
chdprop = mean(chd)) %>%
ggplot() + aes(agemean, chdprop) +
geom_point() +
labs(x = 'Mean age', y = 'Proportion of subjects with CHD',
main = 'Scatter plot of proportion of CHD against mean age')
chd ~ age
. What is the odds ratio for a 1 year increase in age?chd_pregibon <- function(model) {
Y <- tapply(model$y, list(model$model$age), sum)
M <- tapply(model$y, list(model$model$age), length)
vY <- Y[as.character(model$model$age)]
vM <- M[as.character(model$model$age)]
pr <- (vY - vM*model$fitted.values)/sqrt(vM*model$fitted.values*(1-model$fitted.values))
V <- vcov(model)
X <- cbind(rep(1, 100), model$model$age)
uhat <- diag(X %*% V %*% t(X))
hat <- uhat*vM*model$fitted.values*(1-model$fitted.values)
db <- (pr^2*hat)/(1-hat)^2
}
db <- chd_pregibon(chd_model)
plot(fitted(chd_model), db, type = 'h')
# plot(chd_model, which = 4)