Poisson regression, constraints, overdispersion. Negative binomial regression.
This worksheet is based on the ninth lecture of the Statistical Modelling in Stata course, created by Dr Mark Lunt and offered by the Centre for Epidemiology Versus Arthritis at the University of Manchester. The original Stata exercises and solutions are here translated into their R equivalents.
Refer to the original slides and Stata worksheets here.
Though the original datasets were in the Stata-specific .dta
mode, I have converted them for you into the more universal .csv
format.
Thus you can import them using the built-in read.csv
or read.table
functions (or the method of your choice).
In this section you will be analysing the dataset ships
provided in the MASS (Modern Applied Statistics with S) package.
This is data from Lloyds of London concerning the rate at which damage occurred at different times to different types of ship.
There are five type
s of ship (labelled “A” to “E”), which could have been built in any one of four time periods (year
), and sailed during one of two time period
s.
The aggregate duration of operation of each type of ship (in months) is given by service
, and the number of incidents of damage is given by incidents
.
data(ships, package = 'MASS')
help(ships, package = 'MASS')
Convert the columns period
, type
and year
to factor (if they are not already) and set the reference level of type
to “E” and year
to “75” (1975–1979) with the relevel
function.
Fit a generalised linear model with a logarithmic link function.
The R equivalent of Stata’s exposure
in a log-linear model is to wrap the log of the variable in offset()
on the right-hand side of the model formula.
Obviously we can’t take logarithms of zero, so ignore classes of ships that did not see any service (using subset
in glm
, or otherwise by removing these rows from the data frame).
So we don’t have to write this out repeatedly, we’ll create a model just for the offset term and then generate our other models by adding terms with update()
.
Unlike lm
objects, when you apply anova
to glm
objects in R it doesn’t automatically perform the test for you.
But you can carry it out yourself.
The \(\chi^2\) test statistic is 55.4 on 4 degrees of freedom, giving a p-value of
pchisq(deviance(model_null) - deviance(model_type), # = 55.4
df.residual(model_null) - df.residual(model_type), # = 4
lower.tail = FALSE)
which is effectively zero, hence the type
term is highly significant.
Note. In this case, anova(model_type)
gives equivalent output to anova(model_null, model_type)
.
The year
term is highly significant and once again gives a p-value of near-zero.
(You can verify this using the method given in the previous question.)
Again the term is highly significant.
Applying the function anova(model)
implicitly compared model
with the null.
Here we want to compare the full model with one that includes every term except type
.
We use the anova(model1, model2)
syntax to get the corresponding \(\chi^2\) test statistic.
The resulting p-value is therefore
pchisq(deviance(model_built_sailed) - deviance(model_full),
df.residual(model_built_sailed) - df.residual(model_full),
lower.tail = FALSE)
which implies that the effect of ship type is still significant after the other factors are taken into account.
fitted
(or predict
) to obtain expected numbers of damage incidents. Compare the observed and expected numbers of incidents. For which type of ship and which time periods are the predicted values furthest from the observed values?The canonical Stata-like solution translated to R is something like
And then inspecting the resulting data frame (the last rows thereof) either using View
or tail
to see which type and time periods had the largest magnitude residuals.
In dplyr the equivalent operations are:
You can also get the residuals directly (rather than doing the subtraction yourself) using the residuals
(aka resid
) accessor function on the model object.
residuals(model_full, type = 'response')
So you can retrieve the desired row using
And see that the least accurate prediction was for type D ships built in 1970–1974, operated in 1975–1979, with 11 damage incidents observed and 6.2 predicted by the model.
The Stata command estat gof
provides goodness-of-fit statistics for the model, which tests whether the observed and expected values are significantly different from one another.
The residual deviance is mentioned in the last line of output from summary(model_full)
.
The resulting p-value from the \(\chi_{25}^2\) test is
pchisq(deviance(model_full), # sum(resid(model_full, 'deviance')^2)
df.residual(model_full),
lower.tail = FALSE)
Pearson’s cumulative test statistic (the other output of Stata’s estat gof
) is
which has 25 degrees of freedom, giving a p-value of 0.017.
We therefore reject the null hypothesis that the observed data are drawn from the distribution specified by the fitted model. In other words, the model is not a great fit.
Note. The original Stata solution sheet appears to be wrong here, as it interacts ship type with built
, rather than sailed
, as asked in the question.
First we will add the type:built
interaction term, then later on we will answer the original question.
inter_built <- update(model_full, ~ . + type:year)
anova(model_full, inter_built)
pchisq(deviance(model_full) - deviance(inter_built),
df.residual(model_full) - df.residual(inter_built),
lower.tail = FALSE)
The interaction between year built and type of ship is significant. The goodness-of-fit statistic yields \(\chi_{13}^2\) = 14.6, with p-value
pchisq(deviance(inter_built),
df.residual(inter_built),
lower.tail = FALSE)
implying that the model is now a good fit.
Testing for the year of operation (as was originally asked in the question):
The interaction term between year of operation and type of ship is not significant, so it does not appear to be worth adding that to the model.
pchisq(deviance(model_full) - deviance(inter_sailed),
df.residual(model_full) - df.residual(inter_sailed),
lower.tail = FALSE)
This section uses data concerning childhood mortality in three cohorts, from the dataset nbreg.csv
.
The children were divided into seven age bands1, and the number of deaths, and the persons-months of exposure are recorded in deaths
and exposure
respectively.
nbreg <- read.csv('nbreg.csv')
Firstly, convert cohort to a factor, so it doesn’t get misinterpreted as a continuous variable.
nbreg$cohort <- as.factor(nbreg$cohort)
Then, fit the generalised linear model.
There’s no point actually trying to interpret the coefficients, for reasons explained in the next question.
We want to check for overdispersion in the model. A Poisson model is overdispersed (or underdispersed) if the variance is significantly greater (or smaller) than the mean. This is indicated by the ratio of deviance to residual degrees of freedom being different from 1.
You can calculate the overdispersion parameter, which has definition \[\hat\phi = \sum_{i=1}^n \frac{r^2} {n - p + 1},\] the sum of the squared Pearson residuals divided by the residual degrees of freedom. Here we have
sum(residuals(pois_mod, 'pearson')^2) / df.residual(pois_mod)
which you can also verify by fitting the same model with family = quasipoisson
and looking at the dispersion parameter given in the model summary output.
Clearly, our model is very overdispersed, meaning the variance is much larger than the mean, and so the Poisson model is not appropriate.
We can use the function glm.nb
from the MASS package.
There are no significant differences between cohorts.
(Similarly, anova(nb_model)
shows variation between cohorts is not significantly greater than within cohorts.)
This refers to the model formulation \[\text{Var}(Y) = \mu(1 + \alpha \mu)\] mentioned in the lecture notes, hence \(\alpha\) is the dispersion parameter for the quasi-negative binomial model.
Be aware that the MASS::glm.nb
summary output defines the dispersion parameter differently, with
\[\text{Var}(Y) = \mu + \frac{\mu^2}{\theta}.\]
A full explanation is given here.
The dispersion parameter \(\alpha\) is therefore:
1 / nb_model$theta
or you can compute it yourself:
A confidence interval for \(\theta = 1/\alpha\) is
nb_model$theta + c(-1, 1) * 1.96 * nb_model$SE.theta
From the lecture notes, this refers to the model parametrisation \[\text{Var}(Y) = \mu(1 + \delta),\] which is similar to the quasi-Poisson \[\text{Var}(Y) = \phi\mu,\] for a dispersion parameter \(\phi\).
According to Brian Ripley, the Stata command nbreg
with option dispersion(constant)
is not fitting a GLM at all.
The constant-dispersion negative binomial described in Stata is possibly(?) analogous to a quasi-Poisson.
However the dispersion parameter given in summary(qpois_model)
is much larger than that given in the Stata solutions, so I’m not 100% sure.
More details are in the Stata documentation.
(Since cohort
was not sigificant, it is dropped for this model.)
The likelihood ratio test statistic is 2 × 44 on 6 degrees of freedom, which is highly significant, suggesting mortality is different between age groups.
No, because \(\theta = \alpha^{-1} = 33 ~ (= 0.03^{-1})\), indicating overdispersion.
As in Stata, this model struggles to converge.
You can try to tweak the model fitting hyperparameters in glm.control
.
nb_age_cohort <- update(nb_age, ~ . + cohort) # throws lots of warnings
Or better yet, fit the model using Markov chain Monte Carlo (MCMC) via Stan.
library(rstanarm)
nb_age_cohort <- stan_glm.nb(deaths ~ age_gp + cohort, offset = log(exposure),
data = nbreg, algorithm = 'sampling')
The 90% Bayesian posterior uncertainty (“credible”) intervals are:
posterior_interval(nb_age_cohort)
which do not contain zero for the age or cohort parameters, implying that age and cohort have an effect on mortality.
The 95% credible interval for the reciprocal dispersion is
posterior_interval(nb_age_cohort, par = 'reciprocal_dispersion', prob = .95)
which are both greater than one, implying there is, if anything, underdispersion in the model, but no longer overdispersion.
family = poisson
. Does this model agree with the negative binomial model?“Fit the same model using (different model spec)” is maybe an odd way of phrasing this!
Anyway, the equivalent Poisson model (back to using stats::glm
again) is
This appears to agree with the negative binomial model: all terms are significant and the parameter estimates are almost identical to those given in the model nb_age_cohort
.
Moreover, the residual deviance implies no overdispersion (though possibly underdispersion).
pchisq(deviance(pois_age_cohort), df.residual(pois_age_cohort),
lower.tail = FALSE)
No longer significant, implying a good fit.
This section uses the data on damage to ships from the dataset MASS::ships
, again.
In R, the standard output returns log-incidence ratios, so you can either apply exp
to the coefficients to get relative incidence ratios, or else on the logarithmic scale check which parameters’ confidence intervals contain zero (\(\log 1 = 0\)).
If you’re working in the same R session as before, you should still have type “E” and construction year 1975–1979 as reference levels, resulting in output like below. (If your reference levels are construction year 1960–1964 and type “A”, you can still make the same inferences, but the output will look different.)
Construction year 1960–1964 is not significantly different from 1975–1979 (though it’s borderline). Types A and D are not significantly different from type E.
ships$pred_n <- fitted(ship_model)
In R, you accomplish this using contrasts
.
The default contrasts are contr.treatment
, which means all comparisons are with the baseline factor level (type “A” by default, though we switched it to “E” in the first section).
contrasts(ships$type)
ship_model$contrasts
Assuming the question is not just asking us to change the reference level again (in which case we’d employ relevel()
), we can construct a contrasts matrix where the reference level and type D have rows equal to zero.
summary(ship_model_D)
The typeD
row is now effectively blanked out, because the estimate is constrained to logit zero (relative incidence of 1) and has no standard error.
The other parameter estimates have changed slightly, though not by a large amount.
(Recall from earlier that the effect of type “D” is not statistically significant.)
Both give very similar results and indicate a lack of fit:
# Original model:
pchisq(deviance(ship_model), df.residual(ship_model), lower = F)
# With type D constraint:
pchisq(deviance(ship_model_D), df.residual(ship_model_D), lower = F)
If type “E” is still your reference category, it does not make much sense to force it to be zero, as it is already. Hence we will relevel the factor so that type “A” is the new reference category, then add the constraint.
pchisq(deviance(ship_model_DE),
df.residual(ship_model_DE),
lower.tail = F)
Lack of fit has got worse—this p-value is smaller than before.
As building contrasts
matrices becomes more complex, you might like to look up packages to simplify the process.
Here I’ll do it in base R.
(I hope this is right!)
Fit a Poisson regression model with all three constraints:
In Stata’s model summary output, the lines for the constrained construction years would be identical.
In R, summary(ship_model_constr)
will instead show year70
as NA
, since it is redundant.
(Or, if we omit columns of zeros from the contrasts
matrix, those rows will simply not appear.)
Here, year65
still has a valid estimate and standard error, since it was not constrained to take a specific value (only to be equal to that of year70
).
exp(cbind(estimate = coefficients(ship_model_constr),
confint(ship_model_constr)))
You can verify that they are treated the same (and that types “D” and “E” have no effect) by predicting on new data, for example:
They were not constrained to take a specific value.
pchisq(deviance(ship_model_constr),
df.residual(ship_model_constr),
lower.tail = F)
The fit has deteriorated slightly.
ships$pred_cn <- fitted(ship_model_constr)
We can use a scatterplot matrix.
Tip. See help(pairs)
for tips on how to get correlations in the same figure.
From this, it appears the constraints have had little effect on the fit of the model; the correlation between predicted and observed values is still very high (though it has dropped slightly).
If you wish, you can examine the observed and predicted values directly with View(ships)
(or otherwise by printing them to the console).
Constraints can be applied to many different types of regression model.
However, applying constraints when using nnet::multinom
can be tricky because there are several equations.
The syntax is then similar to the syntax we saw last week for lincom
.
For this part of the practical, we are using the same alligators.dta
dataset that we saw last week.
levels
or str
to remind yourself of the different variables.lapply(alligators, levels)
nnet::multinom()
.mn_model <- nnet::multinom(food ~ lake, data = alligators)
Are there significant differences between lakes in the primary food choice?
Yes.
Fish is the reference food choice. So the odds ratios are
exp(coefficients(mn_model))['Invertebrate', ]
contrasts
.Fit the constrained model.
Look at the new parameter estimates:
exp(coefficients(mn_model2))
(Unlike glm
objects, redundant parameters in a multinom
are repeated, rather than marked as NA
.)
lake_contrasts[] <- c(0, 1, 1, 1) / 3
lake_contrasts
Fit a multinomial logistic regression model with both of these constraints.
How does the common odds ratio for all three lakes compare to the 3 separate odds ratios you calculated previously?
exp(coefficients(mn_model3))
The common odds ratio of 6.68 is somewhere in between the three previous estimates. Verify it with a single logistic regression:
logistic <- glm(food == 'Invertebrate' ~ lake != 'Hancock',
data = alligators,
subset = food %in% c('Invertebrate', 'Fish'),
family = binomial)
exp(coefficients(logistic))
The original Stata 13 nbreg.dta
file contained age_gp
as a column of labelled floating-point numbers. For cross-compatibility, I’ve converted this to a character vector.↩︎