Fitting binomial, multinomial and probit regression models for discrete and categorical responses with R.
This worksheet is based on the eighth lecture of the Statistical Modelling in Stata course, created by Dr Mark Lunt and offered by the Centre for Epidemiology Versus Arthritis at the University of Manchester. The original Stata exercises and solutions are here translated into their R equivalents.
Refer to the original slides and Stata worksheets here.
The datasets for this practical can be accessed via http from within R.
The file path is very long, so you can save yourself some typing by saving it to a variable.
The function file.path
joins strings together with slashes, for example if you want to combine a web address with a directory or file name.
As Mark has saved these datasets in the Stata-specific .dta
format, we import them into R using the read.dta
function from the foreign package.
Usually however, data are stored in CSV format and they can be imported with the base function read.csv
.
The data used for this section was collected as part of a survey of alligator food choices in 4 lakes in Florida. The largest contributor to the volume of the stomach contents was used as the outcome variable food, and the characteristics of the alligators are their length (dichotomised as ≤ 2.3m and > 2.3m), their sex and which of the four lakes they were caught in.
alligators.dta
data into R from the path above.Familiarise yourself with the factor labels in the dataset. For example, since every column is a factor, you can list all the possible levels for each variable with:
lapply(alligators, levels)
If your data are classed as character vectors rather than factors, you could get the same with
lapply(alligators, unique)
invertebrate
which takes the level “Invertebrate” if the main food was invertebrates, “Fish” if the main food was fish and “Other” for all other groups.Here we see how Stata differs from R in its treatment of categorical variables.
For the most part you can model categorical variables directly, either as character vectors or as factor
s (which can take on text-based or numerical labels).
Numerical codes make things easier for computers to compute, but harder for humans to understand and read.
The factor
class in R (and its implicit treatment of character
vectors as categorical variables) helps abstract away integer coding, so your code and data are more human-readable.
What’s the difference between a character and a factor, then? A factor can have more levels than appear in the data. That is, there can be factor levels with zero observations. By contrast, summarising a character vector will only show those values that actually appear in the data.
class(alligators$food)
To merge the “Reptile”, “Bird” and “Other” values under a single label, while leaving the “Fish” and “Invertebrate” levels unchanged, there are multiple different approaches.
Some methods are specific to data classed as factor
; others work directly on character
vectors.
As a factor
, the food
column has underlying values (integers 1 to 5) and then printed labels for each level.
Inspecting the levels:
levels(alligators$food)
We can change the last three levels to have the same name. Here they are ordered by first appearance, but don’t rely on this; sometimes, depending on how the data are imported, they might be in alphabetical order, or another ordering entirely. So we will subset those levels which are neither ‘Fish’ nor ‘Invertebrate’ and set them equal to ‘Other’
The tidyverse package forcats (“for categorical variables”) offers a whole suite for tools for these sorts of applications, designed with a consistent interface. This is an alternative to the base R way; which way you choose is down to personal preference.
There are countless other methods of achieving the same. In fact, it’s often simpler just to treat the variables as character vectors rather than factors:
(The x %in% c(a, b)
syntax is a shorthand way of writing x == a | x == b
.)
You could also do this in data.table:
library(data.table)
setDT(alligators)
alligators[, invertebrate := as.character(food)]
alligators[!(food %in% c('Fish', 'Invertebrate')),
invertebrate := 'Other']
Whatever your approach, you should now have a dataset that looks a bit like this:
table
or xtabs
.You should see that whilst fish and invertebrates are equally common in the smaller alligators, the larger ones are more likely to eat fish than invertebrates.
In the original Stata worksheet we were asked to replace ‘Other’ with missing values (i.e. NA
) but we haven’t done that here, because it is easy to model only a subset of the data using the subset
argument to glm
.
However, if you’d followed the original instructions, then invertebrate
would be a vector equal to TRUE
(1) if the main food was invertebrates, FALSE
(0) if fish and NA
for anything else.
Here we kept the text values, so within the model we convert to logical/numeric with the ==
syntax you see below.
The coefficients give the log-odds, so the odds ratios are simply
exp(coefficients(size_model))
As you might have seen already, there’s no need to do this.
Most of the above steps were not necessary for the previous model; you can fit the entire model in one function call as follows.
size_model2 <- glm(food == 'Reptile' ~ size,
data = alligators,
subset = food %in% c('Reptile', 'Fish'),
family = binomial)
exp(coefficients(size_model2))
nnet::multinom
to obtain the odds ratios for the effect of size on all food choices. Which food category is the comparison group?Use the multinom()
function from the package nnet.
Which food category is the comparison (baseline/reference) group? Check that the odds ratios for the invertebrate vs. fish and reptile vs. fish comparisons are the same as before.
exp(coefficients(size_model3))
Again, there is no base R equivalent of Stata’s lincomb
command.
You can compute the linear combination by hand.
From the output above, you can see we want to test if \(\beta_0 = \beta_1\), or
\[\mathbf{c}^T \boldsymbol\beta = \pmatrix{-1 & 1 & 0 & 0} \cdot \pmatrix{\beta_0 \\ \beta_1 \\ \beta_2 \\ \beta_3} = 0.\]
In R we use the matrix multiplication operator %*%
, then apply the exponential function to get the odds ratio.
exp(c(-1, 1, 0, 0) %*% coefficients(size_model3))
Thus the odds ratio is 6.1.
size_model4 <- glm(food == 'Reptile' ~ size,
data = alligators,
subset = food %in% c('Reptile', 'Invertebrate'),
family = binomial)
exp(coefficients(size_model4))
table
(or xtabs
).Does the primary food differ between the four lakes?
Tip. If you want column and row totals in your counts table you can use addmargins()
.
with(alligators, addmargins(table(food, lake)))
You could also plot the data for a better overview:
library(ggplot2)
ggplot(alligators) +
aes(lake, fill = invertebrate) +
geom_bar(position = 'fill') +
ylab('proportion') +
scale_fill_brewer('Food choice', type = 'qual') +
theme_classic()
From the proportions
table above: 7%.
This particular summary doesn’t give a \(\chi^2\) statistic but you can compute one yourself. Does this suggest that the primary food differs between the lakes?
with(alligators, chisq.test(food, lake))
Fish and Lake Hancock are both reference levels in this case, so we can just read off the coefficients.
exp(coefficients(lake_model))['Invertebrate', 'lakeOklawaha']
lake_model2 <- glm(food == 'Invertebrate' ~ lake,
data = alligators,
subset = food %in% c('Invertebrate', 'Fish'),
family = binomial)
exp(coefficients(lake_model2))['lakeOklawaha']
nnet::multinom
This section uses the dataset politics.dta
which contains information on the effect of gender and race on political party identification in the United States.
levels
to find out the meanings of the variables.lapply(politics, levels)
library(nnet)
party_model <- multinom(party ~ race, data = politics,
trace = FALSE)
exp(coefficients(party_model))
Black people are less likely to be Republicans versus Democrats; the odds ratio is 0.1.
Note. “White” and “Democrat” are both reference levels in this model.
From the output above, the odds ratio of being an independent rather than a Democrat is 0.3 for those people identified as black.
table
or xtabs
to check whether your answers to the previous questions are sensible.xtabs(~ party + race, data = politics)
party_model2 <- multinom(party ~ gender, data = politics, trace = FALSE)
exp(coefficients(party_model2))
The odds ratio is 0.6.
Verify this with a logistic regression model. (Here we column-bind the parameter estimates with their 95% confidence intervals.)
party_model3 <- glm(party == 'Republican' ~ gender,
data = politics,
subset = party != 'Independent',
family = binomial)
exp(
cbind(estimate = coefficients(party_model3),
confint(party_model3))
)
party_model4 <- multinom(party ~ race + gender, data = politics)
Either fit the model again with another call to multinom
, or add the term to the model using update
:
party_model5 <- update(party_model4, ~ . + race:gender)
Compare the two models with a \(\chi^2\) test:
anova(party_model5, party_model4)
The difference is not statistically significant, implying the influence of race is not different for men versus women.
Ordered logistic/probit regression models are implemented by the polr
function from the MASS (Modern Applied Statistics with S) package.
This section uses the data in housing.dta
.
These data concern levels of satisfaction among tenants of different types of housing, according how much contact they have with other residents and how much influence they feel they have over the management of their housing.
housing <- foreign::read.dta('http://personalpages.manchester.ac.uk/staff/mark.lunt/stats/8_Categorical/data/housing.dta')
levels
to inspect the different factor levels.lapply(housing, levels)
MASS::polr
)From the output above, all terms are negative, the reference level (tower blocks) have the most satisfied tenants.
Confidence intervals for the odds ratios do not contain 1 (and for the log-odds do not contain 0), suggesting that the predictors are significant at the 5% level.
None of these confidence intervals of odds contains 1, implying all are significant.
influence
depend on which type of housing a subject lives in? (i.e. is the influence:housing
interaction significant?)The \(\chi^2\) test statistic is significant at the 5% level, implying there is an interaction effect.