Analytical Epidemiology II: Statistical Inference

Sampling. Confidence intervals. Hypothesis tests.

David Selby https://www.research.manchester.ac.uk/portal/david.selby.html (Centre for Epidemiology Versus Arthritis)https://www.cfe.manchester.ac.uk
2021-03-30

This lecture forms part of the Introduction to Epidemiology course offered by the Centre for Epidemiology Versus Arthritis at the University of Manchester. There are two sessions on analytical epidemiology: the first session (last week) covered descriptive statistics; this second session introduces statistical inference.

Much of the material overlaps with the first few lectures in the Statistical Modelling with Stata course, for which materials are available online. There is less emphasis on statistical software here. Examples, where provided, will be in the R programming language. Unlike Stata, R is completely free, so you don’t need an institutional software licence to follow along at home. Download it from the Comprehensive R Archive Network.

If you have questions or comments about any of these materials, please contact .

Links to example datasets (or code to generate them) are provided in the respective sections below.

Sampling

Often, it is not practical to measure every subject in a population. A reduced number of subjects—a sample—is measured instead. This is cheaper, quicker and can be more thorough (i.e. allowing more detailed study of each subject). The sample needs to be chosen in such a way as to be representative of the target population. A uniform random sample is the simplest way to achieve this: every subject in the population has the same probability of being included in the sample.

Suppose we have a population in which a variable \(X\) has a mean \(\mu\) and standard deviation \(\sigma\). Assume we don’t know what the true value of \(\mu\) actually is, and want to estimate it. We take a random sample of size \(n\) from the population. Then:

How much the means of different samples differ depends on:

  1. sample size — the mean of a small sample will vary more than the mean of a large sample
  2. variance in the population — if the variable measured only varies slightly across the whole population, the sample mean can only vary slightly as well.

In other words, the variance of \(\bar X\) depends on the variance of \(X\) and on the sample size \(n\).

For example, consider a population consisting of 1000 copies of each of the digits 0, 1, …, 9. The frequency distribution of the values in this population is:

uniform <- data.frame(digit = rep(0:9, each = 1000))
ggplot(uniform) + aes(x = digit) +
  geom_bar(stat = 'count') +
  scale_x_continuous(breaks = 0:9)

We can draw random samples of size \(n = 5, 25, 100\) from this population. Let’s draw 2000 samples of each of these three sizes and calculate the mean of \(x\) (i.e. \(\bar x\)) for each sample. The distributions of sample means are visualised below.

library(tidyverse)
# For each sample size, for 2000 samples
samples <- expand_grid(n = c(5, 25, 100),
                       sample_id = 1:2000) %>%
  group_by(n, sample_id) %>%
  # Draw the random samples
  summarise(samples = sample(uniform$digit, n)) %>%
  # Calculate the sample means
  summarise(x_bar = mean(samples),
            s = sd(samples))

# Visualise the distribution
ggplot(samples) + aes(x_bar) + geom_histogram() +
  facet_wrap(~ n, labeller = label_both) +
  xlab('sample mean')

Properties of \(\bar X\)

Standard error

The standard error is fundamental to frequentist (classical, non-Bayesian) statistics. The standard error is the standard deviation of the sampling distribution of a statistic. The sampling distribution is the distribution of a statistic as sampling is repeated. All statistics have sampling distributions. Statistical inference is based on the standard error.

For example, consider our uniformly-distributed digits population, above. It is a discrete uniform distribution so we can actually calculate the theoretical mean and standard deviation for any such uniform distribution:

where \(a=0\) and \(b=9\) are the boundaries of the distribution. We can verify this by calculating the mean and standard deviation of the entire 10,000-strong population (uniform) in our example:

mu <- mean(uniform$digit)
[1] 4.5
sigma <- sd(uniform$digit)
[1] 2.872425

In practice, of course, we don’t know an underlying theoretical distribution, nor can we measure the entire population. So we have to estimate \(\mu\) and \(\sigma\) from samples.

Let’s compare the expected (theoretical) versus observed (sample) values of the mean and standard deviation for each sample size.

samples %>%
  summarise(mean_observed = mean(x_bar),
            sd_observed = sd(x_bar)) %>%
  mutate(mean_expected = mu,
         sd_expected = sigma / sqrt(n))
Sample size
Mean
S.D.
n expected observed expected observed
5 4.5 4.47 1.28 1.31
25 4.5 4.48 0.57 0.58
100 4.5 4.49 0.29 0.29

You can see that the average sample means (‘observed’) are very close to the true population mean (‘expected’), and get closer as the sample size, \(n\), increases. Similarly, the standard deviations of the sample means (‘observed’) are very close to the theoretical value of \(\sigma/\sqrt{n}\) (‘expected’) for each sample size, \(n\).

This shows that the properties of \(\bar X\), described above, are true.

Confidence intervals

In real life, we don’t know the true distribution of \(X\), nor can we observe the entire population, so we cannot compare \(\bar X\) with \(\mu\) or \(s\) (the sample standard deviation) with \(\sigma\).

However, recall from the previous lecture that we can summarise distributions using quantiles. And by the central limit theorem, the distribution of \(\bar X\) is asymptotically normal as the sample size increases (even if the distribution of \(X\) is not, as in our digits example). Thus we can summarise the spread of \(\bar X\) via the quantiles of the corresponding normal distribution.

Reference ranges

In a box plot you might wish to summarise a dataset using the median and lower and upper quartiles, i.e. the 25th, 50th and 75th centiles. For a normal distribution you can derive these quantiles using the mean and standard deviation.

If \(x\) is normally distributed with mean \(\mu\) and standard deviation \(\sigma\), we have

because the median, lower and upper quartiles of the standard normal distribution (mean 0 and standard deviation 1) are 0, -0.674 and 0.674, respectively. You can check this in a table of normal distribution quantiles, or using R:

qnorm(c(.25, .5, .75), mean = 0, sd = 1)
[1] -0.6744898  0.0000000  0.6744898

That would give us the inter-quartile range of a normal distribution, representing 75% of the population. In what interval do 95% of the population lie? Instead of the first and third quartiles (25th and 75th centiles) we now want the 2.5th and 97.5th centiles, which are

qnorm(c(.025, .975))
[1] -1.959964  1.959964

so the 95% reference range of a normal distribution, defined as the interval containing 95% of the population, is given by \[(\mu - 1.96 \sigma, \mu + 1.96 \sigma).\]

For our digits data, a 95% reference range is not appropriate because the digits are not normally distributed.

Confidence intervals

In frequentist statistics, \(\mu\) is assumed to be a fixed value. And, of course, it’s unknown. You can rearrange the above formula to state that 95% of the intervals \(\bar x \pm 1.96 \frac{\sigma}{\sqrt{n}}\) contain \(\mu\).

That is, if you were able to draw samples repeatedly from the population, and calculate the sample mean for each of them, then 95% of those intervals would contain \(\mu\). Typically we have just one of these samples (and of course don’t know whether it is one that contains \(\mu\) or not) and we call the interval a 95% confidence interval for the population mean.

A reference range tells us about individual values in the distribution (but only assuming it’s normally distributed); a confidence interval tells us about the mean value (which is normally distributed for large \(n\) thanks to the central limit theorem).

You don’t have to pick 95%, of course, though it’s conventional; you can pick any proportion you like, e.g. 50%, 80%, 99%. The larger this percentage (confidence level), the wider the interval will be, for us to be able to say the intervals contain \(\mu\) with that level of confidence.

A 100% confidence interval would be the entire possible range of the data: in the digits case it would be 0 to 9; for continuous data on the real number line it would be \(-\infty\) to \(+\infty\), which obviously isn’t terribly helpful.

If \(\sigma\) is unknown then you can use the sample standard deviation, \(s\) (or sometimes denoted \(\hat\sigma\)), \[s = \frac{\sum_{i=1}^n (x_i - \bar x)^2}{n-1}.\]

Simulated example

Let’s construct 95% confidence intervals around all of our sample means in the digits data.

confints <- samples %>%
  mutate(lower = x_bar - 1.96 * s / sqrt(n),
         upper = x_bar + 1.96 * s / sqrt(n),
         contains_mu = 4.5 >= lower & 4.5 <= upper)

We would expect \(\mu\) to fall within (approximately) 95% of the confidence intervals around the sample means. What proportion of these intervals contained the true mean, in our samples? How many of them contain the true population mean?

confints %>%
  summarise(confidence = mean(contains_mu))
# A tibble: 3 x 2
      n confidence
  <dbl>      <dbl>
1     5      0.862
2    25      0.931
3   100      0.943

As the sample size increases, the distribution of the sample mean approaches normality and the confidence intervals contain the true population mean 95% of the time. However, for small sample sizes (\(n = 5\)) the central limit theorem doesn’t yet hold, so the 95% confidence intervals do not actually contain \(\mu\) for 95% of the time. In fact, for such small sample sizes with a standard deviation estimated from the data, we shouldn’t be using the normal distribution at all: there is a more appropriate distribution you should use in this situation.

Here is an illustrative animation. Each frame is one sample from the population.

And here are all of the intervals shown together:

confints %>%
  # Reorder the bars
  ungroup() %>% arrange(n, x_bar) %>%
  group_by(n) %>%
  mutate(order = row_number()) %>%
  # Visualise the data
  ggplot() +
  aes(xmin = lower, xmax = upper, y = order,
      colour = contains_mu) +
  facet_wrap(~ n, labeller = label_both, ncol = 1) +
  geom_errorbarh() +
  geom_vline(xintercept = 4.5, linetype = 'dashed') +
  labs(title = 'Confidence intervals for mean digit',
       subtitle = 'Which contain the true mean?',
       x = NULL, y = NULL,
       colour = expression('contains' ~ mu)) +
  scale_x_continuous(breaks = c(0, 4.5, 9)) +
  theme(axis.text.y = element_blank())

Is your sample’s confidence interval one that contains \(\mu\), or not? You can never be sure! But assuming the samples were drawn uniformly from the population, there is a 95% probability it contains the population mean.

Real-world example

In 216 patients with primary biliary cirrhosis, serum albumin had a mean value of 34.46 g/l and a standard deviation of 5.84 g/l.

What is the 95% confidence interval for the mean serum albumin level?

  1. The standard error is \(\sigma/\sqrt{n} = 5.84 / \sqrt{216} = 0.397\).
  2. Hence the 95% confidence interval for \(\mu\) is \[\begin{aligned} \bar x \pm 1.96 \sigma/\sqrt{n} &= 34.46 \pm 1.96 \times 0.397 \\ &= (33.68, 35.24) ~g/l. \end{aligned}\]

So the mean value of serum albumin in the population of patients with primary biliary cirrhosis is probably between 33.68 g/l and 35.24 g/l.

In R,

# Directly get quantiles of normal distribution:
qnorm(c(.025, .975), mean = 34.46, sd = 5.84 / sqrt(216))
[1] 33.68119 35.23881
# Or manually calculate the interval:
34.46 + c(-1.96, 1.96) * 5.84 / sqrt(216)
[1] 33.68117 35.23883

Proportional data

Consider a different example of a binary variable (i.e. can take one of two values), for example presence of a disease or a particular characteristic. Let the proportion in the population with this quality be denoted by \(p\)—this is our unknown population parameter.

Suppose \(X\) is the number of such people in the population. Then \(X\) is a discrete random variable with a binomial distribution with parameters \(n\) (the total number of subjects) and \(p\) (the underlying probability of having the characteristic).

For small \(n\), you can compute the probability of \(X\) taking on particular values. For example, if we have a fair coin (\(p = 0.5\)) and flip it 20 times then the probability of getting exactly 11 heads is \[\begin{aligned} \operatorname{Pr}(X = 11) &= \textstyle\binom{20}{11} 0.5^{11} 0.5 ^9 \\ &= 0.16 \end{aligned}\] i.e. about 1 in 6. In R, use the density (mass) function:

dbinom(11, size = 20, prob = .5)
[1] 0.1601791

As with continuous variables, usually we’re interested in intervals or ranges of possible values: what is the probability that we get 15 or more heads in 20 flips? \[\begin{aligned} \operatorname{Pr}(X \geq 15) &= 1 - \operatorname{Pr}(X < 15) \\ &= 1 - \sum_{i=0}^{14} \operatorname{Pr}(X = i) \\ &= 0.021 \end{aligned}\]

Or in R:

1 - sum(dbinom(0:14, 20, .5))
[1] 0.02069473
pbinom(14, 20, .5, lower.tail = FALSE)
[1] 0.02069473

With large populations, we can approximate the binomial distribution with a normal distribution that has mean \(np\) and variance \(np(1-p)\). Typically we’re interested in the proportion, not the absolute count, so we divide by the constant \(n\) to get a distribution with mean \(p\) and standard deviation \(\sigma = \sqrt{p(1-p)/n}\).

For example, imagine 100 subjects each receive two analgesics, \(A\) and \(B\), for one week in a randomly determined order. They then state a preference for one drug: 65 prefer \(A\) and 35 prefer \(B\). Calculate a 95% confidence interval for the proportion preferring \(A\).

Using a normal approximation, the standard error of \(p\) is \(\sqrt{0.65 \times 0.35 / 100} = 0.0477\), thus the 95% confidence interval is \[0.65 \pm 1.96 \times 0.0477 = (0.56, 0.74),\] so we estimate that in the general population, between 56% and 74% of people tend to prefer \(A\) (equivalently: the underlying probability of preferring \(A\) is likely to fall between 0.56 and 0.74).

In R,

0.65 + qnorm(c(.025, .975)) * sqrt(.65 * .35 / 100)
[1] 0.5565157 0.7434843

Or you could calculate an ‘exact’ confidence interval using the binomial distribution:

qbinom(c(.025, .975), size = 100, prob = 0.65)
[1] 56 74

In fact, the R function qbinom uses a normal approximation behind the scenes, as it’s computationally more efficient.

Confidence intervals in R

To produce confidence intervals in R, you can specify the mathematical formula yourself as in the examples above, use the function confint() on a fitted model object, or perform a specific kind of test using the dedicated function, e.g. t.test().

# Generate one sample
x <- sample(uniform$digit, 100)
# Manually construct the 95% confidence interval:
mean(x) + qt(c(.025, .975), 99) * sd(x) / sqrt(100)
[1] 3.822711 4.977289
# Get it via a linear model object
confint(lm(x ~ 1))
               2.5 %   97.5 %
(Intercept) 3.822711 4.977289
# Via a t-test function
t.test(x)

    One Sample t-test

data:  x
t = 15.123, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 3.822711 4.977289
sample estimates:
mean of x 
      4.4 

In this case, we are estimating the standard deviation from the data so we need to use quantiles from the \(t\)-distribution with \(n-1\) degrees of freedom via qt(); using qnorm (i.e. ±1.96) would give a similar (but not identical) result.

Other sampling distributions

We have already seen the normal and binomial distributions, as sampling distributions for location parameters in continuous and proportional data.

But there are many more possible questions that we can ask under the classical frequentist framework. For example:

Student’s \(t\)-distribution

Many statistics, like the sample mean, can be considered normally distributed, if the sample size is large enough. If the statistic \(T\) has mean \(\mu\) and standard deviation \(\sigma\), then \(\frac{T - \mu}{\sigma}\) has a normal distribution with mean 0 and standard deviation 1. But in practice we rarely know the population variance; we only have an estimate in the form of the sample standard deviation \(s\). For a sample of size \(n\), the statistic \(\frac{T-\mu}{s}\) has a Student’s \(t\)-distribution with \(n-1\) degrees of freedom.

Why “Student”? William Sealy Gossett, chief brewer at Guinness in Dublin, developed the theory for analysing small sample sizes of barley for brewing. When publishing academic papers scientists at the company had to use pseudonyms and avoid mentioning ‘beer’ or ‘Guinness’, to avoid divulging trade secrets to competitors.

Plaque to William Sealy Gossett in the Guinness Storehouse museum, Dublin

The Student’s \(t\)-distribution is a bit like a normal distribution but has fatter tails, allowing for more extreme values in small samples with unknown variance. As \(n\) gets larger, the corresponding \(t\)-distribution (with \(n-1\) degrees of freedom) approximates a normal distribution, so for larger sample sizes the distinction is less important.

You can access quantiles of the \(t\) distribution in R using the qt function. They will generally give slightly wider confidence intervals than a normal distribution on the same data.

In our digits data, if we assume the population standard deviation \(\sigma\) is not known, then the ‘correct’ way to generate confidence intervals is to use the \(t\)-distribution rather than the normal. This is why the confidence intervals from earlier tended to under-cover the distribution for the smaller sample sizes.

samples %>%
  mutate(lower = x_bar + qt(.025, n - 1) * s / sqrt(n),
         upper = x_bar + qt(.975, n - 1) * s / sqrt(n),
         contains_mu = 4.5 >= lower & 4.5 <= upper) %>%
  summarise(confidence = mean(contains_mu))
# A tibble: 3 x 2
      n confidence
  <dbl>      <dbl>
1     5      0.922
2    25      0.944
3   100      0.944

Using the more appropriate choice of sampling distribution, the confidence intervals capture the true population more reliably when the sample size is small.

\(\chi^2\) distribution

The \(\chi_k^2\) (chi-squared) distribution is the distribution of the sum of squares of \(k\) independent standard normal random variables, \[Q = \sum_{i=1}^k (Z_i - \bar Z)^2.\]

We use it when measuring the difference between expected and observed frequencies (\(E_i\) and \(O_i\)) in a contingency table.

Why? Because the test statistic for such a difference is \[X^2 = \sum_{i=1}^n \frac{(O_i - E_i)^2}{E_i} = \sum_{i=1}^k \frac{O_i^2}{E_i} - n,\] which is asymptotically distributed as \(\chi^2\) with \(k-1\) degrees of freedom.

Consider a new example. Suppose we observe two groups of people: one exposed to a particular treatment (e.g. a drug or environmental condition) and the other not. We wish to investigate whether the exposure appears to increase the incidence of a particular outcome (a disease or event) or not.

The data are organised in a \(2\times 2\) contingency table of the form

Exposed Unexposed
Outcome \(n_{11}\) \(n_{21}\)
No outcome \(n_{12}\) \(n_{22}\)

If there were no effect—i.e. the likelihood of experiencing the event is the same for exposed and unexposed subjects—then the \(X^2\) statistic is distributed as \(\chi_1^2\) with 1 degree of freedom (the number of groups minus 1).

As in our digits example, we simulate 2000 samples of size \(n = 5, 25, 100\) subjects, with an underlying baseline odds of 1.3 of experiencing the outcome: i.e. you are 30% more likely than not to experience the outcome in general (equivalently: the chance of experiencing the outcome is about 57%).

The empirical distribution of the \(X^2\) statistics is plotted below and compared with the density of a theoretical \(\chi_1^2\) distribution.

# Simulate the contingency tables
samples2 <- expand_grid(n = c(5, 25, 100),
                        sample_id = 1:5000) %>%
  group_by(n, sample_id) %>%
  summarise(exposure = as.integer(runif(n) > .5),
            outcome = runif(n) < plogis(log(1.3))) %>%
  group_by(n, sample_id, exposure) %>%
  summarise(yes = sum(outcome), no = sum(!outcome))

# Calculate the X^2 statistics
X2 <- samples2 %>%
  mutate(p = sum(yes) / sum(yes + no),
         Eyes = p * (yes + no),
         Eno = (1 - p) * (yes + no)) %>%
  summarise(Xsq = sum((yes - Eyes)^2 / Eyes +
                        (no - Eno)^2 / Eno))

# Visualise the distribution (compared with chi^2, df = 1)
ggplot(X2) + aes(Xsq) +
  stat_function(fun = function(x) dchisq(x, df = 1),
                linetype = 'dashed', alpha = .5) +
  geom_density(colour = 'purple', fill = 'purple', alpha = .5) +
  facet_wrap(~n, ncol = 1, labeller = label_both) +
  coord_cartesian(xlim = c(0, 6)) +
  labs(x = expression(X^2), y = 'density') +
  theme(axis.text.y = element_blank())

Because there is no true effect in our population, the mode of the distribution is zero: you’d expect, in the long run, the difference in outcomes between the ‘exposed’ and ‘unexposed’ groups to be about zero, with some random fluctuation.

If you really want to confirm the simulated \(X^2\) statistics are distributed \(\chi_1^2\), then recall from the last lecture that the most reliable way to compare observed data with theoretical quantiles of a distribution is via a quantile–quantile plot:

ggplot(X2) +
  aes(sample = Xsq) +
  stat_qq_line(distribution = qchisq, dparams = list(df = 1)) +
  stat_qq(distribution = qchisq, dparams = list(df = 1)) +
  facet_wrap(~n, scales = 'free_y', labeller = label_both) +
  labs(x = expression('theoretical'~chi[1]^2~'quantiles'),
       y = expression(X^2))

If the test statistic falls into a part of the \(\chi^2\) distribution with very low density, we’d suggest that the probability of observing this change would be unlikely if the null distribution were true.

Hypothesis testing and why it’s a bad idea

We can never say exactly what a population-level parameter is without measuring the entire population. Confidence intervals tell us that a particular range probably contains the population value, with some level of confidence.

Null hypothesis significance testing is a way of comparing a simple theory against the information conveyed by our dataset (via the test statistic).

The null hypothesis

The null hypothesis is a very simple, uninteresting model that we seek to disprove or falsify. For example:

The idea is then to show that the observed data would be so unlikely under such a model, that the null model must be wrong. This supposedly lends credence to the idea that, therefore, a proposed alternative hypothesis is a better explanation. However, this is often a false dichotomy.

  1. Make two bad explanations
  2. Show first bad explanation is bad
  3. ???
  4. Therefore, second bad explanation must be better
  5. Profit

Nobody seriously expects these hypotheses to be exactly true in the first place: they are straw men. We already know that all models are wrong; the goal should be to think about how to improve a model, rather than simply ‘rejecting’ it.

This criticism is most commonly levelled in the social sciences; Meehl (1997) argues that in a technological rather than theoretical context (e.g. simply evaluating the effect of a drug on a specific outcome measure) then null hypothesis significance testing may be more appropriate, but less for explaining more open-ended phenomena such as behaviour, where variables are less clearly defined.

Nonetheless, null hypothesis significance testing is common in the literature, so you should understand it when you see it.

The alternative hypothesis

The alternative hypothesis is, supposedly, what you are trying to show, by rejecting the null hypothesis. It represents everything that isn’t the null.

For example, suppose we want to investigate the hypothesis that Drug A lowers blood pressure more than Drug B. Then, the null hypothesis, denoted \(H_0\), is that Drugs A and B have exactly the same effect on blood pressure. If we can show that, under various assumptions, the null hypothesis is an implausible explanation of our data, then we reject the null hypothesis, and try to use this to support the alternative hypothesis, \(H_1\), that therefore Drugs A and B have different effects on blood pressure.

Testing hypotheses using confidence intervals

Recall that a confidence interval summarises what we think about the population parameter (such as the mean), based on our observed data.

Suppose we had a null hypothesis that the population mean of our uniform digits, \(\mu\), is exactly equal to 3.0.

  1. Assume the null hypothesis is true, i.e. \(\mu_0 = 3.0\)
  2. Construct a \((1-\alpha)\%\) confidence interval around \(\bar x - \mu_0\)
  3. Is the difference between observed sample mean and (proposed) population mean plausible?

(Equivalently: you can draw a confidence interval around \(\mu_0\) and see if it contains \(\bar x\), or draw a confidence interval around \(\bar x\) and see if it contains \(\mu_0\).)

On our digits data, we see that the more data we have, the more likely we are to (correctly in this case) reject the null hypothesis (obviously we know the true mean is \(\mu = 4.5\) here, not 3).

test3 <- samples %>%
  mutate(lower = x_bar - 3.0 + qt(.025, n - 1) * s / sqrt(n),
         upper = x_bar - 3.0 + qt(.975, n - 1) * s / sqrt(n),
         plausible = 0 >= lower & 0 <= upper)

test3 %>%
  arrange(n, x_bar) %>%
  mutate(order = row_number()) %>%
  ggplot() +
  aes(xmin = lower, xmax = upper, y = order) +
  geom_linerange(aes(colour = plausible)) +
  facet_wrap(~n, labeller = label_both, ncol = 1) +
  geom_vline(xintercept = 0, linetype = 'dashed') +
  xlab(expression(bar(x) - mu[0])) +
  scale_y_continuous(NULL, breaks = NULL) +
  scale_colour_discrete(NULL,
                        labels = c(expression('reject'~H[0]),
                                   expression('do not reject'~H[0])))

Because \(\mu \neq 3.0\), every time we fail to reject \(H_0\) in this example, we are committing a type II error: failing to reject a false null hypothesis; i.e. a false negative conclusion. As we can see, the type II error rate (usually denoted \(\beta\)) decreases as the statistical power (here, the sample size) increases. Our sample size \(n=5\) appears to be severely underpowered, while the sample size of \(n=100\) sample digits seems to have good statistical power.

test3 %>%
  summarise(typeII = mean(plausible))
# A tibble: 3 x 2
      n typeII
  <dbl>  <dbl>
1     5  0.874
2    25  0.319
3   100  0    

Type I error (\(\alpha\))

A type I error means the null hypothesis is true, but we reject it, based on the data. In other words, a type I error is a false positive result. The type I error rate is equal to the significance level of a confidence interval (or test): so a 95% confidence interval has a 5% type I error rate; a 99% confidence interval has a 1% type I error rate.

Type II error (\(\beta\))

A type II error means the null hypothesis is not true, but our sample does not provide enough evidence to reject it. In other words, a type II error is a false negative result. The type II error rate depends on the study size and the size of the true effect: large effects are easier to detect than small ones. The power of a study, denoted \(1 - \beta\), is the probability of detecting a given effect, if it exists (i.e. the sensitivity).

Example: one sample t-test

The following data are uterine weights (in mg) for a sample of 20 rats. Previous work suggests that the mean uterine weight for the stock from which the sample was drawn was 24mg. Does this sample confirm that suggestion?

x <- c(9, 14, 15, 15, 16, 18, 18, 19, 19, 20,
       21, 22, 22, 24, 24, 26, 27, 29, 30, 32)
mean(x)
[1] 21
sd(x)
[1] 5.91163

Here, we have a sample mean, \(\bar x = 21\), and wish to compare it with a hypothetical value, \(\mu_0 = 24\). The population standard deviation \(\sigma\) is not known, so we estimate it from the data to be \(s = 5.91\).

Hence, we construct a confidence interval around the difference \(\bar x - 24\), using the quantiles of the Student’s \(t\)-distribution with 19 degrees of freedom. This procedure is called a one sample t-test.

The 95% confidence interval for the difference in means is

(mean(x) - 24) + qt(c(.025, .975), 19) * sd(x) / sqrt(20)
[1] -5.766728 -0.233272

and it does not contain zero. This means that, if the true population mean were really \(\mu_0 = 24\), we would expect to see a sample mean of this size less than 5% of the time. The data offer evidence that the null hypothesis is unlikely: rather, we’d expect with 95% confidence that the population mean lies in the interval

mean(x) + qt(c(.025, .975), 19) * sd(x) / sqrt(20)
[1] 18.23327 23.76673

You can speed up this procedure in R by simply calling the t.test function or fitting a (null) linear model:

t.test(x - 24)

    One Sample t-test

data:  x - 24
t = -2.2695, df = 19, p-value = 0.03507
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -5.766728 -0.233272
sample estimates:
mean of x 
       -3 
t.test(x) # interval does not contain 24

    One Sample t-test

data:  x
t = 15.886, df = 19, p-value = 1.996e-12
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 18.23327 23.76673
sample estimates:
mean of x 
       21 
confint(lm(x ~ 1)) # to get confidence interval only
               2.5 %   97.5 %
(Intercept) 18.23327 23.76673

Example: unpaired (two-sample) t-test

Rather than comparing a hypothetical value for the mean with sample observations, we can compare two samples, to investigate whether they are drawn from a population with the same mean value.

For two samples, \(x\) and \(y\), of sample sizes \(n_x\) and \(n_y\) respectively,

What is the sampling distribution of the statistic \(\bar y - \bar x\)? Assuming the population standard deviation(s) is unknown, the mean difference, divided by its standard error, has a Student’s \(t\)-distribution on \(n_x + n_y - 2\) degrees of freedom.

What is the standard error of \(\bar y - \bar x\)? We can either

In the latter, more general case, the estimated standard error for \(\bar y = \bar x\) is \[\text{se}(\bar x - \bar y) = \sqrt{\frac{s_y^2}{n_y} + \frac{s_x^2}{n_x}}.\]

Hence, we can draw a confidence interval around our statistic to see if the difference in sample means is within the range we’d expect from a population with true common mean.

Using the height and weight dataset (htwt.csv), let’s investigate whether the height in centimetres (as measured by the nurse) is equal between male and female patients.

We can perform the test in three ways and compare the result:

  1. By hand, using mathematical formulae
  2. Using the t.test function
  3. Using a linear regression model
# Method 1
htwt %>%
  filter(!is.na(nurseht)) %>%
  group_by(sex) %>%
  # Compute summary statistics
  summarise(mean = mean(nurseht),
            sd = sd(nurseht),
            n = n()) %>%
  # Generate a confidence interval for the mean difference
  summarise(stderror = sqrt(sum(sd^2 / n)),
            lower = diff(mean) + qt(.025, sum(n) - 2) * stderror,
            upper = diff(mean) + qt(.975, sum(n) - 2) * stderror)
# A tibble: 1 x 3
  stderror lower upper
     <dbl> <dbl> <dbl>
1    0.673  11.9  14.5
# Method 2
with(htwt, t.test(nurseht ~ sex))

    Welch Two Sample t-test

data:  nurseht by sex
t = -19.579, df = 359.17, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -14.50728 -11.85899
sample estimates:
mean in group female   mean in group male 
            159.7740             172.9571 
# Method 3
confint(lm(nurseht ~ sex, data = htwt))
                2.5 %    97.5 %
(Intercept) 158.90933 160.63869
sexmale      11.87259  14.49368

All three methods give the same result: if the population of male and female patients really had a common mean height, it would be very unlikely that we would see a difference in sample means as great as observed here. Thus, we conclude that male and female patients in this population probably have different mean heights (and that this mean height difference is between 11.9 and 14.5 cm).

Example: comparing proportions

We wish to compare sample proportions \(p_1 = \frac a{n_1}\) and \(p_2 = \frac b{n_2}\).

The null hypothesis is the populations have a common value for the proportion: \(\pi_1 = \pi_2 = \pi\).

The standard error of \(p_1 - p_2\) is \(\sqrt{\pi (1- \pi)(\frac1{n_1}+\frac1{n_2})}\), where we estimate \(\pi\) with \(p = \frac{a + b}{n_1 + n_2}\).

Then \[\frac{p_1 - p_2}{\sqrt{p(1-p)(\frac1{n_1} + \frac1{n_2})}}\] can be compared to a standard normal distribution.

(Equivalently, we can compare \(p_1 - p_2\) with a normal distribution with standard deviation equal to the standard error given in the denominator.)

Use the dataset epicourse.csv, which describes a sample of patients and whether or not they experience various types of pain.

Let’s investigate whether the proportion of patients complaining of back pain (back_p) is different between the sexes.

xtab <- with(epicourse, table(sex, back_p))
   back_p
sex   no  yes
  F 1694  637
  M 1739  445

Using the formulae above, we can calculate the test statistic (the difference in observed proportions) and produce a confidence interval for the difference.

n <- rowSums(xtab)
p_MF <- xtab[, 'yes'] / n
p <- sum(xtab[, 'yes']) / sum(n)
stderror <- sqrt(p * (1 - p) * (1 / n[1] + 1 / n[2]))
qnorm(c(.025, .975), mean = diff(p_MF), sd = stderror)
[1] -0.09443435 -0.04460304

The 95% confidence interval for the difference in mean proportions does not contain zero, implying that such a difference in observed proportions would be unlikely, if there really were no difference in risk in the population, between the sexes. Thus we reject the null hypothesis at the 5% level.

The above procedure used a normal approximation to the \(\chi^2\) distribution. We can also perform a \(\chi^2\) test directly on the contingency table using dedicated R functions:

prop.test(xtab)

    2-sample test for equality of proportions with continuity
    correction

data:  xtab
X-squared = 29.525, df = 1, p-value = 5.519e-08
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.09471380 -0.04432359
sample estimates:
   prop 1    prop 2 
0.7267267 0.7962454 
chisq.test(xtab) # same result but less detailed output

    Pearson's Chi-squared test with Yates' continuity correction

data:  xtab
X-squared = 29.525, df = 1, p-value = 5.519e-08

\(p\)-values

In some of the R output above you may have seen mention of \(p\)-values. When performing a null hypothesis significance test, a \(p\)-value is a less informative summary of the same procedure. Unlike a confidence interval, it conflates sample size with effect size, and is generally inferior, as well as being harder to interpret and less transparent about the underlying process. However, for some reason, many researchers in the literature like to report \(p\)-values, so you should know what they are when you see them.

A \(p\) value is the probability that we would observe a test statistic at least as extreme as that observed, under the null hypothesis (in the case of a two-sided test. A one-sided test would offer the probability that the test statistic is ‘at least this small’ or ‘at least this large’). If we drew a confidence interval for which our test statistic fell exactly on the boundary, its significance level would be \(p\). In other words, the \(p\)-value is the quantile of the test statistic in the distribution.

The critical region is the tail (usually 2.5% or 5%) region in which a test statistic would result in a \(p\)-value that would reject the null hypothesis. The critical value is the boundary that defines the region: typically the 2.5th or 97.5th quantile of the null sampling distribution.

Digits and \(p\)-values

Let’s demonstrate this graphically, using example of the uniformly-sampled digits data, and the null hypothesis that \(\mu_0 = 3.0\). Under the null hypothesis, the sampling distribution of the sample mean is given below, the 5% critical region shaded in pink. Think of this as the complement of a confidence interval: if the sample mean falls inside the critical region (equivalently: the null mean falls outside the confidence interval around the observed mean) then we would reject the null hypothesis.

Suppose we observe a sample mean of \(\bar x = 4.64\) with \(s = 3.04\) and sample size \(n = 25\). The sample mean is marked with the vertical red line. We can already see it falls outside the confidence interval (the same as the \(\mu_0=3\) falling outside a confidence interval around the sample mean), so an observation like this would cause us to reject the null hypothesis.

The \(p\)-value is the total area to the right of the line. Because this area (probability mass) is less than 2.5%, we reject the null hypothesis. The confidence interval gives us the same information; all the \(p\)-value tells us is how far into the critical region the test statistic lies (which you could see anyway by comparing the statistic with the boundaries of the confidence interval). It’s therefore arguable that the \(p\)-value imparts no useful information that isn’t already provided by the confidence interval.

Significance tests are appropriate in technological contexts, such as evaluating the efficacy of interventions, but setting up confidence intervals is preferable. In scientific communication, authors should always present confidence intervals, so that readers who prefer them will have the information they desire without depriving the significance testers of what they want to know.

Meehl (1997)

Here is another example. Suppose we observe \(\bar x = 2.8\) with \(s = 3.56\) and \(n = 25\). Clearly, this value falls inside the confidence interval, so we would not reject the null hypothesis. The \(p\)-value is \(p = 0.39\), describing what quantile of the distribution is represented by the observed sample mean. It implies that there is a 39% chance we would observe a value at least this small (or an 80% chance ‘at least this extreme’, if doing a two-sided test), under the null hypothesis.

Proportions and \(p\)-values

Let’s look at the back pain data, again.

xtab
   back_p
sex   no  yes
  F 1694  637
  M 1739  445

To derive the \(\chi^2\) test result given by prop.test or chisq.test, we calculate the \(X^2\) statistic manually and compare it with the quantiles of the \(\chi_1^2\) distribution.

expected <- as.table(rowSums(xtab) %*% t(colSums(xtab) / sum(xtab)))
        no      yes
A 1772.386  558.614
B 1660.614  523.386
Xsq <- sum((xtab - expected)^2 / expected)
[1] 29.90576

Or with Yate’s correction for continuity,

Xsq_Yates <- sum((abs(xtab - expected) - 0.5)^2 / expected)
[1] 29.52546

Then we compare this with the theoretical \(\chi_1^2\) distribution to get the \(p\)-value (or, equivalently, just see that it falls outside the confidence interval for the test statistic).

qchisq(c(.025, .975), df = 1) # conf. interval does not contain 30
[1] 0.0009820691 5.0238861873
pchisq(Xsq_Yates, df = 1, lower.tail = FALSE)
[1] 5.51871e-08

Graphically, we can see the \(\chi_1^2\) distribution (which is very heavily skewed: even on this truncated graphic) has a 95% confidence interval that ends at about 5, so the test statistic falls well outside it. The amount of probability mass to the right of the test statistic (marked by the red line), i.e. the \(p\)-value, is trivially small.

Equivalently, let’s look at the confidence interval for the difference in proportions (we centre it at zero and then see if the observed difference, positive or negative, is within the region. It would be equivalent to centre the interval at the observed difference and see if the resulting confidence interval contained zero).

Once again we can clearly see the difference in means (in either direction—it’s a two-sided test) is well outside the 95% confidence interval, so the total amount of probability mass in the tails (the \(p\)-value) is vanishingly small:

pnorm(diff(p_MF), sd = stderror) +
  pnorm(-diff(p_MF), sd = stderror, lower.tail = FALSE)
           M 
4.535639e-08 

But clearly, we could work this out from looking at the confidence interval—knowing the exact amount of probability mass does not add any further insight. Meanwhile the confidence interval tells us that if there were no difference in the population, we’d expect the observed variation in our sample to be less than 2.5% in either direction, most (95%) of the time.

Additional example

Consider the following \(2 \times 2\) contingency tables of exposures and outcomes.

tab1 <- as.table(rbind(c(20, 11), c(6, 6)))
tab2 <- as.table(rbind(c(269, 205), c(731, 600)))
dimnames(tab1) <- dimnames(tab2) <-
  list(outcome = c('yes', 'no'),
       exposed = c('yes', 'no'))

tab1
       exposed
outcome yes no
    yes  20 11
    no    6  6
tab2
       exposed
outcome yes  no
    yes 269 205
    no  731 600

Assuming these samples are representative of the respective target populations, is exposure associated with an increased incidence of outcomes?

Example 1

In the first table, the counts are very small, so a \(\chi^2\) approximation would not be appropriate. Instead of a Pearson \(\chi^2\) test or a normal approximation, we can try Fisher’s exact test.

fisher.test(tab1)

    Fisher's Exact Test for Count Data

data:  tab1
p-value = 0.4922
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.3777471 8.6272987
sample estimates:
odds ratio 
  1.792264 

The 95% confidence interval for the odds ratio contains 1, so we do not reject the null hypothesis (that the proportion of outcomes is no difference between exposed and unexposed subjects). The \(p\)-value is about 0.5.

(Note: fisher.test uses a different method for estimating the odds ratio to the sample odds.)

Example 2

In the second contingency table, the counts are larger, so both the exact test and the \(\chi^2\) approximation are now appropriate.

fisher.test(tab2)

    Fisher's Exact Test for Count Data

data:  tab2
p-value = 0.5186
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.8671679 1.3386358
sample estimates:
odds ratio 
   1.07697 
prop.test(tab2)

    2-sample test for equality of proportions with continuity
    correction

data:  tab2
X-squared = 0.40254, df = 1, p-value = 0.5258
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.03512815  0.07172701
sample estimates:
   prop 1    prop 2 
0.5675105 0.5492111 

Confirming, the odds ratio is

tab2[1, 1] / tab2[1, 2] * tab2[2, 2] / tab2[2, 1]
[1] 1.077041

And we conclude that, because the 95% confidence interval for the odds contains 1 (or, in the case of prop.test, the 95% confidence interval for the difference in proportions contains 0), we do not reject the null hypothesis that outcomes are no more likely in the exposed group than in the unexposed group: the confidence interval is between 0.88 and 1.4 (proportional difference -3% to +8%) which includes the possibility of no difference. The \(p\)-value is around 0.5 again.

Notice that despite the two datasets provide very similar \(p\)-values, despite vastly different confidence intervals.

Odds ratios in R

Rather than doing standalone tests, it’s better in R to fit a (generalized) linear model and interpret the output. For a logistic regression model, the parameter estimates are given as log-odds, which you can convert to odds using the exp function. You can extract confidence intervals using confint. The Intercept is the baseline log-odds of the outcome, before accounting for exposure.

df <- data.frame(
  yes = c(200, 269),
  no = c(600, 731),
  exposure = c(0, 1)
)

model <- glm(cbind(yes, no) ~ exposure, df, family = binomial)
log_odds <- coef(model)
exp(confint(model))
                2.5 %    97.5 %
(Intercept) 0.2834061 0.3903826
exposure    0.8930699 1.3662010

As you can see, we retrive the same 95% confidence interval for the odds ratio as with fisher.test.

Turn log-odds into probabilities with plogis (the inverse logit function) to verify the confidence interval for the difference is about -3% to +8% (as in the output from prop.test).

plogis(confint(model))
                2.5 %    97.5 %
(Intercept) 0.2208234 0.2807735
exposure    0.4717575 0.5773816

Practical exercises

All practical exercises are lifted from lecture 3 of the Statistical Modelling with Stata course.

https://personalpages.manchester.ac.uk/staff/david.selby/stata/2020-04-03-sampling/

I may write some new ones here, later.

Further reading

Robert P. Abelson (1995). Statistics as Principled Argument. https://doi.org/10.4324/9781410601155

Paul E. Meehl (1997). The problem is epistemology, not statistics: Replace significance tests by confidence intervals and quantify accuracy of risky numerical predictions. In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.), What If There Were No Significance Tests? pp. 393–425. http://meehl.umn.edu/sites/meehl.dl.umn.edu/files/169problemisepistemology.pdf

Andrew Gelman et al. (1995–2020). Bayesian Data Analysis. http://www.stat.columbia.edu/~gelman/book/

David Trafimow et al. (2018). Manipulating the alpha level cannot cure significance testing. Frontiers in Psychology. https://doi.org/10.3389/fpsyg.2018.00699