Performing null hypothesis significance testing for means, proportions and variances with one or two samples. Using base R and community packages to do power calculations.
This worksheet is based on the fourth lecture of the Statistical Modelling in Stata course, offered by the Centre for Epidemiology Versus Arthritis at the University of Manchester. The original Stata exercises and solutions are here translated into their R equivalents.
Refer to the original slides and Stata worksheets here.
This section is bookwork and does not require R or Stata.
Out of 80 women in a random sample of women in Manchester, 13 were asthmatic; this could be used to calculate a 95% confidence interval for the proportion of women in Manchester with asthma. This confidence interval could be compared to the suggested prevalence of 20% in Northern England. An alternative approach would be to test the hypothesis that the true proportion, π, is 0.20.
What is the expected proportion of women with asthma under the null hypothesis?
What is the observed proportion of women with asthma?
What is the standard error of the expected proportion? Remember from last week that the standard error of a proportion p is given by \[\sqrt{\frac{p (1 - p)}{n}}.\]
The appropriate test statistic, T, is given by the formula: \[T = \frac{\text{observed proportion} - \text{expected proportion}}{\text{standard error of proportion}}.\] Calculate T.
T should be compared to a t-distribution with how many degrees of freedom?
From tables for the appropriate t-distribution, the corresponding p-value is 0.4. Is it reasonable to suppose that these women are a random sample from a population in which the prevalence of asthma is 20%?
This section is bookwork and does not require R or Stata.
In the sample heights and weights we have looked at, there were 412 individuals of whom 234 were women. We wish to test that there are equal numbers of men and women in our population.
What is the null hypothesis proportion of women?
What is the observed proportion of women?
What is the null hypothesis standard error for the proportion of women?
What is an appropriate statistic for testing the null hypothesis?
Load htwt.csv
into R either by saving the CSV file to your desktop and running the command
htwt <- read.csv('htwt.csv')
or else you can import it directly from the internet.
htwt <- read.csv('http://personalpages.manchester.ac.uk/staff/david.selby/stata/2020-04-02-summarising/htwt.csv')
We wish to test whether the mean height is the same in men and women.
What is the null hypothesis difference in height between men and women?
Use the function t.test()
to test whether the mean height differs between men and women.
What is the mean height in men?
What is the mean height in women?
What is the mean difference in height between men and women, with its 95% confidence interval?
Which of the three hypothesis tests is the appropriate one in this instance?
What is the p-value from the t-test?
What would you therefore conclude?
t.test(nurseht ~ sex, data = htwt)
Welch Two Sample t-test
data: nurseht by sex
t = -19.579, df = 359.17, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-14.50728 -11.85899
sample estimates:
mean in group female mean in group male
159.7740 172.9571
Unlike Stata, the R function stats::t.test
does not perform all three tests at once. Check the documentation for help(t.test)
(click on the code above) to decide what to pass to the argument alternative
.
In this case it turns out that the default value, two.sided
, is actually the correct choice.
Compare BMI (based on the measured values, i.e. bmi
) between men and women in htwt.csv
, again using the t.test()
function in R.
t.test(bmi ~ sex, data = htwt)
Welch Two Sample t-test
data: bmi by sex
t = -1.0911, df = 394.79, p-value = 0.2759
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.388285 0.397293
sample estimates:
mean in group female mean in group male
25.85984 26.35534
The means and confidence interval are given in the test output, above.
There are a few different tests we can do here.
var.test(bmi ~ sex, data = htwt)
bartlett.test(bmi ~ sex, data = htwt)
ansari.test(bmi ~ sex, data = htwt)
mood.test(bmi ~ sex, data = htwt)
In the Stata documentation, it appears that sdtest
performs a Levene’s test, using the mean as the estimator of central tendency, but offers the option of the more robust test of Brown and Forsythe (1974), which uses the median or a trimmed mean. Levene’s test is available in R via the car
(Companion to Applied Regression) package.
library(car)
leveneTest(bmi ~ sex, center = mean) # Stata default
leveneTest(bmi ~ sex, center = median) # Stata alternative W_50
leveneTest(bmi ~ sex, center = mean, trim = 0.1) # Stata alternative W_10
Each of these tests gives different results, implying there is not a lot of evidence either way, and that the conclusions depend on assumptions of each test—especially whether the data in each group are normally distributed. A visualisation allows us to check this, and suggests that the data have roughly similar scale, but are skewed by several outliers.
library(ggplot2)
ggplot(htwt) +
aes(x = bmi, fill = sex) +
geom_histogram(bins = 25) +
facet_wrap(~sex, nrow = 2)
I would probably conclude by saying there is insufficient evidence to support the hypothesis that the variances are not equal. The only tests that suggest otherwise are dependent on the assumption of normality, which does not appear to hold if the small number of very high BMIs are taken into account. Thus, an analysis that uses Stata’s sdtest
with default settings possibly leads to an erroneous result in this case.
var.equal
option. Are your conclusions any different?In R, the default for a two-sample t-test is var.equal = FALSE
, which means it assumes that the variances are different. To perform a test where the variances are equal, set this argument equal to TRUE
.
t.test(bmi ~ sex, data = htwt, var.equal = TRUE)
Two Sample t-test
data: bmi by sex
t = -1.0694, df = 398, p-value = 0.2855
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.4064007 0.4154089
sample estimates:
mean in group female mean in group male
25.85984 26.35534
The conclusion is the same, in any case.
Download the bpwide.csv
dataset to your computer and import it into R with the command
bpwide <- read.csv('bpwide.csv')
or download it directly from the internet with
bpwide <- read.csv('http://personalpages.manchester.ac.uk/staff/david.selby/stata/2020-04-03-sampling/bpwide.csv')
This consists of fictional blood pressure data, taken before and after an intervention. We wish to determine whether the intervention had affected the blood pressure.
summary(bpwide[, 4:5])
bp_before bp_after
Min. :138.0 Min. :125.0
1st Qu.:147.0 1st Qu.:140.8
Median :154.5 Median :149.5
Mean :156.4 Mean :151.4
3rd Qu.:164.0 3rd Qu.:161.0
Max. :185.0 Max. :185.0
bp_diff <- with(bpwide, bp_after - bp_before)
t.test
function to test whether the change in blood pressure is statistically significant. Is it?t.test(bp_diff)
One Sample t-test
data: bp_diff
t = -3.3372, df = 119, p-value = 0.00113
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-8.112776 -2.070557
sample estimates:
mean of x
-5.091667
We can avoid creating the intermediate variable by doing an equivalent paired test:
Paired t-test
data: bp_after and bp_before
t = -3.3372, df = 119, p-value = 0.00113
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-8.112776 -2.070557
sample estimates:
mean of the differences
-5.091667
The difference appears to be statistically significant at the 5% level.
It is already given in the output above, but to be thorough, we can work through it step by step. As we are looking at a difference in means, we compare the sample blood pressure difference to a Student’s t distribution with \(n - 1\) degrees of freedom, where \(n\) is the number of patients.
By hand: the formula for a two-sided \((1-\alpha)\) confidence interval for a sample of size \(n\) is \[Pr\left(\bar{X}_n - t_{n-1}(\alpha/2) \frac{S_n}{\sqrt n} < \mu < \bar{X} + t_{n-1}(\alpha/2) \frac{S_n}{\sqrt n}\right) = \alpha \]
In R, the q
family of functions returns quantiles of distributions (just as we have seen the r
family draws random samples from those distributions). So qnorm
finds the quantiles of the normal distribution, qt
the quantiles of the Student’s t-distribution, and so on. We can use it to retrieve the 2.5% and 97.5% quantiles of the t distribution with \(n-1\) degrees of freedom and then plug them into the above formula, like so:
[1] -8.112776 -2.070557
This interval is the same as that given in the t.test
output in the previous question.
A near equivalent to Stata’s sampsi
command in this case is the power.prop.test
function. There are also analogous functions: power.t.test
and power.anova.test
. They can be used to compute power of a test, or the parameters needed to obtain a specified power.
power.prop.test(power = .9, p1 = .25, p2 = .4)
Two-sample comparison of proportions power calculation
n = 202.8095
p1 = 0.25
p2 = 0.4
sig.level = 0.05
power = 0.9
alternative = two.sided
NOTE: n is number in *each* group
The difference between R and Stata here is that sampsi
(which has been replaced by power
in Stata 13) uses continuity correction by default, whereas power.prop.test
in R does not.
The Hmisc package contains some useful functions for power calculations for two-sample binomial tests.
power.prop.test(n = 100, p1 = .25, p2 = .4)
Two-sample comparison of proportions power calculation
n = 100
p1 = 0.25
p2 = 0.4
sig.level = 0.05
power = 0.6211764
alternative = two.sided
NOTE: n is number in *each* group
bpower(.25, .4, n = 200)
Power
0.6211857
There is not an argument for heteroscedastic groups in power.t.test
, but all that is needed is to plug in the pooled standard deviation, using the formula \[\sigma_{\text{pooled}}^2 = \frac{n_1\sigma_1^2 + n_2\sigma_2^2}{n_1 + n_2}.\]
power.t.test(delta = 17 - 15,
sd = sqrt((5^2 + 6^2) / 2),
power = .9)
Two-sample t test power calculation
n = 161.2047
delta = 2
sd = 5.522681
sig.level = 0.05
power = 0.9
alternative = two.sided
NOTE: n is number in *each* group
power.t.test(delta = 17 - 15,
sd = sqrt((5^2 + 6^2) / 2),
power = .95)
Two-sample t test power calculation
n = 199.1349
delta = 2
sd = 5.522681
sig.level = 0.05
power = 0.95
alternative = two.sided
NOTE: n is number in *each* group
There is no base R function for two-sample t-tests with unequal group sizes, but we can try to figure it out.
Recall that power is the probability of rejecting the null hypothesis if it is false. That is, the probability, under the alternative hypothesis, of the effect size being greater than or equal to the critical value.
The degrees of freedom of the \(t\) distribution for a two-sample \(t\)-test are given by the formula \[m = \frac{ \left(\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}\right)^2 }{ \dfrac{\left(\frac{\sigma_1^2}{n_1}\right)^2}{\small{n_1 - 1}} + \dfrac{\left(\frac{\sigma_2^2}{n_2}\right)^2}{\small{n_2 - 1}} },\] and the non-centrality parameter (of the \(t\)-distribution under the alternative hypothesis) is given by \[\lambda = \frac{\delta}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}.\]
So we can write functions to compute each, calculate the corresponding critical values of the central \(t\)-distribution and the \(p\)-values of the non-central \(t\)-distribution and we shall get the power for a given pair of group sizes.
degfreedom <- function(sd1, sd2, n1, n2) {
numerator <- (sd1^2 / n1 + sd2^2 / n2)^2
denom1 <- (sd1^2 / n1)^2 / (n1 - 1)
denom2 <- (sd2^2 / n2)^2 / (n2 - 1)
numerator / (denom1 + denom2)
}
noncentrality <- function(delta, sd1, sd2, n1, n2) {
delta / sqrt(sd1^2 / n1 + sd2^2 / n2)
}
tpower <- function(delta, sd1, sd2, n1, n2) {
m <- degfreedom(sd1, sd2, n1, n2)
lambda <- noncentrality(delta, sd1, sd2, n1, n2)
critval <- qt(c(.025, .975), df = m)
power <- pt(critval[1], lower.tail = TRUE,
df = m, ncp = lambda) +
pt(critval[2], lower.tail = FALSE,
df = m, ncp = lambda)
}
Compute the group sizes over a grid of points and plot the result, to see which gives the approximately 90% power.
points <- data.frame(n2 = seq(150, 400, by = 10))
points$beta <- sapply(points$n2,
function(n2) tpower(2, 5, 6, 100, n2))
plot(beta ~ n2, data = points, type = 'l')
abline(h = .9, lty = 2, col = 'red')
abline(v = 283, lty = 2, col = 'red') # approx position where lines meet
This gives an optimum size of approximately 280 for the second group.
You don’t need to roll your own function every time you want to do this calculation! A bit of Googling reveals the MESS package has exactly the tool we seek.
Specify all but one of the parameters to power_t_test()
and it will output the desired answer. Here we know the size of the smaller group (n
), the difference in means (delta
), the standard deviation of the smaller group (sd
), the desired power
and the ratio in standard deviations (sd.ratio
). We set ratio
, the relative size of the groups, to NULL
and it will estimate it for us.
# install.packages('MESS')
library(MESS)
power_t_test(n = 100, # size of smaller group
delta = 17 - 15, # mean difference
sd = 5, # sd of smaller group
ratio = NULL, # relative size of the 2 groups
power = .9, # power
sd.ratio = 6 / 5) # relative size of standard deviations
Two-sample t test power calculation with unequal sample sizes and unequal variances
n = 100.0000, 283.1456
delta = 2
sd = 5, 6
sig.level = 0.05
power = 0.9
alternative = two.sided
NOTE: n is vector of number in each group
The output sample size is similar to that indicated by our hand calculation, above.
This isn’t the only tool you could use. You might also like to try the pwr package.