Using R to generate random numbers, compute means and standard deviations, perform Student’s t-tests and calculate confidence intervals.
This worksheet is based on the third lecture of the Statistical Modelling in Stata course, created by Dr Mark Lunt and offered by the Centre for Epidemiology Versus Arthritis at the University of Manchester. The original Stata exercises and solutions are here translated into their R equivalents.
Refer to the original slides and Stata worksheets here.
In this part of the practical, you are going to repeatedly generate random samples of varying size from a population with known mean and standard deviation. You can then see for yourselves how changing the sample size affects the variability of the sample mean. Obviously, this is not something that you would ever do in data analysis, but since all of the frequentist statistics we cover in this course are based on the concept of repeated sampling, it is useful to see some repeated samples.
In RStudio: go to Session > New Session, or clear the current working environment with Session > Clear Workspace….
In the original RGui, use Misc > Remove all objects.
Or in any console just type rm(list = ls())
.
You could also create a new R Markdown document.
n
that is equal to 5.n <- 5
x
, of length n
from a standard normal distribution.The standard normal is a normal distribution with mean 0 and standard deviation 1. The rnorm
command draws random normal samples of a specified size:
x <- rnorm(n, mean = 0, sd = 1)
In fact, in R the default arguments for rnorm
are mean = 0
and sd = 1
, so we can implicitly draw from the standard normal distribution with the more concise syntax:
x <- rnorm(n)
x
.mean(x)
Equivalently,
sample_means
.You can save yourself from mashing the same command in multiple times,
and in any case you should never enter data in manually if you can
avoid it. You can repeat a non-deterministic function using the
function replicate
.
So to draw the ten samples, we could write
This will produce all 10 samples for you with a single command. You can see the samples with
samples
You can now calculate the mean of each sample and save it in a new
data frame as a column called n_5
with the command:
sample_means <- data.frame(n_5 = apply(samples, 2, mean))
The apply
function tells R
to apply the function mean
to the columns
(dimension 2, hence the 2) of the matrix samples
.
n
= 100.As before:
The resulting data frame might look something like:
sample_means
Everything is in a data frame, so finding summary statistics for each column is straightforward. For each column, we can call something like
But to save time typing that out three times, we can calculate for all the columns at once:
A shorthand for calculating the means is:
colMeans(sample_means)
and while there isn’t a function called colVars
or colSDs
, you can compute the variance–covariance matrix and read the numbers off the diagonal:
If you prefer dplyr:
library(dplyr)
summarise_all(sample_means, mean)
summarise_all(sample_means, sd)
And finally data.table:
library(data.table)
setDT(sample_means)
sample_means[, .(n = names(sample_means),
mean = sapply(.SD, mean),
sd = sapply(.SD, sd))]
This section is bookwork and does not require R or Stata. See the Stata solution for the answers
If the standard deviation of the original distribution is σ, then the standard error of the sample mean is σ/√n, where n is the sample size.
This section is bookwork and does not require R or Stata. See the Stata solution for the answers
Again using our height and weight dataset of 412 individuals, 234 (56.8%) are women and 178 (43.2%) are men.
If we take a number of smaller samples from this population, the proportion of women will vary, although they will tend to be scattered around 57%. Figure ?? represents 50 samples, each of size 40.
What would you expect to happen if the sample sizes were bigger, say n=100?
In a sample of 40 individuals from a larger population, 25 are women. Calculate a 95% confidence interval for the proportion of women in the population.
Note. When sample sizes are small, the use of standard errors and the normal distribution does not work well for proportions. This is only really a problem if p (or (1−p)) is less than 5/n (i.e. there are fewer than 5 subjects in one of the groups).
Download the bpwide.csv
dataset.
Load the blood pressure data in its wide form into R with the command
bpwide <- read.csv('bpwide.csv')
This is fictional data concerning blood pressure before and after a particular intervention.
bp_before
is skewed. What do you think?The original Stata solution calls for a histogram, for example:
However, a kernel density plot smooths out the histogram and may make it easier to visualise the distribution.
There is a slight positive (right) skew.
You can create a standalone variable:
bp_diff <- with(bpwide, bp_after - bp_before)
Or, since it is of the same length as bp_after
and bp_before
, you can store it as a new column in the bpwide
data frame.
bpwide <- transform(bpwide, bp_diff = bp_after - bp_before)
What is the mean change in blood pressure?
As we are looking at a difference in means, we compare the sample blood pressure difference to a Student’s t distribution with n−1 degrees of freedom, where n is the number of patients.
By hand: the formula for a two-sided (1−α) confidence interval for a sample of size n is Pr(ˉXn−tn−1(α/2)Sn√n<μ<ˉX+tn−1(α/2)Sn√n)=α
In R, the q
family of functions returns quantiles of distributions (just as we have seen the r
family draws random samples from those distributions).
So qnorm
finds the quantiles of the normal distribution, qt
the quantiles of the Student’s t-distribution, and so on.
We can use it to retrieve the 2.5% and 97.5% quantiles of the t distribution with n−1 degrees of freedom and then plug them into the above formula, like so:
n <- length(bp_diff) # or nrow(bpwide)
mean(bp_diff) + qt(c(.025, .975), df = n - 1) * sd(bp_diff) / sqrt(n)
Or, using the built-in t.test
function we get the same result:
As the 95% confidence interval does not contain zero, we may reject the hypothesis that, on average, there was no change in blood pressure.
Notice that the paired t.test
did not require us to calculate bp_diff
.
But we can pass the same to t.test
for a one-sample test and it will compute exactly the same test statistic, because testing whether two variables are equal is the same thing as testing whether their paired difference is zero.
t.test(bp_diff)
hist
.hist(bp_diff, col = 'steelblue')
You could also try a kernel density plot, or even an old-fashioned stem and leaf diagram:
stem(bp_diff)
Does this confirm your answer to the previous question?
down <- with(bpwide, bp_after < bp_before)
Use the table
command to see how many subjects showed a decrease in blood pressure.
table(down)
To get proportions, you can use prop.table
, like so:
prop.table(table(down))
But it is just as quick to take the mean of the binary indicators (as TRUE == 1
and FALSE == 0
):
Note. The point of this is to create a confidence interval around a proportion. Making inference about dichotomised categorical variables is rarely a good idea, although clinically meaningful classifications can be very useful in describing your data. The problem with inference is that you lose power by discarding a lot of information, and you can reverse the effect: if most people show a slight decrease in blood pressure, but a few show a large increase, the mean change may be greater than 0 even though most subjects show a decrease.
Nonetheless, given that the purpose of the exercise is to create a binomial confidence interval, then we have:
binom.test(sum(down), n)