Obtaining numerical and visual summaries of datasets in R. An introduction to the split-apply-combine approach to data analysis, using either base functions or the packages dplyr and data.table.
This worksheet is based on the second lecture of the Statistical Modelling in Stata course, offered by the Centre for Epidemiology Versus Arthritis at the University of Manchester. The original Stata exercises and solutions are here translated into their R equivalents.
Refer to the original slides and Stata worksheets here.
To distribute the pimax.dta
and htwt.dta
files in a more universal format, we use the foreign package to convert them from the Stata binary format into data frames, then save them as CSV.
datadir <- 'http://personalpages.manchester.ac.uk/staff/mark.lunt/stats/2_summarizing_data/data/'
library(foreign)
pimax <- read.dta(file.path(datadir, 'pimax.dta'))
htwt <- read.dta(file.path(datadir, 'htwt.dta'))
write.csv(pimax, 'pimax.csv', row.names = FALSE)
write.csv(htwt, 'htwt.csv', row.names = FALSE)
You don’t need to do this as I’ve converted the files for you:
This section gives you the chance to do some calculations for yourself and see how the concepts we saw in the lecture work in real life. Once upon a time, these calculations would have been done by hand: I’m sure you could do them in your head, but getting Stata R to do them for you will be quicker. However, we are going to go through the steps that you would have to perform if you were calculating them by hand, so that you can see how it works. In practice, you would simply ask Stata R to churn the results out rather than calculating them this way.
Load in the PImax dataset with the command
pimax <- read.csv('pimax.csv')
This dataset contains a single variable, pimax
, which is the maximal static inspiratory pressure, measured in cm H2O, of 25 cystic fibrosis patients.
Look at the data in the spreadsheet view using the command View
, or, in RStudio, by clicking on the pimax
Data variable in the Environment pane. You can also get a quick preview of the data set, its size and data types using the str
(structure) command, or just print the first few rows using head(pimax)
.
str(pimax)
'data.frame': 25 obs. of 2 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ pimax: int 80 85 110 95 95 100 45 95 130 75 ...
Another way of getting a quick summary is the summary
function:
summary(pimax)
id pimax
Min. : 1 Min. : 40.0
1st Qu.: 7 1st Qu.: 75.0
Median :13 Median : 95.0
Mean :13 Mean : 92.6
3rd Qu.:19 3rd Qu.:110.0
Max. :25 Max. :150.0
You might notice that a few common summary statistics are already returned by the summary
function, but you can compute individual ones directly:
median(pimax$pimax)
[1] 95
If you don’t like the dollar sign ($
) syntax for selecting columns, you can access variables within a dataframe implicitly by wrapping the expression in with()
(a little bit like Stata’s use
, but just for one expression). In this case:
mean(pimax$pimax)
[1] 92.6
Whereas Stata’s sum
function computes a running (cumulative) sum (for some reason), R returns only the scalar total.
If you did want cumulative sums and means you could use the cumsum
function. There is no cummean
function in base R but you can easily compute it yourself. The R equivalent of Stata’s
gen n = _n
gen sum = sum(pimax)
gen mean = sum/n
is something like the following:
id pimax cummean
1 1 80 80.00000
2 2 85 82.50000
3 3 110 91.66667
4 4 95 92.50000
5 5 95 93.00000
6 6 100 94.16667
7 7 45 87.14286
8 8 95 88.12500
9 9 130 92.77778
10 10 75 91.00000
11 11 80 90.00000
12 12 70 88.33333
13 13 80 87.69231
14 14 100 88.57143
15 15 120 90.66667
16 16 110 91.87500
17 17 125 93.82353
18 18 75 92.77778
19 19 100 93.15789
20 20 40 90.50000
21 21 75 89.76190
22 22 110 90.68182
23 23 150 93.26087
24 24 75 92.50000
25 25 95 92.60000
where seq_along(x)
is just another way of writing 1:length(x)
.
The function transform
is used for adding and modifying columns in the data frame that is specified by the first argument (i.e. the data frame pimax
). But you can equally compute the same quantity as a standalone vector without adding it as a column to the original dataset:
[1] 80.00000 82.50000 91.66667 92.50000 93.00000 94.16667 87.14286
[8] 88.12500 92.77778 91.00000 90.00000 88.33333 87.69231 88.57143
[15] 90.66667 91.87500 93.82353 92.77778 93.15789 90.50000 89.76190
[22] 90.68182 93.26087 92.50000 92.60000
This particular example is a little bit confusing because the data frame and the main column of interest are both called the same name pimax
.
For this particular example, it’s not clear why you’d actually want to calculate a running sum or mean, however.
sd(pimax$pimax)
[1] 24.92154
Or, manually,
If the above expression looks a bit unwieldy, you can break it into steps as in the original Stata example.
with(pimax, {
mean <- mean(pimax)
n <- length(pimax)
diff <- pimax - mean
diff2 <- diff^2 # or diff * diff
variance <- sum(diff2) / n
sqrt(variance)
})
[1] 24.41803
Why are these answers slightly different to the value returned by sd()
above?
Hint: check the ‘Details’ section in help(sd)
.
Read the CSV file htwt.csv
by typing
htwt <- read.csv('htwt.csv')
This file includes two BMI values: bmi
, which was based on measured data and bmirep
, which was based on reported data.
If you didn’t want to save the CSV file to your desktop, you could instead read it directly from the web:
htwt <- read.csv('http://personalpages.manchester.ac.uk/staff/david.selby/stata/2020-04-02-summarising/htwt.csv')
One way to visualise a distribution of values might be by displaying a histogram, which (with base R graphics) you can achieve via the call
hist(htwt$bmi)
Or, using the popular ggplot2 visualisation package:
library(ggplot2)
ggplot(htwt) + aes(x = bmi) +
geom_histogram(bins = 20)
Another good way of showing the distribution is via a kernel density plot (which is essentially a special kind of histogram). Unlike Stata, R is less forgiving of missing values (NA
s) and will throw warnings or errors if you have not explicitly removed or specified how to deal with them.
Error in density.default(htwt$bmi): 'x' contains missing values
plot(density(htwt$bmi, na.rm = TRUE),
main = 'Distribution of measured BMI values (excluding missing)')
ggplot(htwt) + aes(bmi) + geom_density() +
ggtitle('Distribution of measured BMI values')
Warning: Removed 12 rows containing non-finite values (stat_density).
This function was mentioned briefly in the previous exercise.
Calculate summary measures of the entire htwt
dataset by simply passing the name of the dataframe:
summary(htwt)
caseno sex age nurseht
Min. :10006 Length:412 Min. :19.00 Min. :141.0
1st Qu.:11656 Class :character 1st Qu.:36.00 1st Qu.:158.5
Median :20646 Mode :character Median :48.00 Median :164.5
Mean :17938 Mean :48.41 Mean :165.5
3rd Qu.:22975 3rd Qu.:62.00 3rd Qu.:172.9
Max. :25224 Max. :76.00 Max. :192.0
NA's :10
nursewt htfeet htinch wtstone
Min. : 42.80 Min. :4.000 Min. : 0.000 Min. : 7.00
1st Qu.: 60.60 1st Qu.:5.000 1st Qu.: 2.000 1st Qu.: 9.00
Median : 70.40 Median :5.000 Median : 5.000 Median :10.00
Mean : 71.53 Mean :5.039 Mean : 5.241 Mean :10.63
3rd Qu.: 80.90 3rd Qu.:5.000 3rd Qu.: 8.000 3rd Qu.:12.00
Max. :125.00 Max. :6.000 Max. :11.000 Max. :18.00
NA's :10 NA's :2 NA's :2 NA's :5
wtlbs bmirep bmi
Min. : 0.00 Min. :15.67 Min. :17.34
1st Qu.: 0.00 1st Qu.:22.27 1st Qu.:22.65
Median : 4.00 Median :24.35 Median :25.71
Mean : 4.44 Mean :24.91 Mean :26.08
3rd Qu.: 7.00 3rd Qu.:27.06 3rd Qu.:28.66
Max. :13.00 Max. :45.28 Max. :45.47
NA's :5 NA's :6 NA's :12
Or you can hone in on a specific column, such as the measured bmi
:
summary(htwt$bmi)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
17.34 22.65 25.71 26.08 28.66 45.47 12
Write down the mean BMI. You can refer to the summary
output above, or else compute the mean directly.
mean(htwt$bmi, na.rm = TRUE)
[1] 26.07662
Next, write down the median BMI. Again, you can use summary
, or calculate it explicitly.
median(htwt$bmi, na.rm = TRUE)
[1] 25.71006
What are the lower and upper quartiles of the data? These fairly common statistics are also given in the default summary output. To compute them yourself, use the quantile
function and specify which quantile(s) you would like.
To summarise the data separately for each sex, there are several approaches. For a single summary statistic you can use the base R command
aggregate(bmi ~ sex, data = htwt, FUN = mean)
sex bmi
1 female 25.85984
2 male 26.35534
aggregate(bmi ~ sex, data = htwt, FUN = median)
sex bmi
1 female 25.08112
2 male 26.25182
repeating for the quartiles, standard deviations and so on.
Or you can use the apply
family of functions to run summary
on each group in the data. The by
function (also known as tapply
) has the following syntax.
by(htwt$bmi, htwt$sex, FUN = summary)
htwt$sex: female
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
17.34 22.25 25.08 25.86 28.66 45.47 9
----------------------------------------------------
htwt$sex: male
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
17.96 23.57 26.25 26.36 28.65 44.92 3
This means “apply the function summary
to the variable bmi
, divided according to sex
”. Using the same syntax we can calculate a single value such as a median for each group. (Here I use with
to avoid writing htwt$
twice.)
sex: female
[1] 25.08112
----------------------------------------------------
sex: male
[1] 26.25182
However, most R users use packages such as dplyr or data.table for these common data analysis tasks, as they are slightly easier to use and more flexible. These packages will be introduced in later sections.
The distributions for the two sexes can be neatly compared graphically using box plots. The command to do this in base R is
boxplot(bmi ~ sex, data = htwt)
or, in ggplot2:
library(ggplot2)
ggplot(htwt) + aes(sex, bmi) + geom_boxplot()
However you can get a more granular look at the distribution with histograms or kernel density plots, and ggplot2 makes it easy to combine these superimposed on the same axes, or as small multiples (a lattice of separate plots).
# Superimposed
ggplot(htwt) +
aes(bmi, fill = sex) +
geom_density(alpha = 0.5) + # "alpha" means the opacity
ggtitle('Distribution of reported BMI, by sex')
# Stacked (harder to interpret)
ggplot(htwt) +
aes(bmi, fill = sex) +
geom_histogram(position = 'stack') +
ggtitle('Distribution of reported BMI, by sex')
# Small multiples
ggplot(htwt) +
aes(bmi, fill = sex) +
geom_density(alpha = 0.5) +
facet_grid(sex ~ .) +
ggtitle('Distribution of reported BMI, by sex')
# Small multiples (specify `group` else it will try to stack)
ggplot(htwt) +
aes(bmi, fill = sex, group = sex) +
geom_histogram() +
facet_grid(sex ~ .) +
ggtitle('Distribution of reported BMI, by sex')
These packages can be used to produce summary statistics, but they are much more flexible and powerful than the aggregate
and apply
commands from base R. They also tend to have more consistent syntax.
For example, in dplyr, the following gives the mean and SD of BMI.
mean sd
1 26.07662 4.597912
Importantly, there is a group_by
command, which enables you to split the data by subgroups, apply functions to each subset and then combine the results.
# A tibble: 2 x 3
sex mean sd
<chr> <dbl> <dbl>
1 female 25.9 4.90
2 male 26.4 4.17
(Here we also use the filter
command to exclude those rows where bmi
is NA
. This is equivalent to setting the argument na.rm = TRUE
in mean
and sd
calls.)
The pipe operator, %>%
is just a way of passing the output of one function to the first argument of another, to avoid death by nested brackets. That is,
%>% mean(na.rm = TRUE) x
is just another way of writing
mean(x, na.rm = TRUE)
Use dplyr’s group_by
and summarise
functions. (If you have forgotten the names of the variables to use for this, try typing str(htwt)
or colnames(htwt)
.)
htwt %>%
group_by(sex) %>%
summarise(mean_height = mean(nurseht, na.rm = T),
sd_height = sd(nurseht, na.rm = T),
mean_weight = mean(nursewt, na.rm = T),
sd_weight = sd(nursewt, na.rm = T))
# A tibble: 2 x 5
sex mean_height sd_height mean_weight sd_weight
<chr> <dbl> <dbl> <dbl> <dbl>
1 female 160. 6.40 65.9 12.8
2 male 173. 6.91 78.8 12.2
Another way avoids typing the same code twice, since we’re applying mean
and sd
to the same columns and with the same arguments.
htwt %>%
select(sex, nurseht, nursewt) %>%
group_by(sex) %>%
summarise_all(list(mean = mean, sd = sd), na.rm = TRUE)
# A tibble: 2 x 5
sex nurseht_mean nursewt_mean nurseht_sd nursewt_sd
<chr> <dbl> <dbl> <dbl> <dbl>
1 female 160. 65.9 6.40 12.8
2 male 173. 78.8 6.91 12.2
Finally, an advanced approach reshapes the data and stores it in a ‘long’ format. Sometimes this structure is useful for subsequent analyses, or for fitting lots of variables into a table for publication (where a wide table might not fit on the page).
htwt %>%
select(nurseht, nursewt, sex) %>%
tidyr::pivot_longer(-sex, names_to = 'measure') %>%
filter(!is.na(value)) %>%
group_by(measure, sex) %>%
summarise(mean = mean(value), sd = sd(value))
# A tibble: 4 x 4
# Groups: measure [2]
measure sex mean sd
<chr> <chr> <dbl> <dbl>
1 nurseht female 160. 6.40
2 nurseht male 173. 6.91
3 nursewt female 65.9 12.8
4 nursewt male 78.8 12.2
An alternative to dplyr is data.table, which can do the same thing, but with a different syntax. The package data.table is an extension to R’s default data.frame
structures, which makes filtering, grouping and manipulating tabular data easier.
The format is DT[filter, compute, group]
, i.e. the first argument in the square brackets determines which rows to include, the second determines which columns to select or which summary statistics to compute, and the last argument determines grouping variables.
library(data.table)
setDT(htwt) # convert from data.frame to data.table
htwt[!is.na(bmi), .(mean = mean(bmi), median = median(bmi), sd = sd(bmi)), by = sex]
sex mean median sd
1: female 25.85984 25.08112 4.903072
2: male 26.35534 26.25182 4.170245
Let’s calculate the same numerical summaries by group, but this time using the data.table package.
setDT(htwt)
htwt[, .(mean_height = mean(nurseht, na.rm = T),
sd_height = sd(nurseht, na.rm = T),
mean_weight = mean(nursewt, na.rm = T),
sd_weight = sd(nursewt, na.rm = T)),
by = sex]
sex mean_height sd_height mean_weight sd_weight
1: female 159.7740 6.398803 65.86416 12.7513
2: male 172.9571 6.911771 78.81250 12.2367
The data.table equivalent to dplyr::summarise_all
(see previous section) is to use the .SD
(‘Subset of Data’) and .SDcols
abbreviations, and loop over the columns in the table.
variable <- c(nurseht = 'height', nursewt = 'weight') # column labels
htwt[,
.(variable,
mean = sapply(.SD, mean, na.rm = TRUE),
sd = sapply(.SD, sd, na.rm = TRUE)),
.SDcols = names(variable),
by = sex]
sex variable mean sd
1: female height 159.77401 6.398803
2: female weight 65.86416 12.751296
3: male height 172.95714 6.911771
4: male weight 78.81250 12.236702
Here sapply
is a base function that means ‘loop over a variable, apply a function and simplify the result’.
mean(htwt$age)
[1] 48.41262
hist
or geom_histogram
. Do the ages follow a normal distribution?hist(htwt$age)
ggplot(htwt) + aes(age) + geom_histogram(binwidth = 5)
Note from David: A histogram may not be the best way of doing this. Probably more reliable to use a QQ plot:
Or in ggplot2:
ggplot(htwt) + aes(sample = age) +
stat_qq() + stat_qq_line()
htwt %>%
group_by(sex) %>%
summarise_at('age', list(Youngest = min, Oldest = max))
# A tibble: 2 x 3
sex Youngest Oldest
<chr> <int> <int>
1 female 19 74
2 male 19 76
Or, in base R:
by(htwt$age, htwt$sex, range)
htwt$sex: female
[1] 19 74
----------------------------------------------------
htwt$sex: male
[1] 19 76
Or, in data.table syntax:
sex youngest oldest
1: female 19 74
2: male 19 76
bmidiff <- with(htwt, bmi - bmirep)
Write down its mean value, standard deviation and the number of subjects for whom both BMI measures are available.
Whilst you can do this in R, arguably quantile–quantile plots are a better way of visualising the normality (or otherwise) of the sample distribution.
ggplot(htwt) +
aes(sample = nurseht) +
stat_qq() + stat_qq_line() +
facet_wrap(~ sex)
If you really want to super-impose a density curve over a histogram, you can do so as follows.
In base R, the following code split
s the dataset into two, according to the sex
of the patients. Then using these partial datasets (temporarily named subset
each time), a histogram is plotted, followed by a normal density curve on the same axes (with mean and standard deviation equal to the respective sample estimates from each subset).
You don’t have to use a loop for this—you could just plot each sex manually. But if you had a lot of groups then this approach saves a lot of repetition.
You can also do this in ggplot2, but if you want to use facet
s then you need to calculate the densities yourself.
library(dplyr)
densities <- htwt %>%
group_by(sex) %>%
filter(!is.na(nurseht)) %>%
summarise(mean = mean(nurseht), sd = sd(nurseht),
min = min(nurseht), max = max(nurseht),
nurseht = seq(min, max, length = 100),
density = dnorm(nurseht, mean, sd))
ggplot(htwt) +
aes(nurseht, ..density..) +
geom_histogram() +
geom_line(aes(y = density), data = densities, colour = 'tomato2') +
facet_wrap(~sex, nrow = 2, scales = 'free_y')
You’re still better off with a QQ plot!
While the original Stata worksheet suggests this, it is not a particularly reproducible workflow for writing an analysis report.
R Markdown (or Overleaf, or Jupyter) should be used instead. Then, if the original code is updated, the final graph will also be updated and no manual formatting is required.
If the output format must be Word (versus HTML, PDF), one can create an R Markdown document that renders a Word document as follows.
---
output: word_document
---
Text goes here.
```{r}
# R code here
```
Text continues.
```{r}
# Further R code here
```
More text.
See Literate Programming in R Markdown to learn more about dynamic document writing.