Lecture 2: Summarising Data

This worksheet is based on the second lecture of the Statistical Modelling in Stata course, created by Dr Mark Lunt and offered by the Centre for Epidemiology Versus Arthritis at the University of Manchester. The original Stata exercises and solutions are here translated into their R equivalents.

To distribute the pimax.dta and htwt.dta files in a more universal format, we use the foreign package to convert them from the Stata binary format into data frames, then save them as CSV.

datadir <- 'http://personalpages.manchester.ac.uk/staff/mark.lunt/stats/2_summarizing_data/data/'

library(foreign)
pimax <- read.dta(file.path(datadir, 'pimax.dta'))
htwt <- read.dta(file.path(datadir, 'htwt.dta'))

write.csv(pimax, 'pimax.csv', row.names = FALSE)
write.csv(htwt, 'htwt.csv', row.names = FALSE)

Hand calculations

This section gives you the chance to do some calculations for yourself and see how the concepts we saw in the lecture work in real life. Once upon a time, these calculations would have been done by hand: I’m sure you could do them in your head, but getting ~~Stata~~ R to do them for you will be quicker. However, we are going to go through the steps that you would have to perform if you were calculating them by hand, so that you can see how it works. In practice, you would simply ask ~~Stata~~ R to churn the results out rather than calculating them this way.

pimax <- read.csv('pimax.csv')

This dataset contains a single variable, pimax, which is the maximal static inspiratory pressure, measured in cm H₂O, of 25 cystic fibrosis patients.

Look at the data in the spreadsheet view using the command View, or, in RStudio, by clicking on the pimax Data variable in the Environment pane. You can also get a quick preview of the data set, its size and data types using the str (structure) command, or just print the first few rows using head(pimax).

str(pimax)

summary(pimax)

What is the median value of maximal static inspiratory pressure?

You might notice that a few common summary statistics are already returned by the summary function, but you can compute individual ones directly:

median(pimax$pimax)

If you don’t like the dollar sign ($) syntax for selecting columns, you can access variables within a dataframe implicitly by wrapping the expression in with() (a little bit like Stata’s use, but just for one expression). In this case:

with(pimax, median(pimax))

What are the lower and upper quartiles?

quantile(pimax$pimax, c(.25, .75))

Now we will calculate the mean.

mean(pimax$pimax)

Whereas Stata’s sum function computes a running (cumulative) sum (for some reason), R returns only the scalar total.

If you did want cumulative sums and means you could use the cumsum function. There is no cummean function in base R but you can easily compute it yourself. The R equivalent of Stata’s

gen n = _n
gen sum = sum(pimax)
gen mean = sum/n

transform(pimax, cummean = cumsum(pimax) / seq_along(pimax))

The function transform is used for adding and modifying columns in the data frame that is specified by the first argument (i.e. the data frame pimax). But you can equally compute the same quantity as a standalone vector without adding it as a column to the original dataset:

with(pimax, cumsum(pimax) / seq_along(pimax))

This particular example is a little bit confusing because the data frame and the main column of interest are both called the same name pimax.

For this particular example, it’s not clear why you’d actually want to calculate a running sum or mean, however.

Calculate the standard deviation.

sd(pimax$pimax)

with(pimax,
  sqrt( sum((pimax - mean(pimax))^2) / length(pimax) )
)

If the above expression looks a bit unwieldy, you can break it into steps as in the original Stata example.

with(pimax, {
  mean <- mean(pimax)
  n <- length(pimax)
  diff <- pimax - mean
  diff2 <- diff^2  # or diff * diff
  variance <- sum(diff2) / n
  sqrt(variance)
})

Hint: check the ‘Details’ section in help(sd). Hint from Mark: or wait until next week when we will discuss this in depth :-)

Summarising data in R

htwt <- read.csv('htwt.csv')

This file includes two BMI values: bmi, which was based on measured data and bmirep, which was based on reported data.

If you didn’t want to save the CSV file to your desktop, you could instead read it directly from the web:

htwt <- read.csv('http://personalpages.manchester.ac.uk/staff/david.selby/stata/2020-04-02-summarising/htwt.csv')

Examine the distribution of measured BMI scores

One way to visualise a distribution of values might be by displaying a histogram, which (with base R graphics) you can achieve via the call

hist(htwt$bmi)

library(ggplot2)
ggplot(htwt) + aes(x = bmi) +
  geom_histogram(bins = 20)

Another good way of showing the distribution is via a kernel density plot (which is essentially a special kind of histogram). Unlike Stata, R is less forgiving of missing values (NAs) and will throw warnings or errors if you have not explicitly removed or specified how to deal with them.

plot(density(htwt$bmi), main = 'Distribution of measured BMI values')

plot(density(htwt$bmi, na.rm = TRUE),
     main = 'Distribution of measured BMI values (excluding missing)')

ggplot(htwt) + aes(bmi) + geom_density() +
  ggtitle('Distribution of measured BMI values')

Basic summary functions

Calculate summary measures of the entire htwt dataset by simply passing the name of the dataframe:

summary(htwt)

summary(htwt$bmi)

Write down the mean BMI. You can refer to the summary output above, or else compute the mean directly.

mean(htwt$bmi, na.rm = TRUE)

Next, write down the median BMI. Again, you can use summary, or calculate it explicitly.

median(htwt$bmi, na.rm = TRUE)

What are the lower and upper quartiles of the data? These fairly common statistics are also given in the default summary output. To compute them yourself, use the quantile function and specify which quantile(s) you would like.

quantile(htwt$bmi, c(.25, .75), na.rm = TRUE)

Summarising by group

Numerical summaries

To summarise the data separately for each sex, there are several approaches. For a single summary statistic you can use the base R command

aggregate(bmi ~ sex, data = htwt, FUN = mean)
aggregate(bmi ~ sex, data = htwt, FUN = median)

Or you can use the apply family of functions to run summary on each group in the data. The by function (also known as tapply) has the following syntax.

by(htwt$bmi, htwt$sex, FUN = summary)

This means “apply the function summary to the variable bmi, divided according to sex”. Using the same syntax we can calculate a single value such as a median for each group. (Here I use with to avoid writing htwt$ twice.)

with(htwt, by(bmi, sex, median, na.rm = TRUE))

However, most R users use packages such as dplyr or data.table for these common data analysis tasks, as they are slightly easier to use and more flexible. These packages will be introduced in later sections.

Graphical summaries

The distributions for the two sexes can be neatly compared graphically using box plots. The command to do this in base R is

boxplot(bmi ~ sex, data =  htwt)

library(ggplot2)
ggplot(htwt) + aes(sex, bmi) + geom_boxplot()

However you can get a more granular look at the distribution with histograms or kernel density plots, and ggplot2 makes it easy to combine these superimposed on the same axes, or as small multiples (a lattice of separate plots).

# Superimposed
ggplot(htwt) +
  aes(bmi, fill = sex) +
  geom_density(alpha = 0.5) + # "alpha" means the opacity
  ggtitle('Distribution of reported BMI, by sex')

# Stacked (harder to interpret)
ggplot(htwt) +
  aes(bmi, fill = sex) +
  geom_histogram(position = 'stack') +
  ggtitle('Distribution of reported BMI, by sex')

# Small multiples
ggplot(htwt) +
  aes(bmi, fill = sex) +
  geom_density(alpha = 0.5) +
  facet_grid(sex ~ .) +
  ggtitle('Distribution of reported BMI, by sex')

# Small multiples (specify `group` else it will try to stack)
ggplot(htwt) +
  aes(bmi, fill = sex, group = sex) +
  geom_histogram() +
  facet_grid(sex ~ .) +
  ggtitle('Distribution of reported BMI, by sex')

Split-apply-combine with dplyr

These packages can be used to produce summary statistics, but they are much more flexible and powerful than the aggregate and apply commands from base R. They also tend to have more consistent syntax.

library(dplyr)
summarise(htwt,
          mean = mean(bmi, na.rm = TRUE),
          sd = sd(bmi, na.rm = TRUE))

Importantly, there is a group_by command, which enables you to split the data by subgroups, apply functions to each subset and then combine the results.

htwt %>%
  group_by(sex) %>%
  filter(!is.na(bmi)) %>%
  summarise(mean = mean(bmi),
            sd = sd(bmi))

(Here we also use the filter command to exclude those rows where bmi is NA. This is equivalent to setting the argument na.rm = TRUE in mean and sd calls.)

The pipe operator, %>% is just a way of passing the output of one function to the first argument of another, to avoid death by nested brackets. That is,

x %>% mean(na.rm = TRUE)

mean(x, na.rm = TRUE)

Compute the mean and standard deviation of height and weight (as measured by the nurse) for each sex.

Use dplyr’s group_by and summarise functions. (If you have forgotten the names of the variables to use for this, try typing str(htwt) or colnames(htwt).)

htwt %>%
  group_by(sex) %>%
  summarise(mean_height = mean(nurseht, na.rm = T),
            sd_height = sd(nurseht, na.rm = T),
            mean_weight = mean(nursewt, na.rm = T),
            sd_weight = sd(nursewt, na.rm = T))

Another way avoids typing the same code twice, since we’re applying mean and sd to the same columns and with the same arguments.

htwt %>%
  select(sex, nurseht, nursewt) %>%
  group_by(sex) %>%
  summarise_all(list(mean = mean, sd = sd), na.rm = TRUE)

Finally, an advanced approach reshapes the data and stores it in a ‘long’ format. Sometimes this structure is useful for subsequent analyses, or for fitting lots of variables into a table for publication (where a wide table might not fit on the page).

htwt %>%
  select(nurseht, nursewt, sex) %>%
  tidyr::pivot_longer(-sex, names_to = 'measure') %>%
  filter(!is.na(value)) %>%
  group_by(measure, sex) %>%
  summarise(mean = mean(value), sd = sd(value))

Split-apply-combine with data.table

An alternative to dplyr is data.table, which can do the same thing, but with a different syntax. The package data.table is an extension to R’s default data.frame structures, which makes filtering, grouping and manipulating tabular data easier.

The format is DT[filter, compute, group], i.e. the first argument in the square brackets determines which rows to include, the second determines which columns to select or which summary statistics to compute, and the last argument determines grouping variables.

library(data.table)
setDT(htwt) # convert from data.frame to data.table
htwt[!is.na(bmi), .(mean = mean(bmi), median = median(bmi), sd = sd(bmi)), by = sex]

Compute the mean and standard deviation of height and weight (as measured by the nurse) for each sex.

Let’s calculate the same numerical summaries by group, but this time using the data.table package.

setDT(htwt)
htwt[, .(mean_height = mean(nurseht, na.rm = T),
         sd_height   = sd(nurseht, na.rm = T),
         mean_weight = mean(nursewt, na.rm = T),
         sd_weight   = sd(nursewt, na.rm = T)),
     by = sex]

The data.table equivalent to dplyr::summarise_all (see previous section) is to use the .SD (‘Subset of Data’) and .SDcols abbreviations, and loop over the columns in the table.

variable <- c(nurseht = 'height', nursewt = 'weight') # column labels
htwt[,
     .(variable,
       mean = sapply(.SD, mean, na.rm = TRUE),
       sd = sapply(.SD, sd, na.rm = TRUE)),
     .SDcols = names(variable),
     by = sex]

Here sapply is a base function that means ‘loop over a variable, apply a function and simplify the result’.

Further exercises

What is the average age of the subjects?

Draw a histogram of the ages, using the command hist or geom_histogram. Do the ages follow a normal distribution?

Note from David: A histogram may not be the best way of doing this. Probably more reliable to use a QQ plot:

How old are the youngest and oldest males and females in the study?

What is the mean of the self reported BMI? Is this greater or less than the mean of the BMI as measured by the nurse?

Create a variable for the difference between measured BMI and self-reported BMI:

Write down its mean value, standard deviation and the number of subjects for whom both BMI measures are available.

Produce histograms of height in men and women and super-impose a normal distribution curve.

Whilst you can do this in R, arguably quantile–quantile plots are a better way of visualising the normality (or otherwise) of the sample distribution.

ggplot(htwt) +
  aes(sample = nurseht) +
  stat_qq() + stat_qq_line() +
  facet_wrap(~ sex)

If you really want to super-impose a density curve over a histogram, you can do so as follows.

In base R, the following code splits the dataset into two, according to the sex of the patients. Then using these partial datasets (temporarily named subset each time), a histogram is plotted, followed by a normal density curve on the same axes (with mean and standard deviation equal to the respective sample estimates from each subset).

for (subset in split(htwt, htwt$sex)) with(subset, {
  hist(nurseht, main = unique(sex), probability = TRUE)
  curve(dnorm(x, mean(nurseht, na.rm = T), sd(nurseht, na.rm = T)),
        add = TRUE)
})

You don’t have to use a loop for this—you could just plot each sex manually. But if you had a lot of groups then this approach saves a lot of repetition.

You can also do this in ggplot2, but if you want to use facets then you need to calculate the densities yourself.

library(dplyr)
densities <- htwt %>%
  group_by(sex) %>%
  filter(!is.na(nurseht)) %>%
  summarise(mean = mean(nurseht), sd = sd(nurseht),
            min = min(nurseht), max = max(nurseht),
            nurseht = seq(min, max, length = 100),
            density = dnorm(nurseht, mean, sd))

ggplot(htwt) +
  aes(nurseht, ..density..) +
  geom_histogram() +
  geom_line(aes(y = density), data = densities, colour = 'tomato2') +
  facet_wrap(~sex, nrow = 2, scales = 'free_y')

Copy your graphs into a Word document.

While the original Stata worksheet suggests this, it is not a particularly reproducible workflow for writing an analysis report.

R Markdown (or Overleaf, or Jupyter) should be used instead. Then, if the original code is updated, the final graph will also be updated and no manual formatting is required.

If the output format must be Word (versus HTML, PDF), one can create an R Markdown document that renders a Word document as follows.

---
output: word_document
---

Text goes here.

```{r}
# R code here
```

Text continues.

```{r}
# Further R code here
```

More text.

Lecture 2: Summarising Data

Author

Affiliation

Published

DOI

Hand calculations

What is the median value of maximal static inspiratory pressure?

What are the lower and upper quartiles?

Now we will calculate the mean.

Calculate the standard deviation.

Summarising data in R

Examine the distribution of measured BMI scores

Basic summary functions

Summarising by group

Numerical summaries

Graphical summaries

Split-apply-combine with dplyr

Compute the mean and standard deviation of height and weight (as measured by the nurse) for each sex.

Split-apply-combine with data.table

Compute the mean and standard deviation of height and weight (as measured by the nurse) for each sex.

Further exercises

What is the average age of the subjects?

Draw a histogram of the ages, using the command `hist` or `geom_histogram`. Do the ages follow a normal distribution?

How old are the youngest and oldest males and females in the study?

What is the mean of the self reported BMI? Is this greater or less than the mean of the BMI as measured by the nurse?

Create a variable for the difference between measured BMI and self-reported BMI:

Produce histograms of height in men and women and super-impose a normal distribution curve.

Copy your graphs into a Word document.

Footnotes

Lecture 2: Summarising Data

Author

Affiliation

Published

DOI

Hand calculations

What is the median value of maximal static inspiratory pressure?

What are the lower and upper quartiles?

Now we will calculate the mean.

Calculate the standard deviation.

Summarising data in R

Examine the distribution of measured BMI scores

Basic summary functions

Summarising by group

Numerical summaries

Graphical summaries

Split-apply-combine with dplyr

Compute the mean and standard deviation of height and weight (as measured by the nurse) for each sex.

Split-apply-combine with data.table

Compute the mean and standard deviation of height and weight (as measured by the nurse) for each sex.

Further exercises

What is the average age of the subjects?

Draw a histogram of the ages, using the command hist or geom_histogram. Do the ages follow a normal distribution?

How old are the youngest and oldest males and females in the study?

What is the mean of the self reported BMI? Is this greater or less than the mean of the BMI as measured by the nurse?

Create a variable for the difference between measured BMI and self-reported BMI:

Produce histograms of height in men and women and super-impose a normal distribution curve.

Copy your graphs into a Word document.

Footnotes

Draw a histogram of the ages, using the command `hist` or `geom_histogram`. Do the ages follow a normal distribution?