Lecture 2: Summarising Data

Obtaining numerical and visual summaries of datasets in R. An introduction to the split-apply-combine approach to data analysis, using either base functions or the packages dplyr and data.table.

David Selby https://www.research.manchester.ac.uk/portal/david.selby.html (Centre for Epidemiology Versus Arthritis)https://www.cfe.manchester.ac.uk
2020-04-02

This worksheet is based on the second lecture of the Statistical Modelling in Stata course, offered by the Centre for Epidemiology Versus Arthritis at the University of Manchester. The original Stata exercises and solutions are here translated into their R equivalents.

Refer to the original slides and Stata worksheets here.

To distribute the pimax.dta and htwt.dta files in a more universal format, we use the foreign package to convert them from the Stata binary format into data frames, then save them as CSV.

datadir <- 'http://personalpages.manchester.ac.uk/staff/mark.lunt/stats/2_summarizing_data/data/'

library(foreign)
pimax <- read.dta(file.path(datadir, 'pimax.dta'))
htwt <- read.dta(file.path(datadir, 'htwt.dta'))

write.csv(pimax, 'pimax.csv', row.names = FALSE)
write.csv(htwt, 'htwt.csv', row.names = FALSE)

You don’t need to do this as I’ve converted the files for you:

Hand calculations

This section gives you the chance to do some calculations for yourself and see how the concepts we saw in the lecture work in real life. Once upon a time, these calculations would have been done by hand: I’m sure you could do them in your head, but getting Stata R to do them for you will be quicker. However, we are going to go through the steps that you would have to perform if you were calculating them by hand, so that you can see how it works. In practice, you would simply ask Stata R to churn the results out rather than calculating them this way.

Load in the PImax dataset with the command

pimax <- read.csv('pimax.csv')

This dataset contains a single variable, pimax, which is the maximal static inspiratory pressure, measured in cm H2O, of 25 cystic fibrosis patients.

Look at the data in the spreadsheet view using the command View, or, in RStudio, by clicking on the pimax Data variable in the Environment pane. You can also get a quick preview of the data set, its size and data types using the str (structure) command, or just print the first few rows using head(pimax).

str(pimax)
'data.frame':   25 obs. of  2 variables:
 $ id   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ pimax: int  80 85 110 95 95 100 45 95 130 75 ...

Another way of getting a quick summary is the summary function:

summary(pimax)
       id         pimax      
 Min.   : 1   Min.   : 40.0  
 1st Qu.: 7   1st Qu.: 75.0  
 Median :13   Median : 95.0  
 Mean   :13   Mean   : 92.6  
 3rd Qu.:19   3rd Qu.:110.0  
 Max.   :25   Max.   :150.0  

What is the median value of maximal static inspiratory pressure?

You might notice that a few common summary statistics are already returned by the summary function, but you can compute individual ones directly:

median(pimax$pimax)
[1] 95

If you don’t like the dollar sign ($) syntax for selecting columns, you can access variables within a dataframe implicitly by wrapping the expression in with() (a little bit like Stata’s use, but just for one expression). In this case:

with(pimax, median(pimax))
[1] 95

What are the lower and upper quartiles?

quantile(pimax$pimax, c(.25, .75))
25% 75% 
 75 110 

Now we will calculate the mean.

mean(pimax$pimax)
[1] 92.6

Whereas Stata’s sum function computes a running (cumulative) sum (for some reason), R returns only the scalar total.

If you did want cumulative sums and means you could use the cumsum function. There is no cummean function in base R but you can easily compute it yourself. The R equivalent of Stata’s

gen n = _n
gen sum = sum(pimax)
gen mean = sum/n

is something like the following:

transform(pimax, cummean = cumsum(pimax) / seq_along(pimax))
   id pimax  cummean
1   1    80 80.00000
2   2    85 82.50000
3   3   110 91.66667
4   4    95 92.50000
5   5    95 93.00000
6   6   100 94.16667
7   7    45 87.14286
8   8    95 88.12500
9   9   130 92.77778
10 10    75 91.00000
11 11    80 90.00000
12 12    70 88.33333
13 13    80 87.69231
14 14   100 88.57143
15 15   120 90.66667
16 16   110 91.87500
17 17   125 93.82353
18 18    75 92.77778
19 19   100 93.15789
20 20    40 90.50000
21 21    75 89.76190
22 22   110 90.68182
23 23   150 93.26087
24 24    75 92.50000
25 25    95 92.60000

where seq_along(x) is just another way of writing 1:length(x).

The function transform is used for adding and modifying columns in the data frame that is specified by the first argument (i.e. the data frame pimax). But you can equally compute the same quantity as a standalone vector without adding it as a column to the original dataset:

with(pimax, cumsum(pimax) / seq_along(pimax))
 [1] 80.00000 82.50000 91.66667 92.50000 93.00000 94.16667 87.14286
 [8] 88.12500 92.77778 91.00000 90.00000 88.33333 87.69231 88.57143
[15] 90.66667 91.87500 93.82353 92.77778 93.15789 90.50000 89.76190
[22] 90.68182 93.26087 92.50000 92.60000

This particular example is a little bit confusing because the data frame and the main column of interest are both called the same name pimax.

For this particular example, it’s not clear why you’d actually want to calculate a running sum or mean, however.

Calculate the standard deviation.

sd(pimax$pimax)
[1] 24.92154

Or, manually,

with(pimax,
  sqrt( sum((pimax - mean(pimax))^2) / length(pimax) )
)
[1] 24.41803

If the above expression looks a bit unwieldy, you can break it into steps as in the original Stata example.

with(pimax, {
  mean <- mean(pimax)
  n <- length(pimax)
  diff <- pimax - mean
  diff2 <- diff^2  # or diff * diff
  variance <- sum(diff2) / n
  sqrt(variance)
})
[1] 24.41803

Why are these answers slightly different to the value returned by sd() above?

Hint: check the ‘Details’ section in help(sd).

Summarising data in R

Read the CSV file htwt.csv by typing

htwt <- read.csv('htwt.csv')

This file includes two BMI values: bmi, which was based on measured data and bmirep, which was based on reported data.

If you didn’t want to save the CSV file to your desktop, you could instead read it directly from the web:

htwt <- read.csv('http://personalpages.manchester.ac.uk/staff/david.selby/stata/2020-04-02-summarising/htwt.csv')

Examine the distribution of measured BMI scores

One way to visualise a distribution of values might be by displaying a histogram, which (with base R graphics) you can achieve via the call

hist(htwt$bmi)

Or, using the popular ggplot2 visualisation package:

library(ggplot2)
ggplot(htwt) + aes(x = bmi) +
  geom_histogram(bins = 20)

Another good way of showing the distribution is via a kernel density plot (which is essentially a special kind of histogram). Unlike Stata, R is less forgiving of missing values (NAs) and will throw warnings or errors if you have not explicitly removed or specified how to deal with them.

plot(density(htwt$bmi), main = 'Distribution of measured BMI values')
Error in density.default(htwt$bmi): 'x' contains missing values
plot(density(htwt$bmi, na.rm = TRUE),
     main = 'Distribution of measured BMI values (excluding missing)')
ggplot(htwt) + aes(bmi) + geom_density() +
  ggtitle('Distribution of measured BMI values')
Warning: Removed 12 rows containing non-finite values (stat_density).

Basic summary functions

This function was mentioned briefly in the previous exercise.

Calculate summary measures of the entire htwt dataset by simply passing the name of the dataframe:

summary(htwt)
     caseno          sex                 age           nurseht     
 Min.   :10006   Length:412         Min.   :19.00   Min.   :141.0  
 1st Qu.:11656   Class :character   1st Qu.:36.00   1st Qu.:158.5  
 Median :20646   Mode  :character   Median :48.00   Median :164.5  
 Mean   :17938                      Mean   :48.41   Mean   :165.5  
 3rd Qu.:22975                      3rd Qu.:62.00   3rd Qu.:172.9  
 Max.   :25224                      Max.   :76.00   Max.   :192.0  
                                                    NA's   :10     
    nursewt           htfeet          htinch          wtstone     
 Min.   : 42.80   Min.   :4.000   Min.   : 0.000   Min.   : 7.00  
 1st Qu.: 60.60   1st Qu.:5.000   1st Qu.: 2.000   1st Qu.: 9.00  
 Median : 70.40   Median :5.000   Median : 5.000   Median :10.00  
 Mean   : 71.53   Mean   :5.039   Mean   : 5.241   Mean   :10.63  
 3rd Qu.: 80.90   3rd Qu.:5.000   3rd Qu.: 8.000   3rd Qu.:12.00  
 Max.   :125.00   Max.   :6.000   Max.   :11.000   Max.   :18.00  
 NA's   :10       NA's   :2       NA's   :2        NA's   :5      
     wtlbs           bmirep           bmi       
 Min.   : 0.00   Min.   :15.67   Min.   :17.34  
 1st Qu.: 0.00   1st Qu.:22.27   1st Qu.:22.65  
 Median : 4.00   Median :24.35   Median :25.71  
 Mean   : 4.44   Mean   :24.91   Mean   :26.08  
 3rd Qu.: 7.00   3rd Qu.:27.06   3rd Qu.:28.66  
 Max.   :13.00   Max.   :45.28   Max.   :45.47  
 NA's   :5       NA's   :6       NA's   :12     

Or you can hone in on a specific column, such as the measured bmi:

summary(htwt$bmi)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  17.34   22.65   25.71   26.08   28.66   45.47      12 

Write down the mean BMI. You can refer to the summary output above, or else compute the mean directly.

mean(htwt$bmi, na.rm = TRUE)
[1] 26.07662

Next, write down the median BMI. Again, you can use summary, or calculate it explicitly.

median(htwt$bmi, na.rm = TRUE)
[1] 25.71006

What are the lower and upper quartiles of the data? These fairly common statistics are also given in the default summary output. To compute them yourself, use the quantile function and specify which quantile(s) you would like.

quantile(htwt$bmi, c(.25, .75), na.rm = TRUE)
     25%      75% 
22.65145 28.66408 

Summarising by group

Numerical summaries

To summarise the data separately for each sex, there are several approaches. For a single summary statistic you can use the base R command

aggregate(bmi ~ sex, data = htwt, FUN = mean)
     sex      bmi
1 female 25.85984
2   male 26.35534
aggregate(bmi ~ sex, data = htwt, FUN = median)
     sex      bmi
1 female 25.08112
2   male 26.25182

repeating for the quartiles, standard deviations and so on.

Or you can use the apply family of functions to run summary on each group in the data. The by function (also known as tapply) has the following syntax.

by(htwt$bmi, htwt$sex, FUN = summary)
htwt$sex: female
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  17.34   22.25   25.08   25.86   28.66   45.47       9 
---------------------------------------------------- 
htwt$sex: male
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  17.96   23.57   26.25   26.36   28.65   44.92       3 

This means “apply the function summary to the variable bmi, divided according to sex”. Using the same syntax we can calculate a single value such as a median for each group. (Here I use with to avoid writing htwt$ twice.)

with(htwt, by(bmi, sex, median, na.rm = TRUE))
sex: female
[1] 25.08112
---------------------------------------------------- 
sex: male
[1] 26.25182

However, most R users use packages such as dplyr or data.table for these common data analysis tasks, as they are slightly easier to use and more flexible. These packages will be introduced in later sections.

Graphical summaries

The distributions for the two sexes can be neatly compared graphically using box plots. The command to do this in base R is

boxplot(bmi ~ sex, data =  htwt)

or, in ggplot2:

library(ggplot2)
ggplot(htwt) + aes(sex, bmi) + geom_boxplot()

However you can get a more granular look at the distribution with histograms or kernel density plots, and ggplot2 makes it easy to combine these superimposed on the same axes, or as small multiples (a lattice of separate plots).

# Superimposed
ggplot(htwt) +
  aes(bmi, fill = sex) +
  geom_density(alpha = 0.5) + # "alpha" means the opacity
  ggtitle('Distribution of reported BMI, by sex')
# Stacked (harder to interpret)
ggplot(htwt) +
  aes(bmi, fill = sex) +
  geom_histogram(position = 'stack') +
  ggtitle('Distribution of reported BMI, by sex')
# Small multiples
ggplot(htwt) +
  aes(bmi, fill = sex) +
  geom_density(alpha = 0.5) +
  facet_grid(sex ~ .) +
  ggtitle('Distribution of reported BMI, by sex')
# Small multiples (specify `group` else it will try to stack)
ggplot(htwt) +
  aes(bmi, fill = sex, group = sex) +
  geom_histogram() +
  facet_grid(sex ~ .) +
  ggtitle('Distribution of reported BMI, by sex')

Split-apply-combine with dplyr

These packages can be used to produce summary statistics, but they are much more flexible and powerful than the aggregate and apply commands from base R. They also tend to have more consistent syntax.

For example, in dplyr, the following gives the mean and SD of BMI.

library(dplyr)
summarise(htwt,
          mean = mean(bmi, na.rm = TRUE),
          sd = sd(bmi, na.rm = TRUE))
      mean       sd
1 26.07662 4.597912

Importantly, there is a group_by command, which enables you to split the data by subgroups, apply functions to each subset and then combine the results.

htwt %>%
  group_by(sex) %>%
  filter(!is.na(bmi)) %>%
  summarise(mean = mean(bmi),
            sd = sd(bmi))
# A tibble: 2 x 3
  sex     mean    sd
  <chr>  <dbl> <dbl>
1 female  25.9  4.90
2 male    26.4  4.17

(Here we also use the filter command to exclude those rows where bmi is NA. This is equivalent to setting the argument na.rm = TRUE in mean and sd calls.)

The pipe operator, %>% is just a way of passing the output of one function to the first argument of another, to avoid death by nested brackets. That is,

x %>% mean(na.rm = TRUE)

is just another way of writing

mean(x, na.rm = TRUE)

Compute the mean and standard deviation of height and weight (as measured by the nurse) for each sex.

Use dplyr’s group_by and summarise functions. (If you have forgotten the names of the variables to use for this, try typing str(htwt) or colnames(htwt).)

htwt %>%
  group_by(sex) %>%
  summarise(mean_height = mean(nurseht, na.rm = T),
            sd_height = sd(nurseht, na.rm = T),
            mean_weight = mean(nursewt, na.rm = T),
            sd_weight = sd(nursewt, na.rm = T))
# A tibble: 2 x 5
  sex    mean_height sd_height mean_weight sd_weight
  <chr>        <dbl>     <dbl>       <dbl>     <dbl>
1 female        160.      6.40        65.9      12.8
2 male          173.      6.91        78.8      12.2

Another way avoids typing the same code twice, since we’re applying mean and sd to the same columns and with the same arguments.

htwt %>%
  select(sex, nurseht, nursewt) %>%
  group_by(sex) %>%
  summarise_all(list(mean = mean, sd = sd), na.rm = TRUE)
# A tibble: 2 x 5
  sex    nurseht_mean nursewt_mean nurseht_sd nursewt_sd
  <chr>         <dbl>        <dbl>      <dbl>      <dbl>
1 female         160.         65.9       6.40       12.8
2 male           173.         78.8       6.91       12.2

Finally, an advanced approach reshapes the data and stores it in a ‘long’ format. Sometimes this structure is useful for subsequent analyses, or for fitting lots of variables into a table for publication (where a wide table might not fit on the page).

htwt %>%
  select(nurseht, nursewt, sex) %>%
  tidyr::pivot_longer(-sex, names_to = 'measure') %>%
  filter(!is.na(value)) %>%
  group_by(measure, sex) %>%
  summarise(mean = mean(value), sd = sd(value))
# A tibble: 4 x 4
# Groups:   measure [2]
  measure sex     mean    sd
  <chr>   <chr>  <dbl> <dbl>
1 nurseht female 160.   6.40
2 nurseht male   173.   6.91
3 nursewt female  65.9 12.8 
4 nursewt male    78.8 12.2 

Split-apply-combine with data.table

An alternative to dplyr is data.table, which can do the same thing, but with a different syntax. The package data.table is an extension to R’s default data.frame structures, which makes filtering, grouping and manipulating tabular data easier.

The format is DT[filter, compute, group], i.e. the first argument in the square brackets determines which rows to include, the second determines which columns to select or which summary statistics to compute, and the last argument determines grouping variables.

library(data.table)
setDT(htwt) # convert from data.frame to data.table
htwt[!is.na(bmi), .(mean = mean(bmi), median = median(bmi), sd = sd(bmi)), by = sex]
      sex     mean   median       sd
1: female 25.85984 25.08112 4.903072
2:   male 26.35534 26.25182 4.170245

Compute the mean and standard deviation of height and weight (as measured by the nurse) for each sex.

Let’s calculate the same numerical summaries by group, but this time using the data.table package.

setDT(htwt)
htwt[, .(mean_height = mean(nurseht, na.rm = T),
         sd_height   = sd(nurseht, na.rm = T),
         mean_weight = mean(nursewt, na.rm = T),
         sd_weight   = sd(nursewt, na.rm = T)),
     by = sex]
      sex mean_height sd_height mean_weight sd_weight
1: female    159.7740  6.398803    65.86416   12.7513
2:   male    172.9571  6.911771    78.81250   12.2367

The data.table equivalent to dplyr::summarise_all (see previous section) is to use the .SD (‘Subset of Data’) and .SDcols abbreviations, and loop over the columns in the table.

variable <- c(nurseht = 'height', nursewt = 'weight') # column labels
htwt[,
     .(variable,
       mean = sapply(.SD, mean, na.rm = TRUE),
       sd = sapply(.SD, sd, na.rm = TRUE)),
     .SDcols = names(variable),
     by = sex]
      sex variable      mean        sd
1: female   height 159.77401  6.398803
2: female   weight  65.86416 12.751296
3:   male   height 172.95714  6.911771
4:   male   weight  78.81250 12.236702

Here sapply is a base function that means ‘loop over a variable, apply a function and simplify the result’.

Further exercises

What is the average age of the subjects?

mean(htwt$age)
[1] 48.41262

Draw a histogram of the ages, using the command hist or geom_histogram. Do the ages follow a normal distribution?

hist(htwt$age)
ggplot(htwt) + aes(age) + geom_histogram(binwidth = 5)

Note from David: A histogram may not be the best way of doing this. Probably more reliable to use a QQ plot:

qqnorm(htwt$age)
qqline(htwt$age)

Or in ggplot2:

ggplot(htwt) + aes(sample = age) +
  stat_qq() + stat_qq_line()

How old are the youngest and oldest males and females in the study?

htwt %>%
  group_by(sex) %>%
  summarise_at('age', list(Youngest = min, Oldest = max))
# A tibble: 2 x 3
  sex    Youngest Oldest
  <chr>     <int>  <int>
1 female       19     74
2 male         19     76

Or, in base R:

by(htwt$age, htwt$sex, range)
htwt$sex: female
[1] 19 74
---------------------------------------------------- 
htwt$sex: male
[1] 19 76

Or, in data.table syntax:

htwt[, .(youngest = min(age), oldest = max(age)), by = sex]
      sex youngest oldest
1: female       19     74
2:   male       19     76

What is the mean of the self reported BMI? Is this greater or less than the mean of the BMI as measured by the nurse?

mean(htwt$bmirep, na.rm = TRUE)
[1] 24.90835
mean(htwt$bmi, na.rm = TRUE)
[1] 26.07662

Create a variable for the difference between measured BMI and self-reported BMI:

bmidiff <- with(htwt, bmi - bmirep)

Write down its mean value, standard deviation and the number of subjects for whom both BMI measures are available.

mean(bmidiff, na.rm = TRUE)
[1] 1.091431
sd(bmidiff, na.rm = TRUE)
[1] 1.845408
sum(!is.na(bmidiff))
[1] 394

Produce histograms of height in men and women and super-impose a normal distribution curve.

Whilst you can do this in R, arguably quantile–quantile plots are a better way of visualising the normality (or otherwise) of the sample distribution.

ggplot(htwt) +
  aes(sample = nurseht) +
  stat_qq() + stat_qq_line() +
  facet_wrap(~ sex)

If you really want to super-impose a density curve over a histogram, you can do so as follows.

In base R, the following code splits the dataset into two, according to the sex of the patients. Then using these partial datasets (temporarily named subset each time), a histogram is plotted, followed by a normal density curve on the same axes (with mean and standard deviation equal to the respective sample estimates from each subset).

for (subset in split(htwt, htwt$sex)) with(subset, {
  hist(nurseht, main = unique(sex), probability = TRUE)
  curve(dnorm(x, mean(nurseht, na.rm = T), sd(nurseht, na.rm = T)),
        add = TRUE)
})

You don’t have to use a loop for this—you could just plot each sex manually. But if you had a lot of groups then this approach saves a lot of repetition.

You can also do this in ggplot2, but if you want to use facets then you need to calculate the densities yourself.

library(dplyr)
densities <- htwt %>%
  group_by(sex) %>%
  filter(!is.na(nurseht)) %>%
  summarise(mean = mean(nurseht), sd = sd(nurseht),
            min = min(nurseht), max = max(nurseht),
            nurseht = seq(min, max, length = 100),
            density = dnorm(nurseht, mean, sd))

ggplot(htwt) +
  aes(nurseht, ..density..) +
  geom_histogram() +
  geom_line(aes(y = density), data = densities, colour = 'tomato2') +
  facet_wrap(~sex, nrow = 2, scales = 'free_y')

You’re still better off with a QQ plot!

Copy your graphs into a Word document.

While the original Stata worksheet suggests this, it is not a particularly reproducible workflow for writing an analysis report.

R Markdown (or Overleaf, or Jupyter) should be used instead. Then, if the original code is updated, the final graph will also be updated and no manual formatting is required.

If the output format must be Word (versus HTML, PDF), one can create an R Markdown document that renders a Word document as follows.

---
output: word_document
---

Text goes here.

```{r}
# R code here
```

Text continues.

```{r}
# Further R code here
```

More text.

See Literate Programming in R Markdown to learn more about dynamic document writing.