Types of data. Qualitative data. Quantitative data.
This lecture forms part of the Introduction to Epidemiology course offered by the Centre for Epidemiology Versus Arthritis at the University of Manchester. There are two sessions on analytical epidemiology: this first session covers descriptive statistics; the second session (next week) introduces statistical inference.
Much of the material overlaps with the first few lectures in the Statistical Modelling with Stata course, for which materials are available online. There is less emphasis on statistical software here. Examples, where provided, will be in the R programming language. Unlike Stata, R is completely free, so you don’t need an institutional software licence to follow along at home. Download it from the Comprehensive R Archive Network.
If you have questions or comments about any of these materials, please contact david.selby@manchester.ac.uk.
Links to example datasets are provided in the respective sections below.
Descriptive statistics and graphical summaries are essential for two tasks that bookend any quantitative research project.
Exploring and presenting data both involve similar tools.
Remember: visualise early and often. Before you can communicate your research, you first need to understand the data yourself. Visualising data very quickly reveals gaps, extreme values, skewness or unusual patterns that are much harder to diagnose using numerical methods alone. For many characteristics of datasets, a visualisation is also a more efficient way of describing it than through prose or tabulated values.
Data can be qualitative or quantitative.
Qualitative data have labels but no absolute numerical meaning; they can be simple groups (nominal), possibly with a nested structure (hierarchical), or relatively-ordered categories (ordinal).
Quantitative data have explicit numerical interpretation: if discrete, the outcome can take one of a countable set of possible values; if continuous, the possible values are uncountably infinite.
Examples include:
Continuous values can take on different scales: a ratio scale or an interval scale.
A ratio scale can be added, subtracted, multiplied and divided, and has an absolute zero: for instance 20 kg is twice as massive as 10 kg, and 0 kg means zero mass.
An interval scale can be added or subtracted, but has no absolute zero, so multiplication and division is not meaningful: 30°C is more than 10°C but is not three times warmer, and 0°C is not zero heat. Temperature (except in Kelvin) is not a ratio scale.
Nominal variables can have numeric labels, but that does not make them quantitative: e.g. the number 143 bus.
In the real world, measurements do not have infinite precision and are sometimes considered discrete even if the underlying phenomenon is continuous.
The type and scale of data determine the ways in which you can summarise, visualise and make inferences from them.
What types and scales are the following variables?
In R, quantitative data are represented by numeric
or integer
vectors. Qualitative data are given as character
(text/string) vectors, as factors
(if the number of possible values is known) and, for ordinal data, as ordered factors
. When reading in data from a file, be careful of nominal data with numeric labels, such as ID numbers; these may be automatically converted to numeric
or integer
when really you want character
or factor
.
We summarise qualitative data by counting them. Represent the number of observations of each level by their frequency or proportion.
We can then visualise frequencies/proportions via
To count discrete/qualitative data in R, use table()
to get frequencies, then proportions()
or prop.table()
for relative proportions. On a single vector it will simply count the different values. Pass multiple arguments to generate a 2-way (or higher dimensional) table. Another way of generating multi-way tables is using xtabs
. Alternatively, you can use tally()
and count()
from the dplyr package.
For example, try the diamonds
dataset (bundled with the ggplot2 package).
Fair Good Very Good Premium Ideal
1610 4906 12082 13791 21551
proportions(table(diamonds$cut))
Fair Good Very Good Premium Ideal
0.02984798 0.09095291 0.22398962 0.25567297 0.39953652
cut
color Fair Good Very Good Premium Ideal
D 163 662 1513 1603 2834
E 224 933 2400 2337 3903
F 312 909 2164 2331 3826
G 314 871 2299 2924 4884
H 303 702 1824 2360 3115
I 175 522 1204 1428 2093
J 119 307 678 808 896
xtabs(~ color + cut, data = diamonds)
cut
color Fair Good Very Good Premium Ideal
D 163 662 1513 1603 2834
E 224 933 2400 2337 3903
F 312 909 2164 2331 3826
G 314 871 2299 2924 4884
H 303 702 1824 2360 3115
I 175 522 1204 1428 2093
J 119 307 678 808 896
proportions(table(diamonds$color, diamonds$cut), margin = 1)
Fair Good Very Good Premium Ideal
D 0.024 0.098 0.223 0.237 0.418
E 0.023 0.095 0.245 0.239 0.398
F 0.033 0.095 0.227 0.244 0.401
G 0.028 0.077 0.204 0.259 0.433
H 0.036 0.085 0.220 0.284 0.375
I 0.032 0.096 0.222 0.263 0.386
J 0.042 0.109 0.241 0.288 0.319
# A tibble: 35 x 4
# Groups: cut [5]
color cut n `%`
<ord> <ord> <int> <dbl>
1 D Fair 163 10.1
2 D Good 662 13.5
3 D Very Good 1513 12.5
4 D Premium 1603 11.6
5 D Ideal 2834 13.2
6 E Fair 224 13.9
7 E Good 933 19.0
8 E Very Good 2400 19.9
9 E Premium 2337 16.9
10 E Ideal 3903 18.1
# ... with 25 more rows
Also known as column charts. The length of each bar is proportional to the number of observations in each category. A grouped bar chart puts the bars side-by-side, so they have a common reference of zero. A stacked bar chart puts the bars on top of one another, which makes it easier to see the overall total in each category. A proportional bar chart standardises each group to have unit length, for easier visualisation of within-group proportions.
library(ggplot2)
ggplot(as.data.frame(Titanic)) +
aes(Class, Freq, fill = Survived) +
geom_col(position = 'dodge') +
labs(title = 'Grouped bar chart',
subtitle = 'Survival of passengers on RMS Titanic, by class')
ggplot(as.data.frame(Titanic)) +
aes(Class, Freq, fill = Survived) +
geom_col(position = 'stack') +
labs(title = 'Stacked bar chart',
subtitle = 'Survival of passengers on RMS Titanic, by class')
ggplot(as.data.frame(Titanic)) +
aes(Class, Freq, fill = Survived) +
geom_col(position = 'fill') +
labs(title = 'Stacked proportional bar chart', y = 'Prop',
subtitle = 'Survival of passengers on RMS Titanic, by class')
A pie chart is just a proportional bar chart that’s been transformed into polar coordinates.
ggplot(as.data.frame(Titanic)) +
aes(Freq, '', fill = Survived) +
geom_col(position = 'fill') +
labs(title = 'Stacked proportional bar chart', x = NULL, y = NULL,
subtitle = 'Survival of passengers on RMS Titanic, by class') +
facet_wrap(~Class, nrow = 1) +
coord_polar() +
theme(axis.text.x = element_blank())
Don’t use pie charts. Perceptually, it is easier to compare lengths than it is to compare angles or areas.
When counts are smaller, numbers can appear more concrete as pictograms, also known as isotype visualisations. Even without figurative icons you can represent subjects as stacked dots or squares, so that each shape represents a single person.
This can be useful to avoid reading too much into a dataset that contains a small number of people. It’s also an effective way of emphasising the impact that events have on individuals.
Use ggplot2::geom_dotplot
or the package waffle to produce such charts.
remotes::install_github('hrbrmstr/waffle')
library(waffle)
ggplot(as.data.frame(Titanic)) +
aes(fill = Survived, values = Freq) +
geom_waffle(colour = 'white', n_rows = 20, flip = TRUE) +
coord_equal() +
facet_wrap(~Class, nrow = 1) +
theme_enhance_waffle() +
theme(legend.position = 'bottom') +
labs(title = 'Faceted waffle chart',
subtitle = 'Survival of passengers on RMS Titanic, by class')
As the number of observations increases, the individual shapes get smaller and smaller and approximates a bar chart.
You can also make a proportional waffle chart. The individual shapes lose their meaning (no longer 1 square = 1 subject) but it can useful for unbalanced distributions (emphasising the relative size of the smallest and largest categories).
ggplot(as.data.frame(Titanic)) +
aes(fill = Survived, values = Freq) +
geom_waffle(colour = 'white', n_rows = 10, flip = TRUE,
make_proportional = TRUE) +
coord_equal() +
facet_wrap(~Class, nrow = 1) +
theme_enhance_waffle() +
theme(legend.position = 'bottom') +
labs(title = 'Faceted proportional waffle chart',
subtitle = 'Survival of passengers on RMS Titanic, by class')
Where data have a two-way array structure, you can represent counts or proportions through colour, which is sometimes easier to read than a numeric table, especially when the number of rows and columns is very large and you are looking for patterns, extreme values or outliers.
In ggplot2, use geom_tile
(if you’ve already computed the frequencies) or geom_bin2d
(if you want the package to count the subjects for you).
library(ggplot2)
ggplot(diamonds) +
aes(cut, color) +
geom_bin2d(aes(fill = after_stat(count))) +
scale_fill_viridis_c() +
labs(title = 'Heat map',
subtitle = 'Diamonds counted according to colour and quality of cut')
Limits of human perception make it difficult to read raw values off colour scales, but easier to spot patterns among shapes of varying colour. If you are going to use a colour scale, use scale_colour_viridis
or another colour-blind-friendly palette. Don’t use rainbow colour palettes.
In the R package ggplot2, generally speaking geom_col()
is for when you have already computed the counts, and geom_bar()
is for when you want the package to calculate the counts for you while generating the plot. The resulting final plot will look the same. For example, the built-in Titanic
dataset is already given as explicit counts. But a typical epidemiological dataset might have one row for each subject (as in the diamonds
example dataset, or ggmosaic::titanic
). You can parametrise one function to act like the other with stat = 'count'
or stat = 'identity'
, respectively.
The simplest (not necessarily good, let alone best) way of summarising quantitative data is to treat it as qualitative data. That is, by discretising (binning) the values into a small number of groups (e.g. age bands), then presenting the frequency distribution of those groups using a table or chart, as above. If there are a small number of unique values (e.g. number of children per family) then you can report the counts without binning.
Might look like a bar chart, but is very different. The area of each rectangle represents the frequency of observations in the interval. There are no gaps between bars. Bars can be varying widths.
How many bars and what width? This is a difficult question and has profound influence on the resulting graphic. For this reason, ggplot2 will print a warning message about this, if you do not specify the number of bins explicitly. It will default to 30 bins, which is not intended to be a good default, but to get you experimenting.
For this example we use the htwt.csv
dataset of measured and reported body mass indices.
ggplot(htwt) +
aes(nurseht) +
geom_histogram() +
facet_wrap(~ sex) +
labs(x = 'measured height (cm)',
title = 'Histogram', subtitle = 'With arbitrary default 30 bins')
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(htwt) +
aes(nurseht) +
geom_histogram(breaks = seq(140, 200, by = 2.5)) +
facet_wrap(~ sex) +
labs(x = 'measured height (cm)',
title = 'Histogram', subtitle = 'With 24 bins of width 2.5 cm')
By default the y-axis will represent counts. Change it to density using aes(y = ..density..)
.
Histograms are not recommended unless, a priori, you can pre-specify break points along the x-axis (e.g. age bands), because they are highly sensitive to the positions of breakpoints and number of bins.
For example, a dataset with ten observations of every integer value of \(x\) from 1 to 40, should appear as a uniform distribution over this range, but it doesn’t because the default number of bins (in ggplot) is 30:
uniform <- data.frame(x = rep(1:40, each = 10))
ggplot(uniform) +
aes(x) +
geom_histogram(colour = 'grey')
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(uniform) +
aes(x) +
geom_histogram(breaks = c(0, 10, 20, 30, 40),
colour = 'grey')
For small numbers of observations you might like to use a dot plot. But you still need to choose the positions of the stacks of dots, just as with a histogram.
ggplot(uniform) +
aes(x) +
geom_dotplot(binwidth = 1) +
theme(axis.text.y = element_blank()) +
labs(title = 'Dot plot', subtitle = 'With bin width of 1', y = NULL)
Generally, avoid discretising data unless you have a compelling reason. (The ‘correct’ choice in this case is 40 bins, because we know the data are uniform. But if you already knew the data were uniform, what’s the purpose of the visualisation?)
If you have no prior idea of sensible bin widths or break points, use a kernel density plot and it should pick a reasonable bandwidth automatically. A kernel density plot is essentially just a smoothed version of a histogram. It tends to work better than an ordinary histogram, except where smoothing might not be appropriate (gaps or sudden jumps in the frequencies), in which case you might want to start with a kernel density plot, diagnose discontinuities and then switch to another plotting method.
ggplot(uniform) +
aes(x) +
geom_density(bw = 1) +
labs(title = 'Kernel density plot')
For the htwt
data we get the following kernel density plot. If comparing groups, you can plot them on the same axes, or faceted in a way that allows direct comparison of the scales.
ggplot(htwt) +
aes(nurseht, fill = sex) +
geom_density(alpha = .75) +
labs(x = 'measured height (cm)',
title = 'Kernel density plot',
subtitle = 'Two groups on same axes')
ggplot(htwt) +
aes(nurseht, sex, fill = sex) +
ggridges::geom_density_ridges() +
labs(x = 'measured height (cm)',
title = 'Kernel density plot',
subtitle = 'Two groups on separate axes') +
theme(legend.position = 'none')
Better yet, identify outliers, if any, with a jittered scatter plot.
ggplot(uniform) +
aes(x, '') +
geom_jitter(width = 0) +
labs(title = 'Jittered scatter plot', y = NULL)
A dot plot with a binwidth of 1 is essentially the same thing as the jittered scatter plot but with all the random vertical gaps removed.
ggplot(htwt) +
aes(nurseht) +
geom_dotplot(stackdir = 'centerwhole', binwidth = 1, method = 'histodot') +
facet_wrap(~ sex, ncol = 1) +
labs(x = 'measured height (cm)', y = NULL,
title = 'Centred dot plots', subtitle = 'With bin width of 1 cm') +
theme(axis.text.y = element_blank())
Certain data distributions can be represented more compactly than via visualisations. Often (not always) we can concisely communicate the important characteristics of the distribution using summary statistics.
What is the average or the value of a ‘typical’ observation?
What is the spread of the data?
A confidence interval is a special case of an inter-quantile range (namely: the difference between upper and lower quantiles of a theoretical normal distribution).
In R, you can calculate the mean, quartiles and extreme values using summary
. Combine it with by
to compute summaries by a grouping variable.
by(htwt$nurseht, htwt$sex, summary)
htwt$sex: female
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
141 156 159 160 163 182 7
----------------------------------------------------
htwt$sex: male
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
154 168 174 173 177 192 3
In R, there are dedicated functions for most common statistics including mean
, sd
, median
, quantile
, IQR
, min
, max
, mad
, range
(min, max) and so on.
htwt %>%
filter(!is.na(nurseht)) %>% # exclude missing values
group_by(sex) %>%
summarise(mean = mean(nurseht),
sd = sd(nurseht),
Q1 = quantile(nurseht, .25),
median = median(nurseht),
Q3 = quantile(nurseht, .75),
IQR = IQR(nurseht),
min = min(nurseht),
max = max(nurseht),
MAD = mad(nurseht)) %>%
knitr::kable()
sex | mean | sd | Q1 | median | Q3 | IQR | min | max | MAD |
---|---|---|---|---|---|---|---|---|---|
female | 160 | 6.4 | 156 | 159 | 163 | 7.2 | 141 | 182 | 5.9 |
male | 173 | 6.9 | 168 | 174 | 177 | 9.2 | 154 | 192 | 6.7 |
You can generate publication-quality tables using knitr::kable
and friends.
By convention, medical papers tend to report in the first table:
See the R-thritis worksheet Best practices for building baseline tables for more information.
A box and whisker plot is a very efficient summary of the distribution, assuming your summary measures of interest are the quartiles: i.e. first quartile (25th centile), the median (second quartile / 50th centile), third quartile (75th centile) and the minimum and maximum (possibly excluding outliers).
The “whiskers” extend to the minimum and maximum, or to the smallest and largest observations that are within the interval \(\text{median} \pm 1.5 \times \text{IQR}\). Outliers, if any, are marked as separate dots.
A box plot can help show skewness but not bimodality. A single box plot is not terribly demonstrative (it’s more efficient just to tabulate the quartiles instead) but it is a useful way visualising how the respective quartiles differ between multiple groups.
ggplot(htwt) +
aes(nurseht, sex, fill = sex) +
geom_boxplot(show.legend = FALSE) +
labs(title = 'Box plots',
subtitle = 'Quartiles of measured height, by sex',
x = 'measured height (cm)', y = NULL)
The normal distribution (Gaussian distribution, ‘bell curve’) is symmetric, unimodal and mesokurtic. It is entirely defined by two parameters:
A standard normal distribution has mean 0 and standard deviation 1. Other distributions are more complex to describe or interpret via parameters, and easier to represent graphically.
Asymmetric distributions are skewed:
It is easy to diagnose when a distribution is obviously not normal by looking at a density plot or histogram. But it’s not quite so easy to verify when a distribution is approximately normal this way. For that you should always use quantile–quantile (QQ) plots. In practice, the only reliable way to verify normality (and diagnose departures from normality) is using QQ plots.
Looks approximately normal, but how different would it have to be before we said it didn’t? And would our conclusion change if we specified a different bandwidth (for density plots) or bin width (for histograms) for the data?
With a QQ plot, at least we don’t have to worry about choice of bin width.
ggplot(htwt) +
aes(sample = nurseht) +
geom_qq() +
geom_qq_line() +
facet_wrap(~ sex) +
labs(title = 'Normal quantile–quantile plots',
subtitle = 'Sample quantiles of measured height (cm)')
The male patients’ heights have a near-perfect normal distribution. The female patients’ heights are approximately normal but the tails seem to be slightly heavier than a theoretical normal distribution with the same mean and standard deviation; in other words there are more extreme values than we might expect. But this deviation is very small and not worth worrying about in this case.
Note the line on a QQ plot can be whatever you want: often it’s a line through the first and third quartiles, but it could also be a least-squares regression line, a line through the first and 99th percentile or something else. You are examining whether the points fall along a single straight line, whatever the equation of that line might be (i.e. the slope doesn’t have to be exactly 1).
If there is more than one ‘mode’ (peak) in a distribution then it suggests the data would be better described by a mixture of normals (if not another distribution entirely).
In the height and weight data, the measurements might look slightly bimodal if you failed to segregate them by sex.
If a distribution is not symmetric, it is skewed. Positive (negative) skewness implies more high (low) values than would otherwise be expected from a symmetric distribution.
A rough and ready test for skewness is to check if the median is approximately equal to the mean. If they are very different, the data may be skewed. However, the scale of ‘very different’ is dependent on sample size, so you should draw a plot as well.
Kurtosis is a measure of how heavy (fat) the tails are. Not every symmetric, unimodal distribution is a normal distribution!
Leptokurtic (fat tailed) distributions might be a result of outliers, due to measurement error. Platykurtic (thin tailed) distributions can result from sample values being artificially bounded in some way (for example due to limits of a measurement range of a device).
There are null-hypothesis significance tests available to test for normality, for example the Shapiro–Wilk test. Avoid using these, except to verify your conclusions from graphical tests.
In conclusion: use any method you like to determine normality, so long as that method is a QQ plot.
Skewed distributions can be made symmetric using a mathematical transformation. Taking logs is the most common. Other transformations (e.g. square root, reciprocal) can be used, but are much more difficult to interpret. It may be better to transform back to original units when presenting results.
PImax, the maximal static inspiratory pressure, is a measure of lung function, given in units cm H2O. The following is a list of lung function measurements for a set of cystic fibrosis patients.
80, 85, 110, 95, 95, 100, 45, 95, 130, 75, 80, 70, 80, 100, 120, 110, 125, 75, 100, 40, 75, 110, 150, 75, 95.
Try completing the following exercises by hand, then check your answers with a computer.
Find the median of the lung function measurements.
Hint: start by writing down the measurements in order.
Calculate the inter-quartile range.
Calculate the mean.
Calculate the standard deviation.
Read the dataset htwt.csv
into R by downloading it to your desktop or using
<- 'https://personalpages.manchester.ac.uk/staff/david.selby/stata/2020-04-02-summarising/htwt.csv'
path <- read.csv(path) htwt
This file includes two BMI variables: bmi
, which was based on measured data and bmirep
, which was based on reported data.
library(ggplot2)
ggplot(htwt) + aes(bmi) + geom_histogram(binwidth = 1)
ggplot(htwt) + aes(sample = bmi) + geom_qq() + geom_qq_line()
summary(htwt$bmi)
Write down the mean BMI.
How does the mean compare to the median?
What are the lower and upper quartiles of the observed values?
by
, aggregate
, dplyr::group_by
etc.). For example:**ggplot(htwt) + aes(sex, bmi) + geom_boxplot()
ggplot(htwt) + aes(bmi, colour = sex) + geom_density()
ggplot(htwt) + aes(bmi) + geom_histogram(binwidth = 2) + facet_grid(sex ~ .)
The packages dplyr and/or data.table can be used to produce tables of summary statistics. They are similar to the base functions summary
, aggregate
and by
, but more flexible. The basic syntax for split-apply-combine (split the data into groups, apply a summary function, combine the results) in dplyr is:
library(dplyr)
%>% group_by(factorvar) %>% summarise(name = fun(variable)) dataset
where dataset
is a data frame, factorvar
is the name of an optional grouping variable, name
is an optional output label and fun
is a summary function (e.g. mean
). For example,
would give the mean and standard deviation of BMI for the whole sample (note R does not ignore missing values by default). Passing a variable to group_by
enables you to obtain the statistics for different subgroups:
You can do this in base R if you prefer but the syntax is less flexible:
What is the average age of the subjects?
Draw a histogram or density plot of the ages, accompanied by a QQ plot. Do the ages follow a normal distribution?
How old are the youngest and oldest males and females in the study?
What is the mean of the self-reported BMI? Is this greater or less than the mean of the BMI as measured by the nurse?
Assign a variable for the difference between measured BMI and self-reported BMI:
htwt <- transform(htwt, bmidiff = bmi - bmirep)
Write down its mean value, standard deviation and the number of subjects for whom both BMI measures are available (hint: use the summary
function).
ggplot(htwt) + aes(sample = nurseht) + facet_wrap(~ sex) +
geom_qq() + geom_qq_line()
and
ggplot(htwt) + aes(sample = nursewt) + facet_wrap(~ sex) +
geom_qq() + geom_qq_line()