Analytical Epidemiology I

# Analytical Epidemiology I
## Summarising Data
### David Selby
### 23<sup>rd</sup> March 2021

---

layout: true
background-image: url(cfe-logo.jpg)
background-position: 97% 97%
background-size: 70px

.purple { color: #644BA5; }
.pink { color: #EB64A0; }
.blue { color: #0073BE; }
.green { color: #37A53C; }
.yellow { color: #FAB900; }
.red { color: #E61E32; }
    
ul, ol {
  padding-left: 0;
}
ol li {
  margin-top: .5em;
}
ul ul, dl {
  padding-left: 1em;
}

code {
  color: #644BA5;
}

.remark-slide-number {
  position: inherit;
}

.remark-slide-number .progress-bar-container {
  position: absolute;
  bottom: 0;
  height: 4px;
  display: block;
  left: 0;
  right: 0;
}

.remark-slide-number .progress-bar {
  height: 100%;
  background-color: #0073BE;
}

</style>

---

## Stats in the basic epi course

- **Analysis 1**
  - today
  - descriptive statistics
- **Analysis 2**
  - next week (30 March)
  - inferential statistics
  
--

**Today**: different types of data and how to summarise them

- visual summaries
- numerical summaries

---

Data can be **qualitative** or **quantitative**.

### Qualitative

<dl>
<dt>nominal</dt>
<dd>named groups, no numerical interpretation</dd>
<dt>ordinal</dt>
<dd>groups with a <em>relative</em> ordering</dd>
</dl>

### Quantitative

<dl>
<dt>discrete</dt>
<dd>countable set of possible values</dd>
<dt>continuous</dt>
<dd>uncountably many possible values</dd>
</dl>

---

## Examples of data types

<dl>
<dt>nominal</dt>
<dd>blood group; hair colour</dd>
<dt>ordinal</dt>
<dd>strongly agree/agree/disagree/strongly disagree; education</dd>
<dt>discrete</dt>
<dd>number of children; date of birth</dd>
<dt>continuous</dt>
<dd>birthweight, height, body fat percentage</dd>
</dl>

---

## Caveats with data types

Distinctions between data types can be _subjective._

- Cancer staging I, II, III, IV: nominal or ordinal?

- Number of long-term conditions: discrete or ordinal?

- Continuous phenomena &rArr; discrete measurements

--
  
- .red[Red] and .blue[blue]? Or .red[\#FF0000] and .blue[\#0000FF]?

Depends on the **application** or **research question**.

---

## Examples of data types

What type of variable are each of the following:

1. number of visits to a GP this year
2. marital status
3. size of tumour in cm
4. pain (minimal, moderate, severe, unbearable)
5. blood pressure in mm Hg

---

# Qualitative data

---

layout: true
background-image: url(cfe-logo.jpg)
background-position: 97% 97%
background-size: 70px

---

# Qualitative data

**Count** the number of subjects/observations in each group

The count is called the *frequency*.

The proportion is called the *relative frequency*.

- **R:** use `table()`, `prop.table()` and `xtabs()` functions
  - **dplyr**: `count()` and `tally()`
  - **data.table**: `.N`
- **Stata:** use `tabulate` command

---

## Summarising counts

How many penguins, by *sex* and *species*?

```r
with(penguins, table(sex, species))
```

```
        species
sex      Adelie Chinstrap Gentoo
  female     73        34     58
  male       73        34     61
```

```r
with(penguins, proportions(table(sex, species), margin = 2))
```

```
        species
sex      Adelie Chinstrap Gentoo
  female   0.50      0.50   0.49
  male     0.50      0.50   0.51
```

---

## Visualising counts

We can communicate frequencies/proportions by representing them as:

1. **text:** tables
2. **shapes:** dot plots, waffle charts, pictograms
3. **length:** bar or column charts
4. **colour:** heat maps or chloropleth maps
5. **area:** mosaic/spine plots, tree maps, ~~pie charts~~

---

## Visualising counts

Tables are **just one way of visualising data**.
They can be *precise*, but often a poor way of spotting *trends* or *anomalies*.

```
           color
cut            D    E    F    G    H    I    J
  Fair       163  224  312  314  303  175  119
  Good       662  933  909  871  702  522  307
  Very Good 1513 2400 2164 2299 1824 1204  678
  Premium   1603 2337 2331 2924 2360 1428  808
  Ideal     2834 3903 3826 4884 3115 2093  896
```

---

## Visualising counts

Tables are **just one way of visualising data**.
They can be *precise*, but often a poor way of spotting *trends* or *anomalies*.

---

## Indexical visualisation

---

## Bar charts

---

## Bar charts

---

## Bar charts

---

## Mosaic plots or spine plots

---

## R functions for visualising counts

- **base**/**stats**: 
  - `table`, `ftable`, `prop.table`/`proportions`, `xtabs`
- **dplyr**
  - `count`, `tally`, `summarise( n() )`
- **graphics**:
  - `plot`, `barplot`, `mosaicplot`, `spineplot`
- **ggplot2**:
  - built-in: `geom_bar`, `geom_col`, `geom_tile`
  - `ggmosaic::geom_mosaic`
  - `waffle::geom_waffle`

---

# Intermission

Questions?

*After the break:* **quantitative data**

---

layout: true
background-image: url(cfe-logo.jpg)
background-position: 97% 97%
background-size: 70px

---

# Quantitative data

**Recall**: we report _qualitative_ data by counting them and printing/plotting the frequencies.

A simple way to summarise _quantitative_ data is to **treat them as qualitative**:
i.e. count the discrete values, or divide observations into bins, then count them.

Tread carefully: the resulting figures are _highly sensitive_ to the choice of bins.

---

## Binning data

Easier to count, _at cost of granularity_.

```r
age_band <- cut(htwt$age, c(18, 30, 40, 50, 60, 70, 80))
addmargins(table(htwt$sex, age_band))
```

```
        age_band
         (18,30] (30,40] (40,50] (50,60] (60,70] (70,80] Sum
  female      40      55      41      38      47      13 234
  male        16      34      38      34      36      20 178
  Sum         56      89      79      72      83      33 412
```

---

### Histograms

Like a bar chart of binned observations, _but_:

- label boundaries, not bars (no gaps between bars)
- _frequency_ on *y*-axis (frequency = height), **or**
- _density_ on *y*-axis (frequency = area)

---

### Histograms

How do you choose the number or position of bins?
**Impossible to say.**
_Don't_ just use the default!

```r
ggplot(htwt) + aes(nurseht) + facet_wrap(~sex) + geom_histogram() 
```

```
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

---

### Histograms

What could possibly go wrong?

```r
uniform <- data.frame(x = rep(1:40, each = 10))
ggplot(uniform) + aes(x) + geom_histogram()
```

```
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

---

### Histograms

A **kernel density plot** is a _smoothed histogram_.

- _Bandwidth_ is picked automatically
- Smooths out the noise
- May mask discontinuities (but so can a histogram)

---

### Histograms

Can use a **dot plot** for smaller data.

---

### Histograms

A **spinogram** is a spine plot with one binned continuous variable.
Both *x* and *y* axes represent relative frequency.

---

# Summary measures

---

layout: true
background-image: url(cfe-logo.jpg)
background-position: 97% 97%
background-size: 70px

---

### Summary measures

### Location

What is the _average_ or _typical_ observed value?

- Mean, median, ...

### Scale

What is the _spread_ of the data?

- Range, inter-qua*n*tile range, mean absolute deviation
- standard deviation, confidence intervals

---

## Measures of location

The **arithmetic mean** is

$$
`\begin{aligned}
\bar x &= \frac{x_1 + x_2 + \dots + x_n}{n} \\
       &= \sum_{i=1}^n x_i.
\end{aligned}`
$$

- easy to compute
- location parameter for many probability distributions

---

## Measures of location

The **median**: &ldquo;Sort the values and pick the middle one&rdquo;

$$
\operatorname{median}(x) =
`\begin{cases}
x_{(n+1)/2} & n~\text{is odd} \\[1ex]
\dfrac{x_{n/2} + x_{(n/2) + 1}}{2} & n~\text{is even}
\end{cases}`
$$
- Essentially a (**heavily**) _trimmed_ mean
- Less sensitive to extreme outliers
- More ‘typical’ than mean, if data are skewed

---

## Quantiles

**Quantiles** are _cut points_ that divide data into equal proportions.

- **quartiles** split data into _quarters_
- **centiles** (percentiles) split data into *hundredths*

The median is the 2.sup[nd] quartile or the 50.sup[th] centile.

The zeroth quantile is the _minimum_ value; the last quantile is the _maximum_.

---

## Measures of variation

How close are our data to the ‘typical’ value?

- Range
- Inter-qua*n*tile range
  - Inter-quartile range (IQR)
  - `$(1-\alpha)$`% confidence interval (CI)
- Variance (standard deviation)
- Mean absolute deviation

---

## Measures of variation

#### Range

`$$\text{range}(x) = \max(x) - \min(x)$$`

- depends on only two measurements
- can only increase with sample size

#### Inter-quartile range

`$$\text{IQR}(x) = Q_{3/4}(x) - Q_{1/4}(x)$$`

- less sensitive to extreme values
- not meaningful for very small datasets
- not uniquely defined!

---

## Measures of variation

#### Standard deviation

$$\text{sd}(x) = \sqrt{\sum_{i=1}^n \frac{(x_i - \bar x)^2}{n-1}} $$

- _nearly_ the ‘average distance from the mean’
- uses information from every observation
- sensitive to outliers
- in the same units as the observations
- easy to use mathematically

---

## Measures of variation

#### Mean absolute deviation

$$\text{MAD}(x) = \sum_{i=1}^n \frac{|x_i - \bar x|}{n} $$

- the average distance from the mean (or median)
- less sensitive to outliers
- easy to compute
- always &leq; standard deviation
- _biased_ estimator; more difficult to use mathematically

---

## Summary statistics in R

- `summary` will give quartiles, min, max & mean
  - or counts of each level, if qualitative
- `quantile` to compute quantiles
    - e.g. `quantile(x, c(.25, .75))` for lower, upper quartiles
- `mean`, `median`, `sd`, `var`, `mad`, `IQR`
- `range` gives minimum and maximum

```r
summary(htwt$nursewt)
```

```
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
     43      61      70      72      81     125      10 
```

Compute by group with `by`, `aggregate` or `dplyr::group_by`

---

## Numerical summary: Table 1

Overview of study sample.

Report counts/proportions or location/scale (with units) for each variable.

- **Normally distributed**: mean and SD
- **Skewed distribution**: median and IQR
- **Mixture/if in doubt**: median and IQR
- **Qualitative**: frequency and proportion

_Always_ align numbers by their decimal points.

---

# The normal distribution

---

layout: true
background-image: url(cfe-logo.jpg)
background-position: 97% 97%
background-size: 70px

---

## The normal distribution

The normal (Gaussian) distribution is _symmetric_, _unimodal_ and _mesokurtic_.
Described by **mean** and __standard deviation__.

---

## The normal distribution

Why care if data are normally distributed?

- **Asymmetric**: very high or low values are skewing data
  - mean no longer represents &lsquo;typical&rsquo; value
- **Multimodal**: more than one peak
  - indicates a mixture of groups
- **Platykurtic**: thin tails
  - unusually few / less extreme values
  - bounded measurements?
- **Leptokurtic**: fat tails
  - unusually many / more extreme values
  - anomalous measurements?
  
---

## Assessing normality

- Any way you like, so long as it's a **quantile—quantile plot**.

---

### Positively skewed distribution

Some extremely high values; long right tail

---

### Negatively skewed distribution

Some extremely small values; long left tail

---

### Bimodal distribution

Two peaks (modes): possible mixture of distributions

---

### Leptokurtic (fat tailed)

More extreme values (both large & small) than normal

---

### Platykurtic (thin tailed)

Fewer extreme values (both large & small) than normal

---

### Quantile–quantile plots in R

- In base graphics, call `qqnorm` and `qqline`.
- In **ggplot2**, use `geom_qq` and `geom_qq_line`

```r
ggplot(htwt) + aes(sample = nursewt) +
  geom_qq() + geom_qq_line()
```

---

## Box and whisker plots

Another kind of &lsquo;quantile plot&rsquo;

- Median, upper & lower quartiles
- Min, max (within 1.5 &times; IQR of median) & outliers
- Compare skewness between 2+ variables
- In R: `boxplot` or `ggplot2::geom_boxplot`

---

## Transforming data

- Symmetrise data via transformation
- Most common transform: taking logs
- Others (e.g. `$1/x$`, `$\sqrt{x}$`) harder to interpret

---

# Further reading

---

layout: true
background-image: url(cfe-logo.jpg)
background-position: 97% 97%
background-size: 70px

---

# Further reading

- Lecture notes
- These slides
- Practical exercises

### Books on data visualisation

**Edward Tufte**, _The Visual Display of Quantitative Information_.

**William Cleveland**, _Visualizing Data_.