class: center, middle, inverse, title-slide # Analytical Epidemiology I ## Summarising Data ### David Selby ### 23
rd
March 2021 --- layout: true background-image: url(cfe-logo.jpg) background-position: 97% 97% background-size: 70px <style type="text/css"> dd { font-size: 90%; } .purple { color: #644BA5; } .pink { color: #EB64A0; } .blue { color: #0073BE; } .green { color: #37A53C; } .yellow { color: #FAB900; } .red { color: #E61E32; } ul, ol { padding-left: 0; } ol li { margin-top: .5em; } ul ul, dl { padding-left: 1em; } code { color: #644BA5; } .remark-slide-number { position: inherit; } .remark-slide-number .progress-bar-container { position: absolute; bottom: 0; height: 4px; display: block; left: 0; right: 0; } .remark-slide-number .progress-bar { height: 100%; background-color: #0073BE; } </style> --- ## Stats in the basic epi course - **Analysis 1** - today - descriptive statistics - **Analysis 2** - next week (30 March) - inferential statistics -- **Today**: different types of data and how to summarise them - visual summaries - numerical summaries .small[https://personalpages.manchester.ac.uk/staff/david.selby/analysis.html] --- Data can be **qualitative** or **quantitative**. ### Qualitative <dl> <dt>nominal</dt> <dd>named groups, no numerical interpretation</dd> <dt>ordinal</dt> <dd>groups with a <em>relative</em> ordering</dd> </dl> ### Quantitative <dl> <dt>discrete</dt> <dd>countable set of possible values</dd> <dt>continuous</dt> <dd>uncountably many possible values</dd> </dl> --- ## Examples of data types <dl> <dt>nominal</dt> <dd>blood group; hair colour</dd> <dt>ordinal</dt> <dd>strongly agree/agree/disagree/strongly disagree; education</dd> <dt>discrete</dt> <dd>number of children; date of birth</dd> <dt>continuous</dt> <dd>birthweight, height, body fat percentage</dd> </dl> --- ## Caveats with data types Distinctions between data types can be _subjective._ -- - Cancer staging I, II, III, IV: nominal or ordinal? -- - Number of long-term conditions: discrete or ordinal? -- - Continuous phenomena ⇒ discrete measurements -- - .red[Red] and .blue[blue]? Or .red[\#FF0000] and .blue[\#0000FF]? -- Depends on the **application** or **research question**. --- ## Examples of data types What type of variable are each of the following: 1. number of visits to a GP this year 2. marital status 3. size of tumour in cm 4. pain (minimal, moderate, severe, unbearable) 5. blood pressure in mm Hg --- layout: false class: inverse, middle, center # Qualitative data --- layout: true background-image: url(cfe-logo.jpg) background-position: 97% 97% background-size: 70px --- # Qualitative data -- **Count** the number of subjects/observations in each group The count is called the *frequency*. The proportion is called the *relative frequency*. -- - **R:** use `table()`, `prop.table()` and `xtabs()` functions - **dplyr**: `count()` and `tally()` - **data.table**: `.N` - **Stata:** use `tabulate` command --- ## Summarising counts How many penguins, by *sex* and *species*? ```r with(penguins, table(sex, species)) ``` ``` species sex Adelie Chinstrap Gentoo female 73 34 58 male 73 34 61 ``` -- ```r with(penguins, proportions(table(sex, species), margin = 2)) ``` ``` species sex Adelie Chinstrap Gentoo female 0.50 0.50 0.49 male 0.50 0.50 0.51 ``` --- ## Visualising counts We can communicate frequencies/proportions by representing them as: 1. **text:** tables 2. **shapes:** dot plots, waffle charts, pictograms 3. **length:** bar or column charts 4. **colour:** heat maps or chloropleth maps 5. **area:** mosaic/spine plots, tree maps, ~~pie charts~~ --- ## Visualising counts Tables are **just one way of visualising data**. They can be *precise*, but often a poor way of spotting *trends* or *anomalies*. -- ``` color cut D E F G H I J Fair 163 224 312 314 303 175 119 Good 662 933 909 871 702 522 307 Very Good 1513 2400 2164 2299 1824 1204 678 Premium 1603 2337 2331 2924 2360 1428 808 Ideal 2834 3903 3826 4884 3115 2093 896 ``` --- ## Visualising counts Tables are **just one way of visualising data**. They can be *precise*, but often a poor way of spotting *trends* or *anomalies*. <img src="lecture1_files/figure-html/diamonds-1.png" width="100%" /> --- ## Indexical visualisation <img src="lecture1_files/figure-html/waffle-1.png" width="100%" /> --- ## Bar charts <img src="lecture1_files/figure-html/grouped-bar-1.png" width="100%" /> --- ## Bar charts <img src="lecture1_files/figure-html/stacked-bar-1.png" width="100%" /> --- ## Bar charts <img src="lecture1_files/figure-html/stacked-prop-bar-1.png" width="100%" /> --- ## Mosaic plots or spine plots <img src="lecture1_files/figure-html/mosaicplot-1.png" width="100%" /> --- ## R functions for visualising counts - **base**/**stats**: - `table`, `ftable`, `prop.table`/`proportions`, `xtabs` - **dplyr** - `count`, `tally`, `summarise( n() )` - **graphics**: - `plot`, `barplot`, `mosaicplot`, `spineplot` - **ggplot2**: - built-in: `geom_bar`, `geom_col`, `geom_tile` - `ggmosaic::geom_mosaic` - `waffle::geom_waffle` --- layout: false class: inverse, middle, center # Intermission Questions? *After the break:* **quantitative data** --- layout: true background-image: url(cfe-logo.jpg) background-position: 97% 97% background-size: 70px --- # Quantitative data **Recall**: we report _qualitative_ data by counting them and printing/plotting the frequencies. A simple way to summarise _quantitative_ data is to **treat them as qualitative**: i.e. count the discrete values, or divide observations into bins, then count them. Tread carefully: the resulting figures are _highly sensitive_ to the choice of bins. --- ## Binning data Easier to count, _at cost of granularity_. ```r age_band <- cut(htwt$age, c(18, 30, 40, 50, 60, 70, 80)) addmargins(table(htwt$sex, age_band)) ``` ``` age_band (18,30] (30,40] (40,50] (50,60] (60,70] (70,80] Sum female 40 55 41 38 47 13 234 male 16 34 38 34 36 20 178 Sum 56 89 79 72 83 33 412 ``` --- ### Histograms Like a bar chart of binned observations, _but_: - label boundaries, not bars (no gaps between bars) - _frequency_ on *y*-axis (frequency = height), **or** - _density_ on *y*-axis (frequency = area) <img src="lecture1_files/figure-html/histogram-1.png" width="90%" /> --- ### Histograms How do you choose the number or position of bins? **Impossible to say.** _Don't_ just use the default! ```r ggplot(htwt) + aes(nurseht) + facet_wrap(~sex) + geom_histogram() ``` ``` `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` <img src="lecture1_files/figure-html/histogram-ht-1.png" width="95%" /> --- ### Histograms What could possibly go wrong? ```r uniform <- data.frame(x = rep(1:40, each = 10)) ggplot(uniform) + aes(x) + geom_histogram() ``` ``` `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` <img src="lecture1_files/figure-html/histogram-uniform-1.png" width="95%" /> --- ### Histograms A **kernel density plot** is a _smoothed histogram_. - _Bandwidth_ is picked automatically - Smooths out the noise - May mask discontinuities (but so can a histogram) <img src="lecture1_files/figure-html/density-1.png" width="90%" /> --- ### Histograms Can use a **dot plot** for smaller data. <img src="lecture1_files/figure-html/dotplot-1.png" width="80%" /> --- ### Histograms A **spinogram** is a spine plot with one binned continuous variable. Both *x* and *y* axes represent relative frequency. <img src="lecture1_files/figure-html/spinogram-1.png" width="85%" /> --- layout: false class: inverse, middle, center # Summary measures --- layout: true background-image: url(cfe-logo.jpg) background-position: 97% 97% background-size: 70px --- ### Summary measures ### Location What is the _average_ or _typical_ observed value? - Mean, median, ... ### Scale What is the _spread_ of the data? - Range, inter-qua*n*tile range, mean absolute deviation - standard deviation, confidence intervals --- ## Measures of location The **arithmetic mean** is $$ `\begin{aligned} \bar x &= \frac{x_1 + x_2 + \dots + x_n}{n} \\ &= \sum_{i=1}^n x_i. \end{aligned}` $$ - easy to compute - location parameter for many probability distributions --- ## Measures of location The **median**: “Sort the values and pick the middle one” $$ \operatorname{median}(x) = `\begin{cases} x_{(n+1)/2} & n~\text{is odd} \\[1ex] \dfrac{x_{n/2} + x_{(n/2) + 1}}{2} & n~\text{is even} \end{cases}` $$ - Essentially a (**heavily**) _trimmed_ mean - Less sensitive to extreme outliers - More ‘typical’ than mean, if data are skewed --- ## Quantiles **Quantiles** are _cut points_ that divide data into equal proportions. - **quartiles** split data into _quarters_ - **centiles** (percentiles) split data into *hundredths* The median is the 2.sup[nd] quartile or the 50.sup[th] centile. The zeroth quantile is the _minimum_ value; the last quantile is the _maximum_. --- ## Measures of variation How close are our data to the ‘typical’ value? - Range - Inter-qua*n*tile range - Inter-quartile range (IQR) - `\((1-\alpha)\)`% confidence interval (CI) - Variance (standard deviation) - Mean absolute deviation --- ## Measures of variation #### Range `$$\text{range}(x) = \max(x) - \min(x)$$` - depends on only two measurements - can only increase with sample size #### Inter-quartile range `$$\text{IQR}(x) = Q_{3/4}(x) - Q_{1/4}(x)$$` - less sensitive to extreme values - not meaningful for very small datasets - not uniquely defined! --- ## Measures of variation #### Standard deviation $$\text{sd}(x) = \sqrt{\sum_{i=1}^n \frac{(x_i - \bar x)^2}{n-1}} $$ - _nearly_ the ‘average distance from the mean’ - uses information from every observation - sensitive to outliers - in the same units as the observations - easy to use mathematically --- ## Measures of variation #### Mean absolute deviation $$\text{MAD}(x) = \sum_{i=1}^n \frac{|x_i - \bar x|}{n} $$ - the average distance from the mean (or median) - less sensitive to outliers - easy to compute - always ≤ standard deviation - _biased_ estimator; more difficult to use mathematically --- ## Summary statistics in R - `summary` will give quartiles, min, max & mean - or counts of each level, if qualitative - `quantile` to compute quantiles - e.g. `quantile(x, c(.25, .75))` for lower, upper quartiles - `mean`, `median`, `sd`, `var`, `mad`, `IQR` - `range` gives minimum and maximum ```r summary(htwt$nursewt) ``` ``` Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 43 61 70 72 81 125 10 ``` Compute by group with `by`, `aggregate` or `dplyr::group_by` --- ## Numerical summary: Table 1 Overview of study sample. Report counts/proportions or location/scale (with units) for each variable. - **Normally distributed**: mean and SD - **Skewed distribution**: median and IQR - **Mixture/if in doubt**: median and IQR - **Qualitative**: frequency and proportion _Always_ align numbers by their decimal points. --- layout: false class: inverse, middle, center # The normal distribution --- layout: true background-image: url(cfe-logo.jpg) background-position: 97% 97% background-size: 70px --- ## The normal distribution The normal (Gaussian) distribution is _symmetric_, _unimodal_ and _mesokurtic_. Described by **mean** and __standard deviation__. <img src="lecture1_files/figure-html/normal-1.png" width="80%" style="display: block; margin: auto;" /> --- ## The normal distribution Why care if data are normally distributed? - **Asymmetric**: very high or low values are skewing data - mean no longer represents ‘typical’ value - **Multimodal**: more than one peak - indicates a mixture of groups - **Platykurtic**: thin tails - unusually few / less extreme values - bounded measurements? - **Leptokurtic**: fat tails - unusually many / more extreme values - anomalous measurements? --- ## Assessing normality - Any way you like, so long as it's a **quantile—quantile plot**. <img src="lecture1_files/figure-html/normality-1.png" width="100%" /> --- ### Positively skewed distribution Some extremely high values; long right tail <img src="lecture1_files/figure-html/positive-skew-1.png" width="100%" /> --- ### Negatively skewed distribution Some extremely small values; long left tail <img src="lecture1_files/figure-html/negative-skew-1.png" width="100%" /> --- ### Bimodal distribution Two peaks (modes): possible mixture of distributions <img src="lecture1_files/figure-html/multimodal-1.png" width="100%" /> --- ### Leptokurtic (fat tailed) More extreme values (both large & small) than normal <img src="lecture1_files/figure-html/lepto-1.png" width="100%" /> --- ### Platykurtic (thin tailed) Fewer extreme values (both large & small) than normal <img src="lecture1_files/figure-html/platy-1.png" width="100%" /> --- ### Quantile–quantile plots in R - In base graphics, call `qqnorm` and `qqline`. - In **ggplot2**, use `geom_qq` and `geom_qq_line` ```r ggplot(htwt) + aes(sample = nursewt) + geom_qq() + geom_qq_line() ``` <img src="lecture1_files/figure-html/unnamed-chunk-7-1.png" width="50%" /> --- ## Box and whisker plots Another kind of ‘quantile plot’ - Median, upper & lower quartiles - Min, max (within 1.5 × IQR of median) & outliers - Compare skewness between 2+ variables - In R: `boxplot` or `ggplot2::geom_boxplot` <img src="lecture1_files/figure-html/boxplot-1.png" width="100%" /> --- ## Transforming data - Symmetrise data via transformation - Most common transform: taking logs - Others (e.g. `\(1/x\)`, `\(\sqrt{x}\)`) harder to interpret <img src="lecture1_files/figure-html/log-transform-1.png" width="80%" /> .small[Logarithm of the positively-skewed data from earlier] --- layout: false class: inverse, middle, center # Further reading --- layout: true background-image: url(cfe-logo.jpg) background-position: 97% 97% background-size: 70px --- # Further reading .small[https://personalpages.manchester.ac.uk/staff/david.selby/analysis/] - Lecture notes - These slides - Practical exercises ### Books on data visualisation **Edward Tufte**, _The Visual Display of Quantitative Information_. **William Cleveland**, _Visualizing Data_.