Graphics. Summarising data. More syntax. Looping. Reshaping data.
This worksheet is based on the eleventh lecture of the Statistical Modelling in Stata course, created by Dr Mark Lunt and offered by the Centre for Epidemiology Versus Arthritis at the University of Manchester. The original Stata exercises and solutions are here translated into their R equivalents.
Refer to the original slides and Stata worksheets here.
Load the dataset uselifeexp
from the Stata web site. It concerns life expectancy in various subgroups in the United States from 1900–2000.
uslifeexp <- foreign::read.dta('https://www.stata-press.com/data/r8/uslifeexp.dta')
Similar datasets are available within R, but we load the Stata one for ease of comparison with the original worksheet.
Produce a simple scatter plot of life expectancy against year.
plot(le ~ year, data = uslifeexp) # or plot(uslifeexp$year, uslifeexp$le)
You should see life expectancy increasing steadily, with a blip of very low life expectancy in 1918. Now we will add a title to this graph. Go back to your script and add a main
argument to the plot
command. We can also change the y-axis label, ylab
.
plot(le ~ year, data = uslifeexp,
ylab = 'life expectancy',
main = 'US Life Expectancy')
Now we are going to extend the y-axis back to 0, rather than starting at 40. This is set by passing a vector of length 2 to the ylim
argument.
Now we will practise overlaying graphs, using the same dataset. We can compare male and female life expectancy. First plot the points for male life expectancy, then add the female points to the same plot using the points
function. Set the plotting characters to different symbols and colours using pch
and col
arguments. Use the following guide:
Add a legend, if you like.
You should see that life-expectancy is increasing over time in both sexes, but consistently higher in females. Add regression lines to the plot, after fitting the appropriate linear models, using the abline()
function. You may wish to adjust your ylim
.
You can add confidence intervals to the plot in the same way, but it’s a bit of a pain.
A better option is ggplot2, which fits models, adds legends and shades the confidence intervals automatically. However, the package demands data in a long format, so that we can treat sex as a factor, rather than as two separate variables. Reshaping data will be explained later, but here’s a preview.
library(tidyverse)
uslifeexp %>%
select(year, le_female, le_male) %>%
pivot_longer(le_female:le_male,
names_to = 'sex',
values_to = 'lifeexp') %>%
mutate(sex = str_remove(sex, 'le_')) %>% # clean up the labels
ggplot() + aes(year, lifeexp, colour = sex) +
geom_smooth(method = lm, formula = y ~ x) +
geom_point() +
labs(y = 'life expectancy', title = 'US Life Expectancy') +
theme_classic() # Just because I don't like the default grey theme
Read the cancer
dataset into R.
cancer <- foreign::read.dta('http://www.stata-press.com/data/r9/cancer.dta')
Find out what the dataset is about with str
.
str(cancer)
'data.frame': 48 obs. of 4 variables:
$ studytime: int 1 1 2 3 4 4 5 5 8 8 ...
$ died : int 1 1 1 1 1 1 1 1 1 0 ...
$ drug : int 1 1 1 1 1 1 1 1 1 1 ...
$ age : int 61 65 59 52 56 67 63 58 56 58 ...
- attr(*, "datalabel")= chr "Patient Survival in Drug Trial"
- attr(*, "time.stamp")= chr " 3 Mar 2005 16:09"
- attr(*, "formats")= chr [1:4] "%8.0g" "%8.0g" "%8.0g" "%8.0g"
- attr(*, "types")= int [1:4] 252 252 252 252
- attr(*, "val.labels")= chr [1:4] "" "" "" ""
- attr(*, "var.labels")= chr [1:4] "Months to death or end of exp." "1 if patient died" "Drug type (1=placebo)" "Patient's age at start of exp."
- attr(*, "version")= int 8
(These attr
ibutes are carried over from Stata.)
nrow(cancer)
[1] 48
summary()
function to get some idea of the values taken by the different variables. What was the longest followup time?summary(cancer)
studytime died drug age
Min. : 1.00 Min. :0.0000 Min. :1.000 Min. :47.00
1st Qu.: 7.75 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:50.75
Median :12.50 Median :1.0000 Median :2.000 Median :56.00
Mean :15.50 Mean :0.6458 Mean :1.875 Mean :55.88
3rd Qu.:23.00 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:60.00
Max. :39.00 Max. :1.0000 Max. :3.000 Max. :67.00
See summary output above, or use
max(cancer$studytime)
[1] 39
When first imported via read.dta
this becomes an integer vector, so the summary above for drug
might not make much sense. If you don’t want to convert this variable to a factor, then you can use
unique(cancer$drug)
[1] 1 2 3
otherwise, convert to factor and summarise again.
studytime died drug age
Min. : 1.00 0:17 1:20 Min. :47.00
1st Qu.: 7.75 1:31 2:14 1st Qu.:50.75
Median :12.50 3:14 Median :56.00
Mean :15.50 Mean :55.88
3rd Qu.:23.00 3rd Qu.:60.00
Max. :39.00 Max. :67.00
You can also now use
nlevels(cancer$drug)
[1] 3
See summary output above, or use
range(cancer$age)
[1] 47 67
See summary output above, or use
mean(cancer$age)
[1] 55.875
We don’t get standard deviations in summary output, so use:
sd(cancer$studytime)
[1] 10.25629
table
(or xtabs
) to cross-tabulate the number and percentage of subjects who died on each treatment. How many subjects on placebo died?19 subjects died on placebo (= drug 1).
with(cancer, proportions(table(drug, died), margin = 1))
died
drug 0 1
1 0.0500000 0.9500000
2 0.5714286 0.4285714
3 0.5714286 0.4285714
43% of subjects on treatment 2 died.
Other methods include
aggregate(age ~ died, mean, data = cancer)
died age
1 0 54.17647
2 1 56.80645
by(cancer$age, cancer$died, mean)
cancer$died: 0
[1] 54.17647
----------------------------------------------------
cancer$died: 1
[1] 56.80645
mean
1 56.80645
cancer %>% group_by(died) %>% summarise(mean = mean(age))
# A tibble: 2 x 2
died mean
<fct> <dbl>
1 0 54.2
2 1 56.8
Many different ways of doing it, as above.
mean(cancer$studytime[cancer$drug == 1]) # or use with()
[1] 9
aggregate(studytime ~ drug, mean, data = cancer)
drug studytime
1 1 9.00000
2 2 14.92857
3 3 25.35714
by(cancer$studytime, cancer$drug, mean)
cancer$drug: 1
[1] 9
----------------------------------------------------
cancer$drug: 2
[1] 14.92857
----------------------------------------------------
cancer$drug: 3
[1] 25.35714
cancer %>% group_by(drug) %>% summarise(mean = mean(studytime))
# A tibble: 3 x 2
drug mean
<fct> <dbl>
1 1 9
2 2 14.9
3 3 25.4
[1] 55.94737
aggregate(age ~ died + drug, mean, data = cancer)
died drug age
1 0 1 58.00000
2 1 1 55.94737
3 0 2 54.75000
4 1 2 59.83333
5 0 3 53.12500
6 1 3 56.50000
died: 0
drug: 1
[1] 58
----------------------------------------------------
died: 1
drug: 1
[1] 55.94737
----------------------------------------------------
died: 0
drug: 2
[1] 54.75
----------------------------------------------------
died: 1
drug: 2
[1] 59.83333
----------------------------------------------------
died: 0
drug: 3
[1] 53.125
----------------------------------------------------
died: 1
drug: 3
[1] 56.5
mean
1 55.94737
cancer %>% group_by(died, drug) %>% summarise(mean = mean(age))
# A tibble: 6 x 3
# Groups: died [2]
died drug mean
<fct> <fct> <dbl>
1 0 1 58
2 0 2 54.8
3 0 3 53.1
4 1 1 55.9
5 1 2 59.8
6 1 3 56.5
agegrp
, dividing subjects into two more-or-less equally sized groups.You can try using cut(age, 2)
but this will split the data at 57 into equal lengths of years, rather than numbers of subjects.
You need to specify the median manually, and then either set lower and upper bounds (c(min(age), median(age), max(age))
or equivalently quantile(age, c(0, .5, 1))
).
Or use findInterval
, which lets you specify a single breakpoint (but doesn’t label the result).
# Generates a vector of 0s and 1s:
cancer$agegrp <- findInterval(cancer$age, median(cancer$age))
# Slightly better as you get labels:
cancer$agegrp <- cut(cancer$age,
quantile(cancer$age, c(0, .5, 1)),
include.lowest = TRUE)
If you prefer a specific pre-made function, use cut2
from the Hmisc package, which will cut into the number of quantile groups specified by argument g
.
cancer$agegrp <- Hmisc::cut2(cancer$age, g = 2)
Lots of ways of doing this.
Or in data.table:
library(data.table)
setDT(cancer)
cancer[, group_size := .N, by = agegrp]
All give the same result.
Assign the dataset to the symbol mycancer
, for later exercises.
mycancer <- cancer
Warning. You will rarely need to use loops of any kind in R. Nearly all tasks are better accomplished by using vectorised operations or using an apply (or map) type function. Exceptions to this are iterative processes that are performed for their side effects, such as saving to file, producing plots or printing to the console. (However, even these might be accomplished with functional programming verbs like purrr::walk
.) Nevertheless, here’s a direct R transliteration of the Stata worksheet.
For a simple illustration of looping, type
Note. You don’t need to wrap the loop expression in curly brackets {}
unless it spans multiple lines. Here it’s just a single print
operation, so they are omitted.
You should still have the dataset mycancer
from the last section. You can now enter the following code:
1 2 3
0 1 8 8
1 19 6 6
[47,57) [57,67]
0 12 5
1 15 16
This should produce two cross-tabulations, one for drug
against died
, and the other for agegrp
against died
.
Note. Expressions inside loops need an explicit call to print()
to appear in the console.
For a more complex example of using for()
, load the life-expectancy data (uslifeexp
) from earlier. We will create graphs for all of the variables:
In interactive use, you can subset a data frame using df$myvar
, but if you want to select a column by a string then you need df[[x]]
, which will give the same result as if you’d typed the value of x after the dollar sign. This is so you don’t mix up a column whose name is equal to the value of x
with a column whose name is literally “x”.
If you really did want to produce a set of plots like above, you might be better off using lattice or ggplot2 to produce a set of small multiples (facets); this is demonstrated at the bottom of the page. A prerequisite is reshaping the data from wide to long, which is described in the next section.
There are several ways to reshape data in R.
melt
and cast
functions (also built into data.table)pivot_longer
and pivot_wider
functions (formerly known as gather
and spread
)The main differences between these three approaches are that reshape
tries to do everything in a single function, reshape2 uses a set of functions to coerce the data to and from an intermediate (molten) format, and tidyr has a pair of functions that convert between wide and long directly.
Generally speaking, you want to use either reshape2 or tidyr, and if in doubt then start with tidyr and see how you get on.
Read the bplong
dataset into R with
bplong <- foreign::read.dta('https://www.stata-press.com/data/r8/bplong.dta')
Use an appropriate method to get the mean and standard deviation of the blood pressure for categories when = 'Before'
and when = 'After'
, separately. Make a note of these so you can check later that your reshaping was successful.
# A tibble: 2 x 3
when mean sd
<fct> <dbl> <dbl>
1 Before 156. 11.4
2 After 151. 14.2
To reshape these data into wide form, the unique identifier is patient
and the variable to ‘widen’ is when
.
First let’s see reshape:
bpwide <- reshape(bplong,
v.names = 'bp',
timevar = 'when',
idvar = 'patient',
direction = 'wide')
patient sex agegrp bp.Before bp.After
1 1 Male 30-45 143 153
3 2 Male 30-45 163 170
5 3 Male 30-45 153 168
7 4 Male 30-45 153 142
9 5 Male 30-45 146 141
11 6 Male 30-45 150 147
With reshape2, you can melt
an object into a ‘molten’ long-format data frame, then cast
it back into the desired wider format.
library(reshape2)
molten <- melt(bplong, measure.vars = 'bp')
bpwide <- dcast(molten, patient + sex + agegrp ~ when)
patient sex agegrp Before After
1 1 Male 30-45 143 153
2 2 Male 30-45 163 170
3 3 Male 30-45 153 168
4 4 Male 30-45 153 142
5 5 Male 30-45 146 141
6 6 Male 30-45 150 147
And finally the package tidyr provides pivot_wider
, which performs the reshape operation in a single function call (the dual function is pivot_longer
) without having to melt it first.
library(tidyr)
bpwide <- bplong %>% pivot_wider(names_from = 'when', values_from = 'bp')
# A tibble: 6 x 5
patient sex agegrp Before After
<int> <fct> <fct> <int> <int>
1 1 Male 30-45 143 153
2 2 Male 30-45 163 170
3 3 Male 30-45 153 168
4 4 Male 30-45 153 142
5 5 Male 30-45 146 141
6 6 Male 30-45 150 147
Check the summary statistics have not changed.
bpwide %>%
summarise_at(vars(Before:After),
list(mean = mean, sd = sd))
# A tibble: 1 x 4
Before_mean After_mean Before_sd After_sd
<dbl> <dbl> <dbl> <dbl>
1 156. 151. 11.4 14.2
Reshaping from wide to long is especially useful for packages like ggplot2, as it allows you to colour or facet according to different variables. It’s also handy when summarising lots of variables at once.
Reload the uslifeexp
life expectancy data from earlier.
There are a series of variables giving the life-expectancy in different subgroups of the US population over the years, and a variable le
containing the overall life-expectancy. We will change this to have several observations for each year, a single variable le
containing the life expectancy and a variable
group saying which subgroup the life-expectancy applies to. First, we need to change the name of the variable le
to le_total
, so that there is something to put in the group
variable for the overall life expectancy.
colnames(uslifeexp)[2] <- 'le_total'
Now, unique observations are identified by the variable year
, the variable we want to have in the long data is le
, and the variable we want to identify which variable in wide form corresponds to an observation in long form is called group
.
The different ways of achieving this are as follows. Firstly, for the masochists, reshape
.
life_long <-
reshape(uslifeexp,
varying = list(colnames(uslifeexp)[-1]), # everything except 'year'
v.names = 'le',
timevar = 'group',
idvar = 'year',
times = colnames(uslifeexp)[-1],
direction = 'long')
year group le
1900.le_total 1900 le_total 47.3
1901.le_total 1901 le_total 49.1
1902.le_total 1902 le_total 51.5
1903.le_total 1903 le_total 50.5
1904.le_total 1904 le_total 47.6
1905.le_total 1905 le_total 48.7
Next a reshape2 workflow. It’s a single melt
operation; no need to cast
afterwards because a long format is what we want.
life_long <- melt(uslifeexp,
id.vars = 'year',
variable.name = 'group',
value.name = 'le')
year group le
1 1900 le_total 47.3
2 1901 le_total 49.1
3 1902 le_total 51.5
4 1903 le_total 50.5
5 1904 le_total 47.6
6 1905 le_total 48.7
And the equivalent syntax using tidyr. Here we use the selector syntax “-year
” to mean ‘pivot all of the columns except this one’. You could also use starts_with('le')
to select the columns by name.
life_long <- uslifeexp %>%
pivot_longer(-year, names_to = 'group', values_to = 'le')
# A tibble: 6 x 3
year group le
<int> <chr> <dbl>
1 1900 le_total 47.3
2 1900 le_male 46.3
3 1900 le_female 48.3
4 1900 le_w 47.6
5 1900 le_wmale 46.6
6 1900 le_wfemale 48.7
Hence you can recreate the scatter plot from earlier.
Better yet, let’s plot all of the variables at once. No loops necessary!
ggplot(life_long) +
aes(year, le) +
geom_point() +
facet_wrap( ~ group)
Other packages are available, of course. Another option for small-multiples plots is lattice.