Lecture 11: Refinements of the Stata Language

Graphics. Summarising data. More syntax. Looping. Reshaping data.

Author

Affiliation

Centre for Epidemiology Versus Arthritis

Published

Feb. 2, 2021

DOI

This worksheet is based on the eleventh lecture of the Statistical Modelling in Stata course, created by Dr Mark Lunt and offered by the Centre for Epidemiology Versus Arthritis at the University of Manchester. The original Stata exercises and solutions are here translated into their R equivalents.

Refer to the original slides and Stata worksheets here.

Graphs

Load the dataset uselifeexp from the Stata web site. It concerns life expectancy in various subgroups in the United States from 1900–2000.

uslifeexp <- foreign::read.dta('https://www.stata-press.com/data/r8/uslifeexp.dta')

Similar datasets are available within R, but we load the Stata one for ease of comparison with the original worksheet.

Base R graphics

Produce a simple scatter plot of life expectancy against year.

plot(le ~ year, data = uslifeexp) # or plot(uslifeexp$year, uslifeexp$le)

You should see life expectancy increasing steadily, with a blip of very low life expectancy in 1918. Now we will add a title to this graph. Go back to your script and add a main argument to the plot command. We can also change the y-axis label, ylab.

plot(le ~ year, data = uslifeexp,
     ylab = 'life expectancy',
     main = 'US Life Expectancy')

Now we are going to extend the y-axis back to 0, rather than starting at 40. This is set by passing a vector of length 2 to the ylim argument.

plot(le ~ year, data = uslifeexp,
     ylab = 'life expectancy',
     main = 'US Life Expectancy',
     ylim = c(0, 80))

Now we will practise overlaying graphs, using the same dataset. We can compare male and female life expectancy. First plot the points for male life expectancy, then add the female points to the same plot using the points function. Set the plotting characters to different symbols and colours using pch and col arguments. Use the following guide:

Quick-R: Graphical Parameters

plot(le_male ~ year, data = uslifeexp,
     ylab = 'life expectancy',
     main = 'US Life Expectancy',
     ylim = c(0, 80),
     pch = 16, col = 'steelblue')
points(le_female ~ year, data = uslifeexp,
       pch = 16, col = 'tomato')

Add a legend, if you like.

legend('bottomright', c('Male', 'Female'), pch = 16,
       col = c('steelblue', 'tomato'))

You should see that life-expectancy is increasing over time in both sexes, but consistently higher in females. Add regression lines to the plot, after fitting the appropriate linear models, using the abline() function. You may wish to adjust your ylim.

model_m <- lm(le_female ~ year, data = uslifeexp)
model_f <- lm(le_male ~ year, data = uslifeexp)
legend('bottomright', c('Male', 'Female'), pch = 16, lty = 1,
       col = c('steelblue', 'tomato'))

abline(model_m, col = 'steelblue')
abline(model_f, col = 'tomato')

You can add confidence intervals to the plot in the same way, but it’s a bit of a pain.

ggplot2

A better option is ggplot2, which fits models, adds legends and shades the confidence intervals automatically. However, the package demands data in a long format, so that we can treat sex as a factor, rather than as two separate variables. Reshaping data will be explained later, but here’s a preview.

library(tidyverse)
uslifeexp %>%
  select(year, le_female, le_male) %>%
  pivot_longer(le_female:le_male,
               names_to = 'sex',
               values_to = 'lifeexp') %>%
  mutate(sex = str_remove(sex, 'le_')) %>% # clean up the labels
  ggplot() + aes(year, lifeexp, colour = sex) +
  geom_smooth(method = lm, formula = y ~ x) +
  geom_point() +
  labs(y = 'life expectancy', title = 'US Life Expectancy') +
  theme_classic() # Just because I don't like the default grey theme

Summarising data

Read the cancer dataset into R.

cancer <- foreign::read.dta('http://www.stata-press.com/data/r9/cancer.dta')

Find out what the dataset is about with str.

str(cancer)

'data.frame':   48 obs. of  4 variables:
 $ studytime: int  1 1 2 3 4 4 5 5 8 8 ...
 $ died     : int  1 1 1 1 1 1 1 1 1 0 ...
 $ drug     : int  1 1 1 1 1 1 1 1 1 1 ...
 $ age      : int  61 65 59 52 56 67 63 58 56 58 ...
 - attr(*, "datalabel")= chr "Patient Survival in Drug Trial"
 - attr(*, "time.stamp")= chr " 3 Mar 2005 16:09"
 - attr(*, "formats")= chr [1:4] "%8.0g" "%8.0g" "%8.0g" "%8.0g"
 - attr(*, "types")= int [1:4] 252 252 252 252
 - attr(*, "val.labels")= chr [1:4] "" "" "" ""
 - attr(*, "var.labels")= chr [1:4] "Months to death or end of exp." "1 if patient died" "Drug type (1=placebo)" "Patient's age at start of exp."
 - attr(*, "version")= int 8

(These attributes are carried over from Stata.)

How many observations are there in the dataset?

nrow(cancer)

[1] 48

Use the `summary()` function to get some idea of the values taken by the different variables. What was the longest followup time?

summary(cancer)

   studytime          died             drug            age       
 Min.   : 1.00   Min.   :0.0000   Min.   :1.000   Min.   :47.00  
 1st Qu.: 7.75   1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:50.75  
 Median :12.50   Median :1.0000   Median :2.000   Median :56.00  
 Mean   :15.50   Mean   :0.6458   Mean   :1.875   Mean   :55.88  
 3rd Qu.:23.00   3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:60.00  
 Max.   :39.00   Max.   :1.0000   Max.   :3.000   Max.   :67.00

See summary output above, or use

max(cancer$studytime)

[1] 39

How many different treatments were in the study?

When first imported via read.dta this becomes an integer vector, so the summary above for drug might not make much sense. If you don’t want to convert this variable to a factor, then you can use

unique(cancer$drug)

[1] 1 2 3

otherwise, convert to factor and summarise again.

cancer <- transform(cancer, died = as.factor(died), drug = as.factor(drug))
summary(cancer)

   studytime     died   drug        age       
 Min.   : 1.00   0:17   1:20   Min.   :47.00  
 1st Qu.: 7.75   1:31   2:14   1st Qu.:50.75  
 Median :12.50          3:14   Median :56.00  
 Mean   :15.50                 Mean   :55.88  
 3rd Qu.:23.00                 3rd Qu.:60.00  
 Max.   :39.00                 Max.   :67.00

You can also now use

nlevels(cancer$drug)

[1] 3

How old were the oldest and youngest subjects in the study?

See summary output above, or use

range(cancer$age)

[1] 47 67

What was the mean age at the start of the study?

See summary output above, or use

mean(cancer$age)

[1] 55.875

What was the standard deviation of the followup time?

We don’t get standard deviations in summary output, so use:

sd(cancer$studytime)

[1] 10.25629

Use `table` (or `xtabs`) to cross-tabulate the number and percentage of subjects who died on each treatment. How many subjects on placebo died?

with(cancer, table(drug, died))

19 subjects died on placebo (= drug 1).

What percentage of subjects on treatment 2 died?

with(cancer, proportions(table(drug, died), margin = 1))

    died
drug         0         1
   1 0.0500000 0.9500000
   2 0.5714286 0.4285714
   3 0.5714286 0.4285714

43% of subjects on treatment 2 died.

Further syntax

What is the mean age of subjects in the cancer study who died?

with(cancer, mean(age[died == 1]))

[1] 56.80645

Other methods include

aggregate(age ~ died, mean, data = cancer)

  died      age
1    0 54.17647
2    1 56.80645

by(cancer$age, cancer$died, mean)

cancer$died: 0
[1] 54.17647
---------------------------------------------------- 
cancer$died: 1
[1] 56.80645

cancer %>% filter(died == 1) %>% summarise(mean = mean(age))

      mean
1 56.80645

cancer %>% group_by(died) %>% summarise(mean = mean(age))

# A tibble: 2 x 2
  died   mean
  <fct> <dbl>
1 0      54.2
2 1      56.8

Find the mean followup time among subjects on placebo.

Many different ways of doing it, as above.

mean(cancer$studytime[cancer$drug == 1]) # or use with()

[1] 9

aggregate(studytime ~ drug, mean, data = cancer)

  drug studytime
1    1   9.00000
2    2  14.92857
3    3  25.35714

by(cancer$studytime, cancer$drug, mean)

cancer$drug: 1
[1] 9
---------------------------------------------------- 
cancer$drug: 2
[1] 14.92857
---------------------------------------------------- 
cancer$drug: 3
[1] 25.35714

cancer %>% group_by(drug) %>% summarise(mean = mean(studytime))

# A tibble: 3 x 2
  drug   mean
  <fct> <dbl>
1 1       9  
2 2      14.9
3 3      25.4

What was the mean age among subjects who died after being treated on placebo?

with(cancer, mean(age[died == 1 & drug == 1]))

[1] 55.94737

aggregate(age ~ died + drug, mean, data = cancer)

  died drug      age
1    0    1 58.00000
2    1    1 55.94737
3    0    2 54.75000
4    1    2 59.83333
5    0    3 53.12500
6    1    3 56.50000

with(cancer, by(age, list(died = died, drug = drug), mean))

died: 0
drug: 1
[1] 58
---------------------------------------------------- 
died: 1
drug: 1
[1] 55.94737
---------------------------------------------------- 
died: 0
drug: 2
[1] 54.75
---------------------------------------------------- 
died: 1
drug: 2
[1] 59.83333
---------------------------------------------------- 
died: 0
drug: 3
[1] 53.125
---------------------------------------------------- 
died: 1
drug: 3
[1] 56.5

cancer %>% filter(died == 1, drug == 1) %>% summarise(mean = mean(age))

      mean
1 55.94737

cancer %>% group_by(died, drug) %>% summarise(mean = mean(age))

# A tibble: 6 x 3
# Groups:   died [2]
  died  drug   mean
  <fct> <fct> <dbl>
1 0     1      58  
2 0     2      54.8
3 0     3      53.1
4 1     1      55.9
5 1     2      59.8
6 1     3      56.5

Create a variable called `agegrp`, dividing subjects into two more-or-less equally sized groups.

You can try using cut(age, 2) but this will split the data at 57 into equal lengths of years, rather than numbers of subjects.

You need to specify the median manually, and then either set lower and upper bounds (c(min(age), median(age), max(age)) or equivalently quantile(age, c(0, .5, 1))).

Or use findInterval, which lets you specify a single breakpoint (but doesn’t label the result).

# Generates a vector of 0s and 1s:
cancer$agegrp <- findInterval(cancer$age, median(cancer$age))
# Slightly better as you get labels:
cancer$agegrp <- cut(cancer$age,
                     quantile(cancer$age, c(0, .5, 1)),
                     include.lowest = TRUE)

If you prefer a specific pre-made function, use cut2 from the Hmisc package, which will cut into the number of quantile groups specified by argument g.

cancer$agegrp <- Hmisc::cut2(cancer$age, g = 2)

Create a new variable containing the number of subjects in each age-group.

Lots of ways of doing this.

# Using factor `agegrp` to index table() output
cancer <- transform(cancer, group_size = table(agegrp)[agegrp])
# Using ave()
cancer <- transform(cancer, group_size = ave(age, agegrp, FUN = length))
# Using dplyr
cancer <- dplyr::add_count(cancer, agegrp, name = 'group_size')

Or in data.table:

library(data.table)
setDT(cancer)
cancer[, group_size := .N, by = agegrp]

All give the same result.

Assign the dataset to the symbol mycancer, for later exercises.

mycancer <- cancer

Looping

Warning. You will rarely need to use loops of any kind in R. Nearly all tasks are better accomplished by using vectorised operations or using an apply (or map) type function. Exceptions to this are iterative processes that are performed for their side effects, such as saving to file, producing plots or printing to the console. (However, even these might be accomplished with functional programming verbs like purrr::walk.) Nevertheless, here’s a direct R transliteration of the Stata worksheet.

For a simple illustration of looping, type

for (x in c('one', 'two', 'three'))
  print(x)

[1] "one"
[1] "two"
[1] "three"

Note. You don’t need to wrap the loop expression in curly brackets {} unless it spans multiple lines. Here it’s just a single print operation, so they are omitted.

You should still have the dataset mycancer from the last section. You can now enter the following code:

for (x in c('drug', 'agegrp')) {
  print(table(cancer$died, cancer[[x]]))
}

   
     1  2  3
  0  1  8  8
  1 19  6  6
   
    [47,57) [57,67]
  0      12       5
  1      15      16

This should produce two cross-tabulations, one for drug against died, and the other for agegrp against died.

Note. Expressions inside loops need an explicit call to print() to appear in the console.

For a more complex example of using for(), load the life-expectancy data (uslifeexp) from earlier. We will create graphs for all of the variables:

for (x in grep('^le', colnames(uslifeexp), value = TRUE)) {
  plot(uslifeexp$year, uslifeexp[[x]], ylab = x, main = x)
}

In interactive use, you can subset a data frame using df$myvar, but if you want to select a column by a string then you need df[[x]], which will give the same result as if you’d typed the value of x after the dollar sign. This is so you don’t mix up a column whose name is equal to the value of x with a column whose name is literally “x”.

If you really did want to produce a set of plots like above, you might be better off using lattice or ggplot2 to produce a set of small multiples (facets); this is demonstrated at the bottom of the page. A prerequisite is reshaping the data from wide to long, which is described in the next section.

Reshaping data

There are several ways to reshape data in R.

reshape: built into R, though isn’t so easy to use (in fact, it’s infuriatingly unintuitive!)
Package reshape2: designed to replace the base command with a better interface, via melt and cast functions (also built into data.table)
Package tidyr: part of the tidyverse family, provides pivot_longer and pivot_wider functions (formerly known as gather and spread)

The main differences between these three approaches are that reshape tries to do everything in a single function, reshape2 uses a set of functions to coerce the data to and from an intermediate (molten) format, and tidyr has a pair of functions that convert between wide and long directly.

Generally speaking, you want to use either reshape2 or tidyr, and if in doubt then start with tidyr and see how you get on.

Long to wide

Read the bplong dataset into R with

bplong <- foreign::read.dta('https://www.stata-press.com/data/r8/bplong.dta')

Use an appropriate method to get the mean and standard deviation of the blood pressure for categories when = 'Before' and when = 'After', separately. Make a note of these so you can check later that your reshaping was successful.

bplong %>% group_by(when) %>% summarise(mean = mean(bp), sd = sd(bp))

# A tibble: 2 x 3
  when    mean    sd
  <fct>  <dbl> <dbl>
1 Before  156.  11.4
2 After   151.  14.2

To reshape these data into wide form, the unique identifier is patient and the variable to ‘widen’ is when.

First let’s see reshape:

bpwide <- reshape(bplong,
                  v.names = 'bp',
                  timevar = 'when',
                  idvar = 'patient',
                  direction = 'wide')

   patient  sex agegrp bp.Before bp.After
1        1 Male  30-45       143      153
3        2 Male  30-45       163      170
5        3 Male  30-45       153      168
7        4 Male  30-45       153      142
9        5 Male  30-45       146      141
11       6 Male  30-45       150      147

With reshape2, you can melt an object into a ‘molten’ long-format data frame, then cast it back into the desired wider format.

library(reshape2)
molten <- melt(bplong, measure.vars = 'bp')
bpwide <- dcast(molten, patient + sex + agegrp ~ when)

  patient  sex agegrp Before After
1       1 Male  30-45    143   153
2       2 Male  30-45    163   170
3       3 Male  30-45    153   168
4       4 Male  30-45    153   142
5       5 Male  30-45    146   141
6       6 Male  30-45    150   147

And finally the package tidyr provides pivot_wider, which performs the reshape operation in a single function call (the dual function is pivot_longer) without having to melt it first.

library(tidyr)
bpwide <- bplong %>% pivot_wider(names_from = 'when', values_from = 'bp')

# A tibble: 6 x 5
  patient sex   agegrp Before After
    <int> <fct> <fct>   <int> <int>
1       1 Male  30-45     143   153
2       2 Male  30-45     163   170
3       3 Male  30-45     153   168
4       4 Male  30-45     153   142
5       5 Male  30-45     146   141
6       6 Male  30-45     150   147

Check the summary statistics have not changed.

bpwide %>%
  summarise_at(vars(Before:After),
               list(mean = mean, sd = sd))

# A tibble: 1 x 4
  Before_mean After_mean Before_sd After_sd
        <dbl>      <dbl>     <dbl>    <dbl>
1        156.       151.      11.4     14.2

Wide to long

Reshaping from wide to long is especially useful for packages like ggplot2, as it allows you to colour or facet according to different variables. It’s also handy when summarising lots of variables at once.

Reload the uslifeexp life expectancy data from earlier.

There are a series of variables giving the life-expectancy in different subgroups of the US population over the years, and a variable le containing the overall life-expectancy. We will change this to have several observations for each year, a single variable le containing the life expectancy and a variable group saying which subgroup the life-expectancy applies to. First, we need to change the name of the variable le to le_total, so that there is something to put in the group variable for the overall life expectancy.

colnames(uslifeexp)[2] <- 'le_total'

Now, unique observations are identified by the variable year, the variable we want to have in the long data is le, and the variable we want to identify which variable in wide form corresponds to an observation in long form is called group.

The different ways of achieving this are as follows. Firstly, for the masochists, reshape.

life_long <-
  reshape(uslifeexp,
          varying = list(colnames(uslifeexp)[-1]), # everything except 'year'
          v.names = 'le',
          timevar = 'group',
          idvar = 'year',
          times = colnames(uslifeexp)[-1],
          direction = 'long')

              year    group   le
1900.le_total 1900 le_total 47.3
1901.le_total 1901 le_total 49.1
1902.le_total 1902 le_total 51.5
1903.le_total 1903 le_total 50.5
1904.le_total 1904 le_total 47.6
1905.le_total 1905 le_total 48.7

Next a reshape2 workflow. It’s a single melt operation; no need to cast afterwards because a long format is what we want.

life_long <- melt(uslifeexp,
                  id.vars = 'year',
                  variable.name = 'group',
                  value.name = 'le')

  year    group   le
1 1900 le_total 47.3
2 1901 le_total 49.1
3 1902 le_total 51.5
4 1903 le_total 50.5
5 1904 le_total 47.6
6 1905 le_total 48.7

And the equivalent syntax using tidyr. Here we use the selector syntax “-year” to mean ‘pivot all of the columns except this one’. You could also use starts_with('le') to select the columns by name.

life_long <- uslifeexp %>%
  pivot_longer(-year, names_to = 'group', values_to = 'le')

# A tibble: 6 x 3
   year group         le
  <int> <chr>      <dbl>
1  1900 le_total    47.3
2  1900 le_male     46.3
3  1900 le_female   48.3
4  1900 le_w        47.6
5  1900 le_wmale    46.6
6  1900 le_wfemale  48.7

Hence you can recreate the scatter plot from earlier.

life_long %>%
  filter(group %in% c('le_male', 'le_female')) %>%
  mutate(group = str_remove(group, '^le_')) %>% # remove prefix
  ggplot() + aes(year, le, colour = group) +
  geom_smooth(method = lm) +
  geom_point() +
  labs(colour = 'sex', y = 'life expectancy',
       title = 'US Life Expectancy')

Better yet, let’s plot all of the variables at once. No loops necessary!

ggplot(life_long) +
  aes(year, le) +
  geom_point() +
  facet_wrap( ~ group)

Other packages are available, of course. Another option for small-multiples plots is lattice.

library(lattice)
xyplot(le ~ year | group, life_long, layout = c(3, 3))

Lecture 11: Refinements of the Stata Language

Author

Affiliation

Published

DOI

Graphs

Base R graphics

ggplot2

Summarising data

How many observations are there in the dataset?

Use the summary() function to get some idea of the values taken by the different variables. What was the longest followup time?

How many different treatments were in the study?

How old were the oldest and youngest subjects in the study?

What was the mean age at the start of the study?

What was the standard deviation of the followup time?

Use table (or xtabs) to cross-tabulate the number and percentage of subjects who died on each treatment. How many subjects on placebo died?

What percentage of subjects on treatment 2 died?

Further syntax

What is the mean age of subjects in the cancer study who died?

Find the mean followup time among subjects on placebo.

What was the mean age among subjects who died after being treated on placebo?

Create a variable called agegrp, dividing subjects into two more-or-less equally sized groups.

Create a new variable containing the number of subjects in each age-group.

Looping

Reshaping data

Long to wide

Wide to long

Footnotes

Use the `summary()` function to get some idea of the values taken by the different variables. What was the longest followup time?

Use `table` (or `xtabs`) to cross-tabulate the number and percentage of subjects who died on each treatment. How many subjects on placebo died?

Create a variable called `agegrp`, dividing subjects into two more-or-less equally sized groups.