Wrangling data the right way with R

Tips on data manipulation with the packages data.table, dplyr and tidyr, including reshaping data, creating lagged variables and preparing data for survival analysis.

Sian Bladon https://www.research.manchester.ac.uk/portal/sian.bladon.html (Centre for Health Informatics)https://www.herc.ac.uk/ , James Gwinnutt https://www.research.manchester.ac.uk/portal/james.gwinnutt.html (Centre for Epidemiology Versus Arthritis)https://www.cfe.manchester.ac.uk , David Selby https://www.research.manchester.ac.uk/portal/david.selby.html (Centre for Epidemiology Versus Arthritis)https://www.cfe.manchester.ac.uk
2021-02-26

Reshaping data (reshape, pivot, melt, cast)

By Sian Bladon and James Gwinnutt

See the R solutions for Lecture 11 of the Statistical Modelling with Stata course.

Lagged variables

By Sian Bladon and David Selby

To create lag (or lead) variables in R is very easy, and the syntax can be similar to Stata or different, depending on which package you choose to use. Consider the following data frame called observations.

time patient value
1 1 74
2 1 60
3 1 50
4 1 60
1 2 45
2 2 61
3 2 38
4 2 33

lead, lag and shift functions

The easiest way is to use the lead() and lag() functions built into the package dplyr.

library(dplyr)
observations %>%
  group_by(patient) %>%
  mutate(lag1 = lag(value, n = 1))
time patient value lag1
1 1 74
2 1 60 74
3 1 50 60
4 1 60 50
1 2 45
2 2 61 45
3 2 38 61
4 2 33 38
observations %>%
  group_by(patient) %>%
  mutate(lead1 = lead(value, n = 1))
time patient value lead1
1 1 74 60
2 1 60 50
3 1 50 60
4 1 60
1 2 45 61
2 2 61 38
3 2 38 33
4 2 33

Or if you prefer the package data.table’s syntax, the function is shift():

library(data.table)
setDT(observations)
observations[, .(lag2 = shift(value, 2, type = 'lag')), by = patient]
patient lag2
1
1
1 74
1 60
2
2
2 45
2 61

For quick analyses (without grouping variables) you can also do it yourself using head(x, -1), tail(x, -1) or basic subsetting using syntax x[-1] and x[-length(x)], but you’re less likely to make a mistake by using the dedicated functions.

Gaps between records

If you have gaps in your records and you only want to lag successive years (for example), you can use a conditional (e.g. ifelse) expression. Consider the dataset gap_obs:

time value
1 42
2 56
3 65
5 52
6 45
8 42

Consecutive records are those whose time points have lag-1 differences of 1 time unit.

gap_obs %>%
  mutate(lag1 = ifelse(time - lag(time) == 1, lag(value), NA))
time value lag1
1 42
2 56 42
3 65 56
5 52
6 45 52
8 42

You can also calculated lagged differences with the diff function, but the result will be of length 1 shorter than your input vector, so you need to prepend an NA.

Last observation carried forward

If data are missing (NA) and you want to fill it in with the last observation carried forward (LOCF) you can use the nafill from the data.table package. Change the type argument to change between last observation carried forward, next observation carried backward, or to fill in NAs with a specified constant value.

x <- c(10, 20, 25, NA, 30, 10)
nafill(x, type = 'locf')
[1] 10 20 25 25 30 10
nafill(x, type = 'nocb')
[1] 10 20 25 30 30 10
nafill(x, type = 'const', fill = 0)
[1] 10 20 25  0 30 10

The zoo package (Z’s Ordered Observations) provides a similar function called na.locf().

[1] 10 20 25 25 30 10
na.locf(x, fromLast = TRUE) # next obs carried backward
[1] 10 20 25 30 30 10

Equivalently, you can turn implicit missing values to explicit ones and then use a conditional expression. (Be aware this may not be very efficient if your records have lots of gaps.)

library(tidyr)
gap_obs %>%
  # Add explicit NAs for the missing time points
  tidyr::complete(time = seq(min(time), max(time), by = 1)) %>%
  # Last obs carried forward for missing values, using lag
  mutate(value2 = ifelse(is.na(value), lag(value), value),
         value3 = zoo::na.locf(value)) # same result
time value value2 value3
1 42 42 42
2 56 56 56
3 65 65 65
4 65 65
5 52 52 52
6 45 45 45
7 45 45
8 42 42 42

Rolling averages

Survival analysis

By David Selby

This example uses the dataset leukaemia.csv (right-click to download) from lecture 10 of the Statistical Modelling with Stata course.

leukaemia <- read.csv('leukaemia.csv')
weeks relapse treatment1 treatment2 wbc3cat
1 1 Standard Drug B Moderate
1 1 Standard Drug B High
2 1 Standard Drug B High
2 1 Standard Drug B High
3 1 Standard Drug B High
4 1 Standard Drug B High
4 1 Standard Drug B Moderate
5 1 Standard Drug B High
5 1 Standard Standard High
6 1 Drug A Standard Moderate
6 0 Drug A Standard High
6 1 Drug A Standard High
6 1 Drug A Drug B High
7 1 Drug A Standard High
8 1 Standard Standard Moderate
8 1 Standard Standard High
8 1 Standard Drug B High
8 1 Standard Standard High
9 0 Drug A Standard Moderate
10 1 Drug A Standard Moderate
10 0 Drug A Standard Moderate
11 1 Standard Standard High
11 0 Drug A Standard Moderate
11 1 Standard Standard Normal
12 1 Standard Standard High
12 1 Standard Standard Normal
13 1 Drug A Standard Moderate
15 1 Standard Standard Normal
16 1 Drug A Drug B High
17 1 Standard Standard Moderate
17 0 Drug A Standard Normal
19 0 Drug A Standard Normal
20 0 Drug A Drug B Normal
22 1 Standard Standard Moderate
22 1 Drug A Drug B Moderate
23 1 Standard Drug B Normal
23 1 Drug A Drug B Moderate
25 0 Drug A Drug B Normal
32 0 Drug A Drug B Moderate
32 0 Drug A Drug B Normal
34 0 Drug A Drug B Normal
35 0 Drug A Drug B Normal

Suppose we want to fit a Cox proportional hazards model to these data to model the effect of treatment2. The Kaplan–Meier survival curves look like this:

library(survival)
leuk_km <- survfit(Surv(weeks, relapse) ~ treatment2, data = leukaemia)

library(survminer)
ggsurvplot(leuk_km)
Kaplan-Meier survival curve comparing relapse in leukaemia patients on 'Drug B' versus standard treatment

Figure 1: Kaplan-Meier survival curve comparing relapse in leukaemia patients on ‘Drug B’ versus standard treatment

In the first 10 weeks, survival on standard treatment appears to be better than Drug B, but after 10 weeks, the reverse appears to be true.

A Kaplan–Meier plot with crossing survival curves suggests that a proportional hazards assumption may not be reasonable. You can also verify this by checking whether the Schoenfeld residuals for a Cox model (while accounting for white blood cell count) appear constant over time:

leuk_cox1 <- coxph(Surv(weeks, relapse) ~ treatment2 + wbc3cat, data = leukaemia)
plot(cox.zph(leuk_cox1)[1])
Schoenfeld residuals for a Cox proportional hazards model of leukaemia relapse time against white blood cell count and treatment

Figure 2: Schoenfeld residuals for a Cox proportional hazards model of leukaemia relapse time against white blood cell count and treatment

Note: due to a bug in survminer, its function ggcoxzph (for plotting Schoenfeld residuals) draws the confidence bands incorrectly. Instead, use the base plotting function (as above), set ggcoxzph(..., se = FALSE), or check GitHub to see when the issue will be fixed.

Let’s split the dataset at 10 weeks, to compare the different effects of Drug B before and after this cutpoint.

leuk_split <- survSplit(Surv(weeks, relapse) ~ treatment2 + wbc3cat,
                        data = leukaemia, cut = 10, episode = 'time_group')
treatment2 wbc3cat tstart weeks relapse time_group
Drug B Moderate 0 1 1 1
Drug B High 0 1 1 1
Drug B High 0 2 1 1
Drug B High 0 2 1 1
Drug B High 0 3 1 1
Drug B High 0 4 1 1
Drug B Moderate 0 4 1 1
Drug B High 0 5 1 1
Standard High 0 5 1 1
Standard Moderate 0 6 1 1
Standard High 0 6 0 1
Standard High 0 6 1 1
Drug B High 0 6 1 1
Standard High 0 7 1 1
Standard Moderate 0 8 1 1
Standard High 0 8 1 1
Drug B High 0 8 1 1
Standard High 0 8 1 1
Standard Moderate 0 9 0 1
Standard Moderate 0 10 1 1
Standard Moderate 0 10 0 1
Standard High 0 10 0 1
Standard High 10 11 1 2
Standard Moderate 0 10 0 1
Standard Moderate 10 11 0 2
Standard Normal 0 10 0 1
Standard Normal 10 11 1 2
Standard High 0 10 0 1
Standard High 10 12 1 2
Standard Normal 0 10 0 1
Standard Normal 10 12 1 2
Standard Moderate 0 10 0 1
Standard Moderate 10 13 1 2
Standard Normal 0 10 0 1
Standard Normal 10 15 1 2
Drug B High 0 10 0 1
Drug B High 10 16 1 2
Standard Moderate 0 10 0 1
Standard Moderate 10 17 1 2
Standard Normal 0 10 0 1
Standard Normal 10 17 0 2
Standard Normal 0 10 0 1
Standard Normal 10 19 0 2
Drug B Normal 0 10 0 1
Drug B Normal 10 20 0 2
Standard Moderate 0 10 0 1
Standard Moderate 10 22 1 2
Drug B Moderate 0 10 0 1
Drug B Moderate 10 22 1 2
Drug B Normal 0 10 0 1
Drug B Normal 10 23 1 2
Drug B Moderate 0 10 0 1
Drug B Moderate 10 23 1 2
Drug B Normal 0 10 0 1
Drug B Normal 10 25 0 2
Drug B Moderate 0 10 0 1
Drug B Moderate 10 32 0 2
Drug B Normal 0 10 0 1
Drug B Normal 10 32 0 2
Drug B Normal 0 10 0 1
Drug B Normal 10 34 0 2
Drug B Normal 0 10 0 1
Drug B Normal 10 35 0 2

Now fit the model to compare the effect of Drug B before and after 10 weeks.

leuk_cox2 <-
  coxph(Surv(tstart, weeks, relapse) ~ wbc3cat + treatment2 * strata(time_group),
        data = leuk_split)

plot(cox.zph(leuk_cox2)[2])

Is the hazard ratio for the effect of Drug B before 10 weeks significant? What about after 10 weeks?

cbind(hazard = coef(leuk_cox2), confint(leuk_cox2)) %>%
  exp
hazard 2.5% 97.5%
Moderate WBC 3.15 0.95 10.4
High WBC 16.30 4.63 57.4
Drug B < 10 wks 2.18 0.85 5.6
Drug B >= 10 wks 0.05 0.01 0.3

See the R solutions for Lecture 10 of the Statistical Modelling with Stata course for more information.

Another useful function for wrangling survival data is tmerge() (for combining time-dependent variables with baseline data).