David Selby: Wrangling data the right way with R

To create lag (or lead) variables in R is very easy, and the syntax can be similar to Stata or different, depending on which package you choose to use. Consider the following data frame called observations.

time	patient	value
1	1	74
2	1	60
3	1	50
4	1	60
1	2	45
2	2	61
3	2	38
4	2	33

lead, lag and shift functions

The easiest way is to use the lead() and lag() functions built into the package dplyr.

time	patient	value	lag1
1	1	74
2	1	60	74
3	1	50	60
4	1	60	50
1	2	45
2	2	61	45
3	2	38	61
4	2	33	38

time	patient	value	lead1
1	1	74	60
2	1	60	50
3	1	50	60
4	1	60
1	2	45	61
2	2	61	38
3	2	38	33
4	2	33

Or if you prefer the package data.table’s syntax, the function is shift():

patient	lag2
1
1
1	74
1	60
2
2
2	45
2	61

For quick analyses (without grouping variables) you can also do it yourself using head(x, -1), tail(x, -1) or basic subsetting using syntax x[-1] and x[-length(x)], but you’re less likely to make a mistake by using the dedicated functions.

Gaps between records

If you have gaps in your records and you only want to lag successive years (for example), you can use a conditional (e.g. ifelse) expression. Consider the dataset gap_obs:

time	value
1	42
2	56
3	65
5	52
6	45
8	42

Consecutive records are those whose time points have lag-1 differences of 1 time unit.

time	value	lag1
1	42
2	56	42
3	65	56
5	52
6	45	52
8	42

You can also calculated lagged differences with the diff function, but the result will be of length 1 shorter than your input vector, so you need to prepend an NA.

Last observation carried forward

If data are missing (NA) and you want to fill it in with the last observation carried forward (LOCF) you can use the nafill from the data.table package. Change the type argument to change between last observation carried forward, next observation carried backward, or to fill in NAs with a specified constant value.

The zoo package (Z’s Ordered Observations) provides a similar function called na.locf().

Equivalently, you can turn implicit missing values to explicit ones and then use a conditional expression. (Be aware this may not be very efficient if your records have lots of gaps.)

library(tidyr)
gap_obs %>%
  # Add explicit NAs for the missing time points
  tidyr::complete(time = seq(min(time), max(time), by = 1)) %>%
  # Last obs carried forward for missing values, using lag
  mutate(value2 = ifelse(is.na(value), lag(value), value),
         value3 = zoo::na.locf(value)) # same result

time	value	value2	value3
1	42	42	42
2	56	56	56
3	65	65	65
4		65	65
5	52	52	52
6	45	45	45
7		45	45
8	42	42	42

Rolling averages

Survival analysis

weeks	relapse	treatment1	treatment2	wbc3cat
1	1	Standard	Drug B	Moderate
1	1	Standard	Drug B	High
2	1	Standard	Drug B	High
2	1	Standard	Drug B	High
3	1	Standard	Drug B	High
4	1	Standard	Drug B	High
4	1	Standard	Drug B	Moderate
5	1	Standard	Drug B	High
5	1	Standard	Standard	High
6	1	Drug A	Standard	Moderate
6	0	Drug A	Standard	High
6	1	Drug A	Standard	High
6	1	Drug A	Drug B	High
7	1	Drug A	Standard	High
8	1	Standard	Standard	Moderate
8	1	Standard	Standard	High
8	1	Standard	Drug B	High
8	1	Standard	Standard	High
9	0	Drug A	Standard	Moderate
10	1	Drug A	Standard	Moderate
10	0	Drug A	Standard	Moderate
11	1	Standard	Standard	High
11	0	Drug A	Standard	Moderate
11	1	Standard	Standard	Normal
12	1	Standard	Standard	High
12	1	Standard	Standard	Normal
13	1	Drug A	Standard	Moderate
15	1	Standard	Standard	Normal
16	1	Drug A	Drug B	High
17	1	Standard	Standard	Moderate
17	0	Drug A	Standard	Normal
19	0	Drug A	Standard	Normal
20	0	Drug A	Drug B	Normal
22	1	Standard	Standard	Moderate
22	1	Drug A	Drug B	Moderate
23	1	Standard	Drug B	Normal
23	1	Drug A	Drug B	Moderate
25	0	Drug A	Drug B	Normal
32	0	Drug A	Drug B	Moderate
32	0	Drug A	Drug B	Normal
34	0	Drug A	Drug B	Normal
35	0	Drug A	Drug B	Normal

Suppose we want to fit a Cox proportional hazards model to these data to model the effect of treatment2. The Kaplan–Meier survival curves look like this:

library(survival)
leuk_km <- survfit(Surv(weeks, relapse) ~ treatment2, data = leukaemia)

library(survminer)
ggsurvplot(leuk_km)

Figure 1: Kaplan-Meier survival curve comparing relapse in leukaemia patients on ‘Drug B’ versus standard treatment

In the first 10 weeks, survival on standard treatment appears to be better than Drug B, but after 10 weeks, the reverse appears to be true.

A Kaplan–Meier plot with crossing survival curves suggests that a proportional hazards assumption may not be reasonable. You can also verify this by checking whether the Schoenfeld residuals for a Cox model (while accounting for white blood cell count) appear constant over time:

Figure 2: Schoenfeld residuals for a Cox proportional hazards model of leukaemia relapse time against white blood cell count and treatment

Note: due to a bug in survminer, its function ggcoxzph (for plotting Schoenfeld residuals) draws the confidence bands incorrectly. Instead, use the base plotting function (as above), set ggcoxzph(..., se = FALSE), or check GitHub to see when the issue will be fixed.

Let’s split the dataset at 10 weeks, to compare the different effects of Drug B before and after this cutpoint.

treatment2	wbc3cat	tstart	weeks	relapse	time_group
Drug B	Moderate	0	1	1	1
Drug B	High	0	1	1	1
Drug B	High	0	2	1	1
Drug B	High	0	2	1	1
Drug B	High	0	3	1	1
Drug B	High	0	4	1	1
Drug B	Moderate	0	4	1	1
Drug B	High	0	5	1	1
Standard	High	0	5	1	1
Standard	Moderate	0	6	1	1
Standard	High	0	6	0	1
Standard	High	0	6	1	1
Drug B	High	0	6	1	1
Standard	High	0	7	1	1
Standard	Moderate	0	8	1	1
Standard	High	0	8	1	1
Drug B	High	0	8	1	1
Standard	High	0	8	1	1
Standard	Moderate	0	9	0	1
Standard	Moderate	0	10	1	1
Standard	Moderate	0	10	0	1
Standard	High	0	10	0	1
Standard	High	10	11	1	2
Standard	Moderate	0	10	0	1
Standard	Moderate	10	11	0	2
Standard	Normal	0	10	0	1
Standard	Normal	10	11	1	2
Standard	High	0	10	0	1
Standard	High	10	12	1	2
Standard	Normal	0	10	0	1
Standard	Normal	10	12	1	2
Standard	Moderate	0	10	0	1
Standard	Moderate	10	13	1	2
Standard	Normal	0	10	0	1
Standard	Normal	10	15	1	2
Drug B	High	0	10	0	1
Drug B	High	10	16	1	2
Standard	Moderate	0	10	0	1
Standard	Moderate	10	17	1	2
Standard	Normal	0	10	0	1
Standard	Normal	10	17	0	2
Standard	Normal	0	10	0	1
Standard	Normal	10	19	0	2
Drug B	Normal	0	10	0	1
Drug B	Normal	10	20	0	2
Standard	Moderate	0	10	0	1
Standard	Moderate	10	22	1	2
Drug B	Moderate	0	10	0	1
Drug B	Moderate	10	22	1	2
Drug B	Normal	0	10	0	1
Drug B	Normal	10	23	1	2
Drug B	Moderate	0	10	0	1
Drug B	Moderate	10	23	1	2
Drug B	Normal	0	10	0	1
Drug B	Normal	10	25	0	2
Drug B	Moderate	0	10	0	1
Drug B	Moderate	10	32	0	2
Drug B	Normal	0	10	0	1
Drug B	Normal	10	32	0	2
Drug B	Normal	0	10	0	1
Drug B	Normal	10	34	0	2
Drug B	Normal	0	10	0	1
Drug B	Normal	10	35	0	2

leuk_cox2 <-
  coxph(Surv(tstart, weeks, relapse) ~ wbc3cat + treatment2 * strata(time_group),
        data = leuk_split)

plot(cox.zph(leuk_cox2)[2])

Is the hazard ratio for the effect of Drug B before 10 weeks significant? What about after 10 weeks?

	hazard	2.5%	97.5%
Moderate WBC	3.15	0.95	10.4
High WBC	16.30	4.63	57.4
Drug B < 10 wks	2.18	0.85	5.6
Drug B >= 10 wks	0.05	0.01	0.3

See the R solutions for Lecture 10 of the Statistical Modelling with Stata course for more information.

Another useful function for wrangling survival data is tmerge() (for combining time-dependent variables with baseline data).

Wrangling data the right way with R

Authors

Affiliations

Published

DOI

Reshaping data (reshape, pivot, melt, cast)

Lagged variables

`lead`, `lag` and `shift` functions

Gaps between records

Last observation carried forward

Rolling averages

Survival analysis

Footnotes

Wrangling data the right way with R

Authors

Affiliations

Published

DOI

Reshaping data (reshape, pivot, melt, cast)

Lagged variables

lead, lag and shift functions

Gaps between records

Last observation carried forward

Rolling averages

Survival analysis

Footnotes

`lead`, `lag` and `shift` functions