Tips on data manipulation with the packages data.table, dplyr and tidyr, including reshaping data, creating lagged variables and preparing data for survival analysis.
By Sian Bladon and James Gwinnutt
See the R solutions for Lecture 11 of the Statistical Modelling with Stata course.
By Sian Bladon and David Selby
To create lag (or lead) variables in R is very easy, and the syntax can be similar to Stata or different, depending on which package you choose to use. Consider the following data frame called observations
.
time | patient | value |
---|---|---|
1 | 1 | 74 |
2 | 1 | 60 |
3 | 1 | 50 |
4 | 1 | 60 |
1 | 2 | 45 |
2 | 2 | 61 |
3 | 2 | 38 |
4 | 2 | 33 |
lead
, lag
and shift
functionsThe easiest way is to use the lead()
and lag()
functions built into the package dplyr.
time | patient | value | lag1 |
---|---|---|---|
1 | 1 | 74 | |
2 | 1 | 60 | 74 |
3 | 1 | 50 | 60 |
4 | 1 | 60 | 50 |
1 | 2 | 45 | |
2 | 2 | 61 | 45 |
3 | 2 | 38 | 61 |
4 | 2 | 33 | 38 |
time | patient | value | lead1 |
---|---|---|---|
1 | 1 | 74 | 60 |
2 | 1 | 60 | 50 |
3 | 1 | 50 | 60 |
4 | 1 | 60 | |
1 | 2 | 45 | 61 |
2 | 2 | 61 | 38 |
3 | 2 | 38 | 33 |
4 | 2 | 33 |
Or if you prefer the package data.table’s syntax, the function is shift()
:
library(data.table)
setDT(observations)
observations[, .(lag2 = shift(value, 2, type = 'lag')), by = patient]
patient | lag2 |
---|---|
1 | |
1 | |
1 | 74 |
1 | 60 |
2 | |
2 | |
2 | 45 |
2 | 61 |
For quick analyses (without grouping variables) you can also do it yourself using head(x, -1)
, tail(x, -1)
or basic subsetting using syntax x[-1]
and x[-length(x)]
, but you’re less likely to make a mistake by using the dedicated functions.
If you have gaps in your records and you only want to lag successive years (for example), you can use a conditional (e.g. ifelse
) expression. Consider the dataset gap_obs
:
time | value |
---|---|
1 | 42 |
2 | 56 |
3 | 65 |
5 | 52 |
6 | 45 |
8 | 42 |
Consecutive records are those whose time points have lag-1 differences of 1 time unit.
time | value | lag1 |
---|---|---|
1 | 42 | |
2 | 56 | 42 |
3 | 65 | 56 |
5 | 52 | |
6 | 45 | 52 |
8 | 42 |
You can also calculated lagged differences with the diff
function, but the result will be of length 1 shorter than your input vector, so you need to prepend an NA
.
If data are missing (NA
) and you want to fill it in with the last observation carried forward (LOCF) you can use the nafill
from the data.table package. Change the type
argument to change between last observation carried forward, next observation carried backward, or to fill in NA
s with a specified constant value.
[1] 10 20 25 25 30 10
nafill(x, type = 'nocb')
[1] 10 20 25 30 30 10
nafill(x, type = 'const', fill = 0)
[1] 10 20 25 0 30 10
The zoo package (Z’s Ordered Observations) provides a similar function called na.locf()
.
Equivalently, you can turn implicit missing values to explicit ones and then use a conditional expression. (Be aware this may not be very efficient if your records have lots of gaps.)
library(tidyr)
gap_obs %>%
# Add explicit NAs for the missing time points
tidyr::complete(time = seq(min(time), max(time), by = 1)) %>%
# Last obs carried forward for missing values, using lag
mutate(value2 = ifelse(is.na(value), lag(value), value),
value3 = zoo::na.locf(value)) # same result
time | value | value2 | value3 |
---|---|---|---|
1 | 42 | 42 | 42 |
2 | 56 | 56 | 56 |
3 | 65 | 65 | 65 |
4 | 65 | 65 | |
5 | 52 | 52 | 52 |
6 | 45 | 45 | 45 |
7 | 45 | 45 | |
8 | 42 | 42 | 42 |
By David Selby
This example uses the dataset leukaemia.csv
(right-click to download) from lecture 10 of the Statistical Modelling with Stata course.
leukaemia <- read.csv('leukaemia.csv')
weeks | relapse | treatment1 | treatment2 | wbc3cat |
---|---|---|---|---|
1 | 1 | Standard | Drug B | Moderate |
1 | 1 | Standard | Drug B | High |
2 | 1 | Standard | Drug B | High |
2 | 1 | Standard | Drug B | High |
3 | 1 | Standard | Drug B | High |
4 | 1 | Standard | Drug B | High |
4 | 1 | Standard | Drug B | Moderate |
5 | 1 | Standard | Drug B | High |
5 | 1 | Standard | Standard | High |
6 | 1 | Drug A | Standard | Moderate |
6 | 0 | Drug A | Standard | High |
6 | 1 | Drug A | Standard | High |
6 | 1 | Drug A | Drug B | High |
7 | 1 | Drug A | Standard | High |
8 | 1 | Standard | Standard | Moderate |
8 | 1 | Standard | Standard | High |
8 | 1 | Standard | Drug B | High |
8 | 1 | Standard | Standard | High |
9 | 0 | Drug A | Standard | Moderate |
10 | 1 | Drug A | Standard | Moderate |
10 | 0 | Drug A | Standard | Moderate |
11 | 1 | Standard | Standard | High |
11 | 0 | Drug A | Standard | Moderate |
11 | 1 | Standard | Standard | Normal |
12 | 1 | Standard | Standard | High |
12 | 1 | Standard | Standard | Normal |
13 | 1 | Drug A | Standard | Moderate |
15 | 1 | Standard | Standard | Normal |
16 | 1 | Drug A | Drug B | High |
17 | 1 | Standard | Standard | Moderate |
17 | 0 | Drug A | Standard | Normal |
19 | 0 | Drug A | Standard | Normal |
20 | 0 | Drug A | Drug B | Normal |
22 | 1 | Standard | Standard | Moderate |
22 | 1 | Drug A | Drug B | Moderate |
23 | 1 | Standard | Drug B | Normal |
23 | 1 | Drug A | Drug B | Moderate |
25 | 0 | Drug A | Drug B | Normal |
32 | 0 | Drug A | Drug B | Moderate |
32 | 0 | Drug A | Drug B | Normal |
34 | 0 | Drug A | Drug B | Normal |
35 | 0 | Drug A | Drug B | Normal |
Suppose we want to fit a Cox proportional hazards model to these data to model the effect of treatment2
. The Kaplan–Meier survival curves look like this:
library(survival)
leuk_km <- survfit(Surv(weeks, relapse) ~ treatment2, data = leukaemia)
library(survminer)
ggsurvplot(leuk_km)
In the first 10 weeks, survival on standard treatment appears to be better than Drug B, but after 10 weeks, the reverse appears to be true.
A Kaplan–Meier plot with crossing survival curves suggests that a proportional hazards assumption may not be reasonable. You can also verify this by checking whether the Schoenfeld residuals for a Cox model (while accounting for white blood cell count) appear constant over time:
leuk_cox1 <- coxph(Surv(weeks, relapse) ~ treatment2 + wbc3cat, data = leukaemia)
plot(cox.zph(leuk_cox1)[1])
Note: due to a bug in survminer, its function ggcoxzph
(for plotting Schoenfeld residuals) draws the confidence bands incorrectly. Instead, use the base plotting function (as above), set ggcoxzph(..., se = FALSE)
, or check GitHub to see when the issue will be fixed.
Let’s split the dataset at 10 weeks, to compare the different effects of Drug B before and after this cutpoint.
leuk_split <- survSplit(Surv(weeks, relapse) ~ treatment2 + wbc3cat,
data = leukaemia, cut = 10, episode = 'time_group')
treatment2 | wbc3cat | tstart | weeks | relapse | time_group |
---|---|---|---|---|---|
Drug B | Moderate | 0 | 1 | 1 | 1 |
Drug B | High | 0 | 1 | 1 | 1 |
Drug B | High | 0 | 2 | 1 | 1 |
Drug B | High | 0 | 2 | 1 | 1 |
Drug B | High | 0 | 3 | 1 | 1 |
Drug B | High | 0 | 4 | 1 | 1 |
Drug B | Moderate | 0 | 4 | 1 | 1 |
Drug B | High | 0 | 5 | 1 | 1 |
Standard | High | 0 | 5 | 1 | 1 |
Standard | Moderate | 0 | 6 | 1 | 1 |
Standard | High | 0 | 6 | 0 | 1 |
Standard | High | 0 | 6 | 1 | 1 |
Drug B | High | 0 | 6 | 1 | 1 |
Standard | High | 0 | 7 | 1 | 1 |
Standard | Moderate | 0 | 8 | 1 | 1 |
Standard | High | 0 | 8 | 1 | 1 |
Drug B | High | 0 | 8 | 1 | 1 |
Standard | High | 0 | 8 | 1 | 1 |
Standard | Moderate | 0 | 9 | 0 | 1 |
Standard | Moderate | 0 | 10 | 1 | 1 |
Standard | Moderate | 0 | 10 | 0 | 1 |
Standard | High | 0 | 10 | 0 | 1 |
Standard | High | 10 | 11 | 1 | 2 |
Standard | Moderate | 0 | 10 | 0 | 1 |
Standard | Moderate | 10 | 11 | 0 | 2 |
Standard | Normal | 0 | 10 | 0 | 1 |
Standard | Normal | 10 | 11 | 1 | 2 |
Standard | High | 0 | 10 | 0 | 1 |
Standard | High | 10 | 12 | 1 | 2 |
Standard | Normal | 0 | 10 | 0 | 1 |
Standard | Normal | 10 | 12 | 1 | 2 |
Standard | Moderate | 0 | 10 | 0 | 1 |
Standard | Moderate | 10 | 13 | 1 | 2 |
Standard | Normal | 0 | 10 | 0 | 1 |
Standard | Normal | 10 | 15 | 1 | 2 |
Drug B | High | 0 | 10 | 0 | 1 |
Drug B | High | 10 | 16 | 1 | 2 |
Standard | Moderate | 0 | 10 | 0 | 1 |
Standard | Moderate | 10 | 17 | 1 | 2 |
Standard | Normal | 0 | 10 | 0 | 1 |
Standard | Normal | 10 | 17 | 0 | 2 |
Standard | Normal | 0 | 10 | 0 | 1 |
Standard | Normal | 10 | 19 | 0 | 2 |
Drug B | Normal | 0 | 10 | 0 | 1 |
Drug B | Normal | 10 | 20 | 0 | 2 |
Standard | Moderate | 0 | 10 | 0 | 1 |
Standard | Moderate | 10 | 22 | 1 | 2 |
Drug B | Moderate | 0 | 10 | 0 | 1 |
Drug B | Moderate | 10 | 22 | 1 | 2 |
Drug B | Normal | 0 | 10 | 0 | 1 |
Drug B | Normal | 10 | 23 | 1 | 2 |
Drug B | Moderate | 0 | 10 | 0 | 1 |
Drug B | Moderate | 10 | 23 | 1 | 2 |
Drug B | Normal | 0 | 10 | 0 | 1 |
Drug B | Normal | 10 | 25 | 0 | 2 |
Drug B | Moderate | 0 | 10 | 0 | 1 |
Drug B | Moderate | 10 | 32 | 0 | 2 |
Drug B | Normal | 0 | 10 | 0 | 1 |
Drug B | Normal | 10 | 32 | 0 | 2 |
Drug B | Normal | 0 | 10 | 0 | 1 |
Drug B | Normal | 10 | 34 | 0 | 2 |
Drug B | Normal | 0 | 10 | 0 | 1 |
Drug B | Normal | 10 | 35 | 0 | 2 |
Now fit the model to compare the effect of Drug B before and after 10 weeks.
Is the hazard ratio for the effect of Drug B before 10 weeks significant? What about after 10 weeks?
hazard | 2.5% | 97.5% | |
---|---|---|---|
Moderate WBC | 3.15 | 0.95 | 10.4 |
High WBC | 16.30 | 4.63 | 57.4 |
Drug B < 10 wks | 2.18 | 0.85 | 5.6 |
Drug B >= 10 wks | 0.05 | 0.01 | 0.3 |
See the R solutions for Lecture 10 of the Statistical Modelling with Stata course for more information.
Another useful function for wrangling survival data is tmerge()
(for combining time-dependent variables with baseline data).