Censoring. Survival curves and life tables. Comparing survival curves. Parametric regression. Cox regression.
This worksheet is based on the tenth lecture of the Statistical Modelling in Stata course, created by Dr Mark Lunt and offered by the Centre for Epidemiology Versus Arthritis at the University of Manchester. The original Stata exercises and solutions are here translated into their R equivalents.
Refer to the original slides and Stata worksheets here.
This section uses the dataset leukaemia.csv
.
leukaemia <- read.csv('leukaemia.csv')
First, load the survival package.
The time variable is weeks
, the number of weeks to relapse.
The outcome variable is relapse
, which is 1 if the subject had a relapse at that time and 0 if they did not.
Create a corresponding survival object using the syntax Surv(time, status)
.
This will be on the left hand side of your model formula.
First reorder the levels of wbc3cat
so that ‘Normal’ is the reference group (for ease of comparison with the Stata solutions sheet).
Similarly we can relevel treatment1
so that ‘Standard’ is the first level (as otherwise ‘Drug A’ comes first, alphabetically).
At 23 weeks, the survival function drops below 0.5.
table(leukaemia$relapse, leukaemia$treatment1)
12 subjects were lost to followup on Drug A.
The survival function drops below 0.5 after 8 weeks.
Zero (see table
output above).
Median survival before relapse is better on Drug A (23 weeks) than standard treatment (8 weeks).
There are some basic plotting functions in the survival package.
More tools for visualisation via ggplot2 are provided in the survminer package, for which a helpful cheat sheet is available. There’s less chance of mixing up the strata, as a legend is generated automatically by default:
library(survminer)
ggsurvplot(surv_t1, surv.median.line = 'hv')
In base R graphics you can use abline
and in ggplot2 there is geom_abline
or geom_hline
.
The ggsurvplot
function has a dedicated argument called surv.median.line
for this purpose.
See above.
This is slightly contrived because R doesn’t have a direct equivalent to the lost
command.
So here we’ll show what’s offered in survival and survminer.
See the survminer vignette for more tips.
ggsurvplot(surv_t1,
surv.median.line = 'hv',
censor = TRUE,
cumcensor = TRUE,
ncensor.plot = TRUE)
In base R, use the generic plotting functions text()
or points()
and extract the coordinates and values from the survfit
object.
# Original plot:
plot(surv_t1, xlab = 'weeks', ylab = 'survival',
col = c(2, 4), main = 'Kaplan-Meier survival estimates')
legend('topright', names(surv_t1$strata),
col = c(2, 4), lty = 1, bty = 'n')
abline(h = 0.5, lty = 2)
# Add censor labels:
text(surv_t1$time, surv_t1$surv,
labels = surv_t1$n.censor,
# make label transparent if n.censor == 0
col = rgb(0, 0, 0, 0:1)[1 + (surv_t1$n.censor > 0)])
Look in the documentation, depending on the plotting method you are using.
Both base plot.survfit
and survminer::ggsurv
offer arguments conf.int = TRUE
:
Why do the confidence bands get wider over time? Because they are based on smaller numbers.
survdiff
to perform a logrank test to compare the survival on Drug A to that on standard treatment. Is the difference between Drug A and standard treatment statistically significant?You can actually just do this test and add it to your survminer plot with the arguments pval
and pval.method
both set to TRUE
:
ggsurvplot(surv_t1,
conf.int = TRUE,
censor = TRUE,
pval = TRUE, pval.method = TRUE)
Or do it the old-fashioned way using survdiff()
:
The difference is statistically significant: there are fewer relapses on Drug A than expected.
surv_wbc <- survfit(Surv(weeks, relapse) ~ wbc3cat, data = leukaemia)
ggsurvplot(surv_wbc, conf.int = TRUE)
treatment1
against wbc3cat
. Are the proportions of subjects in each of the white blood cell counts categories the same in the two treatment arms?The standard treatment appears to have a greater proportion of people with high white blood cell counts, compared to those on Drug A.
Tip. The %T>%
pipe from magrittr lets us print and pass the result of xtabs
to the next function, without having to save the table to a temporary variable.
Decreased, because leukaemia is a disease that causes high numbers of abnormal blood cells.
coxph()
.) What is the hazard ratio for Drug A, and its 95% confidence interval?The hazard ratio and confidence interval are given by
Fit the adjusted model and extract the hazard ratio and its 95% confidence interval:
The beneficial effect of Drug A was exaggerated by the difference in white blood cell counts between the groups. The hazard ratio is larger (i.e. chance of relapse increased) after adjusting for it.
To check the proportional hazards assumption, we can use cox.zph()
from the survival package or ggcoxzph()
from the survminer package.
Check the documentation also for information on ggcoxdiagnostics()
and ggadjustedcurves()
.
For example:
ggcoxdiagnostics(cox_unadj, 'schoenfeld', ox.scale = 'time')
We can also plot the observed versus expected survival curves, using survfit
.
dummy <- list(treatment1 = levels(leukaemia$treatment1))
cox_surv <- survfit(cox_unadj, newdata = dummy)
plot(surv_t1, col = c(2, 4), lty = 1,
xlab = 'weeks', ylab = 'survival')
lines(cox_surv, col = c(2, 4), lty = 2)
legend('topright',
c('observed', 'expected', levels(leukaemia$treatment1)),
lty = c(1, 2, 4, 4), col = c(1, 1, 2, 4))
(If there is a ggplot2 or survminer approach, then it is left as an exercise.)
dummy <- list(wbc3cat = levels(leukaemia$wbc3cat))
cox_surv <- survfit(cox_wbc, newdata = dummy)
plot(surv_wbc, col = c(2:4), lty = 1,
xlab = 'weeks', ylab = 'survival')
lines(cox_surv, col = c(2:4), lty = 2)
legend('topright',
c('observed', 'expected', levels(leukaemia$wbc3cat)),
lty = c(1, 2, 4, 4, 4), col = c(1, 1, 2:4))
Note. You can regenerate both Schoenfeld residual plots at once with
To perform the test, simply print
cox.zph(cox_adj)
None of the test statistics is significant, thereby implying that the proportional hazards assumption holds for each of the variables.
The cox.zph()
function did both sets of tests at once, so the answer is already given above.
treatment2
. Compare the survival curves for Drug B and standard treatment. How does the survival on Drug B compare to that on standard treatment during the first 10 weeks?ggsurvplot(surv_t2)
In the first 10 weeks, Drug B has worse survival than the standard treatment.
After 10 weeks, survival on Drug B is better than the standard treatment.
cox_t2 <- coxph(Surv(weeks, relapse) ~ treatment2, data = leukaemia)
dummy <- list(treatment2 = levels(leukaemia$treatment2))
cox_surv2 <- survfit(cox_t2, newdata = dummy)
plot(surv_t2, col = c(2, 4))
lines(cox_surv2, col = c(2, 4), lty = 4, lwd = 2)
legend('topright', lty = 1, col = c(2, 4), names(surv_t2$strata))
legend('bottomleft', lty = c(1, 4), lwd = 1:2, c('observed', 'predicted'))
The observed survival curves intersect each other but the predicted curves do not (obviously, because they are based on a proportional hazards assumption).
treatment2
and wbc3cat
. Does Drug B have a significant effect on survival?The effect of Drug B on survival is not statistically significant (the hazard ratio’s 95% confidence interval contains 1).
cox.zph(cox_adj2)
The proportional hazards assumption does not hold overall; specifically not for treatment 2.
If you prefer visual tests:
summary(surv_t2)
There is a good tutorial here explaining the approach of using step functions for time-varying coefficients.
We can use the function survSplit()
for this.
No.
tstart == 0
and one with tstart == 10
.Look at leuk2
(or whatever you decided to call the output of survSplit
).
This has already been done for us with the episode
argument of survSplit
.
We have a variable called time_group
.
This involves an interaction between the time_group
strata and treatment2
.
(Explained here.)
t2
?See above.
Yes, that Drug B increased risk initially, then reduced it.
The model is now appropriate and none of the predictors appear to show non-proportionality.