Survival analysis III - Implementation in R

To wrap up this introduction to survival analysis, I used an example and R packages to demonstrate the theories in action. The same content can be found in this R markdown file, which you can download and play with. The highlights of this include

Plotting the survival curve from Kaplan-Meier estimator and its similarity to Nelson-Aalen estimator,
Demonstration of how the linear predictors, survival probability and Martingale residual can be computed only using fitted model parameters and KM estimate,
Diagnostic plots for Cox proportional hazard (coxph) models and some tricks to improve the model,
ROC curves for different models.

Setup

library(survival)
library(survminer)

library(splines)
library(lattice)
data("lung", package = "survival")
lung$sex <- factor(lung$sex, levels = 1:2, labels = c("male", "female"))

We use the lung dataset from the survival model, consisting of data from 228 patients.

Non-parametric model

The Surv() function from the survival package create a survival object, which is used in many other functions. In most cases, the first argument the observed survival times, and as second the event indicator. The type of censoring is also specified in this function.

To calculate the Kaplan-Meier (KM) estimate, we use survfit() function and its first argument is an R formula. The left-hand side of this formula is a survival object created by Surv(), and the right-hand side specifies the grouping variable. Here we would like to estimate a single survival curve of the entire population, so we put $1$ as below.

We can call the plot() method on the survfit object to display the estimated curve, where the 95% confidence interval is included by default:

KM_fit <- survfit(Surv(time, status) ~ 1, data = lung)
plot(KM_fit, xlab = "Time", ylab = "Survival Probability", 
    main = "Kaplan-Meier Estimate of S(t) for the lung Data")

The ggsurvplot() function from survminer creates ggplot2 plots from survfit objects.

ggsurvplot(KM_fit, data = lung)

We can also plot the cumulative events or cumulative hazard by specifying fun = "event" or fun = "cumhaz" in the ggsurvplot function.

The Nelson-Aalen estimator is also known as the Breslow and can be estimated by specifying the type argument to be "fleming-harrington". You can see below that the two estimators give nearly identical survival curves.

NA_fit <- survfit(Surv(time, status) ~ 1, data = lung, 
                  type = "fleming-harrington")
plot(KM_fit, conf.int = FALSE, xlab = "Time", ylab = "Survival Probability")
lines(NA_fit, conf.int = FALSE, col = 'red')

Cox proportional hazard model

Model fitting and significance test

coxph() fits a Cox proportional hazard model to the data and the syntax is similar to survfit(). Here, we fit a model using only the age predictor and called summary() to examine the details of the coxph fit. From the output, we can see that the coefficient for age is greater than $0$ and $\exp(\text{coef}) > 1$, meaning that the age contributes to increasing hazard.

cox_fit <- coxph(Surv(time, status) ~ age, data = lung)
summary(cox_fit)

## Call:
## coxph(formula = Surv(time, status) ~ age, data = lung)
## 
##   n= 228, number of events= 165 
## 
##         coef exp(coef) se(coef)     z Pr(>|z|)  
## age 0.018720  1.018897 0.009199 2.035   0.0419 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##     exp(coef) exp(-coef) lower .95 upper .95
## age     1.019     0.9815     1.001     1.037
## 
## Concordance= 0.55  (se = 0.025 )
## Rsquare= 0.018   (max possible= 0.999 )
## Likelihood ratio test= 4.24  on 1 df,   p=0.04
## Wald test            = 4.14  on 1 df,   p=0.04
## Score (logrank) test = 4.15  on 1 df,   p=0.04

To test whether gender is an important factor, we can conduct a log rank test by calling survdiff() function. In this example, the test statistics $\chi^2 = 10.3$ is equivalent to $p = 0.001$, indicating that sex is a significant factor.

survdiff(Surv(time, status) ~ sex, data = lung)

## Call:
## survdiff(formula = Surv(time, status) ~ sex, data = lung)
## 
##              N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=male   138      112     91.6      4.55      10.3
## sex=female  90       53     73.4      5.68      10.3
## 
##  Chisq= 10.3  on 1 degrees of freedom, p= 0.001

We can also use anova to compare two coxph model fitted to the same data. Here, the test results evaluate how significant sex is, given that we already have the age predictor.

cox_fit2 <- coxph(Surv(time, status) ~ sex + age, data = lung)
anova(cox_fit, cox_fit2)

## Analysis of Deviance Table
##  Cox model: response is  Surv(time, status)
##  Model 1: ~ age
##  Model 2: ~ sex + age
##    loglik  Chisq Df P(>|Chi|)   
## 1 -747.79                       
## 2 -742.85 9.8822  1  0.001669 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Prediction

To use coxph to make prediction, we can derive the following formula from the definition. In coxph model $S(t|X) = \exp[-\Lambda(t)\exp(X\beta)] = \exp[-\Lambda(t)]^{\exp(X\beta)} = S_{KM}(t)^{\exp(X\beta)}$ where $S_{KM}$ is the KM-estimator. In the following computation, we computed $X\beta$ for the first patient in the lung dataset. The mean (cox_fit2$means) is subtracted from $X$ and the linear combination of mean-adjusted $X$ given $\beta$ is also called linear predictors (lp). As we can see, lp computed manually using the definition is exactly the same in cox_fit2$linear.predictors.

data <- lung[1, c('sex', 'age')]
data$sex = (data$sex == 'female')
lp <- sum(t(data - cox_fit2$means) * cox_fit2$coefficients)
data.frame(lp_manual = lp, 
           lp_model = cox_fit2$linear.predictors[1])

##   lp_manual  lp_model
## 1 0.3995047 0.3995047

To obtain the individual survival curve, we can extract $S_{KM}$ from a KM estimate and the survival curve is calculated using the previous equation.

pred <- KM_fit$surv ^ exp(lp)
plot(KM_fit, conf.int = FALSE, xlab = "Time", ylab = "Survival Probability")
lines(KM_fit$time, pred, type = 's', col = "red")
legend("topright",
       legend = c('baseline', 'first patient'),
       lty = c(1,1),
       col = c('black', 'red'))

The alternative is to use cumulative hazard function, which can be obtained from calling basehaz() function. Here, we use the cumulative hazard function to verify the calculation of martingale residual, which is defined as $r_{Mi} = \delta_i - \Lambda_0(T_i)\exp(X_j\beta)$ where $\delta_i$ is the state ($0$ is alive/censored and $1$ is dead) and $\Lambda_0(T_i)\exp(\hat{\beta}Z_j)$ is the cumulative hazard function at time $T_i$. The $r_{Mi}$ is dependent on time but the one we are most interested in is the maximum survival time. In the following example, we use basehaz() to obtain $\Lambda_0$ and cox_fit2$linear.predictors for $X_j\beta$. Martingale residual is also accessible at cox_fit2$residuals.

# baseline hazard is the cumulative hazard function
cum_haz <- basehaz(cox_fit2)
# For the first patient in lung, he died at day = 306, the index is 109
# Thus the prediction is
pred1 <- cum_haz[109,1]*exp(cox_fit2$linear.predictors[1])
data.frame(residual_manual = 1 - pred1, 
           residual_model = cox_fit2$residuals[1])

##   residual_manual residual_model
## 1     0.004389994    0.004389994

Residual and effect plots

As martingal residual evaluates the goodness of fit, it can be used to examine non-linearity by plotting residuals against one of the predictors. In this example, we suspect that the survival may have a non-linear relationship with age.

lung$martingale <- residuals(cox_fit2, type = "martingale")
# this is equivalent to lung$martingale <- cox_fit2$residuals
xyplot(martingale ~ age | sex, data = lung, main='martingale residual of coxph')

Here, we also show the effect plots, further confirming the linear relationship.

new_data <- with(lung, expand.grid(
    age = seq(45, 75, 5),
    sex = levels(sex)
))
predictions <- predict(cox_fit2, newdata = new_data, type = "lp", se.fit = TRUE)
new_data$predict <- predictions$fit
new_data$se <- predictions$se.fit
new_data$lower <- predictions$fit - 1.96 * predictions$se.fit
new_data$upper <- predictions$fit + 1.96 * predictions$se.fit
head(new_data,3)

##   age  sex      predict         se      lower     upper
## 1  45 male -0.094809928 0.17730643 -0.4423305 0.2527107
## 2  50 male -0.009583269 0.13559528 -0.2753500 0.2561835
## 3  55 male  0.075643391 0.09791848 -0.1162768 0.2675636

xyplot(predict + lower + upper ~ age | sex, data = new_data, 
       type = "l", col = "black", lwd = 2, lty = c(1, 2, 2),
       abline = list(h = 0, lty = 2, lwd = 2, col = "red"),
       xlab = "Age (years)", ylab = "log Hazard Ratio")

Then we introduce non-linearity using a natural cubic spline on age variable.

cox_fit2_ns <- coxph(Surv(time, status) ~ sex + ns(age, 3), data = lung)
cox_fit2_ns

## Call:
## coxph(formula = Surv(time, status) ~ sex + ns(age, 3), data = lung)
## 
##                coef exp(coef) se(coef)      z       p
## sexfemale   -0.5037    0.6043   0.1679 -3.001 0.00269
## ns(age, 3)1  0.1285    1.1371   0.3595  0.357 0.72074
## ns(age, 3)2  1.7182    5.5744   1.1618  1.479 0.13916
## ns(age, 3)3  1.1237    3.0762   0.4796  2.343 0.01914
## 
## Likelihood ratio test=16.05  on 4 df, p=0.002956
## n= 228, number of events= 165

From the model summary of cox_fit2_ns, the non-linear terms of the age predictor are not all significant. We can see that there is not significant improvement on the residual plot.

lung$martingale <- residuals(cox_fit2_ns, type = "martingale")
xyplot(martingale ~ age | sex, data = lung, main='martingale residual of coxph')

Regardless, it is helpful to visualise the nonlinear relationships.

predictions <- predict(cox_fit2_ns, newdata = new_data, type = "lp", se.fit = TRUE)
new_data$predict <- predictions$fit
new_data$se <- predictions$se.fit
new_data$lower <- predictions$fit - 1.96 * predictions$se.fit
new_data$upper <- predictions$fit + 1.96 * predictions$se.fit

xyplot(predict + lower + upper ~ age | sex, data = new_data, 
       type = "l", col = "black", lwd = 2, lty = c(1, 2, 2),
       abline = list(h = 0, lty = 2, lwd = 2, col = "red"),
       xlab = "Age (years)", ylab = "log Hazard Ratio")

Testing proportionality assumption

Schonfeld residual is often used to test the proportionality assumption and we can use ggcoxzph() from survminer package to produce the diagnostic plot. As there is slight non-linear trend in the diagnostic plot with respect to sex, we can hypothesise that the proportionality may not hold for sex.

ftest <- cox.zph(cox_fit2)
ggcoxzph(ftest)

To address this potential violation of proportionality assumption, one way is to use stratification. If we startify using sex, we are essentially fitting two models regarding different sex.

cox_fit_strat <- coxph(Surv(time, status) ~ strata(sex)*age, data = lung)
predictions <- predict(cox_fit_strat, newdata = new_data, type = "lp", se.fit = TRUE)
new_data$predict <- predictions$fit
new_data$se <- predictions$se.fit
new_data$lower <- predictions$fit - 1.96 * predictions$se.fit
new_data$upper <- predictions$fit + 1.96 * predictions$se.fit
ggplot(new_data, aes(x = age, y = predict, col = sex)) +
    geom_line() + xlab("Age (years)") + ylab("log Hazard Ratio")

ROC and AUC

So far, we have looked at several ways to diagnose the fitted coxph model and tricks to improve the model by adding complexity, such as non-linearity and stratification. One missing piece is to have a single number that evaluates the goodness-of-fit. As I pointed out last time, survival analysis is equivalent to a classification problem at a fixed time point. Therefore, a useful metric is AUC for different models. Here, I used survivalROC package as a demonstration and many advanced AUC measures are implemented in survAUC.

library(survivalROC)

par(mfrow=c(2,2))
for (i in seq(1,4)){
    t = 200 * i
    roc <- survivalROC(Stime = lung$time,
                       status = lung$status,
                       marker = cox_fit2$linear.predictors,
                       predict.time = t,
                       method = "NNE",
                       span = 0.25 * nrow(lung)^(-0.20))
    fig_title <- sprintf("AUC at time %d: %.3f",t, roc$AUC)
    plot(roc$FP, roc$TP, xlab = "FP", ylab = "TP", main =  fig_title)
}

We then plot the ROC for cox_fit2_ns model to examine the difference.

par(mfrow=c(2,2))
for (i in seq(1,4)){
    t = 200 * i
    roc <- survivalROC(Stime = lung$time,
                       status = lung$status,
                       marker = cox_fit2_ns$linear.predictors,
                       predict.time = t,
                       method = "NNE",
                       span = 0.25 * nrow(lung)^(-0.20))
    fig_title <- sprintf("AUC at time %d: %.3f",t, roc$AUC)
    plot(roc$FP, roc$TP, xlab = "FP", ylab = "TP", main =  fig_title)
}

Comparing these two models, we can see that adding non-linearity improves the AUC for long-term prediction (see t = 600, 800) while the performance is worse at short-term (t = 200, 400).

Summary and future work

This blog concludes the series in Survival Analysis, putting knowledge into practice. Although I have only scratched the surface of this field, I find it satisfying and helpful to understand the fundamentals. I will jump on other learnings and I sure hope to come back and dig deeper in the near future! If there is any advice or question, please let me know :)

References

Besides the documentation for various packages, the R code for Prof Dimitris Rizopoulos’s course on Survival Anlaysis helps me greatly in creating this blog.