Professional Documents
Culture Documents
DeepSurv Using A Cox Proportional Hasards DeepNets 1652051740
DeepSurv Using A Cox Proportional Hasards DeepNets 1652051740
Abstract
Background: Medical practitioners use survival models to explore and understand the relationships between
patients’ covariates (e.g. clinical and genetic features) and the effectiveness of various treatment options. Standard
survival models like the linear Cox proportional hazards model require extensive feature engineering or prior medical
knowledge to model treatment interaction at an individual level. While nonlinear survival methods, such as neural
networks and survival forests, can inherently model these high-level interaction terms, they have yet to be shown as
effective treatment recommender systems.
Methods: We introduce DeepSurv, a Cox proportional hazards deep neural network and state-of-the-art survival
method for modeling interactions between a patient’s covariates and treatment effectiveness in order to provide
personalized treatment recommendations.
Results: We perform a number of experiments training DeepSurv on simulated and real survival data. We
demonstrate that DeepSurv performs as well as or better than other state-of-the-art survival models and validate that
DeepSurv successfully models increasingly complex relationships between a patient’s covariates and their risk of
failure. We then show how DeepSurv models the relationship between a patient’s features and effectiveness of
different treatment options to show how DeepSurv can be used to provide individual treatment recommendations.
Finally, we train DeepSurv on real clinical studies to demonstrate how it’s personalized treatment recommendations
would increase the survival time of a set of patients.
Conclusions: The predictive and modeling capabilities of DeepSurv will enable medical researchers to use deep
neural networks as a tool in their exploration, understanding, and prediction of the effects of a patient’s characteristics
on their risk of failure.
Keywords: Deep learning, Survival analysis, Treatment recommendations
© The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Katzman et al. BMC Medical Research Methodology (2018) 18:24 Page 2 of 12
To model nonlinear survival data, researchers have supports medical practitioners in providing personal-
applied three main types of neural networks to the prob- ized treatment recommendations that potentially could
lem of survival analysis. These include variants of: (i) increase the median survival time for a set of patients.
classification methods (see details in [6, 7]), (ii) time- The organization of the manuscript is as follows: in
encoded methods (see details in [8, 9]), (iii) and risk- “Background” section, we provide a brief background on
predicting methods (see details in [10]). This third type survival analysis. In “Methods” section, we present our
is a feed-forward neural network (NN) that estimates an contributions, including an explanation of our imple-
individual’s risk of failure. In fact, Faraggi-Simon’s network mentation of DeepSurv and our proposed recommender
is seen as a nonlinear extension of the Cox proportional system. In “Results” section, we describe the experimen-
hazards model. tal design and results. “Conclusion” and “Discussion”
Risk neural networks learn highly complex and nonlin- sections conclude the manuscript.
ear relationships between prognostic features and an indi- In this section, we define survival data and the
vidual’s risk of failure. In application, for example, when approaches for modeling a population’s survival and fail-
the success of a treatment option is affected by an individ- ure rate. Additionally, we discuss linear and nonlinear
ual’s features, the NN learns the relationship without prior survival models and their limitations.
feature selection or domain expertise. The network is then
able to provide a personalized recommendation based on Survival data
the computed risk of a treatment. Survival data is comprised of three elements: a patient’s
However, previous studies have demonstrated mixed baseline data x, a failure event time T, and an event indica-
results on NNs ability to predict risk. For instance, tor E. If an event (e.g. death) is observed, the time interval
researchers have attempted to apply the Faraggi-Simon T corresponds to the time elapsed between the time in
network with various extensions, but they have failed to which the baseline data was collected and the time of
demonstrate improvements beyond the linear Cox model, the event occurring, and the event indicator is E = 1.
see [11–13]. One possible explanation is that the practice If an event is not observed, the time interval T corre-
of NNs was not as developed as it is today. To the best sponds to the time elapsed between the collection of the
of our knowledge, NNs have not outperformed standard baseline data and the last contact with the patient (e.g.
methods for survival analysis (e.g. CPH). Our manuscript end of study), and the event indicator is E = 0. In this
shows that this is no longer the case; with modern tech- case, the patient is said to be right-censored. If one opts to
niques, risk NNs have state-of-the-art performance and use standard regression methods, the right-censored data
can be used for a variety of medical applications. is considered to be a type of missing data. This is typi-
The goals of this paper are: (i) to show that the appli- cally discarded which can introduce a bias in the model.
cation of deep learning to survival analysis performs as Therefore, modeling right-censored data requires special
well as or better than other survival methods in predicting consideration or the use of a survival model.
risk; and (ii) to demonstrate that the deep neural network Survival and hazard functions are the two fundamen-
can be used as a personalized treatment recommender tal functions in survival analysis. The survival function is
system and a useful framework for further medical denoted by S(t) = Pr(T > t), which signifies the proba-
research. bility that an individual has ‘survived’ beyond time t. The
We propose a modern Cox proportional hazards deep hazard function λ(t) is defined as:
neural network, henceforth referred to as DeepSurv, as
the basis for a treatment recommender system. We make Pr(t ≤ T < t + δ | T ≥ t)
λ(t) = lim . (1)
the following contributions. First, we show that Deep- δ→0 δ
Surv performs as well as or better than other survival
analysis methods on survival data with both linear and The hazard function is the probability an individual will
nonlinear effects from covariates. Second, we include not survive an extra infinitesimal amount of time δ, given
an additional categorical variable representing a patient’s they have already survived up to time t. Thus, a greater
treatment group to illustrate how the network can learn hazard signifies a greater risk of death.
complex relationships between an individual’s covariates
and the effect of a treatment. Our experiments validate Linear survival models
that the network successfully models the treatment’s risk The Cox proportional hazards model is a common
within a population. Third, we use DeepSurv to pro- method for modeling an individual’s survival given their
vide treatment recommendations tailored to a patient’s baseline data x. In accordance with the standard R sur-
observed features. We confirm our results on real clin- vival package coxph, we use notation from [14] to describe
ical studies, which further demonstrates the power of the Cox model. The model assumes that the hazard func-
DeepSurv. Finally, we show that the recommender system tion is composed of two non-negative functions: a baseline
Katzman et al. BMC Medical Research Methodology (2018) 18:24 Page 3 of 12
hazard function, λ0 (t), and a risk score, r(x) = eh(x) , Another popular machine learning approach to model-
defined as the effect of an individual’s observed covariates ing patients’ hazard function is the random survival forest
on the baseline hazard [14]. We denote h(x) as the log-risk (RSF) [15, 16]. The random survival forest is a tree method
function. The hazard function is assumed to have the form that produces an ensemble estimate for the cumulative
hazard function.
λ(t|x) = λ0 (t) · eh(x) . (2) A more recent deep learning approach models the event
time according to a Weibull distribution with parameters
The CPH is a proportional hazards model that estimates given by latent variables generated by a deep exponential
the log-risk function, h(x), by a linear function ĥβ (x) = family [17].
β T x [or equivalently r̂β (x) = eβ x ]. To perform Cox
T
Evaluation T ∼ Exp(λ(t; x)) = Exp λ0 · eh(x) . (7)
Survival data
To evaluate the models’ predictive accuracy on the sur- In both experiments, the log-risk function h(x) only
vival data, we measure the concordance-index (C-index) depends on two of the ten covariates. This allows us to
c as outlined by [24]. The C-index is the most common verify that DeepSurv discerns the relevant covariates from
metric used in survival analysis and reflects a measure of the noise. Next, we choose a censoring time to represent
how well a model predicts the ordering of patients’ death the ‘end of study’ such that 50 percent of the patients have
times. For context, a c = 0.5 is the average C-index of a an observed event, E = 1, in the dataset. Further details of
random model, whereas c = 1 is a perfect ranking of death the simulated data generation are found in Appendix C.
times. We perform bootstrapping [25] and sample the test
set with replacement to obtain confidence intervals. Linear experiment
We first simulate patients to have a linear log-risk func-
Treatment recommendations tion for x ∈ Rd so that the linear proportional hazards
We determine the recommended treatment for each assumption holds true:
patient in the test set using DeepSurv and the RSF.
h(x) = x0 + 2x1 . (8)
We do not calculate the recommended treatment for
CPH; without preselected treatment-interaction terms, Because the linear proportional hazards assumption holds
the CPH model will compute a constant recommender true, we expect the linear CPH to accurately model the
function and recommend the same treatment option for log-risk function in Eq. 8.
all patients. This would effectively be comparing the sur- Our results (see Table 1) demonstrate that DeepSurv
vival rates between the control and experimental groups. performs as well as the standard linear Cox regression and
DeepSurv and the RSF are capable of predicting an indi- better than RSF in predictive ability.
vidual’s hazard per treatment because each computes Figure 2 demonstrates how DeepSurv more accurately
relevant interaction terms. For DeepSurv, we choose the models the log-risk function compared to the linear CPH.
recommended treatment by calculating the recommender Figure 2a plots the true log-risk function h(x) for all
function (Eq. 11). Because the RSF predicts a cumulative patients in the test set. As shown in Fig. 2b, the CPH’s esti-
hazard for each patient, we choose the treatment with the mated log-risk function ĥβ (x) does not perfectly model
minimum cumulative hazard. the true log-risk for a patient. In contrast, as shown
Once we determine the recommended treatment, we in Fig. 2c, DeepSurv better estimates the true log-risk
identify two subsets of patients: those whose treatment function.
group aligns with the model’s recommended treatment To quantify these differences, Fig. 2d and e show that the
(Recommendation) and those who do not undergo the CPH’s estimated log-risk has a significantly larger absolute
recommended treatment (Anti-Recommendation). We error than that of DeepSurv, specifically for patients with
calculate the median survival time of each subset to deter- a high positive log-risk. We calculate the mean-squared-
mine if a model’s treatment recommendations increase error (MSE) between a model’s predicted log-risk and the
the survival rate of the patients. We then perform a log- true log-risk values. The MSEs of CPH and DeepSurv
rank test to validate whether the difference between the are 20.528 057 878 872 541 and 0.192 683 15, respectively.
two subsets is significant. Even though DeepSurv and CPH have similar predic-
tive abilities, this demonstrates that DeepSurv is superior
Simulated survival data than the CPH at modeling the true risk function of the
In this section, we perform two experiments with simu- population.
lated survival data: one with a linear log-risk function and
one with a nonlinear (Gaussian) log-risk function. The Nonlinear experiment
advantage of using simulated datasets is that we can ascer- We set the log-risk function to be a Gaussian with λmax =
tain whether DeepSurv can successfully model the true 5.0 and a scale factor of r = 0.5:
log-risk function instead of overfitting random noise. x20 + x21
For each experiment, we generate a training, valida- h(x) = log(λmax ) exp − . (9)
2r2
tion, and testing set of N = 4000, 1000, 1000 obser-
vations respectively. Each observation x represents a The surface of the log-risk function is depicted in Fig. 3a.
patient vector with d = 10 covariates. The ten vari- Because this log-risk function is nonlinear, we do not
ables are each drawn from a uniform distribution on expect the CPH to predict the log-risk function properly
[ −1, 1). We then generate a patient’s death time T as without adding quadratic terms of the covariates to the
a function of their covariates by using the exponential model. We expect DeepSurv to reconstruct the Gaussian
Cox model [26]: log-risk function and successfully predict a patient’s risk.
Katzman et al. BMC Medical Research Methodology (2018) 18:24 Page 6 of 12
Table 1 Experimental results for all experiments C-index (95% confidence interval)
Experiment CPH DeepSurv RSF
Simulated Linear 0.779239 (0.777,0.781) 0.778065 (0.776,0.780) 0.757863 (0.756,0.760)
Simulated Nonlinear 0.486728 (0.484,0.489) 0.652434 (0.650, 0.655) 0.626552 (0.624,0.629)
WHAS 0.816025 (0.813, 0.819) 0.866723 (0.863,0.870) 0.892884 (0.890,0.895)
SUPPORT 0.583076 (0.581,0.585) 0.618907 (0.617,0.621) 0.619302 (0.618,0.621)
METABRIC 0.631674 (0.627,0.636) 0.654452 (0.650,0.659) 0.619517 (0.615,0.624)
Simulated Treatment 0.516620 (0.514,0.519) 0.575400 (0.573,0.578) 0.550298 (0.548,0.553)
Rotterdam & GBSG 0.658773 (0.655, 0.662) 0.676349 (0.673,0.679) 0.647924 (0.644, 0.651)
The bold faced numbers signify the best performing algorithm
Lastly, we expect the RSF and DeepSurv to accurately rank Heart Attack Study (WHAS), the Study to Understand
the order of patient’s deaths. Prognoses Preferences Outcomes and Risks of Treat-
The CI results in Table 1 shows that DeepSurv outper- ment (SUPPORT), and The Molecular Taxonomy of
forms the linear CPH and predicts as well as the RSF. Breast Cancer International Consortium (METABRIC).
In addition, DeepSurv correctly learns nonlinear relation- Because previous research shows that neural net-
ships between a patient’s covariates and their log-risk. works do not outperform the CPH, our goal is to
As shown in Fig. 3, DeepSurv is more successful than demonstrate that DeepSurv does indeed have state-of-
the linear CPH in modeling the true log-risk function. the-art predictive ability in practice on real survival
Figure 3b demonstrates that the linear CPH regression datasets.
fails to determine the first two covariates as significant.
The CPH has a C-index of 0.486728, which is equivalent to Worcester Heart Attack Study (WHAS)
the performance of randomly ranking death times. Mean- The Worcester Heart Attack Study (WHAS) investigates
while, Fig. 3c demonstrates that DeepSurv reconstructs the effects of a patient’s factors on acute myocardial
the Gaussian relationship between the first two covariates infraction (MI) survival [27]. The dataset consists of 1638
and a patient’s log-risk. observations and 5 features: age, sex, body-mass-index
(BMI), left heart failure complications (CHF), and order
Real survival data experiments of MI (MIORD). We reserve 20 percent of the dataset as
We compare the performance of the CPH and DeepSurv a testing set. A total of 42.12 percent of patients died dur-
on three datasets from real studies: the Worcester ing the survey with a median death time of 516.0 days.
Fig. 2 Simulated Linear Experimental Log-Risk Surfaces. Predicted log-risk surfaces and errors for the simulated survival data with linear log-risk
function with respect to a patient’s covariates x0 and x1 . a The true log-risk h(x) = x0 + 2x1 for each patient. b The predicted log-risk surface of
ĥβ (x) from the linear CPH model parameterized by β. c The output of DeepSurv ĥθ (x) predicts a patient’s log-risk. d The absolute error between
true log-risk h(x) and CPH’s predicted log-risk ĥβ (x). e The absolute error between true log-risk h(x) and DeepSurv’s predicted log-risk ĥθ (x)
Katzman et al. BMC Medical Research Methodology (2018) 18:24 Page 7 of 12
Fig. 3 Simulated Nonlinear Experimental Log-Risk Surfaces. Log-risk surfaces of the nonlinear test set with respect to patient’s covariates x0 and x1 .
a The calculated true log-risk h(x) (Eq. 9) for each patient. b The predicted log-risk surface of ĥβ (x) from the linear CPH model parameterized on β.
The linear CPH predicts a constant log-risk. c The output of DeepSurv ĥθ (x) is the estimated log-risk function
As shown in Table 1, DeepSurv outperforms the CPH; indicator, chemotherapy indicator, ER-positive indicator,
however, the RSF outperforms DeepSurv. age at diagnosis). We then reserved 20 percent of the
patients as the test set.
Study to Understand Prognoses Preferences Outcomes Table 1 shows that DeepSurv performs better than both
and Risks of Treatment (SUPPORT) the CPH and RSF. This result demonstrates not only
The Study to Understand Prognoses Preferences Out- DeepSurv’s ability to model the risk effects of gene expres-
comes and Risks of Treatment (SUPPORT) is a larger sion data but also shows the potential for future research
study that researches the survival time of seriously ill of DeepSurv as a comparable prognostic tool to common
hospitalized adults [28]. The dataset consists of 9,105 medical tests such as the IHC4+C.
patients and 14 features for which almost all patients have
observed entries (age, sex, race, number of comorbidi- Treatment recommender system experiments
ties, presence of diabetes, presence of dementia, presence In this section, we perform two experiments to demon-
of cancer, mean arterial blood pressure, heart rate, respi- strate the effectiveness of DeepSurv’s treatment rec-
ration rate, temperature, white blood cell count, serum’s ommender system. First, we simulate treatment data
sodium, and serum’s creatinine). We drop patients with by including an additional covariate to the simulated
any missing features and reserve 20 percent of the dataset data from “Nonlinear experiment” section. Second, after
as a testing set. A total of 68.10 percent of patients died demonstrating DeepSurv’s modeling and recommenda-
during the survey with a median death time of 58 days. tion capabilities, we apply the recommender system to a
As shown in Table 1, DeepSurv performs as well as the real dataset used to study the effects of hormone treat-
RSF and better than the CPH with a larger study. This val- ment on breast cancer patients. We show that DeepSurv
idates DeepSurv’s ability to predict the ranking of patient’s can successfully provide personalized treatment recom-
risks on real survival data. mendations. We conclude that if all patients follow the
network’s recommended treatment options, we would
Molecular Taxonomy of Breast Cancer International gain a significant increase in patients’ lifespans.
Consortium (METABRIC)
The Molecular Taxonomy of Breast Cancer International Simulated treatment data
Consortium (METABRIC) uses gene and protein expres- We uniformly assign a treatment group τ ∈ {0, 1} to each
sion profiles to determine new breast cancer subgroups in simulated patient in the dataset. All of the patients in
order to help physicians provide better treatment recom- group τ = 0 were ‘unaffected’ by the treatment (e.g. given
mendations. a placebo) and have a constant log-risk function h0 (x).
The METABRIC dataset consists of gene expression The other group τ = 1 is prescribed a treatment with
data and clinical features for 1,980 patients, and 57.72 Gaussian effects (Eq. 9) and has a log-risk function h1 (x)
percent have an observed death due to breast cancer with λmax = 10 and r = 0.5.
with a median survival time of 116 months [29]. We Figure 4 illustrates the network’s success in modeling
prepare the dataset in line with the Immunohistochem- both treatments’ log-risk functions for patients. Figure 4a
ical 4 plus Clinical (IHC4+C) test, which is a com- plots the true log-risk distribution h(x). As expected,
mon prognostic tool for evaluating treatment options for Fig. 4b shows that the network models a constant log-
breast cancer patients [30]. We join the 4 gene indicators risk for a patient in treatment τ = 0, independent of
(MKI67, EGFR, PGR, and ERBB2) with the a patient’s clin- a patient’s covariates. Figure 4c shows how DeepSurv
ical features (hormone treatment indicator, radiotherapy models the Gaussian effects of a patient’s covariates on
Katzman et al. BMC Medical Research Methodology (2018) 18:24 Page 8 of 12
Fig. 4 Simulated Treatment Log-Risk Surface. Treatment Log-Risk Surfaces as a function of a patient’s relevant covariates x0 and x1 . a The true
log-risk h1 (x) if all patients in the test set were given treatment τ = 1. We then manually set all treatment groups to either τ = 0 or τ = 1. b The
predicted log-risk ĥ0 (x) for patients with treatment group τ = 0. c The network’s predicted log-risk ĥ1 (x) for patients in treatment group τ = 1
their treatment log-risk. To further quantify these results, patients with the opposite treatment option to increase
Table 1 shows that DeepSurv has the largest concor- median survival time; however, Fig. ?? shows that that
dance index. Because the network accurately reconstructs improvement would not be statistically valid. While both
the risk function, we expect that it will provide accurate methods of DeepSurv and RSF are able to compute treat-
treatment recommendations for new patients. ment interaction terms, DeepSurv is more successful in
In Fig. 5, we plot the Kaplan-Meier survival curves for recommending personalized treatments.
both the Recommendation and Anti-Recommendation
subset for each method. Figure 5a shows that the sur- Rotterdam & German Breast Cancer Study Group (GBSG)
vival curve for the Recommendation subset is shifted to We first train DeepSurv on breast cancer data from the
the right, which signifies an increase in survival time for Rotterdam tumor bank [31]. and construct a recom-
the population following DeepSurv’s recommendations. mender system to provide treatment recommendations to
This is further quantified by the median survival times patients from a study by the German Breast Cancer Study
summarized in Table 2. The p-value of DeepSurv’s recom- Group (GBSG) [32]. The Rotterdam tumor bank dataset
mendations is less than 0.000090, and we can reject the contains records for 1546 patients with node-positive
null hypothesis that DeepSurv’s recommendations would breast cancer, and nearly 90 percent of the patients have
not affect the population’s survival time. As shown in an observed death time. The testing data from the GBSG
Table 2, the subset of patients that follow RSF’s recom- contains complete data for 686 patients (56 percent are
mendations have a shorter survival time than those who censored) in a randomized clinical trial that studied the
do not follow RSF’s recommended treatment. Therefore, effects of chemotherapy and hormone treatment on sur-
we could take the RSF’s recommendations and provide the vival rate. We preprocess the data as outlined by [33].
Fig. 5 Simulated Treatment Survival Curves. Kaplan-Meier estimated survival curves with confidence intervals (α = .05) for the patients who were
given the treatment concordant with a method’s recommended treatment (Recommendation) and the subset of patients who were not
(Anti-Recommendation). We perform a log-rank test to validate the significance between each set of survival curves. a Effect of DeepSurv’s
Treatment Recommendations (Simulated Data), b Effect of RSF’s Treatment Recommendations (Simulated Data)
Katzman et al. BMC Medical Research Methodology (2018) 18:24 Page 9 of 12
Fig. 6 Rotterdam & German Breast Cancer Study Group (GBSG) Survival Curves. Kaplan-Meier estimated survival curves with confidence intervals
(α = .05) for the patients who were given the treatment concordant with a method’s recommended treatment (Recommendation) and the subset
of patients who were not (Anti-Recommendation). We perform a log-rank test to validate the significance between each set of survival curves.
a Effect of DeepSurv’s Treatment Recommendations (GBSG), b Effect of RSF’s Treatment Recommendations (GBSG)
Katzman et al. BMC Medical Research Methodology (2018) 18:24 Page 10 of 12
Theano with the Python package Lasagne. We use the R λ(t; x|τ = i)
recij (x) = log
package randomForestSRC to fit RSFs. All experiments λ(t; x|τ = j)
are run using Docker containers such that the experi- λ0 (t) · eβ0 i+β1 x1 +...+βn xn
ments are easily reproducible. We use the FloydHub base = log
λ0 (t) · eβ0 j+β1 x1 +...+βn xn
image for the DeepSurv docker container.
The hyper-parameters of the network include: the depth = log eβ0 i+β1 x1 +...+βn xn −(β0 j+β1 x1 +...+βn xn )
and size of the network, learning rate, 2 regularization = β0 i − β0 j
coefficient, dropout rate, exponential learning rate decay
constant , and momentum. We run the Random hyper- = β0 (i − j).
parameter optimization search as proposed in [23] using (11)
the Python package Optunity. We use the Sobol solver [35,
36] to sample each hyper-parameter from a predefined
range and evaluate the performance of the configuration
using k-means cross validation (k = 3). We then choose The CPH will recommend all patients to choose the
the configuration with the largest validation C-index to same treatment option based on whether the model
avoid models that overfit. The hyper-parameters we use calculates the weight β0 to be positive or negative.
in all experiments are summarized in the next “Model Thus, the CPH would not be providing personalized
Hyper-parameters” section. treatment recommendations. Instead, the CPH deter-
mines whether the treatment is effective and, if so,
Model Hyper-parameters
then recommending it to all patients. In an exper-
As described in “Experimental details” section, we tune
iment, when we calculate which patients took the
DeepSurv’s hyper-parameters by running a random
CPH’s recommendation, the Recommendation and Anti-
hyper-parameter search using the Python package Optu-
Recommendation subgroups will be equal to the control
nity. The table below summarizes the hyper-parameters
and treatment groups. Therefore, calculating treatment
we use for each experiment’s DeepSurv network.
recommendations using the CPH provides little value
We applied inverse time decay to the learning rate at
to the experiments in terms of comparing the models’
each epoch:
recommendations.
LR
decayed_LR := . (10)
1 + epoch · lr_decay_rate Appendix C
Simulated data generation
Appendix B Each patient’s baseline information x is drawn from a
CPH recommender function uniform distribution on [ −1, 1)d . For datasets that also
Let each patient in the dataset have a set of n features involve treatment, the patient’s treatment status τx is
xn , in which one feature is a treatment variable x0 = τ . drawn from a Bernoulli distribution with p = 0.5.
The CPH model estimates the log-risk function as a lin- The Cox proportional hazard model assumes that the
ear combination of the patient’s features ĥβ (x) = β T x = baseline hazard function λ0 (t) is shared across all patients.
β0 τ + β1 x1 + . . . + βn xn . When we calculate the rec- The initial death time is generated according to an expo-
ommender function for the CPH model, we show that nential random variable with a mean μ = 5, which we
the model returns a constant function independent of the denote u ∼ Exp(5). The individual death time is then
patient’s features: generated by
u
T = , when there is no treatment variable, Medicine, 333 Cedar Street, 06510 New Haven, CT, USA. 5 Center of Outcomes
eh(x) Research and Evaluation, Yale-New Haven Hospital, 06511 New Haven, CT,
u USA. 6 Interdepartmental Program in Computational Biology and
T = , when there is a treatment variable. Bioinformatics, Yale University, 06511 New Haven, CT, USA. 7 Department of
eτx h(x)
Pathology and Yale Cancer Center, Yale University School of Medicine, 06511
These times are then right censored at an end time to New Haven, CT, USA. 8 Department of Mathematics, University of California,
San Diego, 92093 La Jolla, CA, USA. 9 Final Research, Herzliya, Israel.
represent the end of a trial. The end time T0 is chosen such
that 90 percent of people have an observed death time. Received: 3 September 2017 Accepted: 7 February 2018
Because we cannot observe any T beyond the end time
threshold, we denote the final observed outcome time
References
Z = min(T, T0 ). 1. RW Y, EA S, DJ K, et al. Development and validation of a prediction rule
for benefit and harm of dual antiplatelet therapy beyond 1 year after
Abbreviations percutaneous coronary intervention. JAMA. 2016;315(16):1735–49.
BMI: Body-mass-index; C-index: Concordance-index; CHF: Left heart failure https://doi.org/10.1001/jama.2016.3775.
complications; CPH: Cox proportional hazards model; GBSG: The German 2. Royston P, Altman DG. External validation of a cox prognostic model:
breast cancer study group; IHC4+C: Immunohistochemical 4 plus clinical; principles and methods. BMC Med Res Methodol. 2013;13(1):1.
METABRIC: Molecular taxonomy of breast cancer international consortium; MI: 3. Bair E, Tibshirani R. Semi-supervised methods to predict patient survival
Acute myocardial infraction; MIOR: Order of MI; MSE: Mean squared error; NN: from gene expression data. PLoS Biol. 2004;2(4):108.
Neural network; ReLU: Rectified linear unit; RSF: Random survival forest; SELU: 4. Cheng W-Y, Yang T-HO, Anastassiou D. Development of a prognostic
Scaled exponential linear unit; SUPPORT: Study to understand prognoses model for breast cancer survival in an open challenge environment. Sci
preferences outcomes and risks of treatment; WHAS: The worcester heart Total Environ. 2013;5(181):181–5018150.
attack study 5. Cox DR. In: Kotz S, Johnson NL, editors. Regression Models and
Life-Tables. New York: Springer; 1992, pp. 527–41. https://doi.org/10.
Acknowledgements 1007/978-1-4612-4380-9.
We express our thanks to Steven Ma for his comments. 6. Liestbl K, Andersen PK, Andersen U. Survival analysis and neural nets. Stat
Med. 1994;13(12):1189–200.
Funding 7. Street WN. A neural network model for prognostic prediction. In:
This research was partially funded by a National Institutes of Health grant Kaufmann M, editor. Proceedings of the Fifteenth International
[1R01HG008383-01A1 to Y.K.] and supported by a National Science Conference on Machine Learning. San Francisco; 1998. p. 540–46.
Foundation Award [DMS-1402254 to A.C.]. 8. Jerez JM, Franco L, Alba E, Llombart-Cussac A, Lluch A, Ribelles N,
Munárriz B, Martín M. Improvement of breast cancer relapse prediction in
Availability of data and materials high risk intervals using artificial neural networks. Breast Cancer Res Treat.
Project Name: DeepSurv 2005;94(3):265–72. https://doi.org/10.1007/s10549-005-9013-y.
Project home page: https://github.com/jaredleekatzman/DeepSurv 9. Biganzoli E, Boracchi P, Mariani L, Marubini E. Feed forward neural
Archived version: https://doi.org/10.5281/zenodo.1134133 networks for the analysis of censored survival data: a partial logistic
Operating system(s): Platform independent regression approach. Stat Med. 1998;17(10):1169–86.
Programming language: Python 10. Faraggi D, Simon R. A neural network model for survival data. Stat Med.
Other requirements: Theano 0.8.2 or higher, Lasagne 0.2.dev1 or higher, and 1995;14(1):73–82.
Lifelines 0.9.2 or higher 11. Sargent DJ. Comparison of artificial neural networks with other statistical
License: MIT approaches. Cancer. 2001;91(S8):1636–42.
Any restrictions to use by non-academics: Licence needed 12. Xiang A, Lapuerta P, Ryutov A, Buckley J, Azen S. Comparison of the
The data that support the findings of this study were published in earlier performance of neural network methods and cox regression for censored
studies by others and are also available on DeepSurv’s GitHub repository. survival data. Comput Stat Data Anal. 2000;34(2):243–57.
13. Mariani L, Coradini D, Biganzoli E, Boracchi P, Marubini E, Pilotti S,
Authors’ contributions Salvadori B, Silvestrini R, Veronesi U, Zucali R, et al. Prognostic factors for
JLK, US, AC, JRB, and YK were responsible for the design of the project. TJ metachronous contralateral breast cancer: a comparison of the linear cox
helped with data analysis in consultation with JLK and YK. JLK wrote an initial regression model and its artificial neural network extension. Breast Cancer
version of the manuscript and incorporated comments from US, AC, JRB, and Res Treat. 1997;44(2):167–78.
YK. JLK wrote the software in consultation with US and AC. All authors read 14. Therneau T, Grambsch PM. Modeling Survival Data : Extending the Cox
and approved the final manuscript. Model. New York: Springer; 2000.
15. Ishwaran H, Kogalur UB. Random survival forests for r. R News. 2007;7(2):25–31.
Ethics approval and consent to participate 16. Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival
Not applicable. forests. Ann Appl Statist. 2008;2(3):841–60.
17. Ranganath R, Perotte A, Elhadad N, Blei D. Deep survival analysis. In:
Consent for publication Doshi-Velez F, Fackler J, Kale D, Wallace B, Weins J, editors. Proceedings
Not applicable. of the 1st Machine Learning for Healthcare Conference. Proceedings of
Machine Learning Research, vol 56. Northeastern University, Boston, MA,
Competing interests USA: PMLR; 2016. p. 101–14. http://proceedings.mlr.press/v56/
The authors declare that they have no competing interests. Ranganath16.html.
18. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R.
Publisher’s Note Dropout: A simple way to prevent neural networks from overfitting.
Springer Nature remains neutral with regard to jurisdictional claims in J Mach Learn Res. 2014;15(1):1929–58.
published maps and institutional affiliations. 19. Klambauer G, Unterthiner T, Mayr A, Hochreiter S. Self-normalizing
neural networks. In: Advances in Neural Information Processing Systems;
Author details 2017. p. 972–81. arXiv preprint. 1706.02515.
1 Department of Computer Science, Yale University, 51 Prospect Street, 06511 20. Kingma D, Ba J. Adam: A method for stochastic optimization.
New Haven, CT, USA. 2 Department of Statistics, Yale University, 24 Hillhouse Proceedings of the 3rd International Conference on Learning
Avenue, 06511 New Haven, CT, USA. 3 Applied Mathematics Program, Yale Representations (ICLR 2015). 2015. arXiv preprint arXiv:1412.6980. https://
University, 51 Prospect Street, 06511 New Haven, CT, USA. 4 Yale School of dare.uva.nl/search?identifier=a20791d3-1aff-464a-8544-268383c33a75.
Katzman et al. BMC Medical Research Methodology (2018) 18:24 Page 12 of 12