s12967 023 04004 X
s12967 023 04004 X
Abstract
Background Identifying predictive non-invasive biomarkers of immunotherapy response is crucial to avoid pre‑
mature treatment interruptions or ineffective prolongation. Our aim was to develop a non-invasive biomarker for
predicting immunotherapy clinical durable benefit, based on the integration of radiomics and clinical data monitored
through early anti-PD-1/PD-L1 monoclonal antibodies treatment in patients with advanced non-small cell lung can‑
cer (NSCLC).
Methods In this study, 264 patients with pathologically confirmed stage IV NSCLC treated with immunotherapy
were retrospectively collected from two institutions. The cohort was randomly divided into a training (n = 221) and
an independent test set (n = 43), ensuring the balanced availability of baseline and follow-up data for each patient.
Clinical data corresponding to the start of treatment was retrieved from electronic patient records, and blood test
variables after the first and third cycles of immunotherapy were also collected. Additionally, traditional radiomics and
deep-radiomics features were extracted from the primary tumors of the computed tomography (CT) scans before
treatment and during patient follow-up. Random Forest was used to implementing baseline and longitudinal models
using clinical and radiomics data separately, and then an ensemble model was built integrating both sources of
information.
Results The integration of longitudinal clinical and deep-radiomics data significantly improved clinical durable
benefit prediction at 6 and 9 months after treatment in the independent test set, achieving an area under the receiver
operating characteristic curve of 0.824 (95% CI: [0.658,0.953]) and 0.753 (95% CI: [0.549,0.931]). The Kaplan-Meier
survival analysis showed that, for both endpoints, the signatures significantly stratified high- and low-risk patients
(p-value< 0.05) and were significantly correlated with progression-free survival (PFS6 model: C-index 0.723, p-value =
0.004; PFS9 model: C-index 0.685, p-value = 0.030) and overall survival (PFS6 models: C-index 0.768, p-value = 0.002;
PFS9 model: C-index 0.736, p-value = 0.023).
*Correspondence:
Benito Farina
[email protected]
Full list of author information is available at the end of the article
© The Author(s) 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this
licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativeco
mmons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Farina et al. Journal of Translational Medicine (2023) 21:174 Page 2 of 15
Conclusions Integrating multidimensional and longitudinal data improved clinical durable benefit prediction to
immunotherapy treatment of advanced non-small cell lung cancer patients. The selection of effective treatment
and the appropriate evaluation of clinical benefit are important for better managing cancer patients with prolonged
survival and preserving quality of life.
Keywords Immunotherapy, Lung cancer, Clinical durable benefit, Deep-Radiomics, Clinical data, Longitudinal
analysis, Treatment monitoring
Fig. 1 Flowchart showing the inclusion and exclusion criteria considering the endpoint PFS6. Details of the number of patients in the training and
independent test set are provided
Farina et al. Journal of Translational Medicine (2023) 21:174 Page 4 of 15
follow-up CT images using either the syngo.via Siemens convolutional layers and the classification layers were
Healthineers software or 3D Slicer [23]. The largest lesion fine-tuned to predict the response (defined by the end-
was considered if a patient had an ambiguous primary point PFS6). For network fine-tuning, all primary tumors
tumor. Follow-up CT scans were discarded if the tumor of all available CT images from the immunotherapy train-
found in the baseline CT scan was no longer visible. ing data set (357 tumors - 128 patients) were used. Fine-
For pre-processing, Hounsfield units of all CT images tuning allowed the efficient transfer of malignant-related
were clipped between -1000 and 3050, and z-score nor- spatial features to more complicated high-level semantic
malization was then applied. features related to immunotherapy response.
After training, deep features were extracted for each
Feature engineering tumor from the first fully connected layers of the net-
Radiomics analysis work (500 deep features), referred to as DF-imm. Simi-
Radiomics features were extracted by using Pyradiomics larly to delta-radiomics, delta DF-imm features were also
(version 3.0.1) [24]. The voxel intensity values were dis- calculated.
cretized when computing some texture features using a
bin width of 25 Hounsfield units [25]. To reduce the effect
Clinical data
of low resolution along the z-axis in part of the data, the
Baseline demographic, epidemiological, clinical and
radiomics features were computed only by applying 2D
laboratory data were collected from electronic patient
filters.
records, as well as hemogram-related data after the sec-
Feature reproducibility and feature repeatability against
ond and third treatment cycles. They included sex, age,
segmentation were assessed using the QIN Lung CT
body mass index, tumor histology, smoking, previous
Segmentation dataset, a random subset of the data, and
surgery, presence of metastases, and immune cell-related
the RIDER dataset (Additional file 1: S3). Reproducible
indexes, among others (Additional file 1: S5).
and repeatable features are potentially more robust to
One hot encoding was applied to categorical or con-
variations in CT scanners, acquisition parameters, and
stant variables. Z-score normalization was applied to
segmentation.
continuous variables, and missing data were imputed
After feature extraction and reproducibility selection,
using the k-means algorithm. Delta features were also
delta-radiomics features were calculated as the relative
calculated.
net change between features at baseline and first follow-
up CTs. Patients without first follow-up CT were dis-
carded from this analysis. Model design and analysis
A standard scaler was applied to normalize each radi- Random Forest (RF) models were built for each primary
omics feature. The transformation was learned in train- endpoint in the training set using stratified three-fold
ing and then applied to the test set. cross-validation. The number of training patients for
each RF model is reported in Additional file 1:Tables S1
Deep feature extraction (PFS6) and S2 (PFS9). Feature selection and RF hyperpa-
To extract high-level and domain-related representations rameter optimization were performed using a Bayesian
(e.g., texture, morphology) of the tumors’ deep learning- optimization approach. The optimized hyperparameters
based features, the convolutional neural network (CNN) were the number of estimators, the maximum depth, and
architecture NoduleX [26] was used as a reference imple- the number of features.
mentation to predict the response to immunotherapy. Radiomics, deep features, and clinical data were used
NoduleX input consists of a small 3D volume of 47 × 47 to implement baseline, delta, and longitudinal RF mod-
pixels × 5 slices centered in the centroid of the tumor that els trained for predicting the immunotherapy response.
was sampled and resized from a square of 10 × 10 cm2 . Baseline models’ (RF-baseline) inputs were only the
Image intensities were clipped to the range [-1000, 3050] data before the start of treatment, whereas longitudinal
and then normalized. models used baseline and early treatment data. Patients
A transfer learning approach was used to pretrain Nod- who did not have follow-up data were excluded from the
uleX CNN architecture weights. Namely, the network longitudinal analysis. Two types of longitudinal models
was pre-trained to predict the malignancy of tumors col- were constructed: RF-delta and RF-longitudinal. RF-delta
lected from 719 patients of The Lung Image Database model had delta features as input and considered only
Consortium and Image Database Resource Initiation patients with baseline and first follow-up data. On the
Data Set (LIDC-IDRI) [27] and 14 patients who did not other hand, RF-longitudinal input was the concatenation
meet the inclusion criteria of the immunotherapy dataset of all available features over time for each patient (num-
(1528 tumors, Additional file 1: S4). Then, the last two ber of features multiplied by the number of time points).
Farina et al. Journal of Translational Medicine (2023) 21:174 Page 5 of 15
Missing time points were imputed as the closest in time impacted the model output or how much it increased
available data. or decreased the probability of a single outcome. SHAP
For comparison, the NoduleX architecture pre-trained values allowed us to determine whether the relation-
for malignancy prediction was fine-tuned with the base- ship between a feature and the output was correlative
line training data of the immunotherapy dataset to pre- or anticorrelative. SHAP analysis was performed in
dict treatment durable response (CNN-baseline). Python using the KernelExplainer in the SHAP module
For predicting PFS9, because the training was imbal- (version 0.40.0).
anced, a synthetic minority oversampling technique
(SMOTE) was used during the training phase to resample
the minority class (“responders”). As SMOTE was con- Statistical and survival analysis
figured to generate synthetic samples in training consid- Stratified three-fold cross-validation was performed in
ering five nearest neighbors, the numbers of responders the training set to train all the implemented models and
and nonresponders were equal. optimize the RF hyperparameters. Model performance
Once the models were trained, ensemble RF models was evaluated by the area under the receiver operating
were implemented as the mean value of the predictions characteristic (ROC) curve (AUC) and the correspond-
of the imaging and clinical models alone (ensemble RF). ing 95% confidence interval (CI) was estimated with a
They allowed integrating both clinical and image infor- bootstrap resampling approach (1000 iterations). The
mation. The workflow is shown in Fig. 2. differences between ROC curves were assessed using
the DeLong test. Kaplan-Meier survival analysis was
Model interpretation performed for patients’ stratification based on the mod-
The SHAP (or SHapley Additive exPlanations) algo- el’s predictions (threshold = 0.5). The significance of
rithm was employed to visualize each feature’s contri- differences between survival curves was assessed with
bution to producing the final prediction of the model the log-rank test. Hazard ratios (HRs) and concord-
[28]. SHAP assigns an importance value to each feature ance index were calculated using the Cox proportional-
for each individual predicted value based on concepts hazards model. p-values less than 0.05 (two-sided
from Cooperative Game Theory and local explana- tests) were considered significant. R (version 4.1.1) and
tions. We applied the SHAP algorithm to the clinical Python (version 3.7.10) were used for statistical analysis
model of the ensemble RF model. SHAP values were and model implementation.
calculated to understand how much each feature
Table 1 Demographic and clinical characteristics of the patients in the baseline and longitudinal analyses. P-values of no significant
difference analysis (p-value> 0.05) between the training and test set after two samples T-test for continuous variables, and Chi-square
test for categorical variables. SD represents the standard deviation, and Q1 and Q3 represent the first and third quartiles, respectively
Characteristic Baseline analysis Longitudinal analysis
All patients Train set Test set P-value All patients Train set Test set p-value
(N= 264) (N = 221) (N = 43) (N= 200) (N = 167) (N = 33)
PFS, mean (SD) 9.0 (11.1) 9.3 (11.6) 7.6 (8.1) 0.242 11.1 (11.8) 11.6 (12.3) 9.0 (8.6) 0.147
OS, mean (SD) 13.3 (12.2) 13.3 (12.5) 13.5 (10.5) 0.903 16.0 (12.4) 16.0 (12.8) 15.7 (10.6) 0.889
Status
Alive 107 (40.5%) 91 (41.2%) 16 (37.2%) 0.753 91 (45.5) 78 (46.7) 13 (39.4) 0.562
Dead 157 (59.5%) 130 (58.8%) 27 (62.8%) 109 (54.5) 89 (53.3) 20 (60.6)
Response
Non-responders 148 (56.1%) 124 (56.1%) 24 (55.8%) 1.000 90 (45.0%) 75 (44.9%) 15 (45.5%) 1.000
Responders 116 (43.9%) 97 (43.9%) 19 (44.2%) 110 (55.0%) 92 (55.1%) 18 (54.5%)
Progression
No progression 45 (17.0%) 40 (18.1%) 5 (11.6%) 0.417 42 (21.0%) 38 (22.8%) 4 (12.1%) 0.256
Progression 219 (83.0%) 181 (81.9%) 38 (88.4%) 158 (79.0%) 129 (77.2%) 29 (87.9%)
Age, median [Q1,Q3] 65.0 [59.0,71.0] 65.0 [58.0,71.0] 67.0 [60.5,72.5] 0.204 65.0 [58.0,70.2] 64.0 [57.0,70.0] 67.0 [60.0,72.0] 0.266
Sex
Female 80 (30.3%) 66 (29.9%) 14 (32.6%) 0.865 58 (29.0%) 47 (28.1%) 11 (33.3%) 0.696
Male 184 (69.7%) 155 (70.1%) 29 (67.4%) 142 (71.0%) 120 (71.9%) 22 (66.7%)
IPA, mean (SD) 45.2 (33.4) 45.1 (33.8) 45.4 (31.5) 0.958 44.0 (34.1) 44.9 (34.6) 39.0 (31.2) 0.357
Smoking
Current smoker 55 (21.0%) 50 (22.7%) 5 (11.9%) 0.258 39 (19.7%) 35 (21.1%) 4 (12.5%) 0.530
Former smoker 180 (68.7%) 147 (66.8%) 33 (78.6%) 135 (68.2%) 111 (66.9%) 24 (75.0%)
Non-smoker 27 (10.3%) 23 (10.5%) 4 (9.5%) 24 (12.1%) 20 (12.0%) 4 (12.5%)
Tumour histology
Adenocarcinoma 203 (76.9%) 170 (76.9%) 33 (76.7%) 0.897 151 (75.5%) 126 (75.4%) 25 (75.8%) 0.896
Epidermoid carcinoma 52 (19.7%) 43 (19.5%) 9 (20.9%) 40 (20.0%) 33 (19.8%) 7 (21.2%)
Other 9 (3.4%) 8 (3.6%) 1 (2.3%) 9 (4.5%) 8 (4.8%) 1 (3.0%)
PDL1, mean (SD) 0.4 (0.4) 0.4 (0.4) 0.4 (0.4) 0.876 0.4 (0.4) 0.4 (0.4) 0.3 (0.3) 0.194
Surgery
No 227 (86.0%) 190 (86.0%) 37 (86.0%) 1.000 171 (85.5%) 142 (85.0%) 29 (87.9%) 0.792
Yes 37 (14.0%) 31 (14.0%) 6 (14.0%) 29 (14.5%) 25 (15.0%) 4 (12.1%)
Treatment
Combined immunological 39 (14.8%) 29 (13.1%) 10 (23.3%) 0.393 31 (15.5%) 24 (14.4%) 7 (21.2%) 0.276
agents
Immunotherapy + chemo‑ 50 (18.9%) 41 (18.6%) 9 (20.9%) 39 (19.5%) 30 (18.0%) 9 (27.3%)
therapy
Immunotherapy + radio‑ 17 (6.4%) 15 (6.8%) 2 (4.7%) 11 (5.5%) 11 (6.6%) 0 (0%)
therapy
Monotherapy 154 (58.3%) 132 (59.7%) 22 (51.2%) 116 (58.0%) 99 (59.3%) 17 (51.5%)
Other 4 (1.5%) 4 (1.8%) 0 (0%) 3 (1.5%) 3 (1.8%) 0 (0%)
Farina et al. Journal of Translational Medicine (2023) 21:174 Page 7 of 15
months of treatment, while only 33.2% responded after Model development and response prediction performance
9 months; adenocarcinoma was the most prevalent From the initial set of 1365 radiomics features, only 173
histological variant of advanced NSCLC (76.9%); and (13%) verified both reproducibility and repeatability
89.7% of the patients were current or former smok- against segmentation tests. Furthermore, a total of 500
ers. Immunotherapy treatment included monotherapy DF-imm were extracted for each tumor using the Nod-
(58.3%), immunotherapy combined with radiation uleX architecture. The number of features used as input
therapy (6.4%), immunotherapy combined with chemo- varied depending on each model. The number of features
therapy (18.9%) and a combination of different immu- selected for each implemented model and the results in
nological agents (14.8%). No demographic or clinical the training set are shown in Additional file 1: Tables S8
characteristics had significant differences (p-value < and S9, respectively.
0.05) between the training and test set after the two Figures 3 and 4 compare the ROC curves of CNN-base-
samples of T-tests for continuous variables and Chi- line and the baseline, delta and longitudinal RF models
square tests for categorical variables. using clinical, radiomics and DF-imm data in the inde-
For the subcohort of patients with imaging data (171 pendent test cohort for PFS6 and PFS9, respectively.
over 264 patients), the training and the independent Longitudinal models performed better than baseline
test sets had identical distributions of demographics or delta models in the independent test cohort, achiev-
and clinical characteristics (no statistical difference p > ing an AUC of 0.740 (95% CI: 0.563−0.833) with DF-imm
0.05). and an AUC of 0.700 (95% CI: 0.508−0.877) with clinical
Fig. 3 Comparisons of the ROC curves for endpoint PFS6 prediction of response of the baseline (a), delta (b), and longitudinal RF models (c) based
on clinical, radiomics, or deep-radiomics data
Fig. 4 Comparisons of the ROC curves for endpoint PFS9 prediction of response of the baseline (a), delta (b), and longitudinal RF models (c) based
on clinical, radiomics, or deep-radiomics data
Farina et al. Journal of Translational Medicine (2023) 21:174 Page 8 of 15
data for PFS6 and an AUC of 0.702 (95% CI: 0.515−0.867) with progression-free survival and overall survival in the
with DF-imm and an AUC of 0.585 (95% CI: 0.367− independent test set (6 months: C-index 4.68, 95% CI:
0.783) with clinical data for PFS9. In both cases, the auto- [1.52,7.84], p-value< 0.004; 9 months: C-index 2.38, 95%
matically extracted features performed better than the CI: [0.23,4.54], p-value< 0.030). The HRs with their cor-
hand-crafted radiomics features and clinical data (Figs. 3 responding 95% CIs and the C-indexes of longitudinal
and 4). and ensemble RF models for PFS and OS are shown in
Tables 2 and 3 compare the evaluation metrics of Tables 5 (endpoint PFS6) and 6 (endpoint PFS9). The
all implemented models, showing great improvement integration of clinical and DF-imm data appeared to be
when using the longitudinal models. a more robust approach compared to the radiomics or
clinical models.
Integration of imaging and clinical data Figure 6 shows the Kaplan-Meier survival curves for
Table 4 shows the performance in the independent test PFS and OS on the independent test set for the ensem-
set of the ensemble RF models that used both clinical ble RF models. The ensemble RF could significantly
and imaging information. The comparison with base- stratify PFS and OS for both endpoints compared to the
line and longitudinal RF models tested on the same other models (p-value< 0.05). The comparisons between
patients is shown in Additional file 1: Tables S10 and S11 Kaplan-Meier curves for longitudinal RF and ensemble
for endpoint PFS6 and PFS9, respectively. The ensem- RF models are shown in Additional file 1: Figures S1
ble RF-longitudinal achieved an AUC of 0.824 (95% CI: (endpoint PFS6) and S2 (endpoint PFS9).
0.658−0.953) for PFS6 with a 41% improvement for RF
models with only clinical data (DeLong test: p-value = Model interpretation
0.001) and 13% for the RF model with deep features data The SHAP algorithm was employed to visualize each
(DeLong test: p-value = 0.013). When considering PFS9, feature’s contribution to producing the final prediction
the ensemble model achieved an AUC of 0.753 (95% CI: of the model. The SHAP algorithm was applied to the
0.549−0.931) with a 31% improvement compared to RF clinical model of the ensemble RF. A positive SHAP value
models with only clinical data (DeLong test: p-value = indicated an increased risk of progression for each pre-
0.053) and 5% for the RF model based on deep features diction. As observed in Fig. 7, the most important clini-
data (DeLong test: p-value = 0.058) (Fig. 5). Furthermore, cal variables were the neutrophils-to-lymphocytes ratio
the ensemble models scores were significantly associated (NLR) and the systemic immune-inflammation index
Table 2 Response prediction performance comparison between baseline, delta and longitudinal models in the independent test set
for endpoint PFS6 by evaluating AUC, ACC, SENS, SPEC, PREC and bACC, respectively
Model Features N test AUC ACC SENS SPES PREC bACC
[95% CI] [95% CI] [95% CI] [95% CI] [95% CI] [95% CI]
Table 3 Response prediction performance comparison between baseline, delta and longitudinal models in the independent test set
for endpoint PFS9 by evaluating AUC, ACC, SENS, SPEC, PREC and bACC, respectively
Model Features N test AUC ACC SENS SPES PREC bACC
[95% CI] [95% CI] [95% CI] [95% CI] [95% CI] [95% CI]
(SII): for both endpoints, the higher the values in the sec- prolongation. Automatic extraction of imaging biomark-
ond time step (around 1–2 months after treatment), the ers that capture changes in tumor radiophenotypes dur-
higher the probability of progression. Moreover, the pres- ing treatment in association with clinical information can
ence of liver metastases appeared to be related to a worse potentially aid in patient evaluation and ultimately moni-
outcome. tor and adapt therapy dynamically.
In this two-institutional study, longitudinal informa-
Discussion tion from clinical data and radiomics was used to predict
In immuno-oncology, the traditional approach of manu- clinical durable benefit at 6 and 9 months after the start
ally measuring the size changes of the target lesions dur- of anti-PD-1/PD-L1 monoclonal antibodies treatment in
ing treatment is no longer adequate because the tumor advanced NSCLC patients using an ensemble approach.
unconventionally responds to treatment [29]. There- A deep-learning method was used to automatically
fore, identifying unusual tumor response patterns could extract spatial information from CT scans without
avoid premature treatment interruptions or ineffective manual or semiautomatic segmentation and with the
Table 4 Response prediction performance comparison between longitudinal and ensemble models in the independent test set for
endpoint PFS6 and PFS9 by evaluating AUC, ACC, SENS, SPEC, PREC and bACC, respectively
Endpoint Model Features N test AUC ACC SENS SPES PREC bACC
[95% CI] [95% CI] [95% CI] [95% CI] [95% CI] [95% CI]
PFS6 Ensemble RF-baseline DF-imm 43 0.678 0.605 0.875 0.263 0.600 0.569
Clinical data [0.513,0.836] [0.442,0.744] [0.731,1.000] [0.071,0.467] [0.436,0.758] [0.448,0.684]
Ensemble RF-longitudinal DF-imm 32 0.824 0.750 0.733 0.765 0.733 0.749
Clinical data [0.658,0.953] [0.594,0.906] [0.500,0.938] [0.533,0.947] [0.471,0.933] [0.594,0.897]
PFS9 Ensemble RF-baseline DF-imm 43 0.560 0.581 0.793 0.143 0.657 0.468
Clinical data [0.377,0.731] [0.442,0.721] [0.643,0.933] [0.000,0.364] [0.487,0.811] [0.360,0.590]
Ensemble RF-longitudinal DF-imm 32 0.753 0.813 0.947 0.615 0.783 0.781
Clinical data [0.549,0.931] [0.656,0.938] [0.826,1.000] [0.357,0.889] [0.609,0.950] [0.631,0.923]
For each metric, the 95% confidence interval is shown and the highest value for each endpoint is highlighted in bold
Farina et al. Journal of Translational Medicine (2023) 21:174 Page 10 of 15
Fig. 5 Comparisons of ROC curves of longitudinal and ensemble RF models with clinical and radiomics data. a ROC curves for PFS6: PFS> 6
months. b ROC curve for PFS9: PFS > 9 months
Table 5 Hazard ratios and C-indexes of longitudinal and ensemble models trained for endpoint PFS6 to predict PFS and OS in the
independent test set
PFS OS
Table 6 Hazard ratios and C-indexes of longitudinal and ensemble models trained for endpoint PFS9 to predict PFS and OS in the
independent test set
Model Features PFS OS
HR p-value C-index HR p-value C-index
[95% CI] [95% CI]
advantage of extracting features closely associated with introduced during image acquisition, making them more
response. Furthermore, deep-features compared to tradi- reproducible. Previous studies have demonstrated the
tional radiomics may be more robust to noise variability ability of deep learning to capture higher-level features
Farina et al. Journal of Translational Medicine (2023) 21:174 Page 11 of 15
Fig. 6 Kaplan-Meier survival curves on the independent test cohort for ensemble RF models trained for endpoint PFS6 (first row) and PFS9 (second
row). a and c represent the PFS Kaplan-Meier curves, while b and d represent the OS Kaplan-Meier curves
related to the immunotherapy response [20, 30–32]. The during treatment. In previous studies, longitudinal data
results of this study demonstrated that the deep features have been used to predict immunotherapy response
were more robust than traditional radiomics in predict- from baseline and first follow-up CT scans [14, 15, 14,
ing immunotherapy clinical durable benefit in advanced 15]. However, using data before treatment and up to
NSCLC, as well as in survival prediction and patient four months after treatment (up to three time points per
stratification. This confirms the hypothesis that deep- patient), we were able to improve the predictions of dura-
learning techniques allow the extraction of higher-level ble clinical benefit of immunotherapy.
spatial features that are deeply related to response to To the best of our knowledge, no previous studies have
treatment. They might represent properties of the tumors demonstrated that the integration of complementary
that are indicative of treatment response, such as changes longitudinal clinical and imaging data can significantly
in shape, size or intensity. improve immunotherapy clinical benefit prediction. The
Moreover, a multiple time-point analysis was per- ensembles of longitudinal models with deep-radiomics
formed. Typically, only data before the start of treatment (DF-imm) and clinical data significantly improved pre-
is used for prediction, without including any information diction performance, achieving an AUC of 0.824 for PFS6
Farina et al. Journal of Translational Medicine (2023) 21:174 Page 12 of 15
Fig. 7 Clinical model interpretation using SHAP. The summary plots show each clinical data impact on longitudinal RF model for endpoint PFS6 (a)
and endpoint PFS9 (b). A positive SHAP value indicates an increased risk of progression. Each point in the summary plot represents a patient
and an AUC of 0.753 for PFS9. These models significantly SII early follow-up values are shown to be important for
stratified patients in high- and low-risk groups for both the clinical durable benefit of the therapy. Furthermore,
PFS and OS (p-value< 0.05), and their predictions signifi- the models considered that the presence of metastases
cantly correlated with PFS (PFS6 model: C-index 0.723, in the liver before treatment was related to a worse out-
p-value = 0.004; PFS9 model: C-index 0.685, p-value = come. On the other hand, higher levels of hemoglobin
0.030) and OS (PFS6 models: C-index 0.768, p-value before and during treatment were associated with a bet-
= 0.002; PFS9 model: C-index 0.736, p-value = 0.023). ter response to treatment.
After attempting to identify any unique characteristics Our study had some limitations. First, the retrospective
among the patients with better survival, we found no and multi-center nature of the work implies a heterogene-
significant differences in their clinical data. As a result, ity of the cohort in terms of treatment and imaging pro-
we have determined that the accurate predictions result tocols. Second, the sample size of the two cohorts (FJD
from the model effectively integrating information from and CUN) was relatively large, but a relevant number
both the deep-features and clinical variables. As a com- of cases did not have longitudinal imaging data. Third,
parison, Vanguri et al. [21] showed that integrating base- there was an important unbalance between responders
line medical imaging, histopathological and genomic and nonresponders for PFS9. The SMOTE technique was
features (multimodal model) outperformed unimodal used to partially reduce this imbalance during the model
models, achieving an AUC of 0.80 for the immunother- training, but it did not result in performance comparable
apy response prediction. to the PFS6 models. To further improve the prediction of
The final ensemble models considered changes in treatment response, it may be necessary to collect more
imaging tumor radiophenotypes and clinical covariates data from patients with prolonged responses to treat-
during early treatment. The SHAP analysis shows that ment and/or include more time points in the analysis.
for both PFS6 and PFS9 endpoints, the most important Forth, the interpretation of the deep-features is often not
clinical variables were the NLR and the SII. High values straightforward since they are optimized to minimize the
of NLR and SII after the second cycle of therapy were prediction error and are not designed to match human
highly associated with poor prognosis probably because intuition or knowledge. Despite the limitations, they
of a reduced antitumor effect of the immune system. This can still offer insights into the relationships between the
is consistent with the literature in which baseline NLR is tumors’ image information and response prediction and
considered a prognostic factor associated with a lower contribute to making accurate predictions. Finally, no
likelihood of treatment response [34], and inflammation comparison with other prognostic biomarkers was made,
markers, such as SII, are related to tumor growth, pro- such as PDL1 or tumor mutational burden, due to their
gression, and poor OS [35]. In our study, both NLR and inaccessibility. Similarly, for the definition of radiological
Farina et al. Journal of Translational Medicine (2023) 21:174 Page 13 of 15
progression, the iRECIST criteria were not quantitatively 0.05) between the training and test set after two samples T-test for contin‑
evaluated by the radiologists, so that no comparison uous variables, and Chi-square test for categorical variables. SD represents
could have been performed. In addition, the integra- the standard deviation, and Q1 and Q3 represent the first and third quar‑
tiles, respectively. Table S7. Demographic and clinical characteristics of
tion of these biomarkers, as well as other new molecu- the patients in the longitudinal analysis with imaging data. P-values of no
lar parameters from liquid biopsies such as circulating significant difference analysis (p-value > 0.05) between the training and
tumor DNA, circulating tumor cells, circulating endothe- test set after two samples T-test for continuous variables, and Chi-square
test for categorical variables. SD represents the standard deviation, and
lial cells or the changes in variant allele frequencies with Q1 and Q3 represent the first and third quartiles, respectively. Table S8.
the deep features and clinical data used in the study, may Number of features selected for each RF model. Longitudinal models had
enhance the performance of the models even further [36, as input the concatenation of features extracted from baseline, 1st and
2nd follow-up data (n time steps = 3). In the case of clinical models, only
37]. 12 variables had continuous values. Table S9. Results of the implemented
models in the training set for PFS6 and PFS9. The results are presented in
terms of the area under the curve ROC curve (AUC) for the 3-fold cross
Conclusion validation. Table S10. Response prediction performance comparison
In conclusion, an ensemble of longitudinal deep-radiom- between longitudinal and ensemble models in the independent test set
for endpoint PFS6 by evaluating AUC, ACC, SENS, SPEC, PREC and bACC,
ics and clinical data has been used to predict the dura- respectively. For each metric, the 95% confidence interval is shown. The
ble clinical benefit of immunotherapy at 6 and 9 months highest value for each metric is highlighted in bold. Table S11. Response
after treatment. Our results demonstrate that integrating prediction performance comparison between longitudinal and ensemble
models in the independent test set for endpoint PFS9 by evaluating AUC,
multidimensional and longitudinal data improves predic- ACC, SENS, SPEC, PREC and bACC, respectively. For each metric, the 95%
tion performance. The model may be used as a prognos- confidence interval is shown and the highest value is highlighted in bold.
tic biomarker and decision-support tool that can assist
oncologists in identifying patients for whom the therapy Acknowledgements
is effective, avoiding premature interruptions or, on the Not applicable.
other hand, the lengthening of an ineffective treatment.
Author contributions
Experimental design: BF, ADRG, and MJLC. Collect and curation of radiologi‑
cal and clinical data: all authors. Data Analysis and Interpretation: BF, ADRG,
Abbreviations and MJLC. Project supervision and resource acquisition: MJLC. Manuscript
NSCLC Non-small cell lung cancer writing: BF and MJLC. All authors contributed to the article and reviewed and
CT Computed tomography approved the manuscript.
CI Confidence interval
RECIST Response evaluation criteria in solid tumors Funding
FJD Hospital Universitario Fundación Jiménez Díaz The authors acknowledge the support of Ministerio de Ciencia e Innovación,
CUN Clínica Universidad de Navarra Agencia Estatal de Investigación, under grants PDC2022-133865-I00, RTI2018-
PFS Progression-free survival 098682-B-I00 and PID2019-109820RB-I00 AEI/10.13039/501100011033/(MCIN/
OS Overall survival AEI/ERDF, UE), co-financed by European Regional Development Fund (ERDF),
CNN Convolutional neural network ‘A way of making Europe’. Additionally, this work has been developed with
LIDC-IDRI The Lung Image Database Consortium and Image Database the financial support of Instituto de Salud Carlos III (ISCIII) project INGENIO
Resource Initiation Data Set (PMP21/00107) and the Next Generation EU funds. This work was partially
RF Random forest funded by the Leonardo grant to researchers and cultural creators 2019 from
SMOTE Synthetic minority oversampling technique Fundación BBVA. BF was supported by an FPI grant from Spain’s Ministry of
SHAP SHapley Additive exPlanations Education.
ROC Receiver operating characteristic curve
AUC Area under the ROC curve Availability of data and materials
HR Hazard ratio The immunotherapy data that support the findings of this study are available
NLR Neutrophils-to-lymphocytes ratio from the corresponding author, BF, upon reasonable request. The data from
SII Systemic immune-inflammation index LIDC-IDRI dataset are available in a public repository at https://wiki.cancerim‑
agingarchive.net/pages/viewpage.action?pageId=1966254
Supplementary Information
The online version contains supplementary material available at https://doi. Declarations
org/10.1186/s12967-023-04004-x.
Ethics approval and consent to participate
Additional file 1: Table S1. Number of patients in the training and This study was approved by the institutional review boards of each institution
independent test set for each model considering the endpoint PFS6. (Hospital Universitario Fundación Jiménez Díaz and Clínica Universidad de
Table S2. Number of patients in the training and independent test set for Navarra) involved and informed consent was collected accordingly.
each model considering the endpoint PFS9. Table S3. CT image acquisi‑
tion and reconstruction parameters for the two institutions involved in the Consent for publication
study: FJD and CUN. Table S4. Results of the feature repeatability against Not applicable.
segmentation and feature reproducibility. Table S5. Clinical variables
used for the implementation of the clinical models. Table S6. Demo‑ Competing interests
graphic and clinical characteristics of the patients in the baseline analysis The authors declare that the research was carried out in the absence of
with imaging data. P-values of no significant difference analysis (p-value > commercial or financial relationships that could be construed as a potential
competing interests.
Farina et al. Journal of Translational Medicine (2023) 21:174 Page 14 of 15
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in pub‑
lished maps and institutional affiliations.