Etd 5608 Goodwin 7863.57

MEDICAL QUESTION ANSWERING AND PATIENT COHORT RETRIEVAL
by
Travis R. Goodwin
APPROVED BY SUPERVISORY COMMITTEE:
Sanda M. Harabagiu, Chair
Vibhav Gogate
Nicholas Ruozzi
Vincent Ng
Copyright © 2018
Travis R. Goodwin
All rights reserved

This dissertation is dedicated to my family.
by
TRAVIS R. GOODWIN, BS, MS
DISSERTATION
Presented to the Faculty of
The University of Texas at Dallas
in Partial Fulfillment
of the Requirements
for the Degree of
DOCTOR OF PHILOSOPHY IN
COMPUTER SCIENCE
THE UNIVERSITY OF TEXAS AT DALLAS
May 2018
ACKNOWLEDGMENTS
I am deeply grateful to my PhD advisor, Dr. Sanda Harabagiu, for her guidance throughout my
PhD candidacy. Without her support, I would not have finished my research. I am also thankful to
my PhD committee members, Dr. Vibhav Gogate, Dr. Nicholas Ruozzi, and Dr. Vincent Ng, for
their support and feedback.
To Ramon Maldonado and Stuart Taylor, I thank you both for your support and friendship. I learned
much while working with you and look forward to following your success as you each embark on
your own dissertation research.
To Dr. Michael Skinner, I thank you for advice and perspective. Our discussions helped guide
my research to be more clinically relevant, and expanded my understanding of the roles of medical
informatics and medicine.
To Dr. Bryan Rink and Dr. Kirk Roberts, I thank you both for your advice and support. Your
guidance at the beginning of my PhD candidacy when we worked in the same lab was invaluable.
More so, I appreciate the continued support both of you provided even after your respective
graduations.
Finally, I am thankful to my parents not only for their unwavering encouragement, but for giving
me the opportunity for my education. I am deeply thankful to my entire family for their support,
and would like to specifically thank my paternal grandfather who turned down a job at NASA to
follow his faith and to my maternal grandmother who not only completed her masters but taught
English for forty years. Without their inspiration, I may never have sought my PhD.
March 2018
v
Travis R. Goodwin, PhD

The University of Texas at Dallas, 2018
Supervising Professor: Sanda M. Harabagiu, Chair
With the advent of the electronic health record (EHR), there has been an explosion of rich medical
information available for automatic and manual analyses. While the majority of current medical
informatics research focuses on easily accessible structured information stored in medical databases,
it is widely believed that the majority of information in EHRs remains locked within unstructured
text. This dissertation aims to present research that will unlock the knowledge encoded in clinical
texts by automatically (1) identifying clinical texts relevant to a specific information need and (2)
reasoning about the information encoded in clinical text to answer medical questions posed in
natural language. Specifically, we address the tasks of medical question answering – analyzing
the knowledge encoded by EHRs documenting medical practice and experience as well as medical
research articles to automatically produce answers to medical questions posed by a physician – and
patient cohort retrieval – identifying patients who satisfy a given natural language description of
specific inclusion and exclusion criteria. Novel systems addressing both of these task are presented
and discussed. Moreover, this dissertation presents a number of approaches for overcoming some
of the most significant complexities of processing electronic health records. We present new
approaches for (1) modeling the temporal aspects of electronic health records – that is, the fact that
the clinical picture of a patient varies throughout his or her medical care – and show how these
approaches can be used to infer, represent, and predict temporal interactions of clinical findings and
observations; (2) inferring underspecified information and recovering missing sections of records;
vi
and (3) applying machine learning to learn an optimal set of relevance criteria for a specific
set of information needs and collection of clinical texts. Combined, this work demonstrates the
importance of harnessing the natural language content of electronic health records and highlights
the promise of medical question answering and patient cohort retrieval for enabling more informed
patient care and improved patient outcomes.
vii
TABLE OF CONTENTS
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Overview of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
CHAPTER 2 MEDICAL QUESTION ANSWERING . . . . . . . . . . . . . . . . . . . . 11
2.1 System Architecture for Medical Question Answering . . . . . . . . . . . . . . . . 19
2.1.1 Inferring Medical Answers with Medical Knowledge Sketches . . . . . . . 19
2.1.2 Architecture of Medical Q/A System used in Clinical Decision Support . . 21
2.2 Inferring Medical Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.1 Representing Medical Knowledge in the Clinical Picture and Therapy Graph 27
2.2.2 Inference Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3 Extracting Medical Knowledge with Medical Language Processing . . . . . . . . . 37
2.3.1 Identification of Medical Concepts . . . . . . . . . . . . . . . . . . . . . . 37
2.3.2 Recognizing the Medical Assertions . . . . . . . . . . . . . . . . . . . . . 39
2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.1 Medical Answer Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.4.2 Medical Article Retrieval Evaluation . . . . . . . . . . . . . . . . . . . . . 47
2.4.3 Medical Knowledge Evaluation . . . . . . . . . . . . . . . . . . . . . . . 50
2.5 Summary and Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
CHAPTER 3 PATIENT COHORT RETRIEVAL . . . . . . . . . . . . . . . . . . . . . . 53
3.1 Constructing the Qualified Medical Knowledge Graph . . . . . . . . . . . . . . . . 56
3.2 Generating the Nodes of the Qualified Medical Knowledge Graph . . . . . . . . . 59
3.2.1 Medical Concept Recognition . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2.2 Assigning Belief Values to Medical Concepts . . . . . . . . . . . . . . . . 62
3.3 Constructing the Edges of the Qualified Medical Knowledge Graph . . . . . . . . . 64
viii
3.3.1 A Map-Reduce Representation . . . . . . . . . . . . . . . . . . . . . . . . 67
3.3.2 First Order Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3.3 Second Order Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3.4 n-Order Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.4 Patient Cohort Retrieval with the Qualified Medical Knowledge Graph . . . . . . . 74
3.4.1 A Patient Cohort Retrieval System . . . . . . . . . . . . . . . . . . . . . . 77
3.4.2 Query Expansion Informed by the QMKG . . . . . . . . . . . . . . . . . . 81
3.5.1 Evaluation of the Techniques for Discovering QMKG Nodes . . . . . . . . 83
3.5.2 Evaluation of the Techniques for Discovering QMKG Edges . . . . . . . . 86
3.5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
CHAPTER 4 MULTIMODAL PATIENT COHORT RETRIEVAL . . . . . . . . . . . . . 91
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.2 The Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3 Multimodal Patient Cohort Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.3.1 Indexing the EEG Big Data . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.3.2 Section Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.3.3 Medical Language Processing . . . . . . . . . . . . . . . . . . . . . . . . 99
4.3.4 Generating Fingerprints of EEG Signal Recordings. . . . . . . . . . . . . . 100
4.3.5 Organizing EEG Fingerprints into a Similarity-based Hierarchy . . . . . . 104
4.4 Query Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.4.1 Relevance Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.5.1 Evaluation of Patient Cohort Discovery . . . . . . . . . . . . . . . . . . . 108
4.5.2 Evaluation of Polarity Classification . . . . . . . . . . . . . . . . . . . . . 110
4.6 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
CHAPTER 5 ACCOUNTING FOR LONGITUDINAL INFORMATION . . . . . . . . . . 114
5.1 Lattice Markov Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
ix
5.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.1.2 The 2014 i2b2/UTHealth Dataset . . . . . . . . . . . . . . . . . . . . . . 117
5.1.3 Predicting the Progression of Clinical Findings . . . . . . . . . . . . . . . 118
5.1.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.1.5 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.2 Inferring Temporal Interactions involving Risk Factors and Medications . . . . . . 128
5.2.1 Related Work and Background . . . . . . . . . . . . . . . . . . . . . . . . 129
5.2.2 Risk Factors and Medications . . . . . . . . . . . . . . . . . . . . . . . . 131
5.2.3 Generating the Graphical Model . . . . . . . . . . . . . . . . . . . . . . . 132
5.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.2.6 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.3 Jointly Learning to Predict and Cluster . . . . . . . . . . . . . . . . . . . . . . . . 148
5.3.1 Related Work and Background . . . . . . . . . . . . . . . . . . . . . . . . 149
5.3.2 The Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.3.4 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
CHAPTER 6 ACCOUNTING FOR MISSING OR UNDERSPECIFIED INFORMATION 169
6.1 Inferring Unspecified Information . . . . . . . . . . . . . . . . . . . . . . . . . . 170
6.1.1 Previous and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 171
6.1.2 Inferring the Over-all Impression of EEG Reports with Deep Learning . . . 174
6.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
6.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
6.1.5 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
6.2 Recovering Missing Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
6.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
6.2.2 Recovering the Clinical Correlation Section of EEG Reports . . . . . . . . 193
6.2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
x
6.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
6.2.5 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
CHAPTER 7 LEARNING TO RANK FOR MEDICAL INFORMATION RETRIEVAL . . 210
7.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
7.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
7.3 Searching Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
7.3.1 Representing Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . 216
7.4 Searching MEDLINE Articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
7.4.1 Indexing MEDLINE Articles . . . . . . . . . . . . . . . . . . . . . . . . . 216
7.4.2 Query Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
7.4.3 Scoring MEDLINE Articles . . . . . . . . . . . . . . . . . . . . . . . . . 218
7.5 Learning-to-Rank (L2R) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
7.5.1 Feature Extraction from MEDLINE Articles and Clinical Trials . . . . . . 219
7.5.2 The Deep Highway Network . . . . . . . . . . . . . . . . . . . . . . . . . 221
7.5.3 Ranking MEDLINE Articles . . . . . . . . . . . . . . . . . . . . . . . . . 224
7.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
7.6.1 Quality Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
7.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
7.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
7.7.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
CHAPTER 8 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
CURRICULUM VITAE
xi
LIST OF FIGURES
2.1 Medical Question Answering: TREC-CDS Topics . . . . . . . . . . . . . . . . . . . . 13

2.2 Medical Question Answering: Typical System Architecture . . . . . . . . . . . . . . . 13
2.3 Medical Question Answering: Proposed Architecture (Simplified) . . . . . . . . . . . 17
2.4 Medical Question Answering: Proposed System Architecture (Detailed) . . . . . . . . 22
2.5 Medical Question Answering: The Clinical Picture and Therapy Graph . . . . . . . . . 30
2.6 Medical Question Answering: Architecture for Medical Concept and Assertion Detection 38
2.7 Medical Question Answering: Medical Language Processing . . . . . . . . . . . . . . 40
2.8 Medical Question Answering: Topic-level Answer Inference Results . . . . . . . . . . 44
3.1 Cohort Retrieval: The Qualified Medical Knowledge Graph . . . . . . . . . . . . . . . 55
3.2 Cohort Retrieval: Example Graph Nodes . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3 Cohort Retrieval: Concept Identification . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.4 Cohort Retrieval: Concept Detection Features . . . . . . . . . . . . . . . . . . . . . . 63
3.5 Cohort Retrieval: Assertion Classification . . . . . . . . . . . . . . . . . . . . . . . . 64
3.6 Cohort Retrieval: Assertion Classification Features . . . . . . . . . . . . . . . . . . . 65
3.7 Cohort Retrieval: Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.8 Cohort Retrieval: Query Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.9 Cohort Retrieval: Query Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.10 Cohort Retrieval: Distribution of Nodes in the Qualified Medical Knowledge Graph . . 83
4.1 Multimodal Retrieval: The MERCuRY System . . . . . . . . . . . . . . . . . . . . . 96
4.2 Multimodal Retrieval: The MERCuRY Index . . . . . . . . . . . . . . . . . . . . . . 97
4.3 Multimodal Retrieval: Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . 102
4.4 Multimodal Retrieval: Long Short-Term Memory . . . . . . . . . . . . . . . . . . . . 102
5.1 Temporal Models: Clinical Findings and Temporal Signals in Longitudinal Records . . 120
5.2 Temporal Models: Chronological Ordering . . . . . . . . . . . . . . . . . . . . . . . 122
5.3 Temporal Models: Model of Clinical Finding Progression . . . . . . . . . . . . . . . . 123
5.4 Temporal Models: Predicted Clinical Finding Progressions . . . . . . . . . . . . . . . 127
5.5 Temporal Models: Mathematical Structures . . . . . . . . . . . . . . . . . . . . . . . 135
5.6 Temporal Models: Model of Patient Chronologies . . . . . . . . . . . . . . . . . . . . 136
xii
5.7 Temporal Models: Observation and Elapsed Time Tensors . . . . . . . . . . . . . . . 154
5.8 Temporal Model: Model of Patients’ Histories . . . . . . . . . . . . . . . . . . . . . . 155
5.9 Temporal Model: Bayesian Model of Patients’ Clinical Histories . . . . . . . . . . . . 157
5.10 Temporal Model: Distribution of Clinical Finding Observations . . . . . . . . . . . . 163
6.1 Missing Information: Over-all Impressions of EEG Reports . . . . . . . . . . . . . . . 173
6.3 Missing Information: Skip-gram Model . . . . . . . . . . . . . . . . . . . . . . . . . 178
6.4 Missing Information: Deep Averaging Network . . . . . . . . . . . . . . . . . . . . . 180
6.5 Missing Information: Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . 182
6.6 Missing Information: Deep Section Recovery Model (DSRM) . . . . . . . . . . . . . 193
6.7 Missing Information: DSRM Extractor . . . . . . . . . . . . . . . . . . . . . . . . . . 197
6.8 Missing Information: DSRM Generator . . . . . . . . . . . . . . . . . . . . . . . . . 200
6.9 Missing Information: Inferred and Gold-standard Clinical Correlation Sections . . . . 206
7.1 Learning-to-Rank: Architecture of NCT Link . . . . . . . . . . . . . . . . . . . . . . 214
7.2 Learning-to-Rank: Mapping between Clinical Trials and MEDLINE Articles . . . . . 217
7.3 Learning-to-Rank: Deep Highway Network . . . . . . . . . . . . . . . . . . . . . . . 222
7.4 Learning-to-Rank: Rectified Linear Unit with and without Highway Mechanism . . . . 223
xiii
LIST OF TABLES
1.1 Introduction: Example Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Medical Question Answering: Comparison of Medical Knowledge Sketches . . . . . . 27
2.2 Medical Question Answering: Assertion Values . . . . . . . . . . . . . . . . . . . . . 41
2.3 Medical Question Answering: Aggregate Answer Inference Results . . . . . . . . . . 43
2.4 Medical Question Answering: Inferred Medical Answers . . . . . . . . . . . . . . . . 45
2.5 Medical Question Answering: Information Retrieval Results . . . . . . . . . . . . . . 48
3.1 Cohort Retrieval: Example TRECMed Topics . . . . . . . . . . . . . . . . . . . . . . 76
3.2 Cohort Retrieval: Hospital Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.3 Cohort Retrieval: Gender Lexica . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.4 Cohort Retrieval: Assertion Performance . . . . . . . . . . . . . . . . . . . . . . . . 85
3.5 Cohort Retrieval: Similarity Measure Performance . . . . . . . . . . . . . . . . . . . 87
4.1 Multimodal Retrieval: Example Queries . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.2 Multimodal Retrieval: Evaluation of Patient Cohorts . . . . . . . . . . . . . . . . . . 109
4.3 Multimodal Retrieval: Polarity Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 111
5.1 Temporal Models: Clinical Findings for Heart Disease . . . . . . . . . . . . . . . . . 118
5.2 Temporal Models: Temporal Signal . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.3 Temporal Models: Clinical Finding Predictions . . . . . . . . . . . . . . . . . . . . . 126
5.4 Temporal Models: Medication Types . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.5 Temporal Models: Risk Factor Predictions . . . . . . . . . . . . . . . . . . . . . . . . 143
5.6 Temporal Models: Inferred Interactions between Risk Factors . . . . . . . . . . . . . . 145
5.7 Temporal Models: Inferred Interactions between Medications and Risk Factors . . . . 145
5.8 Temporal Models: Inferred Interactions between Risk Factors and Medications . . . . 146
5.9 Temporal Models: Latent Patient Sub-group Performance . . . . . . . . . . . . . . . . 161
5.10 Temporal Models: Predictive Performance of Clinical Finding Observations . . . . . . 164
6.1 Missing Information: Evaluation of Inferred Over-all Impressions . . . . . . . . . . . 186
6.2 Missing Information: Quantitative Evaluation of Inferred Clinical Correlation Sections 204
6.3 Missing Information: Qualitative Scale for Inferred Clinical Correlation Sections . . . 207
7.1 Learning-to-Rank: NCT Link Features . . . . . . . . . . . . . . . . . . . . . . . . . . 219
7.2 Learning-to-Rank: Evaluation of NCT Link . . . . . . . . . . . . . . . . . . . . . . . 227
xiv
CHAPTER 1
INTRODUCTION
As hospitals and governments across the globe continue to promote and adopt electronic health
records (EHRs), the potential for secondary use of EHR information is constantly expanding.
Although much of the information in EHRs is represented as structured data (e.g., vital signs,
laboratory results), the information encoded in the free-text portions of EHRs provides a rich but
underutilized (Chapman et al., 2011) source of medical knowledge. Harnessing this valuable source
of knowledge can aid in a variety of clinical and clinical research applications, including clinical
decision support (Demner-Fushman et al., 2009) and patient cohort retrieval (Hersh, 2009). This
dissertation focuses two major applications in medical informatics: medical question answering and
patient cohort retrieval. We show how harnessing the rich information present in clinical narratives
as well as the longitudinal aspects of EHRs enables novel, accurate and reliable approaches for
information retrieval and question answering specifically adapted to the medical domain. The
remainder of this chapter presents a brief introduction to the fields of question answering and
information retrieval, followed by an outline of the main contributions described in each chapter,
and an overview of planned future work.
1.1 Background
This section provides a high-level overview of the fields of question answering (Q/A) and in-
formation retrieval (IR). Note: additional details on medical question answering are provided in
Chapter 2, while additional explanations regarding information retrieval and patient cohort retrieval
are provided by Chapters 3 and 4.
Question Answering (Q/A)
At the intersection of natural language processing, information retrieval, and artificial intelligence,
question answering (Q/A) has been a major focus of research for nearly six decades (Kolomiyets
1
and Moens, 2011; Green Jr. et al., 1961). The role of a question answering system is, given an
information need posed as a natural language question, to identify and present the best answer from
a pre-specified information source (Manning et al., 2008). An information need is the real-world
catalyst that leads the user to pose his or her question. For example, if a user were planning a trip to
Italy to visit historic museums, it is likely that he or she may be interested in discovering the oldest
museum in Italy. Thus, in this scenario, the user’s information need would be to discover the oldest
museum in Italy. In order to satisfy their information need, the user would interact with the question
answering system by providing a question as input to the system. Thus, the question can be seen as
the natural language realization of the user’s information need. To continue our example, the user’s
information need may be expressed by the question “What is the oldest museum in Italy?”. The
role of the question answering system is to satisfy the user’s information need by providing the best
answer to the given question. The answer is the entity, concept, or expression in the information
source that is mostly likely to satisfy the user’s information need as expressed by the question. In
our running example, the question answering system would identify the oldest museum in Italy –
Musei Capitolini (established in 1471) – and return that information to the user. In order to produce
an answer, typical question answering system relies on a pre-specified information source, which
is typically a knowledge base (Katz et al., 2002; Hovy et al., 2000; Xu et al., 2016; Yih et al., 2016;
Moreda et al., 2011), database (Androutsopoulos et al., 1995; Green Jr. et al., 1961; Woods et al.,
1972; Androutsopoulos et al., 1993), or text collection (Omari et al., 2016; Chen et al., 2017; Wang
et al., 2007; Punyakanok et al., 2004; Rao et al., 2016) from which all answers must be derived.
Recently, however, many Q/A systems have considered a question answering scenario in which
some aspect of the information source is dynamic and may change with each question. Specifically,
Wang et al. (2016); Seo et al. (2016); Lee et al. (2016); Iyyer et al. (2014); Sukhbaatar et al. (2015);
Kadlec et al. (2016); Hermann et al. (2015) have considered an alternative question answering
setting in which the question answering system is provided with both a question and a background
providing additional context used to produce an answer. The background is a passage of natural
2
Table 1.1. Example of a medical question, its background and the correct answer produced by a
Q/A system.
Question: What is the most likely Diagnosis?

Background: The patient is a 65 y/o male with no significant history of cardiovascular
disease presents to the emergency room with acute onset of shortness of breath, tachypnea,
and left-sided chest pain that worsens with inspiration. Of note, he underwent a right total
hip replacement two weeks prior to presentation and was unable to begin physical therapy and
rehabilitation for several days following the surgery due to poor pain management. Relevant
physical exam findings include a respiratory rate of 35 and right calf pain.
Answer: Pulmonary Embolism
language text which may vary between questions and acts as a supplemental information source for
the question answering system.
In this dissertation, we focus on medical question answering problems with an emphasis on
medical question answering for clinical decision support in which the background of a question
corresponds to a description of a patient’s medical case and the questions correspond to identifying
the best medical treatment, medical test, or diagnosis for the patient. Table 1.1 gives an example
of a medical question with a background along with its answer. Clearly, answering the question in
Table 1.1, requires accounting for the information in the question’s background, as well as medical
knowledge from an information source. Chapter 2 provides details on how the answers to medical
questions like this can be automatically produced by combining knowledge from the question’s
background with knowledge from medical practice and scientific literature.
Information Retrieval
Information retrieval is highly related to question answering. In fact, many question answering
systems include an information retrieval component. The goal of information retrieval is to
identify relevant information from an information source given a (typically natural language) query
(Manning et al., 2008) Unlike question answering, which requires a specific answer be returned
in response to a question, information retrieval typically considers only document collections
3
(Voorhees et al., 1999) as the information source; thus, information retrieval systems return a ranked
list of documents. The notion of ranking is important to information retrieval, with the idea that the
rank or position of a document in the ranked list should correspond to the relevance between the
document and the query provided by the user. Thus, the ability to accurately estimate the relevance
between a document and a query is the central problem addressed by information retrieval, with
every component of an information retrieval system contributing, in some way, to improving the
ability to estimate relevance. It should be noted that, while question answering systems consider
natural language questions, the query processed by information retrieval systems may range from
natural language sentences, e.g., “Show me hotels in Lesbon.”, to sentence fragments, e.g., “places
to eat near Frisco”, or just combinations of important words and phrases, e.g., “Mexican restaurants
no seafood”.
In this dissertation, we focus on a specific application of information retrieval, namely, patient
cohort retrieval, (Hersh, 2009). Patient cohort retrieval is an important problem in medical research
in which a patient cohort is identified from a set of electronic health records given a natural language
description of a patient cohort (Voorhees and Tong, 2011; Voorhees and Hersh, 2012). A patient
cohort corresponds to a group of patients who satisfy specific exclusion or inclusion criteria,
e.g., “Pregnant women over 30 who are not on insulin” and typically describe candidates for a
medical research study or clinical trial. It should be noted that patient cohort retrieval systems
are responsible for ranking patients, and that each patient may correspond to multiple documents
or EHRs (Edinger et al., 2012; Chapman et al., 2011). In Chapters 3 and 4, patient cohorts are
represented by ranked list of patients where-in the rank or relevance of each patient indicates the
degree that the patient satisfies the criteria associated with the cohort.
1.2 Overview of Contributions
In this dissertation, we describe novel approaches to medical question answering and patient cohort
retrieval which obtain state-of-the-art performance on a variety of standard datasets (Voorhees and
4
Tong, 2011; Voorhees and Hersh, 2012; Simpson et al., 2014; Roberts et al., 2015; Uzuner et al.,
2011; Uzuner and Stubbs, 2015). An overview of the contributions made in each chapter of this
dissertation is as follows:
Chapter 2 presents a hybrid medical question answering and medical information retrieval system
(Goodwin and Harabagiu, 2016, 2017) used for clinical decision support. The system was inspired
by the Text REtrieval Conference (TREC) Clinical Decision Support (TREC-CDS) evaluations
conducted during 2014 and 2015 (Simpson et al., 2014; Roberts et al., 2015). The TREC-CDS
track was strictly an information retrieval challenge in which participants were given a query
corresponding to a medical case description and one of three expected medical answer types –
diagnosis, test, or treatment – and asked to return a ranked list of scientific articles from the
PubMed Central Open Access Subset (Varmus et al., 1999) which are likely to contain the answer.
In this chapter, we showed that information retrieval performance could be significantly improved
by casting the challenge as a question answering problem, and first trying to determine the most
likely answer from medical knowledge and then ranking articles based on the answers they contain.
Specifically, we describe a novel medical question answering system which automatically extracted
medical knowledge and beliefs accumulated through practice using natural language processing on
the electronic health records (EHRs) provided in the MIMIC-III critical care database (Johnson
et al., 2016) and represented it as a factorized Markov network (Koller and Friedman, 2009) which
named the Clinical Picture and Therapy Graph (CPTG) (Goodwin and Harabagiu, 2016, 2017).
This chapter presents three different approaches for representing medical knowledge about the
medical case associated with each query, namely: (1) the medical concepts distilled from the
query; (2) the medical concepts distilled from both the query and each potentially-relevant PubMed
article; and (3) the medical concepts and assertions distilled from the query and each paragraph of a
potentially-relevant PubMed article. To infer the answer using the CPTG, we evaluated four different
inference techniques: exact inference using an inverted index, pair-wise smoothed inference, a novel
5
interpolated smoothed inference technique, and using the Bethe free energy approximation. The
experimental results described in the chapter show that (1) combining medical question answering
with information retrieval and (2) incorporating knowledge from medical practice, experience, and
research obtains state-of-the-art performance (Goodwin and Harabagiu, 2016, 2017).
Chapter 3 discusses two systems we developed for patient cohort retrieval: Cohort Shepherd
(Goodwin et al., 2011), and Cohort Shepherd II (Goodwin et al., 2012). The Cohort Shepherd
systems were designed for the TREC Medical Records track (TRECMed) during 2011 (Voorhees
and Tong, 2011) and 2012 (Voorhees and Hersh, 2012). The goal of the track was to identify patients
from a collection of EHRs provided by the University of Pittsburgh BluLAB NLP repository 1 given
a query describing a specific patient cohort (e.g., “women with osteoporosis”). In this chapter,
we outline a novel approach for key-phrase detection which is able to account for the prevalence
of multi-word expressions in medical language by incorporating knowledge from Wikipedia and
PubMed (Goodwin et al., 2011, 2012; Goodwin and Harabagiu, 2013a,c). For example, given
the query “patients with lower extremity chronic wound”, our key-phrase detection method would
identify that the entire phrase “lower extremity chronic wound” corresponds to a single key-phrase.
We also discuss the role of key-phrase decomposition, in which, each key-phrase is represented by
a separate sub-query (e.g., “lower extremity” and “chronic wound”) allowing scores for each EHR
to be computed by considering the relevance of all matched sub-queries. The chapter presents
four methods of query expansion based on (1) the Unified Medical Language System (UMLS)
(Bodenreider, 2004), the Systematized Nomenclature of Medicine – Clinical Terms (SNOMED-
CT) (Stearns et al., 2001), Wikipedia article titles and redirects, as well as co-occurrence information
from PubMed (Varmus et al., 1999). In addition to query expansion, we investigate a number of
re-ranking techniques to enforce specific inclusion criteria such as hospital status, gender, and
age (Goodwin et al., 2012). Finally, and most importantly, we present and analyze the role of an
1http://www.dbmi.pitt.edu/
6
automatically constructed qualified medical knowledge graph (QMKG) (Goodwin and Harabagiu,
2014, 2013b,c,a). This chapter outlines how a patient cohort retrieval system can be designed
and implemented, as well as a number of strategies for improving the accuracy and reliability of
automatically identified patient cohorts.
Chapter 4 provides additional insights on the task of patient cohort retrieval. In this chapter, a
multimodal patient cohort retrieval system, named MERCuRY (Goodwin and S, 2016), is intro-
duced. Unlike the Cohort Shepherd I and II systems described in Chapter 3, MERCuRY is a patient
cohort retrieval system specifically tailored for the neurology domain. MERCuRY relies on a novel
multimodal index containing the natural language content of EEG reports, as well as automatically
learned fingerprints of the EEG signal associated with each report. This chapter details how the
multi-modal index is constructed, as well as our deep learning techniques for producing EEG signal
fingerprints. Moreover, we detail two new relevance models extending the popular BM25F ranking
function (Zaragoza et al., 2004) to account for the role of polarity in EEG reports. We show how
the multimodal index can be used for a novel form of pseudo-relevance feedback based on the
geometric properties of the automatically inferred EEG fingerprints. Finally, we present a new
dataset of queries and relevance judgments based on the Temple University Hospital EEG Corpus
(Harati et al., 2013).
Chapter 5 discusses the role of longitudinal information in electronic health records. That is,
accounting for the fact that the information recorded in EHRs changes over time for each patient.
Through each patient’s care, clinicians generate multiple electronic medical records (EMRs) which
document a wide variety of clinical observations, such as the patient’s diagnoses, risk factors,
medications, and test results. In this chapter we explore the longitudinal information encoded in
the set of longitudinal EMRs provided by the organizers of the Challenges in Language Processing
for Clinical Data shared task sponsored by the 2014 Informatics for Integrating Biology and the
Beside (I2B2) and The University of Texas Health Science Center (UTHealth) (Uzuner and Stubbs,
7
2015). Three separate probabilistic graphical models are presented for capturing longitudinal
information from this dataset. First (in Section 5.1), we present a simple lattice Markov network
for predicting the presence or absence of five risk factors for heart disease in successive EMRs for
diabetic patients (Goodwin and Harabagiu, 2015). The network relies on temporal inference based
on human-annotated temporal signals to produce a chronological ordering of risk factors for each
patient. Section 5.2 presents a significant extension to the lattice Markov network in which we
consider all seven risk factors as well as the set of twenty-two medication types evaluated in the
2014 i2b2/UTHealth shared task. Moreover, in addition to enabling prediction, this second model
is able to infer causal interactions across successive EMRs. Specifically, we investigate the role of
risk factors positively predicting successive risk factors, medical negatively predicting successive
risk factors, and risk factors positively predicting medications. These interactions are modeled
using Noisy-Or and Noisy-And distributions from Bayesian belief networks (Pearl, 1986), and
inference is achieved using Gibb’s sampling (Geman and Geman, 1984). The third and final model
presented in the chapter encodes the progression of a patient’s clinical chronology and jointly learns
to predict clinical observations in time and cluster patients into latent sub-populations with similar
progressions. Specifically, a separate predictive model is learned for each sub-population, and
inference is achieved using Expectation Maximization (EM) to iteratively refine the patient cluster
assignments, and update the parameters of the predictive models for each cluster. The three models
in this chapter enable the ability to automatically infer and model longitudinal information from
EHRs, improving the ability of patient cohort retrieval and medical question answering systems to
account for longitudinal data.
Chapter 6 explores strategies for overcoming missing and underspecified information in EHRs.
Data quality issues when adapting EHRs for retrospective analyses and medical information tasks
have been widely reported (Weiner, 2011; Hersh, 2012). One of the most pervasive data quality
issues when using EHRs for medical information retrieval problems such as patient cohort retrieval
8
and medical question answering, is the "prevalence of missing (Smith et al., 2005), inconsistent, or
underspecified data (O’malley et al., 2005; Berlin and Stang, 2011)" (Goodwin and S, 2017). In this
chapter, we present a case study using the Temple University Hospital EEG Corpus (introduced in
Chapter 4) for which we explore two scenarios for inferring missing or underspecified information.
First, in Section 6.1, we present an approach for inferring the "over-all" impression of an EEG
report – i.e., whether the EEG indicates abnormal or normal brain activity. Section 6.2 extends
these ideas to consider a more challenging problem: recovering the information from missing
sections in EEG reports. Specifically, Section 6.2 describes a novel neural network architecture
capable of reasoning about an EEG report and to generate the clinical correlation section – a natural
language description of how the findings documented in an EEG report correspond to the general
clinical picture of the patient. Together, the two models presented in this chapter demonstrate the
ability of deep learning techniques to identify patterns in a large dataset and recover missing or
underspecified information from individual documents. The automatically recovered information,
in turn, has the potential to improve recall for medical information retrieval systems operating on
those documents.
Chapter 7 demonstrates how a supervised machine learning framework known as learning-to-
rank (Liu, 2011) can be applied to medical information retrieval. While medical information
retrieval approaches typically rely on either a custom hand-tuned relevance model (e.g., Chapter 2)
or standard relevance models developed for ad-hoc retrieval such as BM25 (Robertson et al., 1996)
(e.g., Chapters 3 and 4), learning-to-rank enables an optimal relevance model to be learned for
a specific dataset and application. In Chapter 7, we explore the task of enriching the output of
a medical information retrieval system designed for clinical decision support, such as the system
described in Chapter 2. Specifically, we aim to augment the relevant scientific literature retrieved
for a patient’s medical case with pertinent clinical trials. We demonstrate how to design a learning-
to-rank system named NCT Link which is trained to automatically link clinical trials to published
9
scientific articles reporting their results. The experimental results and analysis reported in this
chapter provides evidence indicating that learning-to-rank can be successfully applied to medical
information retrieval problems.
Finally, Chapter 8 summarizes the information provided in each chapter.
10
CHAPTER 2
MEDICAL QUESTION ANSWERING
Authors – Travis R. Goodwin, and Sanda M. Harabagiu
The Department of Computer Science, EC 31
800 West Campbell Road
Richardson, Texas 75080-3021
Minor revision, with permission, of Travis R. Goodwin and Sanda M. Harabagiu, Knowledge
Representations and Inference Techniques for Medical Question Answering, The Association for
Computing Machinery (ACM) of Transactions on Intelligent System Technologies (TIST), Vol. 9,
Issue 2, October 2017. doi:10.1145/3106745.
11
In their everyday practice, physicians face a variety of clinical decisions regarding the care of
their patients, e.g., deciding the diagnosis, the test(s) or the treatment that is best suited for the
patient’s care. Clinical Decision Support (CDS) systems have been designed to help physicians
address the myriad of complex clinical decisions that might arise during a patient’s care (Garg et al.,
2005). By leveraging the fact that patient care is documented in electronic medical records (EMRs),
one of the goals of modern CDS systems is to anticipate the information needs of physicians by
linking EMRs with information relevant for patient care, retrieved from the bio-medical literature.
The special track on Clinical Decision Support in the Text REtrieval Conference (TREC-CDS)
(Simpson et al., 2014), has addressed the challenge of retrieving bio-medical articles relevant to a
medical case when answering one of three generic medical questions: (a) “What is the diagnosis?”;
(b) “What test(s) should be ordered?”; and (c) “Which treatment(s) should be administered?”. The
TREC-CDS track did not rely on a collection of EMRs, but instead it used an idealized representation
of medical records in the form of 30 short medical case reports, each describing a challenging
medical case. The medical case reports were presented in two formats: (a) a narrative describing
the fragments from the patient’s EMRs that were pertinent to the case detailed description, or (b) a
summary of the case. Medical case reports (in both formats) along with one of the generic questions
were considered topics in TREC-CDS. Thus, systems developed for the TREC-CDS challenge were
provided with a list of topics and were expected to use either the medical case description or the
summary to answer the question by providing a ranked list of articles available from PubMed
Central (Varmus et al., 1999) containing the answers. As only one of the three generic questions
was asked in each topic, the expected medical answer type (EMAT) of the question was diagnosis,
test or treatment. Figure 2.1 illustrates three examples of topics evaluated in the 2015 TREC CDS,
one example per EMAT. Figure 2.1 also illustrates the correct answer of each of the questions.
Most systems that participated in TREC-CDS used architectures similar to the one illustrated
in Figure 2.2. The topics were processed to discover terms or concepts that were used to generate
a query. Queries were expanded to enhance the quality of retrieval enabled by a relevance model
12
Topic 33 Topic 42 Topic 54
EMAT: Diagnosis EMAT: Test EMAT: Treatment
Description: A 65 yo Description: A 44-year- Description: A 31 yo male with no sig-
male with no significant his- old man was recently in nificant past medical history presents with
tory of cardiovascular dis- an automobile accident productive cough and chest pain. He re-
ease presents to the emer- where he sustained a skull ports developing cold symptoms one week
gency room with acute onset fracture. In the emergency ago that were improving until two days ago,
of shortness of breath, tachyp- room, he noted clear fluid when he developed a new fever, chills, and
nea, and left-sided chest pain dripping from his nose. worsening cough. He has right-sided chest
that worsens with inspira- The following day he pain that is aggravated by coughing. His
tion. Of note, he underwent started complaining of wife also had cold symptoms a week ago
a right total hip replacement severe headache and fever. but is now feeling well. Vitals signs in-
two weeks prior to presenta- Nuchal rigidity was found clude temperature 103.4, pulse 105, blood
tion and was unable to be- on physical examination. pressure 120/80, and respiratory rate 15.
gin physical therapy and re- Summary: A 44-year- Lung exam reveals expiratory wheezing,
habilitation for several days old man complains of decreased breath sounds, and egophany in
following the surgery due to severe headache and fever. the left lower lung field.
poor pain management. Rel- Nuchal rigidity was found Summary: A 31 year old male presents
evant physical exam findings on physical examination. with productive cough, chest pain, fever and
include a respiratory rate of 35 Diagnosis: Bacterial chills. On exam he has audible wheezing
and right calf pain. Meningitis with decreased breath sounds and dullness
Summary: A 65-year-old Answer: Spinal Tap to percussion.
male presents with dyspnea, / Cerebrospinal Fluid Diagnosis: Community Acquired Pneumo-
tachypnea, chest pain on in- Analysis nia (CAP)
spiration, and swelling and Answer: Antibiotics
pain in the right calf.
Answer: Pulmonary Em-
bolism
Figure 2.1. Examples of topics evaluated in the 2015 TREC CDS track.
TREC-CDS Topic Topic Query

Topic Description Topic Summary Processing Expansion
Expected Medical Answer Type

Relevance Model
Medical Ranked List of Relevant Articles

Diagnosis Test Treatment
Findings Index
Figure 2.2. Architecture of a typical medical question answering system for clinical decision
support.
13
operating on the index of the PubMed Central collection. Roberts et al. (2016) details the variety
of topic processing, query expansions and relevance modes used by the various systems. Notably,
Roberts et al. (2016) also discusses the relations between the three types of expected medical
answers (EMATs), namely the diagnosis, the tests and the treatments as well as the medical
findings pertaining to the difficult medical case addressed by the CDS topics. Figure 2.2 also
illustrates these relationships which influence the clinical decision making process, as stated in
Roberts et al. (2016). Medical findings mentioned in the medical case description and/or summary
require the inference of a diagnosis (differential or confirmed) and require tests to be ordered,
either preliminary or confirmatory. The tests may confirm the diagnosis, thus an additional relation
exits between the tests and the diagnosis. Treatments are dependent on both the diagnosis and
the results of the tests. As Figure 2.2 illustrates, the EMAT of a CDS topic cannot be considered
in isolation, given the dependencies between the three EMATs considered in TREC-CDS and the
medical findings pertaining to the medical case. Moreover, we believe that these dependencies
should be further explored, an insight that was not considered by any of the systems participating
in the TREC-CDS.
In this chapter, we present and evaluate multiple forms of medical knowledge representations
and show how they can be used for medical question answering (Q/A). We contemplated knowledge
representations in which the EMATs of questions evaluated in TREC-CDS are considered along
with additional medical concepts, as well the connections shared by them. However, we constrained
these representations to take into account connections between medical concepts that are observed
in medical practice, and thus are infer-able from a vast EMR collection. Moreover, we considered
a probabilistic representation of the medical knowledge, in which answers to the medical questions
like those evaluated in TREC-CDS can be inferred instead of only being searched. Thus, we
believed that by focusing on medical knowledge representations and experimenting with inference
methods operating on them, we could not only find the optimal knowledge representations and
inference methods for medical Q/A, but we could also further improve the medical article retrieval
14
results reported in Goodwin and Harabagiu (2016), as it would allow us to produce new answer-
informed relevance models. Our belief that improved answer inference must lead to enhance article
retrieval originated from observations of results of the 2015 TREC-CDS.
In the 2015 TREC-CDS track a new task was offered, in which for questions having the EMAT
∈ {test; treatment}, the patient’s diagnosis was provided (shown in Figure 2.1). In this way, some
of the dependencies related to clinical decision were exposed. The results for this new task, as
reported in Roberts et al. (2015) were superior to the results for the same topics when no diagnoses
were provided. This observation let us to believe that medical knowledge related to the EMAT
can be considered as a partial answer to the medical question of the CDS topic. The results from
the evaluation of the new TREC-CDS task indicated that knowledge of a partial answer leads to
significantly improved retrieval of relevant bio-medical literature. Thus, we asked ourselves if (1)
we could automatically assemble medical knowledge that could provide partial answers for any
CDS topic; and (2) if we could in fact identify the answers to medical questions from the CDS
topics with acceptable accuracy, if such medical knowledge would be available. More importantly,
we wondered if we should first try to find the answer and then rank the relevant scientific articles
for any given medical question.
It was clear to us from the beginning that answer identification would be a harder problem,
unless we could tap into a new form of knowledge and consider answering the questions directly
from a knowledge base (KB). Question answering (Q/A) from KBs has experienced a recent
revival. In the 60’s and 70’s, domain-specific knowledge bases were used to support Q/A, e.g.,
the Lunar Q/A system (Woods, 1973). With the recent growth of KBs such as DBPedia (Auer
et al., 2007) and Freebase (Bollacker et al., 2008), new promising methods for Q/A from KBs have
emerged (Dong et al., 2015; Yao and Van Durme, 2014; Bao et al., 2014). These methods map
questions into sophisticated meaning-representations which are used to retrieve the answers from
the KB. However, we believe that when probabilistic representations of the medical knowledge
are available, medical Q/A from KB can achieve significant accuracy when the inference methods
15
take advantage of the various forms of medical knowledge. It is equally important to capture the
background of medical questions in medical knowledge sketches. We considered three different
forms of medical knowledge sketches that combine in different ways the knowledge processed from
the description of the medical case with knowledge processed from clinical practice and medical
research. Therefore, in this chapter, we explore three different possible medical knowledge sketches
and four different inference methods to discover which combination of knowledge representation
and inference approach produces optimal results of medical Q/A on TREC-CDS data.
The ability to cast the problem of answering medical questions for CDS topics as a Q/A from
KB problem depends on the availability of a large medical knowledge base in which the clinical
picture (comprising the medical findings and the diagnoses) and therapy (comprising the tests
and treatments) of a vast population of patients is captured. Moreover, in this medical knowledge
base, the dependencies between diagnoses, medical findings, tests and treatments would be not
only available, but also captured at the level of each patient and medical case, providing the
knowledge granularity required by the medical questions evaluated in TREC-CDS (as illustrated
in Figure 2.2). To our knowledge, no such knowledge base is readily available. Widely used
medical ontologies such as the Unified Medical Language System (UMLS) (Bodenreider, 2004)
and MeSH (Lipscomb, 2000) encode a large number of medical concepts, but these ontologies do
not relate medical concepts to any specific medical case. We believe that a medical knowledge
base that could inform the medical questions asked in TREC-CDS should capture the knowledge
documented in a vast collection of medical records. For this purpose, we automatically generated
a very large medical knowledge graph from a publicly available collection of electronic medical
records (EMRs). Because, as reported in Roberts et al. (2016), the medical case descriptions from
the TREC-CDS topics were generated by consulting the EMRs from MIMIC-II (Lee et al., 2011),
we used all the publicly available EMRs provided by MIMIC-III (a more recent superset of the
EMRs in MIMIC-II) to automatically generate a very large knowledge graph designed to encode
knowledge acquired from medical practice.
16
TREC-CDS Topic
Topic Description Topic Summary Ranked List
Topic Processing of Relevant Articles
Expected Medical Answer Type
Relevance Model
Informed by Answers
Medical
Diagnosis Test Treatment
Findings
Ranked list
of Answers
Clinical Picture &
Electronic
Therapy Graph
Health Records
Answer
Inference Index
Figure 2.3. Architecture of our medical question answering system for clinical decision support.
We organized the medical knowledge acquired from the EMR collection into a clinical picture
and therapy graph (CPTG), which informed, along with the medical knowledge sketches resulting
from topic processing, answer inference – the central component of the new medical Q/A system
illustrated in Figure 2.3. The answers, ranked by their likelihood, enable novel answer-informed
relevance models to produce a ranked list of relevant biomedical articles, based on the index of
PubMed Central.
In the architecture illustrated in Figure 2.3, the CPTG is automatically generated by (1) pro-
cessing the language from the narratives of medical records to identify medical concepts (and their
assertions); and (2) inferring probabilistically the connections between medical concepts. TREC-
CDS topics can be processed to discern medical concepts and their assertions in the same format as
the one used in the CPTG. By identifying (a) several forms of medical concepts including signs and
symptoms, diagnoses, as well as tests and treatments; and (b) the way in which they are asserted
(e.g., PRESENT, ABSENT, POSSIBLE, etc.), we considered a richer semantic representation of
medical knowledge that the one provided by the three EMATs on the medical finding discussed in
Roberts et al. (2015). Furthermore, in this paper we consider for the CPTG more complex repre-
sentations of connections between medical concepts than those discussed in Roberts et al. (2016).
17
Answers inferred from the CPTG were evaluated against correct answers made available to all par-
ticipants in the 2015 TREC-CDS challenge. In addition, evaluations of relevant biomedical articles
could be performed for the new form of medical Q/A, first reported in Goodwin and Harabagiu
(2016) and illustrated in Figure 2.3. The results confirmed our intuition that it is not necessary to
first discover relevant biomedical articles from which the answers can be extracted (as was the case
with previous textual Q/A systems) but we could infer the answers directly from the CPTG (i.e.,
the medical KB) and then discover the relevant articles that contained the already known answers
to provide additional information and context for the answers. In the extended research reported in
this paper, we explore the ideal knowledge representations of medical knowledge that can be used
as well as the ideal inference methods than can be considered in an medical Q/A system similar
to the architecture illustrated in Figure 2.3. Furthermore, we experimented with another lesson
learned for textual Q/A, namely to search for an answer in texts by preferring one paragraph at a
time instead of the entire document. Our experiments show that this lesson is not applicable to
medical Q/A, as knowledge distilled only from one paragraph leads to inference of less accurate
answers. This may be explained by the fact that scientific articles organize knowledge about a
complex medical case across multiple paragraphs, and thus the knowledge distilled from only one
paragraph is insufficient.
In this chapter, we present the knowledge representations considered for medical Q/A used in
clinical decision support as well as the details of the probabilistic inference methods that were used,
providing the following main contributions:
(C1 ) A probabilistic representation of the medical knowledge processed from a vast EMR collec-
tion known as a Clinical Picture and Therapy Graph (CPTG). The CPTG is represented as
a Markov network which implements: (i) nodes for diagnoses, tests, treatments, symptoms
and signs; and (ii) factors for the connections between them;
(C2 ) Usage of assigned and latent random variables in the CPTG to account for medical concepts
explicitly expressed or infer-able, respectively;
18
(C3 ) Using the likelihood of the automatically discovered answers to produce several novel answer-
informed rankings of the relevant scientific articles;
(C4 ) Locating the answers both at document- and paragraph-level and showcasing the impact that
paragraph indexing produces on the quality of article relevance; and
(C5 ) Designing and implementing a system architecture for answering medical questions that
can be used in CDS when considering any of the three medical knowledge sketches. In
this architecture, answer inference can be performed through four different methods. This
multi-case architecture allowed us to evaluate many possible knowledge representations and
inference methods to discover the optimal results on TREC-CDS data.
The remainder of the chapter is organized as follows. Section 2.1 details the new architecture for
answering medical questions. Section 2.2 on page 26 describes the CPTG used for capturing the
necessary medical knowledge and details the probabilistic inference methods that were used for
discovering answers. Section 2.3 on page 37 details the methods used for automatically generating
the CPTG while Section 2.4 on page 40 presents and discusses the experimental results. Section 2.5
on page 51 summarizes the lessons learned.
2.1 System Architecture for Medical Question Answering
The design of a Q/A architecture that operates on TREC-CDS topics and provides both answers and
relevant biomedical answers from PubMed needs to take into account (a) the medical knowledge
base that informs the answer inference as well as (b) the multiple ways in which, once the answer is
discovered, new relevance models can identify and rank the relevant PubMed articles and compare
them to the relevance results obtained when the answers are ignored.
2.1.1 Inferring Medical Answers with Medical Knowledge Sketches
The cornerstone of our medical Q/A method used for clinical decision support (CDS) is the
derivation of the answers to a topic’s question from a vast medical knowledge graph, generated
19
automatically from a collection of EMRs. The medical knowledge base contained approximately
634 thousand nodes and 14 billion edges, in which each node represents a medical concept (and its
belief value). We automatically identified four types of medical concepts: signs/symptoms, tests,
diagnoses and treatments (as detailed in Section 2.3.1 on page 37). However, identifying medical
concepts is not sufficient to capture all the subtleties of medical language used by physicians when
expressing medical knowledge. Medical science involves asking hypotheses, experimenting with
treatments, and formulating beliefs about the diagnoses and tests. Therefore, when writing about
medical concepts, physicians often use hedging as a linguistic means of expressing an opinion
rather than a fact. Consequently, clinical writing reflects this modus operandi with a rich set of
speculative statements. Hence, automatically discovering clinical knowledge from EMRs needs to
take into account the physician’s degree of belief by qualifying the medical concepts with assertions
indicating the physician’s belief value (e.g., HYPOTHETICAL, PRESENT, ABSENT) as detailed
in Section 2.3.2 on page 39. In Section 2.3 on page 37 we describe the methods used to identify
automatically medical concepts and their associated assertions, reflecting the belief values of the
physician that authored the clinical narrative. Section 2.2.1 on page 27 details the estimations used
in the CPTG, while Section 2.2.2 on page 31 describes the inference methods used for discovering
the answers to TREC-CDS topics.
However, answers to the TREC-CDS medical questions need also to (1) account for the medical
knowledge expressed in the description and/or summary of TREC-CDS topics; and (2) be mentioned
in relevant biomedical articles from PubMed Central. For the first desideratum, for any of the TREC-
CDS topics, t, a medical knowledge sketch Z1 (t) was discerned. Z1 (t) accounts for the clinical
picture and therapy of the complex medical case referred by the topic t and it consists of medical
concepts and their inferred assertions. The second desideratum constrained us to discover only
answers to the question from t that can also be observed in biomedical articles from PubMed Central.
Thus, in order to find such answers, for any biomedical document l (from PubMed Central) which
was relevant to the question from topic t, we considered the medical knowledge sketch Z2 (t, l),
20
which combined Z1 (t) with medical concepts (qualified by their assertions) recognized, l. We
believe that Z2 (t, l) accounts for a more complete view of a possible clinical picture and therapy
of a medical case than the one discerned only from the topic. This belief was strengthen by the
observation that the joint distribution estimated from the CPTG favors more common medical
concepts, whereas the topics evaluated in the TREC-CDS correspond to complex medical cases,
rather than common cases.
We also wondered if an alternative medical knowledge sketch, which adds to Z1 (t) only qualified
medical concepts recognized within the same paragraph of biomedical documents relevant to
the topic would not enable the inference of better answers. Textual Q/A is known to produce
superior results when answers are extracted from paragraphs rather than documents (Harabagiu
and Maiorano, 1999). Thus, we also build a medical knowledge sketch denoted as Z3 (t, s) which
adds to Z1 (t) only asserted medical concepts from a paragraph s, belonging to a biomedical
document from PubMed Central known to be relevant to the question of the topic t. In this way,
we were able to consider three forms of knowledge sketches, namely Z1 (t), Z2 (t, l), or Z3 (t, s)
for identifying the answer to a TREC-CDS topic. If z ∈ {Z1 (t), Z2 (t, l), Z3 (t, s)} is any of the
medical knowledge sketches, we could discovered the most likely answer â to the medical question
associated with t by discovering the medical concept encoded in the CPTG, which, when combined
with the sketch, produces the most likely clinical picture and therapy. Formally:
P ({a} ∪ z)
â = argmax P (a | z) = (2.1)
a∈A P (z)
where the set A denotes all the concepts in the CPTG with the same type as the EMAT of t, and
P(•) refers to the probability estimate provided by the CPTG.
2.1.2 Architecture of Medical Q/A System used in Clinical Decision Support
The architecture of the medical QA system which can be used in Clinical Decision Support (CDS)
is illustrated in Figure 2.4. We envisioned three cases or scenarios for using this architecture:
21
TREC-CDS Topic
Case 1 Answer
Topic Processing Inference
Case 1/2/3 𝑅𝐿𝐴1 , 𝑅𝐿𝐴2 , 𝑅𝐿𝐴3
Ranked Lists
Case 2/3 of Answers
Clinical Picture & Case 2
Query
Therapy Graph Case 3
Electronic Expansion
Health Records Case 1/2/3
Case 1/2/3 Relevant
Scientific
Articles
Case 2 Relevance Models
Case 2 Informed by
Document Answers
Case 2
Document Relevance
Index Model Case 1/2/3
Case 1/2/3
Relevant
Paragraphs
𝑅𝐿𝐷1 , 𝑅𝐿𝐷2 , 𝑅𝐿𝐷3,
Paragraph Paragraph Relevance Ranked Lists of
Index Case 3 Model Case 3 Documents
Figure 2.4. An architecture that implements three different cases for answering medical questions
for clinical decision support.
Case 1 infers answers from the CPTG only and uses the answers to retrieve PubMed Central
documents relevant to the question implied by the topic’s EMAT (e.g., what is the most likely
diagnosis/treatment/test?);
Case 2 combines the advantages of a vast medical knowledge base (provided by the CPTG)
with the document relevance model to infer answers, which are later used by an answer-
informed relevance model (different from the one used in Case 1) to identify PubMed Central
documents relevant to the question implied by the topic’s EMAT; and
Case 3 combines the same medical knowledge base used in Cases 1 and 2 with a paragraph
relevance model to infer the answers, which are later used by an answer-informed relevance
model (different from the ones used in Cases 1 or 2) to identify PubMed Central documents
relevant to the question implied by the topic’s EMAT.
Details of each of these cases are provided below.
22
Case 1
In Case 1, as illustrated in Figure 2.4, the topic is processed with methods detailed in Section 2.3
on page 37 to discern the medical concepts mentioned in the topic’s description or summary
and to discover their assertions. Topic processing generates the medical knowledge sketch Z1 (t),
which is used to produce the ranked list of answers RLA1 according to Equation (2.1) based on
inference enabled by the CPTG. Additionally, we designed an answer-informed relevance model
to also identify the ranked list of documents RLD1 , relevant to the medical question from each
topic. When a document li from PubMed Central (retrieved from the document index), contains an
answer Yi ∈ RLA1 , we defined the answer-informed relevance to the topic t by:
Rel(li ) = P (Yi | Z1 (t)) ∝ P(Yi ∪ Z1 (t)) (2.2)
Equation (2.2) represents an answer-informed ranking of each scientific article li that contains
answers Yi from RLA1 based on the likelihood of the answers in the article, given the medical
knowledge sketch derived for the topic.
Case 2
In Case 2, illustrated in Figure 2.4, topic processing is also used to produce a query (as described
below), which can be further expanded, as it was typically done in the systems participating in
the TREC-CDS challenge (and illustrated in Figure 2.2). When processing the topic to generate a
query, deciding whether to use individual words or concepts is important. In TREC-CDS, there
were systems that used all the content words from the description to produce the query, while other
systems considered only medical concepts. Both the Unified Medical Language System (UMLS)
(Bodenreider, 2004) and MeSH (Lipscomb, 2000) were commonly used as ontological resources
for medical concepts. When generating the query, we opted to use medical concepts rather than
words and formed a disjunctive Boolean query by considering each medical concept as a key phrase.
When the query was expanded, additional medical concepts from UMLS which share the same
23
concept unique identifier (CUI) as any medical concept detected from the topic were added. These
concepts represent synonyms of the topic concepts; thus, the same assertion was extended to them
as well. It should be noted that some medical concepts (e.g., “heart failure”) or their synonyms (e.g.,
“cardiac insufficiency”) are phrases rather than single words. Consequently, the resulting expanded
query consists of a list of key-phrases representing medical concepts. For example, the initial
Boolean query produced for Topic 33 (from Figure 2.1) would include “cardiovascular disease”
OR “shortness of breath” OR “tachypenea”, etc., while the expanded Boolean query would include
“cardiovascular disease” OR “cardiovascular disorder” OR “CVD”, etc. The document relevance
model illustrated in Figure 2.4 makes use of the expanded query, the index of the PubMed Central
articles and a document relevance model to retrieve a list L of 1,000 relevant documents. In the
index we used a snapshot of PubMed Central articles from January 21, 2014 containing a total
of 733,138 articles which were provided by the TREC-CDS organizers. In TREC-CDS, systems
implemented a variety of relevance models, as reported in Roberts et al. (2015), to generate the
ranked list of relevant articles. We also experimented with several document relevance models
(discussed in Section 2.5 on page 51) to retrieve the first 1,000 most relevant documents, denoted
as L. We used L to produce the medical knowledge sketch Z2 (t, l), for each topic t and relevant
document l ∈ L.
A close inspection of the contents of the medical knowledge sketches Z2 (t, l) indicated the
inclusion of many medical concepts obtained from scientific articles presented no relevance to the
topic. This reflects the fact that many of the scientific articles in PubMed Central discuss unexpected
or unusual medical cases – often in non-human subjects. This created a serious problem in the
usage of the medical knowledge sketch Z2 (t, l) to infer answers from the CPTG to the question
from a topic t. Specifically, because the likelihood estimate of an answer enabled by the CPTG
is based on the observed clinical pictures and therapies of patients documented in the MIMIC
clinical database, non-relevant scientific articles which contained common diagnoses, treatments,
tests, signs, or symptoms had a disproportionately large impact on the ranking of answers. In order
24
to address this problem, we refined the ranking of answers provided by Equation (2.1) in order to
incorporate the relevance of the scientific article l ∈ L used for generating the medical knowledge
sketch Z2 (t, l). Thus, in Case 2, we produced the answer ranking by using a novel probabilistic
metric, namely the Reciprocal-Rank Article Score (RRAS). RRAS considers for each article lr ∈ L:
(1) the conditional probability of the answer given the medical knowledge sketch Z2 (t, lr ); as well
as (2) the relevance rank, r of the article lr in L. Formally, the new ranking of answers to a question
associated with topic t generated by the RRAS metric is defined as:
1,000
Õ 1 P ({a} ∪ Z2 (t, lr ))
RRAS (a) = · P (a | Z2 (t, lr )) = (2.3)
r=1
r P (Z2 (t, lr ))
The list of ranked answers, denoted as RLA2 , was ranked by using the RRAS metric. The document
index was used to retrieve the set of PubMed Central documents that contain answers from RLA2 .
These documents were retrieved by using a Boolean query that used a disjunction of all medical
concepts from Z2 (t, l). However, this document set needed to be ranked in order to produce
the ranked list of documents RLD2 , relevant to the medical question from each topic in case 2.
We defined a new, answer-informed relevance model, which ranked the documents from RLD2 .
Specifically, when a document li from RLD2 contains answers Yi ∈ RLA2 , the answer-informed
relevance of li to the topic t is provided by:
P (Z2 (t, li ))
Rel (li ) = P (Yi | Z2 (t, li )) = (2.4)
P (Z2 (t, li ) − Yi )
In this way, the relevance of an article li responding to the question of a topic is computed by
comparing the likelihood of the medical knowledge sketch Z2 (t, li ), which includes the answers
found in the article li against the likelihood of a version of the medical knowledge sketch Z2 (t, li )
which does not contain the answers found in the article.
Case 3
Finally, in Case 3, answers are inferred from the CPTG using Z3 (t, s). Thus, a new set of answers
are inferred (details of inference methods are provided in Section 2.2 on the following page) and
25
ranked in a new answer list denoted as RLA3 . In RLA3 , ranking is provided by a Reciprocal-Rank
Paragraph Score (RRPS) defined as:
1,000
Õ 1
RRPS (a) = × max P (a | Z3 (t, s k )) (2.5)
r=1
r sk ∈lr
where, the article lr has the rank r in L and paragraph s k indicates the k-th paragraph in article
lr . In the definition of the ranking metric RRPS used in RLA3 , we took into account (a) the rank r
of the article from L that contained the answer; as well as (b) the most likely paragraph from the
same article that contained the answer. Furthermore, the ranking generated by RPPS on the list
of answers RLA3 was used by yet another answer-informed relevance model to produce RLD3 , the
ranked list of documents from PubMed Central relevant to the question from a TREC-CDS topic.
The ranking in this list of articles is generated by:
Rel (li ) = max P (Yi | Z3 (t, s k )) (2.6)

sk ∈l i
where Yi represents all answers from RLA3 found in an article ranked on position i of L. Because
not all answers from Yi may be found in each paragraph of the article li , the ranking favors the
articles which (a) contain most of the answers in a single paragraph; and (b) also contain most of
the concepts from the topic in the same paragraph.
In summary, the architecture illustrated in Figure 2.4 enabled us to use medical Q/A in CDS
system in three different cases, summarized in Table 2.1. Each case generated a different list of
ranked answers (RLA1 , RLA2 and RLA3 , depending on the usage of medical knowledge sketches
Z1 (t), Z2 (t, l), or Z3 (t, s)) as well as a different list of relevant articles for each topic, namely RLD1 ,
RLD2 and RLD3 , produced by three different answer-informed relevance models.
2.2 Inferring Medical Answers
In this section, we first describe the ontological principles used in representing the medical knowl-
edge as a Markov network, a special form of probabilistic graphical models. We detail how medical
26
Table 2.1. Overview of the different medical knowledge sketches, document relevance models, and
answer ranking metrics used in each case illustrated in Figure 2.1.
Case Sketch Document Relevance Model Answer Ranking Metric

Case 1 Z1 (t) P (Yi | Z1 (t)) (Equation (2.2)) P (a | z) (Equation (2.1))
Case 2 Z2 (t, d) P (Yi | Z2 (t, li )) (Equation (2.4)) RRAS (a) (Equation (2.3))
Case 3 Z3 (t, s) P (Yi | Z3 (t, s k )) (Equation (2.6)) RRPS (a) (Equation (2.5))
concepts and their inferred assertions are represented, as well as how relations between the vari-
ous medical concepts are considered. We present the way in which any combination of medical
concepts (including the medical knowledge sketches used for answering medical questions in the
architecture presented in Figure 2.4) can be represented such that answer inference can be achieved
through Equations (2.1) to (2.6). In the second part of the section we describe four distinct inference
methods that were used for inferring answers to the TREC-CDS questions from the CPTG.
2.2.1 Representing Medical Knowledge in the Clinical Picture and Therapy Graph
The ontological framework introduced in Scheuermann et al. (2009) considers (1) the clinical
picture of a patient, consisting of the medical problems, signs/symptoms, and medical tests that
might influence the diagnosis of the patient; and (2) the therapy of a patient, consisting of the
set of all treatments, cures, and preventions included within the management plan for the patient.
Medical language processing methods detailed in Section 2.3 on page 37 enable us to discern from
a vast EMR collection the medical concepts that represent the clinical picture and therapy (CPT)
mentioned in the narrative of each medical record. Moreover, the methods described in Section 2.3
allow us to associate with each medical concept an assertion value, indicating whether the medical
concept is PRESENT, ABSENT, etc. (the full list of assertion values and their definitions is
provided in Section 2.3.2). Thus, a CPT lists the set of medical concepts from the same medical
record, with each concept having its own assertion. It is important to note that the CPT varies
significantly between patients with the same disease (e.g., in one patient, a symptom may be
PRESENT, whereas in another ABSENT) and often varies across different points in time for the
27
same patient during the course of their care (e.g., a patient that had fever PRESENT at some point,
after a treatment with antibiotics, the fever resolves and is asserted as ABSENT).
The medical knowledge acquired automatically from the EMR collection was represented in
the CPTG such that each node corresponds to a medical concept observed in any CPT derived from
the EMR collection. We decided to represent the CPTG as a factorized Markov network (Koller
and Friedman, 2009) in which the nodes are partitioned into: (1) D representing all the diagnoses;
(2) S representing all the signs/symptoms; (3) E representing all the tests; and (4) R representing all
the medical treatments. It is important to note that factorized Markov networks encode knowledge
by using (i) statistical random variables and (ii) mathematical factors (or functions) measuring
the strength of the relationships between statistical random variables in the model. Hence, in the
CPTG, each medical concept (i.e., each node) is a binary random variable which is assigned the
value of 1 when the medical concept was asserted to be PRESENT, CONDUCTED, ORDERED,
or PRESCRIBED, a value of 0 if the medical concept was asserted as ABSENT, and was left as a
latent or unassigned variable, otherwise.
Given a CPT derived from a medical record, it is important to note that (a) each medical
concept from the CPT has a random variable assigned to it; and (b) all the nodes from the CPTG
not recognized in the CPT are associated with latent variables, whose values can be later inferred
during answer inference. It is important to note that by relying on both assigned and latent random
variables, we enable the representation of any possible combination of medical concepts (and their
assertions), regardless of whether they were mentioned in the same electronic medical record or
not. Possible combinations of medical concepts include the three medical knowledge sketches that
we have considered in the architecture for answering medical questions that can be used in clinical
decision support, defined in Section 2.1 on page 19, namely Z1 (t), Z2 (t, l), and Z3 (t, s). All of these
medical knowledge sketches consist of medical concepts and their assertions. The probabilistic
representation of any of these medical knowledge sketches consists of (1) random variables assigned
to the medical concepts identified in the sketch; and (2) latent variables corresponding to all the
other nodes from the CPTG.
28
Formally, a combination of medical concepts (and their assertions) is denoted by C = D ∪ S ∪
E ∪ R, where D ⊆ D indicates the random variables corresponding to the diagnoses in C, S ⊆ S
indicates the random variables corresponding to signs/symptoms in C, E ⊆ E indicates the random
variables corresponding to tests in C, and R ⊆ R indicates the random variables corresponding
to treatments in C. The estimation of the probability of any combination of medical concepts
(and their assertions) C is made possible once the factors of the CPTG (represented as a Markov
network) are defined. We defined ten factors. The first four factors are: φ1 (D), the likelihood of a
CPT (derived from a medical record) containing the diagnoses from D; φ2 (S), the likelihood of a
CPT (derived from a medical record) containing the signs/symptoms from S; φ3 (E), the likelihood
of a CPT (derived from a medical record) containing the tests from E ; and φ4 (R), the likelihood
of a CPT (derived from a medical record) containing the treatments from R. In addition, the
CPTG encodes relations between medical concepts of different types. Six additional factors enable
the probabilistic representation of these relations: (1) ψ1 (D, S), the strength of the correlation
between all the diagnoses in D and all the signs/symptoms in S; (2) ψ2 (S, E), the strength of the
correlation between all the signs/symptoms in S and all the tests in E; (3) ψ3 (D, E), the strength
of the correlation between all the diagnoses in D and all the tests in E; (4) ψ4 (D, R), the strength
of the correlation between all the diagnoses in D and all the treatments in R; (5) ψ5 (E, R), the
strength of the correlation between all the tests in E and all the treatments in R; and (6) ψ6 (S, R),
the strength of the correlation between all the signs/symptoms in S and all the treatments in R. It
is to be noted that, unlike knowledge graphs that typically encode binary relations, the factors used
in the CPTG correspond to hyper-edges representing n-ary relations between many nodes in the
graph. Figure 2.5 illustrates the representation of a the CPTG as a factorized Markov network in
which each partition of random variables D, S, E, and R is shown. Figure 2.5 also illustrates the
representation of a combination of medical concepts (and their assertions), C = D ∪ S ∪ E ∪ R. As
such, the random variables with assigned values in D, S, E, or R are represented as filled circles,
whereas the latent variables corresponding to nodes from the CPTG which are not present in C
29
𝜓4
𝜓3
𝜓6
𝔻 𝕊 𝔼 ℝ
𝜓1 𝜓2 𝜓5
ϕ1 ϕ2 ϕ3 ϕ4
⋯
𝒟 𝒮 ℰ ℛ
Figure 2.5. Factorized Markov network representation of the Clinical Picture and Therapy Graph
(CPTG) and the representation of any combination of medical concepts (and their assertions)
C = D ∪ S ∪ E ∪ R in the CPTG.
are represented as empty circles. Finally, Figure 2.5 also illustrates the factor φ1 in D as an n-ary
relation involving all the assigned and latent variables in D. Factors φ2 , φ3 , and φ4 encode n-ary
relations involving all the random variables in S, E, and R, respectively; factors ψ1 , ψ2 , ψ3 , ψ4 , ψ5
and ψ6 encode the n-ary relations between medical concepts of different types.
The factorized Markov network representation of the CPTG illustrated in Figure 2.5 enables us
to compute the probability of any combination of medical concepts (and their assertions) C using:
P(C) = P(D, S, E, R) ∝φ1 (D) × φ2 (S) × φ3 (E) × φ4 (R) × ψ1 (D, S) × ψ2 (S, E) × ψ3 (D, E)
× ψ4 (D, R) × ψ5 (E, R) × ψ6 (S, R)
(2.7)
The probability distribution provided in Equation (2.7) is determined by the product of the ten factors
used in the factorized Markov network representation of the CPTG. Unfortunately, evaluating any of
these factors directly can be intractably expensive. For example, pre-computing ψ1 (D, S) requires
storing 2|D |×|S| probabilities (or counts). Beyond computational complexity, an additional problem
arises from the inherent sparsity of clinical data: for a given combination of medical concepts,
it is very unlikely that the given combination may be reflected in the CPT of any patient in the
30
EMR collection, thus, the probability assigned to that combination would be zero. For example, if
the diagnoses in C are D = { [heart attack/PRESENT], [diabetes/PRESENT], [obesity/ABSENT],
[pneumonia/POSSIBLE] }, we may not find any patient documented in the EMR collection who
has all the diagnoses with the same assertions as in D. Consequently, we would infer the likelihood
of C as zero. If C were a medical knowledge sketch, we would not find any answers to the medical
question of a TREC-CDS topic. To address this problem, we decided to consider EMRs with
narratives describing clinical pictures and therapies which are similar to C (e.g., any of the medical
knowledge sketches). To do so, we relaxed the maximum likelihood estimation requirements.
This allowed us to infer the ten factors used to compute P(C) using four methods: (1) exact
inference, (2) pair-wise smoothing, (3) interpolated smoothing, and (4) applying the Bethe free-
energy approximation.
2.2.2 Inference Methods
Exact Inference
The obvious approach for defining the factors used in the CPTG is to perform exact inference based
on maximum likelihood estimates (MLE). Specifically, we define the MLE of any combination of
medical concepts, C, as:
number of EMRs in the collection which contain C in their CPTs

PM LE (C) = (2.8)
the total number of EMRs in the collection
which allows us to define each of the four φ factors as:
φ1 (D) = PM LE (D) φ2 (S) = PM LE (S) φ3 (E) = PM LE (E) φ4 (R) = PM LE (R) (2.9)
and each of the six ψ factors as:
ψ1 (D, S) = PM LE (D ∪ S) ψ2 (S, E) = PM LE (S ∪ E) ψ3 (D, E) = PM LE (D ∪ E)

(2.10)
ψ4 (D, R) = PM LE (D ∪ R) ψ5 (E, R) = PM LE (E ∪ R) ψ6 (S, R) = PM LE (S ∪ R)
31
It is important to note that we discover very large number of CPTs from the EMR collection, and,
hence the CPTG contains a significantly large number of diagnoses, signs/symptoms, tests, and
treatments. Therefore, computing Equation (2.8) entails either (a) pre-computing PM LE (C) for
every possible combination of medical concepts – requiring considering all 2|D∪S∪E∪R| possible
combinations – which is prohibitively expensive; or (b) computing PM LE (C) on-demand. This an
alternative approach takes advantage of a bag-of-medical-concepts model. In the bag-of-medical-
concepts model, akin to the bag-of-word model used in vector space retrieval, we generated an
index that uses as a dictionary all instances of qualified medical concepts processed from the EMR
collection, while the inverted list structures the linked lists of the CPTs that contain each instance
of a medical concept. This index is used to compute PM LE (C), when a conjunctive Boolean query
is formed with all components of C. The estimations of the factors from Equations (2.9) and (2.10)
are all produced by using the bag-of-medical-concepts model and the index of CPTs.
Inference with Pair-wise Smoothing
While the exact inference method accurately estimates the number of EMRs with CPTs containing
a given combination of medical concepts, it cannot account for sparsity. For example if the
combination C contains eight medical concepts, although there may be CPTs in the EMR collection
which have six or seven medical concepts in common with C, if no CPT in the collection contains
all eight medical concepts in C with the same assertions, then the MLE probability will be zero.
To address this problem, we relax the maximum likelihood estimates to better handle sparsity. By
defining each factor as the product of the pair-wise associations between all concepts C, we are
smoothing the MLE of each factor. In this way, the four same-typed factors (φ1 . . . φ4 ) can be
assigned to the product of pair-wise MLE estimates:
Ö Ö Ö Ö
φ1 (D) = PM LE ({d1, d2 }) φ2 (S) = PM LE ({s1, s2 })
d1 ∈D d2 ∈D/{d1 } s1 ∈S s2 ∈S/{s1 }
Ö Ö Ö Ö (2.11)
φ3 (E) = PM LE ({e1, e2 }) φ4 (R) = PM LE ({r1, r2 })
e1 ∈E e2 ∈E/{e1 } r1 ∈R r2 ∈R/{r1 }
32
Likewise, the factors ψ1 . . . ψ6 can be similarly defined:
ÖÖ ÖÖ
ψ1 (D, S) = PM LE ({d, s}) ψ2 (S, E) = PM LE ({s, e})
d∈D s∈S s∈S e∈E
ÖÖ ÖÖ
ψ3 (D, E) = PM LE ({d, e}) ψ4 (D, R) = PM LE ({d, r }) (2.12)
d∈D e∈E d∈D r∈R
ÖÖ ÖÖ
ψ5 (E, R) = PM LE ({e, r }) ψ6 (S, R) = PM LE ({s, r })
e∈E r∈R s∈S r∈R
By using the pair-wise definitions from Equations (2.11) and (2.12), we estimate the joint distribu-
tion in Equation (2.7).
Inference with Interpolated Smoothing
The pair-wise smoothing method for answer inference still suffers from sparsity problems: if the
likelihood of any pair of medical concepts is zero, then (as with the exact inference method), the
joint probability will be zero. Moreover, the pairwise approach cannot distinguish between the
level of similarity between a given combination of medial concepts (and their assertions) C and
the CPTs used to generated the CPTG. For example, if C contains eight concepts, and there are 50
EMRs that share seven concepts with C but 200 EMRs that share only one concept with C, the 200
EMRs with only a single concept in common would dominate the probability estimates. To account
for this, we define the level of similarity between two CPTs as the number of concepts contained
in both CPTs (with the same assertions). Thus, the levels of similarity range from perfectly similar
(all |C| concepts in common) to perfectly dissimilar (0 concepts in common). In order to account
for each of these levels of similarity, we interpolated the likelihood of C, with the likelihoods
of all subsets of C. This would typically require enumerating all 2|C| sub-sets of C and, thus,
would appear to be computationally intractable. Fortunately, as with the exact inference method,
we can reduce the complexity to be linear in the size of the EMR collection by using the same
bag-of-medical-concepts model described in Section 2.2.2 on page 31.
33
Algorithm 2.1 Calculate the Smoothed Likelihood of C
Precondition: C is a combination of medical concepts, α is a smoothing parameter ∈ [0, 1].
1 function Smoothed-Likelihood(C, α)
2 Let m be a 1 × N zero-vector
3 for ci ∈ C do:
4 h ← retrieve(ci )
5 m← m+n
6 end for
7 Let c be a 1 × Z vector initialized to β
8 for mi ∈ m do do
9 c ← c+1
10 end for
11 Let s = 0
12 for i ∈ [1, |C|] do:
| C |−1
13 s ← (1 − α)2 · αc i
14 end for
15 return s
16 end function
Postcondition: s represents the smoothed likelihood of C, i.e., P(C).
Specifically, using the same index created for exact inference (described in Section 2.2.2 on
page 31) we were able to compute the smoothed likelihood of a given combination of medical con-
cepts C through a series of constant-time Boolean retrieval operations, as shown in Algorithm 2.1.
Formally, for each medical concept (and its assertion) ci ∈ C we construct a separate Boolean
query consisting only of ci and identify the EMRs in the collection which are returned by that
query. This allows us to produce a binary vector h i (for each ci ∈ C) which indicates which EMRs
in the collection mentioned ci (with the given assertion). We can determined the number of medical
concepts in common (i.e., the level of similarity) between each EMR in the collection and C by
computing the element-wise sum over each of these binary vectors. We denote the element-wise
sum as m = i h i . Using m, we can compute the number of EMRs in the EMR collection that have
Í
each level of similarity with C. Formally, let n j indicate the number of EMRs in the collection that
have a j level of similarity with C. We computed n j by initializing n as a zero vector and then, for
34
each mk ∈ m, incrementing n mk by one. This allows the smoothed likelihood of C to be estimated
by interpolating the number of EMRs at each similarity level (n 0 . . . n |C| ):
|C|−1
Õ h i
| C |−i
P(C) ∝ α · n |C| + (1 − α)2 · ni (2.13)
i=1
where α ∈ [0, 1] is a scaling factor such that determines how much smoothing is applied: When
α = 1, no smoothing is applied and Equation (2.13) reduces to the exact probability estimation
(given in Section 2.2.2 on page 31); when α = 0, the exact probability estimation is ignored and
only the interpolated similarity counts are used. In our experiments, we used α = 0.5
Bethe Free-energy Approximation
Finally, we considered state-of-the-art methods for approximate inference. In contrast to the three
previous approaches for answer inference, approximate inference guarantees a constant upper
bounds on the error between the approximate probability and the true probability of each factor.
The canonical example of an approximate inference algorithm is that of Loopy Belief Propagation
(Pearl, 1986) wherein variables and factors repeatedly exchange messages until convergence, at
which point, the full joint distribution can be estimated. However, recent work has considered
interpreting the distribution of a set of random variables as the information energy present in a
physical system. In this setting, the distribution of all possible clinical pictures and therapies given
in Equation (2.7) is cast as the energy J:
4
Ö 6
Ö
J(C) = log ψi (C) φ j (C) (2.14)
i=1 j=1
This allows us to then define the “Free Energy” of the system as follows:
energy entropy
z }| { z }| {
F(C) = U(C) − H(C) = P(C)J(C) − P(C) log P(C) (2.15)
where U(C) is the energy and H(C) is the entropy of C. It has been shown the minimum fixed
points of Equation (2.15) are equivalent to fixed points of the iterative Loopy Belief Propagation
35
algorithm, as reported by (Vontobel, 2013) and (Yedidia et al., 2005). This observation indicates
that minimizing the free energy in Equation (2.15) obtains the same probability estimates as running
iterative loopy belief propagation on Equation (2.7) until convergence. We can take advantage of
the Bethe free energy approximation by transforming our original, potentially infinitely-looping
message passing problem into a convex linear programming problem. As with the pair-wise
smoothing approach described in Section 2.2.2 on page 32, the Bethe free energy approximation
relies on pair-wise interactions. Formally:
FB (C, τ) = UB (C, τ) − HB (C, τ) (2.16a)
where
Õ Õ Õ Õ
UB (C, τ) = − τx (v x ) log φ(x) − τx,y (v x, v y ) log ψ(x, y) (2.16b)
x∈C vx ∈{0,1} y∈C/{x} vy ∈{0,1}
Õ Õ Õ Õ τx,y (v x, v y )
HB (C, τ) = − τx (v x ) log τx (v x ) − τx,y (v x, v y ) log (2.16c)
x∈C vx ∈{0,1}
τx (v x )τy (v y )
y∈C/{x} vy ∈{0,1}
This allows P(C) from Equation (2.7) to estimated by finding the set of τ that minimize FB (C, τ):
h i
P(C) ≈ exp − min Fb (C, τ) (2.17a)
τ
where τ must satisfy the following conditions:

Õ
∀x ∈ C, v x ∈ {0, 1} τx,y (v x, v y ) = τx (v x ) (2.17b)
vy ∈{0,1}
Õ Õ
∀x ∈ C, y ∈ C/{x} τx,y (v x, v y ) = 1 (2.17c)
vx ∈{0,1} vy ∈{0,1}
Õ
∀x ∈ X τx (v x ) = 1 (2.17d)
vx ∈{0,1}
The constraints in Equations (2.17b) to (2.17d) can be represented by Lagrangian multipliers,

allowing us to estimate the joint probability of any clinical picture and therapy from Equation (2.7)
using gradient descent (or any other method for convex optimization). In our implementation, we
36
used the publicly available Hogwild software for parallel stochastic gradient descent (Recht et al.,
2011).
Overall, we have considered four approaches for inferring the probability of any given combina-
tion of clinical picture and therapies as defined by Equation (2.7), enabling us to use Equations (2.1)
to (2.6) to infer the answers to a medical topic and, consequently, to rank the scientific articles
containing each topic. Before computing P(C), however, it is necessary to first construct the CPTG
from a collection of EMRs.
2.3 Extracting Medical Knowledge with Medical Language Processing
While the probabilistic inference methods described in Section 2.2 on page 26 represent the (hyper-
)edges in the CPTG as factors, the nodes of the CPTG are identified by applying natural language
processing on the collection of electronic medical records (EMRs). The natural language processing
extracts the CPT from each medical record by automatically identifying every medical concept (and
its assertions) from the natural language narrative. In this section, we detail the natural language
processing approach used to automatically identify medical concepts and their assertions. The
system which identifies medical concepts and their assertions is illustrated in Figure 2.6. As will
be detailed in Sections 2.3.1 and 2.3.2, first medical concepts are identified and then their assertion
values are recognized by extracting and selecting features from multiple external resources, as
shown in Figure 2.6. It should be noted that the same natural language processing techniques are
applied to the medical topic and scientific articles when producing the medical knowledge sketch.
We first describe how medical concepts are discerned, then we detail how assertions are identified.
2.3.1 Identification of Medical Concepts
When designing our automatic approach to recognizing medical concepts in clinical texts we started
from the general framework developed during the 2010 shared-task on Challenges in Natural
Language Processing for Clinical Data(Uzuner et al., 2011) jointly organized by the Informatics
37
Medical Concept Medical Concept Medical Assertion
SVM
SVM
CRF
Boundary Detection Type Detection Type Detection
Medical Text
Medical Problem Medical Test Treatment
Feature Selection Feature Selection
Signs & Symptoms Diagnosis

Feature Extractor
Feature Extractor
Medical Medical
Synonym Expansion Assertion
Concept
External Resources
General
GENIA WordNet Wikipedia UMLS MetaMap NegEx Inquirer
Figure 2.6. System for Medical Concept Detection and Assertion Recognition
for Integrating Biology at the Bedside (i2b2) and the United States Department of Veteran’s Affairs
(VA). In this shared-task, participants were asked to identify three types of medical concepts in
clinical texts: medical problems, treatments, and medical tests. For our work, we have extended
this framework by further classifying medical problems into: (1) observations from the patient
(known as symptoms) or from a physical exam (known as signs); and (2) the diagnoses, including
co-morbid diseases or disorders.
We cast the problem of identifying medical concepts in narratives of EMRs as three-stage
classification:
Stage 1. Recognizing the boundaries of medical concepts.
Stage 2. Discriminating between medical problems, tests, and treatments.
Stage 3. Classifying medical problems as either SIGNS/SYMPTOMS or DIAGNOSES.
In the stage one, a conditional random field (CRF) was used to determine the boundaries (starting
and ending tokens) of each medical concept. Stage two of the pipeline relied on a support vector
machine (SVM) to determine the type of medical concept. In stage 3, we automatically project each
identified medical concept onto the UMLS ontology and classify it as a SIGN/SYMPTOM if the
38
UMLS semantic type is Symptom or Sign or Finding, and as a diagnosis, otherwise. Stages one and
two relied on lexical information, concept type information from UMLS and Wikipedia, as well as
semantic information describing predicates and arguments. Automatic feature selection was used to
tune the number of features considered by the CRF and SVM separately, as documented in Roberts
and Harabagiu (2011). Overall, feature extraction relied on a number of external resources including
The Unified Medical Language System (UMLS) (Bodenreider, 2004), MetaMap (Aronson, 2001),
the GENIA project (Kim et al., 2003), WordNet (Fellbaum, 1998), and Wikipedia.
In addition to recognizing the boundaries and types of medical concepts, we also associated
each medical concept with a set of synonyms (including synonymous abbreviations). Synonyms
were generated by (1) collecting all UMLS atoms that share the same concept unique identifier
(CUI) and (2) sets of article titles in Wikipedia which all redirect to the same article. This allows
us to account for synonymous expressions of the same medical concept in the CPTG by combining
all the nodes corresponding to synonymous concepts into a single node representing the set of
synonymous concepts. Figure 2.7 illustrates examples of our approach for automatic medical
concept recognition.
2.3.2 Recognizing the Medical Assertions
Our approach for automatically recognizing assertions for medical concepts extends the framework
reported by (Roberts and Harabagiu, 2011) in which the belief status (or assertion type) of a medical
concept is determined by a single SVM classifier. As with medical concepts, we trained our
classifier using the annotations produced during the 2010 i2b2/VA challenge (Uzuner et al., 2011).
However, in this challenge, only medical problems were annotated with assertions. Consequently,
we have extended the assertion values to also apply to tests and treatments as previously reported
in (Goodwin and Harabagiu, 2014). To this end, annotated six new assertion values to describe the
degree of the physicians’ beliefs for treatments and tests as well. Table 2.2 lists all twelve assertion
1Management dilemmas in acute pulmonary embolism; doi:10.1136/thoraxjnl-2013-204667.
39
Topic Diagnoses: ∅
Signs & Symptoms: {[dyspnea, shortness of breath, sob, . . . ]/PRESENT, [tachypnea, rapid breath-
ing, increased respiratory rate, . . . ]/assertionpresent, [chest pain, dthoracic pain,
pain in chest, . . . ]/PRESENT, [swollen calf, calf swelling, swelling of the calf,
. . . ]/PRESENT, [pain of right calf, pain in the right calf, right calf pain,
. . . ]/PRESENT, }
Tests: ∅
Treatments: ∅
Excerpt: . . . Patients over 75 years, bed rest over four days, cancer, chronic obstructive pulmonary
disease, heart failure, kidney failure, tachycardia and syncope are all clinical and
Article
comorbidity indicators of poor prognosis in acute PE . . .

Diagnoses: { [cancer, malignant neoplasms, malignant tumor, . . . ]/PRESENT, [chronic obstruc-
tive pulmonary disease, copd, small airway disease, . . . ]/PRESENT, [heart failure,
myocardial failure, cardiac insufficiency, . . . ]/PRESENT, [kidney failure, renal failure,
. . . ]/PRESENT, [acute pulmonary embolism, acute PE, . . . ]/PRESENT }
Signs & Symptoms { [tachycardia, rapid pulse, . . . ]/PRESENT, [syncope, fainting, . . . ]/PRESENT, }
Tests: ∅
Treatments: { [bed rest, bedrest, . . . ]/PRESENT }
Figure 2.7. Example of medical concepts and their assertions discerned from the summary of med-
ical Topic 33 (illustrated in Figure 2.1) as well as from the relevant PubMed article PMC39131201.
values as well as their definitions. In Table 2.2 the six additional assertion values we annotated are
indicated with a ‘?’. As previously reported in (Goodwin and Harabagiu, 2014), we have annotated
a total 2,349 medical concepts with the expended set of assertion values.
The SVM used for automatic assertion classification considered the same set of features and
external resources as those reported in (Roberts and Harabagiu, 2011), namely, UMLS, MetaMap,
NegEx(Chapman et al., 2001) and the Harvard General Inquirer(Stone et al., 1966). As in (Roberts
and Harabagiu, 2011), we relied on (1) the above external resources, (2) lexical features, and (3)
statistical information about assertions classified for previous mentions of the same medical concept
to train a 12-class SVM.
2.4 Experimental Results
In our experiments we evaluated (1) the accuracy of answers produced by our system for each topic,
(2) the relevance of scientific articles retrieved for each topic, and (3) the structure and composition
of the automatically generated clinical picture and therapy graph (CPTG). When evaluating the
40
Table 2.2. Definitions and examples of assertion values. New assertion values are denoted with
a ‘?’. In this table, moment refers to the specific instant in time in which a medical concept was
written.
Treatment
Problem
Test
Assertion Value Definition
?HISTORICAL X X X The indicated medical concept occurred during a previous
hospital visit; e.g., the patient’s past medical history is signif-
icant for CONGESTIVE HEART FAILURE.
CONDITIONAL X X X The mention of the indicated medical concept asserts that it
occurs only during certain conditions; e.g., [we will] likely
readmit him for REHAB once the WOUND has HEALED.
?PRESCRIBED X The indicated treatment has been assigned and will begin
sometime after this moment; e.g., she was given ROCEPHIN
and ZITHROMAX.
ABSENT X X X The note asserts that the indicated medical concept does not
exist at this moment; e.g., the patient denies any CHEST PAIN
at this time.
? SUGGESTED X The indicated treatment or test is advised, though it cannot be
assumed to actually occur; e.g., it was recommended that he
be on ALLOPURINOL long-term
PRESENT X The indicated problem is still active at this moment; e.g., there
is a moderate PERICARDIAL EFFUSION.
HYPOTHETICAL X The note asserts the patient may develop the indicated prob-
lem; e.g., she is to return for any WORSENING PAIN,
FEVERS, or PERSISTENT VOMITING.
ORDERED* X The indicated treatment has been scheduled and will be com-
pleted sometime after this moment; e.g., we will do a PUL-
MONARY FUNCTION TEST with DESATURATION STUDY.
ASSOCIATED WITH ANOTHER X The mention of the medical problem is associated with some-
one other than the patient; e.g., father died of LUNG CANCER
probably related to ASBESTOS EXPOSURE.
POSSIBLE X The note asserts that the patient may have a problem, but
there is some degree of uncertainty; e.g., SHORTNESS OF
BREATH: I believe that this may represent worsening for
PULMONARY HYPTERTENSION.
ONGOING* X X The indicated problem or treatment persists beyond this mo-
ment; e.g., as per nephrology, continue DIALYSIS.
CONDUCTED* X The indicated medical test been performed and completed as
of this moment; e.g., UNASYN 3 GRAMS IV was given.
41
answers and scientific articles automatically produced by our system, we considered the 30 topics
(labeled 31-60) used during the 2015 TREC-CDS evaluation (Simpson et al., 2014) (for which
answers had been given).
2.4.1 Medical Answer Evaluation
The accuracy of the medical answers automatically produced by our approach was measured by
computing the Mean Reciprocal Rank (MRR). The MRR is the mean of the reciprocal of the rank
produced by our system for the (first) correct answer for each topic. To identify the correct answer for
each topic, we relied on a set of “candidate answers” manually produced by the authors of the 2015
TREC-CDS topics. The candidate answers were distributed to the TREC-CDS participants after
the evaluation had completed. It should be noted that these candidate answers were not provided
to the relevance assessors when evaluating document retrieval, and they were only provided to
participating teams after the evaluation had concluded. The 2015 TREC-CDS task was strictly
an information retrieval evaluation measuring only the performance of systems when retrieving
and rank scientific articles from the PubMed Central open access subset. However, the candidate
answers provided after the conclusion of the evaluation allowed us to cast the TREC-CDS task as a
question-answering (Q/A) problem. The candidates answers produced by the topic creators indicate
one-or-more candidate answers that the topic author considered when producing each topic. It is
important to note that the candidate answers are not necessarily the “best” answers and are not
always well-represented in PubMed. We evaluated each of the three ranked list of answers produced
by our system using the candidate answers as a gold-standard. Table 2.3 lists the performance of our
approach when considering each type of medical knowledge sketch (as described in Section 2.1 on
page 19) and when using each method of answer inference (as described in Section 2.2 on page 26).
As shown in Table 2.3, it is clear that the most accurate answers were obtained when using
(1) the Bethe Free-Energy Approximation method for answer inference applied to (2) the medical
knowledge sketch obtained by considering both the medical topic t and an entire scientific article
42
Table 2.3. The Mean Reciprocal Rank (MRR) of the medical answers inferred using each inference
method and each type of medical knowledge sketch.
Case 1 ?Case 2 Case 3

Inference Method (Z1, RLA1 ) (Z2, RLA2 ) (Z3, RLA3 )
Exact Inference 0.031 0.000 0.220
Pair-wise Smoothing 0.083 0.502 0.329
Interpolated Smoothing 0.124 0.601 0.466
?Bethe Approximation 0.125 0.694 0.464
l, z2 (t, l). Clearly, exact inference was unable to produce accurate answers, as evidence by the
low performance across all three types of medical knowledge sketch. When investigating the poor
performance when Z2 (t, l) was used with exact inference, we found that very few CPTs in our EMR
collection contained all the medical concepts with the same assertions as in Z2 , highlighting the high
degree of sparsity in clinical data. It is important to point out that the was no statistically significant
difference between the Bethe Free-Energy Approximation and the Interpolated Smoothing methods
of answer inference (using the Wilcoxon signed-ranked test, where p < 0.001 and N = 30), but
that both methods did significantly outperform both the pair-wise smoothing and exact inference
methods of answer inference. This suggests that the accuracy of answers inferred using the
CPTG can be greatly improved by smoothing or approximation (rather than using exact estimates).
Overall, the answers obtained using Z1 (t) were of significantly poorer quality than those obtained
using either Z2 (t, l) or Z3 (t, s). Moreover, the answers obtained using Z2 (t, s) were less accurate than
those obtained using Z3 (t, l). While considering paragraphs from scientific articles can improve the
accuracy of answers automatically identified by our approach, decrease in performance from Z2 (t, l)
to Z3 (t, s) suggests that the most accurate answers (and their rankings) are achieved by considering
the entire scientific article. Moreover, the high performance of Z2 (t, s) suggests that the questions
associated with medical topics are complex: the answer(s) are rarely described within a single
paragraph, and must be inferred from multiple paragraphs across the document. We analyzed the
answers obtained using each type of medical knowledge sketch and found that the answers produced
43
Figure 2.8. Reciprocal Rank for each topic evaluated in TREC-CDS 2015.
by considering Z1 (t) were dominated by the most common diseases, tests, or treatments mentioned
in the EMR collection (that is, Z1 (t) effectively reduced to the prior probability of each medical
concept). By contrast, Z2 (t, l) was able to identify reasonably accurate answers in most cases,
while Z3 (t, s) preferred answers that were related to only a small number of medical concepts in
the topic, and were often unable to account for the fact that some concepts in the medical topic
are more important than others when determining an answer. Overall, the performance measures
reported in Table 2.3 reinforce our hypothesis that combining knowledge from relevance scientific
articles with the medical topic can produce significantly more accurate answers when considering
the topic in isolation. Moreover, the MRR obtained when considering Z2 (t, l) and Z3 (t, s) show
that answer inference over the clinical picture and therapy graph are reasonably accurate compared
to the candidate answers produced by the 2015 TREC-CDS topic creators.
In addition to the Mean Reciprocal Rank shown in Table 2.3, Figure 2.8 shows the reciprocal
rank of the gold-standard answer for each individual topic used in the 2015 TREC-CDS evaluation
when using the Interpolated Smoothing method for answer inference applied to (1) Z2 (t, l) and
(2) Z3 (t, s). As shown, for the majority of topics, for all medical knowledge sketch types, our
top-ranked answer was the same (or synonymous with) the candidate answer produced by the topic
creators. In fact, we obtained the correct answer for the majority of topics with an EMAT of
treatment (i.e., topics 51-60) as well as for many of the topics with an EMAT of diagnosis (i.e.,
topics 31-40). Unfortunately, for many of the topics with an EMAT of medical test, (i.e., topics
41-50), our approach struggled to identify and rank the correct answers. Table 2.4 presents the ten
44
Table 2.4. Examples of answers discovered for the medical cases illustrated in Figure 2.1.
Topic Details
EMAT: Diagnosis
33 Answers: acute pulonary embolism; thrombolysis; ischaemia; dvt; pulmonary hyper-
tension; myocardial infarction; tension pneumothorax; arrhythmia; cardiogenic shock;
aortic dissection
Gold Answer: pulmonary embolism
EMAT: Test
Diagnosis: bacterial meningitis
42
Answers: spinal puncture; gram’s stain; cerebrospinal fluid culture; latex fixation test;
bacterial cultures; cervical puncture; CSF assessment; lymph node biopsy; cranial nerve
assessment; computer tomographic angiogram
Gold Answer: spinal tap / cerebrospinal fluid analysis
EMAT: Treatment
Diagnosis: community acquired pneumonia (cap)
54
Answers: continuous positive airway pressure; antibiotics; moxifloxacin; fentanyl;
levofloxacin; zosyn; vancomycin; fluid resuscitation; combination therapy
Gold Answers: antibiotics
highest-ranked answers produced by our approach for each topic previously shown in Figure 2.1 as
well as the (1) the candidate answer(s) produced by the TREC topic authors and (2) the held-out
diagnosis.
As shown, for Topics 33 and 42, we obtain the correct answer at the highest rank, while for
Topic 54 we obtain the correct answer at the second rank. In the case of Topic 33, this is because
pulmonary embolisms were mentioned with high frequency in relevant scientific articles, and were
often diagnosed for patients with many of the signs/symptoms and tests as indicated by the topic.
The answers obtained at lower ranks include other cardiovascular conditions such as deep vein
thrombosis (DVT) and heart attack (myocardial infarction), with a handful of less severe co-morbid
conditions such as low blood flow (ischaemia) and high blood pressure (hypertension). When
answering Topic 42, our system produced a synonym spinal puncture for the correct answer of
spinal tap. The second-ranked answer, gram’s stain is a test used to distinguish between types
of bacteria in a collected sample or culture. Note that cerebrospinal fluid (CSF) culture and CSF
45
assessment from the correct answer were obtained at ranks three and seven. This indicates that
although our method was able to identify the most important medical concept for treating bacterial
meningitis, it was not as effective at ranking other highly-related concepts. It should be noted
that Topic 42 was one of only three topics with an expected medical answer type (EMAT) of Test
in which the correct answer was produced at rank one. For the majority of test topics, the most
commonly described tests related to the topic were ranked at the highest positions, rather than the
tests most related to the topic. Finally, for Topic 54, we obtained the correct answer antibiotics
at the second rank, followed by a large number of specific antibiotics (e.g., mxofloxacin, fentanyl,
levofloxacin, etc.). Interestingly, the highest-ranked treatment for Topic 54 was continuous positive
airway pressure, or CPAP. Although CPAP is a treatment used to keep the airways open for patients
with respiratory problems, it only treats the symptoms rather than the underlying pneumonia. This
suggests a possible area for future research: understanding which treatments target the disease itself
(such as antibiotics) and which treatments are designed to alleviate symptoms of the disease (such
as CPAP).
Overall, we found two main sources of errors in the answers produced by our system: (1) a
failure to account for more fine-grained relationships between individual tests and treatments and
the specific medical problems the treatments are targeting, as well as (2) the inability to account for
counter-indicated medical treatments – medical treatments which are known to produce adverse
effects in certain situations. Unlike electronic medical records (EMRs), many scientific articles
begin with a review of recent literature and the current knowledge of their subject, often presenting
negative findings or counter-indicated treatments, such as medications which are typically pre-
scribed for a disease but are known to produce adverse reactions in specific situations. This type of
counter-indicated information is not negated, or speculated, and suggests the need for identifying
either more fine-grained assertions, or detecting counter-indication relations in scientific articles
and electronic medical records.
46
2.4.2 Medical Article Retrieval Evaluation
The performance of our approach when automatically identifying and ranking scientific articles
relevant to each medical topic was measured by relying on the relevance judgments produced during
the 2015 TREC-CDS topics by Oregon Health and Science University (OHSU). For the 2015 topics,
a total of 37,807 topic-article pairs were judged as either (1) relevant, (2) partially relevant or (3)
non-relevant. Physician students provided relevance judgments by pooling the twenty top-ranked
articles as well as a 20% random sample of the articles retrieved between ranks 21 and 100 for each
topic retrieved by any participating team in the TREC CDS task.
We followed the official TREC-CDS evaluations, by not distinguishing between relevant and
partially relevant articles when measuring the performance of our system (i.e., we considered
only binary relevance). We measured the quality of the ranked list of scientific articles produced
by our approach using four information retrieval metrics also used in TREC: (1) the inferred
Average Precision (inf. AP), wherein retrieved articles were randomly sampled and the Average
Precision was calculated as in Yilmaz and Aslam (2006); (2) the inferred Normalized Discounted
Cumulative Gain (inf. NDCG), wherein retrieved articles were randomly sampled and the NDCG
was calculated as per Yilmaz et al. (2008); (3) the R-Precision, which measures the precision of the
highest R-retrieved documents, where R is the total number of relevant documents for the topic;
and (4) the Precision of the first ten documents retrieved (P@10) (Manning et al., 2008).
We compared the quality of ranked scientific articles produced by our system using each answer
inference method applied to each type of medical knowledge sketch, and found that the best
performance was obtained when using Interpolated Smoothing for answer inference and Z2 (t, l) as
the medical knowledge sketch. Consequently, for readability, in Table 2.5 we present the results
when (1) using interpolated smoothing for answer inference on each type of medical knowledge
sketch,a s well as (2) when using each type of answer inference for Z2 (t, l). Moreover, Table 2.5 also
indicates the performance obtained using five baseline information retrieval models: BM25 relied
on the Okapi-BM25 (Robertson et al., 1995) (k 1 = 1.2 and b = 0.75) relevance model; TF-IDF
47
Table 2.5. Performance results obtained for the system reported in this paper (Q/A-CDS) when
using each type of medical knowledge sketch, method for answer inference, and relevance model
as well as the iNDCG obtained by the State-of-the-Art (SotA) automatic and manual systems
submitted to TREC.
inf. AP inf. NDCG R-Prec P@10

? Baseline: BM25 .042 .204 .163 .387
Baseline: TF-IDF .041 .197 .169 .350
Baseline: LMJM .040 .193 .151 .357
Baseline: LMDir .043 .203 .170 .360
Baseline: DFR .039 .197 .167 .333
Z1 (t) .006 .010 .020 .062
?Z2 (t, l) .147 .434 .344 .722
?Z3 (t, s) .018 .114 .081 .190
Exact Inference .063 .167 .154 .410
Pair-wise Smoothing .128 .382 .330 .610
?Interpolated Smoothing .147 .434 .344 .722
Bethe Approximation .140 .432 .336 .701
used the standard term frequency-inverse document frequency vector retrieval relevance model;
LMJM and LMDir leveraged language-model ranking functions using Jelinek-Mercer (λ = 0.5)
or Dirichlet (µ = 2, 000) smoothing (Zhai and Lafferty, 2001), respectively; and DFR considered
the Divergence from Randomness framework (Amati and Van Rijsbergen, 2002) with an inverse
expected document frequency model for information content, a Bernoulli-process normalization of
information gain, and Zipfian term frequency normalization. We also compare our performance
against the top-performing systems for the 2015 TREC-CDS evaluation for Tasks A and B. Note
that, as in the official evaluation, we distinguish between automatic systems which involved no
human intervention, and manual systems in which arbitrary human intervention was allowed.
For Task A (in which no explicit diagnoses was provided), the best inf. NDCG reported for an
automatic system was 0.294 and the best for a manual system was 0.311, while for Task B (in
which an explicit diagnoses was given for each topic focusing on a medical test and treatment), the
best inf. NDCG reported for an automatic system was 0.382 and the best reported for a manual
48
system was 0.381. Please note that our approach was designed for Task A and, consequently,
does not consider the explicit diagnoses given in Task B. Moreover, our approach incorporates a
very basic form of query expansion (described in Section 2.3 on page 37) and a simple relevance
model (BM25) while many of the top-performing systems submitted to the TREC-CDS task relied
on significantly more complex methods for query expansion and often incorporated additional
information retrieval components (e.g., pseudo-relevance feedback, rank fusion) which were not
considered in our approach (Roberts et al., 2016).
Clearly, the best performance obtained by our system (denoted with a ‘?’) relies on (1) the
medical knowledge sketch obtained by consider the topic t and scientific article l (Z2 (t, l)) and (2)
the interpolated-smoothing method for answer inference. Likewise we found that the best relevance
model for our approach (based on the performance of each baseline) was the BM25 ranking function.
There was no statistically significant difference in the performance measured when applying the
Interpolated Smoothing or the Bethe Approximation methods for answer inference (as observed
when measuring the accuracy of answers produced by our approach). Clearly, as illustrated in
Table 2.5, our approach enabled significantly higher quality scientific article retrieval than the top
reported systems for each task (Simpson et al., 2014). Specifically, we measured a 49% increase in
inferred NDCG compared to the best reported automatic system (Balaneshin-kordan et al., 2015)
and measured a 40% increase in inferred NDCG compared to the best reported manual system
(Balaneshin-kordan et al., 2015) when considering Task A. When comparing our approach to Task
B, in which an explicit diagnosis was provided with every topic with an EMAT of test or treatment,
we measured a 14% increase in inferred NDCG compared to the best reported automatic (Song
et al., 2015) and manual (You et al., 2015) systems. We believe that the difference in performance
increase observed between our approach and the state-of-the-art approaches across task A and B
indicates that our approach was often able to infer the correct diagnosis in Task A. Moreover, we also
observed a clear increase in performance when comparing our approach to the state-of-the-art for
Task B, which suggests that the ability of our approach to infer additional semantically-meaningful
49
medical concepts beyond the explicit diagnosis which were able to improve the relevance of retrieved
scientific articles. This in turn, further suggests that the relevant articles in the TREC-CDS task
contain answers that were not always in the candidate answer set. Overall, we believe that the high
performance of our approach clearly demonstrates the impact of incorporating medical question
answering from knowledge bases to improve clinical decision support.
2.4.3 Medical Knowledge Evaluation
There is no clear way to measure the “accuracy” of the clinical picture and therapy graph (CPTG)
(as the edges between concepts do not indicate a single, direct semantic relationship). Con-
sequently, we report the structure and connectivity of the CPTG. The CPTG contained 634
thousand nodes and 13.9 billion edges where 31.2% of all nodes were diagnoses, 21.84% were
signs or symptoms, 23.62% were medical tests, and 23.34% of nodes were medical treatments.
The distribution of assertions associated with medical concepts in the CPTG is: 13.1% were
ABSENT, 0.01% were ASSOCIATED-WITH-SOMEONE-ELSE, 1.13% were CONDITIONAL,
33.31% were CONDUCTED, 17.05% were HISTORICAL, 0.72% were HYPOTHETICAL, 8.37%
were ONGOING, 1.04% were ORDERED, 0.55% were POSSIBLE, 1.12% were PRESCRIBED,
22.34% were PRESENT, and 0.89% were SUGGESTED. To evaluate the CPTG, we evaluated
only the quality of the nodes of the CPTG which represent medical concepts and their assertions.
We were unable to evaluate the edges of the CPTG because there are no medical knowledge bases
or ontologies that encode relations between medical concepts qualified by assertions.
The evaluation of the quality of the nodes encoded in the CPTG considered the F1 -scores when
(1) detecting the boundaries of medical concepts; (2) detecting the type of medical concepts; and
(3) identifying assertions. To perform the evaluation we considered the 72, 846 gold-standard
annotations provided with the 2010 i2b2/VA shared-task. On that data, our system, as reported in
(Roberts and Harabagiu, 2011), obtained an F1 -score of 83.45% for boundary detection, 95.49%
for concept type detection and 93.94% for assertion recognition. However, as noted in Section 2.3
50
on page 37, the i2b2 annotations did not indicate whether medical problems were signs/symptoms
or diagnoses, and did not include assertions for medical tests or treatments. Consequently, we
performed 2, 349 additional annotations on EMRs from MIMIC III. We performed 10-fold cross
validation which allowed us to compute the F1 -scores when detecting medical concept boundaries
as 81.22%, whereas the F1 -score when detecting medical concept types was 85.99% and the F1 -
score for identifying assertions was 75.99%. It is obvious that the new concept types and assertions
that were annotated impacted the performance of the automatic medical concept and assertion
identification system. We believe that the performance may be improved as more annotations
become available for training.
2.5 Summary and Lessons Learned
In this chapter, we detailed the knowledge representations considered in a novel medical Q/A
framework that can be used in CDS systems for recognizing relevant biomedical articles and
pinpointing the answers to questions about complex medical cases. In order to answer medical
questions about complex medical cases, we introduced the notion of medical knowledge sketches,
that capture the clinical background of the medical case. We have presented three forms of medical
knowledge sketches and shown how they can be used to infer answers. Moreover, we have shown
how four different probabilistic inference methods operate on the medical knowledge acquired from
a vast EMR collection, reflecting knowledge pertaining to medical practice. We also introduced
three novel article relevance models, informed by answers, which are used to retrieve relevant
biomedical articles.
In our experiments we considered all twelve combinations of medical knowledge sketches and
probabilistic inference methods and the results indicated surprisingly high MRR scores of the
answers when evaluating the questions from the 2015 TREC-CDS task. Although the questions
were related to complex medical cases, the results that were obtained rivaled the performance of
Q/A results obtained for simpler, factoid questions. The best results were obtained when the medical
51
knowledge sketch considered all the medical knowledge discerned from an entire biomedical article
as well as the medical knowledge discerned from the description of the medical case, while using
the Interpolated smoothing method of probabilistic inference (Bethe approximation performed
equally well). Moreover, when the answers of a medical question are known, they inform the
ranking of relevant articles from PubMed with 86.5% increased inferred Average Precision to
current state-of-the-art systems evaluated in the most recent TREC-CDS.
To our knowledge this is the first work that considers medical Q/A from a knowledge base by
employing four probabilistic inference methods. It is also the first attempt of representing knowledge
as (a) medical knowledge sketches and (b) a clinical picture and therapy graph. Possible avenues for
future work include (1) automatically recognizing semantic attributes such as severity, temporality,
etc. and incorporating them into the knowledge graph and (2) considering the roles of different
sections, rather than paragraphs or entire articles when constructing the medical knowledge sketch.
52
CHAPTER 3
PATIENT COHORT RETRIEVAL
Minor revision, with permission, of Travis R. Goodwin and Sanda M. Harabagiu, Graphical
Induction of Qualified Medical Knowledge, International Journal of Semantic Computing (IJSC),
Vol. 7, Issue 4, pp. 377 – 405. ©World Scientific, 2013. doi:10.1142/S1793351X13400126.
53
Massive warehouses of electronic health records (EHRs) contain a wealth of medical knowledge
that is expressed by physicians and health care professionals when reporting on the patient visits.
Hospital throughout the United States and other countries process millions of EHRs annually.
The notes within these EHRs typically include a variety of clinical information, including medical
history, physical exam findings, lab reports, radiology reports, operative reports, as well as discharge
summaries. Information about a patient’s medical problems, treatments, and clinical course is also
available from EHRs. This information is essential for conducting comparative effectiveness
research, defined in a brief report from the National Institute of Medicine published in 2009 as the
generation and synthesis of evidence that compares the benefits and harms of alternative methods to
prevent, diagnose, treat and monitor a clinical condition or to improve the delivery of care (Ratner
et al., 2009).
Essential to uncovering knowledge that enables comparative studies is the capability to auto-
matically process the large EHR repositories by identifying medical concepts and the relations
between them. However, medical concepts are not mentioned in EHRs without a degree of belief
that the physicians have. When physicians write about medical concepts, they often incorporate
hedging or other linguistic means of expressing their opinion, in lieu of strict facts. Medical science
involves asking hypotheses, experimenting with treatments, and reasoning from medical evidence.
Consequently, clinical writing reflects this modus operandi with a rich set of speculative statements
(Edinger et al., 2012; Chapman et al., 2011; Cohen and Hersh, 2005).
By taking this observation into account, we decided to explore a knowledge representation that
(1) takes into account the physician’s degrees of belief – qualifications of the medical concepts
mentioned in EHRs; and that (2) can be acquired automatically from a large corpus of EHRs. Our
work considers that all medical concepts within an EHR fall within the categories of (1) medical
problems (e.g., LUNG CANCER), (2) medical tests (e.g., CT – indicating an X-ray computed
tomography scan, (3) medical treatments (e.g., TYLENOL), or (4) infectious agents (e.g., MRSA
– Methicillin-resistant Staphylococcus aureus). In order to capture the belief values that physicians
54
steroids TREATMENT CT TEST
CONTINUING ORDERED
0.3
lesions PROPBLEM
ABSENT
0.4
antibiotics TREATMENT
CT TEST
PRESCRIBED
CONDUCTED 0.2
0.6
Inflammation PROPBLEM
PRESENT
Figure 3.1. Visualization of the Qualified Medical Knowledge Graph
express with regards to medical concepts, we have considered (a) six types of assertions1 that
were used to qualify the state of a patient’s medical problem in the 2010 i2b2/VA challenge; (b)
three additional assertions that qualify a patient’s treatments, (c) an assertion that applies onto
medical tests, and (d) a new assertion that applies to medical problems, treatments, and tests. This
classification follows the framework devised in the 2010 i2b2/VA challenge (Uzuner et al., 2011),
which tasked participants with categorizing medical concepts as problems, treatments, or tests and
with classifying the assertion for each medical concept.
1In this chapter, we refer to assertions and belief values interchangeably. An “assertion” is considered to be one
possible belief value.
55
By capturing the assertions associated with medical concepts, we are able to build a novel form
of medical knowledge, which we call “qualified medical knowledge.” We organize this knowledge
into a graph, which we call the qualified medical knowledge graph (QMKG). As illustrated in
Figure 3.1, the edges of the QMKG have different weights which are derived automatically by
considering metrics of context cohesion through a battery of similarity measures. In this figure,
a hypothetical visualization is provided highlighting a few example nodes (such as [steroids /
TREATMENT / CONTINUING]) and the weighted edges between them.
In this chapter, we are also concerned with the utility of the QMKG. For this purpose, we
perform an extrinsic evaluation to ascertain the quality of the QMKG. We evaluate the extrinsic
utility of the graph by using the QMKG to improve the quality of patient cohort retrieval by enabling
a method for query expansion that is based on the weighted structure of the QMKG.
The remainder of this chapter is structured as follows. Section 3.1 presents the framework
for constructing the QMKG, and Section 3.2 illustrates the process of automatically identifying
medical concepts, their concept type, and their belief values from EHRs. Each medical concept,
along with associated concept type and assertion value becomes a node in our qualified medical
knowledge graph (QMKG). Section 3.3 details the way in which edges of the QMKG are learned
while Section 3.4 presents the evaluation of our assertion classification and the utility of QMKG
when applied to patient cohort retrieval. Section 3.5 presents the results and discussion while
Section 3.6 summarizes the lessons learned.
3.1 Constructing the Qualified Medical Knowledge Graph
In this chapter, we present an automatic method for generating the QMKG by processing a large
corpus of EHRs. The vertices in our QMKG are triples of the form: (1) lexicalized medical
concepts (e.g., PNEUMONIA), (2) the associated medical concept type (e.g., PROBLEM), and (3)
the belief value held by the author of the EHR concerning the associated medical concept (e.g.,
56
Cultures from the wound grew Pseudomonas aeruginosa and Enterococcus. We started vancomycin.
cultures wound pseudomonas aeruginosa enterococcus vancomycin

TEST PROBLEM INFECTIOUS AGENT INFECTIOUS AGENT TREATMENT
CONDUCTED PRESENT PRESENT PRESENT CONTINUING
Figure 3.2. Examples of medical concepts and their associated graph nodes, which were generated
by processing the EHRs.
HYPOTHETICAL). The edges in our graph are weighted in order to indicate the cohesive strengths
between their associated vertices, as indicated in EHRs.
The content of each node of the graph is provided by (i) an automatic method of identifying
medical concepts in EHRs, and (ii) an automatic method of asserting the belief value of the
respective medical concept. Table 2.2 on page 41 lists the assertions that we considered for
qualifying the physician’s belief status as well as their definitions. Figure 3.2 illustrates sentences
from EHRs with their associated vertices in the QMKG. Clearly, belief values associated with
medical concepts mined from EHRs encode a new form of semantic knowledge which can enable
several forms of reasoning.
To connect the vertices from the QMKG, we devised a method that considers that ontological
relatedness can be captured from the cohesive properties of EHRs. When using medical concepts
to generate the narrative portion of an EHR, a physician creates cohesive text by mentioning
related medical concepts. Thus, we assume that an edge between two graph nodes exists if the
corresponding medical concepts co-occur within a window of λ tokens within the same EHR.
For our experiments, we set λ = 20 based on our own observations about the average sentence
length in our collection of EHRs. This idea was inspired by the SympGraph methodology reported
in Sondhi et al. (2012) which models symptom relationships in clinical notes. In addition to
symptoms, the medical concepts that we recognize include diseases, injuries, and other types of
57
concepts that represent a medical problem. In addition, the graph also encodes treatments and
medical tests within vertices. The co-occurrence relations indicate links between the nodes in the
QMKG, and these edges learn non-uniform weights. In our extrinsic evaluation of the QMKG, the
weights of the edges in the QMKG inform query expansion for patient cohort retrieval. Within
the QMKG, edges are generated according to co-occurrence information within the collection
of EHRs. The weights of these edges, however, are calculated according to various first-order
similarity measures, such as PMI, Lin’s discounted PMI, the normalized Google distance, and
Fisher’s exact test. Additionally, an n-order similarity model is introduced such that paths of n
transitive co-occurrences are considered. Due to large dataset used, we cast the task as a BigData
problem and present a method for constructing the edges and weights of the QMKG according to
the MapReduce model.
We are also concerned with the utility of the graph. For this reason we conduct an extrinsic
evaluation by investigating the performance impact of applying the QMKG for query expansion
on the task of patient cohort EHR retrieval. In order to expand such queries, we focused on
learning the weights of the edges within the QMKG. We did this because the selection of expanded
terms for enhancing a query is based on the assumption that the weight connecting a vertex to its
neighbors indicates the strength of the relationship between those concepts, and thus relative utility
of a potential expansion. Moreover, due to the fact that the same medical concept, when qualified
by different belief states (assertions), will correspond to different vertices which will in turn have
varying weights on their edges.
We use our cohort identification system (Goodwin et al., 2011, 2012) developed for the evalua-
tion of the TRECMed challenges in 2011 and 2012. We find that the queries expanded based on the
graph we present in this chapter produce cohorts that are 23.7% more accurate than those obtained
without access to the information encoded within the graph, according to the percent difference
between the performance of our system using the random-walk query expansion technique based
on the QMKG and the performance of our system without using any query expansion based on the
58
QMKG. To learn the weights of the co-occurring links, we have considered 4 different techniques
(discussed in Section 3.3) – PMI, Lin’s smoothed PMI (Pantel and Lin, 2002), Fisher’s exact test,
and the normalized Google distance (Cilibrasi and Vitányi, 2004)) – and found that the best results
were obtained when the PMI similarity measure was used as reported in Goodwin and Harabagiu
(2013a).
Furthermore, we believe that the QMKG can also be used for learning how to best rank the
patient cohorts and to rely on the feedback from medical experts. Note that, like in Sondhi et al.
(2012), the graph can dynamically update when new EHRs are considered.
3.2 Generating the Nodes of the Qualified Medical Knowledge Graph
The automatic identification of medical concepts and their assertions in the narrative portion of
EHRs benefits from existing clinical ontological resources as well as several methods of identifying
automatically concepts in EHRs. As medical concepts are expressed in natural language, the first
choice was to consider a resource where lexico-semantic medical knowledge is encoded, such as
the Unified Medical Language System (UMLS)2 (Schuyler et al., 1993). Open-source software,
such as MetaMap (Aronson, 2001) or, more recently, cTakes (Savova et al., 2010) can parse the
topics and the EHRs to assign concept unique identifiers (CUIs) which correspond to entries in
UMLS. However, the semantic network available from UMLS involves a large set of concepts that
were organized by ontological principles, rather than the latent semantics that can be derived from
the large corpus of EHRs. In order to decide on the conceptual representation, we also considered
the more general framework developed by the 2010 i2b2 challenge reported in Uzuner et al. (2011).
The object of this framework is to identify medical concepts in clinical texts and moreover to assign
several possible values to capture the degree of belief associated with the medical concepts. Because
so many lexico-semantic resources exist for processing clinical texts, i2b2 proposed a challenge to
2UMLS is a database of medical terms and the relationships between them sponsored by the NIH.
59
find which resources and which features produce the best results for recognizing medical concepts.
But, more importantly, the 2010 i2b2 challenge brought to the forefront of research in medical
informatics the problem that recognizing medical concepts alone is not sufficient. When medical
concepts are used in clinical documents, physicians also express assertions about those concepts,
namely that a medical problem is present or absent, that a treatment is conditional on a test,
or that the clinician is uncertain about a medical concept. The i2b2 2010 challenge considered
assertions only for medical problems. In our research, for retrieving patient cohorts, we have
extended the problem of assertion classification in two ways: first, we have produced assertion
(or belief values) for all medical concepts that we have automatically identified; second, we have
considered six additional values which are defined in Table 2.2. Moreover, we also considered four
different forms of medical concepts: (1) medical problems, (2) medical treatments, (3) medical
tests, and (4) infectious agents. The fourth type of concept, infectious agents, was introduced by
us due to our interest in preparing the QMKG for applications in the area of infectious diseases.
We are using these concept types after extensive discussion with clinicians at The University of
Texas Southwestern Medical Center. We have included in our discussion, clinicians specializing in
infectious diseases, internal medicine, and surgery. We also involved the creators of the infectious
disease ontology3 which concurred with our concept type choices. For the purpose of this chapter,
infectious agents are considered to be a type of problem.
3.2.1 Medical Concept Recognition
Medical concepts were recognized in a four-step process based the methods reported in Roberts
and Harabagiu (2011):
Step 1: Preprocessing and feature extraction
Step 2: Feature selection
3The infectious disease ontology (IDO) is available at http://infectiousdiseaseontology.org/.
60
Preprocessing
CRF
Prose Concept
Hospital Sentence Segmenter, Tokenizer
Boundary Detector
SVM
Pattern-based Entity Recognizer
Visits Concept Type
(EMRs) Age
Classifier
Date
Disease ID
CRF
Dosage Non-prose Concept
List Element Measurement Boundary Detector
Name Percent
Time
Medical Problem/
Test/ Treatment /
Feature Extractor Feature Selection Infectious Agent
External Resources
GENIA UMLS WordNet Wikipedia

Lemmas, Part-Of-Speech Tags,
Phrase Chunks, Named Entities PropBank-Based
MetaMap
Semantic Parser
Figure 3.3. The architecture of the concept identification method.
Step 3: Detection of textual boundaries within the text referring to each medical concept;
Step 4: Classification of each medical concept into (a) medical problems, (b) medical treat-
ments, (c) medical tests, or (d) infectious agents.
In Step 1, both the structured and unstructured text is preprocessed for typical natural language
annotations such token lemmas; part-of-speech information; phrase chunks; and a variety of entities
are recognized such as dates, dosages, measurements, percentages, names, times, diseases, lists,
and ages. Features are then extracted based on combinations of this preprocessed information,
as well as knowledge from from several lexico-semantic resources, as illustrated in Figure 3.3.
In addition to UMLS (Bodenreider, 2004) and MetaMap (Aronson, 2001), the Genia annotations
(Kim et al., 2003) which were used for Biomedical text mining were incorporated. Lexico-semantic
information, especially aimed at identifying lemmas and multi-word expressions, was mined from
WordNet. Additional information about lemmas, part-of-speech, and phrasal chunks as well as
61
names of entities was provided by the Genia tools. Concept type information from Wikipedia was
also used. As we extended the types of medical concepts to include infectious agents, we also
incorporated a lexicon of infectious agents and acronyms. This information is evaluated both an
individual token level and for contexts of varying numbers of tokens (e.g., two-token spans, three-
token spans, etc.) to create an extremely large feature space (222 features). Because this space is
very large, Step 2 utilizes a feature selection method based on a greedy forward strategy. By taking
a “greedy approach,” we repeatedly selected the feature which produced the highest score when
added to the current feature set. Thus, we determined the feature set which achieves the highest
score on the i2b2 training dataset, as well as a small portion of annotations which we produced on
the EHR collection used during TRECMed. The selected feature set illustrated in Figure 3.4, was,
surprisingly, identical to the one reported in Roberts and Harabagiu (2011).
Using the reduced feature set determined by feature selection, we then detected concept bound-
aries both within the narrative of the report and within the structured fields (e.g., CHIEF COM-
PLAINT). Thus, Step 3 involves two different classifiers implemented as conditional random fields
(CRFs) which were trained on the i2b2 annotations and using the TRECMed documents as test
data to extract medical concepts.
Finally, the decision of the medical concept type is made in Step 4, where a single support
vector machine (SVM) classifier is used. The individual feature sets used by each classifier (the
two CRFs and the SVM) is shown in Figure 3.4. Additionally, this figure shows that the concept
type recognition benefits from features extracted from UMLS, as well as Wikipedia, along with
features provided by semantic parsing of the EHRs based on the PropBank annotations (Kingsbury
and Palmer, 2002).
3.2.2 Assigning Belief Values to Medical Concepts
In order to properly encode the medical knowledge in the QMKG, we also needed to automatically
identify whether a medical concept is qualified by any of the assertions given in Table 2.2. To
62
Features for determining concept boundaries in “non-prose” (structured) sections
F1. uncased word F4. prev. word POS F7. MetaMap CUI
F2. pattern-based entity F5. 3-token POS context
F3. uncased prev. word F6. MetaMap type
Features for determining concept boundaries in “prose” sections
G1. word lemma G6. 1-token POS context G11. GENIA phrase chunk
G2. prev. word G7. UMLS concept parents G12. prev. GENIA POS
G3. uncased prev. word G8. MetaMap type G13. prev. GENIA lemma
G4. 2-char suffix G9. GENIA lemma G14. prev. GENIA phrase chunk
G5. prev. POS G10. GENIA entity type G15. next GENIA lemma
Features used for determining concept types in all sections
H1. uncased words H4. next lemma H7. UMLS concept type
H2. 4-char prefix H5. uncased prev. bigram H8. Wiki. concept type
H3. prev. lemma H6. SRL pred. + arg. type
Figure 3.4. Feature sets used for determining concept boundaries and types
be able to automatically identify such assertions, we cast this problem as a classification problem,
implemented as an SVM-based assertion classifier which uses a selected set of features aggregated
from (a) lexical features of the context of the medical concept, (b) the medical concept type identified
in the EHR on which the assertion is produced, (c) the meta data available in the section header
where the assertion is implied, (d) features available from UMLS (extracted by MetaMap), and (e)
features reflective of negated statements, disclosed through the NegEx negation detection package.
Additionally, a special case of features that provide belief values are available from the General
Inquirer’s category information which encodes uncertainty. The General Inquirer (Stone et al.,
1966) is the first general-purpose computerized text analysis resource developed in psychology,
relying on the Harvard psychological dictionaries. For the detection of assertions, we performed
63
Section Header Extractor
Hospital
Visits External Resources
(EMRs)
UMLS MetaMap NegEx General
Feature Extractor Inquirer
Medical Concept
SVM
Recognition Feature Selection Assertion
Classifier
Medical Medical Medical Infectious

Problem Test Treatment Agent
Assertion
Absent Associated_with_Someone_Else Conditional
lesions PROBLEM
Conducted Historical Hypothetical Ongoing
ABSENT
Ordered Possible Prescribed Present Suggested
QMKG node
Figure 3.5. The architecture of the assertion classification method
additional annotations on the TRECMed clinical data, to provide training data for six additional
values which are marked with an ‘*’ in Figure 3.5. The assertion classifier was re-trained and the
same 27 features reported in the state-of-the-art assertion identification method from Roberts and
Harabagiu (2011) were selected, as shown in Figure 3.6. Features A1-A7 used an assertion n-gram
function ANG(x, y, z) where x ∈ P, F is the correlation metric – point wise mutual information or
fisher’s exact test, respectively; y is the context, and z is the minimum count.
3.3 Constructing the Edges of the Qualified Medical Knowledge Graph
An intuitive means of constructing a medical knowledge graph is to create a node for each en-
countered medical concept and associated assertion value in the corpus (henceforth referred to as
a “qualified medical concept”), and an edge e = (u, v) between qualified medical concepts u and
v if and only if they co-occur within the same context. To capture the relations spanning medical
concepts in the QMKG, we address two problems: (1) whether or not to create an edge between a
pair of nodes in the QMKG, and (2) how to determine the strength of connections between every
64
Binary features
B1. whether sentence contains pattern-based en- B4. prev. word has category IF in Harvard inquirer
tity B5. uncased prev. bigram is a stopword
B2. medical concept is in UMLS B6. prev. assertion in the document is hypothetical
B3. medical concept is detected by Metamap with
score ≥ 800
Non-binary features
N1. section name N7. the part-of-speech label assigned to the next
N2. other medical concepts from the same sen- word
tence N8. tokens constiting the associated medical con-
N3. type of prev. assertion in the same sentence cept
N4. type of prev. assertion within 5 tokens N9. next word
N5. type of prev. assertion along with all the to- N10. prev. word
kens occurring between its associated concept N11. uncased next word
and the current concept N12. uncased prev. word
N6. the modifier used by NegEx N13. uncased prev. bigram
N14. all tokens in the sentence containing this con-
cept
N-Gram features
A1. max ANG(P, sent, 10) is ABSENT A5. types in ANG(P, sent, 3) with value ≤ −1
A2. max ANG(F, sent, 10) is ASSOCIATED A6. types in ANG(F, 5 tokens, 3) with log value
WITH ANOTHER ≤ −2
A3. max ANG(P, sent, 3) is HYPOTHETICAL A7. n-grams in ANG(F, 5 tokens, 3) with log value
A4. types in ANG(F, sent, 3) with log value ≤ −2
≤ −100
Figure 3.6. Feature set used for assertion classification
two adjacent nodes in the QMKG. These two problems are treated by (a) generating an adjacency
matrix which represents both the connections and their strength; and (b) by generating several sim-
ilarity matrices which allow us to use the QMKG for query expansion and thus evaluate the utility
65
of the QMKG. To generate the adjacency matrix, we consider all possible contexts in the EHR,
represented as windows of 20 words based on our own observations about the average sentence
length in our collection of EHRs. Thus, the EHR collection is viewed as a sequence of contexts
c1, c2, . . . , cN where c1 starts at word w1 and ends at word w20 in EHR 1, c2 starts at word w2 and
ends at word w21 in EHR 2, and so on.
Given that the total number of medical concepts identified in the EHRs represents |V |, where
the QMKG = (V, E), in order to generate the edges from E we devised an iterative manner of
computing the elements of the adjacency matrix A(i) ∈ R|V |×|V | . We initialize A(0) by considering
that all pairs of concepts are initially connected, and thus, ∀u, v ∈ V, A(0) (u, v) = 0. The following
iterations follow Equation (3.1).


if u and v co-occurs in context ci

1

A(i) (u, v) = A(i−1) (u, v) +

(3.1)


0
 otherwise.

To interpret the contents of this adjacency matrix at iteration (i), we note that the u-th row of
A(i) , denoted as A(i) (u, ?), represents the vector listing the number of times that medical concept
u has co-occurred with any other medical concept in the first i contexts extracted from the EHRs.
Likewise, the v-th column of A(i) , denoted as A(i) (?, v), represents the vector listing the number of
times any medical concept has co-occurred with medical concept v.
To construct the edges of the QMKG = (V, E), we iterate N-times the construction of the
adjacency matrix, where N is the total number of contexts we considered across the entire EHR
collection. We can infer that there is an edge between a concept u and a concept v if A(N) , 0.
Additionally, we also infer the strength of these connections.
In order to further explore the strength of edges in the QMKG, we also considered that some
of the concepts may share a degree of semantic similarity, and thus similarity metrics can also be
used. For this purpose, we have encoded the similarity between two qualified medical concepts
66
within another matrix, S|V |2 ∈ R|V |×|V | . Each element in this similarity matrix corresponds to an
edge in the adjacency matrix A, such that the value of S(u, v) is determined according to some
similarity function Similarity : (V × V) → R applied to corresponding edge, A(i) (u, v).
3.3.1 A Map-Reduce Representation
Constructing the QMKG adjacency matrix and similarity matrix requires storing and calculating
all the co-occurrences between all pairs of qualified medical concepts in a corpus. As the memory
requirements for such a representation are impractical for large data sets, we present a map-reduce
formulation of the QMKG edge construction. Map-Reduce is a programming model for large-scale
parallel, distributed processing of large data sets popularized by Google (Dean and Ghemawat,
2008). A computation in the Map-Reduce model can be generalized as a combination of two
classes of operations:
• Map: Input is divided into sub-problems and distributed across the cluster, where-in each
tuple of data is somehow transformed.
• Reduce: Multiple tuples of data are some-how aggregated to create the desired output.
Algorithm 3.1 presents the processing for generating the QMKG edges according to the MapReduce
model. The entire process consists of three phrases. First, map operation Count-Concepts counts
the occurrences of qualified medical concepts. Second, three parallel reduce operations aggregate
these counts: Sum-Vocabulary calculates the total number of qualified medical concept mentions,
Sum-Co-occurrences calculates the total co-occurrences for each qualified medical concept pair,
and Sum-Occurrences counts the total occurrences for each individual qualified medical concept.
Lastly, a final reduce operation calculates the similarity for each pair of qualified medical concepts
according to some similarity function, Similarity, based on the occurrences of each concept, as
well as the co-occurrences and vocabulary size. As illustrated in Algorithm 3.1, the similarity
functions play an important role in generating the QMKG edges. In the remainder of this section,
we detail the methods we have considered for computing these similarities.
67
Algorithm 3.1 Map-Reduce Model for Constructing the QMKG
1 map Count-Concepts(docid i, doc d)
2 for all qualified concept (c1, a1 ) ∈ doc d do
3 for all qualified concept (c2, a2 ) ∈ doc d do
4 if (c1, a1 ) , (c2, a2 ) then
5 Emit(left (c1, a1 ), right (c2, a2 ), count 1)
6 end if
7 end for
8 end for
9 end map
10 reduce Sum-Vocabulary(, counts [c1, c2, . . .])

11 s←0
12 for all count c ∈ counts [c1, c2, . . .] do
13 s ← s+c
14 end for
15 Emit(vocabulary s)
16 end reduce
17 reduce Sum-Cooccurrences((left l, right r), counts [c1, c2, . . .])

18 s←0
19 for all count c ∈ counts [c1, c2, . . .] do
20 s ← s+c
21 end for
22 Emit(left l, right r, cooccurrences s)
23 end reduce
24 reduce Sum-Occurrences(left l, tuples [t1, t2, . . .])

25 s←0
26 for all (right r, count c) ∈ tuples [t1, t2, . . .] do
27 s ← s+c
28 end for
29 Emit(node l, occurrences s)
30 end reduce
31 reduce Compute-Similarity(, tuples [t1, t2, . . .])

32 for all tuple j ∈ tuples [t1, t2, . . .] do
33 for all tuple l ∈ tuples [t1, t2, . . .] do
34 for all tuple r ∈ tuples [t1, t2, . . .] do
35 if j.left = l.node and j.right = r.node then
36 s ← Similarity( j.coocurrences, l.occurrences, r.occurrences, j.vocabulary)
37 Emit(left j.left, right j.right, similarity s)
38 end if
39 end for
40 end for
41 end for
42 end reduce
68
3.3.2 First Order Similarity
The strengths of the edges in the QMKG, S, can be also represented as similarity scores between
two qualified medical concepts. Although, there are a variety of techniques for calculating the
semantic similarity between two spans of text, we consider only four such techniques. To define
the similarity measures, we use the following definitions:
n
C(n, k) denotes the binomial coefficient, k
AN (?, v) = AN (i, v); i.e., the total occurrences of v

Í
i
AN (u, ?) = AN (u, j); i.e., the total occurrences of u

Í
j
|V | is the vocabulary size; i.e.,the number of qualified medical concepts.
Similarity Method 1
The first technique we used for qualifying the similarity between two qualified medical concepts is
using the point-wise mutual-information (PMI) between two qualified medical concepts.
AN (u, v)
pmi(u, v) = log (3.2)
AN (u, ?) ∗ AN (?, v)
To compute the PMI between u and v, we make use of Equation (3.2). For example, the PMI
between heart failure/ PRESENT and new cardiac event/ PRESENT is 15.48. These PMI values
indicate that the independence between these pairs of qualified medical concepts is low, and, thus,
that they are likely to be related.
Similarity Method 2
However, point-wise mutual information has a well-known bias towards infrequent events. This is
clear when one attempts to extrapolate knowledge from the top-scoring PMI-weighted edges for
69
a given grounded concept. Consider, that for the PMI between heart failure/ PRESENT and the
unrelated qualified medical concept a divert colostomy/ CONDITIONAL is 15.08.
min AN (u, ?), AN (?, v)

AN (u, v)
lin(u, v) = N × × pmi(u, v) (3.3)
A (u, v) + 1 min AN (u, ?)AN (?, v) + 1

Equation (3.3), often referred to as Lin’s modified PMI, addresses this bias by scaling the PMI
by a discounting factor given in Pantel and Lin (2002). This discounting factor considers the
frequency of each individual qualified medical concept in a way that discourages extremely rare
qualified medical concepts from having too much impact on the resulting weight. For comparison,
the highest scoring edges using Lin’s modified PMI for heart failure/ PRESENT are nonischemic
cardiomyopathy/ PRESENT, and intravaneous lasix/ ONGOING, which are commonly associated
with heart failure.
Similarity Method 3
We investigate Fisher’s exact test which measures the significance of association (contingency)
between two vertices in the graph, and given in Equation (3.4).
C A N (u, ?), A N (u, v) × C |V | − A N (u, ?), A N (?, v) − A N (u, v)

log fisher(u, v) = log (3.4)
C (|V |, A N (u, ?) + A N (?, v))
Fisher’s exact test is commonly used in statistics to evaluate the null hypothesis in situations where
the sample size is too small to evaluate using the Chi-squared test. This test is a distance measure,
thus, the least weight edges are the most similar. Continuing our example, the least distant neighbors
for heart failure/ PRESENT according are hypertension/ PRESENT at −116.57, and congestive
heart failure/ PRESENT at −92.3.
Similarity Method 4
Our fourth technique, Equation (3.5), adapts the Normalized Google Distance (Cilibrasi and Vitányi,
2004), which is a way of measuring semantic similarity based on Google hits, into a similarity
70
measure for qualified medical concepts. We do this by replacing the Google frequency with the
number of associated contexts in our corpus.
log max AN (u, ?), AN (?, v) − log AN (u, v)

ngd(u, v) = (3.5)
log|V | − log min AN (u, ?), AN (?, v)

As such, our complete left bundle branch block/ PRESENT and nonischemic cardiomyopathy/
PRESENT are the two least distant neighbors for heart failure/ PRESENT.
3.3.3 Second Order Similarity
Because of the incredible sparsity of qualified medical concepts in EHRs, there are a multitude
of qualified medical concepts that do not share the same context window, but still share semantic
similarity that would be of value to medical knowledge processing systems. For example, consider
the concepts atrial fibrillation and ventricular fibrillation. Although these concepts are unlikely
to co-occur directly, they represent the same medical phenomena – an irregular heart beat – but
correspond to different anatomical locations of the heart (the atrium and the ventricles, respectively).
In order to capture this relationship, we generalize the notion of second-order PMI, which has been
exploited for learning synonymy, to build a measure of second-order similarity given any first-order
similarity function. We provide an algorithm, which we call Second-Order-Sim, for calculating the
second-order similarity based on any first-order similarity matrix W for a graph G = (V, E).
The second-order similarity can be viewed as an aggregation of the weights (first-order simi-
larities) on paths connecting any pairs of nodes. In this work, we only consider paths containing
a single intermediate node (e.g., u ← t ← v). In order to compute the second order similarity
between a pair of nodes (u, v) from the graph, we first need to determine the number of single-
intermediate-node paths we will consider. In our case, u and v encode triplets containing the
lexicalized medical concept, the concept type, and the assertion. Hence, we want to determine how
many intermediary medical concepts should be considered when determining the second-order
similarity between u and v. We call these numbers βu and βv . The second-order similarity of u
71
Algorithm 3.2 Computing the second-order similarity matrix given a graph and its first-order
similarity matrix
Precondition: G = (V, E) is a graph of qualified medical concepts
Precondition: W is a first-order similarity matrix of size |V | × |V |
1 function Second-Order-Sim(G = (V, E), W |V |2 )

2 initialize Z as a |V | × |V | matrix
3 for all e = (u, v) ∈ E do
βu ← floor log i W(u, i)2 × log2 |V | ÷ δ
Í
4
Í
5 βv ← floor log j W( j, v)2 × log2 |V | ÷ δ
6 zu ← Sum-Top-β(u, βu , W)
7 zv ← Sum-Top-β(v, βv , W)
8 Z(u, v) ← zu βu −1 + zv βv −1
9 end for
10 return Z
11 end function
12 function Sum-Top-β(v, β, WV 2 )
13 initialize Y as a zero-vector of size |V |
14 for i = 1 to |V | do
Í
15 Y [i] ← j W( j, v)
16 end for
17 sort Y in descending order
18 z←0
19 for i = 1 to β do
20 z ← z + Y [i]γ
21 end for
22 return z
23 end function
to v is then computed based on the first-order similarities along the most similar βu paths from
u to v (and vice-versa for the similarity from v to u). In calculating the values of βu and βv , we
determine (a) how many other medical concepts encoded in the QMKG may be used to semantically
describe the concepts from u and v, and (b) how many nodes should be considered when generating
paths between u and v in the QMKG. The algorithm above gives the details about how βu and βv
are are computed, which enables the estimation of the second-order similarity for nodes u and v,
denoted as zu and zv . The function Sum-Top-β enables the computation of these values. Further
details of the motivations behind and computation of the β values are provided in Islam and Inkpen
72
(2006). Finally, we compute the second-order similarity between v and u as a normalized sum
of the first-order-similarities in the top βv and βu paths between u and v (the normalized sum of
zu and zv ). This second-order similarity encodes the indirect similarity between v and u given β
intermediate nodes in the QMKG. That is, if the sum of the top β weights between v and u is
significantly large, then the second order similarity between v and u will also be large, indicating
that v and u are highly similar.
3.3.4 n-Order Similarity
The notion of second order similarity can be further extended to capture the semantic similarity
between two concepts across any number of intermediate vertices. However, in order to allow for
arbitrary-length paths, it is necessary to modify the way in which the aggregate similarity of a path
is computed. In the original second-order similarity measure, a threshold value, β is computed
for each concept which indicates how many intermediate nodes should be considered as function
of the frequency of that concept. To simplify the complexity of an n-order similarity measure, a
commutative, uniform threshold is needed, allowing us to build the n-order similarity recursively.
To generate the n-order similarity recursively, we start with (1) A(N) (u, ?) which represents all
the edges (and strengths) from node u to all other nodes in the QMKG, and (2) A(N) (?, v) which
encodes all the edges (and strengths) from all nodes to node v in the QMKG. We then define
A(N) (u, ?, v) as the intersection of nodes adjacent to A(N) (u, ?) and A(N) (?, v). We also consider
the vectors eu,? and e?,v , which encodes all edges between node u and A(N) (u, ?, v) and all edges
between A( N)u,?,v and v, respectively. By connecting eu,? and e?,v , we obtain all 2-edge paths that
connect nodes u and v in the QMKG.
Because of the massive number of such paths, we can reduce the complexity of later compu-
tations by reducing the dimensionality of these vectors. We perform this dimensionality reduction
by ordering the indices of both vectors in descending order according to min(eu,?, e?,v ) and taking
only the top k dimensions (in this work, we set k = 100). This has the effect of causing future
73
computations to only consider the k paths from u to v that have the largest minimum edge weight.
To determine the updated second-order cohesion between the two nodes, we want to determine
the similarity of their first-order cohesions across the entire vectors. We calculate this updated
second-order similarity between vertices u and v by considering the angle between eu,? and e?,v as
determined by the cosine similarity, given in Equation (3.6).
N
Í
eu,i ei,v
eu,? · e?,v i=0
cos(eu,?, e?,v ) = =s (3.6)
k eu,? kk e?,v k
s
N
Í N
Í
eu,i
2 ei,v
2
i=0 i=0
By using the cosine similarity, we able to determine the degree of correlation between to nodes
across the entire distribution of the vocabulary (k = N), or a subset of the k-most correlated words.
We can then leverage the transitivity of the cosine function to calculate the n-order similarity
recursively by repeatedly composing the cosine function, as shown in Equation (3.7).
N
eu,i cosine(n−1) (i, v)
Í
i=0
cosine(n) (u, v) = s s (3.7)
N N
cosine(n−1) (i, v)
Í Í
eu,i
2
i=0 i=0
As a base case for the recursion, we define cosine(0) (u, v) to be the original cosine similarity between
eu,? and e?,v as determined Equation (3.6).
3.4 Patient Cohort Retrieval with the Qualified Medical Knowledge Graph
Because EHRs do not document the rationale for medical decisions, patient cohort studies need
to be undertaken for understanding the progression of disease as well as the factors that influence
clinical outcomes. Patient cohort identification has been the target of an information retrieval
challenge task performed in the Text REtrieval Conference (TREC) in 2011 and 2012, under the
medical records track (TRECMed) (Voorhees and Tong, 2011; Voorhees and Hersh, 2012). The
TRECMed organizers aimed to develop a retrieval problem pertinent to real-world clinical medicine
74
by (a) enabling access to a large corpus of de-identified EHRs available from the University of
Pittsburgh Medical Center and (b) a set of 85 retrieval queries called topics, reflecting patient
filtering criteria similar to those specified for participation in clinical studies and the list of findings
conditions developed by the National Institute of Medicine (Ratner et al., 2009).
This retrieval task considered 95,703 de-identified EHRs which were generated from multiple
hospitals during 2007. The EHRs were grouped into hospital visits consisting of one or more
medical reports from each patient’s hospital stay. Thus, the EHRs were organized into 17,199
different patient hospital visits, wherein each hospital visit consists of all the reports generated
during that patient’s hospital stay. These reports are composed of primarily free-text, and consist
of medical histories, physical examinations, radiology reports, operative reports, and discharge
summaries. Each report is lightly wrapped within eXtensible Markup Language (XML) containing
the patient’s admit diagnoses and discharge diagnoses as ICD-9 codes. Additionally, a mapping
from individual clinical reports to their associate patient’s hospital visit was provided.
Patient cohort identification is a retrieval task in which, given a characterization of patients
targeted by clinical research, a ranked list of patient hospital visits are generated. These patient
cohorts are characterized by various medical phenomena such as medical problems, treatments,
tests, or infectious agents, as well as individual traits such as age, gender, or hospital status. Cohorts
are identified by a ranked list of hospital visits, in which the first hospital visit pertains to the patient
deemed most relevant to the query’s topic, while the following hospital visits correspond to patients
from the same cohort in decreasing order of relevance.
The 35 topics evaluated in 2011 and the 50 topics evaluated in 2012 were characterized by (a)
usage of medical concepts (e.g., acute coronary syndrome or plavix) and (b) constraints imposed
on the patient population (e.g., children, female patients). A subset of the topics is illustrated in
Table 3.1.
Traditionally, retrieval models do not take into account belief values asserted about concepts.
Moreover, semantic information such as word senses has been shown to not improve the accuracy
75
Table 3.1. Examples of topics provided as part of the TRECMed evaluation (the topic numbers
correspond to the topic numbers evaluated in TRECMed).
Topic Number Topic Description

156 Patients with depression on anti-depressant medication.
160 Patients with low back pain who had imaging studies.
172 Patients with peripheral neuropathy and edema.
184 Patients with colon cancer who had chemotherapy.
or completeness of retrieval results. This is because semantic information that is too fine-grained
seems not to be beneficial to retrieval quality. For example, Voorhees and Harman (1997) used
WordNet (Fellbaum, 1998) as a tool for query expansion (Voorhees, 1994) with the TREC collection.
By expanding query terms with WordNet synonyms, hypernyms, or hyponyms, documents which
were retrieved using the SMART retrieval system (Salton, 1971) were more relevant only when
queries were short. But, when WordNet was used for word sense disambiguation of the documents,
retrieval performance was, in fact, degraded (Voorhees, 1994).
Taking into account these lessons learned, we investigated if, for the problem of patient cohort
retrieval, the recognition of medically significant semantic information in the query topics could
improve retrieval by informing query expansion methods. For this purpose, we make use of the
patient cohort retrieval system reported in Goodwin et al. (2011, 2012); Goodwin and Harabagiu
(2013c), which was developed for evaluations in TRECMed 2011 and 2012. By using the QMKG
for producing query expansions of the topics evaluated in TRECMed 2011 and 2012, we were able
to produce an extrinsic evaluation of the QMKG.
This constitutes a novel application of document retrieval wherein an incredible amount of
medical knowledge must be processed in order to model the relevance of a given topic (Hersh, 2009).
Part of that knowledge consists of various medical concepts, such as medical problems, treatments,
symptoms, and conditions (Edinger et al., 2012). Another critical aspect of the knowledge encoded
in a given topic constrains the gender, age or hospital status of a patient.
76
3.4.1 A Patient Cohort Retrieval System
The architecture of our system is illustrated in Figure 3.7. Both topics and the EHRs are analyzed to
identify medical concepts and their assertions. Because topics convey multiple semantic constraints,
topic analysis aims to recognize additional semantic classes that are specific to patients, e.g., their
age, gender and hospital status. Special submodules of the topic analysis distill the patient age (e.g.,
elderly, children), patient gender(e.g., women, male patients), hospital status (e.g., presenting to the
emergency room, discharged from the hospital, admitted with), or medical assertion status which
captures the existence, absence or uncertainty of medical phenomena (e.g., without a diagnosis of
x, family history of x, recommended for possible x).
Topic Analysis
Medical Concept Hospital Status Recognition KeyPhrase Extraction
Topic
Recognition Reranking
Patient Age Recognition QMKG
Medical Assertion
Patient Gender Recognition KeyPhrase Expansion Hospital
Identification
Visits
Medical Concept Recognition Hospital
Hospital Visits EMR
(EMRs) EMR Analysis Visit
Medical Assertion Identification Index
Retrieval
Figure 3.7. The architecture for patient cohort retrieval
We observed that there were three criteria that occurred frequently throughout the 2011 and
NLM practice topics: where the report concerned the patient’s admission, the patient’s discharge,
or the Emergency Room. The desired hospital status was detected by comparing the lemmatized
topic against a small set of simple patterns, described in Table 3.2.
Topics such as elderly patients with ventilator-associated pneumonia or patients in their 20s
and 30s admitted for overdose pose an additional requirement: patient’s age should lie within
a specific numeric range. Patient age information is detected according to a manually created
grammar inspired by manually reviewing the sixty practice topics provided by the National Library
77
Table 3.2. Examples of detected hospital status along with a sample topic and lemmatized patterns.
Hospital Status Example Topic Lemmatized Patterns

Hospital Admission Patients admitted with a diagnosis
• admit for
of multiple sclerosis
• admit to the hospital for
• present to the hospital
Hospital Discharge Patients being discharged from the

hospital on hemodialysis • discharge
Emergency Room Patients who presented to the

• Emergency Department
Emergency Department with Acute
• ED course
Coronary Syndrome
• emergency room
of Medicine:
hage-phrasei ::= hunqualified-prefixi hnumberi hage-qualifieri
| hqualified-prefixi hnumberi
| hnumberi hage-qualifieri hunqualified-suffixi
| hnumberi hqualified-suffixi
| hrange-prefixi hnumberi hrange-infixi hnumberi
| hknown-age-expressioni
where a hnumberi entity captures both English and numeric representations of ages, and a hknown-
age-expressioni is one of a few dozen manually created expressions with known age ranges, such
as elderly (ages 60 or older), children (aged 2 to 12), or adult (aged 20 or older). To ensure the
captured hnumberi describes an age, and not a range of some other, arbitrary domain, each rule
requires either a qualified prefix or suffix, or an age qualifier. For example, the sequence patients
younger than 30 contains the qualified prefix younger than denoting that the captured number, 30,
78
Table 3.3. Lexica used to detect patients’ gender.
(a) Lexicon of male gender words (b) Lexicon of female gender words
man men boy boys woman women female females

dude dudes gentleman guy girl girls dudette lady
lad lads gentlemen guys gal gals dudettes ladies
he him his himself lass lasses lassie lassies
male she her hers herself
is a description of age. The sequence patients of at most 30, by contrast, must be followed by an age
qualifier such as years to establish that the captured range denotes an age range, and not, say, a BMI
(body mass index) range. If an age range is found, the age requirements are stored as part of the
query. Additionally, if any keyword extracted for this topic conveys the same patient requirements
as the extracted age range (e.g., the keyword children), the keyword is discarded because the age
requirement already conveys this restriction more directly.
In addition to patient age requirements, some topics expressed gender requirements. For
example, the topic men with prostate cancer treated with surgery imposes the requirement that all
returned hospital visits pertain to male patients. In order to detect this information, we created a
high-precision lexicon of words that denote male subjects, and another that denotes female subjects.
These lexica are provided in Tables 3.3a and 3.3b, respectively. If more than one gender was detected
in a topic (i.e., if both the words women or men occur), the associated query is assumed to have no
gender requirements. As with age requirement extraction, keywords that indicate redundant patient
traits to the extracted gender are removed. Thus, the possible gender requirements are ‘male’,
‘female’, or ’either’.
Because medical phenomena are often represented through multi-token, complex nominal
phrases, our keyword extraction technique considers multi-word expressions that preserve the se-
mantics encoded by the syntactic structure of the topic. This requires determining which token
sequences constitute a keyword, and which sequences should be decomposed into separate key-
words. To address this problem, we recursively consider all sub-sequences of tokens from each
79
Figure 3.8. Example of query decomposition for lower extremity chronic wound.
query and check if that sequence corresponds to an article title in Wikipedia. This allows us to
capture virtually any medical concept as well as common abbreviations, misspellings, short-hand,
phrasal verbs, noun collocations and synonyms. However, many common phrases and stopwords
exist as Wikipedia articles. To combat this, we ensure that any matched sequence occurs less than
a threshold, λW 4, within the PubMed Central open access subset of biomedical text. Finally, each
keyphrase is decomposed so that it contains, as sub-keyphrases, any phrases within it which would
themselves satisfy the keyword criteria. For example, the keyword lower extremity chronic wound
in Figure 3.8 contains the sub-keywords lower extremity and chronic wound; sub-keyword lower
extremity thus, contains the sub-sub-keywords lower and extremity, while sub-keyword chronic
wound contains sub-sub-keywords chronic and wound.
The next subsection details the methods for query expansion based on the QMKG. After
extracting and expanding the keywords that characterize a patient cohort, we must retrieve all
relevant hospital visits that match the extracted keywords. This task is accomplished through the
use of Apache Lucene 4.0 (Hatcher and Gospodnetic, 2005). Prior to retrieval, we created an
index over all hospital visits by merging all the electronic health records associated with each
hospital visit into a single document. The various fields encoded in each EHR were retained when
indexed (admit diagnosis, chief complaint, etc.) so that per-field weights could be adjusted. For the
retrieval part, each topic is represented as an interpolation of its weighted expansions, and those of
4In our case, λW = 30, 000. This threshold was based on observed occurrences of keywords from the TREC 2011
queries.
80
any subsumed keywords (e.g., chronic wound would also include wound). More precisely, a topic
is represented as a weighted sum of keywords as given in Equation (3.8).

Õ
Q (k; λ) = λ [α UMLS(k) + β Wiki(k) + γ SNOMED(k) + δ QMKG(k)] + Q (s; µ · λ) (3.8)
s ∈S
where λ is the initial keyword score; α, β, γ, and δ are the weights associated with the respective
keyword expansion method; S is the set of keywords subsumed by k; and µ is the discounting factor
such that 0 < µ < 1. In our experiments, we set λ = 16, α = 12, β = 10, γ = 8, δ = 14 and µ = 0.5.
These weighted expansions were then scored using the highly popular BM25 ranking function to
created a ranked list of hospital visits. We address the additional cohort constrains (age, gender,
status) and assertion values by an iteratively filtering, as described in Goodwin and Harabagiu
(2013c).
3.4.2 Query Expansion Informed by the QMKG
We incorporated the similarity information stored within the QMKG by expanded medical concepts
in a query such that they correspond to semantically similar concepts used in EHRs. To produce
these query expansions, we perform k single-step random walks to expand a query concept so
that it includes k neighbors from the QMKG according to the similarity model used to weight the
QMKG. Figure 3.9 illustrates a portion of the QMKG and the way in which it can be used to expand
a patient cohort query. The query term CELLULITIS is mapped to its associated vertex in the
QMKG – the node (cellulitis, PROBLEM, PRESENT). This node is connected to a set of related
concepts indicated by its neighboring nodes. A random walk of length j on the QMKG started at
vertex v is a stochastic process with random variables U1, U2, . . . , U j where U1 = V and Ui+1 is
a qualified medical concept selected from the distribution of neighbors of Ui in the QMKG. At
each step of the random walk, we select the next node by associating all neighbors of Ui with their
normalized weight (as indicated by the n-order similarity between Ui and each neighbor according
to the QMKG) and sampling once from this distribution. For example, consider the query given in
81
Query:
 cellulitis
Topic:
 anti-depressant
Patients with cellulitis.
cellulitis
PROBLEM
Expanded Query:
 cellulitis/PRESENT
PRESENT
15.2 15.3
 leg legion/PRESENT
 leg ulcer/PRESENT
leg legion leg ulcer
 lymphadenitis/PRESENT
PROBLEM PROBLEM
PRESENT
 mild osteoarthritis/PRESENT
PRESENT
 unasyn/PRESENT
14.6 14.3
15.3
lymphadenitis mild osteoarthritis unasyn

PROBLEM PROBLEM TREATMENT
PRESENT PRESENT ONGOING
Figure 3.9. Example of query expansion terms provided by the QMKG for a topic representation
of a query given to a patient cohort retrieval system.
Figure 3.9. Using the root node (cellulitis, PROBLEM, PRESENT) we normalize the distribution
of weighted neighbors and randomly sample from it, yielding the node (leg ulcer, PROBLEM,
PRESENT). We repeat this process k times to generate k new weighted query terms.
Note that a random walk is merely a special case of a Markov chain. Additionally, the n-order
similarity encodes the normalized joint likelihood of visiting some qualified medical concept v
for all n combinations of intermediate nodes t1, t2, · · · , tn . This allows us to model the probability
that a random walk of length n starting at qualified medical concept v and ending at qualified
medical concept u as the n-order similarity between vertices v and u. By leveraging the Markovian
assumption inherent in a random walk (that the likelihood of visiting any node depends only on
the current node), we can simulate a j-step random walk over a potentially infinite-order QMKG
by instead performing a one-step random walk over the QMKG weighted by the n-order similarity
function. Note that by using a random walk for this purpose, the expansions generated may change
on each invocation of the query expansion algorithm. If this is undesired, one may instead simply
select the k largest-weighted edges in the QMKG and use those for query expansion.
82
3.5.1 Evaluation of the Techniques for Discovering QMKG Nodes
The QMKG that we have automatically generated contains 634 thousand nodes and 13.9 billion
edges (3.45% of nodes are connected). Figure 3.10a shows the distribution of automatically detected
medical concepts. The most common concept type was PROBLEM, accounting for nearly half the
medical concepts. Concepts of types TREATMENT and TEST effectively comprise the remaining
quarters of the data. The concept type introduced by us, INFECTIOUS AGENT, occurs incredibly
infrequently, and can largely be ignored for these evaluations.
PROBLEM 50.02%
TREATMENT 23.34%
TEST 23.62%
INF. AGENT 3.02%
0% 5% 10%15%20%25%30%35%40%45%50%
(a) Concepts
ABSENT 13.1%
ASSOC. WITH ANOTHER 0.01%
CONDITIONAL 1.13%
CONDUCTED 33.31%
HISTORICAL 17.05%
HYPOTHETICAL 0.72%
ONGOING 8.73%
ORDERED 1.04%
POSSIBLE 0.55%
PRESCRIBED 1.12%
PRESENT 22.34%
SUGGESTED 0.89%
0% 5% 10% 15% 20% 25% 30% 35%

(b) Assertions
Figure 3.10. Distributions of automatically identified concepts and assertions in our collection of
EHRs.
83
As the QMKG is unique in its representation of qualified medical concepts, we were also
interested in the distribution of the assertions in the QMKG. Figure 3.10b illustrates the distribution.
The most common assertion class in our EHR collection is CONDUCTED, accounting for nearly
a third of all concepts. PRESENT follows closely, comprising nearly a fifth of the mentions. The
i2b2 distribution contains 69% of the concepts being labeled as PRESENT. To compare this to
our data, we must consider that we have divided PRESENT based on the concept type, so that
it now corresponds to PRESENT for problems, CONDUCTED for tests, and PRESCRIBED for
treatments. Thus, our comparable value to the type-less i2b2 PRESENT is 56.77, significantly
lower than that of i2b2. Additionally, when comparing the 20% of ABSENT mentions in the i2b2
data to the 13.1% of our collection, it should be noted that we introduced the HISTORICAL class,
which accounts for 17.05%, creating a total of 30.15% of ABSENT or HISTORICAL mentions.
Because the QMKG is automatically generated, we were interested in evaluating the correctness
of the encoded information. The medical information from the nodes of the QMKG uses extensions
of the techniques reported in (Roberts and Harabagiu, 2011). The ability to detect lexicalized
medical concepts on the 2010 i2b2 data had an F1 -score of 79.59% (compared to the top submission
score of 85.23). When the i2b2 assertions were used, the system’s ability to identify them obtained
an F1 -score of 92.75% (compared to the top submission score of 93.62. The precision and recall of
assertion classification varied for each assertion, as illustrated in Table 3.4. Clearly, our assertion
classification methodology performs best on the classes that occur the most (CONDUCTED,
PRESENT, ABSENT), while rarer classes (SUGGESTED, and ORDERED) are harder to detect.
As we did not have the same amount of annotations as those used in the 2010 i2b2 challenge (where
25 thousand medical concepts were used for training and 40 thousand for testing), we have relied
on 2,349 new annotations for our new assertions classes that we have introduced. The distribution
of the assertions far from uniform within the i2b2 dataset, and so it was in our dataset as well.
Evaluating the assertions was performed in two phases, first, we evaluated the assertions that were
used in the i2b2 challenge and then we evaluated the new assertions against our annotations (using
84
Table 3.4. Precision and recall for assertion types as evaluated against the 2010 i2b2 data, and our
annotations for EHRs.
i2b2 2010 EHRs

Assertion Class P R F1 P R F1
ABSENT 95.93 93.41 94.65 88.9 89.1 89.0
ASSOCIATED WITH ANOTHER 91.47 81.38 86.13 - - -
CONDITIONAL 72.86 29.82 42.32 20.0 63.6 30.4
CONDUCTED - - - 89.8 80.7 85.0
HISTORICAL - - - 57.0 65.5 61.0
HYPOTHETICAL 92.2 87.0 89.5 - - -
ONGOING - - - 81.3 64.2 71.7
ORDERED - - - 17.6 54.6 26.7
POSSIBLE 81.63 58.89 68.42 04.3 25.0 07.3
PRESCRIBED - - - - 13.6 27.6 18.2
PRESENT 94.39 98.00 96.17 90.1 78.4 83.9
SUGGESTED - - - - - -
10-fold cross validation) and achieved an accuracy of 75.99%. Table 3.4 illustrates these results.
As expected, the performance of our concept and assertion classification on the TRECMed EHRs
is significantly reduced compared to the original i2b2 dataset it was designed for. This performance
reduction is primarily due to significant changes in the dataset: unlike the i2b2 data, the TRECMed
EHRs are not sentence segmented, and contain an incredible amount of embedded formatting (such
as section titles, tables, lists, charts) encoded as text. Additionally, the distribution of hospital
notes is significantly different, as the TREC documents contain detailed surgery and laboratory
notes which consist of long lists of numerical findings. For this reason, features based on syntactic
preprocessing are significantly degraded, resulting in a decrease of performance on the TRECMed
dataset.
It is also clear that the performance of our assertion classification method was significantly
influenced by the amount of data available for each class. This is even more obvious when the new
assertion classes are compared to the original assertion classes, as the scope of our annotations (2
thousand) is significantly less than the 72 thousand used in the i2b2 evaluation. Additionally, certain
assertion values (ASSOCIATED WITH ANOTHER, HYPOTHETICAL, and SUGGESTED) were
85
encountered very rarely (< 10 occurrences) and were not correctly classified. Although these values
are incredibly rare, they constitute important semantic information that bears a significant distinction
in meaning from the other classes for medical processing systems (consider, for example, the utility
of the ASSOCIATED WITH ANOTHER assertion if one is interested in a patient’s family history).
As such, we conclude that our automatic generation of the QMKG nodes and edges achieves results
comparable to state-of-the-art techniques.
3.5.2 Evaluation of the Techniques for Discovering QMKG Edges
Because our QMKG was generated with the purpose of improving a patient cohort retrieval system,
we have also evaluated the results it enabled on our TRECMed system (Goodwin and Harabagiu,
2013c). Patient cohort identification was evaluated in TREC 2011 and 2012, within the medical
records track (known as TRECMed). This retrieval problem considered 95,703 de-identified EHRs
grouped into 17,199 sets corresponding to individual patients’ hospital visits. When retrieving
a patient cohort for a given patient topic, systems were tasked with producing a list of hospital
visits ranked by relevancy. Thirty-five topics were evaluated in 2011, and fifty more in 2012.
The top 100 results for participant systems were pooled, and expert judgments were created by
NIST assigning each document a score of 0 (NON-RELEVANT), 1 (PARTIALLY RELEVANT),
or 2 (RELEVANT). Table 3.5 displays these results. Three different metrics were used to evaluate
systems. Additionally, we have included the scores of the top performing manual system (meaning
that human intervention was used to create the ranked result set), and the top performing automatic
system.
The first metric, inferred average precision (infAP) is an extension of notion of average precision
(AP) designed to be robust against incomplete judgments. Because precision and recall do not
consider the order of returned documents, the precision-recall curve plotting the precision and recall
at every position in the ranked documents is often used to evaluate retrieval systems. One of the
most popular metrics for evaluating IR systems is that of average precision (AP), which computes
86
Table 3.5. Scores achieved on the TRECMed 2012 topics.
Similarity Measure inf AP inf NDCG P @ 10

None? .276 .394 .145
Lin’s PMI .326 .590 .209
PMI .346 .612 .232
Fisher’s Exact Test .329 .594 .215
NGD .340 .609 .232
Top Manual .366 .680 .749
Top Automatic .286 .578 .592
the average value of the precision over all values of recall. It estimates the average precision by
sampling a relevant document, d, from the collection and determining the expected precision at
the retrieved rank of d by sampling the binary relevance (equating partial with complete relevance)
of a document from the 1, . . . , k ranked documents. Note that when all judgments are available,
inferred average precision exactly equals the average precision.
The second metric, inferred normalized discounted cumulative gain (infNDCG), is likewise
an extension of the immensely popular normalized discounted cumulative gain (NDCG) metric
designed to be robust against missing judgments. NDCG is, likewise, a transformation of the
discounted cumulative gain (DCG) wherein the DCG is normalized across queries. The DCG
metric penalizes (discounts) the cumulative gain (CG) by penalizing the CG value logarithmically
proportional to the position of the result. Finally, the cumulative gain (CG) is the sum of relevance
values for all result. Inferred NDCG is designed for scenarios with incomplete, non-binary relevance
judgments.
Finally, the metric, the precision within the first 10 documents is shown, and corresponds
to the number of correct results retrieved at rank 10 or higher. In general, these metrics agree
on the relative rankings of each technique, revealing that the point-wise mutual information best
informations patient cohort identification systems.
87
3.5.3 Discussion
Automatically constructing a qualified medical knowledge graph (QMKG) entails two major steps:
(1) identifying qualified medical concepts and their assertions to constitute the nodes of the graph
(2) determining whether two nodes should be adjacent, and what the strength of that edge should
be. Identifying the nodes comprising the QMKG achieved state-of-the-art results for classifying
medical concepts. When classifying medical concepts and detecting their boundaries, our method-
ology had only a 7% difference in F1 -measure compared to the top performing system on the i2b2
data. Our technique performed well on multi-token medical concepts, such as [atrial and ventric-
ular dysrhythmia-PROBLEM-PRESENT]. However, the EHRs we used had significant formatting
differences to those used in the i2b2 training data: the i2b2 data had clearly delineated sentence
boundaries; the TRECMed EHRs, however, contained raw free-form text that often incorporated
long lists and tables of medications, lab results, and other measurements. Additionally, the EHRs
had traces of structural information arbitrarily embedded within the text, such as titles, authors,
and headers. Finally, the EHRs contained excessive duplicated information, as many reported
contained the verbatim texts of previous reports. As such, we had problems with pleonastic pro-
nouns such as it being incorrect categorized as problems, likewise with relative pronouns such as
[this-PROBLEM-ABSENT].
When detecting the assertion associated with a medical concept, our technique achieved within
0.9% of the top ranking system, and also achieved an average F1 score of 79.33% against our new
annotations. Automatic feature selection allowed us to consider a large number of features from
a significant number of resources. The success of our methodology varied proportionally to the
number of training instances available to us. This is likely a result of the difficulty of detecting
implied semantic relationships based on a limited set of lexical clues. Consider that the belief value
of a given concept often depends on the semantics of that particular concept, as shown in the except
colorectal cancer screen normal, where the word normal indicates the absence of a problem.
88
Finally, we can discuss the quality of the results of methods for generating edges of the QMKG
by their impact on patient cohort retrieval results. Surprisingly, the basic point-wise mutual
information measure proved the most useful similarity measure, achieving a 115.5% increase in
inferred average precision, a 55.5% increase in inferred normalized discount cumulative gain, and
an 85.6% increase in the precision within the first 10 documents. This is likely due to characteristics
within the domain of clinical texts. PMI’s known bias towards infrequent terms, although typically
viewed negatively, offers substantially higher recall which is well suited to the task of patient cohort
retrieval. By incorporating the QMKG in our patient cohort retrieval system, our infNDCG results
differ from the top performing manual system by only 0.01% (Demner-Fushman et al., 2012).
Further, our infNDCG outperforms the top performing automatic system by 0.05% (Callejas P.
et al., 2012). Clearly, incorporating the knowledge encoded within the QMKG yielded significantly
more relevant patient cohorts.
An extraordinary breadth of electronic health records are used throughout the world. These
documents contain detailed narratives of the circumstances surrounding a patients treatment, such
as surgical reports, patient histories, discourses with the physician, or discharge summaries. Despite
the incredible depth of knowledge encoded in these records, they are not readily usable for machine
consumption (Chapman et al., 2011). This is due to the fact that physicians do not state their
reasoning behind certain actions, assuming that readers of their records have the domain knowledge
required to infer their motivations. In this chapter, we presented a framework for capturing
medical knowledge automatically both as qualified medical concepts and as connections between
them. We have evaluated the quality of this QMKG for the purpose of patient cohort retrieval.
The 2011 and 2012 Text REtrieval conference evaluated the task of retrieving patient cohorts
from electronic health records (EHRs). We showed that by constructing a graph of medical
concepts – medical problems, treatments, or tests – qualified by the physician’s belief (such as
89
absent, present, or conditional) can greatly improve the relevance of patient cohorts when used
for query expansion. Due to the large data sets used, we incorporated BigData techniques by
presenting a process for constructing this graph according to the MapReduce model. Further, we
evaluated four different methods for determining the first-order similarity between qualified medical
concepts. We also provided a generalized technique for computing the second-order similarity from
a first-order similarity matrix. Additionally, we introduced a reformulation of this second order
similarity measure to recursively calculate an n-order similarity. Finally, we showed how this
n-order similarity can be used to simulate n-step random walks for query expansion, yielding
significantly improved results compared to state-of-the-art patient cohort retrieval systems. This
kind of knowledge – the nature of medical concepts such as problems, treatments, or tests as well as
their belief state (e.g., present, hypothetical) – constitutes a reasonable method for systems operating
in the medical domain to simulate the high degree of domain knowledge required to interpret the
findings in EHRs. Further, automatically learning this kind of knowledge allows for a secondary
use of readily abundant EHRs (Safran et al., 2007).
90
CHAPTER 4
MULTIMODAL PATIENT COHORT RETRIEVAL
Minor revision, with permission, of Travis R. Goodwin and Sanda M. Harabagiu, Multi-modal Pa-
tient Cohort Identification from EEG Report and Signal Data, Proceedings of the American Medical
Informatics Association (AMIA) Annual Symposium, 2016, pp. 1,794 – 1,803. PMID:28269938.
91
Clinical electroencephalography (EEG) is an electrophysiological monitoring method used to
record electrical activity of the brain. Clinical EEG is the most important investigation in the
diagnosis and management of epilepsies and can also be used to evaluate other types of brain
disorders (Smith, 2005). An EEG records the electrical activity along the scalp and measures spon-
taneous electrical activity of the brain. Unfortunately, as noted in Beniczky et al. (2013), the EEG
signal is complex and inter-observer agreement for EEG interpretation is known to be moderate.
Such interpretations of EEG recordings are available in EEG reports. As more clinical EEG data
becomes available, the interpretation of EEG signals can be improved by providing neurologists
with results of search for patients that exhibit similar EEG characteristics. Searching the EEG
signals and reports results in the identification of patient cohorts that inform the clinical decision
of neurologists and enable comparative clinical effectiveness research (Edinger et al., 2012). For
example, a neurologist suspecting that her patient has epilepsy potential formulated the query
(Q1 ) History of seizures and EEG with TIRDA without sharps, spikes, or electrographic seizures.
When inspecting the EEG signals and reports of the resulting patient cohort, the neurologist was
able to observe the specific features of the EEG for patients that exhibited epileptic potential. In
another search instance, a neurologist researcher was interested in one of the research priorities for
improving surveillance and prevention of epilepsy as reported by England et al. (2012), namely,
to identify effective interventions for epilepsy accompanied by mental health co-morbidities. This
researcher formulated the query (Q2 ) History of Alzheimer and abnormal EEG [sic]. The patient
cohort that was identified enabled the researcher to observe the treatment outcomes as well as the
clinical correlations documented in the EEG reports.
To ensure that patients from a cohort satisfy the criteria expressed in the natural language queries
formulated by neurologists, it is important to not only consider the narrative from EEG reports, but
also the EEG signal data. Searching for patient cohorts by considering EEG signals and reports
relies on (1) an index of the EEG clinical information and (2) relevance models that identify records
of relevant patients against a query. Indexing EEG clinical information requires organizing both
92
narratives from the EEG reports and signal data from the EEG recordings. Consequently, the EEG
index needs to capture multi-modal clinical knowledge processed both from the reports and the
signal recordings. While medical language processing enables the indexing of information form the
EEG reports, the index must also comprise a representation of EEG signal recordings. In addition,
the relevance models used by the patient cohort retrieval system must account for inclusion and
exclusion criteria inferred from processing the natural language query. To address these problems,
we have developed a patient cohort retrieval system which produces a multi-modal index of big
EEG data. We have also implemented two relevance models to identify the most relevant patients
based on their EEG reports and also based on the properties of the EEG signal recordings. The
patient cohort retrieval system, called MERCuRY (Multi-modal EncephalogRam patient Cohort
discoveRY), uses medical language processing to identify the inclusion and exclusion criteria from
the queries and to index the clinical knowledge from the EEG reports. In addition, MERCuRY
has two novel aspects not present in previous approaches for patient cohort retrieval: (1) it uses
deep learning to represent the EEG signal and to produce a multi-modal EEG index; and (2) it
operates based on two EEG relevance models – one that uses only the clinical information from the
EEG reports, and a second one which also considers the EEG signal information. We evaluated
MERCuRY by using expert judgments of queries against a collection of nearly 20,000 EEGs.
4.1 Background
The ability to automatically identify patient cohorts satisfying a wide range of criteria – including
clinical, demographic, and social information – has applications in numerous use cases, as pointed
out in Shivade et al. (2014) including (a) clinical trial recruitment; (b) outcome prediction; and (c)
survival analysis. Although the identification of patient cohorts is a complex task, many systems
aiming to resolve it automatically have used statistical techniques or machine learning methods
taking advantage of natural language processing (NLP) of the clinical documents (Shivade et al.,
2014). However, these systems cannot rank the identified patients based on the relevance of the
93
patient to the cohort criteria. This notion of relevance is at the core of information retrieval (IR)
systems. Thus, viewing the problem of patient cohort identification as an IR problem enables us to
not only identify which patients belong to a cohort, but to also rank patients based on relevance to
the inclusion and exclusion criteria used in the query.
Using information retrieval for patient cohort identification was considered in 2011 (Voorhees
and Tong, 2011) and 2012 (Voorhees and Hersh, 2012) by the Medical Records track in the annual
Text REtrial Conference (TREC) hosted by the National Institute for Standards and Technology
(NIST). When patient cohort identification systems are presented with a query expressing the
inclusion/exclusion criteria for a desired patient cohort, a ranked list of patients representing the
cohort is produced where each patient may be associated with multiple health records. Thus,
identifying a ranked list of patients is equivalent to producing a ranked list of sets of health records,
each pertaining to a patient belonging to the cohort. In MERCuRY, we have adopted the same
framework of identifying patients that are relevant to a cohort and ranking them according to their
relevance to the given cohort criteria. However, unlike the TREC patient cohort retrieval systems,
which considered only the clinical texts available from a large set of electronic health records,
MERCuRY uses a multi-modal index that encodes textual data available from EEG reports as well
as signal data produced by EEG signal recordings.
Historically, the majority of multi-modal retrieval systems have operated on text and image
data. For example, Demner-Fushman et al. (2012) designed a biomedical article retrieval system
which allows users to not only discover biomedical articles relevant to a query, but to also discover
similar images to those found in any retrieved articles. Their approach clusters images using a
large number of visual features capturing color, edge, texture, and other image information. By
contrast, our approach relies on unsupervised deep learning to generate a fingerprint of EEG data.
As such, although designed for EEG data, our architecture can be easily adapted to support other
types of (physiological) waveform data (such as ECGs). Other forms of physiological waveform
data were previously investigated by the AALIM (Syeda-Mahmood et al., 2007) system which
94
enabled cardiac decision support by allowing physicians to locate similar patients according to the
ECG, echo, and audio data associated with the patient. However, unlike our approach, their system
does not support search: it can only identify similar patients to a provided ECG, echo, or audio
recording and thus cannot be used to discover patients matching arbitrary criteria.
The MERCuRY system presented in this chapter relies on medical language processing tech-
niques that were informed by the experience gained from participating in the 2010, 2011 and
2012 Informatics for Integrating Biology and the Bedside (i2b2) Challenges on NLP for Clinical
Records, which focused on the automatic identification of medical concepts and events in clinical
texts (Uzuner et al., 2011). In contrast, the EpiDEA patient cohort identification system (Cui
et al., 2012) which also retrieves EEG-specific patient cohorts, operates on discharge summaries
to recognize patients that are relevant only to some pre-defined queries obtained by relying on the
EpSO ontology (Sahoo et al., 2014).
4.2 The Data
The MERCuRY system was developed to identify patient cohorts from the big EEG data available
from the Temple University Hospital (TUH) EEG Corpus (Harati et al., 2013) (over 25,000 sessions
and 15,000 patients collected over 12 years). This dataset is unique because, in addition to
the raw signal information, physician’s EEG reports are provided for each EEG. Following the
American Clinical Neurophysiology Society Guidelines for writing EEG reports (American Clinical
Neurophysiology Society et al., 2006), the EEG reports from the TUH corpus start with a CLINICAL
HISTORY of the patient, describing the patient’s age, gender, and relevant medical conditions at
the time of the recording (e.g., “after cardiac arrest”) followed by a list of the medications which
may influence the EEG. The INTRODUCTION section is the depiction of the techniques used for
the EEG (e.g., “digital video EEG”, “using standard 10-20 system of electrode placement with 1
channel of EKG”), as well as the patient’s conditions prevalent at the time of the recording (e.g.,
fasting, sleep deprivation) and level of consciousness (e.g., “comatose”). The DESCRIPTION
95
section is the mandatory part of the EEG report, and it provides a description of any notable
epileptiform activity (e.g., “sharp wave”), patterns (e.g., “burst suppression pattern”) and events
(“very quick jerks of the head”). In the IMPRESSION section, the physician states whether the
EEG readings are normal or abnormal. If abnormal, then the contributing epileptiform phenomena
are listed. The final section of the EEG report, the CLINICAL CORRELATIONS section explains
what the EEG findings mean in terms of clinical interpretation (Kaplan and Benbadis, 2013) (e.g.,
“very worrisome prognostic features”). Each EEG report in the TUH corpus is associated with the
EEG signal recording it interprets. The signal information consists of 24 to 36 channels of signal
data as well as an additional annotation channel providing markers identifying events of interest to
physicians and technicians. EEG signals are sampled at a rate of 250 Hz or 256 Hz using 16-bits
per sample. Each EEG recording from the TUH EEG corpus contains roughly 20 megabytes of
raw data, stored in the European Data Format (EDF+) file schema (Kemp and Olivan, 2003).
Relevance Model
Analysis
1. Term Filtering
EEG Cohort 2. Query Formulation
Case 1
Description 3. Query Expansion Patient Cohort CASE 1
Case 2
EEG Report
EEG Report Processing Patient Cohort CASE 2
Term / Phrase Tiered Inverted Lists
1. Section identification
2. Medical language processing Dictionary
EEG Signal
EEG Signal Processing
Deep Neural Network EEG Signal fingerprints
Multi-Modal EEG INDEX
Figure 4.1. Overview of the MERCuRY Patient Cohort Discovery System
4.3 Multimodal Patient Cohort Retrieval
MERCuRY is a multi-modal patient cohort discovery system which allows neurologists to inspect
the EEG records as well as the EEG signal recordings of patients deemed relevant to a query
expressing inclusion and exclusion criteria through natural language. As illustrated in Figure 4.1,
96
the neurologist query is analyzed to identify the inclusion and exclusion criteria. The results of
query analysis inform two different relevance models (illustrated as Case 1 and Case 2 in Figure 4.1)
which rely on the multi-modal EEG index encoding information identified in the EEG reports and
signal recordings. When EEG reports are indexed, the sections of the EEG reports are identified and
medical language processing is performed to identify the terms and phrases of the dictionary and to
create tiered inverted lists. When the EEG signal recordings are processed, they are represented by
EEG signal fingerprints which are produced by deep learning methods. EEG signal recordings are
converted into low-dimensional fingerprint vectors which are included in the multi-modal index.
Additional details of the index are provided later in this chapter. As shown in Figure 4.1, in
MERCuRY, we considered two relevance models designed to identify and rank patients based
on their relevance to the patient cohort query: Case 1, in which the EEG signal fingerprints are
ignored and only the EEG reports are used, and Case 2, in which both the EEG fingerprints and
reports are used. These two cases allowed us to experimentally evaluate the impact of the EEG
signal fingerprint representation on the overall performance of MERCuRY when identifying patient
cohorts.
term ID alpha Tiered Inverted Lists

term ID beta
… … EEG Report ID EEG Signal Fingerprint ID
DICTIONARY
term ID hypertension POSITIVE

…. …. POLARITY Report Section Report Section Position
term ID lovenox
… … Medical Concept ID Concept Position Next
term ID seizure
EEG Signal
term ID sharp
fingerprints
term ID slow
TERM
term ID spike NEGATIVE EEG Report ID EEG Signal Fingerprint ID

term ID stroke POLARITY
…. ….
Report Section Report Section Position Next
term ID wave
Medical Concept ID Concept Position
Medical Concept ID Concept Type alpha term ID Medical Concept

…. …. …. ….
Medical Concept ID Concept Type Sharp and slow wave term ID DICTIONARY
Figure 4.2. The Multi-Modal Tiered Index used in the MERCuRY Patient Cohort Discovery System
97
4.3.1 Indexing the EEG Big Data
The multi-modal index used by the MERCURY system organizes the information from the EEG
reports as well as the information from the EEG signal recordings. It contains both a term dictionary
and a medical concept dictionary, listing all the terms and medical concepts discerned from the
EEG reports. We considered five medical concept types: (1) medical problems; (2) medical tests;
(3) medical treatments (including medications); (4) EEG patterns and activities; as well as (5) EEG
events. Because medical concepts often are multi-term expressions (e.g., “spike and slow waves”),
the medical concept dictionary used term IDs to associate a concept with all terms expressing it
(e.g., “spike and slow waves” is associated with the terms “spike”, “slow” and “wave”). Moreover,
as illustrated in Figure 4.2, each entry from the term dictionary is linked to a pair of inverted lists:
the first corresponding to positive polarity associated with the term while the second corresponding
to negative polarity. By using polarity information (which is automatically processed from the
medical language used in the EEG reports), we have designed a multi-tiered index. Each of the
tiered inverted lists is implemented as a linked list. Each cell of those lists indicates, for every
occurrence of the term:
(1) in which EEG report the term was observed;
(2) the EEG signal fingerprint of that EEG report;
(3) in which section of the EEG report was the term observed;
(4) in which position of the EEG section;
(5) whether the term belongs to a medical concept identified in the EEG report; and
(6) if so, what position does the term have in the concept.
The EEG signal fingerprints, by contrast, are representations of the EEG signal recordings obtained
through deep-learning techniques (described later in the chapter) and organized in a similarity-based
hierarchy which enables the discovery of relevant patients when the EEG signal recordings are also
considered (Case 2 illustrated in Figure 4.1). When the EEG signal recordings are not used for
identifying patient cohorts (Case 1 illustrated in Figure 4.1) only the term dictionary, the medical
98
concept dictionary, and the tiered inverted lists from the index are used. Creating the multi-modal
tiered index for the MERCuRY Patient Cohort Discovery System involves:
(1) the recognition of sections of the EEG reports;
(2) medical language processing to determine (a) the terms from the dictionary, (b) their polarity
and (c) the medical concepts;
(3) generating the fingerprints for the EEG recording; and
(4) organizing the EEG signal fingerprints in the similarity-based hierarchy.
4.3.2 Section Identification
Sections were identified through a rule-based section segmentation approach. Our rules were
defined after manually reviewing 300 randomly sampled EEG reports. We detected a set of
candidate headers by discovering all sequences of all capitalized words ending in a colon or line
break, and normalized section titles based on simple regular expressions. For example, “description
of the record”, “description of record”, and “description of the recording” would all be normalized
to DESCRIPTION.
4.3.3 Medical Language Processing
In order to build (1) the term dictionary; (2) the medical concept dictionary and (3) the two polarity-
informed posting lists we have used the following sequence of medical language processing steps:
(Step 1) Tokenizing the EEG reports: we relied on Stanford’s CoreNLP pipeline (Manning
et al., 2014) to detect sentences and tokens from every EEG report.
(Step 2) Discovering the Dictionary Terms: Each token was normalized in order to account
for any lexical variation (e.g., “waves” and “wave” or “markedly and “marked”) using Stanford’s
CoreNLP lemmatizer (Manning et al., 2014). The resultant lemmatized terms formed the basis of
the dictionary.
99
(Step 3) Identifying the Polarity: Term polarity was cast as a classification problem implemented
as a conditional random field (CRF) (Lafferty et al., 2001). Leveraging our previous experience
with the i2b2 challenge, the CRF assigned a binary polarity value (i.e., positive or negative) to
each term based on feature vector containing lexical information as well as information from
external resources, such as the NegEx negation detection system (Chapman et al., 2001), the
Harvard General Inquirer (Stone et al., 1962), the Unified Medical Language System (UMLS)
meta-thesaurus (Bodenreider, 2004), and MetaMap (Aronson, 2001). Specifically, we considered
nine features: (1) the section name, (2) whether the term was considered a modifier by NegEx, (3)
whether the term was within a NegEx negation span, (4) whether the term was in the ‘IF’ category
of the Harvard General Inquirer, (5) the part-of-speech tag assigned to the token by Stanford’s
CoreNLP part-of-speech tagger (Manning et al., 2014), (6) whether the term belonged to a UMLS
concept, (7) whether the term belongs to a MetaMap concept, (8) the original cased term before
lemmatization, and (9) the lowercased and lemmatized form of the term. The classifier was trained
using 2,349 manual annotations.
(Step 4) Identifying Medical Concepts: An automatic system for medical concept recognition
previously developed for the 2010 i2b2 challenge (Uzuner et al., 2011) recognized medical prob-
lems, tests, treatments. In addition, we have produced 4,254 new annotations for EEG patterns and
events as well as EEG activities and re-trained the concept recognizer to identify all these types
of concepts. Concept extraction was cast as a classification task, in which a CRF was used to
detect medical concept boundaries. A support vector machine (SVM) (Cortes and Vapnik, 1995)
was used to classify each concept into one of five types: medical problem, medical test, medical
treatment, EEG activity, or EEG event.
4.3.4 Generating Fingerprints of EEG Signal Recordings.
In the TUH EEG corpus, the EEG signals are encoded as dense floating-point matrices of the form
D ∈ RN×L , where N ∈ [24, 36] is the number of electrode potential channels in the EEG and L
100
is the number of samples (such that duration of the EEG recording in seconds is equal to L/250
Hz). Thus, Di j encodes the magnitude of the potential recording on the i-th channel during the
j-th time sample. Both the number of channels and the number of samples vary not only across
patients, but also across EEG recording sessions. These variations, particularly when considered
with the large amount of data encoded in each matrix (typically 20 megabytes), make it difficult to
not only characterize the relevance EEG signals to a particular patient cohort, but also to determine
the similarity between two EEG signals. For example, consider that a single naïve pass over the
TUH EEG corpus requires considering over 400 gigabytes worth of information. Consequently,
we devised a representation of the EEG recordings that not only requires less memory, but enables
rapid similarity detection between EEG signal recordings. This allows us to compactly encode the
information from the EEG signals (reducing 20 megabytes of signal data to a few hundred bytes).
Our representation is based on EEG signal recording fingerprints obtained with a recurrent neural
network. Recurrent neural networks are deep learning architectures which enabled us to generate
fingerprints for each EEG in the TUH EEG corpus in a matter of hours instead of weeks. Because
traditional neural networks have difficulty operating on sequential data (e.g., EEG signals) as they
cannot consider relationships between successive inputs (e.g., between successive samples), we
have used a recurrent neural network (Kosko, 1988) (a neural network with a loop) which allowed
information learned at each sample to persist between samples in the same EEG signal. The learned
information (known as the internal memory state) is updated with each sample until ultimately
becoming the fingerprint for that EEG recording. Figure 4.3 illustrates the recurrent neural network
used for EEG fingerprinting. As shown, the unrolled network (on the right) processes each sample
from the EEG signal and predicts the value of the next sample (ht+1 ) according to both the current
sample (xt ) and the current fingerprint ( f ).
Recursive neural networks are able to connect information from the previous sample to the
current sample, allowing them to consider the structure of the EEG waves in each channel. However,
interpreting EEGs requires considering not just the immediately adjacent signal information but
101
Figure 4.3. The recursive neural network used for generating EEG signal fingerprints
also long distance signal patterns (e.g., alpha waves being interrupted by a sharp and slow wave
complex or repeated bursts of high amplitude delta waves). In order to allow our EEG fingerprints
to consider this type of long-distance information – that is, to consider the context of the entire
signal when predicting the next sample, we adapt a special form of recursive neural network cell –
the Long Short-Term Memory (Hochreiter and Schmidhuber, 1997) (LSTM) cell – which are able
to remember information for long periods of time.
ℎ𝑡−1 ℎ𝑡
× +
× tanh
LSTM LSTM
𝜎 𝜎 tanh 𝜎 ×
⋈ ⋈
𝑥𝑡 𝑥𝑡+1
Figure 4.4. Details of the Long Short-Term Memory (LSTM) Cell
Figure 4.4 outlines the architecture of an LSTM. Small green boxes denote sub-layers in the
neural network, while purple circles denote point-wise vector operations, and the gray Z cell denotes
vector concatenation. The top horizontal line represents the current working EEG fingerprint, i.e.,
102
f . The bottom horizontal line captures the previous prediction concatenated with the signal at the
current sample, denoted as ht . Relying on h, the LSTM operates in three steps. First, a sigmoid
gate (denoted as σ) determines whether the fingerprint (or memory) should forget information.
In the second step, another sigmoid gate determines which dimensions of the fingerprint should
be updated, while the tanh gate determines what new values will be added to the fingerprint. In
the final step, the (possibly updated) fingerprint is used to predict the next sample in the sequence
ht = σ (ht ) × tanh ( f ). Both f and the next prediction, ht are passed to the next LSTM cell in the
chain until all samples in the EEG signal have been considered. After the final sample has been
considered, the value of f is used as the fingerprint for the entire EEG signal. Thus, as the neural
network processes each sample, the fingerprint will be continuously refined until the network is
best able to predict every sample based on the previous sample and the fingerprint.
The recursive neural network shown in Figures 4.3 and 4.4 is formally defined as follows.
We define the parameter K as the fingerprint dimensionality, or the number of dimensions of
the finger-print vector such that K N × L (where N is the number of channels and L is the
number of samples). This allows us to determine the fingerprint vector f ∈ R1×K for an EEG
D ∈ RN×L by using the fingerprint as the internal memory state of the LSTM chain. In order
to ensure that the fingerprint can be used to reconstruct portions of the EEG data, we define an
additional parameter, W, the sample window, which indicates the number of subsequent samples
which should be predicted from each cell. This allows us to learn the optimal fingerprint for
each document by determining the vector f which minimizes the cosine distance between the
predicted values for each sample (hi, hi+1, . . . , hi+W ) and the actual values (xi+1, xi+2, . . . , xi+W+1 ).
Note, that the fingerprint vector and prediction vector are both K-dimensional, while each sample
vector is only N-dimensional. Thus, before comparing, we must project the output vector hi into
N dimensions by defining a projection matrix W ∈ RK×N and a bias vector b ∈ R1×N . Unlike
the fingerprint vectors, which are optimized for each individual EEG, W and b are optimized over
the entire corpus. Thus, the optimal fingerprint vector f for each document was computed by
103
minimizing the cosine distance between the output of the LSTM cell and the next W samples:
L−W
ÕÕ W
f = cos lstm f , DTi
0
·W + b, DTj (4.1)
i=0 j=i
where lstm f , DTi refers to the standard LSTM loss function (Hochreiter and Schmidhuber, 1997).

As defined, the recurrent neural network allows us to generate the optimal fingerprint f for each
EEG signal by discovering the vector f which is best able to predict the progression of samples
in each EEG recording according to the LSTM. In this way, the fingerprint is able to provide
a substantially compressed view of the EEG signal while still retaining many of the long-term
interactions and characteristics of the signal.
4.3.5 Organizing EEG Fingerprints into a Similarity-based Hierarchy
Rapid computation of similarity between EEG signals (or their fingerprints) is facilitated by the Fast
Library for Approximate Nearest Neighbors (FLANN) (Muja and Lowe, 2009). FLANN provides
implementations of a variety of highly-efficient structures for computing (or approximating) the
nearest neighbors of vectors in high dimensions. This allowed us to not only compactly store the
EEG signal information, but also to retrieve, for any EEG fingerprint, the most similar EEGs (i.e.,
the nearest neighbors) as measured by cosine distance. We used a k-means tree which allows
for high precision retrieval of the nearest EEGs to any fingerprint by recursively clustering the
fingerprint vectors using k-means clustering. The number of clusters is determined by FLANN’s
auto-tuning mechanism.
4.4 Query Analysis
The purpose of query analysis is to identify the inclusion and exclusion criteria expressed in the
query. For example, in the query “patients with shifting arrhythmic delta suspected of underlying
cerebrovascular disease” two separate inclusion criteria are detected: “shifting arrhythmic delta”
104
and “cerebrovascular disease”. Similarly, in the query “patients with dementia and no abnormal
EEG”, there is one inclusion criterion, “dementia”, and one exclusion criterion, “abnormal EEG”.
To automatically detect the criteria, the following steps are used:
(Step 1) Term Filtering: Tokenization, lemmatization, and part-of-speech tagging using Stan-
ford’s CoreNLP pipeline (Manning et al., 2014) enables the filtering of terms that are not identified
as a noun, verb, adverb, adjective, or preposition.
(Step 2) Query Formulation: Our approach for determining inclusion and exclusion criteria
in the query relied on the same polarity and medical concept classifiers used (and previously
described) for building the inverted index from EEG reports. Specifically, we considered two
methods for recognizing inclusion and exclusion criteria: (a) phrase chunking using Stanford’s
CoreNLP pipeline and (b) medical concept detection using the previously described classifier. In
both cases, we relied on the previously described polarity classifier to distinguish between inclusion
criteria (positive) and exclusion criteria (negative) based on the polarity of each phrase or concept.
(Step 3) Query Expansion: In order to account for the fact that many medical concepts can
be expressed in multiple ways, we perform query expansion using the Unified Medical Language
System (UMLS) to detect synonymous criteria. This is accomplished by expanding each criterion
to include the set of all atoms in UMLS which have the same concept unique identifier (CUI) as
the criteria. For example, “cerebrovascular disease” would be associated with 110 expansions,
including “cerebral aneurysm”, “vascular ischemia”, “brain stem hemorrhage”, etc.
4.4.1 Relevance Models
The inclusion and exclusion criteria discerned from the query analysis were used by MERCuRY’s
relevance models to assess the relevance of each EEG report against the given query. Two relevance
models were considered (as illustrated in Figure 4.1): Case 1, which ignored the EEG signal
fingerprints and Case 2, which incorporates them.
105
Case 1. This relevance model assigns a score to an individual EEG report based on the BM25F
ranking function (Zaragoza et al., 2004). BM25F measures the relevance of an EEG report based on
the frequency of mentions of each inclusion criterion and the absence of each exclusion criterion.
Moreover, BM25F is capable of adjusting the score for each criterion based on the tiers in the
posting list: that is, a criterion mention is scored according to both the polarity and the section in
the document. Formally, for an EEG report r and a query q = {c1, c2, · · · } composed of individual
inclusion and exclusion criteria (c), the BM25F relevance score is computed as:
Õ x̄r,c
BM25F (r; q) = idf (c) (4.2)
c∈q
K1 + x̄r,c
where idf(c) is the inverse-document frequency of criterion c (i.e., the inverse of the number of
documents mentioning c), K1 is a structuring parameter (in our case set to the standard (Zaragoza
et al., 2004) value K1 = 1.2) and x̄r,c is a tier-normalizing criterion frequency measure. The
tier-normalizing criterion frequency measure, x̄r,c , adjusts the frequency of criterion c in report
r according to the polarity and section of each mention. Before defining this measure, we must
account for the fact that query analysis described above considers two ways of representing inclusion
and exclusion criteria – (a) by phrases and (b) by typed medical concepts; thus, the tier-normalizing
criterion frequency measure changes depending on which of these methods is used:
Õ xr,c,s,p Õ xr,c,t,p
x̄r,c = (4.3a) x̄r,c = (4.3b)
l p l p
p,s 1 + b lr,s,
s, p
− 1 p,t 1 + b lr,t,
t, p
− 1
(Case 1a) When an inclusion or exclusion criterion are expressed as a phrase, we defined x r,c
(used by the BM25F function) in Equation (4.3a), where xr,c,s,p is the number of occurrences of
criterion c with polarity p in section s of report r; b is a normalizing parameter (in our case using
the standard (Zaragoza et al., 2004) value b = 0.75), lr,p,s is the number of terms with polarity p in
section s of report r, and l s,p is the average number of terms with polarity p in section s across all
reports.
106
(Case 1b) In this case, each criterion is represented as a medical concept. For this method, the
tier-normalizing criterion frequency measure is restricted only to the sections pertinent to the type
of medical concept. That is, medical problems, and medical tests are only searched in the HISTORY
and CORRELATION sections; medical treatments are searched in the MEDICATIONS and COR-
RELATION sections; while EEG activities and EEG events are searched only in the DESCRIPTION
and IMPRESSION sections. Consequently, the tier-normalizing criterion frequency measure (used
in the BM25F function) is computed using Equation (4.3b) where t indicates a section pertinent to
the type of the medical concept used to express the criterion c.
Case 2. The second relevance model considers both the information from EEG reports as well
as the EEG signal fingerprints. It starts with the candidate patients discovered based on Case 1. The
ranked list of patients is then updated based on the fingerprints associated with the most relevant
patients’ EEGs. The rank updating procedure relies on two parameters: (1) λ, the rank threshold
parameter indicating how many of the initially retrieved patients should be used for re-ranking (in
our experiments we set λ = 5), and (2) δ, the fingerprint selection parameter which determines the
number of similar fingerprints to consider for each patient (in our experiments we set δ = 3). The
updated patient ranking is obtained as follows: for each patient p x of the λ-highest ranked patients,
we (i) find the fingerprint f x associated with p x , (ii) use the hierarchy of EEG signal fingerprints
(illustrated in Figure 4.2) from the multi-modal index to discover the σ most-similar fingerprints
to f x , and (iii) insert the patients corresponding to these fingerprints into the ranked list of patients
immediately after the patient p x , thus generating a new ranked list of patients.
We evaluated two aspects of the MERCuRY system: (1) the overall quality of patient cohorts
discovered by the system and (2) the quality of the polarity classifier used to process the EEG
reports and to detect exclusion criteria in queries.
107
Table 4.1. Example queries used to evaluate the MERCuRY system
Patient Cohort Descriptions (Queries)

1. History of seizures and EEG with TIRDA without sharps, spikes, or electrographic seizures
2. History of Alzheimer dementia and normal EEG
3. Patients with altered mental status and EEG showing nonconvulsive status epilepticus
(NCSE)
4. Patients under 18 years old with absence seizures
5. Patients over age 18 with history of developmental delay and EEG with electrographic
seizures
4.5.1 Evaluation of Patient Cohort Discovery
We primarily evaluated the MERCuRY system according to its ability to retrieve patient cohorts.
To this end, we asked three neurologists to generate a set of 5 evaluation queries each and then
used them for evaluation. A sample of these queries is illustrated in Table 4.1. For each query,
we retrieved the ten most relevant patients as well as a random sample of ten additional patients
retrieved between ranks eleven and one hundred. We asked six relevance assessors to judge whether
each of these patients belonged or did not belong to the given cohort. Moreover, the order of the
documents (and queries) were randomized and judges were not told the ranked position of each
patient. Each query and patient pair was judged by at least two relevance assessors, obtaining an
inter-annotator agreement of 80.1% (measured by Cohen’s kappa).
This experimental design allowed us to evaluate not only the set of patients retrieved for each
cohort, but also the individual rank assigned to them. Specifically, we adopted standard measures
for information retrieval effectiveness, where patients labeled as belonging to the cohort were
considered relevant to the cohort query, and patients labeled as not belonging to the cohort were
considered as non-relevant the cohort query. Because the relevance of a patient to a particular
cohort can be difficult to automatically measure, we report multiple measures of retrieval quality.
Moreover, because our relevance assessments consider only a sample of the patients retrieved for
each topic, we adopted two measures of ranked retrieval quality: the Mean Average Precision
108
Table 4.2. Quality of Patient Cohorts Identified by the MERCuRY System
Relevance Model MAP NDCG P @ 10

Baseline 1: BM25 52.05% 66.41% 80.00%
Baseline 2: LMD 50.37% 65.90% 80.00%
Baseline 3: DFR 46.22% 59.35% 70.00%
MERCuRY: Case 1 (a) 58.59% 72.14% 90.00%
MERCuRY: Case 1 (b) 57.95% 70.34% 90.00%
MERCuRY: Case 2 (a) 70.43% 84.62% 100.00%
MERCuRY: Case 2 (b) 69.87% 83.21% 100.00%
(MAP) and the Normalized Discounted Cumulative Gain (NDCG) (Manning et al., 2008; Järvelin
and Kekäläinen, 2002). The MAP provides a single measurement of the quality of patients retrieved
at each rank for a particular topic. Likewise, the NDCG measures the gain in overall cohort quality
obtained by including the patients retrieved at each rank. This gain is accumulated from the
top-retrieved patient to the bottom-retrieved patient, with the gain of each patient discounted at
lower ranks. Lastly, we computed the “Precision at 10” metric (P@10), which measures the
ratio of patients retrieved in the first ranks which belong to the patient cohort. Although less
statistically meaningful, the precision is the easiest to interpret in terms of clinical application in
that a 100.00%Precision at 10 indicates that all of the patients returned above rank 10 completely
satisfy all the criteria of the given cohort. By comparison, the other measures indicate the quality
of the ranking produced by our system such that the MAP and NDCG scores capture the degree that
a patient retrieved at each rank will more closely match the cohort criteria than patients retrieved
at low ranks.
We measured the performance the MERCuRY system configured for the two relevance models
illustrated in Figure 4.1: Case 1, in which only the EEG reports are considered and Case 2, in which
both the EEG reports and EEG signal information is considered. In both cases, we considered both
methods of representing inclusion and exclusion criteria: (a) using phrases composed of terms, and
(b) using typed medical concepts. We compared these four combinations against three competitive
baseline systems for text retrieval: Base Model 25 (BM25) (Robertson et al., 1995), language
109
model retrieval using Dirichlet smoothing (LMD) (Zhai and Lafferty, 2001), and the Divergence
from Randomness (DFR) (Amati and Van Rijsbergen, 2002) framework using Poisson smoothing,
Bernoulli and Zipfian normalization. Table 4.2 illustrates these results.
As shown, both configurations yield promising performance. Moreover, Case 2 obtains the
highest quality patient cohorts as measured by all three metrics. This shows that the multi-modal
capabilities enabled by the EEG fingerprinting approach are able to identify patients who were
missed when only the EEG reports were considered. The poor performance obtained by the
baseline systems highlights the difficulty of automatically discovering patient cohorts. Moreover,
the increase in performance obtained by MERCuRY Model 1 compared to Baseline 1 highlights
the importance of medical language processing on EEG reports – particularly the role of the
tiered index and the incorporation of exclusion spans. The highest performance was obtained by
MERCuRY Model 2, showing the promise of including EEG signal information when discovering
patient cohorts. This suggests that the content of EEG reports alone is not enough to adequately
determine if a patient satisfies particular inclusion criteria. This finding is not surprising, as EEG
reports were not written to completely replace the EEG signal, but rather to describe the important
characteristics of the EEG recording which may be of interest to other neurologists. As such,
EEG reports typically document only notable findings making it difficult to exclude patients based
only on the text in the EEG reports. The superior performance obtained by MERCuRY Model 2
indicates that EEG fingerprinting is able to supplement the information in the EEG reports and
bridge the gap between the high level description of EEG information in the text, and the low-level
electrode potentials recorded in the EEG signal.
4.5.2 Evaluation of Polarity Classification
We evaluated the quality of our automatic polarity detection approach by performing 10-fold cross
validation on the 2,349 manual annotations we produced and measured precision, recall and F1 -
measure, as shown in Table 4.3. We compared our classifier against two baseline classifiers, (a)
110
Table 4.3. Polarity classification performance
Label Precision Recall F 1 -Measure

Baseline: Word Only 24.50 29.65 59.35
Baseline: UMLS Only 37.08 14.22 20.55
MERCuRY 86.82 70.10 76.20
“Baseline: Word Only” which uses only word features, and (b) “Baseline: UMLS Only” which uses
only UMLS concept features. The MERCuRY classifier obtains substantially higher performance.
Moreover, the poor performance of the baseline systems suggests that determining exclusion spans
in text requires more information that lexical context and can be improved by incorporating NegEx
and medical ontologies.
4.6 Lessons Learned
In terms of polarity classification, the most common types of error were due to confusion regarding
the exact boundary of negative regions of text, for example, the sentence “No focal or epileptiform
features were identified in this record” was classified such that only “no focal” was negative,
rather than the entire phrase “no focal or epileptiform features.” This indicates a failure by the
incorporated standard natural language processing modules (part of speech and phrase chunking)
to adapt to the clinical domain. One obvious path to improvement would be to annotate basic
linguistic information on clinical documents – particularly on EEG reports.
Another common type of error was related to the binary granularity of polarity. For example,
the excerpt there is a suggestion of a generalized spike and wave discharge in association with
photic stimulation” the phrase “generalized spike and wave discharge” was labeled as having a
negative polarity, despite the physician clearly indicating the possibility of such an activity. This
implies that future work would be well supported by a more fine-grained approach to capturing
the physicians’ beliefs, for example, by considering the assertions used in the 2010 i2b2 challenge
(Uzuner et al., 2011). Unfortunately, introducing assertions requires overcoming additional barriers
111
including increased risk of misclassification, and accounting for the degree of similarity between
different assertion values.
In terms of patient cohort retrieval, it is clear that the two methods of representing inclusion and
exclusion criteria – (a) using phrases of terms and (b) using typed concepts – do not provide any
significant changes to cohort performance. Based on our analysis, we believe this is primarily due
to the fact that there is little ambiguity in the types of concepts used in EEG reports: a particular
phrase or term (e.g., heart attack) was always associated with the same concept type. Moreover, the
types of concepts are almost completely restricted by the section they occur in (e.g., EEG activities
and events do not occur in the HISTORY , MEDICATION, or CORRELATION sections). This
suggests that considering EEG concepts alone does not provide any additional value to considering
terms directly. Moreover, because the index records positional information, multi-term concepts
(e.g., “slow and sharp wave”) are handled identically to multi-term phrases. Despite this, a number
of errors were observed. First, neither phrase chunking nor concept detection is sufficient to
fully capture the semantics of all inclusion criteria. For example, epileptiform activities were
often described as attributes of a particular wave (e.g., “slow rhythmic delta [waves]”) where the
individual concept (i.e., “delta [waves]”) is far less meaningful than its attributes (“slow” and
“rhythmic”). This suggests that performance can be improved by not only accounting for the
attributes of epileptiform activities but by adjusting the relevance model to ensure that mentions
of attributes actually modify the correct term – that is, to ensure that “slow” actually modifies the
same wave as “rhythmic”.
Finally, the substantial increase in performance when using the full multi-modal index shows that
EEG fingerprints are able to recover relevant information omitted from EEG reports. Unfortunately,
as the rank of retrieved patient’s decreases, the quality of the cohort obtained by finding similar
patients using EEG fingerprints decreases. We investigated multiple values of λ (the number of
patients to be used for re-ranking) and σ (the number of similar fingerprints to retrieve), and found
that increasing these values can result in a decrease in performance. Regardless, we did observe
112
that the fingerprints often identify patients which were not retrieved using the report text alone.
Moreover, we believe that by further refining the fingerprints we can improve the quality of patients
retrieved with higher values of λ.
4.7 Summary
In this chapter, we described a patient cohort retrieval system that relies on a multi-modal multi-
tiered index that organizes clinical information automatically processed from a big data resource of
EEGs. Generating the index involved both medical language processing on EEG reports, but also
a novel and highly efficient representation of the EEG signal recordings provided by a performant
Long Short Term Memory network. When evaluating the quality of patient cohorts obtained
when considering both EEG reports and signal recordings, we have a Mean Average Precision
of 70.43%. This high performance highlights the promise of multi-modal retrieval from text and
signal data. The remaining barriers of high-accuracy patient cohort identification from EEGs that
need to be removed will rely on: (1) incorporating a more fine-grained representation of inclusion
and exclusion semantics discerned from EEG reports, (2) extending medical language processing
for capturing spatial and temporal information, and (3) tightly correlating the information from
EEG reports with the EEG signal recordings. In future work, we plan to address these barriers
using recent developments in neural learning.
113
CHAPTER 5
ACCOUNTING FOR LONGITUDINAL INFORMATION
One of the most significant differences between information retrieval and question answering
systems operating on general domain text and medical texts is the role of longitudinal information.
That is, accounting for the fact that the information recorded in EHRs changes over time for each
patient. There are an estimated 136.3 million1 emergency department visits each year in the United
States. Of these emergency department visits, 12% (16.4 million) result in hospital admissions,
generating in an average hospital stay of 4.8 days2. In each of these hospitalizations, clinicians
generate multiple electronic health records (EHRs) which document a wide variety of clinical
observations, such as the patient’s diagnoses, risk factors, medications, and test results. This
explosion of rich clinical information offers an exciting opportunity to substantially improve the
quality of health care. Specifically, the United States government has outlined four major goals for
widespread EHR adoption3:
[GOAL 1] Track data over time;
[GOAL 2] Identify patients who are due for preventive visits and screenings;
[GOAL 3] Monitor how patients measure up to certain parameters, such as vaccinations and blood
pressure readings; and
[GOAL 4] Improve overall quality of care in a practice
In this chapter, we describe how each of these goals can be addressed paving the way for more
accurate and reliable medical question answering and patient cohort retrieval systems. Specifically,
1According to the National Hospital Ambulatory Medical Care Survey: 2011 Emergency Department Summary
Tables. Tables 1, 4, 14, 24. See: http://www.cdc.gov/nchs/fastats/emergency-department.htm
2According to the National Hospital Discharge Survey: 2010 table, Number and rate of hospital discharge. See
http://www.cdc.gov/nchs/fastats/hospital.htm
3http://www.healthit.gov/providers-professionals/electronic-medical-records-emr
114
we investigate three probabilistic graphical models for capturing longitudinal information: Sec-
tion 5.1 describes a lattice Markov network for predicting risk factors for heart disease in diabetic
patients, Section 5.2 presents a novel graphical model for inferring the causal interactions among
risk factors and medications over time, and Section 5.3 presents a general Bayesian model jointly
learns to predict clinical observations in time and cluster patients in latent sub-populations.
5.1 Lattice Markov Networks4
The narrative clinical notes in electronic health records (EHRs) mention clinical findings (CFs) re-
lated to patients, and thus are a very important source of information which captures the progression
of the patients’ overall clinical picture. An additional important aspect of the information available
from clinical narratives is provided by temporal information, which enables temporal inference
related to CFs. The clinical information about CFs, and their associated temporal information, is
not structured; thus, it is available only when automatic extraction techniques based on natural lan-
guage processing are employed. As extraction techniques become available, they make possible the
development of prediction methods that can evaluate the likelihood that a certain patient develops
a new condition or clinical risk factor. These predictions can be used in the clinical management
of the patients, being essential in personalized medicine as they inform individual diagnostic and
treatment decision making.
Automated extraction techniques that are able to identify CFs as well as their temporal infor-
mation were developed in the recent 2014 Informatics for Integrating Biology and the Bedside
(i2b2) Challenges addressing Language Processing for Clinical Data5. This task made possible
4Minor revision, with permission, of Travis R. Goodwin, and Sanda M. Harabagiu, A Probabilistic Reasoning
Method for Predicting the Progression of Clinical Findings from Electronic Medical Records, Proceedings of the
American Medical Informatics Association (AMIA) Joint Summits on Translational Bioinformatics (TBI) and Clinical
Research Informatics (CRI), 2015. PMID:26306238.
5Information regarding the i2b2/UTHealth shared task is available at https://www.i2b2.org/NLP/

HeartDisease/
115
the development of multiple approaches which were able to recognize CFs related to coronary
artery disease (CAD). These CFs include diagnoses of related diseases, such as CAD itself, and
diabetes, as well as certain risk factors, such as hypertension, hyperlipidemia, and obesity. As
information about these risk factors related to CAD along with associated temporal information
can now be identified automatically from the narrative portion of EHRs, we are in a position to be
able to (1) perform temporal inference, which enables and informs (2) prediction techniques based
on state-of-the-art probabilistic knowledge representation and reasoning.
5.1.1 Related Work
Clinical prediction rules have been developed to reduce the uncertainty inherent in medical practice
by defining how to use CFs to make predictions (Wasson et al., 1985). However, these rules do not
capture the temporal aspects of the change in CFs, and thus cannot predict their progression. A vast
literature on mining association rules from EHRs has been published, e.g., Kost et al. (2012); Rashid,
Hoque, and Sattar (Rashid et al.), but it has been documented that these methods often produce many
superfluous rules, and even those that are useful for prediction do not rely on any form of temporal
inference. In consequence, they capture only a small portion of the medical knowledge that can
be inferred from EMRs, and only produce predictions that do not consider temporal information.
Most prediction models for Coronary Artery Disease (CAD) rely on statistical methods based
on Cox regression, as illustrated by the GRACE post-discharge prediction model (Fox et al.,
2006). As reported in Eagle et al. (2004), the results of this prediction model on development
and validation patient cohorts were promising but may be further improved by probabilistically
modeling the statistical inter-dependencies between risk factors and co-morbidities. The model
presented in this section is, to our knowledge, the first prediction model that uses an undirected
probabilistic graphical model capable of representing such inter-dependencies and enabling the
prediction of progressions of CFs. Although Bayesian networks have been used for many years
in predictive medicine (Ozdemir and Yildirim, 2014; Roberts et al., 2006), they operate on an
116
underlying causal assumption: that the probabilistic influence between two random variables is
represented by conditional probabilities. The model presented in this section leverages an alternative
class probabilistic graphical models, known as Markov networks, which only assume correlation
between random variables (represented by joint probabilities) and allow for bi-directional influence.
Because we are interested in predicting the likely progression of CFs for any patient, we rely both
on the temporal information and on the bi-directional influence between any pair of CFs discovered
in the EHRs, thus Markov networks are ideal for our probabilistic representation.
5.1.2 The 2014 i2b2/UTHealth Dataset
In this chapter, we consider a dataset of 790 fully de-identified narrative electronic health records.
These records were provided by the 2014 Informatics for Integrating Biology and the Bedside
(i2b2) and The University of Texas Health Science Center at Houston (UTHealth) shared-tasks on
Challenges in Language Processing for Clinical Data. This dataset documents the progression
of heart disease over longitudinal EHRs for 128 diabetic patients where the number of individual
EHRs for each patient varied between three to five. Each EHR was manually annotated to indicate
the presence of certain clinical findings (CFs) deemed clinically relevant to heart disease or diabetes
and include both diseases and their associated risk factors. Table 5.1 illustrates examples of CFs
annotated as well as the criteria used for identifying them in EHRs. There were 6, 302 such
annotations: 16, 95 for DIABETES, 433 for OBESITY, 1, 926 for HYPERTENSION, 1, 062 for
HYPERLIPIDEMIA, and 1, 186 for CAD. Each of the CFs was also annotated with a temporal
signal (TS) which indicates when the CF was inferred. Three temporal signals were used; their
definitions and examples are provided in Table 5.2.
In our work, we considered a potential secondary use of this dataset: to design prediction
models that can infer the progression of CFs over time, given a set of EHRs with the CFs and TSs
extracted from them.
117
Table 5.1. Clinical findings related to heart disease, based on risk factors annotated in the
i2b2/UTHealth 2014 dataset.
Clinical Finding Criteria Example

(1) diagnosis of type 1 or 2 diabetes patient has h/o DMII
CF1 = DIABETES
(2) A1c test over 6.5 7/18: A1c: 7.3
(DM)
(3) two fasting blood glucose mea- (8:00AM) glu: 145
sures over 126 . . . (8:00PM) glu: 139
(1) diagnosis of coronary artery PMH: significant for CAD
disease (CAD)
CF2 = CORONARY ARTERY (2) myocardial infarction (MI, s/p STEMI in 2004
DISEASE (CAD) STEMI, NSTEMI)
(3) revascularization, cardiac ar- CABG in 1999
rest or ischemic cardiomyopa-
thy
(4) stress test showing ischemia dolbutamine stress test
revleaing ischemia
(5) abnormal cardiac catherization cath. of LAD revealed 50% le-
showing coronary stenoses sion
(6) chest pain consistent with treated for stable angina
angina
(1) diagnosis of Hyperlipidemia or control of his hypercholers-
CF3 = HYPERLIPIDEMIA
Hypercholesterolemia terolemia
(HLA)
(2) total cholesterol measure of result of latest chol. test is 250
over 240
(3) LDL measurement of over 100 latest LDL: 135
mg/dL
CF4 = HYPERTENSION (1) diagnosis of Hypertension PMH: HTN
(HTN) (2) blood pressure measurement of at admit, bp 140/100
over 140/90 mm/hg
(1) a description of the patient as 57y/o obese white male
CF5 = OBESITY (OBY) being obese
(2) a body mass index (BMI) over recommending lowering BMI
30 (31.4 last August)
(3) a waist circumference > 40 in. 42in waist
for males or 35 in. for females
5.1.3 Predicting the Progression of Clinical Findings
We developed a probabilistic reasoning technique which is able to predict the progression of CFs
for any individual patient. To enable such predictions, we needed to encode knowledge about
118
Table 5.2. Temporal signals associated with risk factors in the i2b2/UTHealth 2014 dataset.
Temporal Signal Definition Example

DURING finding was present at the time this EHR today’s lab values: Chol. 247
was created
BEFORE finding was present before the creation of lab values from previous visit: LDL: 135
this EHR
AFTER finding is present after the creation of this confirmed as diabetic
EHR
CFs, temporal information that allows for a chronological ordering (CO), as well as the statistical
inter-dependencies between CFs. Although we have used the i2b2/UTHealth dataset, our method
can operate on any set of EHRs with any arbitrary set of CFs as long as they have been extracted
with their associated TSs. The probabilistic knowledge that we encoded relied on (1) the CFs that
were extracted from the annotations available in the data; (2) the COs that resulted from a form
of temporal inference which assigned CFs to time intervals; and (3) statistical inter-dependencies
which were estimated based on the COs produced on the entire dataset. This knowledge was cast
in a graphical model on which probabilistic inference allowed us to produce predictions at any time
during the health management of a patient. To summarize, our approach consists of three steps: (1)
infer COs of CFs, (2) encode knowledge in a graphical model and (3) use probabilistic inference
on the graphical model to make predictions.
Chronological Ordering of Clinical Findings
The EHRs in our dataset document the clinical findings (CFs) for each patient at different times. As
such, there is an implicit temporal ordering between the EHRs for an individual patient. Moreover,
in each EHR, temporal signals (TSs) provide additional temporal information for each CF. For
example, when encountering a clinical finding CFi associated with the TS BEFORE within EHR j ,
we can infer that CFi was present in the time interval beginning at the creation of EHR j−1 and
ending when EHR j was created. These creation times (CTs) were parsed from EHRs. Figure 5.1
shows examples of CFs as well as their associated TSs for two patients.
119
EMR 1 EMR 2 EMR 3 EMR 4
Before Before Before Before
ε Hypertension Hypertension Hypertension
Hyperlipidemia Hyperlipidemia
CAD
During During During During
Obesity Hypertension Hypertension Hypertension
Hypertension Hyperlipidemia Hyperlipidemia
CAD CAD
After After After After
Obesity Hypertension Hypertension Hypertension
Hypertension Hyperlipidemia Hyperlipidemia
CAD CAD
08/18/2098 04/27/2100 06/04/2101 01/07/2102
(a) Patient 109

EMR 1 EMR 2 EMR 3 EMR 4 EMR 5
Before Before Before Before Before
Diabetes Diabetes Diabetes Diabetes Diabetes
Hypertension
During During During During During
Hyperlipidemia Hyperlipidemia Hypertension
Hypertension
After After After After After
Hypertension
11/10/2104 05/02/2106 10/07/2106 08/03/2107 09/08/2109
(b) Patient 395
Figure 5.1. Example clinical findings (CFs) and their associated temporal signals (TSs) across
multiple EMRs/EHRs for two patients.
By analyzing the dataset that was created for the 2014 i2b2/UTHealth Shared Task, we noticed
that the distribution of TSs associated with mentions of CFs is as follows: BEFORE was the
most predominant, associated with 37% of CFs, whereas DURING was observed for 35% of the
CFs, while AFTER was associated with 28% of the CFs. Moreover, we observed that the same
CF may be mentioned multiple times in the same EHR and that each of these mentions may be
associated with a different TS. In this way, the TSs associated with each CF vary within a single
120
EHR, across EHRs for the same patient, and between different patients from the population. When
analyzing the association between mentions of CFs and TSs, we discovered that 89% of the CFs
mentioned in a EHR were associated with all three possible TSs and only a very small percentage
of the CFs annotated in our dataset were associated with only one or two TSs. Motivated by these
observations, we based our chronological ordering on temporal inference which operates according
to the following assumptions:
(A1 ) If a CFi from EHR j for a patient is associated with TS = AFTER, we temporally infer that
CFi was present in the time interval TI(CT(EHR j ), CT(EHR j+1 )), denoting the time interval
(TI) between the creation times (CTs) of two successive EHRs for the same patient.
(A2 ) In the first EHR created for a patient, all CFs associated with TS = BEFORE are inferred to
have been present in the special time interval BEFORE-ALL.
(A3 ) In the last EHR created for a patient, all CFs associated with TS = AFTER are inferred to
have been present in the special time interval AFTER-ALL.
(A4 ) A CFi associated with TS = DURING is processed in the same way as if it were annotated
with TS = AFTER (as described in [A1]). Very few CFs occur only DURING the medical
visit (6%), while the vast majority of CFs (83%) occur both DURING and AFTER (and even
BEFORE) the creation time of their EHRs. Hence, given that the TS DURING does not
represent a statistically significant distribution in our data, we cast the temporal inference
for it to be similar to the one dictated by the TS AFTER. Clearly, for a different temporal
distribution of CFs, this assumption may not hold and additional temporal inference may be
required.
Based on these assumptions, we automatically inferred the CO of the CFs for each patient. Figure 5.2
illustrates the temporal inference for one patient documented in the dataset and the resulting CO of
CFs induced for that patient.
121
Chronological Ordering of Clinical Findings
CF1, CF2 CF2, CF5 CF1, CF4 … CF2, CF4
EHR1 EHR2 EHRlast
BEFORE-ALL TI(EHR1, EHR2) TI(EHR2, EHR3) AFTER-ALL

Time
CT(EHR1) CT(EHR2)
…CT(EHRlast)
INPUT: Temporally ordered EHRs

STEP 1: Derivation of time intervals
STEP 2: Temporal inference of clinical findings mentions into time intervals based on temporal signals
OUTPUT: Chronological ordering of mentions of clinical findings
Figure 5.2. Chronological ordering (CO) of clinical findings (CFs) for a patient.
To devise the chronological order of CFs, we first take into account the creation times (CTs) of
each EHR. Given all the EHRs generated for a patient, we order the EHRs as shown in Figure 5.2.
This allows us to produce N + 1 time intervals where N is the number of EHRs produced for the
patient. These time intervals (TIs) are represented as: [BEFORE-ALL; TI(CT(EHR1 ), CT(EHR2 ));
TI(CT(EHR2 ), CT(EHR3 )); · · · ; AFTER-ALL]. In the next step, we applied the assumptions (A1 -
A4 ), in order to determine which CFs should be associated with each TI. For each E HRi , we map
each CF with TS = BEFORE into the time interval TI(CT(EHRi−1 ), CT(EHRi )), and each CF with
TS = AFTER into the time interval TI(CT(EHRi ), CT(EHRi+1 )).
As illustrated in Figure 5.2, a chronological ordering (CO) is a temporally-ordered sequence
of sets, where each set Si represents the combination of CFs which were temporally inferred as
belonging to the i-th time interval (TIi ). For example, the CO produced for the patient illustrated in
Figure 5.1 consists of the following sets: S0 = {DIABETES, CAD}; S1 = {CAD, OBESITY}; S2 =
{DIABETES HYPERTENSION}, S3 = {HYPERLIPIDEMIA, OBESITY}; and S4 = {CAD, HY-
PERTENSION}. When we infer the COs for all patients in the dataset, we enable the probabilistic
representation of the knowledge about CFs from the clinical dataset.
122
Figure 5.3. A probabilistic graphical model encoding the likelihood of any possible progression of
clinical findings.
A Graphical Model for Representing Knowledge about Clinical Findings
We encoded knowledge using a probabilistic graphical model (PGM), illustrated in Figure 5.3. In
our PGM, nodes correspond to CFs and are represented as binary random variables. Our PGM
also encodes knowledge about the CO of CFs. COs are sequences of sets of CFs, denoted as
S0, S1, · · · , S L where L is the longest CO inferred from our dataset. Because the PGM encodes
knowledge about the entire patient population documented in the dataset, the PGM needed to
encode all the possible sets of CFs for each Si where 0 ≤ i ≤ L. This was achieved by assigning
a value of 1 to the random variable of a CF which was observed in the same Si and a value of 0
to any CF which was not observed in that same Si . An advantage of the knowledge representation
using the PGM stems from the ability to also assign a probability to the random variables, which
encapsulates the statistical distribution of the CFs corresponding to the COs across all patients.
A second advantage of this knowledge representation stems from the ability to capture statistical
dependencies between the random variables, which are represented as edges in the graph. Any
edge from a CF x in Si to a CF y in Si+1 indicates such a dependency. The statistical dependencies
between CFs across successive sets (Si to Si+1 ) allows us to represent all the possible ways in which
CFs may progress from one time interval to the next based on the properties of our clinical dataset.
123
Because we are only considering five different CFs in our dataset, there are 25 = 32 possible
statistical dependencies between any Si to Si+1 .
Predictions as Probabilistic Inference
To make predictions about the progression of CFs of any patient, we used probabilistic inference
to estimate the most likely assignment of CFs, Ŝ, given the observed sets of CFs and their COs
encoded in the sets S0, S1, · · · , Si by finding the most likely assignment to the set Si+1 . When i = 1,
we make predictions about the progression of CFs given knowledge documented only in the first
EHR, whereas when i = LAST, we make predictions about the progression of CFs after the last visit
(documented in the last EHR) of the patient. Since there are 32 possible transitions from any Si to
Si+1 , we define X as the set of all 32 possible values for Si+1 . In this way, probabilistic inference
used the maximum a posterior (MAP) assignment which predicts the most likely progression, i.e.,
the set of CFs Si+1 provided by the assignment Ŝ:
Ŝ = arg max P (Si+1 = S0 | S0, · · · , Si ) (5.1)

S0 ∈X
To compute the MAP estimation, we also needed to estimate (a) the transition probability between
a set of CFs Si to a set Si+1 ; and (b) the prior probability of any set Si which indicates the
likelihood that the combination of CFs represented by Si was observed in any CO produced by
temporal inference in the patient population. To estimate the transition probability, P (Si+1 | Si )
we evaluated two functions: (a) Q1 (Si, Si+1 ) representing the number of times in which all CFs
observed in the set Si were temporally mapped to some time interval TI j in a CO, while all the
observed CFs from Si+1 were temporally mapped to the next time interval TI j+1 in the same CO;
and (b) Q2 (Si ) representing the number of times the CFs from Si were temporally mapped to the
same time interval in any of the COs for the entire patient population. This allowed us to estimate
P (Si+1 | Si ) = Q1 (Si, Si+1 )/Q2 (Si ). Similarly, to estimate P(Si ), we define the number Q3 which
represents the total number of COs induced for the entire dataset. Then, P(Si ) = Q2 (Si )/Q3 . Given
124
the definitions of the transition probability and the prior probability of any set of CFs, we can
compute the likelihood of any progression of CFs. We define the progression of CFs as a sequence
of sets of CFs, S0, S1, · · · , S j , where j represents the number of time intervals in the documented
care of the patient. This enables us to compute the likelihood of any progression of CFs as:
j−1
Ö
P(S0, · · · , S j ) = P(S0 ) × P (Si+1 | Si ) (5.2)
i=0
As we were able to compute the likelihood of any arbitrary progression of CFs from the dataset,
we were also capable of predicting the progression of a new, unseen set of CFs, i.e., S j+1 , using
Equation (5.3).
P(S0, · · · , S j+1 )
P S j+1 S0, · · · , S j = Õ

(5.3)
P(S0, · · · , S j , S0)
S0 ∈X
As the probability of a new progression of CFs is defined by Equation (5.3), the probabilistic
inference through MAP as given in Equation (5.1) makes predictions about the progression of CFs
in the dataset used in our experiments and allowed us to evaluate the model we have constructed. To
exemplify the probabilistic inference enabled by our model, we use the CO illustrated in Figure 5.2
to instantiate the five sets of CFs S0 , S1 , S2 , S3 , and S4 . This allows us to determine the probability
that the CFs for this patient will progress such that only HYPERTENSION is present in the future
by defining S5 = {HT N5 = 1, OBY5 = 0, H L A5 = 0, DBS5 = 0, C AD5 = 0}. Using Equation (5.3),
we can compute the posterior probability for the CFs in S5 as 09.8%, meaning that, for the patient,
there is an approximately 10% chance that he or she will no longer present with CAD in the next
hospital visit and will instead present with only hypertension. We can additionally predict the most
probable progression of CFs for the same patient, by determining (a) the most probable assignment
of random variables from S5 and (b) the likelihood of that assignment. The most probable next set
of CFs is HYPERTENSION and CAD with probability 17.5%, and the next most likely assignment
is HYPERTENSION at 9.8%. This shows that although there are many possible combinations of
CFs for the next time-step, our model predicts that the combination of both HYPERTENSION and
CAD is 78.6% more likely than just HYPERTENSION.
125
Table 5.3. Performance results over all CFs for COs of length j, where Acc = Accuracy, PPV =
positive predictive value (Precision), FNR = false negative rate, FPR = false positive rate, TNR
= true negative rate (Specificity), TPR = true positive rate (Recall), F1 = F1 -measure defined as
PPV×T PR
2 × PPV+T PR , TP = true positives, FP = false positives, FN = false negatives, TN = true negatives.
j Acc. PPV FNR FPR TNR TPR F1 TP FP FN TN

1 84.94 94.22 22.84 05.69 94.31 77.16 84.84 375 23 111 381
2 81.91 86.63 17.80 18.51 81.49 82.20 84.35 434 67 94 295
3 86.18 91.13 13.20 14.91 85.09 86.80 88.91 493 48 75 274
4 86.71 85.21 06.27 23.38 76.62 93.73 89.26 478 83 32 272
5 88.92 85.29 02.52 22.60 77.40 97.48 90.98 232 40 6 137
5.1.4 Experimental Results
We conducted an extensive set of experiments with the purpose of evaluating the quality of the
predictions of CF progressions. We considered predictions performed after any number of time
intervals had been observed, i.e., we evaluated the predictions for each possible value of 1 ≤ j ≤ L,
where L is the length of the longest CO obtained for any patient.
It is to be noted that the accuracy of our model is always very high (above 80%) while the F1 -
measure improves as more information from the COs of CFs becomes available. Table 5.3 details
the evaluations of the predictions produced by our system for all COs of length j from the entire
patient population. As shown, the highest precision for the prediction of any type of CF was obtained
when considering only a single previous time-interval. This confirms the conclusions hypothesized
in Bejan et al. (2013) which suggest that the presence of CFs in the immediately proceeding EHR
is the best predictor for the presence of a CF. Note, however, that when considering COs of greater
length, the Recall tends to improve. This suggests that considering more chronological information
allows for more complex and rarer combinations of CFs to be predicted.
A more detailed evaluation of the prediction of the progression for each individual CF is
illustrated in Figure 5.4. Across all five CFs, our probabilistic model has the best F1 measure when
predicting DIABETES, with an F1 = 94.57. Coronary Artery Disease, CAD, was a close second at
F1 = 92.31, followed by HYPERTENSION with F1 = 93.24, HYPERLIPIDEMIA at F1 = 85.42,
126
(a) Obesity (b) Hypertension (c) Hyperlipidemia
(d) Diabetes (e) CAD
Figure 5.4. Experimental results for the prediction of the progression of CFs for chronological
orderings of lengths 1 ≤ j ≤ 5, where A denotes the Accuracy, P denotes the Precision, R denotes
the Recall, and F1 denotes the F1 -measure.
and OBESITY at F1 = 78.8. We believe that the difference in predictive performance for these CFs
can be attributed to differences in the distribution of how often each type of CF was observed in our
dataset. For example, DIABETES, HYPERTENSION and CAD are the most frequently occurring
CFs (26%, 30%, and 19% respectively), while HYPERLIPIDEMIA and OBESITY are the least
frequently occurring CFs (accounting for 16% and 7%, respectively). Further, we notice that for
the most common CFs (HYPERTENSION and DIABETES), the F1 -measure tends to improve
consistently as COs of longer lengths are considered, while for less common CFs (OBESITY) the
trends are not as clear. This implies the need for additional data to balance the distribution of CFs.
It also shows that, for any given problem-based patient cohort, the progression of certain CFs may
be better predicted as more EHRs documenting that CF are provided.
127
5.1.5 Lessons Learned
This section has introduced a novel method of predicting the progression of CFs. The predictions
are based on probability inference operating on a graphical model that encodes knowledge about
CFs extracted from EHRs as well as their inferred chronological orderings. The probabilistic
graphical model described in this chapter provided promising results in extensive experiments.
5.2 Inferring Temporal Interactions involving Risk Factors and Medications6
In this section, we describe how to infer longitudinal (i.e., temporal) interactions between risk
factors and medications. As defined by the World Health Organization (WHO), a risk factor is any
attribute, characteristic, or exposure of an individual that increases the likelihood of developing a
disease. Because risk factors are such powerful indicators of the likelihood of a patient developing
a disease, they play a critical role in the management and care of individual patients. Naturally, risk
factors are frequently explicitly documented in the Electronic Health Records (EHRs) associated
with a patient. However, as revealed by consultations conducted by the Informatics for Integrating
Biology at the Bedside (i2b2) and The University of Texas Health and Sciences Center (UTHealth)
with clinicians, many risk factors are not explicitly diagnosed; rather, they are merely implied
through natural language text in the EHR (Stubbs et al., 2015). For example, an EHR may omit
an explicit diagnosis of diabetes, instead stating an abnormally high blood glucose measurement
indicative of the disease. For this reason, it is important to consider both the explicitly mentioned
risk factors as well as the textual indicators that suggest them. In addition to risk factors, EHRs also
document other elements of the patient’s care, such as the medications prescribed to the patient.
The prescription of medications and the presence of risk factors play complementary roles in
the management and care of a patient: risk factors increase the likelihood of a patient having or
6Minor revision, with permission, of Travis R. Goodwin, and Sanda M. Harabagiu, Inferring the Interactions of
Risk Factors from EHRs, Proceedings of the American Medical Informatics Association (AMIA) Joint Summits on
Translational Bioinformatics (TBI) and Clinical Research Informatics (CRI), 2016. PMID:27595044.
128
developing a disease, while medications decrease the likelihood of the disease presenting in the
future. Unfortunately, the exact relationship between individual medications and the risk factors
they are targeting is rarely stated in EHRs. Moreover, many medications which target a particular
risk factor can interact with the other risk factors associated with a patient. These interactions
are difficult to anticipate without elaborate clinical trials and analysis, particularly for uncommon
combinations of risk factors. To make matters worse, there are a variety of complex interactions
between multiple risk factors (i.e., a patient with high blood pressure who also smokes is more
likely to develop coronary artery disease than a patient with only high blood pressure). However,
by exploiting the fact that EHRs document the risk factors and the medications given to patients at
different times during their clinical care, it is possible to construct a chronological model of how
the risk factors and medications interact over time. In this section, we define a novel data-driven
probabilistic model of the interactions between risk factors and medications which uses statistical
trends discovered across a large set of EHRs. We also show how this model can be used to (1)
predict the presence or absence of certain risk factors in a patient’s future, to (2) discover the
relationships between individual risk factors and medications, and to (3) identify patients with
irregular or unusual progressions of risk factors and medications.
In order to evaluate our model, we utilized the set of longitudinal EHRs provided by the
organizers of the Challenges in Language Processing for Clinical Data shared task sponsored by
the 2014 Informatics for Integrating Biology and the Beside (I2B2) and The University of Texas
Health Science Center (UTHealth). These EHRs document the progression of heart disease for a
population of diabetic patients, and are particularly well-suited for our model because they were
manually annotated by physicians to denote the presence of risk factors and medications relevant
to diagnosing heart disease.
5.2.1 Related Work and Background
Historically, temporal models for clinical prediction use established criteria specific to an individual
disease and do not often generalize well to new diseases. For example, a regression model capable
129
of selecting patients who may become at risk for heart disease was developed in Amarasingham
et al. (2010), while a variety of different prediction models were analyzed based on their ability to
screen for individual types of cancer based on known antigen relationships in Vickers (2011). An
automatic system based entirely on narrative content was constructed in Bejan et al. (2012) and
evaluated for its ability to identified patients with pneumonia based on past mentions of the disease.
More recent models have focused on modeling multiple types of diseases jointly. A disease-subtype
prediction model was developed in Huopaniemi et al. (2014) which relies on mixture modeling and
a joint-disease risk prediction model using logistic regression was described in Wang et al. (2014).
However, these models cannot account for variations in the amount of time between successive
disease observations. Moreover, the more generalized models do not account for the common
semantics associated with diseases and medications (namely, that disease can predict disease, and
that medications can prevent disease). In order to advance predictive modeling past both of these
barriers, we developed a general multiple risk factor and medication prediction model based on
recent advances in statistical modeling. Specifically, we rely on a powerful probabilistic framework
known as Probabilistic Graphical Models (PGMs) (Koller and Friedman, 2009) which can be
viewed as a generalization of both mixture and regression modeling. Graphical models are able to
not only encode knowledge about multiple risk factors and medications at particular times, but can
also directly represent the inter-actions between these different points in time.
In this section, we leverage both the sequential modeling and probabilistic inference capabilities
of PGMs by defining a model of patient’s chronologies which is general in the sense that it does
not rely on pre-specified knowledge about the relationships between risk factors and medications.
Our model is able to recover these relationships from a large body of EHRs, enabling us to not
only predict the way risk factors may progress for patients, but to discover the latent interactions
between risk factors and disease.
130
5.2.2 Risk Factors and Medications
When conducting our experiments, we used a collection of EHRs associated with 178 diabetic
patients, provided by the organizers of the shared-tasks on Challenges in Language Processing
for Clinical Data7 sponsored by the 2014 Informatics for Integrating Biology and the Beside8
(i2b2) and The University of Texas Health Science Center at Houston9 (UTHealth) described in
Section 5.1.2. Note that in order to follow HIPAA guidelines and to protect patients’ privacy, the
patient information in these records was de-identified, meaning that patients’ names and, more
importantly, the timestamps associated with each individual discharge summary are obfuscated.
Fortunately, the timestamps were obfuscated in a way that preserved the relative elapsed time be-
tween successive discharge summaries for the same patient. That is, the de-identification procedure
merely adjusted all timestamps for a patient by a fixed amount, so that although the exact date of each
discharge summary cannot be recovered, the relative elapsed time between successive discharge
summaries is unchanged. The 2014 i2b2/UTHealth dataset was well suited for our experiments
because it contains gold-standard annotations explicitly documenting the presence of risk factors
and medications associated with hearth disease. A total of 7 risk factors were considered, those
described in Table 5.1, as well as:
• Family history of premature CAD was indicated by a description of a first-degree relative
(i.e., parent, sibling or child) who diagnosed prematurely (i.e., below the age of 55 for males
and 65 for females) with CAD.
• Smoking was indicated by a mention of patient having smoked within the past year.
In addition to the seven risk factors, the discharge summaries were also annotated with medica-
tions prescribed for each patient which were related to diabetes, CAD, hyperlipidemia, hypertension,
7Information on these shared-tasks is available at https://www.i2b2.org/NLP/HeartDisease/.
8I2B2 is a NIH-funded National Center for Biomedical Computing. Additional information is available at https:
//www.i2b2.org/index.html.
9THE UT Health Science Center is part of The University of Texas System. More information is available at
https://www.uth.edu/.
131
or obesity. The exact risk factors targeted by each medication were not explicitly annotated. In
total, 22 medications and medication types were considered as listed in Table 5.4. Although each
Table 5.4. Medication types annotated in the i2b2/UTHealth 2014 dataset.

Medication Type Example
M1 ACE inhibitor Lisinopril
M2 amylin N/A
M3 anti-diabetes Glyset
M4 ARB Avapro
M5 aspirin Aspirin
M6 beta blocker Labetalol
M7 calcium channel blocker Norvasc
M8 diuretic Hydrochlorothiazide
M9 DPP4 inhibitor Januvia
M10 ezetimibe Zetia
M11 fibrate Tricor
M12 GLP1 agonis N/A
M13 insulin Novolog
M14 meglitinides N/A
M15 metformin Glucophage
M16 niacin Niacin
M17 nitrate Nitroglycerin
M18 anti-obesity N/A
M19 statin Simvastatin
M20 sulfonylureas Diabeta
M21 thiazolidinedione Actos
M22 thienopryidine Plavix
risk factor and medication was associated with a temporal signal, as reported in Section 5.1, 89%
of all risk factors and medications annotated were labeled with both present and during temporal
signals. For this reason, in this section, we discarded any risk factors and medications annotated as
occurring only after or before the timestamp of the discharge summary.
5.2.3 Generating the Graphical Model
In order to automatically model the interactions between the risk factors and medications in EHRs,
we define a probabilistic model. This model operates by discovering latent trends in the way in
132
which risk factors and medications changed in successive discharge summaries from a collection of
patient EHRs. Because this model relies on the trends present in a particular dataset, we will first
describe how to pre-process a collection of longitudinal EHRs by extracting the clinical chronologies
and encoding them into mathematical structures. Then, we will describe a probabilistic model over
these data structures and demonstrate how the model can be used to (1) apply these latent trends to
the chronology of a new patient in order to predict how his or her risk factors might progress, to
(2) infer the interactions between pairs of risk factors, or between risk factors and medications over
time, and to (3) identify patients whose risk factors and medications have an irregular progression.
Representing Clinical Chronologies
In order to model a collection of EHRs, we first define the following parameters which characterize
the data:
N = the number of patients in the EHR collection, (5.4a)
Ln = the number of discharge summaries associated with patient n in the data, (5.4b)
i.e., the length of the patient’s chronology,
V = the number of possible risk factors our model should consider, (5.4c)
i.e., the size of the risk factor vocabulary,
U = the number of possible medication types our model should consider, (5.4d)
i.e., the size of the medication lexicon.
Using the 2014 i2b2/UTHealth dataset, a total of N = 128 patients were used to train our model.
Each of these patients was associated with Ln ∈ [3, 5] discharge summaries which were chronically
ordered according to their timestamps. In our experiments, we considered the annotated risk
factors and medications described in the previous section; thus, V = 7 and U = 22. Given
these parameters, we were able to represent the clinical chronologies of all patients in the data by
133
defining three mathematical structures, which for each patient n, encode the set of risk factors and
medications which were indicated during the i-th discharge summary (and i ranges from 1 to Ln for
each patient):
R = Rn,v,i ∈ {0, 1} N×V×Ln

(5.5a)
M = Mn,u,i ∈ {0, 1} N×U×Ln

(5.5b)
n o
E = En,i ∈ R+
N×Ln
(5.5c)
where Rn,v,i is an entry in the 3rd-order risk factor tensor10 R which indicates whether the v-th risk
factor was mentioned in the i-th discharge summary for patient n (we assigned a value of 1 when
the v-th risk factor was mentioned, and 0 otherwise); Mn,u,i is an entry in the 3rd-order medication
tensor M which indicates whether the u-th medication was mentioned in the i-th discharge summary
for patient n (we assigned a value of 1 when the u-th medication or medication type was mentioned,
and 0 otherwise); and En,i is an entry in the elapsed time matrix E which stores the number of
days elapsed between discharge summary i and the previous discharge summary, i − 1, for patient
n. Note that the elapsed time for the first discharge summary for each patient is defined as zero,
i.e., En,0 = 0. Figure 5.5 illustrates slices from the risk factor tensor R, the medication tensor
M and rows from the elapsed time matrix E which show the clinical chronology for individual
patients. In R and M, each slice corresponds to a patient (n), each row corresponds to a risk factor
or medication (respectively), and each column refers to the index of the corresponding discharge
summary (i). In E, each row refers to a patient (n), and each column refers to the index of the
associated discharge summary (i). As illustrated, R, M and E are all jagged structures, meaning
that the number of discharge summaries associated with each patient (n) may vary according to the
value of Ln . In this way, not only have we accounted for the de-identification of EHR timestamps,
but we can directly discover temporal patterns based on the relative time elapsed between successive
discharge summaries for each patient.
10A k-th order tensor is the k-dimensional analogue of a mathematical vector.
134
(a) The risk tensor R.
(b) The medication tensor M.
(c) The elapsed time matrix E.
Figure 5.5. Visualization of the (a) Risk Factor Tensor R, (b) Medication Tensor M and (c) Elapsed
Time matrix E with slices shown for the individual patients 1, 2, and N.
Modeling Chronological Interactions
Using the mathematical representation of patient chronologies obtained from a particular dataset,
we would like to discover patterns in how the risk factors and medications interacted over time. We
135
Figure 5.6. A Probabilistic Graphical Model of Patient Chronologies.
accomplished this by constructing a probabilistic graphical model (PGM) (Koller and Friedman,
2009), which can be viewed as a generalization of a traditional mixture model which allows us
to directly encode the dependencies between risk factors and medications over time. Probabilistic
Graphical Models, like mixture models, operate on a set of statistical random variables and allow us
to efficiently compute the joint distribution of these variables, from which any desired probability
can be derived (e.g., conditional probabilities, prior probabilities, etc.). In order to exploit the latent
statistical information present in the data, our model bust be able to encode any arbitrary patient’s
clinical chronology. To do this, we define a binary random variable for each entry in the risk factor
tensor Rn,v,i and each entry in the medication tensor Mn,u,i , as well as a continuous random variable
for each entry in the elapsed time matrix En,i . Thus, the joint distribution over these variables
captures the likelihood of observing any possible clinical chronology which may be associated with
a patient. Figure 5.6 illustrates this model using standard plate notation, wherein each shaded circle
denotes an observable variable, each edge represents a statistical dependency, and plate (rectangular
box) indicates that all variables contained in the box are copied or duplicated as many times as
indicated by the quantity in the bottom-right of the plate For example, the binary variable indicated
136
by Rn,v,1 is duplicated for each risk factor v ∈ [1, V] and each patient n ∈ [1, N]. The opaque
variables in the left plate correspond to latent statistical parameters which will be inferred from
the data, while each shaded column captures the elapsed time, risk factors, and medications which
were present and absent in each discharge summary. As shown, the risk factors in each discharge
summary are influenced by (1) the time elapsed since the previous discharge summary, (2) the risk
factors present in the previous discharge summary, and well as (3) the medications mentioned in
the previous discharge summary, while the medications depend only on the risk factors observed in
the same discharge summary. In order to define the full joint distribution, we must formally define
each of these four dependencies probabilistically.
We encode the fact that the presence of a particular risk factor v ∈ [1, V] in discharge summary
i for patient n is likely depend on the amount of time elapsed En,i since the previous discharge
summary, by defining an Exponential distribution for each possible risk factor:
P Rn,v,i En,i ≈ E xponential(En,i ; λv ) = λv e−λv En,i

(5.6)
where λv ∼ Gamma(λv ; αv, βv ) is the parameter of the exponential distribution over elapsed times
associated with risk factor v; αv is the number of patient chronologies with v; and βv is the sum
of elapsed times associated with discharge summaries mentioning v. Thus, Equation (5.6) states
that the likelihood of a particular risk factor given an arbitrary elapsed time follows an Exponential
distribution unique to that particular risk factor.
In addition to the elapsed time, the risk factors in discharge summary i are influenced by the risk
factors in the previous discharge summary i − 1. For example, if a patient is diagnosed with diabetes
in discharge summary i, it is very likely that diabetes will be observed in the (i + 1)-th discharge
summary. To represent this type of positive correlation, we define a Noisy-Or distribution for each
risk factor, v. This enforces the semantics that risk factor v could have been triggered by each risk
factor w observed in the previous discharge summary with some probability (pv,w ). Moreover, the
Noisy-Or distribution states that the likelihood of a risk factor being present increases with each
137
additional risk fact which was present during the previous discharge summary. Thus, the likelihood
of risk factor v being present in discharge summary i given the presence or absence of each risk
factor in the previous discharge summary (i − 1) is:


 1 − pv,w, if Rn,w,i−1 = 0

ÖV 
P Rn,v,i Rn,1,i−1, . . . Rn,V,i−1 ; pv,1, . . . pv,V , k v = 1 −

(5.7)

w=1 
 1,
 otherwise.

Equation (5.7) states that the likelihood of risk factor Rn,v,i given previous risk factors Rn,w,i−1 for
w ∈ [1, V] follows the Noisy-Or distribution parametrized by pv,w encoding the likelihood that the
presence of risk factor w in the previous discharge summary can predict the presence of risk factor
v in the current discharge summary. We can calculate the value pv,w ∼ Beta(γv,w, δv,w ) by defining
v as the number of patient chronologies wherein risk factor v was present in a discharge summary
immediately following a discharge in which the risk factor w was present and δv,w as the number
of patient chronologies wherein risk factor v was absent in a discharge summary immediately
following a discharge in which the risk factor w was present.
The final indicator for the presence of a risk factor v in discharge summary i for patient n is the
set of medications prescribed to the patient in the previous discharge summary. This follows the
intuition that the previous prescription of a medication can prevent the presence of targeted risk
factors, and captures the negative correlation between medications and risk factors. To model this,
we utilize an inverted (i.e., (1 − p)) Noisy-Or distribution for each risk factor v which states that
absence of risk factor v can be predicted based on the presence of each medication u in the previous
discharge summary with some probability qv,u :


 1 − qv,u, if Mn,u,i−1 = 0

U 
Ö 
P Rn,v,i Mn,1,i−1, . . . Mn,U,i−1 ; qv,1, . . . qv,U, av =

(5.8)

u=1 
 1,
 otherwise.

Equation (5.8) defines the probability of observing risk factor Rn,v,i despite each medication pre-
scribed in the previous discharge summary. We can estimate the probability qv,u ∼ Beta(ηv,u, θ v,u )
138
by defining ηv,u as the number of patient chronologies in which risk factor Rn,v,i was present follow-
ing a discharge summary in which medication u was prescribed, and θ v,u as the number of discharge
summaries in which Rn,v,i was absent following a discharge summary in which medication u was
prescribed.
Together, Equations (5.6) to (5.8) capture the statistical dependencies governing the presence
or absence of each risk factor. However, the presence of a risk factor can also influence the set of
medications which are prescribed during the same discharge summary. Consider, for example, the
intuition that many medications are only prescribed after certain risk factors have been diagnosed.
To represent this type of interaction, we employ a Noisy-And distribution for each medication u
which assumes the presence of each medication mentioned in a discharge summary depends on one
or more risk factors being mentioned in the same discharge summary. Moreover, the Noisy-And
distribution states that as the number of diagnosed risk factors decreases, so must the probability
of each medication. Mathematically, this has the form:

 1 − su,v, if Rn,v,i = 0

V 
Ö
P Mn,u,i Rn,1,i, . . . Rn,V,i ; su,1, . . . su,V , bu =

(5.9)

v=1 
 1,
 otherwise.

where su,v indicates the probability that medication u requires risk factor v to be diagnosed. As
with Equations (5.7) and (5.8), we estimate the probabilities su,v ∼ Beta(φu,v, ψu,v ) for u ∈ [1, U]
and v ∈ [1, V] by defining φu,v as the number of patient chronologies in which medication u was
prescribed in the same discharge summary in which risk factor v was present, and ψu,v as the
number of patient chronologies in which medication u was not prescribed in the same discharge
summary in which risk factor v was present.
139
Using these four equations, we can define the joint probability of observing any possible patient
chronology as:
N Ö
Ö V U
Ö
P(E, R, M; Θ) = P Mn,u,1 Rn,1,1 . . . Rn,V,1

kv
n=1 v=1 u=1
Ln Ö
Ö V
P Rn,v,i Rn,1,i−1 . . . Rn,V,1 P Rn,v,i En,i

(5.10)
i=2 v=1
ÖU
P Mn,u,i Rn,1,i . . . Rn,V,i

u=1
where Θ refers all the latent variables in our model, i.e., λv , αv , βv , ηv,w , θ v,w , γv,u , δv,u , ψu,v , and
φu,v . Thus, Equation (5.10) represents the joint distribution in terms of products of the previously
defined condition distributions given in Equations (5.6) to (5.9), and allows us to determine any
arbitrary probability involving these variables by appealing to the basic laws of probability.
Discovering the Latent Interactions of Risk Factors and Medications
The probabilistic model representing Equation (5.10) encodes multiple types of interactions between
risk factors and medications. The first latent interaction, as characterized by Equation (5.7), shows
the positive correlation or causal relationship between each pair of risk factors, v, and w in
successive discharge summaries through the latent variable pv,w . The second interaction, defined
through Equation (5.8), captures the negative correlation or inhibiting relationship between each
medication, u, and each risk factor v in the latent variable qv,u . The final interaction, embodied in
Equation (5.9), captures the associative strength or enabling relationship between each risk factor
v and each medication u with the latent variable su,v . We learned these variables by applying a
straight-forward collapsed Gibbs sampler, as described in Liu (1994); Porteous et al. (2008) using
the definitions provided in Pearl (1988) and in Friedman et al. (1998).
Predicting Patient Outcomes from their Histories
After discovering the latent interactions implied by the latent variables in our model, we are able
to predict the clinical outcomes (risk factors) for a new patient by determining the likelihood of
140
each possible risk factor v ∈ [1..V]. To enable such a prediction, we most perform three steps: (1)
encode the patient’s history using binary random variables so that we can leverage our probabilistic
model, (2) use the joint probability to predict how the patient’s observations may progress.
We can encode the clinical chronology for a new patient p̂ in a similar manner to the way we
represented the clinical chronologies pertaining to the original set of patients in our dataset. Let L̂
represent the number of longitudinal discharge summaries for patient p̂. This allows us to define
R̂ v,i ∈ {0, 1}V× L̂ to be the risk factor matrix, M̂u,i ∈ {0, 1}U× L̂ to be the medication matrix, and
Êi ∈ R L̂ to be the elapsed time vector. After sorting the discharge summaries for the patient in
ascending chronological ordering (according to their timestamps), we can set the value of R̂ v,i to 1
when risk factor v was mentioned in the i-th discharge summary, and 0 otherwise. Likewise, we
can set M̂u,i to 1 when medication u was mentioned in the i-th discharge summary, and 0 otherwise.
Finally, we assign to Êi the elapsed time in days between discharge summary i and the previous
discharge summary, i − 1 where Ê1 is set to 0. In this way, we have defined the risk factor and
medication matrices as well as the elapsed time vector in the same way that we defined each slice
of the risk factor and medication tensor and each row of the elapsed time matrix generated for our
original dataset.
This representation allows us to predict clinical outcomes for the patient by constructing latent
variables x1, . . . , xV indicating the presence or absence of each risk factor v ∈ [1, V] and by defining
y to be the time elapsed from the last discharge summary in the patients chronology. To accomplish
this, we compute the maximum a posteriori (MAP) assignment for each variable xv :

P R̂, Ê, M̂, x1, . . . , xV , y; Θ
x̂v = arg max
x 0 ∈{0,1} P R̂, Ê, M̂; Θ
(5.11)
V
Ö
= arg max P x̂v = x 0 P R̂ 1, L̂, . . . R̂V, L̂ ; pv,1 . . . pv,V

x 0 ∈{0,1} v=1
In this way, Equation (5.11) allows us to predict whether each risk factor v ∈ [1..V] will be present
or absent given the clinical chronology for the patient according to the latent interaction variables
141
(Θ) discovered for our dataset. This technique could also be easily extended to predict the presence
or absence of observations between discharge summaries – for example during long gaps in the
patient’s history.
Identifying Irregular Patients
Another potential application of the model arises when one wants to identify patients whose
clinical chronologies are unlike a particular patient population. This allows for down stream
clinical decision support systems to monitor patients who may present with unusual risk factors or
disease progressions. To identify such patients, the model must be first initialized on some dataset
which does not already include the patient (that is, the patient’s EHR must be removed or ignored in
the dataset when learning the latent parameters). Then, let L̂ represent the number of longitudinal
discharge summaries for the target patient p̂. This allows us to define R̂ v,i ∈ {0, 1}V× L̂ to be the risk
factor matrix, M̂u,i ∈ {0, 1}U× L̂ to be the medication matrix, and Êi ∈ R L̂ to be the elapsed time
vector. After sorting the discharge summaries for the patient in ascending chronological ordering
(according to their timestamps), we can set the value of R̂ v,i to 1 when risk factor v was mentioned
in the i-th discharge summary, and 0 otherwise. Likewise, we can set M̂u,i to 1 when medication u
was mentioned in the i-th discharge summary, and 0 otherwise. Finally, we assign to Êi the elapsed
time in days between discharge summary i and the previous discharge summary, i − 1 where Ê1 is
set to 0. This allows us to determine how likely patient p̂’s chronology is by simply computing the
joint probability of that patient’s chronology using Equation (5.10), based on the latent variables
(Θ) discovered from the training corpus.
In our experiments, we relied on the collection of annotated longitudinal EHRs provided in the
2014 shared task on Challenges in Language Processing on Clinical Data sponsored by the 2014
Informatics for Integrating Biology and the Bedside (i2b2) and The University of Texas Health
142
Table 5.5. Predictive performance individual risk factors, as well as the micro-average over all risk
factors.11
Risk Factor Acc. PPV FNR FPR TNR TPR F1 TP FP FN TN

Obesity 0.864 1.0 0.941 0.0 1.0 0.058 0.111 1 0 16 101
Hypertension 0.958 0.958 0.0 1.0 0.0 1.0 0.978 113 5 0 0
Diabetes 0.788 0.812 0.115 0.4 0.6 0.885 0.847 69 16 9 24
Hyperlipidemia 0.729 0.663 0.0 0.582 0.482 1.0 0.797 63 32 0 23
CAD 0.746 0.485 0.448 0.191 0.809 0.551 0.516 16 17 13 72
Micro-average 0.794 0.617 0.172 0.221 0.779 0.828 0.707 735 456 153 1606
Science Center at Houston (UTHealth) created with the purpose of fostering the development of
automatic systems for detecting clinical findings, medications, and temporal signals. We re-purpose
this data in order to learn and evaluate our model of clinical histories. That said, for the sake of
consistence and reproducibility, we report our performance using the same training and testing
partitions given by the i2b2/UTHealth organizers. Note: that was also evaluated our model using
10-fold cross validation; for the sake of brevity, these results are not reported in this section because
the difference in performance was statistically insignificant (p = 0.04). Using this partitioning, our
training set consisted of EHRs documenting the progression of heart disease for 178 patients, and
our testing set consisted of EHRs for 118 patients.
In order to evaluate the predictions enabled by our model, we cast the problem of predicting
the presence or absence of risk factors as a binary classification problem. However, our evaluation
had to consider that each discharge summary was associated with multiple risk factors. Thus, we
leveraged the experimental methodology used for evaluating multi-label classification problems in
the machine learning community (Tsoumakas and Katakis, 2007). After training our the latent
variables in our model using the clinical chronologies extracted on the 168 patients in the training
set, we evaluated the accuracy of our model in predicting the risk factors present and absent in the
last discharge summary, given all the preceding discharge summaries for each patient. Specifically,
11We have omitted the predictive performance for the risk factors Smokes and Family History because they rarely
change and thus unduly inflate the micro-average performance of our model.
143
for each patient n with an EHR containing Ln chronologically ordered discharge summaries, we
used our trained model to predict the presence or absence of each risk factor v ∈ [1, V] given the
chronology in the first (Ln − 1) discharge summaries as well as the amount of time elapsed since the
(L − 1)-th discharge summary (i.e., En,Ln ). Then, we compared the predicted presence or absence
of each risk against the actual values extracted from the L-th discharge summary. Formally, we
considered a predicted risk factor as a true positive (TP) if it was predicted by the model and was
present in the discharge summary, as a false positive (FP) if it was predicted by the model but
was absent in the discharge summary, as a false negative (FN) if it was not predicted by the model
but was present in the discharge summary, and as a true negative (TN) if was not predicted by
the model and was absent in the discharge summary. Table 5.5 presents these results6. Overall
performance was high, although certain classes (such as Hyperlipidemia) proved more difficult
than others (e.g., Hypertension). Note that because the F1 -measure considers only true positive
(and not true negative) labels, the performance of the overwhelmingly absent risk factor Obesity is
better assessed by the accuracy measure. Interestingly, despite the entire patient cohort having a
diagnosis of Diabetes, a number of discharge summaries did not contain diagnoses of the disease,
suggesting that either the condition was managed, or not of primary interest to the physician. We
additionally compared our approach against a previously developed system. The baseline system,
reported in Goodwin and Harabagiu (2015) does not represent medications nor the elapsed time
between successive discharge summaries which achieved a micro-average predictive accuracy of
only 54.3%. The superior performance achieved by the model outlined in this section demonstrates
the importance of encoding the semantics in the types of interactions present in patient’s clinical
chronologies.
5.2.5 Discussion
In addition to the predictive performance, we also explored the latent interactions recovered by
our model. Table 5.6 presents the likelihood that a risk factor w in a discharge summary i will
144
Table 5.6. Likelihood that each risk factors in a (current) discharge summary will positively predict
each risk factor in a future discharge summary.
Future
Obesity Hypertension Hyperlipidemia Diabetes CAD
Obesity 99.624 98.496 78.947 84.586 62.782
Hypertension 43.522 99.834 76.744 86.711 64.950
Current
Hyperlipidemia 44.776 98.507 99.787 86.994 65.245

Diabetes 42.056 97.570 76.262 99.813 64.486
CAD 41.646 97.506 76.309 86.035 99.751
FamilyHistory 43.160 97.883 76.221 86.971 65.147
Smoker 43.160 97.883 76.221 86.971 65.147
Table 5.7. Likelihood of each medication preventing each risk factor in the immediately following
discharge summary.
Risk Factor
ARB 56.627 0.602 18.072 9.036 33.133
beta_blocker 56.794 1.742 22.997 12.718 33.449
metformin 46.642 1.493 16.418 1.493 40.672
diuretic 42.188 0.521 19.792 23.438 44.792
aspirin 55.109 1.277 22.445 14.051 31.204
statin 54.696 2.394 19.705 12.707 34.991
sulfonylureas 53.394 1.810 20.362 1.810 38.009
Medication
thienopyridine 63.492 3.704 22.222 12.698 22.751

calcium_channel_blocker 52.222 0.370 21.111 9.259 36.667
ACE_inhibitor 54.265 1.659 20.853 12.322 35.071
insulin 54.276 4.276 28.618 0.329 39.145
nitrate 59.917 1.653 21.074 16.529 12.397
thiazolidinedione 37.975 1.266 10.127 1.266 31.646
fibrate 40.909 2.273 2.273 2.273 36.364
niacin 94.118 5.882 5.882 52.941 5.882
ezetimibe 75.000 5.000 5.000 5.000 20.000
anti_diabetes 16.667 16.667 16.667 16.667 16.667
145
Table 5.8. Likelihood of each medication being prescribed for each risk factor in the same discharge
summary
Risk Factor
ARB 0.433 1.000 0.819 0.914 0.671
beta_blocker 0.431 0.984 0.770 0.874 0.665
metformin 0.531 0.988 0.840 0.988 0.592
diuretic 0.576 1.000 0.804 0.767 0.551
aspirin 0.447 0.989 0.776 0.861 0.688
statin 0.452 0.977 0.803 0.874 0.649
sulfonylureas 0.463 0.986 0.799 0.986 0.618
Medication
thienopyridine 0.364 0.967 0.781 0.876 0.773

calcium_channel_blocker 0.477 1.000 0.791 0.910 0.634
ACE_inhibitor 0.457 0.985 0.793 0.880 0.649
insulin 0.456 0.959 0.714 1.000 0.608
nitrate 0.399 0.987 0.789 0.838 0.880
thiazolidinedione 0.616 1.000 0.909 1.000 0.687
fibrate 0.593 1.000 1.000 1.000 0.648
niacin 0.000 1.000 1.000 0.474 1.000
ezetimibe 0.217 1.000 1.000 1.000 0.826
anti_diabetes 1.000 1.000 1.000 1.000 1.000
positively predict the present of risk factor v in discharge summary (i + 1) (this corresponds to the
latent variable pv,w in Equation (5.7)). As shown by the probabilities in the diagonal, each risk
factor is unsurprisingly potent at predict a recurrence of itself. More interestingly, the presence of
any of the heart-disease-related risk factors is a strong predictor for the presence of Hypertension.
The difference between the micro-average predictive performance, and the and correlations shown
in Table 5.6 suggests that while individual risk factors may not be valuable for predicting other risk
factors, but instead the complete set of risk factors present (and absent) at each discharge summary
must be considered.
We also investigated the role each medication has in preventing each risk factor in the imme-
diately following discharge summary, as shown in Table 5.7 (corresponding to the variable qv,u
from Equation (5.8)). Note, these results do not distinguish between situations in which the risk
factor was absent in both adjacent discharge summaries and situations in which the risk factor was
146
resolved. As observed, Niacin and Ezetimibe were the best predictors of the absence of Obesity.
This shows that modeling the set of medications can improve the ability of the model to negatively
predict future risk factors.
Finally, we analyzed the association between each medication and each risk factor (correspond-
ing to su,v from Equation (5.9)); that is, we present the strength of the recovered probability that
medication u would be prescribed to a patient presenting with risk factor v. These results are listed
in Table 5.8. Unsurprisingly, given that our dataset is a cohort of diabetic patients, each patient
was taking at least one anti-diabetes medication. More interestingly, our model was able to recover
that fibrates, niacins, and ezetimibes are used to treat Hyperlipidemia. Moreover, by normalizing
these values according to the average for each risk factor, and the average for each medication (i.e.,
by calculating the point-wise mutual information), we observed that our model was able to recover
additional interactions, such as Aspirin being prescribed for CAD, and Metfornin being associated
with Diabetes.
We designed a data-driven probabilistic graphical model for risk factors and medications interact
through patient’s clinical chronologies. This model operates by first learning the latent interactions
between successive pairs of risk factors and medications using semantically motivated probability
distributions. These latent variables, in-turn, allow us to (1) predict the way a new patient’s
clinical chronology might progress as well as to (2) identify patients who clinical chronology is
progressing unusually given some cohort of similar patients. We evaluated the individual risk factor
and micro-average performance when predicting how a patient’s risk factors progressed, compared
to the actual risk factors mentioned in their EHR. Experiments demonstrated an accuracy up to
95.8% for a single class, and a micro-average accuracy of 81.6%, illustrating the potential of
our model for predicting personalized patient outcomes from longitudinal EHRs. Moreover, we
presented and analyzed the interactions discovered from the 2014 i2b2/UTHealth collection of
147
diabetic patients’ EHRs. Future performance may be improved by (1) normalizing the statistical
information informing each latent variable in the model, (2) leveraging larger EHR collections, and
(3) employing more sophisticated inference techniques.
5.3 Jointly Learning to Predict and Cluster12
In this section, we generalize the probabilistic graphical models described in Sections 5.1 and 5.2
to address the four main goals of precision medicine (described at the beginning of this chapter
and enumerated below). The generalized model encodes a large number of general observations
concerning both the clinical picture and the therapy of each patient as documented by multiple
chronologically ordered EHRs. This allows us to [GOAL 1] track how the clinical picture and
therapy changes over time for each patient. Moreover, our model also considers the elapsed time
between successive EHRs for each patient, allowing us to [GOAL 2] identify patients who are
due for preventive visits and screenings by predicting when his or her clinical observations may
change. In order to facilitate more personalized predictions, our model discovers groups of patients
with similar chronologies. This enables us to [GOAL 3] monitor how a particular (e.g., new)
patient’s chronology compares to his or her peers. Of particular importance, is the fact that our
model does not require any pre-specified knowledge about the interactions between observations
of the patient’s clinical picture or therapy; instead it discovers the underlying relationships present
in a dataset. In this way, our model enables a variety of robust, data-drive predictions which can
be exported to [GOAL 4] improve the overall quality of clinical care in practice by informing a
variety of automated clinical decision support systems, such as PUFF (Aikins et al., 1983), HELP
(Kuperman et al., 2012), or APACHE (Knaus, 2002).
Clinical observations are made in EHRs at different times throughout the health management
of a patient. By using the timestamp, or creation time, associated with each EHR we can construct
12Minor revision, with permission, of Travis R. Goodwin, and Sanda M. Harabagiu, A Predictive Chronological
Model of Multiple Clinical Observations, Proceedings of the IEEE International Conference on Healthcare Informatics
(ICHI), 2015. ©2015 IEEE. DOI:10.1109/ICHI.2015.37.
148
a chronological ordering of the clinical observations associated with each patient, as reported in
Goodwin and Harabagiu (2015). However, the clinical course of a disease continues to progress
between the times when a physician examines the patient and a generates a new EHR. Thus, we
developed a model that (1) takes into account the elapsed time between physicians’ notes in the
EHRs of a patient by defining a matrix representation of the elapsed times between successive EHRs
for each patient; and (2) is able to model the patient histories based on a tensor representation of
the clinical observations in each EHR. Using these structures, we defined a probabilistic graphical
model based on a Bayesian extension allowing for a more realistic and robust setting than the
closed-world assumption used in previous models. We present in detail the inference mechanisms
and show how the patient outcomes can be predicted based on the clinical histories inferred by our
model.
The model presented in this section (1) tracks information about multiple clinical observations
over time, (2) considers variations in the elapsed time between successive EHRs when modeling
changes in clinical observations, (3) discovers latent groups of patients whose clinical chronologies
progress similarly, and (4) enables personalized predictions for one or more clinical observations
for a new patient given his or her previous clinical chronology. In order to evaluate our model, we
utilized the set of longitudinal EHRs provided by the organizers of the Challenges in Language
Processing for Clinical Data shared task sponsored by the 2014 Informatics for Integrating Biology
and the Beside (I2B2) and The University of Texas Health Science Center (UTHealth) described in
Section 5.1.2 on page 117. These EHRs document the progression of heart disease for a population
of diabetic patients, and are particularly well-suited for our model because they were manually
annotated by physicians to explicitly denote the presence of particular risk factors and medications
relevant to diagnosing heart disease.
5.3.1 Related Work and Background
Clinical predictive modeling typically focuses on a identifying patients at risk for a particular
disease, based on established criteria specific to that disease. For example, in Amarasingham et al.
149
(2010), the authors designed a regression model for identifying patients at risk for heart failure
based on established risk factors for heart disease, in Vickers (2011) the authors evaluated a variety
of prediction models for individual types of cancer based on the presence of specific antigens in
the patients blood stream, and in Bejan et al. (2012) the authors identified patients with pneumonia
based on mentions of the pneumonia in electronic health records. Although these approaches
achieve suitable performance, they are often difficult (or impossible) to adapt to new diseases or
domains. As such, more recent work as tried to advance clinical prediction modeling by jointly
predicting multiple diseases based on general clinical information. For example, in Huopaniemi
et al. (2014), the authors predict a variety of disease subtypes based on mixture modeling. Likewise,
in Wang et al. (2014), the authors designed a joint disease risk prediction model based on logistic
regression.
At the same time, other authors have worked to improve the quality of predictions generated by
these models by tailoring predictions to individual patients. Known as personalized medicine, the
idea is that global disease models have the problem of eclipsing less common information which
might not be important for the entire population but may be critically important for an individual
patient. Indeed, in 2008 the President’s Council of Advisers on Science and Technology described
the goals of personalized medicine13:
[personalized medicine has the potential to improve] patient care and disease prevention [...
and] to positively impact two other important trends – the increasing cost of health care and
the decreasing rate of new medical product development. The ability to distinguish in advance
those patients who will benefit from a given treatment and those who are likely to suffer
important adverse effects could result in meaningful cost savings for the overall health care
system. Moreover, the ability to stratify patients by disease susceptibility or likely response to
treatment could also reduce the size, duration, and cost of clinical trials, thus facilitating the
development of new treatments, diagnostics, and prevention strategies.
Personalized predictive modeling attempts to realize these possibilities by learning individual
models for specific patients, or for groups of similar patients. For example, in Ng et al. (2015), the
13https://www.whitehouse.gov/files/documents/ostp/PCAST/pcast_report_v2.pdf
150
authors used a K-means clustering approach to partition patients into groups based on their diabetic
risk factors, in Overby et al. (2013) the authors segmented a population based on drug-induced
liver injury phenotyping, and in Nagin and Odgers (2010) the authors adapted work on group-based
trajectory modeling for a variety of clinical tasks using mixture modeling.
In order to address both of these emerging trends, we developed a multiple, personalized
observation prediction model based on recent advances in statistical modeling. Specifically, we rely
on a powerful probabilistic framework known as Probabilistic Graphical Models (PGMs) (Koller and
Friedman, 2009) which can be viewed as a generalization of both mixture and regression modeling.
In a graphical model, not only can we encode knowledge about multiple observations at a particular
time, but we can also represent the inter-actions between observations at different times. Indeed,
certain types of PGMs have been widely used for modeling sequences of observations, particularly
the Hidden Markov Model (Rabiner, 1989) and the Conditional Random Field (Lafferty et al.,
2001). These types of models have been used for an incredible number of sequential prediction
and modeling tasks, such as brain-wave segmentation (Zhang et al., 2001) and causal structure
discovery (Wingate et al., 2009). In addition to sequence modeling, PGMs have also been shown
to be incredibly successful at clustering tasks due to their ability to capture (and discover) latent
similarities between multiple observations. The canonical example of clustering with a PGM is that
of Latent Dirichlet Allocation (LDA) (Blei et al., 2003), in which words in documents are assigned
to latent “topics” based on their context. Due to both their simplicity and empirical performance,
LDA-based models have been applied to an incredible number of problems, such as multi-document
summarization (Arora and Ravindran, 2008) and identifying mRNA regulatory models (Liu et al.,
2010).
By contrast, we leverage both the clustering and sequential modeling capabilities of PGMs
by defining a model of patient’s chronologies which not only assigns patients to latent groups
based on the similarity of their clinical chronologies, but also discovers a sequential model of the
observations of a patients clinical picture and therapy for each latent group. This allows us to
151
jointly model multiple clinical observations and generate specific personalized predictions for new
patients based on observations documented in a provided dataset.
5.3.2 The Approach
In order to automatically predict the way a patient’s clinical observations might progress based on
their medical history, we define a probabilistic temporal prediction model. This model operates by
(1) discovering latent trends in the way clinical observations progressed in a provided collection of
patient histories and then (2) applying these latent trends to the chronology of a new patient in order
to predict how his or her clinical findings might progress. As such, because our model is based
on automatically discovered trends in a particular dataset, we will first describe how to pre-process
a collection of longitudinal EHRs by extracting the clinical chronologies and encoding them into
mathematical structures. Then, we will describe a probabilistic model of the clinical chronologies
in a dataset and how that model can be used to both infer latent groups of patients and predict how
the clinical observations for new patients may progress.
Extracting Clinical Chronologies
Given a collection of longitudinal EHRs, we define the following parameters:
N = the number of patients in our dataset
Ln = the number of EHRs associated with patient n in our dataset
V = the number of clinical observations we are modeling
In our case, using the 2014 i2b2/UTHealth dataset, we used a total of N = 128 patients for for
training our model. Each of these patients was associated with Ln ∈ [3, 5] EHRs, which were
chronically ordered according to their creation dates. Additionally, because each of these EHRs
documents a variety of observations about the clinical picture and therapy of the patient at multiple
times, we limited our model to considering only V = 27 total observations; these observations
152
include the five clinical findings in Table 5.1 as well as the twenty-two medication types listed in
Table 5.4. Given these parameters, we were able to capture the clinical chronology of all patients
in a dataset by defining two mathematical structures (wherein t refers to the t-th EHR and not the
value of a timestamp):
O = On,v,t ∈ {0, 1} M×V×Ln

(5.12)
E = En,t ∈ R M×Ln

where On,v,t is an entry in the 3rd-order observation tensor O which indicates whether the v-th
observation was mentioned in the t-th EHR for patient n (we assigned a value of 1 when v was
mentioned, and 0 otherwise); and En,t is an entry in the elapsed time matrix E which stores the
number of days elapsed between EHR t and the previous EHR, t − 1 for patient n. Note that the
elapsed time for the first EHR for each patient is defined as zero: En,0 = 0. Figure 5.7 illustrates
slices from the observation tensor O and rows from the elapsed time matrix E which show the
clinical chronology for individual patients. As illustrated, both O and E are jagged structures,
meaning that the number of EHRs associated with each patient (n) may vary according to the value
of Tn . In this way, not only have we accounted for the de-identification of EHR timestamps, but
we can directly discover temporal patterns based on the relative time elapsed between successive
EHRs for each patient.
Modeling Patient Histories
Given the mathematical representation of patient histories for a provided dataset, we would like
to discover patterns in how the clinical observations changed over time. Traditionally, disease
progression has been modeled using probabilistic mixture models (Amarasingham et al., 2010;
Vickers, 2011; Huopaniemi et al., 2014; Wang et al., 2014). Although these models have been
shown to be reasonably effective, they cannot directly encode temporal information because they
assume that each observation is statistically independent of all other observations. Likewise, these
models do not directly account for phenotypic variety in the patient population. As such, we
153
0 145 458 316 288 0 321 168 180
Elapsed
Elapsed
Patient 1 (n=1) Times Patient 2 (n=2) Times
L1 =5 L2 =4
CF1=Obesity 0 1 1 0 0 CF1=Obesity 0 1 1 0
CF2=Hypertension 1 1 1 1 1 CF2=Hypertension 0 1 1 0
CF3=Diabetes 0 0 0 0 0 CF3=Diabetes 0 1 1 1
CF4=Hyperlipidemia 1 0 0 0 0 CF4=Hyperlipidemia 0 0 0 0
CF5=CAD 1 1 1 1 1 CF5=CAD 1 0 0 0
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
M21=Thiazolidinedione 1 1 0 0 0 M21=Thiazolidinedione 0 0 1 1
M22=Thienopryidine 1 0 1 1 0 M22=Thienopryidine 1 1 0 0
& Therapy of the Patient Population
Observations of the Clinical Picture
Patient N (n=N)
LN =6
CF1=Obesity 0 1 1 0 0 0
CF2=Hypertension 0 0 0 0 1 1
𝕆
CF3=Diabetes 0 0 0 0 1 1
CF4=Hyperlipidemia 0 0 0 0 0 0
CF5=CAD 1 0 0 0 0 0
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
M21=Thiazolidinedione 0 0 0 0 0 1
M22=Thienopryidine 1 1 0 0 1 1
Elapsed
Clinical Observations of
the Patient Population
Times
Elapsed Time
Patient N 0 217 495 157 67 83

0 217 495 157 67 83
between
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
𝔼 Patient 2 0 321 168 180
Patient 1 0 145 458 316 288
Figure 5.7. Visualization of the Observation Tensor O and Elapsed Time matrix E with slices
shown for individual patients 1, 2, and N.
model our collection of clinical histories using a probabilistic graphical model (PGM) (Koller and
Friedman, 2009), which can be viewed as a generalization of traditional mixture models which
allow us to model dependencies between observations. We designed a probabilistic graphical
model which can not only determine the likelihood of encountering any possible patient history
(and thus the likelihood of any clinical observation at a given time) but will also allow us to
discover latent groups of patients who share similar clinical histories. That is, we would like
to model the intuition that, for an individual patient, we would only like to consider the clinical
histories of “similar” patients when trying to predict how his or her clinical observations might
change, rather than considering all the clinical histories in the dataset equally. A straight-forward
154
Figure 5.8. A Probabilistic Graphical Model of Patients’ Clinical Histories.
approach to this would be to simply learn a similarity function between all pairs of possible clinical
histories. However, even when considering only 27 observations, this would require modeling all
T! V Tn = 1.103 × 1057 possible combinations which is clearly unreasonable. As such, rather than
2
trying to discover the similarity between any possible pair of patient histories, we instead cluster
patients into a predefined number of latent groups, such that the patients in each group are more
“similar” to each other than to the patients assigned to other groups. In doing so, we must define
an additional model parameter, K, which denotes the number of latent groups we would like to
discover.
Probabilistic Graphical Models, like mixture models, operate on a set of statistical random
variables and allow us to efficiently compute the joint distribution of these variables. For our
purpose, this means representing the joint distribution of all possible clinical histories in the dataset.
To do this, we define a binary random variable for each entry in the observation tensor On,v,t and
define a continuous random variable for each entry in the elapsed time matrix En,t previously defined
for our dataset. We also define a latent group assignment variable, zn , which indicates which of
the K latent groups was assigned to each patient n. Figure 5.8 illustrates this model using standard
plate notation, where each box, or plate indicates that each contained variable is repeated for each
element in the set listed in the bottom-right of that plate. As shown, the observations in each EHR
155
depend only on the elapsed time, the group assignment, and the observations at the immediately
proceeding EHR. In this way, we have introduced the so-called Markov assumption (Rabiner, 1989)
which states that, for a given EHR, information beyond the immediately proceeding EHR has no
impact. Although this assumption may not always be exactly accurate, it allows us to significantly
reduce the complexity of our model, and, thus, the amount of training dataset needed to accurately
learn patterns of observation progression (Krogh et al., 2001). Moreover, it has been shown that
these kinds of approximate models can actually yield better performance than their exact counter
parts when the dataset is small (Wainwright, 2006). In our case, this representation allows us to
model all the medical histories present in our dataset in terms of only three shared conditional
probability distributions, which we will define below.
We denote the transition probability of an observation u ∈ O being present given the presence
(or absence) of a proceeding observation v ∈ O as
Ftrans (u, v, z)
Ptrans (u | v, z) = (5.13)
Fbase (v, z)
where Ftrans (u, v, z) encodes the number of patients in group z whose history contained observation
u immediately following observation v, and Fb ase(v, z) encodes the number of patients in group
z whose history contained observation v at any time. Thus, Equation (5.13) defines a maximum
likelihood estimate (MLE) of the transition probability based on the proportion of patients in group
z who had an EHR at time i mentioning v who also had an EHR at time i + 1 mentioning u.
In addition to the transition probability, we also define the temporal probability of an observation
v ∈ O being associated with an elapsed time x ∈ E:
Ptemp (v | x) ≈ E xponential(x; λv ) = λv e−λv x (5.14)
where λv is the parameter of the exponential distribution of elapsed times associated with obser-
vation v. Thus, Equation (5.14) defines the elapsed time of a particular EHR given an observation
from that EHR according to an Exponential distribution. We can easily learn this distribution by
156
Figure 5.9. A Probabilistic Graphical Model of Patients’ Clinical Histories with Bayesian Priors.
calculating the maximum likelihood estimate of λv , which is simply the reciprocal of the average
elapsed time observed over all EHRs mentioning observation v.
Finally, we can define the base probability of an EHR mentioning an observation v as
Fbase (v, z)
Pbase (v | z) = (5.15)
Fgroup (z)
where Fbase (v, z) is the number of patients in group z which had observation v during their history
and Fgroup (z) is the number of patients assigned to group z. In this way, we again rely on a
maximum likelihood estimate for the base probability of observing each observation v according
to the proportion of patients in the group z whose histories contained that observation.
Bayesian Extension
The model we have described thus far operates according to the so-called closed-world assumption:
that the histories present in our dataset constitute all the possible clinical histories that may ever
occur (Minker, 1982). Clearly, this assumption is not always true. Thus, we relax this assumption
by introducing a number of prior distributions over the variables in our model and assume that
the clinical histories in our dataset were generated according to these prior distributions. Thus, we
define the following additional latent variables capturing these prior distributions:
157
ψv,k = parameter for the Binomial prior distribution of each observation v in group
λv,k = parameter for the Exponential prior distribution of the elapsed time asso-
ciated with observation v in group k
θ= parameter for the Multinomial prior distribution of assigning each patient
to a latent group
where ψv,k is simply the base probability of assigning group z to the observation v (i.e., Pbase (v | z = k)),
and θ is a vector of the number of patients assigned to each group (i.e., Fgroup (z = 1), . . . , Fgroup (z =
K)). This allows us to directly encode prior knowledge about each observation by defining an ad-
ditional set of second-order priors – that is, by defining a prior distribution for each of these prior
distributions:
αv, βv = hyper-parameters for the conjugate Beta prior of ψv,k

γv, δv = hyper-parameters for the conjugate Gamma prior of λv,k
η= hyper-parameter for the conjugate Dirichlet prior of θ
We have defined the second-order prior distributions for each of our latent parameters according to
the associated conjugate distribution. For example, ψv,k belongs to a Binomial distribution and the
conjugate prior of a Binomial distribution is the Beta distribution, which has the hyper-parameters
αv and βv . These conjugate prior distributions are widely used in Bayesian decision theory and
machine learning because they significantly reduce the underlying mathematical complexity of
learning the model (Raiffa and Schlaifer, 1961). The hyper-parameters associated with each of
these second-order prior distributions allow us to specify a priori knowledge about the prevalence
of certain observations. For example, we could define αv as the national average of patients with
observation v (e.g., patients with DIABETES) and βv as the national average of patients without
observation v (e.g., patients who don’t have DIABETES). In the absence of a priori knowledge
about an observation, we can simply use a uniform hyper-parameter for the associated distribution.
Figure 5.9 illustrates the final model when the second-order prior distributions have been included.
158
Learning Latent Groups of Patients
Before we can use our model to predict the outcomes for new patients, we must first infer the latent
group assignments for each patient in our dataset. We can infer the latent group assignment for an
individual patient by determining which value of zn was most likely to have generated the patient’s
history (On,1:V,1:Tn and En,1:Tn ). Although these latent group assignments can be estimated using any
probabilistic inference technique, we utilized a variation of Collapsed Gibb’s Sampling (Liu, 1994)
which allows us to easily and quickly estimate not only the optimal group assignments, but also
the parameters of the latent prior distributions associated with each observation and elapsed time
variable (i.e., ψv,k and λv,k , respectively). In general, a Gibb’s sampler can sample from an arbitrary
joint distribution by first initializing each variable by randomly sampling from its prior distribution,
and then by iterating for the desired number of samples and independently sampling a new value for
each variable from the conditional distribution of that variable given all the previous assignments
to all other variables. In this way, the Gibb’s sampler is a type of Markov chain Monte Carlo
technique because it effectively performs a random walk through all possible joint assignments,
where each sample is dependent only on the previous sample (i.e., the sequence of samples form
a Markov chain), and the likelihood of each assignment approaches the actual probability of that
assignment in the joint distribution as you increase the number of samples (i.e., follows a Monte
Carlo simulation).
For the probabilistic graphical model that we have defined, we need to calculate the conditional
probability of assigning the latent group zn to each patient, given that patient’s history (On,1:V,1:Tn
and En,1:Tn ) and the parameter θ. Fortunately, we can utilize both Baye’s rule and the conditional
independence observations present in our model to easily compute this probability using:
Tn Õ
Ö V V
Õ
P zn On,1:V,1:Tn, En,1:Tn, θ ∝ P (zn | θ) ×

P zn On,v,t ×
P zn En,v,t (5.16)
t=1 v=1 v=1

where P zn On,v,t is simply the proportion of patients with the observation On,v,t in group zn

compared to the patients with the observation On,v,t in any group and P zn En,v,t is the ratio of the
159
Algorithm 5.1 SAMPLE: Gibb’s sampling for the patient chronology model.
Initialize latent variables:
1 for v ∈ [1..V] do
2 for k ∈ [1..K] do
(0)
3 draw ψv,k ∼ Beta(αv, βv )
(0)
4 draw λv,k ∼ Gamma(γv, δv )
5 end for
6 end for
7 draw δ(0) ∼ Dirichlet(η)
8 for n ∈ [1..N] do
9 draw z(0)
n ∼ Multinomial(θ )
(0)
10 end for
(0)
11 initialize all counts (Fgr oup, Ftr ans, Fbase ) using z1:N
Sample G joint assignments:

12 for g ∈ [1..G] do
13 for n ∈ [1..N] do
(g−1)
14 by removing z n
update all counts
(g) (g)
15 draw z n ∼ P z n On,1:V,t:Tn , En,1:Tn , θ (g−1)
(g)
16 update all counts by adding z n
17 end for
(g) (g) (g)
18 update θ (g) , λ1:V,1:K , and ψ1:V,1:K according to z1:N
(g)
19 yield z1:N as the g-th joint sample
20 end for
exponential probability of En,v,t given λv,zn compared to the total exponential probability of En,v,t
over all λv,1:K , using the exponential distributions defined for Equation (5.14).
Using the marginal distribution for zn given in Equation (5.16), we can easily define a collapsed
Gibb’s sampler for our model. Specifically, we can draw G samples from the joint distribution of
our model as shown by Algorithm 5.1. In this way, we initialize the group assignments by first
sampling θ from the Dirichlet prior on η, and then sample a group assignment for each patient from
θ. After we’ve assigned an initial group to each patients, we compute the counts Fgroup , Ftrans ,
and Fbase using their maximum likelihood estimates (described when introducing Equations (5.13)
to (5.15). Then, for however many samples we are interested in, G, we sample a joint assignment
(that is, an assignment to every patient in the population) by (1) removing that patient from the
counts of its previous assignment, (2) estimating the likelihood of assigning that patient to each
160
Table 5.9. Micro-average predictive performance for various numbers of latent patient groups (K).
K Time Acc. PPV FNR FPR TNR TPR F1 TP FP FN TN

1 n/a 0.654 0.601 0.05 0.651 0.348 0.95 0.736 285 189 15 101
2 5.00 0.784 0.731 0.09 0.344 0.655 0.91 0.811 273 100 27 190
4 10.01 0.794 0.731 0.05 0.358 0.641 0.94 0.823 283 104 17 186
6 15.07 0.816 0.789 0.13 0.241 0.759 0.87 0.829 262 70 38 220
8 20.13 0.768 0.727 0.13 0.338 0.662 0.87 0.792 261 98 39 192
10 25.48 0.778 0.785 0.22 0.221 0.779 0.78 0.781 233 64 67 226
group, (3) sampling a new group assignment from those likelihoods, and finally (4) updating the
counts according to the new group assignment. After collecting a certain number of samples –
in our case G = 100, 000 – we can estimate an optimal group assignment for each patient n by
determining the most common assignment (the mode) of zn for all the G sampled assignments.
Additionally, by updating all the counts at each iteration, we have essentially pre-computed the
base, transition, and temporal probabilities for each group (defined by Equations (5.13) to (5.15)),
which will allow us to quickly and easily predict patient outcomes for new patients.
Predicting Patient Outcomes from their Histories
By using the described Gibb’s sampler to learn the optimal group assignment for each patient
in our dataset, we have (as an intermediary step) also learned the base, temporal, and transition
probabilities of each observation for every latent group, as determined by the patient histories in
our dataset. This will ultimately allow us to predict the clinical outcomes for a new patient by
determining the likelihood of any observation given the new patient’s history. To enable such a
prediction, we most perform three steps: (1) encode the patient’s history using binary random
variables so that we can leverage our probabilistic model, (2) use our model to assign a latent
group to the patient based on his or her medical history, and (3) use the transition, temporal, and
base probabilities associated with that latent group to predict how the patient’s observations may
progress.
161
We can encode the clinical history for a new patient x in a similar manner to the way we
represented the clinical histories pertaining to the original set of patients in our dataset. Let L
represent the number of longitudinal EHRs for x. This allows us to define Xv,t ∈ {0, 1}V×L to be
the observation matrix and Yt ∈ R L to be the elapsed time vector for patient x. After sorting the
EHRs for the patient in ascending chronological ordering (according to their timestamps), we can
set the value of Xv,t to 1 when observation v was mentioned in EHR t and 0 otherwise. Likewise,
we can set Yt as the elapsed time in days between EHR t and the previous EHR, t − 1 where Y1 is
set to 0. In this way, we have defined the observation matrix and elapsed time vectors in the same
way that we defined each slice of the observation tensor and each row of the elapsed time matrix
for the patients in our dataset.
In order to use our model to predict the most likely group assignment, we simply introduce a
new latent variable z x and determine the maximum a posteriori (MAP) assignment to z x given X
and y.
ẑ x = arg max P (z0 | X, y) (5.17)
z0
In this way, we can leverage Equation (5.16) to assign the patient to the latent group which most
likely to generated the observation sequence X and elapsed time vector y. This optimal group
assignment allows us to predict clinical outcomes for the patient by constructing a latent variable
(w) indicating the presence or absence of any of the possible observations in V given the previous
observations in his or her medical history (X1:V,L ) and the number of days elapsed since these
observations, h. To accomplish this, we compute a second MAP assignment, this time to variable
w:
V
Õ
Ptrans w Xv,L, h, ẑn

ŵ ∝ arg max Pbase (w | ẑn ) (5.18)
w∈{0,1} v=1
In this way, Equation (5.18) allows us to predict whether any clinical observation we have modeled
will be present or absent at some point in the future for a patient, according to the patterns of how
the same observations changed for the patients in the same latent group. Note that our previous
Markov assumption allows us to easily predict the presence or absence of multiple observations by
162
Figure 5.10. Distribution of clinical finding observations in the training and testing sets.
0.35 Train Test
0.3
0.25
0.2
0.15
0.1
0.05
0
Hypertension Obesity Hyperlipidemia CAD Diabetes
simply computing the most likely value for each observation individually. This technique could
also be easily extended to predict the presence or absence of observations between EHRs – for
example during long gaps in the patient’s history.
Experimental Setup
In our experiments, we utilized the collection of annotated longitudinal EHRs provided in the
2014 shared task on Challenges in Language Processing on Clinical Data sponsored by the
2014 Informatics for Integrating Biology and the Bedside (i2b2) and The University of Texas
Health Science Center at Houston (UTHealth). This dataset was designed to foster research in
automatically detecting clinical findings, medications, and temporal signals in EHRs; however, in
this section, we re-purpose it in order to learn and evaluate our model of clinical histories. As
such, in the name of consistency, we utilized the same training and testing partitions given by
163
Table 5.10. Predictive performance for individual observations using K = 6 latent patient groups.
Observation Acc. PPV FNR FPR TNR TPR F1 TP FP FN TN

Obesity 0.864 1.0 0.941 0.0 1.0 0.058 0.111 1 0 16 101
Hypertension 0.958 0.958 0.0 1.0 0.0 1.0 0.978 113 5 0 0
Diabetes 0.788 0.812 0.115 0.4 0.6 0.885 0.847 69 16 9 24
Hyperlipidemia 0.729 0.663 0.0 0.582 0.482 1.0 0.797 63 32 0 23
CAD 0.746 0.485 0.448 0.191 0.809 0.551 0.516 16 17 13 72
ARB 0.975 n/a 1.0 0.0 1.0 0.0 n/a 0 0 3 115
beta_blocker 0.136 0.136 0.0 1.0 0.0 1.0 0.239 16 102 0 0
metformin 0.975 n/a 1.0 0.0 1.0 0.0 n/a 0 0 3 115
diuretic 0.890 0.667 0.857 0.010 0.990 0.143 0.235 2 1 12 103
aspirin 0.873 0.873 0.0 1.0 0.0 1.0 0.932 103 15 0 0
statin 0.017 0.017 0.0 1.0 0.0 1.0 0.033 2 116 0 0
sulfonylureas 0.983 n/a 1.0 0.0 1.0 0.0 n/a 0 0 2 116
thienopyridine 0.636 n/a 1.0 0.0 1.0 0.0 n/a 0 0 43 75
calcium_channel_blocker 0.602 1.0 0.870 0.0 1.0 0.130 0.230 7 0 47 64
ACE_inhibitor 0.449 0.788 0.690 0.206 0.794 0.310 0.444 26 7 58 27
insulin 0.992 1.0 0.167 0.0 1.0 0.833 0.909 5 0 1 112
nitrate 0.847 n/a 1.0 0.0 1.0 0.0 n/a 0 0 18 100
thiazolidinedione 0.975 n/a 1.0 0.0 1.0 0.0 n/a 0 0 3 115
DPP4_inhibitors 1.0 n/a n/a 0.0 1.0 n/a n/a 0 0 0 118
fibrate 0.992 n/a 1.0 0.0 1.0 0.0 n/a 0 0 1 117
niacin 1.0 n/a n/a 0.0 1.0 n/a n/a 0 0 0 118
ezetimibe 1.0 n/a n/a 0.0 1.0 n/a n/a 0 0 0 118
anti_diabetes 1.0 n/a n/a 0.0 1.0 n/a n/a 0 0 0 118
the i2b2/UTHealth organizers. In this way, our training set consisted of 790 EHRs documenting
the clinical histories of 178 patients, and our testing set consisted of 514 EHRs documenting the
clinical histories of 118 patients. In order to evaluate the predictions enabled by our model, we
cast the prediction problem as a classification problem. However, unlike traditional classification
problems, where each instance (in our case an EHR for a patient) is associated with a single class
label, we instead must consider multiple class labels where each possible observation is associated
with a label indicating whether it is predicted as present or absent for that EHR. In this way, we can
evaluate the performance of the predictions enabled by our model by employing the same evaluation
methods used for other multi-label classification problems (Tsoumakas and Katakis, 2007). As
such, in our evaluations we trained our model using the clinical histories extracted from the i2b2
training set. Likewise, we evaluated our model using the clinical histories extracted from the i2b2
164
testing set. Specifically, for each patient with L chronologically ordered EHRs, we used our model
to predict the presence or absence of each observation given the history in the first (L − 1) EHRs,
and the elapsed time between the (L − 1)-th and L-th EHR. Then, we compared these predicted
observation values to the actual observations extracted from the L-th EHR. As such, for each
patient we considered each observation as a true positive (TP) if it was predicted by the model and
mentioned in the EHR, as a false positive (FP) if it was predicted by the model but not mentioned
in the EHR, as a false negative (FN) if it was not predicted by the model but was mentioned in the
EHR, and as a true negative (TN) if it was not predicted by the model and was not mentioned in the
T P+T N
EHR. In this way, we were able to directly measure the Accuracy (Acc., T P+FP+F N+T N ); Positive
TP
Predictive Value (PPV, also known as Precision, T P+FP ); False Negative Rate (FNR, also known as
FN FP
the miss rate, F N+T P ); False Positive Rate (FPR, also known as the fall-out, FP+T N ); True Negative
TN
Rate (TNR, also known as Specificity, FP+T N ) ; True Positive Rate (TPR, also known as the hit
TP 2T P
rate or Recall, T P+F N ); and the F1 Measure ( 2T P+FP+F N ).
We first compared our approach against the baseline system reported in Goodwin and Harabagiu
(2015), where in clinical observations are modeled without latent groups and without encoding
temporal offset information. Using 10-fold cross validation, our model achieved an average predic-
tive accuracy of 81.6%, while the baseline achieved an average predictive accuracy of only 54.3%.
This demonstrates the impact of considering temporal information.
The Impact of The Number of Latent Groups
The predictions enabled by our model depend on which latent group each patient was assigned to.
As such, as we increase the number of latent groups, K, that our model discovers, we increase not
only how personalized the predictions become, but also the complexity of the underlying probability
distribution. In order to evaluate the impact of the parameter K on performance, we determined
the micro-average performance across all observation types when using different numbers of latent
165
groups, or K. Table 5.9 presents these results, where Time14 documents the elapsed time in hours
needed to collection 1,000 samples using Gibb’s sampling with a burn-in15 of 1,000 and skip-size16
of 100 (thus, a total of 101,000 iterations are performed). Clearly, increasing the number of latent
groups significantly increases the time taken to perform a static number of iterations of Gibb’s
sampling: when K = 10, it takes 25 hours to complete Gibb’s sampling, however when K = 6,
it takes only 15 hours. When K = 1, no Gibb’s sampling is needed as no group assignments are
necessary. As K approaches 6, the performance, both in terms of precision (PPV) and F1 -measure
increases consistently, with the best performance (PPV=0.789 and F1 =0.829) achieved using K = 6.
Note that as K surpasses 6, the performance begins to decrease, with the performance of K = 8
and K = 10 being worse than for K = 2 clusters. This suggests that for our training set, K = 6 is
the optimal number of latent groups. Because our model learns different transition, temporal, and
base distributions for each latent group, as the number of latent groups increases, fewer and fewer
patients are assigned to each group. Thus, we believe that the decrease in performance with larger
values of K indicates that the size of our training set is not sufficient to learn more than six latent
groups.
Per-Observation Performance Evaluation
In addition to the (micro-) average performance, we also evaluated the performance for each
individual label in our dataset using K = 6 latent groups. Table 5.10 shows the predictive
performance for the five clinical findings listed in Table 5.1 as well as the medication types given
in Table 5.4. We are primarily interested in the predictions for the five clinical findings in the
top of the table, we include the performance for medications because they highlight trends in the
14Training was performed using single process of 1197 MHz speed.
15Burn-in refers to the practice if discarding the first X samples, owing to the fact that the Gibb’s sampler becomes
more accurate with each iteration.
16Skip-size refers to the practice of considering only every Y -th sample due to the fact that the samples produced
by the Gibb’s sampler depend on the previous sample and thus are not independent.
166
way our model forms latent patient groups. The highest Accuracy (Acc.) is achieved for the
HYPERTENSION observation, at 95.8%, followed by OBESITY at 86.4%. The lowest accuracy
was obtained by HYPERLIPIDEMIA and CORONARY ARTERY DISEASE (CAD) at 72.9%
and 74.6%, respectively. Interestingly, the F1 -measure reveals a significantly different distribution
of performance: OBESITY drops from second-best to worst performance, with an F1 score of
only 11.1%. The reason for this is explained by looking at the true positive and false negative
values: only 1 of 17 present-observations of OBESITY were correctly identified by our system
while all 101 negative or absent instances were correctly predicted. In order to hypothesize
why certain observations have better performance than others, we considered the distribution of
observations in our training and testing datasets, as illustrated in Figure 5.10. This distribution
provides some insight on why HYPERTENSION achieves the best F1 -measure while OBESITY
experiences the worst F1 -measure: HYPERTENSION occurs five times more often in the training
set than OBESITY. Moreover, consider that 14 of the 18 medication types listed are all assigned
the same value (present, or absent) in all EHRs. This suggests that the latent groups in our model
capture patients who share observations whose value does not change throughout their medical
history. These “stationary” observations can be viewed as shared traits describing each latent
group, such as “Diabetic patients taking Beta Blockers” or “patients taking Angiotensin II Receptor
Blockers (ARBs) with Hypertension”, illustrating the potential of our model for stratifying the
patient population by discovering similarities not only in their treatments, but also in their diseases.
We designed a data-driven probabilistic graphical model of how patients’ clinical observations
progress over time. Our model operates by first clustering patients into latent groups whose clinical
histories are similar, and then uses these groups to enable personalized temporal predictions for new
patients. We evaluated the per-observation and micro-average performance when predicting how a
patient’s clinical observations progressed, compared to the actual clinical observations mentioned
167
in their EHRs. Experiments demonstrated an accuracy up to 95.8% for a single class, and a micro-
average accuracy of 81.6%, illustrating the potential of our model for predicting personalized patient
outcomes from longitudinal EHRs. Future performance can be improved by (1) incorporating prior
knowledge about observations and (2) leveraging larger datasets of EHRs.
5.4 Summary
In this chapter, we described three different approaches to modeling temporal information from
longitudinal EHRs. We showed how these models could be used to not only predict clinical findings,
observations, and risk factors in future EHRs for each patient, but also how these models can be
used to infer temporal interactions and to induce latent sub-populations. Our experimental results
indicate not only the power of probabilistic graphical models for inferring temporal information
from EHRs, but also the promise of such methods for improving the ability of medical question
answering and patient cohort retrieval systems to consider temporal information. Moreover, we
believe that temporal models are important step towards enabling automatic systems to realize the
potential of precision medicine.
168
CHAPTER 6
ACCOUNTING FOR MISSING OR UNDERSPECIFIED INFORMATION
In traditional information retrieval (IR) settings, the unit of information requested by the user may
be directly indexed by the system (as described in Chapter 1). For example, consider the setting of
web search where-in the user of the IR system is interested in discovering web pages relevant to a
given query. The unit of information, in this setting, is an individual web page, and the role of the
IR system is to retrieve and rank web pages for the user. In the web search setting, the web pages
can be directly indexed and retrieved by the IR system.
Unfortunately, it is not always possible to directly index the unit of information requested by
the user of an IR system. In the case of patient cohort retrieval (PCR) systems, for example, the
unit of retrieval desired by the user is typically an individual patient: the role of the PCR system is
to identify and rank patients based on a given description of a patient cohort. Unlike web pages,
patients cannot be directly processed or indexed by an IR system. Instead, PCR systems index a
proxy of the patients they retrieve and rank: electronic health records (EHRs). While the distinction
between the unit of retrieval (the patient) and the information indexed by the system (EHRs) may
appear to be unnecessarily pedantic, by understanding the distinction we are able to improve the
quality of automatically retrieved patient cohorts. Consider that “EHRs have been shown to suffer
from myriad idiosyncrasies (Weiner, 2011; Hersh, 2012), chief among them the prevalence of
missing (Smith et al., 2005), inconsistent, or underspecified data (O’malley et al., 2005; Berlin and
Stang, 2011)” (Goodwin and S, 2017). While these idiosyncrasies may appear to limit the success
of PCR systems operating on EHRs, because the unit of retrieval is the patient, and not his or her
EHRs, we can overcome these idiosyncrasies by observing that it may be possible for a patient to
be relevant to a cohort even when his or her EHRs are not.
In this chapter, we explore two scenarios where-in underspecified or missing information can be
inferred or recovered automatically using the Temple University Hospital EEG Corpus described
in Chapter 4. In Section 6.1 we present a deep learning model for inferring the impression of an
169
EEG report – whether the EEG indicates normal or abnormal brain activity – when the impression
is underspecified in an EEG report. Section 6.2 extends this notion by considering the problem
of recovering the natural language content of an entire section of an EEG report – the clinical
correlation section.
6.1 Inferring Unspecified Information1
In this section, we present a novel model for automatically inferring underspecified information
from EHRs (Schlangen et al., 2003). While traditional medical informatics approaches rely on
specific, pre-specified features to predict information, our model harnesses the power of textual
data and deep learning to automatically extract features from EHRs while simultaneously predicting
the underspecified information. We present our model for the motivating task of recovering
underspecified over-all impressions (normal vs. abnormal) from electroencephalogram (EEG)
reports. In each EEG report, we removed the impression section written by the neurologist and
trained our model to infer the over-all impression from the remaining content in the report.
Inferring the over-all impression from EEG reports is a challenging problem because the over-
all impression is informed by the neurologist’s “Subjective interpretation” (American Clinical
Neurophysiology Society et al., 2006) of the EEG recording as well as his or her neurological
expertise and accumulated experience. In fact, it has been shown that the inter-interpreter agreement
between neurologists is only moderate (Gerber et al., 2008). Consequently, automatically inferring
the over-all impression requires accounting for the role of neurological knowledge and experience.
The deep learning model we present in this section is able to automatically infer such knowledge
by processing the natural language within EEG reports. Specifically, our model operates in three
steps:
1Minor revision, with permission, of Travis R. Goodwin, and Sanda M. Harabagiu, Deep Learning from EEG Re-
ports for Inferring Underspecified Information, Proceedings of the American Medical Informatics Association (AMIA)
Joint Summits on Translational Bioinformatics (TBI) and Clinical Research Informatics (CRI), 2017. PMID:28815118.
170
Step 1: word-level features are automatically extracted based on their context by incorporating
the skip-gram model popularized by the Word2Vec framework (Mikolov and Dean, 2013;
Guthrie et al., 2006);
Step 2: report-level features are automatically extracted using either (i) a deep averaging network
(DAN) (Iyyer et al., 2015), or (ii) a recurrent neural network (RNN) (Ba et al., 2015); and
Step 3: the most likely over-all impression is predicted from word- and report-level features through
densely-connected “deep” neural layers.
Our experimental results against a number of competitive baselines show the promise of our model.
Moreover, because our model learns to extract features automatically rather than relying on hand-
crafted features capturing specific aspects of EEG reports or their over-all impressions, we believe
that our approach may be used or enhanced to infer other types of missing information.
6.1.1 Previous and Related Work
A review of recent literature showed that the most common approach to handling missing or
underspecified information is to either ignore it, or simply use the information provided by the
most similar report (Sarkar and Leong, 2001). Unfortunately, this approach (known in the machine
learning community as approximate nearest neighbor) suffers from two major problems: (1) it
requires an accurate and complete metric for measuring the similarity between two patients or two
EHRs, and (2) it often produces information which is not consistent with the original report (Arya
et al., 1998). By contrast, the model we propose in this section is able to recover underspecified
information by examining the content of the report and, thus, is able to produce more consistent
information.
To our knowledge, our deep learning model is the first reported architecture for automatically
extracting high-level features from EEG reports. However, a number of neural architectures have
been previously proposed for extracting high-level features from natural language in general. The
171
Word2Vec (W2V) software produced by Google provides two mechanisms for learning high-level
feature representations of words: (i) the skip-gram model and (ii) the continuous bag-of-words
(CBOW) model (Mikolov and Dean, 2013). Although both models have been shown to achieve
high performance in a number of natural language processing applications, the CBOW model has
been shown to require significantly more data than the skip-gram model in order to learn meaningful
representations(Mikolov and Dean, 2013; Guthrie et al., 2006). Beyond W2V, the Global Vectors
for Word Representation (GloVe) software provided by Stanford also learns word-level feature
representations (Pennington et al., 2014). While W2V learns the best representation of a word
for predicting its context, GloVe learns a word representation through dimensionality reduction
over the co-occurrence counts obtained from a document collection. In general, both GloVe and
W2V have been shown to produce useful feature-representations of words in a variety of clinical
applications (Choi et al., 2016; Kilicoglu et al., 2015). In this section, we consider the skip-gram
model rather than CBOW or GloVe because it requires the least amount of training data and has
the lowest computational complexity.
One of the major promises of deep learning is the ability to consider complex, sequential
information, such as the order of words in an EEG report. This is typically accomplished by using
recurrent neural networks (RNNs). Unfortunately, RNNs have also been shown to struggle with
long documents (Iyyer et al., 2015) and to have difficulties accounting for long-distance interactions
between words (Hochreiter and Schmidhuber, 1997). Consequently, we have also considered the
Deep Averaging Network (DAN) proposed by Iyyer et al. (2015). While both RNNs and DANs can
be used to learn the effect of semantic composition between words in an EEG report, the DAN has
the advantage of reduced computational complexity and can more easily represent long-distance
interactions. In this section, we evaluate both approaches for learning report-level features from
EEG reports.
172
INTRODUCTION: The EEG was performed using the standard 10/20 electrode placement system
with an EKG electrode and anterior temporal electrodes. The EEG was recorded during wakefulness
and photic stimulation, as well as hyperventilation, activation procedures were performed.
MEDICATIONS: Depakote ER
HISTORY: A 21-year-old man with a history of seizures since age 15. Has had five episodes
since 2005, all tonic-clonic seizures with loss of consciousness lasting one to two minutes and postictal
confusion.
DESCRIPTION: The EEG opens to a well-formed 9 to 10Hz posterior dominant rhythm,

which is symmetrically reactive to eye opening and eye closing, There is a normal amount of frontal
central beta rhythm seen. The recording is only seen during wakefulness and he has normal response to
hyperventilation and photic stimulation.
IMPRESSION: Normal EEG in wakefulness.
CLINICAL CORRELATION: This awake EEG is normal. Please note that a normal EEG
does not exclude the diagnosis of epilepsy.
(a) EEG report with an over-all impression of NORMAL.
INTRODUCTION: Digital video EEG is performed at bedside using standard 10-20 system of
electrode placement with 1 channel of EKG. The patient is agitated
MEDICATIONS: Keppra.
HISTORY: An elderly woman with change in mental status, waxing and waning mental status,
COPD, morbid obesity, and markedly abnormal EEG. Digital 3EG was done on June 27, 2011.
DESCRIPTION: Much of the EEG includes muscle artifact. When she Is cooperative, there
is a theta pattern with bursts of frontal delta. Muscle artifact is remarkable when the patient becomes
a bit more agitated. As she goes off to sleep, the deltas slowed considerably. There are handful of
triphasic waves noted. Heart rate 84 BPM.
IMPRESSION: This is an abnormal EEG due to 1. Prominent versus frontally predominant

rhythmic delta. 2 Excess beta. 3. Excess theta.
(b) EEG report with an over-all impression of ABNORMAL.
Figure 6.1. Examples of EEG reports with specified and underspecified over-all impressions.
173
Figure 6.1 continued
INTRODUCTION: Digital video EEG is performed at the bedside using standard 10-20 system of
electrode placement with one channel of EKG. The patient is sitting out of her bed. She is very confused
and poorly cooperative.
MEDICATIONS: Keppra, Aricept, Senna, Aricept, ASA, famotidine
HISTORY: 84-year-old woman of unknown handedness with advanced dementia, failure to

thrive, change in mental status, TIA, dementia.
DESCRIPTION: As the tracing opens, the patient has a lot of muscle activity. She seems to
have facial twitching and grimacing and it almost looks like she has a suck or snout reflexes. Although
the patient does not appear to interact with the physician in any way, this produces an alerting response
with an increase in 5-7 hertz theta activity in the background. The overall background is 1 of shifting
asymmetries with theta from side as with beta sometimes better represented on either side, shifting
arrhythmic delta and intermittent, subtle attenuations in the background. Following admission of the
Ativan, the EEG becomes somewhat more discontinuous.
IMPRESSION: This EEG is similar to the 2 previous studies this year which demonstrated a
slow background. Each recording seems to demonstrate an increase in slowing. The administration of
Ativan produced a somewhat discontinuous pattern as may be anticipated in a patient with advanced
dementia.
CLINICAL CORRELATION: No epileptiform features were seen.
(c) EEG report with an underspecified impression.
6.1.2 Inferring the Over-all Impression of EEG Reports with Deep Learning
When writing an EEG report, the neurologist typically documents their over-all impression of
the EEG: whether it indicates normal or abnormal brain activity. However, this information is
not always explicitly stated in the impression section of an EEG report and must sometimes be
inferred by the reader. Figure 6.1 illustrates three EEG reports indicating (a) an over-all impression
of NORMAL, (b) an over-all impression of ABNORMAL, and (c) an underspecified over-all
impression. Note, in Figure 6.1, we have normalized the order and titles of the sections in each
EEG report; in reality, however, we observed a total of 1,176 unique section titles in our collection.
When producing an over-all impression, the neurologist interprets the EEG signal as well as the
174
patient’s clinical history, medications, and the setting of the EEG. For example, consider report
(b) from Figure 6.1: determining that the EEG was abnormal required identifying, among other
findings, the frontal delta rhythm, while in report (c) the impression involves the drug Ativan and the
patient’s prior diagnoses of dementia. These example show that automatically inferring the over-all
impression requires accounting for high-level semantic information in EEG reports capturing the
characteristics of the patient and the described EEG signal. Moreover, we observed that not all EEG
reports included an impression section. Consequently, we designed an approach for automatically
inferring the overall impression from an EEG report even when the impression section is omitted.
To train and evaluate our model, we considered only the reports with a clear over-all impression and
(1) identified the over-all impression (which was used as the gold-standard) and (2) removed the
impression section from the report. This allowed us to design a deep neural network to predict the
over-all impression for EEG reports without relying on the impression section. We used a standard
3:1:1 split for training, development, and testing.
When designing our deep neural network, we noticed that the natural language content of each
EEG report was far from uniform. The number of sections, the title of sections, the number of
sentences in each section, and the lengths of each sentence all varied between individual neurologists
and individual reports. Moreover, when describing an EEG recording, each neurologist wrote in
a different style: while some neurologists preferred terse economical language, others provided
meticulous multi-paragraph discussions. Thus, it was necessary to design the deep neural network
to be independent of the length (and style) of the language used by the neurologist. Our approach
for determining the over-all impression from EEG reports takes advantage of recent advances in
deep learning in order to (1) automatically preform high-level feature extraction from EEG reports
and (2) determine the most likely overall impression based on trends observed in a large collection
of EEG reports. High-level feature extraction was performed automatically and was accomplished
in two steps. In the first step, we learned word-level features for every word used in any EEG report.
In the second step, we learned how to combine and compose the word-level features to produce a
high-level features characterizing the report itself.
175
Formally, we represent each EEG report as a tensor, R ∈ RN×V , where N is the number of words
in the report and V is the size of the vocabulary or number of unique words across all EEG reports in
the training set (in our case, V = 39, 131). Each row Ri is known as a one-hot vector which indicates
that the i th word in the report corresponds to the j th word in the vocabulary by assigning a value of
one to Rij and a value of zero to all elements. The overall impression of an EEG report (obtained
from the removed impression section) is represented as c ∈ C where C = {normal, abnormal}. The
goal of the deep neural network presented in this section is to determine the optimal parameters θ
which are able to predict the correct assignment of c for a report R:
Õ
θ = arg max log P (c | R; θ 0) (6.1)
θ0 (c,R)∈X
where X indicates the training set of EEG reports. Unfortunately, determining the over-all im-
pression directly from the words in each report is difficult. For example, spikes and sharp waves
typically indicate abnormal brain activity but can be non-pathologic if they occur in the temporal
regions of the brain during sleep: small sharp spikes in the temporal region of the brain during
sleep are known as benign epileptiform transients of sleep (BETS) and do not indicate an abnor-
mal EEG. Consequently, to correctly predict the overall impression c, it is important to consider
high-level features characterizing the content each report rather than individual words. We extract
these features automatically as part of our deep learning architecture. Specifically, we factorize the
distribution used in Equation (6.1) into three factors:
1 2 3
z }| { z }| { z }| {
P (c | R; θ) = P (W | R) · P (e | W) · P (c | e; θ) (6.2)
The three factors in Equation (6.2) correspond to the three steps used to train our deep learning
model:
1. produce a high-level feature representation W of every word in R, i.e., P (W | R);
176
2. create a single high-level feature representation e for the report itself by combining and
composing the high-level feature representations of every word in the report, i.e., P (e | W);
and
3. determine the most likely over-all impression c for the report based on its high-level feature
representation e, i.e., P (c | e; θ).
Next, we will describe each of these steps in detail followed by a description of the training and
application of our model to infer underspecified over-all impressions from EEG reports, as well as
details on how model parameters were selected and the model parameters used in our experiments.
Extracting Word-Level Features from EEG Reports
We determine a high-level feature representation for each possible word v ∈ [1, V] by examining
the context around that word in each report (where V is the size of the vocabulary). To do this, we
adapt a skip-gram model (Mikolov and Dean, 2013). The skip-gram model learns a single feature
representation for every word in the vocabulary based on all of its contexts across all EEG reports
in the training set. Specifically, we learn the projection matrix U ∈ RV×K where each row Uv is the
high-level feature representation of the v th word in the vocabulary. Figure 6.3 shows the architecture
of the skip-gram model when considering the word EEG from the context Digital video EEG is
performed from Figure 6.2c. The goal of the skip-gram model is to learn the projection matrix
U which, when multiplied with the one-hot vector for EEG, is best able to predict the one-hot
vectors associated with each context word, e.g., Digital, video, is, and performed. In this way, the
skip-gram model is able to learn a representation for the word EEG which captures the facts that (1)
an EEG can be performed and that (2) digital video is a type of EEG. We learn the optimal project
matrix U by training a separate neural network in which the input is every word Ri ∈ R in every
report R∈ X, and the goal is to predict the n previous and n following words using the projection
matrix U: " −1 #
N
ÕÕ Õ t=n
Õ
U= P (Ri+t | Ri ; U ) +
0 0
P (Ri+t | Ri ; U ) (6.3)
R∈X i=1 t=−n t=1
177
Figure 6.3. The skip-gram model used to learn word-level features for each word in an EEG report.
Illustrated using report (c) from Figure 6.1.
where
exp (Ri+t U0 · Ri U0)
P (Ri+t | Ri ; U ) = ÍV
0
(6.4)
0
v=1 exp (Ri U v )
In our experiments, we used n = 2. Learning the optimal projection matrix U allows the model
to produce a high-level feature representation of every word in the report, W ∈ RN×K , by simply
multiplying R with U:
W = RU (6.5)
where each Wi indicates the word-level feature vector associated with Ri . The word-level feature
vectors (W) learned by the skip-gram model have a number of useful algebraic properties. Of
particular note is their ability to capture semantic similarity, for example, closest feature vector
to the word generalized is that of the word diffuse, and the closest feature vector to focal is that
of the word localized. This highlights the ability of the skip-gram model to capture the fact that
178
both generalized and diffuse refer to activity spread across a large area of the brain (e.g., both
hemispheres, multiple lobes), while focal and localized describe activity concentrated in one or
two regions of the brain.
Extracting Report-Level Features from EEG Reports
Representing each word in a report as an independent feature vector is not sufficient to predict the
overall impression. Instead, it is necessary to learn how to combine and compose the word-level
feature vectors W to create a single high-level feature vector for the report, e. We considered
two different neural architectures for learning e. The first model is based on a Deep Averaging
Network (DAN) (Iyyer et al., 2015), while the second uses a Recurrent Neural Network (RNN). Both
architectures enable the model to learn a semantic composition but in different ways. Specifically,
a DAN learns an unordered composition of each word in the document, while an RNN learns an
ordered composition. However, the representation learned by an RNN often struggles to account for
long-distance interactions and favors the latter half of each document. Consequently, we evaluated
both models in order to determine the most effective architecture for learning report-level features
from EEG reports.
Deep Averaging Network. The Deep Averaging Network (DAN) (Iyyer et al., 2015) learns the
report-level feature representation e of a report based on its word-level features W. To understand
the need for report-level features, consider the following:
Excerpt 1:
. . . a well-formed 9 to 10Hz posterior dominant rhythm, which is symmetrically reactive to eye
opening and eye closing. . .
Interpreting Excerpt 1 requires understanding (1) that the words posterior dominant rhythm describe
a single EEG activity, and (2) that the posterior dominant rhythm is well-formed. Clearly, word-level
features are not sufficient to capture this information. Instead we would like to extract high-level
semantic features encoding information across words, sentences, and even sections of the report.
The DAN used in our model accomplishes these goals using five layers, as shown in Figure 6.4.
179
Figure 6.4. Architecture of the Deep Averaging Network (DAN) used to combine and compose
word-level feature vectors W 1, · · · , W N extracted from an EEG report. Illustrated using report (c)
from Figure 6.1.
The first two layers learn an encoding of each word Wi associated with the report, and the third
layer combines the resulting encodings to produce an encoding for the report itself. The final
two layers refine this encoding to produce e. To learn an encoding for each word, we apply two
densely-connected Rectified Linear Units (ReLUs) (Glorot et al., 2011). The rectifying activation
functions used in ReLUs have several notable advantages, in particular the ability to allow for sparse
activation. This enables learning which words in an EEG report have the largest impact the over-all
impression. By using a ReLU for the first layer of our encoder, each word represented by feature
vector Wi is projected onto an initial encoding vector ri(1) . The ReLU used in the second layer of
the encoder produces a “more expressive” encoding ri(2) (Iyyer et al., 2015). Both encodings are
generated as:
ri(1) = max (S1 · Ri + b1, 0) (6.6)

ri(2) = max S2 · ri1 + b2, 0 (6.7)
180
where S1, S2 ∈ θ are the learned weights of the connections between the neurons in layers 1 and
2, and b1, b2 ∈ θ are bias vectors. While the encoding ri(2) represents information obtained from
each word vector Wi ∈ W, we are interested in producing a single representation that captures the
information about the entire EEG report. This is accomplished by layers 3 through 5. In layer 3,
the piece-wise average of all word vector encodings is produced:
N
1 Õ (2)
a= r (6.8)
N i=0 i
Layers 4 and 5 act as additional “deep” layers which enhance the quality of the encoding (Iyyer
et al., 2015). To implement layers 4 and 5 we used two additional ReLUs:
r(3) = max (S3 · a + b3, 0) (6.9)

e = max Se · r(3) + be, 0 (6.10)
where S3 , b3 , Se , be ∈ θ are the learned weights and biases used by each ReLU layer. Equations (6.6)
to (6.10) enable our model to generate a fixed-length high-level vector, e, which encodes semantic
information about the entire EEG report.
Recurrent Neural Network. In contrast to the DAN, the recurrent neural network (RNN) used
in our model jointly learns how to (1) map a sequence of word-feature vectors (W1, · · · , WN ) to a
sequence of hidden memory states (m1, · · · , mN ) as well as to (2) map the hidden memory states
to a sequence of output vectors (y1, · · · , yN ), as illustrated in Figure 6.5. Formally, for each word
i ∈ [1, N] where N is the length of the EEG report,
mi = σ (Sm · [Wi + mi−1 ]) (6.11)
yi = σ S y · mi

(6.12)
where Sm, S y ∈ θ are the learned weights connecting the neurons in each layer. Unfortunately,
RNNs are known to have difficulties learning long-range dependencies between words (Hochreiter
et al., 2001). For example, consider the excerpt:
181
Figure 6.5. Architecture of the Recurrent Neural Network (RNN) used to combine and compose
word-level feature vectors W 1, · · · , W N extracted from an EEG report. Illustrated using report (c)
from Figure 6.1.
Excerpt 2:
. . . periodic delta with associated periodic paroxysmal fast activity identified from the left
hemisphere with a generous field of spread including the centrotemporal and frontocentral
region.
A standard RNN would be unlikely to infer that the periodic delta activity was observed in the
centrotemporal and frontocentral regions of the brain due to the significant number of words
between them (19). In order to enable our RNLM to overcome this barrier, we implement each of
our RNNs as a stacked series of long short-term memory units (Hochreiter and Schmidhuber, 1997)
(LSTMs) which are able to learn long-range dependencies by accumulating an internal memory.
182
Inferring the Over-all Impression from EEG Report
The learned high-level feature vector e is used to determine the most likely over-all impression
associated with the EEG report. Given e, we approximated the likelihood of assigning the over-all
impression c to the EEG report associated with e, i.e., P (c | e; θ), with a densely connected logistic
c ∈ [0, 1] such that e

sigmoid layer. The sigmoid layer computes a floating point number e c ≤ 0.5 if
c = NORMAL, and e
c > 0.5 if c = ABNORMAL:
c = σ (Sc · e + bc )
e (6.13)
where Sc, bc ∈ θ are the learned weights and bias vector for the sigmoid layer, and σ is the standard
ex
logistic sigmoid function, σ (x) = e x+1
. Equation (6.13) allows us to approximate the likelihood of
the over-all impression c ∈ C being assigned to the report associated with e as:

c, if c = NORMAL

1 − e


P (c | e; θ) =

(6.14)
c, if c = ABNORMAL



e

Training the Model with EEG Reports
We train our model by learning the parameters θ which minimize the loss when computing the
over-all impression c for each report R in the training set X. In our experiments, we used the cross-
entropy loss between the predicted over-all impression c and the gold-standard value b
c indicated by
the neurologist (in the removed impression section). Formally:
Õ
L (θ) ∝ [P (c | e; θ) · P (e | W) · P (W | R)] · log P (b
c) (6.15)
(R,b
c)∈X
c) = 1 if c = ABNORM AL, and zero, otherwise. We trained our model using adaptive
where P (b
moment estimation (ADAM) (Kingma and Ba, 2015).
183
Inferring Underspecified Information from EEG Reports
The optimal over-all impression c for a new EEG report R can be determined in three steps: (1)
transform R into a word-level feature matrix, W = UR, using the projection matrix U learned from
the training data; (2) transform the word-level feature matrix W into a single report-level feature
vector e using either the DAN or the RNN; and (3) determine the over-all impression c from the
report-level feature vector e.
Implementation Details
In our experiments, we implemented our model using TensorFlow (Abadi et al., 2016) (version
0.8). Because ADAM tunes the learning rate as it trains, we initialized ADAM using the default
parameters in Tensorflow (learning rate = 0.001, β1 = 0.9, β2 = 0.999, and = e−8 ). For the
purposes of our experiments, gradient clipping was not applied, and no regularization terms were
added. Model parameters were determined using a grid-search as follows: skip-gram, ReLU. and
LSTM dimensionality were chosen from {100, 200, 500, 1000}. When performing grid search,
we constrained all ReLUs to share the same dimensionality. We found the optimal dimensionality
for the skip-gram embeddings, ReLU layers, and LSTM to each be 200 dimensions/units.
6.1.3 Results
Our model was evaluated by measuring its ability to correctly determine the over-all impression c
(i.e., normal or abnormal) for each report in the test set. To do this, we removed the impression
section from each report in the test and compared the automatically produced over-all impressions
against the over-all impressions given by the neurologists (obtained from the removed impression
sections). Because our goal is to recover the over-all impression in reports which lack an over-all
impression (normal or abnormal), we filtered out any reports which contained the word “normal”
or “abnormal” after the impression section was removed. In our test set, we found that 76% of EEG
impressions were abnormal and 24% were normal. We evaluated the performance of our model
184
when incorporating either deep averaging network (DAN) or a recurrent neural network (RNN) to
learn report-level features as well as the performance of five competitive baseline systems.
• Support Vector Machine (SVM). We trained two support vector machines (SVMs) (Cortes
and Vapnik, 1995) to classify each EEG report as normal or abnormal based on the con-
tent of the report. The first SVM (SVM:BoW) was trained by transforming R into a
single “bag-of-words” vector. The second SVM (SVM:LDA) was trained by transform-
ing R into a topic vector using Latent Dirichlet Allocation (LDA). LDA was implemented
using Sci-kit Learn (Pedregosa et al., 2011) with symmetric priors. Parameters were de-
termined using grid search as follows: number of topics ∈ {100, 200, 500, 1000}, ker-
nel ∈ {linear, radial basis function (RBF), quadratic, cubic}, C ∈ {10−4, 10−3, 10−2, 10−1, 1,
10, 102, 103, 104 }. The optimal kernel for both SVMs was RBF, the optimal value of C was
103 . For LDA, the optimal number of topics was 200.
• Approximate Nearest Neighbor (ANN:Lev). The traditional approach for recovering miss-
ing or underspecified information is to simply use the information given in the closest
document. In this baseline, the over-all impression was assigned using the over-all impres-
sion given by the closest report R0 in the training set to report R. We measured the “distance”
between reports using Levenshtein distance (with equal costs for insertion, deletion, and
substitution operations).
• Neural Bag-of-Words (NBOW). In order to compare the importance of considering high-
level feature information, we implemented a simple perceptron baseline which considers the
content of the report R (represented as a bag-of-words vector) to produce the conclusion
without automatically inferring any high-level features.
• Doc2Vec (D2V). Finally, we considered a baseline relying on Doc2Vec (D2V), a document-
level extension of Google’s Word2Vec model (Mikolov and Dean, 2013). Like our model,
185
D2V learns a high-level semantic representation of a document. However, unlike our model,
the high-level features learned by D2V are agnostic of any particular task and do not capture
any interaction between the content of the report and the over-all impression. We consid-
ered D2V vector dimensionality ∈ {100, 200, 500, 1000} and found the optimal number of
dimensions was 200.
The performance of our model for generating the over-all impression of an EEG report was measured
by casting it as a binary classification problem. This allowed us to compute the Accuracy, Precision,
Recall, and F1 -measure, as shown in Table 6.1. We also report the time taken to train each model.
We found that our deep neural network (DNN, denoted with a ‘?’) significantly outperformed
the five baseline systems. The low performance of the NBoW and SVM systems highlights the
need to account for high-level contextual information in EEG reports, rather than individual words.
Interestingly, the ANN approach obtained very low performance, despite being the most common
technique reported in the literature. The low performance of Doc2Vec shows that capturing general
report-level information is not sufficient for recovering the over-all impression. By contrast, both
the RNN and DAN were able to learn meaningful high-level features from EEG reports. While the
high performance of the RNN suggests that RNNs are capable of inferring sequential information
from the EEG descriptions, the time taken to train the RNN was significantly longer than that taken
to learn the DAN, mirroring the suspicions reported by Iyyer et al. (2015).
Table 6.1. Performance when determining the over-all impression for EEG reports in the test set.
System Accuracy Precision Recall F1 -measure Time

SVM:BOW 0.8349 0.8503 0.8814 0.8656 4 min 44 s
SVM:LDA 0.6331 0.6245 0.9947 0.7673 9 min 56 s
ANN:Lev 0.7457 0.8069 0.7601 0.7829 38 s
NBOW 0.7491 0.8300 0.7346 0.7794 3 min 37 s
D2V 0.6587 0.7645 0.6275 0.6892 6 min 12 s
? DNN: DAN 0.9143 0.9443 0.9117 0.9277 8 min 14 s
? DNN: RNN 0.8941 0.9234 0.8991 0.9111 20 min 46 s
186
6.1.4 Discussion
Our experimental results show the clear and significant promise of our model. By taking advantage
of textual data through deep neural learning, our model was able to recover the over-all impression
with 91% accuracy. The poor performance of all baseline systems, when compared against the
performance of our model, suggests that both deep neural architectures were able to successfully
extract high-level features automatically from EEG reports. However, despite outperforming all
baseline systems, there were still instances in which the over-all impressions automatically recovered
by our model differed from those of the neurologist. To investigate the causes of these errors, we
manually inspected 100 randomly-selected misclassified EEG reports. We found that the majority
of errors (51%) produced by our model occurred when the impression section referred to EEG
characteristics not mentioned elsewhere in the report, or which had only negated mentions. For
example, the report from Figure 6.1b indicates that the EEG was abnormal due to “excess theta”
and “excess beta”; however, neither of these characteristics are described in the EEG description.
This suggests that the performance may be improved by jointly considering the EEG report and
the associated EEG signal. However, it should be noted that processing the EEG signal directly is
highly computationally expensive and is an open-problem for which there are no clear preferred
methodologies (Lotte, 2014). Thus, the added value of incorporating EEG signal information
largely depends on the nature of the application.
The second most-common source of errors (24%) was EEG reports in which the over-all im-
pression relied on the patient’s age or pre-existing conditions. For example, impressions indicating
“normal EEG for a patient of this age” or “normal activity given recent trauma.” We believe these
errors resulted from a lack of sufficient training data for specific ages and pre-existing conditions,
and that overcoming these errors could be accomplished by providing the model with additional
context from the patient’s EHRs (such as notes from the referring physician). Unfortunately, the
TUH EEG corpus provides no additional EHR information beyond individual EEG reports. Never-
187
theless, we believe that incorporating additional context could be a valuable step towards improving
future performance.
The last major source of errors (9%) we observed were due to typographical mistakes and
grammatical inconsistencies. For example, we observed “$een in” rather than “seen in”; “eta
rhythm” rather than “beta rhythm”; and “& Hz” rather than “7 Hz”. While most of these mistakes
had little impact on the performance of the model, we believe that future work may benefit from pre-
processing EEG reports to remove typos and grammatical inconsistencies. There was no common
source of errors for the remaining 16% of misclassified documents.
In this section, we presented a deep learning approach for inferring missing or underspecified
information from electronic health records (EHRs) by taking advantage of textual data. Our
approach was evaluated based on its performance when recovering over-all EEG impressions from
EEG reports after the impression section had been removed. While traditional machine learning
approaches would require explicitly enumerating features characterizing the over-all impression,
our model relies on deep neural learning to automatically identify high-level features from EEG
reports while learning to predict the correct over-all impression. Our evaluation of over 3,000 EEG
reports showed promising results, with an F1 -measure of 93% (a 40% improvement over Doc2Vec).
Moreover, because our approach does not rely on any manual feature extraction nor representation
specific to EEG reports, we believe these results show the promise of our model for automatically
recovering underspecified information from EHRs in general.
6.2 Recovering Missing Information2
Diagnosing and managing neurological dysfunction often hinges on successful communication
between the neurologist performing a diagnostic test (such an Electroencephalogram or EEG), and
188
the primary physician or other specialists. In Glick et al. (2005) studied malpractice claims against
neurologists and found that 71% of the claims arose from “an evident failure of communication by
the neurologist” and that the majority of the claims resulted from deficient communication between
the neurologist and the primary physician or other specialists. In addition, Glick et al. found that
62.5% of claims included diagnostic errors and that 41.7% involved errors in “ordering, interpreting,
and reporting of diagnostic imaging, follow-through and reporting mechanisms.” It is expected
that these types of errors could be reduced, and communication could be improved by developing
tools capable of automatically analyzing medical reports (Cawsey et al., 1997). Moreover, a recent
Institute of Medicine Report (England et al., 2012) advocated the need for decision-support tools
operating on electronic health records for primary care and emergency room providers to manage
referral steps for further evaluation and care of persons with epilepsy. Specifically, the ability to
automatically extract and analyze the clinical correlations between any findings documented in a
neurological report and the over-all clinical picture of the patient, could enable future automatic
systems to identify patients requiring additional follow-up by the primary physician, neurologist,
or specialist. Furthermore, systems capable of automatic analysis of the clinical correlations
documented in a large number of reports could ultimately provide a foundation for automatically
identifying reports with incorrect, unusual, or poorly-communicated clinical correlations mitigating
misdiagnoses and improving patient care (Cawsey et al., 1997). It should be noted, however, that
automatically identifying incorrect, unusual, or poorly-communicated clinical correlations has two
critical requirements: (1) inferring what the expected clinical correlations would be for the patient
and (2) quantifying the degree of disagreement or contradiction between the clinical correlations
documented in a report and the expected clinical correlations for the patient. In this section, we
focus on the first requirement by considering the clinical correlation sections documented in EEG
reports.
2Minor revision, with permission, of Travis R. Goodwin, and Sanda M. Harabagiu, Inferring Clinical Correlations
from EEG Reports with Deep Neural Learning, Proceedings of the American Medical Informatics Association (AMIA)
Annual Symposium, 2017.
189
The role of the clinical correlation section is not only to describe the relationships between
findings in the EEG report and the patient’s clinical picture, but to also explain and justify the
relationships so as to convince any interested health care professionals. Consequently, the clinical
correlation section of an EEG report is expressed through natural language, meaning that the clinical
correlations documented in the clinical correlation section are qualified and contextualized through
all the subtlety and nuance enabled by natural language expression (Albright et al., 2013). For this
reason, while it might appear sufficient to simply extract individual findings or medical concepts
from the clinical correlation section, describing and justifying the clinical correlations requires
producing coherent natural language (Cawsey et al., 1997). This requirement makes inferring the
expected clinical correlation section from an EEG report a challenging problem because it requires
not only identifying the correct clinical correlations, but also expressing those correlations through
natural language which is by the content of the EEG report as well as the neurologist’s medical
knowledge and accumulated experience.
In this section, we present a novel Deep Section Recovery Model (DSRM) which applies deep
neural learning on a large body of EEG reports in order to infer the expected clinical correlations
for a patient based solely on the natural language content in his or her EEG report. The DSRM
was trained and evaluated using the Temple University Hospital (TUH) EEG Corpus (Harati et al.,
2013) by (a) identifying and removing the clinical correlation section written by the neurologist
and (b) training the DSRM to infer the entire clinical correlation section from the remainder of the
report. At a high level, the DSRM can be viewed as operating through two general steps:
Step 1: word- and report- level features are automatically extracted from each EEG report to
capture contextual, semantic, and background knowledge; and
Step 2: the most likely clinical correlation section is jointly (a) inferred and (b) expressed through
automatically generated natural language.
Our experimental results against a number of competitive baseline models indicate the generative
power of the DSRM, as well as the promise of automatically recognizing unusual, incorrect, or
190
incomplete clinical correlations in the future. It should be noted that although we evaluated the
DSRM by recovering the clinical correlation sections from EEG reports, the model automatically
extracts its own features based on the words in a given report and (clinical correlation) section.
Consequently, we believe the DSRM could be easily adapted to not only process addition types
of medical reports, but to also to infer and generate medical language for other purposes, e.g.,
generating explanations for CDS systems, providing automated second opinions, and assessing and
tracking documentation quality.
6.2.1 Background
The Deep Section Recovery Model (DSRM) presented in this section was originally envisioned as an
extension to the MERCuRY system described in Chapter 4 on page 91. This system assigns different
weights or importance to the different sections in an EEG report, with the clinical correlation
section being the most important. Unfortunately, we found that as many as 1 in 10 EEG reports
were missing a clinical correlation section. In Section 6.1, we designed a binary classification
model for automatically inferring the over-all impression (normal or abnormal) for an EEG report.
This model was extended and adapted to produce natural language, forming the basis for the DSRM
presented in this section. As a natural language generator, the DSRM incorporates advances from
Natural Language Generation, Machine Translation, and Automatic Summarization. We briefly
review each of these topics below.
Natural Language Generation
Natural language generation (NLG) is an area of study on how automatic systems can produce
high-quality natural language text from an internal representation (Varile and Zampolli, 1997).
Traditionally, NLG systems rely on a pipeline of sub-modules including content selection – de-
termining which information the model should generate – and surface realization – determining
how the model should express the information in natural language. These systems typically require
191
supervision at each individual stage and cannot scale to large domains (Iyer et al., 2016). In health
care, NLG has traditionally focused on surface realization through a number of applications(Cawsey
et al., 1997), including generating explanations (Swartout, 1985), advice (Carberry and Harvey,
1997) or critiques (Gertner et al., 1997) in expert systems, as well as generating explanatory ma-
terial for patients (Buchanan et al., 1995). These systems largely rely on templates and rule-based
mechanisms for producing natural language content. By contrast, the DSRM jointly performs con-
tent selection (via latent feature extraction) and surface realization (using a deep neural language
model) without requiring predefined rules or templates.
Machine Translation
Perhaps the most ubiquitous application of NLG, machine translation has been an active area of
research for the last 50 years (Slocum, 1985). While the earliest systems were largely rule-based,
statistical machine translation (SMT) systems have become the focus of the field. Statistical
machine translation systems typically rely on gold-standard word or sentence alignments between
parallel texts in a source and target language and use machine learning to train models which
can automatically translate between them (Brown et al., 1993). More recently, the advent of deep
learning has enabled the design of systems which jointly learn to align and translate(Bahdanau
et al., 2015). The canonical work Bahdanau et al. (2015) introduces the notion of neural attention,
which allows the model to learn how words in the target language should be aligned to words in the
source language without supervision. The DSRM extends this idea by incorporating an attention
layer to learn the association between words in the clinical correlation section and those in the rest
of the report.
Automatic Summarization
Automatic summarization systems can be typically divided into two categories: extractive summa-
rization systems, which aim to select individual words or sentence from a document and “stitch”
192
them together to form a summary (Rush et al., 2015), and abstractive summarization systems
which consider structural and/or semantic information to produce a summary that can containing
words not mentioned in the document. It has been shown in Pivovarov et al. (2016) that extractive
summarization may not be sufficient for health care needs; rather, abstract summarization efforts
should be preferred. Fortunately, as with SMT, advances in deep learning have allowed allowed
summarization systems to learn an internal or embedded representation of a document which can be
used as the basis for NLG(Rush et al., 2015) using Sequence-to-Sequence models (Cho et al., 2014).
Consequently, the DSRM model adapts the notion of abstractive summarization and combines and
extends Sequence-to-Sequence models with the attention mechanisms used by SMT systems.
EEG Report Deep Section Recovery Model
𝒉𝟏 , 𝒉𝟐 , ⋯ , 𝒉𝑵
𝑹 Extractor Generator 𝑺′
𝒆
Extracted Feature Vectors
𝑺
(only while training)
Clinical Correlation
Section
Figure 6.6. Simplified Architecture of the Deep Section Recovery Model (DSRM).
6.2.2 Recovering the Clinical Correlation Section of EEG Reports
When writing the clinical correlation section of an EEG report, the neurologist considers the infor-
mation described in the previous sections, such as relevant clinical history or notable epileptiform
activities, as well as their accumulated medical knowledge and experience with interpreting EEGs.
This type of background knowledge is difficult to capture with hand-crafted features because it is
rarely explicitly stated; rather, it is implied through the subtlety, context, and nuance afforded the
neurologist by natural language. Consequently, to approach this problem, we present a deep neural
network architecture which we refer to as the Deep Section Recovery Model (DSRM). Illustrated
in Figure 6.6, the DSRM consists of two major components:
193
• the Extractor which learns how to automatically extract (a) feature vectors representing
contextual and background knowledge associated with each word in a given EEG report as
well as (b) a feature vector encoding semantic, background, and domain knowledge about the
entire report; and
• the Generator which learns how to use the feature vectors extracted by the Extractor to produce
the most likely clinical correlation section for the given report while also considering the
semantics of the natural language it is generating.
In order to train and evaluate the DSRM, we identified all EEG reports in the TUH EEG Corpus
which contained a CLINICAL CORRELATION section and removed that section from the report.
The model was trained to recover the missing clinical correlation section in the training set and
evaluated based on the clinical correlation sections it inferred for reports in the test set. In the
remainder of this section, we describe (1) the natural language pre-processing steps applied to the
data, (2) the mathematical problem formulation, (3) the Extractor, (4) the Generator, (5) how the
parameters of the model are learned from the training set, and (6) how the learned parameters are
used to infer the most likely clinical correlation section for a (new) EEG report.
Natural Language Pre-processing
Before applying the Deep Section Recovery Model, we pre-processed each EEG report with
three basic natural language processing steps: (1) sentence boundaries were identified using the
OpenNLP3 sentence splitter; (2) word boundaries were detected using the GENIA (Tsuruoka et al.,
2005) tokenizer, and (3) section boundaries were identified using a simple regular expression
search for capitalized characters ending in a colon. These three pre-processing steps allowed us to
represent each section of an EEG report as a sequence of words in which the symbols hsi and h/si
were used to indicate the start and end of each sentence, hpi and h/pi were used to indicate the start
and end of each section, and hdi and h/di were used to indicate the start and end of each report.
3https://opennlp.apache.org/
194
Problem Formulation
In order to formally define the problem, it is necessary to first define the vocabulary as the set of all
words observed at least once in any section (including the clinical correlation section) of any EEG
report in the training set. Let V indicate the size or number of words in the vocabulary. This allows
us to represent an EEG report as sequence of V-length one-hot vectors corresponding to each word
in the report, i.e., R ∈ {0, 1} N×V where N is the length or number of words in the report. Likewise,
we also represent a clinical correlation section as a sequence of V-length one-hot vectors; in this
case, S ∈ {0, 1} M×V where M is the number of words in the clinical correlation section. The goal of
the Deep Section Recovery Model is to infer the most likely clinical correlation section for a given
EEG report. Let θ be the learn-able parameters of the model. Training the model equates to finding
the values of θ which assign the highest probabilities to the gold-standard (neurologist-written)
clinical correlation sections for each EEG report in the training set; formally:
θ = arg max P (S | R; θ 0) (6.16)

θ0
We decompose the probability of a particular clinical correlation section being produced for a given
EEG report (i.e., correctly identifying and describing the clinical correlations in the report) into
two factors:
Extractor Generator
z }| { z }| {
P (S | R; θ) ≈ P (e, h 1, · · · , h N | R; θ) · P (S | e, h 1, · · · , h N ; θ) (6.17)
where the the first factor is implemented by the Extractor and the second factor is implemented by
the Generator.
The Extractor
The language in the clinical correlation section is intended to relate findings and observations
described in the previous sections of the record to the over-all clinical picture of the patient.
Consequently, in order to automatically produce the clinical correlation section, the goal of the
195
Extractor is to automatically (1) identify important neurological findings and observations (e.g.,
“background slowing”), (2) identify descriptions of the patient’s clinical picture (e.g., “previous
seizure”), and (3) determine the inferred relationship(s) between each finding and the clinical
picture as described by the EEG report or implied by medical knowledge and experience (e.g.,
“observed epileptiform activity is consistent with head trauma”). It should be noted that the length
and content of each EEG report varies significantly throughout the collection, both in terms of the
sections included in each report as well as the content in each section. Moreover, when producing
an EEG report, each neurologist writes in a different style, ranging between terse 12-word sections
to 600-word sections organized into multiple paragraphs. Consequently, the role of the Extractor
is to overcome these barriers and extract meaningful feature vectors which characterize semantic,
contextual, and domain knowledge. To address these requirements, we implemented the Extractor
using the deep neural architecture illustrated in Figure 6.7. The Encoder relies on five neural layers
to produce feature vectors for each word in the report (h 1, · · · , h N ) as well as a feature vector
characterizing the entire report (e):
• Layer 1: Embedding. The role of the embedding layer is to embed each word in the
EEG report Ri (represented as a V-length 1-hot vector) into a K-length continuous vector
r i(1) (where K V). This is accomplished by using a fully connected linear projection
layer, r i(1) = Ri W e + b e , where W e ∈ RV×K , b e ∈ RV x1 ∈ θ correspond to the vocabulary

projection matrix and bias vector learned by the Extractor.
• Layer 2: Bidirectional Recurrent Neural Network. Layer 2 implements a bidirectional
recurrent neural network (RNN) using two parallel RNNs trained on the same inputs: (1)
a forward RNN which processes words in the EEG report in left-to-right order and (2) a
backward RNN which processes words in the EEG report in right-to-left order. This allows
the forward RNN to extract features capturing any short- or long-range contextual information
about each word in R provided by any preceding words in the EEG report (e.g., that “slowing”
196
(bidirectional) (bidirectional)
RNN1 RNN2
𝒃𝟏 𝒃𝟐
<d> 𝑹𝟏 E c c 𝒉𝟏
<p> 𝑹𝟐 E c c 𝒉𝟐
<s> 𝑹𝟑 E c c 𝒉𝟑
CLINICAL 𝑹𝟒 E c c 𝒉𝟒
HISTORY 𝑹𝟓 E c c 𝒉𝟓
⋮ ⋮
⋮ ⋮
⋮ ⋮
</d> 𝑹𝑵 E c c 𝒉𝑵
Key:
GRU (forward)
𝒇𝟏 𝒇𝟐 𝒆
GRU (backward) c Concatenation E Embedding
Figure 6.7. Detailed Architecture of the Extractor.
197
is negated in “no background slowing”). Likewise, the backward RNN extracts features
capturing any short- or long-range contextual information provided by successive words
in the EEG report (e.g., that “hyperventilation” described in the introduction section may
influence the inclusion of “spike and wave discharges” in the EEG impression or description
sections). Formally, the forward RNN maps the series word embeddings r (1) (1)
1 , · · · , r N to
(2 f ) (2 f )
a series of “forward” word-level feature vectors r 1 , · · · , r N , while the backward RNN
maps r (1) (1) (2b) (2b)
N , · · · , r 1 to a series of “backward” word-level feature vectors r N , · · · , r 1 . In
our model, the forward and backward RNNs were implemented as a series of shared Gated
Recurrence Units (GRUs)4 (Glorot et al., 2011).
• Layer 3: Concatenation. The concatenation layer combines the forward and backward
word-level feature vectors to produce a single feature vector for each word, namely, r i(3) =
h i
(2 f )
r i ; r i(2b) where [x; y] indicates the concatenation of vectors x and y.
• Layer 4: 2nd Bidirectional Recurrent Neural Network. In order to allow the model
to extract more expressive features, we use a second bidirectional RNN layer. This layer
operates identically to the bidirectional RNN in Layer 2, except that the word-level feature
vectors produced in Layer 3, i.e., r (3) (3)
1 , · · · , r N , are used as the input to the bidirectional
RNN (instead of r (1), · · · , r (1)

N used in Layer 2). Likewise, the memory states produced in
Layer 4 are denoted as f 2 and b 2 , corresponding to the forward RNN and the backward
RNN, respectively. Unlike the bidirectional RNN used in Layer 2, we use the final memory
of the forward RNN (i.e., f 2 ) as the report-level feature vector e which will be used by the
Generator.
• Layer 5: 2nd Concatenation. As in Layer 3, the second concatenation layer combines the
forward and backward word-level features vectors produced in the previous layer. In the
4A GRU is a block of coordinated sub-layers in a neural network which learn to transform an input vector (e.g.,
(2 f )
r i(0) ) into an output vector (e.g., r i or r (2b)
i ) by maintaining and updating an internal memory state. The memory
state used in the forward RNN is denoted by f 1 while the memory state used in the backward RNN is denoted by b 1 .
198
case of Layer 5, however, we used the resulting feature vectors h 1, · · · , h N as the word-level
feature vectors which will be provided to the Generator.
The Generator
The role of the Generator is to generate the most likely clinical correlation section for a given EEG
report using the feature vectors extracted by the Extractor. It is important to note that because the
clinical correlation sections vary both in terms of their length and content, the number of possible
clinical correlations sections that could be produced is intractably high (V MM ax where MMax is
the maximum length of a clinical correlation section). Consequently, we substantially reduce the
complexity of the problem by modeling the assumption that each word in the clinical correlation
section can be determined based solely on (1) the word-level feature vectors h 1, · · · , h N extracted by
the Extractor, (2) the report-level feature vector e extracted by the Extractor, and (3) any preceding
words produced by the Generator. This assumption allows us to define the probability of any
clinical correlation section, S 0, having been produced by a neurologist for a given EEG report (i.e.,
the second factor in Equation (6.17)) as:
M
Ö
P (S | R) =
0
P S 0j S 0j−1, · · · , S 01, e, h 1, · · · , h N ; θ

(6.18)
j=1
To compute Equation (6.18), we designed the Generator to act as a type of Recurrent Neural
Language Model (RNLM) (Mikolov et al., 2010) which incorporates a Recurrent Neural Network
(RNN) to produce one word in the clinical correlation section at-a-time while maintaining and
updating an internal memory of which words have already been produced.
To improve training efficiency, the Generator has two similar but distinct configurations: one
for training, and one for inference (e.g., testing). Figure 6.8 illustrates the architecture of the
Generator under both configurations. The primary difference between each configuration is the
input to the RNN: when training, the model embeds the previous word from the gold-standard
clinical correlation section (i.e., S j−i ) to predict S0j while during inference the RNN operates on the
199
Key: + Addition S Softmax
A Attention c Concatenation E Embedding
</p> 𝑺𝑴−𝟏 E c GRU A + S 𝑺′𝑴
c GRU A + S 𝑺′𝑴
⋮ ⋮ ⋮
c GRU A + S 𝑺′𝑴−𝟏
CLINICAL 𝑺𝟑 E c GRU A + S 𝑺′𝟒
⋮ ⋮ ⋮
<s> 𝑺𝟐 E c GRU A + S 𝑺′𝟑
c GRU A + S 𝑺′𝟑
<p> 𝑺𝟏 E c GRU A + S 𝑺′𝟐
c GRU A + S 𝑺′𝟐
Ø c GRU A + S 𝑺′𝟏
𝒆 𝒆 c GRU A + S 𝑺′𝟏
(a) Training Configuration (b) Inference Configuration
Figure 6.8. Detailed Architecture of the Generator under (a) Training and (b) Inference Configura-
tions.
embedding of the previously generated word (i.e., S0j−1 ) to predict S0j . The Generator produces the
natural language content of a clinical correlation section for a given EEG report using four layers
(with the preliminary embedding layer in the training configuration acting as an extra “zero”-th
layer):
• Layer 0: Embedding. The embedding layer, which is only used when the Generator is in
training configuration, embeds each word in the gold-standard clinical correlation section
S j (represented by V-length 1-hot vectors) into an L-length continuous vector space, s (0)
j ,
where L V. This is accomplished by using a fully connected linear projection layer,

s (0) = S j W G + b G where W G ∈ RV×L, b G ∈ RV x1 ∈ θ correspond to the vocabulary

j
projection matrix and vocabulary bias vector learned by the Generator.
• Layer 1: Concatenation. The first layer used in both configurations of the Generator is a
concatenation layer which combines the embedded representation of the previous word with
h i
e, the report-level feature vector extracted by the Extractor, s (1)
j = s (0)
j−1 ; e where [x, y]
indicates the concatenation of vectors x and y and s (0)
0 is defined as a zero vector.
200
• Layer 2: Gated Recurrent Unit. The second layer used by both configurations is a Gated
Recurrent Unit (GRU). The GRU allows the model to accumulate memories encoding long-
distance relationships between each produced word of the clinical correlation section, S 0, and
any words previously produced by the model. This is performed by updating and maintaining
an internal memory within the GRU which is shared across all words in the clinical correlation
section. We denote the output of the GRU as si(2) .
• Layer 3: Attention. In order to improve the quality and coherence of natural language
produced by the Generator, an attention mechanism was introduced. The attention mechanism
allows the Generator to consider all of the world-level feature vectors h 1, · · · , h N produced
by the Extractor for the given report, and learns the degree that each word in the EEG report
influences the selection of (or aligns with) S 0j ; formally:

ÕN exp βi j
s (3)
j = α i j h i α i j = Î N
β ij = σ W β s (2)
j + U β h i + b β (6.19)
i=1 l=1 exp (β l k )
such that αi, j is an alignment vector used in the alignment model βi j which determines the
degree that the i th word in the EEG report R (represented by hi ) influences the j th word of
the clinical correlation section S 0j (represented by s (2)
j ).
• Layer 4: Addition. The role of the fourth layer is to combine the result of the previous
attention layer with the result of the GRU in Layer 2, i.e., s (4) (3) (2)
i = si + si
• Layer 5: Softmax Projection. In order to measure the probability of each word S 0j being
produced for the given EEG report, we use a final softmax projection layer to produce
a vocabulary-length vector s (5)
j in which the v element indicates the probability that S j
th 0

should be generated as the v th word in the vocabulary, s (5)
i = soft max s (4)
i W p + b p where
soft max(x) =
exp(x)
ÍV , and v ∈ [1, V]. This allows us to complete the definition of
v=1 exp(x v )
Equation (6.18):

S 0j = v S j−1, · · · , S 1, e, h 1, · · · , h N ; θ = s (5)
0 0
P jv (6.20)
201
Training the Deep Section Recovery Model
Training the Deep Section Recovery Model (DSRM) is achieved by finding the parameters θ which
are most likely to produce the gold-standard clinical correlation sections for each EEG report
in the training set T. Formally, we model this by minimizing the cross-entropy loss between
the vocabulary-length probability vectors produced by the model (s (5)

j ) and the one-hot vectors
corresponding to each word in the gold-standard clinical correlation section (S j ).
 M h (5)
Õ Õ i
(5) (5) 
L(θ) ∝ 
 s j log S j + (1 − s j ) log(1 − s j )  (6.21)
(R,S)∈ T  j=1 
 
The model was trained using Adaptive Moment Estimation (ADAM) (Kingma and Ba, 2015) (with
an initial learning rate η = 0.001).
Inferring Clinical Correlations
Given θ learned from the training set, the clinical correlation section S can be generated for a
new EEG report R using the inference configuration illustrated in Figure 6.8b. In contrast to
the training configuration in which S 0j is selected using the previous word from the gold-standard
clinical correlation section (S j−1 ), during inference, the model predicts S 0j using the word previously
produced by the model (S 0j−1 ). It is important to note that, unlike training, we do not know the length
of the clinical correlation section we will generate. Consequently, the model continually generates
output until it produces the END-OF-SECTION symbol h/pi. Thus, the length of the inferred
clinical correlation section M is determine dynamically by the model. When inferring the most
likely clinical correlation section, it is necessary to the convert the vocabulary probability vectors
s (5) (5)
1 , · · · , s M to one-hot vocabulary vectors S j that can be directly mapped to natural language.5
0
5Let ŝ j = arg max(s (5) 0 th

j ); S j is defined as the one-hot vector in which the sˆj value is 1 and all other values are zero.
202
6.2.3 Experiments
We evaluated the performance of the Deep Section Recovery Model (DSRM) using the Temple
University Hospital EEG Corpus (Harati et al., 2013) (described in the Data section) using a
standard 3:1:1 split for training, validation, and testing sets. The performance of our model was
compared against four baseline systems:
1. NN:Cosine. In this nearest-neighbor baseline, we represented each EEG report as a bag-of-
words vector. This baseline infers the clinical correlation for a given EEG report by copying
the clinical correlation associated with the EEG report in the training set whose bag-of-words
vector had the least cosine distance to the bag-of-words vector representation of the given
EEG report.
2. NN:LDA. In the second nearest-neighbor baseline, we represented each EEG report as a
latent topic vector which was computed by applying Latent Dirichlet Allocation (Blei et al.,
2003) to the EEG reports in the training set. This allowed us to infers the clinical correlation
for a given EEG report by copying the clinical correlation associated with the EEG report
in the training whose topic-vector representation has the least Euclidean distance to the
topic-vector representation of the given EEG report.
3. DL:Attn-RNLM. The first deep-learning baseline considers a recurrent neural language
model (Mikolov et al., 2010) (RNLM) using the standard attention mechanism operating on
the embedded word-representations of a given EEG report. This baseline closely resembles
the DSRM if the Extractor component were removed.
4. DL:Basic-S2S. The second deep-learning baseline uses a standard Sequence-to-Sequence(Cho
et al., 2014) model without attention. This baseline closely resembles the DSRM if word-
level feature vectors (i.e., h 1, · · · , h N ) were not extracted and only the report-level feature
vector is considered by the Generator.
203
Implementation Details
Our model and the two deep learning baselines were implemented in Tensorflow6 version 1.0. For
all deep learning models, we used a mini-batch size of 10 EEG reports, a maximum EEG report
length of 800 words, a maximum clinical correlation section length of 60 words, 200-dimensional
vectors for word embeddings, and 256 hidden units in all RNNs based on a grid search over the
validation set.
Table 6.2. Evaluation of automatically inferred clinical correlation sections.
System/Model BLEU-1 BLEU-2 BLEU-3 ROUGE-1 ROUGE-2 ROUGE-3 WER

NN:Cosine .553 34∗∗∗ .402 74∗∗∗ .321 37∗∗∗ .542 84∗∗∗ .385 16∗∗∗ .315 08∗∗∗ 2.521∗∗∗
NN:LDA .517 30∗∗ .363 16∗∗∗ .281 99∗∗∗ .523 89∗∗∗ .368 63∗∗∗ .286 86∗∗∗ 2.891∗∗∗
DL:Attn-RNLM .579 07∗∗ .416 19∗ .324 33∗ .581 96∗∗∗ .419 60∗ .325 75∗∗∗ 2.315∗∗∗
DL:Basic-S2S .589 92∗∗ .368 29∗∗∗ .268 06∗∗∗ .474 87∗∗∗ .311 70∗∗∗ .234 45∗∗∗ 2.658∗∗∗
DSRM .687 92 .546 86 .463 23 .635 23 .504 59 .428 94 1.631
∗ p < 0.05, ∗∗ p < 0.01, ∗∗∗ p < 0.001; statistical significance against DSRM using the Wilcoxon signed-rank test.
Experimental Setup and Results
Evaluating the quality of automatically produced natural language (such as the inferred clinical cor-
relation sections) is an open problem in the natural language processing community. Consequently,
to quantify the quality of the clinical correlation sections inferred by all four baseline systems as
well as the DSRM, we considered standard metrics used to evaluate machine translation, automatic
summarization, and speech recognition.
We measured the surface-level accuracy of an automatically inferred clinical correlation section
in two ways: (1) the Word Error Rate (Jurafsky and James, 2000) (WER) which measures how many
“steps” it takes to transform the inferred clinical correlation section into the gold-standard clinical
correlation section produced by the neurologist, where steps include (a) insertion, (b) deletion, or
6https://www.tensorflow.org/
204
(c) replacement of individual words in the inferred clinical correlation; (2) the Bilingual Evaluation
Understudy (Papineni et al., 2002) (BLEU) metric which is a commonly used analogue for Precision
in language generation tasks. The surface-level completeness of each inferred clinical correlation
section was measured using the Recall-Oriented Understudy for Gisting Evaluation(Lin, 2004)
(ROUGE), a commonly used analogue for Recall (i.e., Sensitivity) in language generation tasks.
Finally, we measured the surface-level coherence by additionally computing the bi-gram and tri-
gram variants of BLEU and ROUGE, which have been shown to correspond to human notions
of coherence. It is important to note that the WER, BLEU, and ROUGE metrics do not take
into account the similarity between individual words nor the semantics of multi-word expressions.
For example, if the gold-standard clinical correlation contains “absence of epileptiform features”,
then the excerpt “no epileptiform activity” would have BLEU-2 and ROUGE-2 scores of zero
and a WER of 2 despite the fact that both excerpts express the same information. Consequently,
these surface-level metrics should be interpreted as strict lower-bounds on the performance of each
evaluated system. Table 6.2 presents these results.
It can be seen that the DSRM achieved the best over-all performance. Moreover, it can be
observed that the Attention Decoder (DL:Attn-Decoder) achieved the second-best performance.
The Basic Sequence-to-Sequence model (Basic-S2S) as well as the Cosine and LDA nearest
neighbor approaches achieved comparable, but only moderate performance. The high performance
of the DSRM compared to the Basic S2S model indicates the importance of incorporating attention,
allowing the model to discover latent relationships between words in the EEG report and each word
in the clinical correlation section. Moreover, the improvement in performance shown by the DSRM
compared to the Attention Decoder indicates that the clinical correlation cannot be generated solely
from word-level features: report-level information should be considered as well.
6.2.4 Discussion
In order to analyze the automatically inferred clinical correlation sections produced by the DSRM,
we manually reviewed 100 randomly selected EEG reports from the test set by comparing the
205
Example 1
Report: 00005044_s03
Inferred: No epileptiform features are identified. If epilepsy is an important consideration, a repeat EEG
capturing deeper stages or sleep deprivation prior to the EEG may be helpful to identify epileptiform
activity.
Gold: There are no definitive epileptiform discharges, but there is an amplitude asymmetry and there is
an asymmetry of wicket activity. Additional recording capturing more extensive sleep may be helpful to
identify epileptiform activity.
Example 2
Record: 00010462_s01
Inferred: This EEG supports a severe underlying encephalopathy and diffuse disturbance of cerebral
dysfunction involving both gray and white matter. Contributing factors can include some of the renal
failure, acute or metabolic processes. The focal features described above should be correlated with
imaging.
Gold: This abnormal EEG demonstrates a severe, diffuse disturbance of cerebral function involving
both gray and subcortical white matter. This EEG pattern was communicated to the primary care team.
Example 3
Report: 000004928_s02
Inferred: This EEG is not suggestive of a metabolic or intermittent encephalopathy. The rare left with
focal feature suggests conforms with underlying metabolic pattern.
Gold: As discussed with the team on the date of this recording, this EEG is most compatible with a
metabolic encephalopathy.
Figure 6.9. Comparisons of inferred and gold-standard clinical correlation sections for three EEG
reports.
206
Table 6.3. Adjusted Likert scale used to assess over-all quality of inferred clinical correlation
sections.
1: (strongly disagree) clinical correlation section is incomprehensible

2: (disagree) clinical correlation section is not correct
3: (weakly agree) clinical correlation section is generally correct, but omits important infor-
mation or contains additional false or inconsistent information
4: (agree) clinical correlation section is correct but omits minor details
5: (strongly agree) clinical correlation section is effectively equivalent to the gold-standard
inferred clinical correlation sections to the gold-standard clinical correlation sections written by
the neurologists. The over-all quality of the inferred clinical correlation sections was assessed
using the Likert scale illustrated in Table 6.3, with the DSRM obtaining an average score 3.491,
indicating that the inferred clinical correlation sections are generally accurate, but may contain
minor additional erroneous information or have minor omissions.
Figure 6.9 illustrates the inferred clinical correlation as well as the gold-standard clinical
correlation section for three EEG reports in the test set. Example 1 illustrates an example of a
correct, but incomplete inferred clinical correlation section. Both the inferred and gold-standard
clinical correlation sections agree that (1) no epileptiform discharges were observed, and (2) that
a repeat EEG focusing on extensive sleep is needed. However, the gold-standard clinical standard
includes additional details about asymmetry and asymmetry of wicket activity which the DSRM
omitted.
Example 2 illustrates an inferred clinical correlation section which accurately expresses the dif-
fuse disturbance of cerebral function. However, the inferred clinical correlation section additionally
indicates a “severe underlying encephalopathy” which was not expressed in the gold-standard clin-
ical correlation section. Moreover, the inferred clinical correlation section attempts to correlate the
findings with the patients “renal failure, and acute, and/or metabolic processes” and indicates that
these findings should be correlated with imaging. While these inclusions highlight the model’s
ability to accumulate knowledge across the large corpus of EEGs in the training set in order to sim-
207
ulate experience, they also demonstrate that the model occasionally struggles to determine which
information is (or is not) relevant.
The inferred clinical correlation illustrated in Example 3 illustrates a relatively rare (15% of
reviewed EEG reports) but significant error: contradiction within the inferred clinical correlation
sections. While the first sentence (incorrectly) states that the EEG does not suggest metabolic
encephalopathy, the second sentence indicates that it does. This error strongly suggests that the
performance of the model could be improved by developing and incorporating a more sophisticated
loss function: the average cross-entropy loss (shown in Equation (6.21)) considers each individual
word in the inferred clinical correlation equally; thus, the incorrect inclusion of “not” in the first
sentence has a very small impact on the loss despite it inverting the meaning of the entire sentence.
In this section, we have presented a deep learning approach for automatically inferring the clini-
cal correlation section for a given EEG report, which we call the Deep Section Recovery Model
(DSRM). While traditional approaches for inferring clinical correlations would require hand-
crafting a large number of sophisticated features, the DSRM learns to automatically extract word-
and report- level features from each EEG report. Our evaluation on over 3,000 EEG reports re-
vealed the promise of the DSRM: achieving an average of 17% improvement over the top-performing
baseline. These promising results provide a foundation towards automatically identifying unusual,
incorrect, or inconsistent clinical correlations from EEG reports in the future. Immediate avenues
for future work include (1) considering more sophisticated loss functions which incorporate con-
textual and semantic information and (2) an in-depth study and evaluation of metrics for qualifying
the degree of disagreement between a given clinical correlation section and the inferred or expected
clinical correlation section.
208
6.3 Summary
In this chapter, we described two techniques for overcoming underspecified or missing information
in EEG reports. We showed how these techniques could be used to infer the most likely value of an
underspecified variable (the over-all impression of an EEG report) or to recover (i.e., generate) the
most likely natural language content of a (missing) clinical correlation section in an EEG report.
Our experimental results indicate not only the power of deep learning techniques for processing the
information encoded in EHRs, but also suggest the promise of such techniques for allowing patient
cohort retrieval and medical question answering systems to overcome the barriers of missing and
underspecified information in EHRs by indexing the information inferred by both techniques. We
believe that the deep learning techniques presented in this chapter provide a promising step towards
enabling automatic systems to "read between the lines" of an EHR and reason directly about the
patient it represents.
209
CHAPTER 7
LEARNING TO RANK FOR MEDICAL INFORMATION RETRIEVAL
Authors – Travis R. Goodwin, Michael A. Skinner, and Sanda M. Harabagiu
Minor revision, with permission, of Travis R. Goodwin, Michael A. Skinner and Sanda M.
Harabagiu, Automatically Linking Registered Clinical Trials to their Published Results with Deep
Highway Networks , Proceedings of the American Medical Informatics Association (AMIA) Infor-
matics Summits, 2018.
Dr. Skinner, MD, provided interpretation of clinical texts which was used to design the system
reported in the chapter.
210
When designing an information retrieval (IR) system for medical applications such as medical
question answering, clinical decision support, or patient cohort retrieval, one of the first obstacles
encountered is the lack of established robust and reliable criteria for measuring relevance of medical
documents to given queries. Some systems incorporate custom task-specific relevance models,
such as the probabilistic factor-driven relevance model for clinical decision support described in
Chapter 2 on page 11. However, these customized relevance models can be difficult to generalize to
new applications or collections. Consequently, most medical IR systems rely on existing relevance
models designed for Ad-hoc information retrieval of general English texts. For example, the IR
systems described in Chapters 2 to 4 on page 11, on page 53 and on page 91 all incorporate the
BM25 (Robertson et al., 1995) relevance model.
There are a large number of established relevance models for Ad-hoc information retrieval,
including: Best Match 25 (BM25) (Robertson et al., 1995), language model approaches (Zhai and
Lafferty, 2004) (LMD), Axiomatic relevance (Fang and Zhai, 2005), Divergence from Independence
(DFI) (Kocabaş et al., 2014). Unfortunately, there are no clear guidelines for choosing the correct
relevance model for a given application, making it difficult to know which relevance model should
be applied. This is exacerbated by the fact that each relevance model relies on subtle relevance
assumptions that may not always be appropriate for medical texts. For example, the BM25 model
operates under the assumption that the relevance of a document is directly related to the proportion
of content in the document that pertains to the query. While this is a reasonable assumption for
ad-hoc web search, electronic health records (EHRs) typically document the entire clinical picture
of a patient, and, thus, it is possible (and indeed, likely) that an EHR may be relevant to a query
even if only a small portion of its content pertains to the query.
Fortunately, the problem of determining optimal relevance criteria for a specific application
can be addressed by a supervised machine learning framework known as learning-to-rank (L2R)
(Liu, 2011). In this chapter, we describe how learning-to-rank can be used to enrich the results of
211
medical IR systems designed for clinical decision support. Specifically, we present NCT Link,1 a
system for automatically linking registered clinical trials to published scientific articles reporting
their results. NCT Link incorporates state-of-the-art deep learning techniques through a specialized
Deep Highway Network (DHN) (Srivastava et al., 2015b) designed to determine the likelihood that
a link exists between an article and a clinical trial by considering the variety of information about
the article, the trial, and the relationships (if any) between them. Our experiments demonstrate that
NCT Link provides a 30%-58% improvement over the automatic methods surveyed in Bashir et al.
(2017); consequently, we believe that NCT Link will provide a valuable tool for health care providers
seeking to obtain timely access to the publications reporting the results of clinical trials. Moreover,
we surmise that NCT Link may also benefit (a) researchers investigating selective publication and
reporting of clinical trial outcomes and (b) study designers aiming to avoid unnecessary duplication
of research efforts (De Angelis et al., 2004) .
7.1 Background
Seeking to deliver best-practice medical care, clinicians increasingly rely on information provided
by published guidelines and systematic reviews. However, recent analyses have estimated that less
than 15 percent of major medical guidelines are supported by high-quality evidence (Tricoci et al.,
2009; Lee and Vielemeyer, 2011). To bridge this gap, health care professionals are increasingly
turning to evidence from clinical trials to help evaluate different treatment options (De Angelis
et al., 2004). To provide more convenient access to clinical trials for persons with serious medical
conditions and to make the results of clinical trial more available to health care providers, the United
States Congress mandated in 1997 the development of the online trial registry ClinicalTrials.gov.
In 2007, in accordance with the increasing role of evidence-based medicine, the mandate was
expanded by requiring the timely inclusion of clinical trial results within the registry for all sponsors
1NCT Link is named after the Clinical Trial identifier, NCT ID, used by ClinicalTrials.gov.
212
of non-phase-1 human trials seeking FDA approval for a new device or drug (Congress, US, 2007).
Moreover, to further increase the availability of study information to patients, physicians, and
investigators, the International Committee of Medical Journal Editors (ICMJE) mandated the
registration of trials before considering publication of trial results (De Angelis et al., 2004).
Unfortunately, despite the numerous policies intended to improve the timely accessibility of
clinical trial results to clinicians, there remain several barriers hindering effective use of these
important data. First, sponsors and investigators have inconsistently complied with the requirement
to update the registry with trial results. In an evaluation of eligible human studies registered at
ClinicalTrials.gov, Anderson et al. (2015) found that only 13.4% of the trials reported summary
results within 12 months of study completion, and only 38.3% of the registered studies reported
any results at any time. Moreover, once trial results are published in peer-reviewed literature, the
article citation is only provided to the ClinicalTrials.gov registry in about 23%-31% of cases (Ross
et al., 2009; Huser and Cimino, 2013) . When registered trials with no reported publications were
manually reviewed, both Ross et al. (2009) as well as Huser and Cimino (2013) were able to find
relevant MEDLINE articles for 31%-45% of reviewed clinical trials. Finally, despite the ICMJE
recommendation that pertinent publications of trial results should contain a specific citation of the
trial registry number to allow simple retrieval of the article with a MEDLINE search, (International
Committee of Medical Journal Editors (ICMJE) et al., 2004) this information is included in only
about 7% of articles presenting trial results (Huser and Cimino, 2013).
Recently, Bashir et al. (2017) conducted a systematic review of studies examining “links” be-
tween registered clinical trials and the publications reporting their results and found that 83% of
studies required some level of manual (i.e., human) analysis (with 19% involving strictly manual
analyses, 64% involving both manual and automatic analyses and 17% involving automatic anal-
yses). They also observed that despite the increasing pressures from journal editors to provide
information about any clinical trials associated with a publication, the number of articles amenable
to being automatically linked to the clinical trials they report has not increased over time. Finally,
213
they found that automatic methods were only able to identify a median of 23% of articles reporting
the results of registered trials, leading them to conclude that identifying publications reporting the
results of a clinical trial remains an arduous, manual task. Clearly, there is a need for the creation
of robust methods to automatically link clinical trials with their results in the medical literature.
7.2 Problem Formulation
In previous studies examining links between registered clinical trials and published articles, inves-
tigators have described at least three ways that a published article may be considered linked to a
clinical trial. For example, an article may (1) relate in some way to the trial, e.g., by providing
supporting evidence for the intervention or highlighting limitations of previous, related studies; (2)
be cited in the summary or official description of the trial; or (3) report the results of the trial. In this
work, we focus exclusively on the third type of link: articles which report the results of a clinical
trial; consequently, we consider a publication to be linked to a clinical trial if and only if it reports
the results of the trial. Moreover, as in Huser and Cimino (2013) we only consider links between
clinical trials registered to ClinicalTrials.gov and published articles indexed by MEDLINE. NCT
NCT ID Learning-to-Rank (L2R)

𝒗𝟏 𝒔𝟏
Trial 𝒕 𝒗𝟐 𝒔𝟐
Search
⋮
𝒗𝑳
Deep ⋮
Feature 𝒔𝑳
Article Extraction Highway Ranking ⋮
Search 𝒂𝟏 Network
𝒂𝟐
⋮
𝒂𝑳
Ranked
Medline
Trial Article Relevance Articles
Index Index Judgments (only while training)
Figure 7.1. Architecture of NCT Link.
Link, illustrated in Figure 7.1, operates in five steps:

1. Trial Search: given an NCT ID, the (meta)data associated with the trial, denoted as t, is
obtained from the registry at ClinicalTrials.gov;
214
2. Article Search: the information in t is used to obtain a subset of potentially-linked arti-
cles (along with their metadata), denoted as A = a1, a 2, · · · , a L , using a specialized local
MEDLINE index (where L is the maximum number of articles considered by NCT Link);
3. L2R: Feature Extraction: each article ai ∈ A retrieved for t is associated with a feature
vector vi encoding a number of complex features characterizing information about t, ai , and
the relationship between them;
4. L2R: Deep Highway Network: a Deep Highway Network (DHN) is used to infer a score si
for each article ai ∈ A quantifying the likelihood that ai should be linked to (i.e., reports the
results of) t;
5. L2R: Ranking: the score si associated with each article ai is used to produce a ranked list
of published articles such that the rank of each article corresponds to the likelihood that it
reports the results of t.
In the remainder of this section, we provide a detailed description each of the five steps listed above.
7.3 Searching Clinical Trials
NCT Link operates on an NCT ID specified by the user. The NCT ID is used to obtain all the
(meta)data stored in the ClinicalTrials.gov registry for the given trial. While the National Library of
Medicine (NLM) provides an online interface for programmatically obtaining data about a clinical
trial specified by an NCT ID, to reduce the burden on the NLM’s servers potentially imposed by
our experiments, we instead created and used our own offline index of all clinical trials registered
on ClinicalTrials.gov.
215
7.3.1 Representing Clinical Trials
Due to the significant variation in the amount of data associated with clinical trials, NCT Link
considers only eight key aspects of each clinical trial: (1) the set of investigators2 associated with
the trial, (2) the set of unique institutions associated with any investigators, (3) the NCT ID of the
trial, (4) the set of interventions studied in the trial, (5) the set of conditions studied in the trial,
(6) the set of keywords provided to the registry, (7) the set of Medical Subject Headings (MeSH)
terms provided to the registry, and (8) the completion date of the trial3. In the remainder of this
paper, we use t to simultaneously refer to a clinical trial as well as all eight aspects of information
associated with the trial.
7.4 Searching MEDLINE Articles
Because MEDLINE contains over 14 million articles, rather than applying the learning-to-rank
component to process and score every article in MEDLINE, we first obtain a smaller, “high-recall”
sub-set of candidate MEDLINE articles that are likely to report the results of t. In this section,
we describe the MEDLINE searching strategy used for both (1) obtaining this high-recall set of
candidate MEDLINE articles as well as (2) feature extraction (described later).
7.4.1 Indexing MEDLINE Articles
To search MEDLINE, NCT Link incorporates its own internal, offline index of every article in
MEDLINE. This index encodes eight fields (i.e., metadata attributes) for each article in MEDLINE:
(1) the authors4of the article (if any), (2) the investigators4 of the article (if any), (3) the PubMed
2Investigators are represented in the registry through three structured fields indicating the investigator’s first,
middle, and last names.
3If the completion date is unspecified, or if the trial is not yet complete, the start date of the trial is used.
4In MEDLINE, authors and investigators are encoded by structured fields corresponding to their first and last
names as well as their initials.
216
identifier (PMID) associated with the art icle, (4) the accession numbers (e.g., NCT IDs) of
any ClinicalTrials.gov entries in the list of “DataBanks” associated with the article, (5) the full
unstructured text of the abstract5, (6) the title of the article, (7) any MeSH terms associated with
the article, and (8) the publication date of the article.
Trial 𝒕
Index Clinical Trial Aspects
Investigators Institutions NCT ID Interventions Conditions Keywords MeSH Terms Completion Date
Authors Investigators PMID DataBanks Abstract Title MeSH Terms Publication Date
𝒂𝒊 Article
MEDLINE Article Fields Index
Figure 7.2. The aspects of medical trials, the fields indexed for MEDLINE articles, and the mapping
between them when searching MEDLINE articles. Note: although MEDLINE distinguishes
between investigators and authors of published articles, NCT Link currently treats the authors and
investigators of MEDLINE articles in the same way.
7.4.2 Query Formulation
A clinical trial t is represented by a disjunctive Boolean query in which each aspect corresponds to
a clause. Each clause, in turn, is represented by a disjunction of natural language terms (or phrases)
encoding the values (e.g., investigators, conditions, etc.) associated with that aspect. Interventions,
conditions, and keywords are expanded using synonyms provided by the Unified Medical Language
System (Bodenreider, 2004). To account for variations in the way affiliations were expressed, each
affiliation was represented by a sequence of “partial locations” by splitting the text of the affiliation
(e.g., “University of California, San Francisco”) on occurrences of commas (e.g., “University of
California” and “San Francisco”). Likewise, due to differences in how authors and investigators
are reported to MEDLINE by various journals, each author/investigator is represented by a sieve
5For structured abstracts, the content of all sections was combined to create a single unstructured passage of text.
217
consisting of four queries, each less specific than the previous: (1) first name, middle initial, and
last name, (2) first initial, middle initial and last name, (3) first name and last name, and (4) the
first initial and last name. To account for the progressive loss of specificity, we associated each
query with a weight of 1.0, 0.5, 0.3, and 0.2, respectively, which multiplicatively affects the score
(described below) of any article retrieved for the query. The clause associated with each aspect is
restricted to the set of semantically related fields illustrated in Figure 7.2.
7.4.3 Scoring MEDLINE Articles
When searching our internal MEDLINE index, candidate articles are retrieved (i.e., selected) and
scored using the BM25 (Robertson et al., 1995) relevance model.6 This allows the high-recall set
of candidate articles A to be defined as the top ranking retrieved articles a1, a2, · · · , a L where L is
the number of candidate MEDLINE articles considered by NCT Link. Conceptually, L acts as an
upper bound on the number of articles the user of the system might be interested in examining. In
our experiments, to ensure thorough evaluations, we used L = 2, 000. However, for general use, a
smaller value of L should be sufficient, e.g., L = 100.
7.5 Learning-to-Rank (L2R)
Given a clinical trial t, and a set of articles A = a1, a 2, · · · , a L , the learning-to-rank module is
responsible for (1) extracting features encoding the relationship between each article ai and the
clinical trial t; (2) training (or using) state-of-the-art deep learning methods – a Deep Highway
Network – to score each article based on the likelihood that it reports the results of t; and (3)
produce a ranked list of MEDLINE articles sorted by their scores.
6To reduce the impact of abstract length on the ranking of candidate MEDLINE articles, we specified the BM25
document-length normalization term as k 1 = 0.25 rather than standard value of 0.75 (Robertson et al., 1995).
218
7.5.1 Feature Extraction from MEDLINE Articles and Clinical Trials
Determining whether a link exists between an article ai and a clinical trial t requires considering
a large variety of information which varies from trial to trial and article to article. For this reason,
deciding whether an article ai ∈ A reports the results of t requires access to a rich set of features.
When extracting features, we consider (1) the eight aspects of clinical trial t (described in Searching
Clinical Trials), (2) the eight fields associated with article ai (described in Searching MEDLINE
Articles), and (3) the mapping between aspects of t and the corresponding fields of ai illustrated
in Figure 7.2. Table 7.1 lists all the features extracted for each article ai retrieved for trial t as well
as the domain (i.e., number and type of values) of each feature, where N denotes the set of natural
numbers, R denotes the set of real numbers, and the exponent indicates the number of values (e.g.,
R5 corresponds to five distinct real numbers).
Table 7.1. Features extracted for each article ai retrieved for trial t.
Feature Description Domain

F1 number of investigators in t N
F2 number of interventions in t N
F3 number of conditions in t N
F4 completion date of t N
F5 days elapsed between the publication date N
of ai and the completion date of t
F6 BM25 from the NCT ID of t to ai R
F7 F2EXP from the NCT ID of t to ai R
F8 DFI from the NCT ID of t to ai R
F9 LMD from the NCT ID of t to ai R
F10 BM25 from all keywords of t to ai R
F11 F2EXP from all keywords of t to ai R
F12 DFI from all keywords of t to ai R
F13 LMD from all keywords of t to ai R
F14 BM25 from all MeSH terms of t to ai R
F15 F2EXP from all MeSH terms of t to ai R
.. .. ..
. . .
(Continued on next page...)
219
Feature Description Domain
(Continued from previous page...)
.. .. ..
. . .
F16 DFI from all MeSH terms of t to ai R
F17 LMD from all MeSH terms of t to ai R
F18 BM25 statistics from each investigator in t to ai R5
F19 F2EXP statistics from each investigator in t to ai R5
F20 DFI statistics from each investigator in t to ai R5
F21 LMD statistics from each investigator in t to ai R5
F22 BM25 statistics from each intervention in t to ai R5
F23 F2EXP statistics from each intervention in t to ai R5
F24 DFI statistics from each intervention in t to ai R5
F25 LMD statistics from each intervention in t to ai R5
F26 BM25 statistics from each condition in t to ai R5
F27 F2EXP statistics from each condition in t to ai R5
F28 DFI statistics from each condition in t to ai R5
F29 LMD statistics from each condition in t to ai R5
F30 number of authors in ai N
F31 number of investigators in ai N
F32 publication date of ai N
F33 publication type(s) of ai {0, 1}38
As shown in Table 7.1, three types of features are extracted: (1) trial features (F1 - F4 ),
encoding information about t which is independent of ai ; (2) dynamic features (F5 - F29 ), encoding
information about the relationship between ai and t; and (3) article features (F30 - F33 ), encoding
information about ai which is independent of t. Features F1 - F3 allow the model to account for
that fact that the more investigators, interventions, or conditions associated with t, the more likely
it is that an article will have an investigator, intervention, or condition in common with t. Features
F6 - F29 adapt four commonly used relevance models to act as similarity measures between an
aspect of t and an article ai . Specifically, we used: (1) the Best Match 25 (Robertson et al., 1995)
(BM25), (2) Dirichlet-Smoothed language model probability (Zhai and Lafferty, 2004) (LMD),
(3) Axiomatic relevance (Fang and Zhai, 2005) (F2Exp), and (4) Divergence from Independence
220
(Kocabaş et al., 2014) (DFI). To account for the significant variance in the number of investigators,
as well as the prevalence of common names, conditions, or interventions, F18 - F29 measure five
statistics capturing the similarity between each investigator, condition, or intervention in t and
ai , namely, the mean, minimum, maximum, variance, and sum. Feature F33 encodes the MeSH
publication type(s) associated with ai (in our experiments, we encountered only 38 different types
of publications). The values features F1 - F33 are concatenated together to form a single vector vi
allowing the Deep Highway Network to consider and combine a variety of different interactions
between the aspects of t and the fields of article ai .
7.5.2 The Deep Highway Network
Owing to the lack of clear and exact criteria for determining whether a link exists between t and ai ,
we were interested in applying deep learning techniques to automatically learn contextual high-level
and expressive “meta”-features by combining the elements of vi . However, a common problem
when designing deep learning networks is that the there are no clear criteria or guidelines for
deciding the number (and configuration) of internal or “deep” layers in the network. Fortunately,
by taking advantage of recent advances in deep structure learning, we were able to define a deep
neural network which automatically tunes the number of internal layers used. Specifically, we
implemented a Deep Highway Network (DHN) (Srivastava et al., 2015a). Unlike traditional deep
networks, in which information flows through each layer of the network sequentially, DHNs allow
information to “skip” layers in the network by traveling along a so-called “information highway”7.
Thus, in a DHN, the number of specified internal layers acts as an upper bound on the number of
layers used by the model. In fact, the information highway allows DHNs to be constructed with
hundreds of intermediate layers – for example DHNs with over 1,000 intermediate layers have been
7Additionally, and perhaps more importantly, the information highway allows the gradient to directly influence
each layer during back propagation, effectively eliminating the vanishing gradient problem and allowing very deep
networks to be trained.
221
reported (Srivastava et al., 2015b). The DHN we have implemented within NCT Link considers a
maximum of 108 internal layers and is illustrated in Figure 7.3.
ReLU ReLU ReLU

Linear Linear
𝒗𝒊 𝝈 𝑠𝑖
Projection Projection
Highway Highway ⋯ Highway
L𝐚𝐲𝐞𝐫 𝟏 L𝐚𝐲𝐞𝐫 𝟐 L𝐚𝐲𝐞𝐫 𝐅
Figure 7.3. Architecture of the Highway Network used in NCT Link.
As shown, the main component of each intermediate layer, l, is a Rectified Linear Unit (ReLU)
(Glorot et al., 2011), i.e.,
x l+1 = ReLU(x l ) = max (x l, 0) , (7.1)
where x l indicates the output of layer l and 0 denotes a zero-vector. In the DHN, each ReLU layer
is augmented with a highway mechanism composed of two gates: (1) a transform gate, T ∈ [0, 1],
which learns a weight that is applied to the output of the ReLU, and (2) a carry gate, C = 1 − T,
which learns whether to skip, apply, or partially apply the ReLU in the layer to x l . Thus, the
highway mechanism enables the network to learn how many (and which) layers should be applied.
Formally, we define each layer in our DHN as follows:

T(x l ) = σ x l · W T(l) (7.2)

x l+1 = T (x l ) · ReLU W (l)
H · x l + (1 − T (x l )) · x l (7.3)
where W T(l), W (l)

H ∈ θ correspond to the learned weights of the transform gate and ReLU used in
layer l. Figure 7.4 illustrates the difference between a standard ReLU layer and a ReLU layer
incorporating a highway mechanism.
To produce the score si ∈ [0, 1] (i.e., the likelihood that ai reports the results of t) associated
with ai , the output of the final highway layer is projected down to a single element projected into
8We also experiment with 100 internal layers and observed no discernible change in performance.
222
𝒙𝒍 × ReLU × + 𝒙𝒍+𝟏
𝒙𝒍 × ReLU 𝒙𝒍+𝟏 𝑾𝑯
𝑻 𝑪
× σ 𝟏− ×
𝑾𝑯
𝑾𝑻
(a) ReLU without a highway mechanism. (b) ReLU with a highway mechanism.
Figure 7.4. Comparison of ReLU layers with and without highway mechanisms.
the range [0, 1] by a sigmoid layer:
si = σ (W s · x F ) (7.4)
x
where σ(x) = e /e x +1 is the logistic sigmoid function, W s ∈ θ corresponds to the learned weights
of the final project layer, and x F indicates the output of the final ReLU layer.
Training the Deep Highway Network
Training the DHN was achieved by finding the parameters θ most likely to predict the correct score
si for every article ai ∈ A retrieved for each clinical trial t in the training set T (details about
the training set and relevance judgments used in our experiments are provided in the Experiments
section). Formally, let yt,i indicate the relevance judgment of article ai with respect to trial t such
that yt,i = 1 if ai is relevant to (i.e., reports the results of) t and yt,i = 0, otherwise. We minimize
the entropy loss between the score si assigned by the DHN and the relevance judgment yt,i :
L
!
Õ Õ
L(θ) = si log yt,i + (1 − si ) log 1 − yt,i

(7.5)
(t,L)∈T i=1
The model was trained using Adaptive Moment Estimation (Kingma and Ba, 2015) (ADAM) (using
the default initial learning rate η = 0.001).
223
7.5.3 Ranking MEDLINE Articles
After producing a score si for each article ai ∈ A retrieved for trial t, the final ranked list of articles
is produced by sorting the articles a 1, a2, · · · , a L in descending order according to their scores
s1, s2, · · · , s L .
7.6 Experiments
Each clinical trial in ClinicalTrials.gov was manually registered by a Study Record Manager (SRM)
and may be associated with two types of publications corresponding to distinct fields in the registry:
(1) “related articles”, articles the SRM deemed related to the trial (typically references) and (2)
“result articles”, articles the SRM indicated as reporting the results of the trial. To evaluate NCT
Link, we randomly selected 500 clinical trials which were each associated with at least one “result
article” in the registry. In our experiments, we used a standard 3:1:1 split for training, development,
and testing. Relevance judgments for all 500 trials were automatically produced using the “result
articles” encoded for each trial. Specifically, for each trial t, we assigned a judgment of RELEVANT
to all MEDLINE articles listed as “result articles” for t. We considered two strategies for producing
IRRELEVANT judgments. Initially, we applied the Closed World Assumption (Minker, 1982)
(CWA) by judging every MEDLINE article not explicitly listed in the “result articles” of t as
IRRELEVANT to t. We refer to this judgment strategy as CLOSED.
However, it has been shown that the SRM of a clinical trial does not always update the registry as
new articles are published (Bashir et al., 2017). Under the CWA, these articles would be mistakenly
labeled IRRELEVANT. To account for this, we considered a secondary judgment strategy intended
to minimize the likelihood of assigning an IRRELEVANT judgment to a MEDLINE article that may
report the results of a trial despite not being included in the “result articles” of the trial. Formally,
for each trial t we obtained a list A of 3, 000 MEDLINE articles using the search strategy described
in the Searching MEDLINE Articles section (without applying the learning-to-rank component of
224
NCT Link). We determined the set of IRRELEVANT articles for t as: (1) articles which were
not listed in the “result articles” of t but were listed in the “result articles” of any other trial in
the registry, (2) 10 randomly selected articles between ranks 10 and 100, (3) 10 randomly selected
articles between ranks 1000 and 2000, (3) 10 randomly selected articles between ranks 2000 and
3000, and (4) 10 randomly selected articles from MEDLINE not in A. We refer to this second
judgment strategy as OPEN. We report the performance of NCT Link when trained using the OPEN
strategy and evaluated using both strategies9.
Due to the paucity of published automatic systems for linking clinical trials to their results in
the literature, we measured the performance of NCT Link against two baseline systems as well as
four alternative configurations of NCT Link:
1. Exact Match: A system in which an article is considered to be linked to a clinical trial if it
specifically mentions the NCT ID of the trial in its abstract or metadata – this is an extension
of the automatic approach described by Bashir et al. (2017) which considers only metadata.
2. IR:BM25: An information retrieval (IR) system which represents all aspects of the clinical
trial as a single disjunctive Boolean query relying on the BM25 similarity function.
3. NCT Link: BM25 A configuration of NCT Link in which no learning-to-rank is performed;
that is, the system returns the ranked list of candidate articles described in the Searching
MEDLINE Articles section.
4. NCT Link:Linear Regression. A configuration of NCT Link that replaces the Deep High-
way Network (DHN) with a linear regression model to determine article scores.
5. NCT Link:Random Forests. A configuration of NCT Link that replaces the DHN with a
Random Forest (Breiman, 2001) model to determine article scores.
6. NCT Link:Gradient Boosting. A configuration of NCT Link that replaces the DHN with
a Gradient Boosting (Friedman, 2001) model to determine article scores; Gradient Boosting
9We found that training with the CLOSED strategy degraded performance on the test set in all cases.
225
can be viewed as modern extension to Random Forests that incorporates boosting rather than
bagging to combine the scores predicted by each decision tree in the forest.
7.6.1 Quality Metrics
We measured the quality of ranked MEDLINE articles produced by all systems using standard
metrics for evaluating the performance of information retrieval systems. Formally, let X indicate
the test set, consisting of pairs of a clinical trial, t, and the final ranked list of L articles produced for
t, B1:L . To measure the overall ranking produced by each system, we measured the Mean Average
Precision (MAP):
Õ
MAP (X) = AP (B1:L ; t) (7.6)
(t,B1:L )inX
L
1Õ
AP (B1:L ; t) = (P (B1:k ; t) · Rel (b k ; t)) (7.7)
L k=1
K
Õ Rel (b k ; t)
P (B1:k ; t) = (7.8)
k=1
Num Rel (t)
where AP (B1:L, t) indicates the Average Precision of B1:L with respect to t, P (B1:k ; t) represents
the precision of the top-K ranked articles retrieved for trial t, Rel (b k ; t) is an indicator function
returning the value 1 if article b k was judged as RELEVANT for trial t and returning 0, otherwise,
and Num Rel (t) returns the number of articles judged RELEVANT for t. In addition to the MAP,
we report the Mean Reciprocal Rank (MRR) which is the average of the multiplicative inverse of
the rank of the first relevant article produced for each trial. The MRR captures how many irrelevant
reports are ranked, on average, above the first relevant article for each trial. We also report the
average precision over all clinical trials at three different ranks: the R-Precision (R-Prec) which is
the precision of the first R-ranked articles, where for each trial t, R = Num Rel (t); the Precision of
the top-five ranked articles (P@5) and the Precision of the top-ten ranked articles (P@10).
226
Table 7.2. Quality of ranked list of MEDLINE articles retrieved for each clinical trial.
(a) Performance when using the CLOSED judgment strategy.
System MAP MRR R-Prec P@5

Exact Match 0.001 0.001 0.0000 0.0000 0.000
IR: BM25 0.011 0.016 0.0032 0.0040 0.003
NCT Link: BM25 0.017 0.021 0.002 0.004 0.004
NCT Link: Linear Regression 0.269 0.302 0.236 0.082 0.046
NCT Link: Random Forests 0.196 0.219 0.154 0.072 0.051
NCT Link: Gradient Boosting 0.143 0.162 0.102 0.046 0.030
?NCT Link: DHN 0.308 0.342 0.244 0.123 0.082
(b) Performance when using the OPEN judgment strategy.
System MAP MRR R-Prec P@5 P @ 10

Exact Match 0.236 0.260 0.2203 0.0620 0.031
IR: BM25 0.258 0.294 0.1793 0.1220 0.095
NCT Link: BM25 0.586 0.610 0.549 0.210 0.115
NCT Link: Linear Regression 0.656 0.723 0.620 0.264 0.161
NCT Link: Random Forests 0.734 0.808 0.709 0.298 0.185
NCT Link: Gradient Boosting 0.717 0.791 0.684 0.282 0.182
?NCT Link: DHN 0.824 0.873 0.920 0.358 0.221
7.6.2 Results
Table 7.2 depicts the performance of all baseline systems as well as all configurations of NCT
link when evaluated according to both judgment strategies and measured with all five metrics.
As expected, all systems obtained higher performance when using the OPEN judgment scheme
than when using the CLOSED scheme. The poorer performance of all systems when using
the CLOSED judgment scheme supports the notion that many relevant articles were incorrectly
labeled as IRRELEVANT. Consequently, the OPEN judgment scheme may be viewed as an upper
bound while the CLOSED judgment scheme may be viewed as a lower bound of each system’s
performance. Regardless of judgment scheme, NCT Link using the Deep Highway Network
(DHN) obtains the highest performance, followed by the three other NCT Link configurations
227
employing learning-to-rank. It is interesting to note that the complexity of these three models
coincided with an increase in performance when using the OPEN judgment scheme and a decrease
in performance when using the CLOSED judgment schemes. This indicates that the more complex
models may have over-fit on the OPEN judgment scheme used for training. The lowest performance
was exhibited by the exact match baseline, reinforcing the observations reported by Bashir et al.
(2017) that considering the NCT ID alone is not sufficient to determine links between MEDLINE
articles and clinical trials. Likewise, the disparity in performance between the basic information
retrieval system (IR: BM25) and the BM25 configuration of NCT Link clearly indicates that
the search criteria described in the Searching MEDLINE Articles section obtains higher quality
results than when using a naïve retrieval strategy. Moreover, the increase in performance when
incorporating learning-to-rank within NCT Link suggests that the features extracted by NCT Link
are able to capture useful criteria for determining whether a link exists between an article and a
trial. When comparing the performance of NCT Link using the DHN against the performance when
using Random Forest, Linear Regression, and Gradient Boosting, it is clear that the DHN obtains
superior performance, suggesting that our DHN is able to successfully extract “meta”-features
capturing additional semantics about the relationship between a MEDLINE article and a clinical
trial.
7.7 Discussion
We manually analyzed the MEDLINE articles retrieved by NCT Link for 30 clinical trials in test
set and found four main sources of error.
The most common source of errors we observed was the result of investigator and author
names. Specifically, we found that, in general, clinical trials represented investigator names with
three fields: first name, middle name, and last name. However, many journals in MEDLINE only
report the authors’ last names and the initials of first and sometimes middle names. This resulted
in scenarios in which the system incorrectly concluded that the investigator of a trial was the same
228
as the author of a paper. This error was most prevalent for common last names (e.g., Lin, Brown),
common first initials (e.g., J, M, S, D), and when the middle initial was unspecified. Moreover, we
observed that in several cases, the first and middle names in the clinical trial registry were blank
and the last name contained the full name of the primary investigator. In four cases case, the first
and middle names were blank and the last name appeared to refer to the sponsoring company. In
future work, we believe some of these errors could be at least partially addressed by incorporating
some degree of citation analysis to help (1) disambiguate initials and/or (2) infer unspecified names
from previous work. The second most common source of errors were mismatched affiliations. We
found many cases in which the same institution was referenced in multiple ways, (e.g., “UCLA”
and “University of California, Los Angeles”). Moreover, addresses were often specified with
different levels of detail (street names, cities, states, country). Unfortunately, resolving this kind
of ambiguity is a difficult problem involving world, spatial, and geographical knowledge as well as
prior knowledge about known institutions and their standard abbreviations.
The third most common source of errors was inconsistencies in the way clinical trial completion
dates were provided to the registry. Because the completion date was represented in natural
language, completion dates were represented in a wide variety of formats. For example, some
SRMs preferred formatting the dates in the European fashion (day-month-year), while others
preferred the American notation (month-day-year). In some cases, only the month and the year and
year were indicated. Individual months were specified using digits (e.g., “01”), the full name (e.g.,
“January”) as well as a variety of abbreviations (e.g., “J”, “Jan”, and “Jan.”). Years were specified
in both two and four digit varieties (e.g., “07”, and “2007”). In our study, we investigated applying
automatic tools for recognizing time expressions (e.g., SUTime (Chang and Manning, 2012)) but
found it increased processing time by two orders of magnitude.
The final source of errors appears to result from SRMs providing incorrect information to the
registry. We found cases in which the references provided as “result articles” for a clinical trial
were published before the trial’s start date (in some cases, decades before). It is unclear whether
229
incorrect citations were given, or whether there was confusion between the “related articles” and
“result articles” fields in the registry.
In addition to the errors described above, there were some limitations in our experiments.
First, we only considered the clinical trials registered on ClinicalTrials.gov despite the availability
of other registries such as the World Health Organization (WHO) International Clinical Trials
Registry Platform (ICTRP)10. Second, we limited our system to considering only articles published
on MEDLINE and did not consider other databases such as EMBASE11 or research conference
proceedings. Moreover, because MEDLINE itself only provides abstracts, NCT Link did not have
access to the full text of articles. In future work, it may be advantageous to consider the full text of
articles included in the PubMed Central (PMC) Open Access Subset12 (OAS); it should be noted,
however, that the PMC OAS contains just over 1 million articles while MEDLINE itself contains
over 14 million articles.
7.7.1 Implementation Details
NCT Link was implemented in both Java (version 8) and Python (version 2.7.13). Java was used for
(1) parsing the data from MEDLINE as well as ClinicalTrials.org, relying on the Java Architecture
for XML Binding (JAXB, version 2.1), (2) indexing and searching clinical trials and MEDLINE
articles, relying on Apache Lucene13 (version 6.6.0), and (3) feature extraction. The Deep Highway
Network (DHN) was implemented in Python using TensorFlow14 (version 1.3). L2R baselines
relied on the implementation provided by the RankLib component of the Lemur Project15 using the
10http://apps.who.int/trialsearch/
11https://www.elsevier.com/solutions/embase-biomedical-research
12https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/
13https://lucene.apache.org/
14https://www.tensorflow.org/
15https://www.lemurproject.org/
230
recommended parameters. Our DHN used 200-dimensional internal layers. When designing the
network, we found that changing the dimensionality of internal layers had no discernible effects on
performance.
In this chapter, we have presented NCT Link, a system for automatically linking clinical trials
to MEDLINE articles reporting their results. While traditional approaches for linking trials to
their publications rely on arduous, manual analyses (Bashir et al., 2017), NCT Link learns to
automatically determine the likelihood that a published article reports the results of a clinical trial
by incorporating state-of-the-art deep learning and information retrieval techniques, obtaining a
30%-58% improvement over previously reported automatic systems (Bashir et al., 2017). These
promising results suggest that NCT Link will provide a useful tool for clinicians seeking to provide
timely, evidence-based care. Moreover, NCT Link can be used to enrich the results of clinical
decision support systems, such as the one described in Chapter 2 on page 11, by augmenting
retrieved scientific articles with relevant clinical trials. It should be noted that the learning-to-
rank framework described in this chapter may be adapted to other medical information retrieval
problems, such as patient cohort retrieval.
231
CHAPTER 8
CONCLUSIONS
With the advent of electronic health records (EHRs), the demand for promising secondary applica-
tions is increasing. Although some information in EHRs is easily accessed from structured fields,
it is believed that the richest source of information is locked with unstructured natural language text
(Demner-Fushman et al., 2009). In this dissertation, we present novel methods for two medical ap-
plications taking advance of the knowledge encoded in unstructured natural language texts: medical
question answering and patient cohort retrieval. Medical language presents a plethora of unique
challenges which were met by a variety of techniques from the field of information retrieval as well
as the fields of natural language processing, machine learning, artificial intelligence, deep learn-
ing, probabilistic graphical models, knowledge representations, and “big data”. This dissertation
showed how novel paradigms for medical information retrieval can be successfully applied to both
medical question answering and patient cohort retrieval. Chapter 2 presented a system and retrieval
framework that combines question answering and information retrieval techniques for the purposes
of clinical decision support. Given a natural language description of a patient’s medical case, and
a question type (e.g., “What is the diagnosis?”) the system obtained a ranked list of answers for the
question as well as a ranking of relevant scientific articles supporting the answer(s). The system
showed how combining knowledge from medical practice (electronic health records) with medical
research (published scientific articles) can obtain state-of-the-art performance for clinical decision
support.
Chapter 3 explores medical information retrieval directly through the problem of patient cohort
retrieval. We present a framework for patient cohort retrieval which, given a natural language
description of a patient cohort (i.e., a population of patients with common demographic and/or
medical characteristics), provided a ranked list of patients eligible for the cohort. In this chapter, we
explored a variety of approaches for multiple information retrieval techniques including key-phrase
detection, query expansion, ranking, and filtering. We found that standard ad-hoc information
232
retrieval techniques were not sufficient for medical texts and that the domain-specific strategies
described in the chapter obtained better performance.
The problem of patient cohort retrieval was further investigated in Chapter 4 where-in we
developed a multi-modal form of patient cohort retrieval operating on Electroencephalogram (EEG)
reports. The proposed system, named MERCuRY, is the first system to our knowledge to incorporate
a multi-modal index considering both the unstructured natural language text included in EEG reports
as well as automatically generated fingerprints encoding the information present in the EEG signal
recordings associated with each report. We demonstrated that a multi-modal retrieval strategy
outperformed the traditional strategy of considering only text.
The remainder of the dissertation explored methods for extending medical information re-
trieval systems to account the complexities of electronic health records. Specifically, we explored
probabilistic graphical models for capturing longitudinal information – that is, for modeling cross-
document temporal relations between medical concepts, deep neural networks for inferring and
recovering missing or underspecified information, and the impact of learning-to-rank on medical
information retrieval.
In Chapter 5, we analyzed one of the most significant barriers for analyzing patient informa-
tion from EHRs: the role of longitudinal information – that is, accounting for the fact that the
clinical picture and therapy of a patient changes over the course of their care and across individual
records. Specifically, we presented three probabilistic graphical models for capturing longitudinal
information: (1) a lattice Markov network for predicting risk factors for heart disease in diabetic
patients, (2) a probabilistic graphical model for inferring the causal interactions among risk factors
and medications over time, (3) a general Bayesian model which jointly learns to predict clinical
observations in time and cluster patients into latent sub-populations. We showed how each of
these models can obtain high predictive accuracy, indicating that each model can accurately encode
and identify longitudinal information across successive EHRs, enhancing the ability of medical
question answering and patient cohort retrieval systems to reason about longitudinal data.
233
In addition to the complexity imposed by longitudinal information, EHRs exhibit a number of
other unique characteristics, particularly when it comes to data quality: electronic health records
are rife with incomplete, missing, or underspecified information. In Chapter 6, we explored two
scenarios where-in underspecified or missing information can be inferred or recovered automatically
using deep learning. Specifically, we presented novel deep learning models for inferring whether
or not an EEG report indicates cerebral dysfunction and for automatically generating the content
of missing sections, based on patterns distilled from a large collection of EHRs. This chapter
demonstrated that deep learning can be successfully applied to recover missing or underspecified
information, paving the way for more robust and reliable medical information retrieval systems.
One of the major obstacles when designing the medical information retrieval components of
the systems described in Chapters 2 to 4 was the lack of established clear and complete criteria for
judging the relevance of medical texts. Consequently, the system described in Chapter 2 employed
a custom relevance model, while the systems described in Chapters 3 and 4 extended and adapted
standard relevance models for ad-hoc information retrieval. In Chapter 7 we showed how supervised
machine learning could be used to automatically learn the optimal relevance model for a particular
set of queries and text collection. Specifically, we presented a case study for how the learning-to-
rank framework can be applied to the problem of identifying published scientific articles that report
the results of a given clinical trial. By combining information (i.e., features) about the query, a
document, and the relationships (if any) between them, we were able to produce a successful and
reliable method for automatically linking clinical trials to published articles reporting their results.
The success of the approached reported in this chapter indicates that learning-to-rank can be a
valuable tool for implementing medical informatics systems.
Overall, this dissertation advances the understanding of medical information retrieval in general
with particular emphasis on the application of medical information retrieval for medical question
answering and patient cohort retrieval. Possible directions for future work include developing novel
representations for EHR content to allow future systems to be more easily adapted for new datasets
234
and types of EHRs, and investigating how the techniques presented in this dissertation could be
integrated into EHR systems and clinical practice. It is my sincerest wish that this dissertation acts
as a stepping stone towards improving patient care and outcomes through enabling easier, more
robust, and more reliable access to the information encoded in medical texts.
235
REFERENCES
Abadi, M., A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,
M. Devin, et al. (2016). Tensorflow: Large-scale machine learning on heterogeneous distributed
systems. arXiv preprint arXiv:1603.04467.
Aikins, J. S., J. C. Kunz, E. H. Shortliffe, and R. J. Fallat (1983). PUFF: an expert system for
interpretation of pulmonary function data. Computers and biomedical research 16(3), 199–208.
Albright, D., A. Lanfranchi, A. Fredriksen, W. F. Styler, C. Warner, J. D. Hwang, J. D. Choi,

D. Dligach, R. D. Nielsen, J. Martin, et al. (2013). Towards comprehensive syntactic and
semantic annotations of the clinical narrative. Journal of the American Medical Informatics
Association 20(5), 922–930.
Amarasingham, R., B. J. Moore, Y. P. Tabak, M. H. Drazner, C. A. Clark, S. Zhang, W. G. Reed,

T. S. Swanson, Y. Ma, and E. A. Halm (2010). An automated model to identify heart failure
patients at risk for 30-day readmission or death using electronic medical record data. Medical
care 48(11), 981–988.
Amati, G. and C. J. Van Rijsbergen (2002). Probabilistic models of information retrieval based
on measuring the divergence from randomness. ACM Transactions on Information Systems
(TOIS) 20(4), 357–389.
American Clinical Neurophysiology Society et al. (2006). Guideline 7: Guidelines for writing eeg
reports. Journal of Clinical Neurophysiology: Official Publication of the American Electroen-
cephalographic Society 23(2), 118.
Anderson, M. L., K. Chiswell, E. D. Peterson, A. Tasneem, J. Topping, and R. M. Califf

(2015). Compliance with results reporting at clinicaltrials. gov. New England Journal of
Medicine 372(11), 1031–1039.
Androutsopoulos, I., G. Ritchie, and P. Thanisch (1993). Masque/sql an efficient and portable natural
language query interface for relational databases. In Proc. of Sixth International Conference on
Industrial & Engineering Applications of Artificial Intelligence & Expert System.
Androutsopoulos, I., G. D. Ritchie, and P. Thanisch (1995). Natural language interfaces to

databases–an introduction. Natural language engineering 1(01), 29–81.
Aronson, A. R. (2001). Effective mapping of biomedical text to the umls metathesaurus: the
metamap program. In AMIA, pp. 17. American Medical Informatics Association.
Arora, R. and B. Ravindran (2008). Latent dirichlet allocation based multi-document summariza-
tion. In Proceedings of the second workshop on Analytics for noisy unstructured text data, pp.
91–97. ACM.
236
Arya, S., D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu (1998). An optimal algorithm
for approximate nearest neighbor searching fixed dimensions. Journal of the ACM (JACM) 45(6),
891–923.
Auer, S., C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives (2007). Dbpedia: A nucleus
for a web of open data. Springer.
Ba, J., V. Mnih, and K. Kavukcuoglu (2015). Multiple object recognition with visual attention. In
ICLR.
Bahdanau, D., K. Cho, and Y. Bengio (2015). Neural machine translation by jointly learning to
align and translate. In ICLR.
Balaneshin-kordan, S., A. Kotov, and R. Xisto (2015). Wsu-ir at trec 2015 clinical decision support
track: Joint weighting of explicit and latent medical query concepts from diverse sources. In
Text Retrieval Conference, TREC.
Bao, J., N. Duan, M. Zhou, and T. Zhao (2014). Knowledge-based question answering as machine
translation. 2(6).
Bashir, R., F. T. Bourgeois, and A. G. Dunn (2017). A systematic review of the processes used to
link clinical trial registrations to their published results. Systematic reviews 6(1), 123.
Bejan, C. A., L. Vanderwende, et al. (2013, November). On-time clinical phenotype prediction
based on narrative reports. In AMIA, Volume 2013, pp. 103–110.
Bejan, C. A., L. Vanderwende, M. M. Wurfel, and M. Yetisgen-Yildiz (2012, November). Assessing

Pneumonia Identification from Time-Ordered Narrative Reports. AMIA Annual Symposium
Proceedings 2012, 1119–1128.
Beniczky, S., L. J. Hirsch, P. W. Kaplan, R. Pressler, G. Bauer, H. Aurlien, J. C. Brøgger, and

E. Trinka (2013). Unified eeg terminology and criteria for nonconvulsive status epilepticus.
Epilepsia 54(s6), 28–29.
Berlin, J. and P. Stang (2011). Clinical data sets that need to be mined. In Learning What
Works: Infrastructure Required for Comparative Effectiveness Research: Workshop Summary,
Volume 1.
Blei, D. M., A. Y. Ng, and M. I. Jordan (2003, March). Latent dirichlet allocation. Journal of
machine Learning research 3(Jan), 993–1022.
Bodenreider, O. (2004). The unified medical language system (umls): integrating biomedical
terminology. Nucleic acids research 32(suppl 1), D267–D270.
237
Bollacker, K., C. Evans, P. Paritosh, T. Sturge, and J. Taylor (2008). Freebase: a collabora-
tively created graph database for structuring human knowledge. In International Conference on
Management of Data, SIGMOD, pp. 1247–1250. ACM.
Breiman, L. (2001). Random forests. Machine learning 45(1), 5–32.
Brown, P. F., V. J. D. Pietra, S. A. D. Pietra, and R. L. Mercer (1993). The mathematics of statistical
machine translation: Parameter estimation. Computational linguistics 19(2), 263–311.
Buchanan, B. G., J. D. Moore, D. E. Forsythe, G. Carenini, S. Ohlsson, and G. Banks (1995).

An intelligent interactive system for delivering individualized information to patients. Artificial
intelligence in medicine 7(2), 117–154.
Callejas P., M. A., Y. Wang, and H. Fang (2012). Exploiting domain thesaurus for medical record
retrieval. In TREC-21.
Carberry, S. and T. Harvey (1997). Generating coherent messages in real-time decision support:
Exploiting discourse theory for discourse practice. In Nineteenth Annual Conference of the
Cognitive Science Society, pp. 79–84.
Cawsey, A. J., B. L. Webber, and R. B. Jones (1997). Natural language generation in health care.
Chang, A. X. and C. D. Manning (2012). Sutime: A library for recognizing and normalizing time
expressions. In LREC, Volume 2012, pp. 3735–3740.
Chapman, W. W., W. Bridewell, P. Hanbury, G. F. Cooper, and B. G. Buchanan (2001). A simple

algorithm for identifying negated findings and diseases in discharge summaries. Journal of
biomedical informatics 34(5), 301–310.
Chapman, W. W., P. M. Nadkarni, L. Hirschman, L. W. D’Avolio, G. K. Savova, and O. Uzuner

(2011). Overcoming barriers to nlp for clinical text: the role of shared tasks and the need for
additional creative solutions. Journal of the American Medical Informatics Association 18(5),
540–543.
Chen, Z., B. Gao, H. Zhang, Z. Zhao, and D. Cai (2017). User personalized satisfaction prediction
via multiple instance deep learning. In Proceedings of the 2017 International World Wide Web
Conference (WWW).
Cho, K., B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio
(2014, October). Learning phrase representations using rnn encoder–decoder for statistical
machine translation. EMNLP, 1724–1734.
Choi, Y., C. Y.-I. Chiu, and D. Sontag (2016). Learning low-dimensional representations of medical
concepts. AMIA Summits on Translational Science Proceedings 2016, 41.
238
Cilibrasi, R. and P. M. B. Vitányi (2004). The google similarity distance. CoRR.
Cohen, A. M. and W. R. Hersh (2005). A survey of current work in biomedical text mining.
Briefings in bioinformatics 6(1), 57–71.
Congress, US (2007). Food and drug administration amendments act of 2007. Public Law, 115–85.
Cortes, C. and V. Vapnik (1995). Support vector machine. Machine learning 20(3), 273–297.
Cui, L., A. Bozorgi, S. D. Lhatoo, G.-Q. Zhang, and S. S. Sahoo (2012). Epidea: Extracting
structured epilepsy and seizure information from patient discharge summaries for cohort identi-
fication. In AMIA Annual Symposium Proceedings, Volume 2012, pp. 1191. American Medical
Informatics Association.
De Angelis, C., J. M. Drazen, F. A. Frizelle, C. Haug, J. Hoey, R. Horton, S. Kotzin, C. Laine,

A. Marusic, A. J. P. Overbeke, et al. (2004). Clinical trial registration: a statement from the
international committee of medical journal editors.
Dean, J. and S. Ghemawat (2008). Mapreduce: simplified data processing on large clusters.
ACM 51(1), 107–113.
Demner-Fushman, D., S. Abhyankar, A. Jimeno-Yepes, R. Loane, F. Lang, J. G. Mork, N. Ide, and

A. R. Aronson (2012). Nlm at trec 2012 medical records track. In TREC-21.
Demner-Fushman, D., S. Antani, M. Simpson, and G. R. Thoma (2012). Design and development
of a multimodal biomedical information retrieval system. Journal of Computing Science and
Engineering 6(2), 168–177.
Demner-Fushman, D., W. W. Chapman, and C. J. McDonald (2009). What can natural language
processing do for clinical decision support? Journal of biomedical informatics 42(5), 760–772.
Dong, L., F. Wei, M. Zhou, and K. Xu (2015). Question answering over freebase with multi-column
convolutional neural networks. In Association for Computational Linguistics, ACL, Volume 1,
pp. 260–269.
Eagle, K. A., M. J. Lim, et al. (2004). A validated prediction model for all forms of acute coronary
syndrome. JAMA.
Edinger, T., A. M. Cohen, S. Bedrick, K. Ambert, and W. Hersh (2012, November). Barriers
to retrieving patient information from electronic health record data: failure analysis from the
trec medical records track. In AMIA Annual Symposium Proceedings, Volume 2012, pp. 180.
American Medical Informatics Association.
England, M. J., C. T. Liverman, A. M. Schultz, and L. M. Strawbridge (2012). Epilepsy across the
spectrum: Promoting health and understanding: A summary of the institute of medicine report.
Epilepsy & Behavior 25(2), 266–276.
239
Fang, H. and C. Zhai (2005). An exploration of axiomatic approaches to information retrieval. In
Proceedings of the 28th annual international ACM SIGIR conference on Research and develop-
ment in information retrieval, pp. 480–487. ACM.
Fellbaum, C. (1998). WordNet: An Electronic Lexical Database. The MIT press.
Fox, K. A., O. H. Dabbous, R. J. Goldberg, K. S. Pieper, K. A. Eagle, F. Van de Werf, Á. Avezum,

S. G. Goodman, M. D. Flather, F. A. Anderson, et al. (2006). Prediction of risk of death
and myocardial infarction in the six months after presentation with acute coronary syndrome:
prospective multinational observational study (grace). bmj 333(7578), 1091.
Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of

statistics, 1189–1232.
Friedman, N., K. Murphy, and S. Russell (1998). Learning the structure of dynamic probabilistic
networks. In Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence,
pp. 139–147. Morgan Kaufmann Publishers Inc.
Garg, A. X., N. K. Adhikari, H. McDonald, M. P. Rosas-Arellano, P. Devereaux, J. Beyene,

J. Sam, and R. B. Haynes (2005). Effects of computerized clinical decision support systems on
practitioner performance and patient outcomes: a systematic review. 293(10), 1223–1238.
Geman, S. and D. Geman (1984). Stochastic relaxation, gibbs distributions, and the bayesian
restoration of images. IEEE Transactions on pattern analysis and machine intelligence (6),
721–741.
Gerber, P. A., K. E. Chapman, S. S. Chung, C. Drees, R. K. Maganti, Y.-t. Ng, D. M. Treiman, A. S.

Little, and J. F. Kerrigan (2008). Interobserver agreement in the interpretation of eeg patterns in
critically ill adults. Journal of Clinical Neurophysiology 25(5), 241–249.
Gertner, A. S., B. L. Webber, J. R. Clarke, C. Z. Hayward, T. A. Santora, and D. K. Wagner (1997).

On-line assurance in the initial definitive management of multiple trauma: evaluating system
potential. Artificial Intelligence in Medicine 9(3), 261–282.
Glick, T. H., L. D. Cranberg, R. B. Hanscom, and L. Sato (2005). Neurologic patient safety: an
in-depth study of malpractice claims. Neurology 65(8), 1284–1286.
Glorot, X., A. Bordes, and Y. Bengio (2011). Deep sparse rectifier neural networks. In International
Conference on Artificial Intelligence and Statistics, pp. 315–323.
Goodwin, T. and S. M. Harabagiu (2013a). Automatic generation of a qualified medical knowledge

graph and its usage for retrieving patient cohorts from electronic medical records. In International
Conference on Semantic Computing, ICSC 2013.
Goodwin, T. and S. M. Harabagiu (2013b). Graphical induction of qualified medical knowledge.

7(04), 377–405.
240
Goodwin, T. and S. M. Harabagiu (2013c). The impact of belief values on the identification of patient
cohorts. In Information Access Evaluation. Multilinguality, Multimodality, and Visualization,
pp. 155–166. Springer Berlin Heidelberg.
Goodwin, T. and S. M. Harabagiu (2014). Clinical data-driven probabilistic graph processing. In

LREC, pp. 101–108.
Goodwin, T. and S. M. Harabagiu (2015). A probabilistic reasoning method for predicting the
progression of clinical findings from electronic medical records. AMIA Summits on Translational
Science Proceedings 2015, 61.
Goodwin, T., B. Rink, K. Roberts, and S. M. Harabagiu (2011). Cohort shepherd: Discovering
cohort traits from hospital visits. In Proceedings of The 20th Text REtrieval Conference.
Goodwin, T., K. Roberts, B. Rink, and S. M. Harabagiu (2012). Cohort shepherd ii: Verifying
cohort constraints from hospital visits. In Proceedings of The 21st Text REtrieval Conference.
Goodwin, T. and H. S (2016). Multi-modal patient cohort identification from eeg report and signal
data. AMIA Annual Symposium, 1694–1803.
Goodwin, T. and H. S (2017). Deep learning from eeg reports for inferring underspecified infor-
mation. AMIA CRI 2017.
Goodwin, T. R. and S. M. Harabagiu (2016). Medical question answering for clinical decision
support. In Proceedings of the 25th ACM International on Conference on Information and
Knowledge Management, pp. 297–306. ACM.
Goodwin, T. R. and S. M. Harabagiu (2017). Knowledge representations and inference techniques

for medical question answering. ACM Transactions on Intelligent Systems and Technology
(TIST) 9(2), 14.
Green Jr., B. F., A. K. Wolf, C. Chomsky, and K. Laughery (1961). Baseball: An automatic
question-answerer. In Proceedings of Western Computing Conference, Volume 19, pp. 219–224.
ACM.
Guthrie, D., B. Allison, W. Liu, L. Guthrie, and Y. Wilks (2006). A closer look at skip-gram
modelling. In Proceedings of the 5th international Conference on Language Resources and
Evaluation (LREC-2006), pp. 1–4. sn.
Harabagiu, S. and S. Maiorano (1999). Finding answers in large collections of texts: Paragraph
indexing + abductive inference. In Proceedings of the AAAI Fall Symposium on Question
Answering Systems, pp. 63–71.
Harati, A., S.-M. Choi, M. Tabrizi, I. Obeid, J. Picone, and M. Jacobson (2013). The temple
university hospital eeg corpus. In Global Conference on Signal and Information Processing
(GlobalSIP), 2013 IEEE, pp. 29–32. IEEE.
241
Hatcher, E. and O. Gospodnetic (2005). Lucene in Action. Manning Publications.
Hermann, K. M., T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom

(2015). Teaching machines to read and comprehend. In Advances in Neural Information
Processing Systems, pp. 1693–1701.
Hersh, W. (2012). From implementation to analytics: The future work of informat-

ics. http://informaticsprofessor.blogspot.com/2012/04/from-implementation-
to-analytics-future.html.
Hersh, W. R. (2009). Information retrieval: a health and biomedical perspective. Springer.
Hochreiter, S., Y. Bengio, P. Frasconi, and J. Schmidhuber (2001). Gradient flow in recurrent nets:
the difficulty of learning long-term dependencies. In J. Kolen and S. Kremer (Eds.), Field Guide
to Dynamical Recurrent Networks. IEEE Press.
Hochreiter, S. and J. Schmidhuber (1997). Long short-term memory. Neural Computation 9(8),
1735–1780.
Hovy, E. H., L. Gerber, U. Hermjakob, M. Junk, and C.-Y. Lin (2000). Question answering in
webclopedia. In TREC, Volume 52, pp. 53–56.
Huopaniemi, I., G. N. Nadkarni, R. Nadukuru, S. B. Ellis, O. Gottesman, and E. Bottinger (2014).

Disease progression subtype discovery from longitudinal EMR data with a majority of missing
values and unknown initial time points. In AMIA Annual Symposium Proceedings, Volume 2014.
Huser, V. and J. J. Cimino (2013). Linking clinicaltrials. gov and pubmed to track results of
interventional human clinical trials. PloS one 8(7), e68409.
International Committee of Medical Journal Editors (ICMJE) et al. (2004). Uniform requirements
for manuscripts submitted to biomedical journals: writing and editing for biomedical publication.
Haematologica 89(3), 264.
Islam, A. and D. Inkpen (2006). Second order co-occurrence pmi for determining the semantic
similarity of words. In Proceedings of the International Conference on Language Resources and
Evaluation, Genoa, Italy, pp. 1033–1038.
Iyer, S., I. Konstas, A. Cheung, and L. Zettlemoyer (2016). Summarizing source code using a neural
attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational
Linguistics, Volume 1, pp. 2073–2083.
Iyyer, M., J. L. Boyd-Graber, L. M. B. Claudino, R. Socher, and H. Daumé III (2014). A neural
network for factoid question answering over paragraphs. In Empirical Methods in Natural
Language Processing, pp. 633–644.
242
Iyyer, M., V. Manjunatha, J. Boyd-Graber, and H. D. III (2015). Deep unordered composition
rivals syntactic methods for text classification. In Proceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics and the 7th International Joint Conference on Natural
Language Processing, Volume 1, pp. 1681–1691.
Järvelin, K. and J. Kekäläinen (2002). Cumulated gain-based evaluation of ir techniques. ACM

Transactions on Information Systems (TOIS) 20(4), 422–446.
Johnson, A. E., T. J. Pollard, L. Shen, L.-w. H. Lehman, M. Feng, M. Ghassemi, B. Moody,

P. Szolovits, L. A. Celi, and R. G. Mark (2016). Mimic-iii, a freely accessible critical care
database. Scientific data 3.
Jurafsky, D. and H. James (2000). Speech and Language Processing: an Introduction to Natural
Language Processing, Computational Linguistics, and Speech. Pearson Education.
Kadlec, R., M. Schmid, O. Bajgar, and J. Kleindienst (2016). Text understanding with the attention
sum reader network. ACL.
Kaplan, P. W. and S. R. Benbadis (2013). How to write an eeg report: Dos and don’ts. Neurol-
ogy 80(Supplement 1), S43–S46.
Katz, B., S. Felshin, D. Yuret, A. Ibrahim, J. Lin, G. Marton, A. J. McFarland, and B. Temelkuran
(2002). Omnibase: Uniform access to heterogeneous data for question answering. In Interna-
tional Conference on Application of Natural Language to Information Systems, pp. 230–234.
Springer.
Kemp, B. and J. Olivan (2003). European data format ‘plus’ (edf+), an edf-alike standard format
for the exchange of physiological data. Clinical Neurophysiology 114(9), 1755–1761.
Kilicoglu, H., M. Fiszman, K. Roberts, and D. Demner-Fushman (2015). An ensemble method

for spelling correction in consumer health questions. In AMIA Annual Symposium Proceedings,
Volume 2015, pp. 727. American Medical Informatics Association.
Kim, J.-D., T. Ohta, Y. Tateisi, and J. Tsujii (2003). Genia corpus – a semantically annotated corpus
for bio-textmining. Bioinformatics 19(suppl 1), i180–i182.
Kingma, D. and J. Ba (2015). Adam: A method for stochastic optimization. ICLR.
Kingsbury, P. and M. Palmer (2002). From treebank to propbank. In LREC. Citeseer.
Knaus, W. A. (2002). APACHE 1978-2001: the development of a quality assurance system based
on prognosis: milestones and personal reflections. Archives of Surgery 137(1), 37–41.
Kocabaş, İ., B. T. Dinçer, and B. Karaoğlan (2014). A nonparametric term weighting method
for information retrieval based on measuring the divergence from independence. Information
retrieval 17(2), 153–176.
243
Koller, D. and N. Friedman (2009). Probabilistic graphical models: principles and techniques.
MIT press.
Kolomiyets, O. and M.-F. Moens (2011). A survey on question answering technology from an
information retrieval perspective. Information Sciences 181(24), 5412–5434.
Kosko, B. (1988). Bidirectional associative memories. IEEE Transactions on Systems, Man, and
Cybernetics 18(1), 49–60.
Kost, R., B. Littenberg, and E. S. Chen (2012). Exploring generalized association rule mining for
disease co-occurrences. In AMIA Annual Symposium Proceedings, Volume 2012, pp. 1284–1293.
AMIA.
Krogh, A., B. Larsson, G. von Heijne, and E. L. L. Sonnhammer (2001, January). Predicting
transmembrane protein topology with a hidden Markov model: application to complete genomes.
Journal of Molecular Biology 305(3), 567–580.
Kuperman, G. J., R. M. Gardner, and T. A. Pryor (2012). HELP: a dynamic hospital information
system. Springer Publishing Company, Incorporated.
Lafferty, J., A. McCallum, F. Pereira, et al. (2001). Conditional random fields: Probabilistic models
for segmenting and labeling sequence data. In Proceedings of the eighteenth international
conference on machine learning, ICML, Volume 1, pp. 282–289.
Lee, D. H. and O. Vielemeyer (2011). Analysis of overall level of evidence behind infectious
diseases society of america practice guidelines. Archives of Internal Medicine 171(1), 18–22.
Lee, J., D. J. Scott, M. Villarroel, G. D. Clifford, M. Saeed, and R. G. Mark (2011). Open-access
mimic-ii database for intensive care research. In Engineering in Medicine and Biology Society,
EMBC, pp. 8315–8318. IEEE.
Lee, K., T. Kwiatkowski, A. Parikh, and D. Das (2016). Learning recurrent span representations
for extractive question answering. arXiv preprint arXiv:1611.01436.
Lin, C.-Y. (2004, July). Rouge: A package for automatic evaluation of summaries. In S. S.
Marie-Francine Moens (Ed.), Text Summarization Branches Out: Proceedings of the ACL-04
Workshop, Barcelona, Spain, pp. 74–81. Association for Computational Linguistics.
Lipscomb, C. E. (2000). Medical subject headings (mesh). 88(3), 265.
Liu, B., L. Liu, A. Tsykin, G. J. Goodall, J. E. Green, M. Zhu, C. H. Kim, and J. Li (2010).
Identifying functional miRNA–mRNA regulatory modules with correspondence latent dirichlet
allocation. Bioinformatics 26(24), 3105–3111.
244
Liu, J. S. (1994, September). The Collapsed Gibbs Sampler in Bayesian Computations with Appli-
cations to a Gene Regulation Problem. Journal of the American Statistical Association 89(427),
958–966.
Liu, T.-Y. (2011). Learning to rank for information retrieval. Springer Science & Business Media.
Lotte, F. (2014). A tutorial on eeg signal-processing techniques for mental-state recognition in

brain–computer interfaces. In Guide to brain-computer music interfacing, pp. 133–161. Springer.
Manning, C. D., P. Raghavan, H. Schütze, et al. (2008). Introduction to Information Retrieval,

Volume 1. Cambridge University Press.
Manning, C. D., M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky (2014).

The Stanford CoreNLP natural language processing toolkit. In Association for Computational
Linguistics (ACL) System Demonstrations, pp. 55–60.
Mikolov, T. and J. Dean (2013). Distributed representations of words and phrases and their
compositionality. Advances in neural information processing systems.
Mikolov, T., M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur (2010). Recurrent neural
network based language model. In INTERSPEECH, Volume 2, pp. 3.
Minker, J. (1982). On indefinite databases and the closed world assumption. In 6th Conference on
Automated Deduction, pp. 292–308. Springer: Springer.
Moreda, P., H. Llorens, E. Saquete, and M. Palomar (2011). Combining semantic information in
question answering systems. Information Processing & Management 47(6), 870–885.
Muja, M. and D. G. Lowe (2009). Fast approximate nearest neighbors with automatic algorithm
configuration. In In VISAPP International Conference on Computer Vision Theory and Appli-
cations.
Nagin, D. S. and C. L. Odgers (2010). Group-Based Trajectory Modeling in Clinical Research.

Annual Review of Clinical Psychology 6(1), 109–138.
Ng, K., Jimeng Sun, Jianying Hu, and Fei Wang (2015). Personalized predictive modeling and
risk factor identification using patient similarity. In AMIA Summits on Translational Science
Proceedings.
O’malley, K. J., K. F. Cook, M. D. Price, K. R. Wildes, J. F. Hurdle, and C. M. Ashton (2005).

Measuring diagnoses: Icd code accuracy. Health services research 40(5p2), 1620–1639.
Omari, A., D. Carmel, O. Rokhlenko, and I. Szpektor (2016). Novelty based ranking of human an-
swers for community questions. In Proceedings of the 39th International ACM SIGIR conference
on Research and Development in Information Retrieval, pp. 215–224. ACM.
245
Overby, C. L., J. Pathak, O. Gottesman, K. Haerian, A. Perotte, S. Murphy, K. Bruce, S. Johnson,
J. Talwalkar, Y. Shen, and others (2013). A collaborative approach to developing an electronic
health record phenotyping algorithm for drug-induced liver injury. Journal of the American
Medical Informatics Association, e243–e252.
Ozdemir, N. and E. Yildirim (2014). Patient specific seizure prediction system using hilbert
spectrum and bayesian networks classifiers. CMMM 2014, 10.
Pantel, P. and D. Lin (2002). Discovering word senses from text. In Proceedings of the eighth
ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 613–619.
ACM.
Papineni, K., S. Roukos, T. Ward, and W.-J. Zhu (2002). Bleu: a method for automatic evaluation of
machine translation. In Proceedings of the 40th annual meeting on association for computational
linguistics, pp. 311–318. Association for Computational Linguistics.
Pearl, J. (1986). Fusion, propagation, and structuring in belief networks. 29(3), 241–288.
Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible reasoning.
Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret-

tenhofer, R. Weiss, V. Dubourg, et al. (2011). Scikit-learn: Machine learning in python. The
Journal of Machine Learning Research 12, 2825–2830.
Pennington, J., R. Socher, and C. Manning (2014). Glove: Global vectors for word representation.
In Proceedings of the 2014 conference on empirical methods in natural language processing
(EMNLP), pp. 1532–1543.
Pivovarov, R., Y. J. Coppleson, S. L. Gorman, D. K. Vawdrey, and N. Elhadad (2016). Can

patient record summarization support quality metric abstraction? In AMIA Annual Symposium
Proceedings, Volume 2016, pp. 1020. American Medical Informatics Association.
Porteous, I., D. Newman, A. Ihler, A. Asuncion, P. Smyth, and M. Welling (2008). Fast Collapsed
Gibbs Sampling for Latent Dirichlet Allocation. In Proceedings of the 14th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, KDD ’08, New York, NY,
USA, pp. 569–577. ACM.
Punyakanok, V., D. Roth, and W.-t. Yih (2004). Mapping dependencies trees: An application to
question answering. In Proceedings of AI & Math, pp. 1–10.
Rabiner, L. (1989, February). A tutorial on hidden Markov models and selected applications in
speech recognition. Proceedings of the IEEE 77(2), 257–286.
Raiffa, H. and R. Schlaifer (1961). Applied statistical decision theory. Harvard University Press.
246
Rao, J., H. He, and J. Lin (2016). Noise-contrastive estimation for answer selection with deep
neural networks. In Proceedings of the 25th ACM International on Conference on Information
and Knowledge Management, pp. 1913–1916. ACM.
Rashid, M. A., M. T. Hoque, and A. Sattar. Association rules mining based clinical observations.
Ratner, R., J. Eden, D. Wolman, S. Greenfield, and H. Sox (2009). Initial national priorities for
comparative effectiveness research. National Academies Press.
Recht, B., C. Re, S. Wright, and F. Niu (2011). Hogwild: A lock-free approach to parallelizing
stochastic gradient descent. In Advances in Neural Information Processing Systems, NIPS, pp.
693–701.
Roberts, J. M., T. A. Parlikar, T. Heldt, and G. C. Verghese (2006). Bayesian networks for
cardiovascular monitoring. EMBS 2006.
Roberts, K. and S. Harabagiu (2011). A flexible framework for deriving assertions from electronic
medical records. JAMIA 18(5), 568–573.
Roberts, K., M. Simpson, D. Demner-Fushman, E. Voorhees, and W. Hersh (2016). State-of-

the-art in biomedical literature retrieval for clinical cases: a survey of the trec 2014 cds track.
Information Retrieval Journal 19(1-2), 113–148.
Roberts, K., M. S. Simpson, E. Voorhees, and W. R. Hersh (2015). Overview of the trec 2015
clinical decision support track. In The Twenty-Fourth Text REtrieval Conference Proceedings
(TREC 2015).
Robertson, S. E., S. Walker, M. M. Beaulieu, M. Gatford, and A. Payne (1996). Okapi at TREC-
4. In Proceedings of the Fourth Text REtrieval Conference (TREC), pp. 73–97. NIST Special
Publication.
Robertson, S. E., S. Walker, S. Jones, M. M. Hancock-Beaulieu, M. Gatford, et al. (1995). Okapi

at trec-3. Proceedings of the Third Text REtrieval Conference (TREC 1995) 109.
Ross, J. S., G. K. Mulvey, E. M. Hines, S. E. Nissen, and H. M. Krumholz (2009). Trial publication
after registration in clinicaltrials. gov: a cross-sectional analysis. PLoS medicine 6(9), e1000144.
Rush, A. M., S. Chopra, and J. Weston (2015, September). A neural attention model for sentence
summarization. EMNLP, 379–389.
Safran, C., M. Bloomrosen, W. E. Hammond, S. Labkoff, S. Markel-Fox, P. C. Tang, and D. E.

Detmer (2007). Toward a national framework for the secondary use of health data: an american
medical informatics association white paper. Journal of the American Medical Informatics
247
Sahoo, S. S., S. D. Lhatoo, D. K. Gupta, L. Cui, M. Zhao, C. Jayapandian, A. Bozorgi, and G.-Q.
Zhang (2014). Epilepsy and seizure ontology: Towards an epilepsy informatics infrastructure for
clinical research and patient care. Journal of the American Medical Informatics Association 21(1),
82–89.
Salton, G. (1971). The SMART retrieval system—experiments in automatic document processing.

Prentice-Hall, Inc.
Sarkar, M. and T.-Y. Leong (2001). Fuzzy k-means clustering with missing values. In Proceedings
of the AMIA Symposium, pp. 588. American Medical Informatics Association.
Savova, G. K., J. J. Masanz, P. V. Ogren, J. Zheng, S. Sohn, K. C. Kipper-Schuler, and C. G.

Chute (2010, September). Mayo clinical Text Analysis and Knowledge Extraction System
(cTAKES): architecture, component evaluation and applications. Journal of the American
Medical Informatics Association 17(5), 507–513.
Scheuermann, R. H., W. Ceusters, and B. Smith (2009, March). Toward an Ontological Treatment
of Disease and Diagnosis. Summit on Translational Bioinformatics 2009, 116–120.
Schlangen, D., A. Lascarides, and A. Copestake (2003). Resolving underspecification using

discourse information. Perspectives on Dialogue in the New Millennium 114, 287.
Schuyler, P., W. Hole, M. Tuttle, and D. Sherertz (1993). The umls metathesaurus: Representing
different views of biomedical concepts. Bulletin of the Medical Library Association 81(2), 217.
Seo, M., A. Kembhavi, A. Farhadi, and H. Hajishirzi (2016). Bidirectional attention flow for
machine comprehension. arXiv preprint arXiv:1611.01603.
Shivade, C., P. Raghavan, E. Fosler-Lussier, P. J. Embi, N. Elhadad, S. B. Johnson, and A. M. Lai

(2014). A review of approaches to identifying patient phenotype cohorts using electronic health
records. Journal of the American Medical Informatics Association 21(2), 221–230.
Simpson, M. S., E. Voorhees, and W. Hersh (2014). Overview of the trec 2014 clinical deci-
sion support track. In Text Retrieval Conference, TREC. National Institute of Standards and
Technology.
Slocum, J. (1985). A survey of machine translation: its history, current status, and future prospects.
Computational linguistics 11(1), 1–17.
Smith, P. C., R. Araya-Guerra, C. Bublitz, B. Parnes, L. M. Dickinson, R. Van Vorst, J. M. Westfall,

and W. D. Pace (2005). Missing clinical information during primary care visits. JAMA 293(5),
565–571.
Smith, S. (2005). Eeg in the diagnosis, classification, and management of patients with epilepsy.
Journal of Neurology, Neurosurgery & Psychiatry 76(suppl 2), ii2–ii7.
248
Sondhi, P., J. Sun, H. Tong, and C. Zhai (2012). Sympgraph: a framework for mining clinical notes
through symptom relation graphs. In SIGKDD, KDD ’12, New York, NY, USA, pp. 1167–1175.
ACM.
Song, Y., Y. He, Q. Hu, and L. He (2015). Ecnu at 2015 cds track: Two re-ranking methods in
medical information retrieval. In Proceedings of the 2015 Text Retrieval Conference.
Srivastava, R. K., K. Greff, and J. Schmidhuber (2015a). Highway networks. Deep Learning
Workshop at the International Conference on Machine Learning (ICML).
Srivastava, R. K., K. Greff, and J. Schmidhuber (2015b). Training very deep networks. In Advances
in neural information processing systems (NIPS), pp. 2377–2385.
Stearns, M., C. Price, K. Spackman, and A. Wang (2001). SNOMED clinical terms: overview of
the development process and project status. In Proceedings of the AMIA Symposium, pp. 662.
American Medical Informatics Association.
Stone, P. J., R. F. Bales, J. Z. Namenwirth, and D. M. Ogilvie (1962). The general inquirer: A
computer system for content analysis and retrieval based on the sentence as a unit of information.
Behavioral Science 7(4), 484–498.
Stone, P. J., D. C. Dunphy, and M. S. Smith (1966). The General Inquirer: A Computer Approach
to Content Analysis. MIT press.
Stubbs, A., C. Kotfila, H. Xu, and O. Uzuner (2015). Identifying risk factors for heart disease over
time: Overview of 2014 i2b2/uthealth shared task track 2. Journal of biomedical informatics.
Sukhbaatar, S., J. Weston, R. Fergus, et al. (2015). End-to-end memory networks. In Advances in
neural information processing systems, pp. 2440–2448.
Swartout, W. R. (1985). Explaining and justifying expert consulting programs. In Computer-

assisted medical decision making, pp. 254–271. Springer.
Syeda-Mahmood, T., F. Wang, D. Beymer, A. Amir, M. Richmond, and S. Hashmi (2007). Aalim:
Multimodal mining for cardiac decision support. In Computers in Cardiology, 2007, pp. 209–212.
IEEE.
Tricoci, P., J. M. Allen, J. M. Kramer, R. M. Califf, and S. C. Smith (2009). Scientific evidence
underlying the acc/aha clinical practice guidelines. JAMA 301(8), 831–841.
Tsoumakas, G. and I. Katakis (2007). Multi-label classification: An overview. International

Journal of Data Warehousing and Mining (IJDWM) 3(3), 1–13.
Tsuruoka, Y., Y. Tateishi, J.-D. Kim, T. Ohta, J. McNaught, S. Ananiadou, and J. Tsujii (2005).
Developing a robust part-of-speech tagger for biomedical text. In Panhellenic Conference on
Informatics, pp. 382–392. Springer.
249
Uzuner, Ö., B. R. South, S. Shen, and S. L. DuVall (2011). 2010 i2b2/va challenge on con-
cepts, assertions, and relations in clinical text. Journal of the American Medical Informatics
Uzuner, Ö. and A. Stubbs (2015). Practical applications for natural language processing in clinical
research: The 2014 i2b2/uthealth shared tasks.
Varile, G. B. and A. Zampolli (1997). Survey of the state of the art in human language technology,
Volume 13. Cambridge University Press.
Varmus, H., D. Lipman, and P. Brown (1999). Pubmed central: An nih-operated site for electronic
distribution of life sciences research reports. 24, 1999.
Vickers, A. J. (2011). Prediction models in cancer care. CA: a cancer journal for clinicians 61(5),
315–326.
Vontobel, P. O. (2013). Counting in graph covers: A combinatorial characterization of the bethe

entropy function. 59(9), 6018–6048.
Voorhees, E. and W. Hersh (2012). Overview of the trec 2012 medical records track. In TREC
2012, Gaithersburg, MD. National Institute for Standards and Technology. Unpublished. Draft
available at http://trec.nist.gov/.
Voorhees, E. and R. Tong (2011). Overview of the trec 2011 medical records track. In TREC 2011,
Gaithersburg, MD. National Institute for Standards and Technology.
Voorhees, E. M. (1994). Query expansion using lexical-semantic relations. In SIGIR, pp. 61–69.
Springer.
Voorhees, E. M. et al. (1999). The trec-8 question answering track report. In Text Retrieval
Conference, TREC, Volume 99, pp. 77–82.
Voorhees, E. M. and D. Harman (1997). Overview of the sixth text retrieval conference (trec-6).
In TREC, pp. 1–24.
Wainwright, M. J. (2006). Estimating the wrong graphical model: Benefits in the computation-
limited setting. The Journal of Machine Learning Research 7, 1829–1859.
Wang, M., N. A. Smith, and T. Mitamura (2007). What is the jeopardy model? a quasi-synchronous
grammar for qa. In EMNLP-CoNLL, Volume 7, pp. 22–32.
Wang, X., F. Wang, J. Hu, and R. Sorrentino (2014). Exploring joint disease risk prediction. In
AMIA Annual Symposium proceedings, Volume 2014, pp. 1180–1187.
Wang, Z., H. Mi, W. Hamza, and R. Florian (2016). Multi-perspective context matching for machine
comprehension. arXiv preprint arXiv:1612.04211.
250
Wasson, J. H., H. C. Sox, et al. (1985). Clinical prediction rules. applications and methodological
standards. NEJM.
Weiner, M. (2011). Evidence generation using data-centric, prospective, outcomes research method-
ologies. San Francisco, CA, Presentation at AMIA Clinical Research Informatics Summit.
Wingate, D., N. D. Goodman, D. M. Roy, and J. B. Tenenbaum (2009). The infinite latent events
model. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence,
pp. 607–614. AUAI Press.
Woods, W. A. (1973). Progress in natural language understanding: an application to lunar geology.

In Proceedings of the June 4-8, 1973, national computer conference and exposition, pp. 441–450.
ACM.
Woods, W. A., R. M. Kaplan, and B. Nash-Webber (1972). The Lunar Sciences: Natural Language
Information System: Final Report. Bolt Beranek and Newman.
Xu, K., S. Reddy, Y. Feng, S. Huang, and D. Zhao (2016). Question answering on freebase via
relation extraction and textual evidence. ACL.
Yao, X. and B. Van Durme (2014). Information extraction over structured data: Question answering
with freebase. In ACL, pp. 956–966. Citeseer.
Yedidia, J. S., W. T. Freeman, and Y. Weiss (2005). Constructing free-energy approximations and
generalized belief propagation algorithms. 51(7), 2282–2312.
Yih, W.-t., M. Richardson, C. Meek, M.-W. Chang, J. Suh, M. Richardson, C. Meek, and S. W.-t.
Yih (2016). The value of semantic parse labeling for knowledge base question answering. In
Proceedings of ACL.
Yilmaz, E. and J. A. Aslam (2006). Estimating average precision with incomplete and imper-
fect judgments. In Proceedings of the 15th ACM international conference on Information and
knowledge management, pp. 102–111. ACM.
Yilmaz, E., E. Kanoulas, and J. A. Aslam (2008). A simple and efficient sampling method for
estimating ap and ndcg. In Proceedings of the 31st annual international ACM SIGIR conference
on Research and development in information retrieval, pp. 603–610. ACM.
You, R., Y. Zhou, S. Peng, S. Zhu, and R. China (2015). Fdumedsearch at trec 2015 clinical
decision support track. In Text Retrieval Conference, TREC.
Zaragoza, H., N. Craswell, M. J. Taylor, S. Saria, and S. E. Robertson (2004). Microsoft Cambridge
at TREC 13: Web and Hard Tracks. In TREC, Volume 4, pp. 1–1.
251
Zhai, C. and J. Lafferty (2001). A study of smoothing methods for language models applied to ad
hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference
on Research and development in information retrieval, pp. 334–342. ACM: ACM.
Zhai, C. and J. Lafferty (2004). A study of smoothing methods for language models applied to
information retrieval. ACM Transactions on Information Systems (TOIS) 22(2), 179–214.
Zhang, Y., M. Brady, and S. Smith (2001, January). Segmentation of brain MR images through a
hidden Markov random field model and the expectation-maximization algorithm. IEEE Trans-
actions on Medical Imaging 20(1), 45–57.
252
BIOGRAPHICAL SKETCH
Travis Goodwin was born and raised in Dallas, Texas. He taught himself programming while
attending Highland Park Middle School by adding custom levels to video games. While attending
Highland Park High School, Travis was formally introduced to the field of Computer Science.
Although Rhetorical Writing and English Literature were Travis’s favorite classes, Travis chose to
focus on Computer Science because he “never wanted to write another paper as long as he lived”.
He graduated high school in the Spring of 2007.
Travis was admitted into the Erik Jonsson School of Engineering and Computer Science at The
University of Texas at Dallas (UTD) in the Fall of 2007. In Spring of 2011 – the Senior year of
his undergraduate degree – Travis enrolled in the graduate-level “Information Retrieval” course
taught by Dr. Sanda Harabagiu. That summer, he worked in Dr. Harabagiu’s lab along with her
students Bryan Rink and Kirk Roberts, gaining not only valuable experience, but a strong passion
for research. Although Travis graduated magna cum laude with a Bachelor’s Degree in Computer
Science, he was inspired to remain at UTD and pursue his Master’s Degree and then his PhD.
Travis was formally admitted into the PhD program in the Spring of 2012 and was awarded the
Excellence in Education Doctoral Fellowship by the Department of Computer Science. Travis
completed his Master of Science at UTD in Fall of 2013. Through-out his tenure at UTD, Travis
published four journal papers and twenty-one refereed conference papers. His work on Medical
Question Answering for Clinical Decision Support was awarded Best Student Paper at the twenty-
fifth ACM International Conference on Information and Knowledge Management (CIKM 2016),
while his work on Inferring Clinical Correlations from EEG Reports with Deep Neural Learning
was awarded the Homer R. Warner Award at the Annual Symposium of the American Medical
Informatics Association (AMIA 2017).
253
CURRICULUM VITAE
TRAVIS R. GOODWIN
Erik Jonsson School of Engineering & Computer Science
Department of Computer Science
Mail Station EC 31, Richardson, Texas 75080-3021
January 1, 2017
Educational History
• Ph.D., Computer Science, 5/2018, University of Texas at Dallas, Richardson, TX
Dissertation: Medical Question Answering and Cohort Retrieval
Advisor: Sanda M. Harabagiu
• M.S., Computer Science, 8/2013, University of Texas at Dallas, Richardson, TX
• B.S., Computer Science, 5/2011, University of Texas at Dallas, Richardson, TX
Employment History
• Research Assistant 2015–present
University of Texas at Dallas
Richardson, Texas, 75083-0688
• Teaching Assistant 2014–2015
• Research Fellow 2012–2015
Professional Recognitions and Honors

Best Papers
• Clinical Research Informatics Distinguished Paper Award
Ramon Maldonado, Travis R. Goodwin, and Sanda Harabagiu
“Memory-Augmented Active Deep Learning for Identifying Relations Between Medical
Concepts in Electroencephalography Reports”
Proceedings of the American Medical Informatics Association Informatics Summit (AMIA-
TBI 2018)
• Homer Warner Award
Travis R. Goodwin and Sanda M. Harabagiu
“Inferring Clinical Correlations from EEG Reports with Deep Neural Learning”
Proceedings of the American Medical Informatics Association Annual Symposium (AMIA
2017).
• Nominated for Distinguished Paper Award

Ramon Maldonado, Travis R. Goodwin and Sanda M. Harabagiu
“Active Deep Learning-Based Annotation of Electroencephalography Reports for Cohort
Identification”
Proceedings of the American Medical Informatics Association Joint Summit on Translation
Science (AMIA-TBI 2017).
• Best Student Paper Award

Travis R. Goodwin and Sanda M. Harabagiu
“Medical Question Answering for Clinical Decision Support"
The 25th ACM International Conference on Information and Knowledge Management (CIKM-
2016).
Fellowships
• Excellence in Education Doctoral Fellowship (2012–2015)
Professional Memberships
• AMIA (American Association of Medical Informatics) (2016–present)
• ACM (Association for Computing Machinery) (2015–present)
• ACM-SIGIR (ACM SIGIR Special Interest Group on Information Retrieval) (2016–present)
Achievements in Original Investigation

Journal Articles (Refereed)
J1 Travis R. Goodwin and Sanda M. Harabagiu (2018) “Learning Relevance Models for Patient
Cohort Retrieval”, Accepted for publication in the Open Access Journal of the American
Medical Informatics Association
J2 Travis R. Goodwin and Sanda M. Harabagiu (2017) “Knowledge Representations and Infer-
ence Techniques for Medical Question Answering”, ACM Transactions on Intelligent Systems
and Technology, Volume 9 Issue 2, October 2017 (26 pages)
J3 Travis R. Goodwin, Ramon M. Maldonado and Sanda M. Harabagiu (2017) “Automatic

Recognition of Symptom Severity from Psychiatric Evaluation Reports”, Journal of Biomed-
ical Informatics, 75 S71-S84
J4 Travis R. Goodwin and Sanda M. Harabagiu (2013) “Graphical Induction of Qualified
Medical Knowledge”, International Journal of Semantic Computing, Vol 7(4):377-406
Conference Proceedings (Refereed)
C1 Travis R. Goodwin, Michael A. Skinner, Sanda M. Harabagiu, “Automatically Linking Reg-

istered Clinical Trials to their Published Results with Deep Highway Networks”, Proceedings
of the American Medical Informatics Association Informatics Summit (AMIA-TBI 2018), San
Fransisco, CA, USA, March 2018.
C2 Ramon Maldonado, Travis R. Goodwin, and Sanda Harabagiu, “Memory-Augmented Active

Deep Learning for Identifying Relations Between Medical Concepts in Electroencephalog-
raphy Reports”, Proceedings of the American Medical Informatics Association Informatics
Summit (AMIA-TBI 2018), San Fransisco, CA, USA, March 2018 (Clinical Research Infor-
matics Distinguished Paper Award)
C3 Travis R. Goodwin and Sanda M. Harabagiu, “Inferring Clinical Correlations from EEG
Reports with Deep Neural Learning”, in Proceedings of the American Medical Informatics
Association Annual Symposium (AMIA), pp 770-779, Washington, DC, USA, November
2017 (Homer Warner Award)
C4 Ramon Maldonado, Travis R. Goodwin, Michael A. Skinner and Sanda M. Harabagiu,

“Deep Learning Meets Biomedical Ontologies: Knowledge Embeddings for Epilepsy”, in
Proceedings of the American Medical Informatics Association Annual Symposium (AMIA),
pp 1226-1235, Washington, DC, USA, November 2017
C5 Travis R. Goodwin, Michael E. Bowen, and Sanda Harabagiu, “A Data-driven Method for
the Early Identification of Diabetes and Prediabetes”, in in Proceedings of the American
Medical Informatics Association Annual Symposium (AMIA), pp 2015, Washington, DC,
USA, November 2017
C6 Stuart J. Taylor, Travis R. Goodwin and Sanda M. Harabagiu, “An Evaluation of Syntactic
Dependency Parsers on Clinical Data”, in in Proceedings of the American Medical Informatics
Association Annual Symposium (AMIA), pp 2205, Washington, DC, USA, November 2017
C7 Ramon Maldonado, Travis R. Goodwin and Sanda M. Harabagiu, “Active Deep Learning-
Based Annotation of Electroencephalography Reports for Cohort Identification”, in Proceed-
ings of the American Medical Informatics Association Joint Summits on Clinical Research
Informatics (AMIA-CRI), pp 229-238, San Francisco, CA, USA, March 2017
C8 Travis R. Goodwin and Sanda M. Harabagiu, “Deep Learning from EEG Reports for In-
ferring Underspecified Information”, in Proceedings of the American Medical Informatics
Association Joint Summits on Clinical Research Informatics (AMIA-CRI), pp 112-121, San
Francisco, CA, USA, March 2017
C9 Travis R. Goodwin and Sanda M. Harabagiu, “Multi-Modal Patient Cohort Identification
from EEG Report and Signal Data”, in Proceedings of the American Medical Informatics
Association Annual Symposium (AMIA), pp 1794-1803, Chicago, IL, USA, November 2016
C10 Travis R. Goodwin, Ramon Maldonado and Sanda M. Harabagiu, “Identifying Symptom
Severity Levels by Combining Learning-to-Rank and Linear Regression”, in Proceedings
of the American Medical Informatics Association Annual (AMIA) Workshop of the Na-
tional Institute of Mental Health (NIMH) Centers of Excellence in Genomic Science (CEGS)
Neuropsychiatric Genome-Scale and RDoC Individualized Domains (N-GRID) Challenge,
Chicago, IL, USA, November 2016
C11 Travis R. Goodwin and Sanda M. Harabagiu, “Medical Question Answering for Clinical
Decision Support”, in Proceedings of the Twenty-fifth ACM International Conference on
Information and Knowledge Management (CIKM 2016), pp 297-306, Indianapolis, Indiana,
USA, October 2016. (Best Student Paper Award)
C12 Travis R. Goodwin and Sanda M. Harabagiu, “Interaction of Risk Factors Inferred from Elec-
tronic Medical Records”, in Proceedings of the American Medical Informatics Association
Summit on Translational Bioinformatics (AMIA-TBI 2016), pp 78-87, San Francisco, CA,
USA, March 2016
C13 Travis R. Goodwin and Sanda M. Harabagiu, “Embedding Open-domain Common-sense
Knowledge from Text”, in Proceedings of the 10th International Conference on Language
Resources and Evaluation (LREC-2016), May 2016
C14 Travis R. Goodwin and Sanda M. Harabagiu, “A Predictive Chronological Model of Mul-
tiple Clinical Observations”, in Proceedings of the International Conference on Healthcare
Informatics 2015 (ICHI 2015), pp 253-262, October 2015
C15 Ramon Maldonado, Travis R. Goodwin, Sanda M. Harabagiu and Michael A. Skinner,
“The Role of Semantic and Discourse Information in Learning the Structure of Surgical
Procedures”, in Proceedings of the International Conference on Healthcare Informatics
2015 (ICHI 2015), pp 223-232, October 2015
C16 Travis R. Goodwin and Sanda M. Harabagiu, “A Probabilistic Reasoning Method for Predict-
ing the Progression of Clinical Findings from Electronic Medical Records”, in Proceedings
of the American Medical Informatics Association Summit on Translational Bioinformatics
(AMIA-TBI 2015), pp 61-65, San Francisco, CA, USA, March 2015
C17 Travis Goodwin and Sanda M. Harabagiu, “Clinical Data-Driven Probabilistic Graph Pro-
cessing”, in the Proceedings of the 9th International Conference on Language Resources and
Evaluation (LREC-2014), pp 101-108, May 2014, Reykjavik, Iceland
C18 Travis Goodwin and Sanda M. Harabagiu, “Automatic Generation of a Qualified Medical
Knowledge Graph and Its Usage for Retrieving Patient Cohorts from Electronic Medical
Records”, in the Proceedings of the IEEE International Conference on Semantic Computing
(ICSC-2013), pp 363–370, September 2013, Irvine CA.
C19 Travis Goodwin and Sanda M. Harabagiu, “The Impact of Belief Values on the Identification
of Patient Cohorts”, in the Proceedings of the 4th International Conference of the CLEF
Initiative (CLEF-2013), pp 155–166, September 2013, Valencia, Spain.
C20 Travis Goodwin, Bryan Rink, Kirk Roberts, and Sanda M. Harabagiu, “UTDHLT: CO-
PACETIC System for Choosing Plausible Alternatives”. In Proceedings of the 6th Interna-
tional Workshop on Semantic Evaluation (SemEval), Montreal, Canada, June 2012
C21 Kirk Roberts, Travis Goodwin and Sanda M. Harabagiu, "Annotating Spatial Containment
Relations Between Events", Proceedings of the Eighth International Conference on Language
Resources and Evaluation (LREC-2012), Istanbul, Turkey, May 2012
Presentations to Professional Meetings (Refereed)
P1 Travis R. Goodwin, Michael A. Skinner, Sanda M. Harabagiu, “Automatically Linking Regis-
tered Clinical Trials to their Published Results with Deep Highway Networks”, the American
Medical Informatics Association Informatics Summit (AMIA-TBI 2018), San Fransisco, CA,
USA, March 2018
P2 Travis R. Goodwin and Sanda M. Harabagiu, “Inferring Clinical Correlations from EEG
Reports with Deep Neural Learning”, the American Medical Informatics Association Annual
Symposium (AMIA), Washington, DC, USA, November 2017
P3 Travis R. Goodwin and Sanda M. Harabagiu, “Deep Learning from EEG Reports for In-
ferring Underspecified Information”, in Proceedings of the American Medical Informatics
Association Joint Summits on Clinical Research Informatics (AMIA-CRI), pp 112-121, San
P4 Travis R. Goodwin and Sanda M. Harabagiu, “Inferring Clinical Correlations from EEG
Reports with Deep Neural Learning”, in Proceedings of the American Medical Informatics
Association Annual Symposium (AMIA), Washington, DC, USA, November 2017
P5 Travis R. Goodwin and Sanda M. Harabagiu, “Multi-Modal Patient Cohort Identification

from EEG Report and Signal Data”, the American Medical Informatics Association Annual
Symposium (AMIA), Chicago, IL, USA, November 2016
P6 Travis R. Goodwin, Ramon Maldonado and Sanda M. Harabagiu, “Identifying Symptom

Severity Levels by Combining Learning-to-Rank and Linear Regression”, in the Ameri-
can Medical Informatics Association Annual (AMIA) Workshop of the National Institute of
Mental Health (NIMH) Centers of Excellence in Genomic Science (CEGS) Neuropsychiatric
Genome-Scale and RDoC Individualized Domains (N-GRID) Challenge, Chicago, IL, USA,
November 2016
P7 Travis R. Goodwin and Sanda M. Harabagiu, “Medical Question Answering for Clinical
Decision Support”, the 25th ACM International Conference on Information and Knowledge
Management (CIKM 2016), Indianapolis, Indiana, USA, October 2016.
P8 Travis R. Goodwin and Sanda M. Harabagiu, “Embedding Open-domain Common-sense
Knowledge from Text”, the 10th International Conference on Language Resources and
Evaluation (LREC-2016) , May 2016
P9 Travis R. Goodwin and Sanda M. Harabagiu, “Interaction of Risk Factors Inferred from
Electronic Medical Records”, the American Medical Informatics Association Summit on
Translational Bioinformatics (AMIA-TBI 2015), San Francisco, CA, USA, March 2016
P10 Travis Goodwin and Sanda M. Harabagiu, “A Predictive Chronological Model of Multi-
ple Clinical Observations”, the IEEE International Conference on Healthcare Informatics
2015 (ICHI 2015) , October 2015
P11 Travis Goodwin and Sanda M. Harabagiu, “A Probabilistic Reasoning Method for Predicting
the Progression of Clinical Findings from Electronic Medical Records”, the American Medi-
cal Informatics Association Summit on Translational Bioinformatics (AMIA-TBI 2015), San

Etd 5608 Goodwin 7863.57

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Etd 5608 Goodwin 7863.57

Uploaded by

Copyright:

Available Formats

MEDICAL QUESTION ANSWERING AND PATIENT COHORT RETRIEVAL

APPROVED BY SUPERVISORY COMMITTEE:

Sanda M. Harabagiu, Chair

All rights reserved

TRAVIS R. GOODWIN, BS, MS

Presented to the Faculty of

The University of Texas at Dallas

for the Degree of

THE UNIVERSITY OF TEXAS AT DALLAS

Travis R. Goodwin, PhD

Supervising Professor: Sanda M. Harabagiu, Chair

2.1 Medical Question Answering: TREC-CDS Topics . . . . . . . . . . . . . . . . . . . . 13

1.1 Introduction: Example Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Question Answering (Q/A)

Question: What is the most likely Diagnosis?

the question answering system.

In this dissertation, we focus on medical question answering problems with an emphasis on

background with knowledge from medical practice and scientific literature.

systems include an information retrieval component. The goal of information retrieval is to

in response to a question, information retrieval typically considers only document collections

1.2 Overview of Contributions

research obtains state-of-the-art performance (Goodwin and Harabagiu, 2016, 2017).

of multi-word expressions in medical language by incorporating knowledge from Wikipedia and

(Bodenreider, 2004), the Systematized Nomenclature of Medicine – Clinical Terms (SNOMED-

account for longitudinal data.

underspecified information from individual documents. The automatically recovered information,

Chapter 7 demonstrates how a supervised machine learning framework known as learning-to-

MEDICAL QUESTION ANSWERING

Authors – Travis R. Goodwin, and Sanda M. Harabagiu

The Department of Computer Science, EC 31

The University of Texas at Dallas

800 West Campbell Road

Richardson, Texas 75080-3021

TREC-CDS Topic Topic Query

Expected Medical Answer Type

Medical Ranked List of Relevant Articles

retrieval originated from observations of results of the 2015 TREC-CDS.

for any given medical question.

background of medical questions in medical knowledge sketches. We considered three different

knowledge acquired from medical practice.

(e.g., PRESENT, ABSENT, POSSIBLE, etc.), we considered a richer semantic representation of

2.1 System Architecture for Medical Question Answering

2.1.1 Inferring Medical Answers with Medical Knowledge Sketches

the answers to TREC-CDS topics.

rather than common cases.

P(•) refers to the probability estimate provided by the CPTG.

2.1.2 Architecture of Medical Q/A System used in Clinical Decision Support

documents relevant to the question implied by the topic’s EMAT; and

relevant to the question implied by the topic’s EMAT.

Details of each of these cases are provided below.

Rel(li ) = P (Yi | Z1 (t)) ∝ P(Yi ∪ Z1 (t)) (2.2)

“cardiovascular disease” OR “cardiovascular disorder” OR “CVD”, etc. The document relevance

Rel (li ) = max P (Yi | Z3 (t, s k )) (2.6)

2.2 Inferring Medical Answers

Case Sketch Document Relevance Model Answer Ranking Metric

E ∪ R, where D ⊆ D indicates the random variables corresponding to the diagnoses in C, S ⊆ S

indicates the random variables corresponding to signs/symptoms in C, E ⊆ E indicates the random

variables corresponding to tests in C, and R ⊆ R indicates the random variables corresponding

to treatments in C. The estimation of the probability of any combination of medical concepts

representation of a combination of medical concepts (and their assertions), C = D ∪ S ∪ E ∪ R. As

relations involving all the random variables in S, E, and R, respectively; factors ψ1 , ψ2 , ψ3 , ψ4 , ψ5

× ψ4 (D, R) × ψ5 (E, R) × ψ6 (S, R)

the diagnoses in C are D = { [heart attack/PRESENT], [diabetes/PRESENT], [obesity/ABSENT],