WatsonPaths - Scenario-basedQuestionAnsweringand Inference Over Unstructured Information
WatsonPaths - Scenario-basedQuestionAnsweringand Inference Over Unstructured Information
Computer Science
* This work was done while at the IBM Thomas J. Watson Research Center
Research Division
Almaden Austin Beijing Cambridge Dublin - Haifa India Melbourne - T.J. Watson Tokyo - Zurich
WatsonPaths: Scenario-based Question Answering and
Inference over Unstructured Information
Evidence: Erythropoietin is
2 WatsonPaths Medical Use Case
Patient has anemia produced in the kidneys.
Although WatsonPaths enables general-purpose
Evidence: Normocytic anemia is
scenario-based question answering, we decided to
a type of anemia with normal start by focusing our attention on the medical domain.
red blood cells.
We focused on the problem of patient scenario analysis,
Patient is at risk for
Patient has normocytic anemia Erythropoietin deficiency where the goal is typically a diagnosis or a treatment
recommendation.
Evidence:
Erythropoietin To explore this kind of problem solving, we obtained
deficiency is a cause of
normocytic anemia.
a set of medical test preparation questions. These are
Most likely cause of low hemoglobin conc.
multiple choice medical questions based on an unstruc-
is Erythropoietin deficiency tured or semi-structured natural language description of
a patient. Although WatsonPaths is not restricted to mul-
tiple choice questions, we saw multiple choice questions
Figure 1: Simple Diagnosis Graph for a Patient with Erythro- as a good starting point for development. Many of these
poietin Deficiency
questions involve diagnosis, either as the entire question,
as in the previous medical example, or as an intermediate
step, as in the following example:
scenarios, as in the above examples, passage matching
by itself is often insufficient to locate the answer. This A 63-year old patient is sent to the neurologist
is because scenario-based question answering requires with a clinical picture of resting tremor that be-
integrating and reasoning over information from multi- gan 2 years ago. At first it was only on the left
ple sources. Furthermore, we must often apply general hand, but now it compromises the whole arm.
knowledge to a specific case, as in a medical scenario At physical exam, the patient has an unexpres-
about a patient. sive face and difficulty in walking, and a con-
In this paper, we present a new approach that builds on tinuous movement of the tip of the first digit
Watsons strengths and is in line with the human reason- over the tip of the second digit of the left hand
ing process we observed. We break down the input sce- is seen at rest. What part of his nervous system
nario into individual pieces of information, ask relevant is most likely affected?
subquestions to conclude new information, and combine
these results into an assertion graph. We then perform For this question, it is useful to diagnose that the patient
probabilistic inference over the graph to conclude the an- has Parkinsons disease before determining which part of
swer to the overall question. This process is repeated to his nervous system is most likely affected. These multi-
extend the graph until a stopping condition is met. Be- step inferences are a natural fit for the graphs that Wat-
cause we use Watson to answer the subquestions, and be- sonPaths constructs. In this example, the diagnosis is the
cause we attempt to construct paths of inference to a final missing link on the way to the final answer.
TM
answer, we call our system WatsonPaths .
In the WatsonPaths graph, the evidence is drawn from
3 Scenario-based Question Answering
a variety of sources including general knowledge ency- In scenario-based question answering, the system re-
clopedias, domain-specific books, structured knowledge ceives a scenario description that ends with a punchline
bases, and semi-structured knowledge bases. We were question. For instance, the punchline question in the
motivated by the desire to design a solution that could Parkinsons example is What part of his nervous system
harness Watson, and we observed that each edge in this is most likely affected? Instead of treating the entire
graph could correspond to a question asked of Watson. scenario as one monolithic question as would Watson,
An added dimension to WatsonPaths is the ability to WatsonPaths explores multiple facts in the scenario in
interact with the user. The original Watson system that parallel and reasons with the results of its exploration as
won Jeopardy! was largely non-interactive. For many ap- a whole to arrive at the most likely conclusion regarding
plications, it is important to engage the user in the prob- the punchline question. The architecture of WatsonPaths
lem solving process. WatsonPaths has the ability to elicit is shown in Figure 2.
3.3 Relation Generation
Scenario
Analysis The relation generation step, which is described in more
detail in Section 6, builds the assertion graph. We do
Input Scenario
Assertion
this primarily by asking Watson questions about the fac-
Node Graph tors. In medicine we want to know the causes of the find-
Prioritization
ings and abnormal test results that are consistent with the
patients demographic information and normal test re-
Relation (Edge)
Generation
sults. Given the scenario in the Introduction, we could
(may ask questions Repeat until ask, What does type 1 diabetes mellitus cause? We
to Watson) completion use a medical ontology to guide the process of formu-
(which may be
defined in
lating subquestions to ask Watson. Relevant factors may
Estimate
Confidences different ways) also be combined to form a single, more targeted ques-
In Nodes tion. Because in this step we want to emphasize recall,
(Belief Engine)
we take several of Watsons highly-ranked answers. The
Hypothesis
exact number of answers taken, or the confidence thresh-
Identification old, are parameters that must be tuned. Given a set of
answers, we add them to the graph as nodes, with edges
from nodes that were used in questions to nodes that were
Hypothesis Final
Confidence Confidences in answers. The edge is labeled with the relation used to
Refinement Hypotheses formulate the question (like causes or indicates), and the
(learned model)
strength of the edge is initially set to Watsons confidence
in the answer. Although Watson is the primary way we
Figure 2: Scenario-based Question Answering Architecture add edges to the graph, WatsonPaths allows for any num-
ber of relation generator components to post edges to the
graph.
3.1 Scenario Analysis
3.4 Belief Computation
The first step in the pipeline is scenario analysis, where Once the assertion graph has been expanded in this way,
we identify factors in the input scenario that may be we recompute the confidences of nodes in the graph
of importance. In the medical domain, the factors based on new information. We do this using probabilis-
may include demographics (32-year old woman), pre- tic inference systems that are described in Section 7. The
existing conditions (type 1 diabetes mellitus), signs inference systems take a holistic view of the assertion
and symptoms (progressive renal failure), and test graph and try to reconcile the results of multiple paths of
results (hemoglobin concentration is 9 g/dL, nor- exploration.
mochromic cells, normocytic cells). The extracted
3.5 Hypothesis Identification
factors become nodes in a graph structure called the as-
sertion graph. The assertion graph structure is defined As Figure 2 shows, this process can go through multiple
in Section 4, while more details of the scenario analysis iterations, during which the nodes that were the answers
process are given in Section 5. to the previous round of questions can be used to ask the
next round of questions, producing more nodes and edges
in the graph. After each iteration we may do hypothesis
3.2 Node Prioritization identification, where some nodes in the graph are identi-
fied as potential final answers to the punchline question
The next step is node prioritization, where we decide (for example, the most likely diagnoses of a patients
which nodes in the graph are most important for solv- problem). In some situations hypotheses may be pro-
ing the problem. In a small scenario like this example, vided up fronta physician may have a list of competing
we may be able to explore everything, but in general this diagnoses and want to explore the evidence for each. But
will not be the case. Factors that affect the priority of a in general the system needs to identify these. Hypothesis
node may include the systems confidence in the node as- nodes may be treated differently in later iterations. For
sertion or the systems estimation of how fruitful it would instance, we may attempt to do backward chaining from
be to expand a node. For example, normal test results the hypotheses, asking Watson what things, if they were
and demographic information are generally less useful true of the patient, would support or refute a hypothesis.
for starting a diagnosis than symptoms and abnormal test The process may terminate after a fixed number of itera-
results. tions or based on some other criterion like confidence in
the hypotheses. A 63-year-old patient
While hypothesis identification is part of WatsonPaths, is sent to the
neurologist with ...
it is not described in detail in this paper. In the system resting tremor ...
Scenario
that generates the results we present in Section 10, no hy- What part of his
nervous system is
pothesis identification is necessary because the multiple most likely affected?
choice answers are provided. That system always does
A node represents a
one iteration of expansion, both forward from the iden- states statement. Types of
tified factors and backward from the hypotheses, before statements are input
Input factors, inferred factors
stopping. Factor
patient
and hypotheses or
exhibits
resting answers. Border
3.6 Hypothesis Confidence Refinement tremor strength visually
represents belief the
As described so far, WatsonPaths confidence in each hy- factor is true in context.
Inferred indicates
pothesis depends on the strengths of the edges leading Factor
to it, and since our primary relation (edge) generator is
patient has An edge represents a
Watson, the hypothesis confidence depends heavily on Parkinsons relation between the
the confidence of Watsons answers. Having good an- Disease connected statements.
Relation Agents make assertions
swer confidence depends on having a representative set about the truth of these
of question/answer pairs with which to train Watson. The indicates relations with
confidences. Edge width
following question arises: What can we do if we do not represents that
patients
have a representative set of question/answer pairs, but Substantia confidence. Gray level
Nigra is represents the amount
we do have training examples for entire scenarios (e.g., Hypothesis of belief flow.
affected
correct diagnoses associated with patient scenarios)? To
leverage the available scenario-level ground truth, we
Assertion Graph
have built machine learning techniques to learn a refine-
ment of Watsons confidence estimation that produces
better results when applied to the entire scenario. This Figure 3: Visualization of an Assertion Graph. By convention,
learning process is discussed in Section 8. input factors are placed at the top and hypotheses at the bottom
with levels of inference factors in between.
3.7 Collaborating with the User
WatsonPaths can run in a completely automated way, as
the Watson question answering system did when playing truth value. For instance, the string patient cannot be
Jeopardy! (This is the case for the results presented in true or false; thus it does not fit into the semantics of an
Section 10.) But there are also many interesting possi- assertion graph. WatsonPaths is charitable in interpret-
bilities for user interaction at each step in the process. ing strings as if they had a truth value. For instance, the
In this way, WatsonPaths exemplifies cognitive comput- default semantics of the string low hemoglobin is the
ing. Our vision for cognitive computing is that the user same as patient has low hemoglobin.
and the computer work together to explore a scenario and A relation is a named association between statements.
reach conclusions faster and more accurately than either Technically, relations are themselves statements, and
could do alone. We discuss the collaborative learning as- have a truth value. Each relation has a predicate; for in-
pects of WatsonPaths in Section 9. stance in medicine we may say that Parkinsons causes
resting tremor or Parkinsons matches Parkinsonism.
4 Assertion Graphs Typically we are concerned with relations that may pro-
The core data structure used by WatsonPaths is the asser- vide evidence for the truth of one statement given an-
tion graph. Figure 3 explains this data structure, along other. Although some relations may have special mean-
with the visualization that we commonly use for it. As- ings in the probabilistic inference systems, a common se-
sertion graphs are defined as follows. mantics for a relation is indicative in the following way:
A statement is something that can be true or false A indicates B means that the truth of A provides an
(though its state may not be known). Often we deal with independent reason to believe that B is true. Section 7
unstructured statements, which are natural language ex- provides more detail on the inference systems.
pressions like A 63-year-old patient is sent to the neurol- An assertion is a claim that some agent makes about
ogist with a clinical picture of resting tremor that began 2 the truth of a statement (including a relation). The as-
years ago. WatsonPaths also allows for statements that sertion records the name of the agent and a confidence
are structured expressions, namely, a predicate and argu- value. Assertions may also record provenance informa-
ments. Not all natural language expressions can have a tion that explains how the agent came to its conclusion.
For the Watson question answering agent, this includes ural language. This creates a dependency tree of
natural language passages that provide evidence for the syntactically linked terms in a sentence and helps to
answer. When the system is collaborating with a user, it associate terms that are distant from each other in
is crucial to be able to display evidence to the user. the sentence.
In the assertion graph, each node represents exactly
one statement, and each edge represents exactly one re- 2. The terms are mapped to a dictionary to iden-
lation. Nodes and edges may have multiple assertions at- tify concepts and their semantic types. For the
tached to them, one for each agent that has asserted that medical domain, our dictionary is derived from
node or edge to be true. the UMLS Metathesaurus (National Library of
We often visualize assertion graphs by using a nodes Medicine, 2009), Wikipedia redirects, and medical
border width to represent the confidence of the node, an abbreviation resources. The concepts identified by
edges width to represent the confidence of the edge, and the dictionary are then typed using the UMLS Se-
an edges gray level as the amount of belief flow along mantic Network, which consists of a taxonomy of
that edge. Belief flow is described later, but essentially it biological and clinical semantic types like Anatomy,
is how much the value of the head influences the value of SignOrSymptom, DiseaseOrSyndrome, and Ther-
the tail. This depends mostly on the confidences of the apeuticOrPreventativeProcedure. In addition to
assertions on the edge. mapping the sequence of tokens in a sentence to
the dictionary, the dependency parse is also used
5 Scenario Analysis to map syntactically linked terms. For example
. . . stiffness and swelling in the arm and leg can
The goal of scenario analysis is to identify information be mapped to the four separate concepts contained
in the natural language narrative of the problem sce- in that phrase.
nario that is potentially relevant to solving the problem.
When human experts read the problem narrative, they are 3. The syntactic and semantic information identified
trained to extract concepts that match a set of seman- above are used by a set of predefined rules to iden-
tic types relevant for solving the problem. In the med- tify important relations. Negation is commonly
ical domain, doctors and nurses identify semantic types used in clinical narratives and needs to be accurately
like chief complaints, past medical history, demograph- identified. Rules based on parse features identify the
ics, family and social history, physical examination find- negation trigger term and its scope in a sentence.
ings, labs, and current medications (Bowen, 2006). Ex- Factors found within the negated scope can then be
perts also generalize from specific observations in a par- associated with a negated qualifier. Another exam-
ticular problem instance to more general terms used in ple of rule-based annotation is lab value analysis.
the domain corpus. An important aspect of this informa- This associates a quantitative measurement to the
tion extraction is to identify the semantic qualifiers asso- substance measured and then looks up reference lab
ciated with the clinical observations (Chang et al., 1998). value ranges to make a clinical assessment. For ex-
These qualifiers could be temporal (e.g.,pain started two ample hemoglobin concentration is 9 g/dL is pro-
days ago), spatial (pain in the epigastric region), or cessed by rules to extract the value, unit, and sub-
other associations (pain after eating fatty foods). Im- stance and then assessed to be low hemoglobin
plicit in this task is the humans ability to extract concepts by looking up a reference. Next, the clinical assess-
and their associated qualifiers from the natural language ment is mapped by the dictionary to the correspond-
narrative. For example, the above qualifiers might have ing clinical concept.
to be extracted from the sentence The patient reports
pain, which started two days ago, in the epigastric region At this point, we should have all the information to
especially after eating fatty foods. identify factors and their semantic qualifiers. We have
The computer system needs to perform a similar anal- to contend, however, with language ambiguities, errors
ysis of the narrative. We use the term factor to denote the in parsing, a noisy and non-comprehensive dictionary,
potentially relevant observations along with their associ- and a limited set of rules. If we were to rely solely on
ated semantic qualifiers. Reliably identifying and typing a rule-based system, then the resulting factor identifica-
these factors, however, is a difficult task, because medi- tion would suffer from a compounding of errors in these
cal terms are far more complex than the kind of named components. To address this issue, we employ machine
entities typically studied in natural language processing. learning methods to learn clinical factors and their se-
Our scenario analytics pipeline attempts to address this mantic qualifiers in the problem narrative. We obtained
problem with the following major processing steps: the ground truth by asking medical students to annotate
clinical factor spans and their semantic types. They also
1. The analysis starts with syntactic parsing of the nat- annotated semantic qualifier spans and linked them to
factors as attributive relations. 6.1 Expanding the Graph with Watson
The machine learning system is comprised of two se- In medical problem solving, experts reason with chief
quential steps: complaints, findings, medical history, demographic in-
1. A conditional random field (CRF) model (Lafferty formation, and so on, to identify the underlying causes
et al., 2001) learns the spans of text that should be for the patients problems. Depending on the situation,
marked as one of the following factor types: finding, they may then proceed to propose a test whose results
disease, test, treatment, demographics, negation, or will allow them to distinguish between multiple possi-
a semantic qualifier. Features used for training the ble problem causes, or identify the best treatment for the
CRF model are lexical (lemmas, morphological in- identified cause, and so on.
formation, part-of-speech tags), semantic (UMLS Motivated by the medical problem solving paradigm,
semantic types and groups, demographic and lab WatsonPaths first attempts to make a diagnosis based on
value annotations), and parse-based (features asso- factors extracted from the scenario. The graph is ex-
ciated with dependency links from a given token). panded to include new assertions about the patient by
A token window size of 5 (2 tokens before and af- asking questions of a version of the Watson question an-
ter) is used to associate features for a given token. A swering system adapted for the medical domain (Ferrucci
BIO tagging scheme is used by the CRF to identify et al., 2013). WatsonPaths takes a two-pronged approach
entities in terms of their token spans and types. to medical problem solving by expanding the graph for-
ward from the scenario in an attempt to make a diagnosis,
2. A maximum entropy model then learns the relations and then linking high confidence diagnoses with the hy-
between the entities identified by the CRF model. potheses. The latter step is typically done by identifying
For each pair of entities in a sentence, this model an important relation expressed in the punchline question
uses lexical features (within and between entities), (e.g., What is the most appropriate treatment for this pa-
entity type, and other semantic features associated tient or What body part is most likely affected?). This
with both entities, and parse features in the depen- approach is a logical extension of the open-domain work
dency path linking them. The relations learned by of Prager et al. (2004), where in order to build a profile
this model are negation and attributeOf relations of an entity, questions were asked of properties of the en-
linking negation triggers and semantic qualifiers (re- tity and constraints between the answers were enforced
spectively) to factors. to establish internal consistency.
The graph expansion process of WatsonPaths begins
The combined entity and relation identification models
with automatically formulating questions related to high
have a precision of 71% and recall of 65% on a blind
confidence assertions, which in our graphs represent
evaluation set of patient scenarios found in medical test
statements WatsonPaths believes to be true to a certain
preparation questions. We are currently exploring joint
degree of confidence about the patient. These statements
inference models and identification of relations that span
may be factors, as extracted and typed by the algorithm
multiple sentences using coreference resolution.
described in Section 5, or combinations of those factors.
To determine what kinds of questions to ask, Watson-
6 Relation Generation
Paths can use a domain model that tells us what relations
The scenario analysis component described in the previ- form paths between the semantic type of a high confi-
ous section extracts pertinent factors related to the patient dence node and the semantic type of a hypothesis like a
from the scenario description. At this stage, the assertion diagnosis or treatment. For the medical domain, we cre-
graph consists of the full scenario, individual scenario ated a model that we called the Emerald, which is shown
sentences, and the extracted factors. An indicates rela- in Figure 4. (Notice the resemblence to an emerald.) The
tion is posted from a source node (e.g., a scenario sen- Emerald is a small model of entity types and relations
tence node) to a target node whose assertion was derived that are crucial for diagnosis and for formulating next
from the assertion in the source node (e.g., a factor ex- steps.
tracted from that sentence). In addition, a set of hypothe- We select from the Emerald all relations that link the
ses, if given, are posted as goal nodes in the assertion semantic type of a high-confidence source node to a
graph. semantic type of interest. The relations and the high-
The task of the relation generation component is to (1) confidence nodes then form the basis of instantiating the
expand the graph by inferring new facts from known facts target nodes, thereby expanding the assertion graph. To
in the graph and (2) identify relationships between nodes instantiate the target nodes, we issue WatsonPaths sub-
in the graph (like matches and contraindicates) to help questions to Watson. All answers returned by Watson
with reasoning and confidence estimation. We begin by that score above a pre-determined threshold are posted
discussing how we infer new facts for graph expansion. as target nodes in the inference graph. A relation edge
the scenario sentences.
The graph expansion process identifies the most confi-
dent assertions in the graph, which include the four clin-
ical factor nodes extracted from the scenario. These four
nodes are all typed as findings, so they are aggregated
into a single finding node for the purpose of graph expan-
sion. For a finding node, the Emerald proposes a single
findingOf relation that links it to a disease. This results in
the formulation of the subquestion What disease causes
resting tremor that began 2 years ago, compromises the
whole arm, unexpressive face, and difficulty in walk-
ing? whose answers include Parkinson disease, Hunt-
ingtons disease, cerebellar disease, and so on. These
answer nodes are added to the graph and some of them
are shown in the third row of nodes in Figure 5.
Figure 4: The Emerald In the reverse direction, WatsonPaths explores rela-
tionships between hypotheses to nodes in the existing
graph based on the punchline question in the scenario,
A 63-year old patient is At physical exam, the
sent to the neurologist At first it was only the left patient has an which in this case is What part of his nervous system is
with a clinical picture of hand, but now it unexpressive face and
resting tremor that compromises the whole arm difficulty in walking,
mostly likely affected? Assuming each hypothesis to be
began 2 years ago and...
true, the system formulates subquestions to link it to the
assertion graph. Consider Substantia nigra. WatsonPaths
resting tremor that compromises unexpressive difficulty in can ask In what disease is substantia nigra most likely
began 2 years ago the whole arm face walking
affected? A subset of the answers to this question, in-
cluding Parkinsons disease and Diffuse Lewy body dis-
Progressive
ease are shown in the fourth row of nodes in Figure 5.
Parkinson Cerebellar Huntingtons
supranuclear
disease diseases disease
palsy 6.2 Matching Graph Nodes
When a new node is added to the WatsonPaths asser-
Parkinsons Diffuse
disease lewy body tion graph, we compare the assertion in the new node
disease
to those in existing nodes to ensure that equivalence rela-
tions between nodes are properly identified. This is done
Substantia Lenticular Caudate
Cerebellum Pons
Nigra Nuclei Nucleus by comparing the statements in those assertions: for un-
structured statements, whether the statements are lexi-
cally equivalent, and for structured statements, whether
Figure 5: WatsonPaths Graph Expansion Process
the predicates and their arguments are the same. A more
complex operation is to identify when nodes contain as-
is posted from the source node to each new target node sertions that may be equivalent to the new assertion.
where the confidence of the relation is Watsons confi- We employ an aggregate of term matchers (Murdock
dence in the answer in the target node. et al., 2012a) to match pairs of assertions. Each term
In addition to asking questions from scenario factors, matcher posts a confidence value on the degree of match
WatsonPaths may also expand backwards from hypothe- between two assertions based on its own resource for de-
ses. The premise for this approach is to explore how a termining equivalence. For example, a WordNet-based
hypothesis fits in with the rest of the inference graph. If term matcher considers terms in the same synset to be
one hypothesis is found to have a strong relationship with equivalent, and a Wikipedia-redirect-based term matcher
an existing node in the assertion graph, then the proba- considers terms with a redirect link between them in
bilistic inference mechanisms described in 7 allow belief Wikipedia to be a match. The dotted line between
to flow from known factors to that hypothesis, thus in- Parkinson disease and Parkinsons disease in Figure 5
creasing the systems confidence in that hypothesis. is posted by the UMLS-based term matcher, which con-
siders variants for the same concept to be equivalent.
Figure 5 illustrates the WatsonPaths graph expansion
process. The top two rows of nodes and the edges be-
7 Confidence and Belief
tween them show a subset of the WatsonPaths assertion
graph after scenario analysis, with the second row of Once the assertion graph is constructed, and some ques-
nodes representing some clinical factors extracted from tions and answers are posted, there remains the problem
of confidence estimation. We develop multiple models of In WatsonPaths, we face a different set of problems.
inference to address this step. The challenge is not to construct a model from training
data, but to use a very noisy, already constructed model to
7.1 Belief Engine do inference. Training data in the classical sense is absent
One approach to the problem of inferring the correct hy- or very sparse; all we have are correct answers to some
pothesis from the assertion graph is probabilistic infer- scenario-level questions. An advantage is that a graph
ence over a graphical model (Pearl, 1988). We refer to structure is given. A disadvantage is that the graph is
the component that does this as the belief engine. noisy. Furthermore, it is not known that the confidences
Although the primary goal of the belief engine is to in- on the edges necessarily correspond to the optimal edge
fer confidences in hypotheses, it also has two secondary strengths. (In the next section, we address the problem
goals. One is to infer belief in unknown nodes that are of learning edge strengths.) Thus we have the problem
not hypotheses. These intermediate nodes may be im- of selecting a semanticsa way to convert the assertion
portant intermediate steps toward an answer; by assign- graph into a graph over which we can do optimal proba-
ing high confidences to them in the main loop, we know bilistic inference to meet our goals.
to assign them high priority for subquestion asking. An- After much experimentation, the primary semantics
other secondary goal is to support the user interface (see used by the belief engine is the indicative semantics: If
Section 9). Among inference algorithms that perform there is a directed relation from node A to node B with
well in terms of accuracy and other metrics, we try to strength x, then A provides an independent reason to be-
make choices that will make the flow of belief intuitive lieve that B is true with probability x. Some edges are
for users. This facilitates the gathering of better oppor- classified as contraindicative; for these edges, A pro-
tunistic annotations, which improves future performance. vides an independent reason to believe that B is false
To execute the belief engine, we first make a work- with probability x. The independence means that multi-
ing copy of the assertion graph that we call the inference ple parents R can easily be combined using a noisy-OR:
graph. A separate graph is used so that we can make
changes without losing information that might be use- (1
Y
(1 r)) =
M
r
ful in later steps of inference. For instance, we might rR rR
choose to merge nodes or reorient edges. Once the infer-
ence graph has been built, we run a probabilistic infer- The graph, so interpreted, forms a noisy-logical
ence engine over the graph to generate new confidences. Bayesian network (Yuille and Lu, 2007). The strength
Each node represents an assertion, so it can be in one of of each edge can be interpreted as an indicative power, a
two states: true or false (on or off). Thus a graph concept related to causal power (Cheng, 1997), with the
with k nodes can be in 2k possible states. The inference difference that we are semantically agnostic as to the true
graph specifies the likelihoods of each of these states. direction of the causal relation. Formally, the probability
The belief engine uses these likelihoods to calculate the of a node being on (true) is given by
marginal probability, for each node, of it being in the
true state. This marginal probability is treated as a confi- " #
dence. Finally, we read confidences and other data from
M M
P (x|Rx , Qx ) = (sr pr ) 1 (sq pq )
the inference graph back into the assertion graph. rRx qQx
There are some challenges in applying probabilistic in-
ference to an assertion graph. Most tools in the infer- where P (x) is the probability of node x being on, Rx
ence literature were designed to solve a different prob- is the set of indicative parents of x, and Qx is the set of
lem, which we will call the classical inference problem. contraindicative parents. The parents state is represented
In this problem, we are given a training set and a test set by pr : 1 if the parent is on, and 0 otherwise. The value sr
that can be seen as samples from a common joint distri- represents the strength of the edge from the parent to x.
bution. The task is to construct a model that captures the In other words, the probability that a node x is on is the
training set (for instance, by maximizing the likelihood noisy-OR of its active indicative parent edge strengths
of the training set), and then apply the model to predict combined via a noisy-AND-NOT with the noisy-OR of
unknown values in the test set. Arguably the greatest its active contraindicative parent edge strengths.
problem in the classical inference task is that the struc- For instance, if the node resting tremor indicates
ture of the graphical model is underdetermined; a large Parkinson disease with strength 0.8, and the node diffi-
space of possible structures needs to be explored. Once culty in walking indicates Parkinson disease with power
a structure is found, adjusting the strengths is relatively 0.4, then the probability of Parkinson disease will be
easier, because we know that samples from the training (1 (1 0.8)(1 0.4)) = 0.88. If so, then the edge
set are sampled from a consistent joint distribution. with strength 0.9 to Parkinsons disease will fire with
probability 0.88 0.9 = 0.792. In this way, probabil- lation generator) and we express the confidence in each
ities can often multiply down simple chains. Inference hypothesis as a closed-form, parameterized expression
must be more sophisticated to handle the graphs we see over the feature values. We can then optimize the param-
in practice, but the intuition is the same. eters on a training set of scenarios and correct diagnoses
An example that adds sophistication to the inference is (see Section 8).
an exactly one constraint that can be optionally added To illustate the idea we describe in detail one such
to multiple-choice questions. This constraint assigns a model, the Noisy-OR Model, which is based on the same
higher likelihood to assignments in which exactly one intuition is the indicative semantics just described.
multiple choice answer is true. Because of these kinds We first convert the assertion graph to a directed
of constraints, and because of the fact that the graphs acyclic graph (DAG). The assertion graph is not, in gen-
contain directed and undirected cycles, we cannot sim- eral, free of cycles. Additionally, the assertion graph con-
ply calculate the probabilities in a feed-forward manner. tains matching relations, which are undirected. To form a
To perform inference we use Metropolis-Hastings sam- DAG, the nodes in the assertion graph are first clustered
pling over a factor graph representation of the inference by these matching relations, and then cycles are broken
graph. This has the advantage of being a very general by applying heuristics to re-orient edges to point from
approachthe inference engine can easily be adapted to factors to hypotheses.
a new semanticsand also allows an arbitrary level of The confidence in factors extracted by Scenario Anal-
precision given enough processing time. ysis is 1.0. For all other nodes the confidence is defined
Users and annotators report that they find the indica- recursively in terms of the confidences of the parents and
tive semantics intuitive, and it performs at least as well as the confidence of the edges produced by the QA system.
other semantics in experiments. One of the first seman- Let the set of parents in the DAG for a node n be given
tics we tried was undirected pairwise Markov random by a(n). The feature vector the QA system gives for
fields. These performed poorly in practice. We hypoth- one node, m, indicating another, n, is given by (m, n).
esize that this is because important information is con- Then the confidence for a non-factor node is given below.
tained in the direction of the edges that Watson returns: The learned weight vector for the QA features is ~q.
Asking about A and getting B as an answer is different
from asking about B and getting A as an answer. An M
undirected model loses this information. P (n) = (~q (ai , n)) P (ai )
The indicative semantics is a default, basic semantics. ai a(n)
correct/incorrect binary classification task. In con- proach to answering medical scenario questions contrasts
trast, many probabilistic inference methods use con- with the WatsonPaths approach of decomposing the sce-
fidence as something like strength of indication or nario, asking questions of atomic factors, and perform-
relevance. ing probabilistic inference over the resulting graphical
model.
For all these reasons, DD data is poorly suited to train-
ing a complete model for judging edge-strength for sub-
We tuned various parameters in the WatsonPaths sys-
question edges in WatsonPaths. But we have found that
tem on the development set to balance speed and perfor-
DD data is useful as subquestion training data1 in the hy-
mance. The system performs one iteration each of for-
brid learning approach described in Section 8; we use
ward and backward relation generation. The minimum
1039 DD questions for consolidating question answering
confidence threshold for expanding a node is 0.25, and
features and then use the smaller, consolidated set of fea-
the maximum number of nodes expanded per iteration
tures as inputs to the inference models that are trained on
is 40. In the relation generation component, the Watson
the 1000 medical test preparation questions.
medical question answering system returns all answers
10.2 Experimental Setup with a confidence of above 0.01.
For comparison purposes, we used our Watson question
answering system adapted for the medical domain (Fer- We evaluate system performance both on the full test
rucci et al., 2013) as a baseline system. This system takes set as well as on the diagnosis subset only. The reason for
the entire scenario as input and evaluates each multiple evaluating the diagnosis subset separately is because, in
choice answer based on its likelihood of being the cor- the vast majority of these questions, either the punchline
rect answer to the punchline question. This one-shot ap- question seeks the diagnosis or depends on a correct di-
agnosis along the way. We use the full 1000 questions in
1
We are also investigating the use of actual subquestions the training set to learn the models for both the baseline
generated by WatsonPaths as training data. Building a compre- system and the WatsonPaths system. As noted earlier,
hensive answer key for such questions is very time consuming, Doctors Dilemma training data is used to consolidate
and an incomplete answer key can be less effective. Although
this approach has not yet succeeded, it may still succeed if we question answering features in the WatsonPaths system.
invest much more in building a bigger, better answer key for We did not use Doctors Dilemma training data for any
actual WatsonPaths subquestions. purpose in the baseline system.
Full Diagnosis comparable to experienced clinicians at the Leeds hos-
Accuracy Baseline 42.0% 53.8% pital where it was developed. But it did not adapt suc-
WatsonPaths 48.0% 64.1%
cessfully to other hospitals or regions, indicating the brit-
Confidence Baseline 59.8% 75.3%
Weighted Score WatsonPaths 67.5% 81.8%
tleness of some systems when they are separated from
their original developers. A recent systemic review of
Table 1: WatsonPaths Performance Results 162 CDSS implementations shows that success at clin-
ical trials is significantly associated with systems that
were evaluated by their own developers (Roshanov et al.,
10.3 Results and Discussion 2013). MYCIN (Shortliffe, 1976) was another early sys-
Table 1 shows the results of our evaluation on a set of 500 tem which used structured representation in the form of
blind questions of which a subset of 156 questions were production rules. Its scope was limited to the treatment
identified as diagnosis questions by annotators. of infectious diseases and, as with other systems with
We report results using two metrics. Accuracy sim- structured knowledge bases, required expert humans to
ply measures the percentage of questions for which a develop and maintain these production rules. This man-
system ranks the correct answer in top position. Con- ual process can prove to be infeasible in many medi-
fidence Weighted Score is a metric that takes into ac- cal specialties where active research produces new di-
count both the accuracy of the system and its confidence agnosis and treatment guidelines and phases out older
in producing the top answer (Voorhees, 2003). We sort ones. Many CDSS implementations mitigate this limi-
all <question, top answer> pairs in an evaluation set in tation by focusing their manual decision logic develop-
decreasing order of the systems confidence in the top ment effort on clinical guidelines for specific diseases
answer and compute the confidence weighted score as or treatments, e.g., hypertension management (Goldstein
et al., 2001). But such systems lack the ability to han-
n
1 X number correct in first i ranks dle patient comorbidities and concurrent treatment plans
CWS =
n i=1 i (Sittig et al., 2008). Another notable system that used
structured knowledge was Internist-1. The knowledge
where n is the number of questions in the evaluation base contained disease-to-finding mappings represented
set. This metric rewards systems for more accurately as- as conditional probabilities (of disease given finding and
signing high confidences to correct answers, an impor- of finding given disease) mapped to a 15 scale. Despite
tant consideration for real-world question answering and initial success as a diagnostic tool, its design as an ex-
medical diagnosis systems. pert consultant was not considered to meet the informa-
Our results show statistically significant improvements tion needs of most physicians. Eventually, its underlying
at p<0.05 (results in bold in Table 1) on the full blind knowledge base helped its evolution into an electronic
set of 500 questions for both metrics. For the diagnosis reference that can provide physicians with customized
subset, the accuracy improvement is statistically signif- information (Miller et al., 1986). A similar system, DX-
icant but the confidence weighted score improvement is plain (Barnett et al., 1987) continues to be commercially
not, even with a 6+% score increase. This is likely due successful and extensively used. Rather than focus on a
to the small diagnosis subset, which contains only 156 definitive diagnosis, it provides the physician with a list
questions. of differential diagnoses along with descriptive informa-
tion and bibliographic references.
11 Related Work Other systems in commercial use have adopted the un-
Clinical decision support systems (CDSS) have had a structured medical text reference approach directly, us-
long history of development starting from the early days ing search technology to provide decision support. Isabel
of artificial intelligence. These systems use a variety of provides diagnostic support using natural language pro-
knowledge representations, reasoning processes, system cessing of medical textbooks and journals. Other com-
architectures, scope of medical domain, and types of de- mercial systems like UpToDate and ClinicalKey forgo
cision (Musen et al., 2014). Although several studies the diagnostic support and provide a search capability
have reported on the success of CDSS implementations to their medical textbooks and other unstructured refer-
in improving clinical outcomes (Kawamoto et al., 2005; ences. Although search over unstructured content makes
Roshanov et al., 2013), widespread adoption and routine it easier to incorporate new knowledge, it shifts the rea-
use is still lacking (Osheroff et al., 2007). soning load from the system back to the physician.
The pioneering Leeds abdominal pain system (De In comparison to the above systems, WatsonPaths uses
Dombal et al., 1972) used structured knowledge in the a hybrid approach. It uses question-answering technol-
form of conditional probabilities for diseases and their ogy over unstructured medical content to obtain answers
symptoms. Its success at using Bayesian reasoning was to specific subquestions generated by WatsonPaths. For
this task, it builds on the search functionality by extract- important for the early development of the system, we
ing answer entities from the search results and seeking have designed WatsonPaths to function well beyond it.
supporting evidence for them in order to estimate answer In future work, we plan to extend WatsonPaths in several
confidences. These answers are then treated as inferences ways.
by WatsonPaths over which it can perform probabilis- The present set of questions are all multiple choice
tic reasoning without requiring a probabilistic knowledge questions. This means that hypotheses have already been
base. identified, and it is also known that exactly one of the
Another major area of difference between CDSS im- hypotheses is the correct answer. Although they have
plementations is the extent of their integration to the made the early development of scenario-based ques-
health information system and workflow used by the tion answering more straightforward, the overall Wat-
physicians. Studies have shown that CDSS are most sonPaths architecture does not rely on these constraints.
effective when they are integrated within the workflow For instance, we can easily remove the confidence re-
(Kawamoto et al., 2005; Roshanov et al., 2013). Many estimation phase for the closed-form inference systems
of the guideline-based CDSS implementations are inte- and the exactly one constraint from the belief engine.
grated with the health information system and workflow, Also, it will be straightforward to add a simple hypoth-
having access to the data being entered and providing esis identification step to the main loop. One way to
timely decision support in the form of alerts. But this do this is to find nodes whose type corresponds to the
integration is limited to the structured data contained in type being asked about in the punchline question. We al-
a patients electronic medical record. When a CDSS re- ready find such correspondances in the base Watson sys-
quires information like findings, assessments, or plans in tem (Chu-Carroll et al., 2012). In the collaborative ap-
clinical notes written by a healthcare provider, existing plication, we are exploring ways of having the user help
systems are unable to extract them. As a result, search- identify hypotheses.
based CDSS remain a separate consultative tool. The We also plan to extend WatsonPaths beyond the med-
scenario analysis capability of WatsonPaths provides the ical domain. For medical applications, it might have
means to analyze these unstructured clinical notes and been easier to design Watson with certain medical as-
serves as a means for integration into the health informa- pects hardcoded into the flow of execution. Instead we
tion system. designed the overall flow as well as each component to
A major point of differentiation between the CDSS be general across domains. Note that the Emerald could
implementations described above and the design of Wat- be replaced by a structure from a different domain, and
sonPaths is its ability to serve as a collaborative problem the basic semantics we have explored: matching, indica-
solving tool as described in Section 9. When teamed with tive and causal, have no requirement that the graph struc-
a student, the role of WatsonPaths approaches that of in- ture come from medicine. Even the causal aspect of the
telligent tutoring systems (Woolf, 2009). Key differences belief engine could apply to any domain that involves di-
exist, however, in the representation of domain knowl- agnostic inference (e.g., automotive repair). Most impor-
edge and student knowledge. Most tutoring systems have tantly, the way that subquestions are answered is com-
a structured representation of the domain knowledge, pletely general. By asking the right subquestions and us-
carrying with it the same knowledge update and mainte- ing the right corpus, we can apply WatsonPaths to any
nance issues faced by CDSS implementations. Watson- scenario-based question answering problem. We hope to
Paths lacks a student model (or in general a model of the develop a toolbox of expansion strategies, relation gen-
collaborator), which is a key capability of intelligent tu- erators, and inference mechanisms that can be reused as
toring systems. As a result, it cannot guide or customize we apply WatsonPaths to new domains.
the tutoring according to student needs, relying instead The most important area for further work is on the col-
on an instructors choice of the problem scenario to be laborative user application. In the early development of
used. the system, it was necessary to focus on automatic per-
formance (as presented in Section 10) to create a viable
12 Conclusions and Further Work scenario-based question answering system. As this per-
WatsonPaths is a system for scenario-based question an- formance improves, we are focusing more on how Wat-
swering that has a graphical model at its core. It includes sonPaths can interact better with users. We plan to de-
a collaborative decision support tool that allows users to velop and more rigorously evaluate how WatsonPaths
understand and contribute to the reasoning process. We learns from users and how users learn from WatsonPaths.
have developed WatsonPaths on a set of multiple choice In a fully automatic system, the user receives an an-
questions from the medical domain. On this test set, swer using little or no time or cognitive effort. In a col-
WatsonPaths shows a significant improvement over Wat- laborative system, the user spends some time and effort,
son. Although the test preparation question set has been and potentially gets a better answer. We suspect that, in
many applications of scenario-based question answering, David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James
this will be an attractive tradeoff for the user, because of Fan, David Gondek, Aditya A. Kalyanpur, Adam
the complexity of the scenario and the importance of the Lally, J. William Murdock, Eric Nyberg, John Prager,
answer. Our objective is to minimize the time and effort Nico Schlaefer, and Chris Welty. 2010. Building Wat-
required of users and maximize the benefit they receive. son: An overview of the DeepQA project. AI Maga-
zine, 31:5979.
The combination of the user and WatsonPaths should be
able to handle more difficult problems more quickly than David Ferrucci, Anthony Levas, Sugato Bagchi, David
either could alone. Gondek, and Erik T. Mueller. 2013. Watson: Beyond
Jeopardy! Artificial Intelligence, 199200:93105.