Professional Documents
Culture Documents
Learning Natural Language Inference With LSTM
Learning Natural Language Inference With LSTM
ak
mk = . (7)
htk
im
k = σ(Wmi mk + Vmi hm mi
k−1 + b ),
fkm = σ(Wmf mk + Vmf hm mf
k−1 + b ),
Figure 1: The top figure depicts the model by Rocktäschel et al.
om = σ(Wmo mk + Vmo hm mo
k−1 + b ),
(2016) and the bottom figure depicts our model. Here Hs rep- k
resents all the hidden states hsj . Note that in the top model each cm
k = fkm cm m
k−1 + ik tanh(Wmc mk + Vmc hm
k−1
ID sentence label
Premise A dog jumping for a Frisbee in the snow.
Example 1 An animal is outside in the cold weather, playing with a plastic toy. entailment
Hypothesis Example 2 A cat washed his face and whiskers with his front paw. contradiction
Example 3 A pet is enjoying a game of fetch with his owner. neutral
Table 3: Three examples of sentence pairs with different relationship labels. The second hypothesis is a contradiction because it
mentions a completely different event. The third hypothesis is neutral to the premise because the phrase “with his owner” cannot
be inferred from the premise.
this data set. (2) If we compare our mLSTM model is still better than previously reported state of the art.
with our implementation of the word-by-word atten- This suggests that the mLSTM architecture coupled
tion model by Rocktäschel et al. (2016) under the with the attention model works well, regardless of
same setting with d = 150, we can see that our per- whether or not we use LSTM to process the original
formance on the test data (85.7%) is higher than that sentences.
of their model (82.6%). We also tested statistical Because the NLI task is a three-way classifica-
significance and found the improvement to be statis- tion problem, to better understand the errors, we also
tically significant at the 0.001 level. (3) The perfor- show the confusion matrix of the results obtained by
mance of mLSTM with bi-LSTM sentence modeling our mLSTM model with d = 300 in Table 2. We
compared with the model with standard LSTM sen- can see that there is more confusion between neu-
tence modeling when d is set to 150 shows that us- tral and entailment and between neutral and contra-
ing bi-LSTM to process the original sentences helps diction than between entailment and contradiction.
(86.0% vs. 85.7% on the test data), but the dif- This shows that neutral is relatively hard to capture.
ference is small and the complexity of bi-LSTM is
much higher than LSTM. Therefore when we in- 3.3 Further Analyses
creased d to 300 we did not experiment with bi-
LSTM sentence modeling. (4) Interestingly, when To obtain a better understanding of how our pro-
we experimented with the mLSTM model using posed model actually performs the matching be-
the pre-trained word embeddings instead of LSTM- tween a premise and a hypothesis, we further con-
generated hidden states as initial representations of duct the following analyses. First, we look at the
the premise and the hypothesis, we were able to learned word-by-word alignment weights αkj to
achieve an accuracy of 85.3% on the test data, which check whether the soft alignment makes sense. This
is the same as what was done by Rocktäschel et al.
Figure 2: The alignment weights and the gate vectors of the three examples.
(2016). We then look at the values of the various relationship labels. They are given in Table 3. The
gate vectors of the mLSTM. By looking at these val- values of the alignment weights and the gate vectors
ues, we aim to check (1) whether the model is able are plotted in Figure 2.
to differentiate between more important and less im- Besides using the three examples, we will also
portant word-level matching results, and (2) whether give some overall statistics of the parameter values
the model forgets certain matching results and re- to confirm our observations with the three examples.
members certain other ones.
Word Alignment
To conduct the analyses, we choose three ex- First, let us look at the top-most plots of Fig-
amples and display the various learned parameter ure 2. These plots show the alignment weights αkj
values. These three sentence pairs share the same between the hypothesis and the premise, where a
premise but have different hypotheses and different darker color corresponds to a larger value of αkj .
Recall that αkj is the degree to which the k th word tend to have higher values of the input gates, which
in the hypothesis is aligned with the j th word in the also makes sense because these words are generally
premise. Also recall that the weights αkj are con- more important for determining the final relation-
figured such that for the same k all the αkj add up ship label.
to 1. This means the weights in the same row in To further verify the observation above, we com-
these plots add up to 1. From the three plots we can pute the average input gate values for stop words
see that the alignment weights generally make sense. and the other content words. We find that the former
For example, in Example 1, “animal” is strongly has an average value of 0.287 with a standard devia-
aligned with “dog” and “toy” aligned with “Frisbee.” tion of 0.084 while the latter has an average value of
The phrase “cold weather” is aligned with “snow.” 0.347 with a standard deviation of 0.116. This shows
In Example 3, we also see that “pet” is strongly that indeed generally stop words have lower input
aligned with “dog” and “game” aligned with “Fris- gate values. Interestingly, we also find that some
bee.” stop words may have higher input gate values if they
In Example 2, “cat” is strongly aligned with “dog” are critical for the classification task. For example,
and “washes” is aligned with “jumping.” It may ap- the negation word “not” has an average input gate
pear that these matching results are wrong. How- value of 0.444 with a standard deviation of 0.104.
ever, “dog” is likely the best match for “cat” among Overall, the values of the input gates confirm that
all the words in the premise, and as we will show the mLSTM helps differentiate the more important
later, this match between “cat” and “dog” is actu- word-level matching results from the less important
ally a strong indication of a contradiction between ones.
the two sentences. The same explanation applies to Next, let us look at the forget gates. Recall that
the match between “washes” and “jumping.” a forget gate controls the importance of the previ-
We also observe that some words are aligned ous cell state in deriving the final hidden state of the
with the NULL token we inserted. For example, current position. Higher values of a forget gate indi-
the word “is” in the hypothesis in Example 1 does cate that we need to remember the previous cell state
not correspond to any word in the premise and is and pass it on whereas lower values indicate that we
therefore aligned with NULL. The words “face” and should probably forget the previous cell. From the
“whiskers” in Example 2 and “owner” in Example 3 three plots of the forget gates, we can see that overall
are also aligned with NULL. Intuitively, if some im- the colors are the lightest for Example 1, which is an
portant content words in the hypothesis are aligned entailment. This suggests that when the hypothesis
with NULL, it is more likely that the relationship la- is an entailment of the premise, the mLSTM tends
bel is either contradiction or neutral. to forget the previous matching results. On the other
hand, for Example 2 and Example 3, which are con-
Values of Gate Vectors tradiction and neutral, we see generally darker col-
Next, let us look at the values of the learned gate ors. In particular, in Example 2, we can see that the
vectors of our mLSTM for the three examples. We colors are consistently dark starting from the word
show these values under the setting where d is set to “his” in the hypothesis until the end. We believe the
150. Each row of these plots corresponds to one of explanation is that after the mLSTM processes the
the 150 dimensions. Again, a darker color indicates first three words of the hypothesis, “A cat washes,” it
a higher value. sees that the matching between “cat” and “dog” and
An input gate controls whether the input at the between “washes” and “jumping” is a strong indica-
current position should be used in deriving the final tion of a contradiction, and therefore these matching
hidden state of the current position. From the three results need to be remembered until the end of the
plots of the input gates, we can observe that gener- mLSTM for the final prediction.
ally for stop words such as prepositions and articles We have also checked the forget gates of the other
the input gates have lower values, suggesting that the sentence pairs in the test data by computing the av-
matching of these words is less important. On the erage forget gate values and the standard deviations
other hand, content words such as nouns and verbs for entailment, neutral and contradiction, respec-
tively. We find that the values are 0.446±0.123, ter representation of the premise to be used to match
0.507±0.148 and 0.536±0.170, respectively. For the hypothesis, whereas in our work we also use it to
contradiction and neutral, the forget gates start to derive representations of the premise that are used to
have higher values from certain positions of the hy- sequentially match the words in the hypothesis.
potheses. The SNLI corpus is new and so far it has
Based on the observations above, we hypothesize only been used in a few studies. Besides the
that the way the mLSTM works is as follows. It re- work by Bowman et al. (2015) themselves and by
members important mismatches, which are useful Rocktäschel et al. (2016), there are two other studies
for predicting the contradiction or the neutral re- which used the SNLI corpus. Vendrov et al. (2015)
lationship, and forgets good matching results. At used a Skip-Thought model proposed by Kiros et al.
the end of the mLSTM, if no important mismatch (2015) to the NLI task and reported an accuracy of
is remembered, the final classifier will likely pre- 81.5% on the test data. Mou et al. (2015) used tree-
dict entailment by default. Otherwise, depending on based CNN encoders to obtain sentence embeddings
the kind of mismatch remembered, the classifier will and achieved an accuracy of 82.1%.
predict either contradiction or neutral.
For the output gates, we are not able to draw any 5 Conclusions and Future Work
important conclusion except that the output gates
seem to be positively correlated with the input gates In this paper, we proposed a special LSTM ar-
but they tend to be darker than the input gates. chitecture for the task of natural language infer-
ence. Based on a recent work by Rocktäschel et al.
4 Related Work (2016), we first used neural attention models to de-
rive attention-weighted vector representations of the
There has been much work on natural language in- premise. We then designed a match-LSTM that pro-
ference. Shallow methods rely mostly on lexical cesses the hypothesis word by word while trying to
similarities but are shown to be robust. For example, match the hypothesis with the premise. The last hid-
Bowman et al. (2015) experimented with a lexical- den state of this mLSTM can be used for predicting
ized classifier-based method, which only uses lexi- the relationship between the premise and the hypoth-
cal information and achieves an accuracy of 78.2% esis. Experiments on the SNLI corpus showed that
on the SNLI corpus. More advanced methods use the mLSTM model outperformed the state-of-the-art
syntactic structures of the sentences to help match- performance reported so far on this data set. More-
ing them. For example, Mehdad et al. (2009) ap- over, closer analyses on the gate vectors revealed
plied syntactic-semantic tree kernels for recogniz- that our mLSTM indeed remembers and passes on
ing textual entailment. Because inference is es- important matching results, which are typically mis-
sentially a logic problem, methods based on for- matches that indicate a contradiction or a neutral re-
mal logic (Clark and Harrison, 2009) or natural lationship between the premise and the hypothesis.
logic (MacCartney, 2009) have also been proposed. With the large number of parameters to learn, an
A comprehensive review on existing work can be inevitable limitation of our model is that a large
found in the book by Dagan et al. (2013). training data set is needed to learn good model pa-
The work most relevant to ours is the recently rameters. Indeed some preliminary experiments ap-
proposed neural attention model-based method by plying our mLSTM to the SICK corpus (Marelli
Rocktäschel et al. (2016), which we have detailed et al., 2014), a smaller textual entailment bench-
in previous sections. Neural attention models have mark data set, did not give very good results. We
recently been applied to some natural language pro- believe that this is because our model learns ev-
cessing tasks including machine translation (Bah- erything from scratch except using the pre-trained
danau et al., 2015), abstractive summarization (Rush word embeddings. A future direction would be to
et al., 2015) and question answering (Hermann et incorporate other resources such as the paraphrase
al., 2015). Rocktäschel et al. (2016) showed that database (Ganitkevitch et al., 2013) into the learning
the neural attention model could help derive a bet- process.
References vectors. In Advances in Neural Information Process-
ing Systems.
[Bahdanau et al.2015] Dzmitry Bahdanau, HyungHyun [MacCartney et al.2008] Bill MacCartney, Michel Galley,
Cho, and Yoshua Bengio. 2015. Neural machine and Christopher D Manning. 2008. A phrase-based
translation by jointly learning to align and translate. In alignment model for natural language inference. In
Proceedings of the International Conference on Learn- Proceedings of the Conference on Empirical Methods
ing Representations. in Natural Language Processing.
[Bowman et al.2015] Samuel R Bowman, Gabor Angeli, [MacCartney2009] Bill MacCartney. 2009. Natural Lan-
Christopher Potts, and Christopher D Manning. 2015. guage Inference. Ph.D. thesis, Stanford University.
A large annotated corpus for learning natural language [Marelli et al.2014] Marco Marelli, Stefano Menini,
inference. In Proceedings of the 2015 Conference on Marco Baroni, Luisa Bentivogli, Raffaella Bernardi,
Empirical Methods in Natural Language Processing. and Roberto Zamparelli. 2014. A SICK cure for the
[Clark and Harrison2009] Peter Clark and Phil Harrison. evaluation of compositional distributional semantic
2009. An inference-based approach to recognizing en- models. In Proceedings of the Ninth International
tailment. In Proceedings of the Text Analysis Confer- Conference on Language Resources and Evaluation.
ence. [Mehdad et al.2009] Yashar Mehdad, Alessandro Mos-
[Dagan et al.2005] Ido Dagan, Oren Glickman, and chitti1, and Fabio Massiomo Zanzotto. 2009.
Bernardo Magnini. 2005. The PASCAL Recognising SemKer: Syntactic/semantic kernels for recognizing
Textual Entailment Challenge. In Proceedings of the textual entailment. In Proceedings of the Text Anal-
PASCAL Challenges Workshop on Recognizing Tex- ysis Conference.
tual Entailment. [Mou et al.2015] Lili Mou, Men Rui, Ge Li, Yan Xu,
[Dagan et al.2013] Ido Dagan, Dan Roth, Mark Sam- Lu Zhang, Rui Yan, and Zhi Jin. 2015. Recogniz-
mons, and Fabio Massimo Zanzotto. 2013. Recog- ing entailment and contradiction by tree-based convo-
nizing Textual Entailment: Models and Applications. lution. arXiv preprint arXiv:1512.08422.
Synthesis Lectures on Human Language Technolo- [Pennington et al.2014] Jeffrey Pennington, Richard
gies. Morgan & Claypool Publishers. Socher, and Christopher D Manning. 2014. Glove:
[Ganitkevitch et al.2013] Juri Ganitkevitch, Benjamin Global vectors for word representation. Proceedings
Van Durme, and Chris Callison-Burch. 2013. PPDB: of the Conference on Empirical Methods in Natural
The paraphrase database. In Proceedings of the 2013 Language Processing.
Conference of the North American Chapter of the [Rocktäschel et al.2016] Tim Rocktäschel, Edward
Association for Computational Linguistics. Grefenstette, Karl Moritz Hermann, Tomáš Kočiskỳ,
[Glickman et al.2005] Oren Glickman, Ido Dagan, and and Phil Blunsom. 2016. Reasoning about entailment
Moshe Koppel. 2005. Web based probabilistic textual with neural attention. In Proceedings of the Interna-
entailment. In Proceedings of the PASCAL Challenges tional Conference on Learning Representations.
Workshop on Recognizing Textual Entailment. [Rush et al.2015] Alexander M Rush, Sumit Chopra, and
[Graves2012] Alex Graves. 2012. Supervised sequence Jason Weston. 2015. A neural attention model for
labelling with recurrent neural networks, volume 385. abstractive sentence summarization. Proceedings of
Springer. the Conference on Empirical Methods in Natural Lan-
[Hermann et al.2015] Karl Moritz Hermann, Tomas Ko- guage Processing.
cisky, Edward Grefenstette, Lasse Espeholt, Will Kay, [Vendrov et al.2015] Ivan Vendrov, Ryan Kiros, Sanja
Mustafa Suleyman, and Phil Blunsom. 2015. Teach- Fidler, and Raquel Urtasun. 2015. Order-
ing machines to read and comprehend. In Advances in embeddings of images and language. arXiv preprint
Neural Information Processing Systems. arXiv:1511.06361.
[Hochreiter and Schmidhuber1997] Sepp Hochreiter and
Jürgen Schmidhuber. 1997. Long short-term memory.
Neural Computation, 9(8):1735–1780.
[Kingma and Ba2014] Diederik Kingma and Jimmy Ba.
2014. Adam: A method for stochastic optimization.
Proceedings of the International Conference on Learn-
ing Representations.
[Kiros et al.2015] Ryan Kiros, Yukun Zhu, Ruslan R
Salakhutdinov, Richard Zemel, Raquel Urtasun, An-
tonio Torralba, and Sanja Fidler. 2015. Skip-thought