Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Learning Natural Language Inference with LSTM

Shuohang Wang Jing Jiang


School of Information Systems School of Information Systems
Singapore Management University Singapore Management University
[email protected] [email protected]
arXiv:1512.08849v2 [cs.CL] 10 Nov 2016

Abstract ing the PASCAL Recognizing Textual Entailment


(RTE) Challenge (Dagan et al., 2005). Existing so-
Natural language inference (NLI) is a funda- lutions to NLI range from shallow approaches based
mentally important task in natural language
on lexical similarities (Glickman et al., 2005) to ad-
processing that has many applications. The
recently released Stanford Natural Language vanced methods that consider syntax (Mehdad et al.,
Inference (SNLI) corpus has made it possi- 2009), perform explicit sentence alignment (Mac-
ble to develop and evaluate learning-centered Cartney et al., 2008) or use formal logic (Clark and
methods such as deep neural networks for nat- Harrison, 2009).
ural language inference (NLI). In this paper,
we propose a special long short-term mem-
Recently, Bowman et al. (2015) released the Stan-
ory (LSTM) architecture for NLI. Our model ford Natural Language Inference (SNLI) corpus for
builds on top of a recently proposed neural at- the purpose of encouraging more learning-centered
tention model for NLI but is based on a sig- approaches to NLI. This corpus contains around
nificantly different idea. Instead of deriving 570K sentence pairs with three labels: entailment,
sentence embeddings for the premise and the contradiction and neutral. The size of the corpus
hypothesis to be used for classification, our so- makes it now feasible to train deep neural network
lution uses a match-LSTM to perform word-
models, which typically require a large amount of
by-word matching of the hypothesis with the
premise. This LSTM is able to place more training data. Bowman et al. (2015) tested a straight-
emphasis on important word-level matching forward architecture of deep neural networks for
results. In particular, we observe that this NLI. In their architecture, the premise and the hy-
LSTM remembers important mismatches that pothesis are each represented by a sentence embed-
are critical for predicting the contradiction or ding vector. The two vectors are then fed into a
the neutral relationship label. On the SNLI multi-layer neural network to train a classifier. Bow-
corpus, our model achieves an accuracy of
man et al. (2015) achieved an accuracy of 77.6%
86.1%, outperforming the state of the art.
when long short-term memory (LSTM) networks
were used to obtain the sentence embeddings.
1 Introduction
A more recent work by Rocktäschel et al. (2016)
Natural language inference (NLI) is the problem of improved the performance by applying a neural at-
determining whether from a premise sentence P one tention model. While their basic architecture is
can infer another hypothesis sentence H (MacCart- still based on sentence embeddings for the premise
ney, 2009). NLI is a fundamentally important prob- and the hypothesis, a key difference is that the em-
lem that has applications in many tasks including bedding of the premise takes into consideration the
question answering, semantic search and automatic alignment between the premise and the hypothesis.
text summarization. There has been much inter- This so-called attention-weighted representation of
est in NLI in the past decade, especially surround- the premise was shown to help push the accuracy to
83.5% on the SNLI corpus. 2 Model
A limitation of the aforementioned two models is In this section, we first review LSTM. We
that they reduce both the premise and the hypoth- then review the word-by-word attention model by
esis to a single embedding vector before matching Rocktäschel et al. (2016), which is their best per-
them; i.e., in the end, they use two embedding vec- forming model. Finally we present our mLSTM ar-
tors to perform sentence-level matching. However, chitecture for natural language inference.
not all word or phrase-level matching results are
equally important. For example, the matching be- 2.1 Background
tween stop words in the two sentences is not likely LSTM: Let us first briefly review LSTM (Hochre-
to contribute much to the final prediction. Also, for iter and Schmidhuber, 1997). LSTM is a special
a hypothesis to contradict a premise, a single word form of recurrent neural networks (RNNs), which
or phrase-level mismatch (e.g., a mismatch of the process sequence data. LSTM uses a few gate vec-
subjects of the two sentences) may be sufficient and tors at each position to control the passing of in-
other matching results are less important, but this in- formation along the sequence and thus improves
tuition is hard to be captured if we directly match the modeling of long-range dependencies. While
two sentence embeddings. there are different variations of LSTMs, here we
In this paper, we propose a new LSTM-based ar- present the one adopted by Rocktäschel et al. (2016).
chitecture for learning natural language inference. Specifically, let us use X = (x1 , x2 , . . . , xN ) to de-
Different from previous models, our prediction is note an input sequence, where xk ∈ Rl (1 ≤ k ≤
not based on whole sentence embeddings of the N ). At each position k, there is a set of internal vec-
premise and the hypothesis. Instead, we use an tors, including an input gate ik , a forget gate fk , an
LSTM to perform word-by-word matching of the output gate ok and a memory cell ck . All these vec-
hypothesis with the premise. Our LSTM sequen- tors are used together to generate a d-dimensional
tially processes the hypothesis, and at each posi- hidden state hk as follows:
tion, it tries to match the current word in the hy-
pothesis with an attention-weighted representation
ik = σ(Wi xk + Vi hk−1 + bi ),
of the premise. Matching results that are critical
fk = σ(Wf xk + Vf hk−1 + bf ),
for the final prediction will be “remembered” by the
LSTM while less important matching results will be ok = σ(Wo xk + Vo hk−1 + bo ),
“forgotten.” We refer to this architecture a match- ck = fk ck−1 + ik tanh(Wc xk + Vc hk−1 + bc ),
LSTM, or mLSTM for short. hk = ok tanh(ck ), (1)
Experiments show that our mLSTM model where σ is the sigmoid function, is the element-
achieves an accuracy of 86.1% on the SNLI cor- wise multiplication of two vectors, and all W* ∈
pus, outperforming the state of the art. Furthermore, Rd×l ,V* ∈ Rd×d and b* ∈ Rd are weight matrices
through further analyses of the learned parameters, and vectors to be learned.
we show that the mLSTM architecture can indeed Neural Attention Model: For the natural lan-
pick up the more important word-level matching re- guage inference task, we have two sentences Xs =
sults that need to be remembered for the final pre- (xs1 , xs2 , . . . , xsM ) and Xt = (xt1 , xt2 , . . . , xtN ),
diction. In particular, we observe that good word- where Xs is the premise and Xt is the hypothesis.
level matching results are generally “forgotten” but Here each x is an embedding vector of the corre-
important mismatches, which often indicate a con- sponding word. The goal is to predict a label y that
tradiction or a neutral relationship, tend to be “re- indicates the relationship between Xs and Xt . In this
membered.” paper, we assume y is one of entailment, contradic-
Our code is available online1 . tion and neutral.
Rocktäschel et al. (2016) first used two LSTMs
1
https://github.com/shuohangwang/ to process the premise and the hypothesis, respec-
SeqMatchSeq tively, but initialized the second LSTM (for the hy-
pothesis) with the last cell state of the first LSTM representation of the whole premise. Rocktäschel et
(for the premise). Let us use hsj and htk to denote al. (2016) then used this haN , which represents the
the resulting hidden states corresponding to xsj and whole premise, together with htN , which can be ap-
xtk , respectively. The main idea of the word-by-word proximately regarded as an aggregated representa-
attention model by Rocktäschel et al. (2016) is to in- tion of the hypothesis3 , to predict the label y.
troduce a series of attention-weighted combinations
of the hidden states of the premise, where each com- 2.2 Our Model
bination is for a particular word in the hypothesis. Although the neural attention model by Rocktäschel
Let us use ak to denote such an attention vector for et al. (2016) achieved better results than Bowman
word xtk in the hypothesis. Specifically, ak is de- et al. (2015), we see two limitations. First, the
fined as follows2 : model still uses a single vector representation of the
premise, namely haN , to match the entire hypothe-
M
X sis. We speculate that if we instead use each of the
ak = αkj hsj , (2) attention-weighted representations of the premise
j=1
for matching, i.e., use ak at position k to match
where αkj is an attention weight that encodes the the hidden state htk of the hypothesis while we
degree to which xtk in the hypothesis is aligned with go through the hypothesis, we could achieve better
xsj in the premise. The attention weight αkj is gen- matching results. This can be done using an RNN
erated in the following way: which at each position takes in both ak and htk as its
input and determines how well the overall matching
exp(ekj ) of the two sentences is up to the current position. In
αkj = P , (3) the end the RNN will produce a single vector repre-
j 0 exp(ekj )
0

senting the matching of the two entire sentences.


where The second limitation is that the model by
Rocktäschel et al. (2016) does not explicitly allow
ekj = we · tanh(Ws hsj + Wt htk + Wa hak−1 ). (4) us to place more emphasis on the more important
matching results between the premise and the hy-
Here · is the dot-product between two vectors, the pothesis and down-weight the less critical ones. For
vector we ∈ Rd and all matrices W* ∈ Rd×d con- example, matching of stop words is presumably less
tain weights to be learned, and hak−1 is another hid- important than matching of content words. Also,
den state which we will explain below. some matching results may be particularly critical
The attention-weighted premise ak essentially for making the final prediction and thus should be
tries to model the relevant parts in the premise with remembered. For example, consider the premise
respect to xtk , i.e., the k th word in the hypothe- “A dog jumping for a Frisbee in the snow.” and
sis. Rocktäschel et al. (2016) further built an RNN the hypothesis “A cat washes his face and whiskers
model over {ak }N k=1 by defining the following hid- with his front paw.” When we sequentially pro-
den states: cess the hypothesis, once we see that the subject
of the hypothesis cat does not match the subject of
hak = ak + tanh(Va hak−1 ), (5) the premise dog, we have a high probability to be-
lieve that there is a contradiction. So this mismatch
where Va ∈ Rd×d is a weight matrix to be learned.
should be remembered.
We can see that the last haN aggregates all the pre-
Based on the two observations above, we propose
vious ak and can be seen as an attention-weighted
to use an LSTM to sequentially match the two sen-
2
We present the word-by-word attention model by tences. At each position the LSTM takes in both ak
Rocktäschel et al. (2016) in a different way but the underlying
model is the same. Our hak is their rt , our Hs (all of hsj ) is their 3
Strictly speaking, in the model by Rocktäschel et al. (2016),
Y, our htk is their ht , and our αk is their αt . Our presentation htN encodes both the premise and the hypothesis because the
is close to the one by Bahdanau et al. (2015), with our attention two sentences are chained. But htN places a higher emphasis on
vectors a corresponding to the context vectors c in their paper. the hypothesis given the nature of RNNs.
portant matching results will be “remembered” by
the LSTM while non-essential ones will be “forgot-
ten.” We use the concatenation of ak , which is the
attention-weighted version of the premise for the k th
word in the hypothesis, and htk , the hidden state for
the k th word itself, as input to the mLSTM.
Specifically, let us define

 
ak
mk = . (7)
htk

We then build the mLSTM as follows:

im
k = σ(Wmi mk + Vmi hm mi
k−1 + b ),
fkm = σ(Wmf mk + Vmf hm mf
k−1 + b ),
Figure 1: The top figure depicts the model by Rocktäschel et al.
om = σ(Wmo mk + Vmo hm mo
k−1 + b ),
(2016) and the bottom figure depicts our model. Here Hs rep- k

resents all the hidden states hsj . Note that in the top model each cm
k = fkm cm m
k−1 + ik tanh(Wmc mk + Vmc hm
k−1

hak represents a weighted version of the premise only, while +bmc ),


in our model, each hm
k represents the matching between the hm
k = om
k tanh(cm
k ). (8)
premise and the hypothesis up to position k.
With this mLSTM, finally we use only hm
N , the last
hidden state, to predict the label y.
and htk as its input. Figure 1 gives an overview of
our model in contrast to the model by Rocktäschel 2.3 Implementation Details
et al. (2016). Besides the difference of the LSTM architecture, we
Specifically, our model works as follows. First, also introduce a few other changes from the model
similar to Rocktäschel et al. (2016), we process the by Rocktäschel et al. (2016). First, we insert a spe-
premise and the hypothesis using two LSTMs, but cial word NULL to the premise, and we allow words
we do not feed the last cell state of the premise to in the hypothesis to be aligned with this NULL. This
the LSTM of the hypothesis. This is because we do is inspired by common practice in machine transla-
not need the LSTM for the hypothesis to encode any tion. Specifically, we introduce a vector hs0 , which
knowledge about the premise but we will match the is fixed to be a vector of 0s of dimension d. This hs0
premise with the hypothesis using the hidden states represents NULL and is used with other hsj to derive
of the two LSTMs. Again, we use hsj and htk to the attention vectors {ak }Nk=1 .
represent these hidden states. Second, we use word embeddings trained from
Next, we generate the attention vectors ak simi- GloVe (Pennington et al., 2014) instead of word2vec
larly to Eqn (2). However, Eqn (4) will be replaced vectors. The main reason is that GloVe covers more
by the following equation: words in the SNLI corpus than word2vec4 .
Third, for words which do not have pre-trained
ekj = we · tanh(Ws hsj + Wt htk + Wm hm word embeddings, we take the average of the em-
k−1 ). (6)
beddings of all the words (in GloVe) surrounding the
The only difference here is that we use a hidden state unseen word within a window size of 9 (4 on the left
hm instead of ha , and the way we define hm is very and 4 on the right) as an approximation of the em-
different from the definition of ha . bedding of this unseen word. Then we do not update
Our hmk is the hidden state at position k generated 4
The SNLI corpus contains 37K unique tokens. Around
from our mLSTM. This LSTM models the match- 12.1K of them cannot be found in word2vec but only around
ing between the premise and the hypothesis. Im- 4.1K of them cannot be found in GloVe.
any word embedding when learning our model. Al- ground truth
though this is a very crude approximation, it reduces prediction N E C
the number of parameters we need to update, and as N 2628 286 255
it turns out, we can still achieve better performance E 340 3005 159
than Rocktäschel et al. (2016). C 250 77 2823
Table 2: The confusion matrix of the results by mLSTM with
3 Experiments d = 300. N, E and C correspond to neutral, entailment and
3.1 Experiment Settings contradiction, respectively.

Data: We use the SNLI corpus to test the effective-


ness of our model. The original data set contains matching. (2) We do not feed the last cell state
570,152 sentence pairs, each labeled with one of the of the LSTM for the premise to the LSTM for
following relationships: entailment, contradiction, the hypothesis, to keep it consistent with the
neutral and –, where – indicates a lack of consensus implementation of our model. (3) For word
from the human annotators. We discard the sentence representation, we also use the GloVe word
pairs labeled with – and keep the remaining ones for embeddings and we do not update the word
our experiments. In the end, we have 549,367 pairs embeddings. For unseen words, we adopt the
for training, 9,842 pairs for development and 9,824 same strategy as described in Section 2.3.
pairs for testing. This follows the same data partition • mLSTM (d = 150): This is our mLSTM model
used by Bowman et al. (2015) in their experiments. with d set to 150.
We perform three-class classification and use accu- • mLSTM with bi-LSTM sentence modeling
racy as our evaluation metric. (d = 150): This is the same as the model
Parameters: We use the Adam method (Kingma above except that when we derive the hidden
and Ba, 2014) with hyperparameters β1 set to 0.9 states hsj and htk of the two sentences, we use
and β2 set to 0.999 for optimization. The initial bi-LSTMs (Graves, 2012) instead of LSTMs.
learning rate is set to be 0.001 with a decay ratio We implement this model to see whether bi-
of 0.95 for each iteration. The batch size is set to LSTMs allow us to better align the sentences.
be 30. We experiment with d = 150 and d = 300 • mLSTM (d = 300): This is our mLSTM model
where d is the dimension of all the hidden states. with d set to 300.
Methods for comparison: We mainly want to • mLSTM with word embedding (d = 300): This
compare our model with the word-by-word atten- is the same as the model above except that we
tion model by Rocktäschel et al. (2016) because directly use the word embedding vectors xsj and
this model achieved the state-of-the-art performance xtk instead of the hidden states hsj and htk in our
on the SNLI corpus. To ensure fair comparison, model. In this case, each attention vector ak is
besides comparing with the accuracy reported by a weighted sum of {xsj }M j=1 . We experiment
Rocktäschel et al. (2016), we also re-implemented with this setting because we hypothesize that
their model and report the performance of our im- the effectiveness of our model is largely related
plementation. We also consider a few variations of to the mLSTM architecture rather than the use
our model. Specifically, the following models are of LSTMs to process the original sentences.
implemented and tested in our experiments:
3.2 Main Results
• Word-by-word attention (d = 150): This is
our implementation of the word-by-word at- Table 1 compares the performance of the various
tention model by Rocktäschel et al. (2016), models we tested together with some previously re-
where we set the dimension of the hidden states ported results.
to 150. The differences between our imple- We have the following observations: (1) First of
mentation and the original implementation by all, we can see that when we set d to 300, our model
Rocktäschel et al. (2016) are the following: (1) achieves an accuracy of 86.1% on the test data,
We also add a NULL token to the premise for which to the best of our knowledge is the highest on
Model d |θ|W+M |θ|M Train Dev Test
LSTM [Bowman et al. (2015)] 100 10M 221K 84.4 - 77.6
Classifier [Bowman et al. (2015)] - - - 99.7 - 78.2
LSTM shared [Rocktäschel et al. (2016)] 159 3.9M 252K 84.4 83.0 81.4
Word-by-word attention [Rocktäschel et al. (2016)] 100 3.9M 252K 85.3 83.7 83.5
Word-by-word attention (our implementation) 150 340K 340K 85.5 83.3 82.6
mLSTM 150 544K 544K 91.0 86.2 85.7
mLSTM with bi-LSTM sentence modeling 150 1.4M 1.4M 91.3 86.6 86.0
mLSTM 300 1.9M 1.9M 92.0 86.9 86.1
mLSTM with word embedding 300 1.3M 1.3M 88.6 85.4 85.3
Table 1: Experiment results in terms of accuracy. d is the dimension of the hidden states. |θ|W+M is the total number of parameters
and |θ|M is the number of parameters excluding the word embeddings. Note that the five models in the last section were implemented
by us while the other results were taken directly from previous papers. Note also that for the five models in the last section, we do
not update word embeddings so |θ|W+M is the same as |θ|M . The three columns on the right are the accuracies of the trained models
on the training data, the development data and the test data, respectively.

ID sentence label
Premise A dog jumping for a Frisbee in the snow.
Example 1 An animal is outside in the cold weather, playing with a plastic toy. entailment
Hypothesis Example 2 A cat washed his face and whiskers with his front paw. contradiction
Example 3 A pet is enjoying a game of fetch with his owner. neutral
Table 3: Three examples of sentence pairs with different relationship labels. The second hypothesis is a contradiction because it
mentions a completely different event. The third hypothesis is neutral to the premise because the phrase “with his owner” cannot
be inferred from the premise.

this data set. (2) If we compare our mLSTM model is still better than previously reported state of the art.
with our implementation of the word-by-word atten- This suggests that the mLSTM architecture coupled
tion model by Rocktäschel et al. (2016) under the with the attention model works well, regardless of
same setting with d = 150, we can see that our per- whether or not we use LSTM to process the original
formance on the test data (85.7%) is higher than that sentences.
of their model (82.6%). We also tested statistical Because the NLI task is a three-way classifica-
significance and found the improvement to be statis- tion problem, to better understand the errors, we also
tically significant at the 0.001 level. (3) The perfor- show the confusion matrix of the results obtained by
mance of mLSTM with bi-LSTM sentence modeling our mLSTM model with d = 300 in Table 2. We
compared with the model with standard LSTM sen- can see that there is more confusion between neu-
tence modeling when d is set to 150 shows that us- tral and entailment and between neutral and contra-
ing bi-LSTM to process the original sentences helps diction than between entailment and contradiction.
(86.0% vs. 85.7% on the test data), but the dif- This shows that neutral is relatively hard to capture.
ference is small and the complexity of bi-LSTM is
much higher than LSTM. Therefore when we in- 3.3 Further Analyses
creased d to 300 we did not experiment with bi-
LSTM sentence modeling. (4) Interestingly, when To obtain a better understanding of how our pro-
we experimented with the mLSTM model using posed model actually performs the matching be-
the pre-trained word embeddings instead of LSTM- tween a premise and a hypothesis, we further con-
generated hidden states as initial representations of duct the following analyses. First, we look at the
the premise and the hypothesis, we were able to learned word-by-word alignment weights αkj to
achieve an accuracy of 85.3% on the test data, which check whether the soft alignment makes sense. This
is the same as what was done by Rocktäschel et al.
Figure 2: The alignment weights and the gate vectors of the three examples.

(2016). We then look at the values of the various relationship labels. They are given in Table 3. The
gate vectors of the mLSTM. By looking at these val- values of the alignment weights and the gate vectors
ues, we aim to check (1) whether the model is able are plotted in Figure 2.
to differentiate between more important and less im- Besides using the three examples, we will also
portant word-level matching results, and (2) whether give some overall statistics of the parameter values
the model forgets certain matching results and re- to confirm our observations with the three examples.
members certain other ones.
Word Alignment
To conduct the analyses, we choose three ex- First, let us look at the top-most plots of Fig-
amples and display the various learned parameter ure 2. These plots show the alignment weights αkj
values. These three sentence pairs share the same between the hypothesis and the premise, where a
premise but have different hypotheses and different darker color corresponds to a larger value of αkj .
Recall that αkj is the degree to which the k th word tend to have higher values of the input gates, which
in the hypothesis is aligned with the j th word in the also makes sense because these words are generally
premise. Also recall that the weights αkj are con- more important for determining the final relation-
figured such that for the same k all the αkj add up ship label.
to 1. This means the weights in the same row in To further verify the observation above, we com-
these plots add up to 1. From the three plots we can pute the average input gate values for stop words
see that the alignment weights generally make sense. and the other content words. We find that the former
For example, in Example 1, “animal” is strongly has an average value of 0.287 with a standard devia-
aligned with “dog” and “toy” aligned with “Frisbee.” tion of 0.084 while the latter has an average value of
The phrase “cold weather” is aligned with “snow.” 0.347 with a standard deviation of 0.116. This shows
In Example 3, we also see that “pet” is strongly that indeed generally stop words have lower input
aligned with “dog” and “game” aligned with “Fris- gate values. Interestingly, we also find that some
bee.” stop words may have higher input gate values if they
In Example 2, “cat” is strongly aligned with “dog” are critical for the classification task. For example,
and “washes” is aligned with “jumping.” It may ap- the negation word “not” has an average input gate
pear that these matching results are wrong. How- value of 0.444 with a standard deviation of 0.104.
ever, “dog” is likely the best match for “cat” among Overall, the values of the input gates confirm that
all the words in the premise, and as we will show the mLSTM helps differentiate the more important
later, this match between “cat” and “dog” is actu- word-level matching results from the less important
ally a strong indication of a contradiction between ones.
the two sentences. The same explanation applies to Next, let us look at the forget gates. Recall that
the match between “washes” and “jumping.” a forget gate controls the importance of the previ-
We also observe that some words are aligned ous cell state in deriving the final hidden state of the
with the NULL token we inserted. For example, current position. Higher values of a forget gate indi-
the word “is” in the hypothesis in Example 1 does cate that we need to remember the previous cell state
not correspond to any word in the premise and is and pass it on whereas lower values indicate that we
therefore aligned with NULL. The words “face” and should probably forget the previous cell. From the
“whiskers” in Example 2 and “owner” in Example 3 three plots of the forget gates, we can see that overall
are also aligned with NULL. Intuitively, if some im- the colors are the lightest for Example 1, which is an
portant content words in the hypothesis are aligned entailment. This suggests that when the hypothesis
with NULL, it is more likely that the relationship la- is an entailment of the premise, the mLSTM tends
bel is either contradiction or neutral. to forget the previous matching results. On the other
hand, for Example 2 and Example 3, which are con-
Values of Gate Vectors tradiction and neutral, we see generally darker col-
Next, let us look at the values of the learned gate ors. In particular, in Example 2, we can see that the
vectors of our mLSTM for the three examples. We colors are consistently dark starting from the word
show these values under the setting where d is set to “his” in the hypothesis until the end. We believe the
150. Each row of these plots corresponds to one of explanation is that after the mLSTM processes the
the 150 dimensions. Again, a darker color indicates first three words of the hypothesis, “A cat washes,” it
a higher value. sees that the matching between “cat” and “dog” and
An input gate controls whether the input at the between “washes” and “jumping” is a strong indica-
current position should be used in deriving the final tion of a contradiction, and therefore these matching
hidden state of the current position. From the three results need to be remembered until the end of the
plots of the input gates, we can observe that gener- mLSTM for the final prediction.
ally for stop words such as prepositions and articles We have also checked the forget gates of the other
the input gates have lower values, suggesting that the sentence pairs in the test data by computing the av-
matching of these words is less important. On the erage forget gate values and the standard deviations
other hand, content words such as nouns and verbs for entailment, neutral and contradiction, respec-
tively. We find that the values are 0.446±0.123, ter representation of the premise to be used to match
0.507±0.148 and 0.536±0.170, respectively. For the hypothesis, whereas in our work we also use it to
contradiction and neutral, the forget gates start to derive representations of the premise that are used to
have higher values from certain positions of the hy- sequentially match the words in the hypothesis.
potheses. The SNLI corpus is new and so far it has
Based on the observations above, we hypothesize only been used in a few studies. Besides the
that the way the mLSTM works is as follows. It re- work by Bowman et al. (2015) themselves and by
members important mismatches, which are useful Rocktäschel et al. (2016), there are two other studies
for predicting the contradiction or the neutral re- which used the SNLI corpus. Vendrov et al. (2015)
lationship, and forgets good matching results. At used a Skip-Thought model proposed by Kiros et al.
the end of the mLSTM, if no important mismatch (2015) to the NLI task and reported an accuracy of
is remembered, the final classifier will likely pre- 81.5% on the test data. Mou et al. (2015) used tree-
dict entailment by default. Otherwise, depending on based CNN encoders to obtain sentence embeddings
the kind of mismatch remembered, the classifier will and achieved an accuracy of 82.1%.
predict either contradiction or neutral.
For the output gates, we are not able to draw any 5 Conclusions and Future Work
important conclusion except that the output gates
seem to be positively correlated with the input gates In this paper, we proposed a special LSTM ar-
but they tend to be darker than the input gates. chitecture for the task of natural language infer-
ence. Based on a recent work by Rocktäschel et al.
4 Related Work (2016), we first used neural attention models to de-
rive attention-weighted vector representations of the
There has been much work on natural language in- premise. We then designed a match-LSTM that pro-
ference. Shallow methods rely mostly on lexical cesses the hypothesis word by word while trying to
similarities but are shown to be robust. For example, match the hypothesis with the premise. The last hid-
Bowman et al. (2015) experimented with a lexical- den state of this mLSTM can be used for predicting
ized classifier-based method, which only uses lexi- the relationship between the premise and the hypoth-
cal information and achieves an accuracy of 78.2% esis. Experiments on the SNLI corpus showed that
on the SNLI corpus. More advanced methods use the mLSTM model outperformed the state-of-the-art
syntactic structures of the sentences to help match- performance reported so far on this data set. More-
ing them. For example, Mehdad et al. (2009) ap- over, closer analyses on the gate vectors revealed
plied syntactic-semantic tree kernels for recogniz- that our mLSTM indeed remembers and passes on
ing textual entailment. Because inference is es- important matching results, which are typically mis-
sentially a logic problem, methods based on for- matches that indicate a contradiction or a neutral re-
mal logic (Clark and Harrison, 2009) or natural lationship between the premise and the hypothesis.
logic (MacCartney, 2009) have also been proposed. With the large number of parameters to learn, an
A comprehensive review on existing work can be inevitable limitation of our model is that a large
found in the book by Dagan et al. (2013). training data set is needed to learn good model pa-
The work most relevant to ours is the recently rameters. Indeed some preliminary experiments ap-
proposed neural attention model-based method by plying our mLSTM to the SICK corpus (Marelli
Rocktäschel et al. (2016), which we have detailed et al., 2014), a smaller textual entailment bench-
in previous sections. Neural attention models have mark data set, did not give very good results. We
recently been applied to some natural language pro- believe that this is because our model learns ev-
cessing tasks including machine translation (Bah- erything from scratch except using the pre-trained
danau et al., 2015), abstractive summarization (Rush word embeddings. A future direction would be to
et al., 2015) and question answering (Hermann et incorporate other resources such as the paraphrase
al., 2015). Rocktäschel et al. (2016) showed that database (Ganitkevitch et al., 2013) into the learning
the neural attention model could help derive a bet- process.
References vectors. In Advances in Neural Information Process-
ing Systems.
[Bahdanau et al.2015] Dzmitry Bahdanau, HyungHyun [MacCartney et al.2008] Bill MacCartney, Michel Galley,
Cho, and Yoshua Bengio. 2015. Neural machine and Christopher D Manning. 2008. A phrase-based
translation by jointly learning to align and translate. In alignment model for natural language inference. In
Proceedings of the International Conference on Learn- Proceedings of the Conference on Empirical Methods
ing Representations. in Natural Language Processing.
[Bowman et al.2015] Samuel R Bowman, Gabor Angeli, [MacCartney2009] Bill MacCartney. 2009. Natural Lan-
Christopher Potts, and Christopher D Manning. 2015. guage Inference. Ph.D. thesis, Stanford University.
A large annotated corpus for learning natural language [Marelli et al.2014] Marco Marelli, Stefano Menini,
inference. In Proceedings of the 2015 Conference on Marco Baroni, Luisa Bentivogli, Raffaella Bernardi,
Empirical Methods in Natural Language Processing. and Roberto Zamparelli. 2014. A SICK cure for the
[Clark and Harrison2009] Peter Clark and Phil Harrison. evaluation of compositional distributional semantic
2009. An inference-based approach to recognizing en- models. In Proceedings of the Ninth International
tailment. In Proceedings of the Text Analysis Confer- Conference on Language Resources and Evaluation.
ence. [Mehdad et al.2009] Yashar Mehdad, Alessandro Mos-
[Dagan et al.2005] Ido Dagan, Oren Glickman, and chitti1, and Fabio Massiomo Zanzotto. 2009.
Bernardo Magnini. 2005. The PASCAL Recognising SemKer: Syntactic/semantic kernels for recognizing
Textual Entailment Challenge. In Proceedings of the textual entailment. In Proceedings of the Text Anal-
PASCAL Challenges Workshop on Recognizing Tex- ysis Conference.
tual Entailment. [Mou et al.2015] Lili Mou, Men Rui, Ge Li, Yan Xu,
[Dagan et al.2013] Ido Dagan, Dan Roth, Mark Sam- Lu Zhang, Rui Yan, and Zhi Jin. 2015. Recogniz-
mons, and Fabio Massimo Zanzotto. 2013. Recog- ing entailment and contradiction by tree-based convo-
nizing Textual Entailment: Models and Applications. lution. arXiv preprint arXiv:1512.08422.
Synthesis Lectures on Human Language Technolo- [Pennington et al.2014] Jeffrey Pennington, Richard
gies. Morgan & Claypool Publishers. Socher, and Christopher D Manning. 2014. Glove:
[Ganitkevitch et al.2013] Juri Ganitkevitch, Benjamin Global vectors for word representation. Proceedings
Van Durme, and Chris Callison-Burch. 2013. PPDB: of the Conference on Empirical Methods in Natural
The paraphrase database. In Proceedings of the 2013 Language Processing.
Conference of the North American Chapter of the [Rocktäschel et al.2016] Tim Rocktäschel, Edward
Association for Computational Linguistics. Grefenstette, Karl Moritz Hermann, Tomáš Kočiskỳ,
[Glickman et al.2005] Oren Glickman, Ido Dagan, and and Phil Blunsom. 2016. Reasoning about entailment
Moshe Koppel. 2005. Web based probabilistic textual with neural attention. In Proceedings of the Interna-
entailment. In Proceedings of the PASCAL Challenges tional Conference on Learning Representations.
Workshop on Recognizing Textual Entailment. [Rush et al.2015] Alexander M Rush, Sumit Chopra, and
[Graves2012] Alex Graves. 2012. Supervised sequence Jason Weston. 2015. A neural attention model for
labelling with recurrent neural networks, volume 385. abstractive sentence summarization. Proceedings of
Springer. the Conference on Empirical Methods in Natural Lan-
[Hermann et al.2015] Karl Moritz Hermann, Tomas Ko- guage Processing.
cisky, Edward Grefenstette, Lasse Espeholt, Will Kay, [Vendrov et al.2015] Ivan Vendrov, Ryan Kiros, Sanja
Mustafa Suleyman, and Phil Blunsom. 2015. Teach- Fidler, and Raquel Urtasun. 2015. Order-
ing machines to read and comprehend. In Advances in embeddings of images and language. arXiv preprint
Neural Information Processing Systems. arXiv:1511.06361.
[Hochreiter and Schmidhuber1997] Sepp Hochreiter and
Jürgen Schmidhuber. 1997. Long short-term memory.
Neural Computation, 9(8):1735–1780.
[Kingma and Ba2014] Diederik Kingma and Jimmy Ba.
2014. Adam: A method for stochastic optimization.
Proceedings of the International Conference on Learn-
ing Representations.
[Kiros et al.2015] Ryan Kiros, Yukun Zhu, Ruslan R
Salakhutdinov, Richard Zemel, Raquel Urtasun, An-
tonio Torralba, and Sanja Fidler. 2015. Skip-thought

You might also like