29 - Multi-Task Deep Neural Networks For Natural Language Understanding

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Multi-Task Deep Neural Networks for Natural Language Understanding

Xiaodong Liu∗1 , Pengcheng He∗2 , Weizhu Chen2 , Jianfeng Gao1


1 2
Microsoft Research Microsoft Dynamics 365 AI
{xiaodl,penhe,wzchen,jfgao}@microsoft.com

Abstract does not. Similarly, it is useful for multiple (re-


lated) tasks to be learned jointly so that the knowl-
In this paper, we present a Multi-Task Deep
edge learned in one task can benefit other tasks.
arXiv:1901.11504v2 [cs.CL] 30 May 2019

Neural Network (MT-DNN) for learning rep-


resentations across multiple natural language Recently, there is a growing interest in applying
understanding (NLU) tasks. MT-DNN not MTL to representation learning using deep neu-
only leverages large amounts of cross-task ral networks (DNNs) (Collobert et al., 2011; Liu
data, but also benefits from a regularization ef- et al., 2015; Luong et al., 2015; Xu et al., 2018;
fect that leads to more general representations Guo et al., 2018; Ruder12 et al., 2019) for two
to help adapt to new tasks and domains. MT- reasons. First, supervised learning of DNNs re-
DNN extends the model proposed in Liu et al.
quires large amounts of task-specific labeled data,
(2015) by incorporating a pre-trained bidirec-
tional transformer language model, known as which is not always available. MTL provides an
BERT (Devlin et al., 2018). MT-DNN ob- effective way of leveraging supervised data from
tains new state-of-the-art results on ten NLU many related tasks. Second, the use of multi-task
tasks, including SNLI, SciTail, and eight out of learning profits from a regularization effect via al-
nine GLUE tasks, pushing the GLUE bench- leviating overfitting to a specific task, thus making
mark to 82.7% (2.2% absolute improvement) the learned representations universal across tasks.
1
. We also demonstrate using the SNLI and Sc-
In contrast to MTL, language model pre-
iTail datasets that the representations learned
by MT-DNN allow domain adaptation with training has shown to be effective for learning
substantially fewer in-domain labels than the universal language representations by leveraging
pre-trained BERT representations. The code large amounts of unlabeled data. A recent sur-
and pre-trained models are publicly available vey is included in Gao et al. (2018). Some of
at https://github.com/namisan/mt-dnn. the most prominent examples are ELMo (Peters
et al., 2018), GPT (Radford et al., 2018) and BERT
1 Introduction
(Devlin et al., 2018). These are neural network
Learning vector-space representations of text, e.g., language models trained on text data using unsu-
words and sentences, is fundamental to many nat- pervised objectives. For example, BERT is based
ural language understanding (NLU) tasks. Two on a multi-layer bidirectional Transformer, and is
popular approaches are multi-task learning and trained on plain text for masked word prediction
language model pre-training. In this paper we and next sentence prediction tasks. To apply a
combine the strengths of both approaches by pre-trained model to specific NLU tasks, we often
proposing a new Multi-Task Deep Neural Network need to fine-tune, for each task, the model with
(MT-DNN). additional task-specific layers using task-specific
Multi-Task Learning (MTL) is inspired by hu- training data. For example, Devlin et al. (2018)
man learning activities where people often apply shows that BERT can be fine-tuned this way to
the knowledge learned from previous tasks to help create state-of-the-art models for a range of NLU
learn a new task (Caruana, 1997; Zhang and Yang, tasks, such as question answering and natural lan-
2017). For example, it is easier for a person who guage inference.
knows how to ski to learn skating than the one who We argue that MTL and language model pre-

Equal Contribution. training are complementary technologies, and can
1
As of February 25, 2019 on the latest GLUE test set. be combined to improve the learning of text rep-
resentations to boost the performance of various Single-Sentence Classification: Given a sen-
NLU tasks. To this end, we extend the MT-DNN tence3 , the model labels it using one of the pre-
model originally proposed in Liu et al. (2015) defined class labels. For example, the CoLA task
by incorporating BERT as its shared text encod- is to predict whether an English sentence is gram-
ing layers. As shown in Figure 1, the lower lay- matically plausible. The SST-2 task is to de-
ers (i.e., text encoding layers) are shared across termine whether the sentiment of a sentence ex-
all tasks, while the top layers are task-specific, tracted from movie reviews is positive or negative.
combining different types of NLU tasks such as
Text Similarity: This is a regression task. Given
single-sentence classification, pairwise text clas-
a pair of sentences, the model predicts a real-value
sification, text similarity, and relevance ranking.
score indicating the semantic similarity of the two
Similar to the BERT model, MT-DNN can be
sentences. STS-B is the only example of the task
adapted to a specific task via fine-tuning. Unlike
in GLUE.
BERT, MT-DNN uses MTL, in addition to lan-
guage model pre-training, for learning text repre- Pairwise Text Classification: Given a pair of
sentations. sentences, the model determines the relationship
MT-DNN obtains new state-of-the-art results on of the two sentences based on a set of pre-defined
eight out of nine NLU tasks 2 used in the Gen- labels. For example, both RTE and MNLI are
eral Language Understanding Evaluation (GLUE) language inference tasks, where the goal is to pre-
benchmark (Wang et al., 2018), pushing the GLUE dict whether a sentence is an entailment, contra-
benchmark score to 82.7%, amounting to 2.2% ab- diction, or neutral with respect to the other. QQP
solute improvement over BERT. We further extend and MRPC are paraphrase datasets that consist of
the superiority of MT-DNN to the SNLI (Bow- sentence pairs. The task is to predict whether the
man et al., 2015a) and SciTail (Khot et al., 2018) sentences in the pair are semantically equivalent.
tasks. The representations learned by MT-DNN
Relevance Ranking: Given a query and a list of
allow domain adaptation with substantially fewer
candidate answers, the model ranks all the can-
in-domain labels than the pre-trained BERT rep-
didates in the order of relevance to the query.
resentations. For example, our adapted models
QNLI is a version of Stanford Question Answer-
achieve the accuracy of 91.6% on SNLI and 95.0%
ing Dataset (Rajpurkar et al., 2016). The task in-
on SciTail, outperforming the previous state-of-
volves assessing whether a sentence contains the
the-art performance by 1.5% and 6.7%, respec-
correct answer to a given query. Although QNLI
tively. Even with only 0.1% or 1.0% of the origi-
is defined as a binary classification task in GLUE,
nal training data, the performance of MT-DNN on
in this study we formulate it as a pairwise ranking
both SNLI and SciTail datasets is better than many
task, where the model is expected to rank the can-
existing models. All of these clearly demonstrate
didate that contains the correct answer higher than
MT-DNN’s exceptional generalization capability
the candidate that does not. We will show that this
via multi-task learning.
formulation leads to a significant improvement in
accuracy over binary classification.
2 Tasks
3 The Proposed MT-DNN Model
The MT-DNN model combines four types of NLU The architecture of the MT-DNN model is shown
tasks: single-sentence classification, pairwise text in Figure 1. The lower layers are shared across all
classification, text similarity scoring, and rele- tasks, while the top layers represent task-specific
vance ranking. For concreteness, we describe outputs. The input X, which is a word sequence
them using the NLU tasks defined in the GLUE (either a sentence or a pair of sentences packed
benchmark as examples. together) is first represented as a sequence of em-
bedding vectors, one for each word, in l1 . Then the
2
The only GLUE task where MT-DNN does not create transformer encoder captures the contextual infor-
a new state of the art result is WNLI. But as noted in the mation for each word via self-attention, and gen-
GLUE webpage (https://gluebenchmark.com/faq), there are
3
issues in the dataset, and none of the submitted systems has In this study, a sentence can be an arbitrary span of con-
ever outperformed the majority voting baseline whose accu- tiguous text or word sequence, rather than a linguistically
racy is 65.1. plausible sentence.
Figure 1: Architecture of the MT-DNN model for representation learning. The lower layers are shared across
all tasks while the top layers are task-specific. The input X (either a sentence or a pair of sentences) is first
represented as a sequence of embedding vectors, one for each word, in l1 . Then the Transformer encoder captures
the contextual information for each word and generates the shared contextual embedding vectors in l2 . Finally, for
each task, additional task-specific layers generate task-specific representations, followed by operations necessary
for classification, similarity scoring, or relevance ranking.

erates a sequence of contextual embeddings in l2 . ers using the NLU tasks in GLUE as examples,
This is the shared semantic representation that is although in practice we can incorporate arbitrary
trained by our multi-task objectives. In what fol- natural language tasks such as text generation
lows, we elaborate on the model in detail. where the output layers are implemented as a neu-
ral decoder.
Lexicon Encoder (l1 ): The input X =
{x1 , ..., xm } is a sequence of tokens of length m. Single-Sentence Classification Output: Sup-
Following Devlin et al. (2018), the first token x1 is pose that x is the contextual embedding (l2 ) of the
always the [CLS] token. If X is packed by a sen- token [CLS], which can be viewed as the seman-
tence pair (X1 , X2 ), we separate the two sentences tic representation of input sentence X. Take the
with a special token [SEP]. The lexicon encoder SST-2 task as an example. The probability that
maps X into a sequence of input embedding vec- X is labeled as class c (i.e., the sentiment) is pre-
tors, one for each token, constructed by summing dicted by a logistic regression with softmax:
the corresponding word, segment, and positional >
embeddings. Pr (c|X) = softmax(WSST · x), (1)

Transformer Encoder (l2 ): We use a multi- where WSST is the task-specific parameter ma-
layer bidirectional Transformer encoder (Vaswani trix.
et al., 2017) to map the input representation vec- Text Similarity Output: Take the STS-B task
tors (l1 ) into a sequence of contextual embedding as an example. Suppose that x is the contextual
vectors C ∈ Rd×m . This is the shared represen- embedding (l2 ) of [CLS] which can be viewed
tation across different tasks. Unlike the BERT as the semantic representation of the input sen-
model (Devlin et al., 2018) that learns the rep- tence pair (X1 , X2 ). We introduce a task-specific
resentation via pre-training, MT-DNN learns the parameter vector wST S to compute the similarity
representation using multi-task objectives, in ad- score as:
dition to pre-training.
>
Below, we will describe the task specific lay- Sim(X1 , X2 ) = wST S · x, (2)
where Sim(X1 , X2 ) is a real value of the range (- answer (Q, A). We compute the relevance score
∞, ∞). as:
>
Rel(Q, A) = g(wQN LI · x), (5)
Pairwise Text Classification Output: Take nat-
ural language inference (NLI) as an example. The For a given Q, we rank all of its candidate an-
NLI task defined here involves a premise P = swers based on their relevance scores computed
(p1 , ..., pm ) of m words and a hypothesis H = using Equation 5.
(h1 , ..., hn ) of n words, and aims to find a log-
3.1 The Training Procedure
ical relationship R between P and H. The de-
sign of the output module follows the answer The training procedure of MT-DNN consists of
module of the stochastic answer network (SAN) two stages: pretraining and multi-task learning.
(Liu et al., 2018a), a state-of-the-art neural NLI The pretraining stage follows that of the BERT
model. SAN’s answer module uses multi-step rea- model (Devlin et al., 2018). The parameters of
soning. Rather than directly predicting the entail- the lexicon encoder and Transformer encoder are
ment given the input, it maintains a state and iter- learned using two unsupervised prediction tasks:
atively refines its predictions. masked language modeling and next sentence pre-
The SAN answer module works as follows. We diction.4
first construct the working memory of premise P In the multi-task learning stage, we use mini-
by concatenating the contextual embeddings of the batch based stochastic gradient descent (SGD) to
words in P , which are the output of the trans- learn the parameters of our model (i.e., the pa-
former encoder, denoted as Mp ∈ Rd×m , and sim- rameters of all shared layers and task-specific lay-
ilarly the working memory of hypothesis H, de- ers) as shown in Algorithm 1. In each epoch, a
noted as Mh ∈ Rd×n . Then, we perform K-step mini-batch bt is selected(e.g., among all 9 GLUE
reasoning on the memory to output the relation la- tasks), and the model is updated according to the
bel, where K is a hyperparameter. At the begin- task-specific objective for the task t. This approx-
ning, the initial state s0 is the summary of Mh : imately optimizes the sum of all multi-task objec-
exp(w1> ·Mh
j) tives.
s0 = h
P
j αj Mj , where αj = exp(w> ·Mh )
.
P
i 1 i For the classification tasks (i.e., single-sentence
At time step k in the range of {1, 2, , K − 1}, or pairwise text classification), we use the cross-
the state is defined by sk = GRU(sk−1 , xk ). entropy loss as the objective:
Here, xk is computed from theP previous state sk−1
p k p
and memory M : x = j βj Mj and βj = 1(X, c) log(Pr (c|X)),
X
− (6)
k−1 > p
softmax(s W2 M ). A one-layer classifier is c
used to determine the relation at each step k:
where 1(X, c) is the binary indicator (0 or 1) if
Prk = softmax(W3> [sk ; xk ; |sk k
− x |; s · x ]). k k class label c is the correct classification for X, and
(3) Pr (.) is defined by e.g., Equation 1 or 4.
At last, we utilize all of the K outputs by aver- For the text similarity tasks, such as STS-B,
aging the scores: where each sentence pair is annotated with a real-
valued score y, we use the mean squared error as
Pr = avg([Pr0 , Pr1 , ..., PrK−1 ]). (4) the objective:

Each Pr is a probability distribution over all (y − Sim(X1 , X2 ))2 , (7)


the relations R ∈ R. During training, we apply
where Sim(.) is defined by Equation 2.
stochastic prediction dropout (Liu et al., 2018b)
The objective for the relevance ranking tasks
before the above averaging operation. During de-
follows the pairwise learning-to-rank paradigm
coding, we average all outputs to improve robust-
(Burges et al., 2005; Huang et al., 2013). Take
ness.
QNLI as an example. Given a query Q, we obtain
Relevance Ranking Output: Take QNLI as an a list of candidate answers A which contains a pos-
example. Suppose that x is the contextual embed- itive example A+ that includes the correct answer,
ding vector of [CLS] which is the semantic rep- 4
In this study we use the pre-trained BERT models re-
resentation of a pair of question and its candidate leased by the authors.
Algorithm 1: Training a MT-DNN model. GLUE The General Language Understanding
Initialize model parameters Θ randomly. Evaluation (GLUE) benchmark is a collection of
Pre-train the shared layers (i.e., the lexicon nine NLU tasks as in Table 1, including question
encoder and the transformer encoder). answering, sentiment analysis, text similarity and
Set the max number of epoch: epochmax . textual entailment; it is considered well-designed
//Prepare the data for T tasks. for evaluating the generalization and robustness of
for t in 1, 2, ..., T do NLU models.
Pack the dataset t into mini-batch: Dt .
SNLI The Stanford Natural Language Inference
end
(SNLI) dataset contains 570k human annotated
for epoch in 1, 2, ..., epochmax do
sentence pairs, in which the premises are drawn
1. Merge all the datasets:
from the captions of the Flickr30 corpus and hy-
D = D1 ∪ D2 ... ∪ DT
potheses are manually annotated (Bowman et al.,
2. Shuffle D
2015b). This is the most widely used entailment
for bt in D do
//bt is a mini-batch of task t. dataset for NLI. The dataset is used only for do-
3. Compute loss : L(Θ) main adaptation in this study.
L(Θ) = Eq. 6 for classification SciTail This is a textual entailment dataset de-
L(Θ) = Eq. 7 for regression rived from a science question answering (SciQ)
L(Θ) = Eq. 8 for ranking dataset (Khot et al., 2018). The task involves as-
4. Compute gradient: ∇(Θ) sessing whether a given premise entails a given hy-
5. Update model: Θ = Θ − ∇(Θ) pothesis. In contrast to other entailment datasets
end mentioned previously, the hypotheses in SciTail
end are created from science questions while the cor-
responding answer candidates and premises come
from relevant web sentences retrieved from a large
and |A| − 1 negative examples. We then minimize corpus. As a result, these sentences are linguis-
the negative log likelihood of the positive example tically challenging and the lexical similarity of
given queries across the training data premise and hypothesis is often high, thus making
X SciTail particularly difficult. The dataset is used
− Pr (A+ |Q), (8)
only for domain adaptation in this study.
(Q,A+ )
4.2 Implementation details
exp(γRel(Q, A+ )) Our implementation of MT-DNN is based on
Pr (A+ |Q) = P , (9)
A0 ∈A exp(γRel(Q, A
0
)) the PyTorch implementation of BERT5 . We used
Adamax (Kingma and Ba, 2014) as our optimizer
where Rel(.) is defined by Equation 5 and γ is a with a learning rate of 5e-5 and a batch size of
tuning factor determined on held-out data. In our 32 by following Devlin et al. (2018). The max-
experiment, we simply set γ to 1. imum number of epochs was set to 5. A linear
learning rate decay schedule with warm-up over
4 Experiments
0.1 was used, unless stated otherwise. We also set
We evaluate the proposed MT-DNN on three pop- the dropout rate of all the task specific layers as
ular NLU benchmarks: GLUE (Wang et al., 2018), 0.1, except 0.3 for MNLI and 0.05 for CoLa. To
SNLI (Bowman et al., 2015b), and SciTail (Khot avoid the exploding gradient problem, we clipped
et al., 2018). We compare MT-DNN with exist- the gradient norm within 1. All the texts were to-
ing state-of-the-art models including BERT and kenized using wordpieces, and were chopped to
demonstrate the effectiveness of MTL with and spans no longer than 512 tokens.
without model fine-tuning using GLUE and do-
main adaptation using both SNLI and SciTail. 4.3 GLUE Main Results
We compare MT-DNN with its variants and a list
4.1 Datasets of state-of-the-art models that have been submitted
This section briefly describes the GLUE, SNLI, 5
https://github.com/huggingface/pytorch-pretrained-
and SciTail datasets, as summarized in Table 1. BERT
Corpus Task #Train #Dev #Test #Label Metrics
Single-Sentence Classification (GLUE)
CoLA Acceptability 8.5k 1k 1k 2 Matthews corr
SST-2 Sentiment 67k 872 1.8k 2 Accuracy
Pairwise Text Classification (GLUE)
MNLI NLI 393k 20k 20k 3 Accuracy
RTE NLI 2.5k 276 3k 2 Accuracy
WNLI NLI 634 71 146 2 Accuracy
QQP Paraphrase 364k 40k 391k 2 Accuracy/F1
MRPC Paraphrase 3.7k 408 1.7k 2 Accuracy/F1
Text Similarity (GLUE)
STS-B Similarity 7k 1.5k 1.4k 1 Pearson/Spearman corr
Relevance Ranking (GLUE)
QNLI QA/NLI 108k 5.7k 5.7k 2 Accuracy
Pairwise Text Classification
SNLI NLI 549k 9.8k 9.8k 3 Accuracy
SciTail NLI 23.5k 1.3k 2.1k 2 Accuracy

Table 1: Summary of the three benchmarks: GLUE, SNLI and SciTail.

Model CoLA SST-2 MRPC STS-B QQP MNLI-m/mm QNLI RTE WNLI AX Score
8.5k 67k 3.7k 7k 364k 393k 108k 2.5k 634
BiLSTM+ELMo+Attn 1 36.0 90.4 84.9/77.9 75.1/73.3 64.8/84.7 76.4/76.1 - 56.8 65.1 26.5 70.5
Singletask Pretrain
45.4 91.3 82.3/75.7 82.0/80.0 70.3/88.5 82.1/81.4 - 56.0 53.4 29.8 72.8
Transformer 2
GPT on STILTs 3 47.2 93.1 87.7/83.7 85.3/84.8 70.1/88.1 80.8/80.6 - 69.1 65.1 29.4 76.9
BERT4LARGE 60.5 94.9 89.3/85.4 87.6/86.5 72.1/89.3 86.7/85.9 92.7 70.1 65.1 39.6 80.5
MT-DNNno-fine-tune 58.9 94.6 90.1/86.4 89.5/88.8 72.7/89.6 86.5/85.8 93.1 79.1 65.1 39.4 81.7
MT-DNN 62.5 95.6 91.1/88.2 89.5/88.8 72.7/89.6 86.7/86.0 93.1 81.4 65.1 40.3 82.7
Human Performance 66.4 97.8 86.3/80.8 92.7/92.6 59.5/80.4 92.0/92.8 91.2 93.6 95.9 - 87.1

Table 2: GLUE test set results scored using the GLUE evaluation server. The number below each task denotes the
number of training examples. The state-of-the-art results are in bold, and the results on par with or pass human
performance are in bold. MT-DNN uses BERTLARGE to initialize its shared layers. All the results are obtained
from https://gluebenchmark.com/leaderboard on February 25, 2019. Model references: 1 :(Wang et al., 2018) ;
2
:(Radford et al., 2018); 3 : (Phang et al., 2018); 4 :(Devlin et al., 2018).

Model MNLI-m/mm QQP RTE QNLI (v1/v2) MRPC CoLa SST-2 STS-B
BERTLARGE 86.3/86.2 91.1/88.0 71.1 90.5/92.4 89.5/85.8 61.8 93.5 89.6/89.3
ST-DNN 86.6/86.3 91.3/88.4 72.0 96.1/- 89.7/86.4 - - -
MT-DNN 87.1/86.7 91.9/89.2 83.4 97.4/92.9 91.0/87.5 63.5 94.3 90.7/90.6

Table 3: GLUE dev set results. The best result on each task is in bold. The Single-Task DNN (ST-DNN) uses the
same model architecture as MT-DNN. But its shared layers are the pre-trainedBERT model without being refined
via MTL. We fine-tuned ST-DNN for each GLUE task using task-specific data. There have been two versions of
the QNLI dataset. V1 is expired on January 30, 2019. The current version is v2. MT-DNN use BERTLARGE as
their initial shared layers.

to the GLUE leaderboard. The results are shown We fine-tuned the model for each GLUE task on
in Tables 2 and 3. task-specific data.

BERTLARGE This is the large BERT model re- MT-DNN This is the proposed model described
leased by the authors, which we used as a baseline. in Section 3. We used the pre-trained BERTLARGE
to initialize its shared layers, refined the model via text classification output module and the pairwise
MTL on all GLUE tasks, and fine-tuned the model ranking loss for the QNLI task which by design
for each GLUE task using task-specific data. The is a binary classification problem in GLUE. To in-
test results in Table 2 show that MT-DNN out- vestigate the relative contributions of these mod-
performs all existing systems on all tasks, ex- eling design choices, we implement a variant of
cept WNLI, creating new state-of-the-art results MT-DNN as described below.
on eight GLUE tasks and pushing the benchmark
ST-DNN ST-DNN stands for Single-Task DNN.
to 82.7%, which amounts to 2.2% absolution im-
It uses the same model architecture as MT-DNN.
provement over BERTLARGE . Since MT-DNN
But its shared layers are the pre-trained BERT
uses BERTLARGE to initialize its shared layers, the
model without being refined via MTL. We then
gain is mainly attributed to the use of MTL in
fine-tuned ST-DNN for each GLUE task using
refining the shared layers. MTL is particularly
task-specific data. Thus, for pairwise text classi-
useful for the tasks with little in-domain training
fication tasks, the only difference between their
data. As we observe in the table, on the same type
ST-DNNs and BERT models is the design of the
of tasks, the improvements over BERT are much
task-specific output module. The results in Ta-
more substantial for the tasks with less in-domain
ble 3 show that on all four tasks (MNLI, QQP, RTE
training data than those with more in-domain la-
and MRPC) ST-DNN outperforms BERT, justi-
bels, even though they belong to the same task
fying the effectiveness of the SAN answer mod-
type, e.g., the two NLI tasks: RTE vs. MNLI, and
ule. We also compare the results of ST-DNN and
the two paraphrase tasks: MRPC vs. QQP.
BERT on QNLI. While ST-DNN is fine-tuned us-
MT-DNNno-fine-tune Since the MTL of MT-DNN ing the pairwise ranking loss, BERT views QNLI
uses all GLUE tasks, it is possible to directly ap- as binary classification and is fine-tuned using the
ply MT-DNN to each GLUE task without fine- cross entropy loss. ST-DNN significantly outper-
tuning. The results in Table 2 show that MT- forms BERT demonstrates clearly the importance
DNNno-fine-tune still outperforms BERTLARGE con- of problem formulation.
sistently among all tasks but CoLA. Our analysis
4.4 Domain Adaptation Results on SNLI and
shows that CoLA is a challenge task with much
SciTail
smaller in-domain data than other tasks, and its
task definition and dataset are unique among all
GLUE tasks, making it difficult to benefit from
the knowledge learned from other tasks. As a
result, MTL tends to underfit the CoLA dataset.
In such a case, fine-tuning is necessary to boost
the performance. As shown in Table 2, the ac-
curacy improves from 58.9% to 62.5% after fine-
tuning, even though only a very small amount
of in-domain data is available for adaptation.
This, together with the fact that the fine-tuned
MT-DNN significantly outperforms the fine-tuned
BERTLARGE on CoLA (62.5% vs. 60.5%), reveals
that the learned MT-DNN representation allows
much more effective domain adaptation than the
pre-trained BERT representation. We will revisit
this topic with more experiments in Section 4.4.
The gain of MT-DNN is also attributed to its
flexible modeling framework which allows us to Figure 2: Domain adaption results on SNLI and Sci-
incorporate the task-specific model structures and Tail development datasets using the shared embeddings
training methods which have been developed in generated by MT-DNN and BERT, respectively. Both
the single-task setting, effectively leveraging the MT-DNN and BERT are fine-tuned based on the pre-
existing body of research. Two such examples are trained BERTBASE . The X-axis indicates the amount of
the use of the SAN answer module for the pairwise domain-specific labeled samples used for adaptation.
Model 0.1% 1% 10% 100% training data, MT-DNN achieves 82.1% in accu-
SNLI Dataset (Dev Accuracy%) racy while BERT’s accuracy is 52.5%; with 1%
#Training Data 549 5,493 54,936 549,367 of the training data, the accuracy from MT-DNN
BERT 52.5 78.1 86.7 91.0 is 85.2% and BERT is 78.1%. We observe similar
MT-DNN 82.1 85.2 88.4 91.5 results on SciTail. The results indicate that the rep-
SciTail Dataset (Dev Accuracy%) resentations learned by MT-DNN are more consis-
#Training Data 23 235 2,359 23,596 tently effective for domain adaptation than BERT.
BERT 51.2 82.2 90.5 94.3 In Table 5, we compare our adapted mod-
MT-DNN 81.9 88.3 91.1 95.7 els, using all in-domain training samples, against
several strong baselines including the best re-
Table 4: Domain adaptation results on SNLI and Sci- sults reported in the leaderboards. We see that
Tail, as shown in Figure 2. MT-DNNLARGE generates new state-of-the-art re-
sults on both datasets, pushing the benchmarks to
One of the most important criteria of building 91.6% on SNLI (1.5% absolute improvement) and
practical systems is fast adaptation to new tasks 95.0% on SciTail (6.7% absolute improvement),
and domains. This is because it is prohibitively respectively. This results in the new state-of-the-
expensive to collect labeled training data for new art for both SNLI and SciTail. All of these demon-
domains or tasks. Very often, we only have very strate the exceptional performance of MT-DNN on
small training data or even no training data. domain adaptation.
To evaluate the models using the above crite- Model Dev Test
rion, we perform domain adaptation experiments
SNLI Dataset (Accuracy%)
on two NLI tasks, SNLI and SciTail, using the fol-
GPT (Radford et al., 2018) - 89.9
lowing procedure:
Kim et al. (2018)∗ - 90.1
1. use the MT-DNN model or the BERT as ini- BERTBASE 91.0 90.8
tial model including both BASE and LARGE MT-DNNBASE 91.5 91.1
model settings; BERTLARGE 91.7 91.0
MT-DNNLARGE 92.2 91.6
2. create for each new task (SNLI or SciTail) a
SciTail Dataset (Accuracy%)
task-specific model, by adapting the trained
GPT (Radford et al., 2018)∗ - 88.3
MT-DNN using task-specific training data;
BERTBASE 94.3 92.0
3. evaluate the models using task-specific test MT-DNNBASE 95.7 94.1
data. BERTLARGE 95.7 94.4
MT-DNNLARGE 96.3 95.0
We starts with the default training/dev/test set
of these tasks. But we randomly sample 0.1%,
Table 5: Results on the SNLI and SciTail dataset.
1%, 10% and 100% of its training data. As a re- Previous state-of-the-art results are marked by
sult, we obtain four sets of training data for Sci- ∗, obtained from the official SNLI leaderboard
Tail, which respectively includes 23, 235, 2.3k and (https://nlp.stanford.edu/projects/snli/) and the
23.5k training samples. Similarly, we obtain four official SciTail leaderboard maintained by AI2
sets of training data for SNLI, which respectively (https://leaderboard.allenai.org/scitail).
include 549, 5.5k, 54.9k and 549.3k training sam-
ples.
5 Conclusion
We perform random sampling five times and re-
port the mean among all the runs. Results on dif- In this work we proposed a model called MT-
ferent amounts of training data from SNLI and Sc- DNN to combine multi-task learning and lan-
iTail are reported in Figure 2. We observe that guage model pre-training for language represen-
MT-DNN outperforms the BERT baseline consis- tation learning. MT-DNN obtains new state-of-
tently with more details provided in Table 4. The the-art results on ten NLU tasks across three pop-
fewer training examples used, the larger improve- ular benchmarks: SNLI, SciTail, and GLUE. MT-
ment MT-DNN demonstrates over BERT. For ex- DNN also demonstrates an exceptional generaliza-
ample, with only 0.1% (23 samples) of the SNLI tion capability in domain adaptation experiments.
There are many future areas to explore to im- J. Gao, M. Galley, and L. Li. 2018. Neural approaches
prove MT-DNN, including a deeper understand- to conversational AI. CoRR, abs/1809.08267.
ing of model structure sharing in MTL, a more Max Glockner, Vered Shwartz, and Yoav Goldberg.
effective training method that leverages related- 2018. Breaking nli systems with sentences that re-
ness among multiple tasks, for both fine-tuning quire simple lexical inferences. In The 56th Annual
Meeting of the Association for Computational Lin-
and pre-training (Dong et al., 2019), and ways of
guistics (ACL), Melbourne, Australia.
incorporating the linguistic structure of text in a
more explicit and controllable manner. At last, Han Guo, Ramakanth Pasunuru, and Mohit Bansal.
2018. Soft layer-specific multi-task summarization
we also would like to verify whether MT-DNN
with entailment and question generation. In Pro-
is resilience against adversarial attacks (Glockner ceedings of the 56th Annual Meeting of the Associa-
et al., 2018; Talman and Chatzikyriakidis, 2018; tion for Computational Linguistics (Volume 1: Long
Liu et al., 2019). Papers), pages 687–697.
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng,
Acknowledgments Alex Acero, and Larry Heck. 2013. Learning deep
structured semantic models for web search using
We would like to thanks Jade Huang from Mi- clickthrough data. In Proceedings of the 22nd ACM
crosoft for her generous help on this work. international conference on Conference on informa-
tion & knowledge management, pages 2333–2338.
ACM.
References
Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018.
Samuel R Bowman, Gabor Angeli, Christopher Potts, SciTail: A textual entailment dataset from science
and Christopher D Manning. 2015a. A large anno- question answering. In AAAI.
tated corpus for learning natural language inference.
In Proceedings of the 2015 Conference on Empiri- Seonhoon Kim, Jin-Hyuk Hong, Inho Kang, and No-
cal Methods in Natural Language Processing, pages jun Kwak. 2018. Semantic sentence matching with
632–642. densely-connected recurrent and co-attentive infor-
mation. arXiv preprint arXiv:1805.11360.
Samuel R. Bowman, Gabor Angeli, Christopher Potts,
and Christopher D. Manning. 2015b. A large an- Diederik Kingma and Jimmy Ba. 2014. Adam: A
notated corpus for learning natural language infer- method for stochastic optimization. arXiv preprint
ence. In Proceedings of the 2015 Conference on arXiv:1412.6980.
Empirical Methods in Natural Language Processing Xiaodong Liu, Kevin Duh, and Jianfeng Gao. 2018a.
(EMNLP). Association for Computational Linguis- Stochastic answer networks for natural language in-
tics. ference. arXiv preprint arXiv:1804.07888.
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng,
Matt Deeds, Nicole Hamilton, and Greg Hullender. Kevin Duh, and Ye-Yi Wang. 2015. Representa-
2005. Learning to rank using gradient descent. In tion learning using multi-task deep neural networks
Proceedings of the 22nd international conference on for semantic classification and information retrieval.
Machine learning, pages 89–96. ACM. In Proceedings of the 2015 Conference of the North
American Chapter of the Association for Computa-
Rich Caruana. 1997. Multitask learning. Machine tional Linguistics: Human Language Technologies,
learning, 28(1):41–75. pages 912–921.
Ronan Collobert, Jason Weston, Léon Bottou, Michael Xiaodong Liu, Pengcheng He, Weizhu Chen, and
Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Jianfeng Gao. 2019. Improving multi-task deep
2011. Natural language processing (almost) from neural networks via knowledge distillation for
scratch. Journal of Machine Learning Research, natural language understanding. arXiv preprint
12(Aug):2493–2537. arXiv:1904.09482.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Xiaodong Liu, Yelong Shen, Kevin Duh, and Jianfeng
Kristina Toutanova. 2018. Bert: Pre-training of deep Gao. 2018b. Stochastic answer networks for ma-
bidirectional transformers for language understand- chine reading comprehension. In Proceedings of the
ing. arXiv preprint arXiv:1810.04805. 56th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers). Asso-
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, ciation for Computational Linguistics.
Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming
Zhou, and Hsiao-Wuen Hon. 2019. Unified Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol
language model pre-training for natural language Vinyals, and Lukasz Kaiser. 2015. Multi-task
understanding and generation. arXiv preprint sequence to sequence learning. arXiv preprint
arXiv:1905.03197. arXiv:1511.06114.
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word rep-
resentations. arXiv preprint arXiv:1802.05365.
Jason Phang, Thibault Févry, and Samuel R Bowman.
2018. Sentence encoders on stilts: Supplementary
training on intermediate labeled-data tasks. arXiv
preprint arXiv:1811.01088.
Alec Radford, Karthik Narasimhan, Tim Salimans, and
Ilya Sutskever. 2018. Improving language under-
standing by generative pre-training.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Percy Liang. 2016. Squad: 100,000+ questions for
machine comprehension of text. pages 2383–2392.
Sebastian Ruder12, Joachim Bingel, Isabelle Augen-
stein, and Anders Søgaard. 2019. Latent multi-task
architecture learning.
Aarne Talman and Stergios Chatzikyriakidis. 2018.
Testing the generalization power of neural network
models across nli benchmarks. arXiv preprint
arXiv:1810.09774.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. arXiv preprint arXiv:1706.03762.
Alex Wang, Amapreet Singh, Julian Michael, Felix
Hill, Omer Levy, and Samuel R Bowman. 2018.
Glue: A multi-task benchmark and analysis platform
for natural language understanding. arXiv preprint
arXiv:1804.07461.
Yichong Xu, Xiaodong Liu, Yelong Shen, Jingjing
Liu, and Jianfeng Gao. 2018. Multi-task learning
for machine reading comprehension. arXiv preprint
arXiv:1809.06963.
Yu Zhang and Qiang Yang. 2017. A survey on multi-
task learning. arXiv preprint arXiv:1707.08114.

You might also like