29 - Multi-Task Deep Neural Networks For Natural Language Understanding
29 - Multi-Task Deep Neural Networks For Natural Language Understanding
29 - Multi-Task Deep Neural Networks For Natural Language Understanding
erates a sequence of contextual embeddings in l2 . ers using the NLU tasks in GLUE as examples,
This is the shared semantic representation that is although in practice we can incorporate arbitrary
trained by our multi-task objectives. In what fol- natural language tasks such as text generation
lows, we elaborate on the model in detail. where the output layers are implemented as a neu-
ral decoder.
Lexicon Encoder (l1 ): The input X =
{x1 , ..., xm } is a sequence of tokens of length m. Single-Sentence Classification Output: Sup-
Following Devlin et al. (2018), the first token x1 is pose that x is the contextual embedding (l2 ) of the
always the [CLS] token. If X is packed by a sen- token [CLS], which can be viewed as the seman-
tence pair (X1 , X2 ), we separate the two sentences tic representation of input sentence X. Take the
with a special token [SEP]. The lexicon encoder SST-2 task as an example. The probability that
maps X into a sequence of input embedding vec- X is labeled as class c (i.e., the sentiment) is pre-
tors, one for each token, constructed by summing dicted by a logistic regression with softmax:
the corresponding word, segment, and positional >
embeddings. Pr (c|X) = softmax(WSST · x), (1)
Transformer Encoder (l2 ): We use a multi- where WSST is the task-specific parameter ma-
layer bidirectional Transformer encoder (Vaswani trix.
et al., 2017) to map the input representation vec- Text Similarity Output: Take the STS-B task
tors (l1 ) into a sequence of contextual embedding as an example. Suppose that x is the contextual
vectors C ∈ Rd×m . This is the shared represen- embedding (l2 ) of [CLS] which can be viewed
tation across different tasks. Unlike the BERT as the semantic representation of the input sen-
model (Devlin et al., 2018) that learns the rep- tence pair (X1 , X2 ). We introduce a task-specific
resentation via pre-training, MT-DNN learns the parameter vector wST S to compute the similarity
representation using multi-task objectives, in ad- score as:
dition to pre-training.
>
Below, we will describe the task specific lay- Sim(X1 , X2 ) = wST S · x, (2)
where Sim(X1 , X2 ) is a real value of the range (- answer (Q, A). We compute the relevance score
∞, ∞). as:
>
Rel(Q, A) = g(wQN LI · x), (5)
Pairwise Text Classification Output: Take nat-
ural language inference (NLI) as an example. The For a given Q, we rank all of its candidate an-
NLI task defined here involves a premise P = swers based on their relevance scores computed
(p1 , ..., pm ) of m words and a hypothesis H = using Equation 5.
(h1 , ..., hn ) of n words, and aims to find a log-
3.1 The Training Procedure
ical relationship R between P and H. The de-
sign of the output module follows the answer The training procedure of MT-DNN consists of
module of the stochastic answer network (SAN) two stages: pretraining and multi-task learning.
(Liu et al., 2018a), a state-of-the-art neural NLI The pretraining stage follows that of the BERT
model. SAN’s answer module uses multi-step rea- model (Devlin et al., 2018). The parameters of
soning. Rather than directly predicting the entail- the lexicon encoder and Transformer encoder are
ment given the input, it maintains a state and iter- learned using two unsupervised prediction tasks:
atively refines its predictions. masked language modeling and next sentence pre-
The SAN answer module works as follows. We diction.4
first construct the working memory of premise P In the multi-task learning stage, we use mini-
by concatenating the contextual embeddings of the batch based stochastic gradient descent (SGD) to
words in P , which are the output of the trans- learn the parameters of our model (i.e., the pa-
former encoder, denoted as Mp ∈ Rd×m , and sim- rameters of all shared layers and task-specific lay-
ilarly the working memory of hypothesis H, de- ers) as shown in Algorithm 1. In each epoch, a
noted as Mh ∈ Rd×n . Then, we perform K-step mini-batch bt is selected(e.g., among all 9 GLUE
reasoning on the memory to output the relation la- tasks), and the model is updated according to the
bel, where K is a hyperparameter. At the begin- task-specific objective for the task t. This approx-
ning, the initial state s0 is the summary of Mh : imately optimizes the sum of all multi-task objec-
exp(w1> ·Mh
j) tives.
s0 = h
P
j αj Mj , where αj = exp(w> ·Mh )
.
P
i 1 i For the classification tasks (i.e., single-sentence
At time step k in the range of {1, 2, , K − 1}, or pairwise text classification), we use the cross-
the state is defined by sk = GRU(sk−1 , xk ). entropy loss as the objective:
Here, xk is computed from theP previous state sk−1
p k p
and memory M : x = j βj Mj and βj = 1(X, c) log(Pr (c|X)),
X
− (6)
k−1 > p
softmax(s W2 M ). A one-layer classifier is c
used to determine the relation at each step k:
where 1(X, c) is the binary indicator (0 or 1) if
Prk = softmax(W3> [sk ; xk ; |sk k
− x |; s · x ]). k k class label c is the correct classification for X, and
(3) Pr (.) is defined by e.g., Equation 1 or 4.
At last, we utilize all of the K outputs by aver- For the text similarity tasks, such as STS-B,
aging the scores: where each sentence pair is annotated with a real-
valued score y, we use the mean squared error as
Pr = avg([Pr0 , Pr1 , ..., PrK−1 ]). (4) the objective:
Model CoLA SST-2 MRPC STS-B QQP MNLI-m/mm QNLI RTE WNLI AX Score
8.5k 67k 3.7k 7k 364k 393k 108k 2.5k 634
BiLSTM+ELMo+Attn 1 36.0 90.4 84.9/77.9 75.1/73.3 64.8/84.7 76.4/76.1 - 56.8 65.1 26.5 70.5
Singletask Pretrain
45.4 91.3 82.3/75.7 82.0/80.0 70.3/88.5 82.1/81.4 - 56.0 53.4 29.8 72.8
Transformer 2
GPT on STILTs 3 47.2 93.1 87.7/83.7 85.3/84.8 70.1/88.1 80.8/80.6 - 69.1 65.1 29.4 76.9
BERT4LARGE 60.5 94.9 89.3/85.4 87.6/86.5 72.1/89.3 86.7/85.9 92.7 70.1 65.1 39.6 80.5
MT-DNNno-fine-tune 58.9 94.6 90.1/86.4 89.5/88.8 72.7/89.6 86.5/85.8 93.1 79.1 65.1 39.4 81.7
MT-DNN 62.5 95.6 91.1/88.2 89.5/88.8 72.7/89.6 86.7/86.0 93.1 81.4 65.1 40.3 82.7
Human Performance 66.4 97.8 86.3/80.8 92.7/92.6 59.5/80.4 92.0/92.8 91.2 93.6 95.9 - 87.1
Table 2: GLUE test set results scored using the GLUE evaluation server. The number below each task denotes the
number of training examples. The state-of-the-art results are in bold, and the results on par with or pass human
performance are in bold. MT-DNN uses BERTLARGE to initialize its shared layers. All the results are obtained
from https://gluebenchmark.com/leaderboard on February 25, 2019. Model references: 1 :(Wang et al., 2018) ;
2
:(Radford et al., 2018); 3 : (Phang et al., 2018); 4 :(Devlin et al., 2018).
Model MNLI-m/mm QQP RTE QNLI (v1/v2) MRPC CoLa SST-2 STS-B
BERTLARGE 86.3/86.2 91.1/88.0 71.1 90.5/92.4 89.5/85.8 61.8 93.5 89.6/89.3
ST-DNN 86.6/86.3 91.3/88.4 72.0 96.1/- 89.7/86.4 - - -
MT-DNN 87.1/86.7 91.9/89.2 83.4 97.4/92.9 91.0/87.5 63.5 94.3 90.7/90.6
Table 3: GLUE dev set results. The best result on each task is in bold. The Single-Task DNN (ST-DNN) uses the
same model architecture as MT-DNN. But its shared layers are the pre-trainedBERT model without being refined
via MTL. We fine-tuned ST-DNN for each GLUE task using task-specific data. There have been two versions of
the QNLI dataset. V1 is expired on January 30, 2019. The current version is v2. MT-DNN use BERTLARGE as
their initial shared layers.
to the GLUE leaderboard. The results are shown We fine-tuned the model for each GLUE task on
in Tables 2 and 3. task-specific data.
BERTLARGE This is the large BERT model re- MT-DNN This is the proposed model described
leased by the authors, which we used as a baseline. in Section 3. We used the pre-trained BERTLARGE
to initialize its shared layers, refined the model via text classification output module and the pairwise
MTL on all GLUE tasks, and fine-tuned the model ranking loss for the QNLI task which by design
for each GLUE task using task-specific data. The is a binary classification problem in GLUE. To in-
test results in Table 2 show that MT-DNN out- vestigate the relative contributions of these mod-
performs all existing systems on all tasks, ex- eling design choices, we implement a variant of
cept WNLI, creating new state-of-the-art results MT-DNN as described below.
on eight GLUE tasks and pushing the benchmark
ST-DNN ST-DNN stands for Single-Task DNN.
to 82.7%, which amounts to 2.2% absolution im-
It uses the same model architecture as MT-DNN.
provement over BERTLARGE . Since MT-DNN
But its shared layers are the pre-trained BERT
uses BERTLARGE to initialize its shared layers, the
model without being refined via MTL. We then
gain is mainly attributed to the use of MTL in
fine-tuned ST-DNN for each GLUE task using
refining the shared layers. MTL is particularly
task-specific data. Thus, for pairwise text classi-
useful for the tasks with little in-domain training
fication tasks, the only difference between their
data. As we observe in the table, on the same type
ST-DNNs and BERT models is the design of the
of tasks, the improvements over BERT are much
task-specific output module. The results in Ta-
more substantial for the tasks with less in-domain
ble 3 show that on all four tasks (MNLI, QQP, RTE
training data than those with more in-domain la-
and MRPC) ST-DNN outperforms BERT, justi-
bels, even though they belong to the same task
fying the effectiveness of the SAN answer mod-
type, e.g., the two NLI tasks: RTE vs. MNLI, and
ule. We also compare the results of ST-DNN and
the two paraphrase tasks: MRPC vs. QQP.
BERT on QNLI. While ST-DNN is fine-tuned us-
MT-DNNno-fine-tune Since the MTL of MT-DNN ing the pairwise ranking loss, BERT views QNLI
uses all GLUE tasks, it is possible to directly ap- as binary classification and is fine-tuned using the
ply MT-DNN to each GLUE task without fine- cross entropy loss. ST-DNN significantly outper-
tuning. The results in Table 2 show that MT- forms BERT demonstrates clearly the importance
DNNno-fine-tune still outperforms BERTLARGE con- of problem formulation.
sistently among all tasks but CoLA. Our analysis
4.4 Domain Adaptation Results on SNLI and
shows that CoLA is a challenge task with much
SciTail
smaller in-domain data than other tasks, and its
task definition and dataset are unique among all
GLUE tasks, making it difficult to benefit from
the knowledge learned from other tasks. As a
result, MTL tends to underfit the CoLA dataset.
In such a case, fine-tuning is necessary to boost
the performance. As shown in Table 2, the ac-
curacy improves from 58.9% to 62.5% after fine-
tuning, even though only a very small amount
of in-domain data is available for adaptation.
This, together with the fact that the fine-tuned
MT-DNN significantly outperforms the fine-tuned
BERTLARGE on CoLA (62.5% vs. 60.5%), reveals
that the learned MT-DNN representation allows
much more effective domain adaptation than the
pre-trained BERT representation. We will revisit
this topic with more experiments in Section 4.4.
The gain of MT-DNN is also attributed to its
flexible modeling framework which allows us to Figure 2: Domain adaption results on SNLI and Sci-
incorporate the task-specific model structures and Tail development datasets using the shared embeddings
training methods which have been developed in generated by MT-DNN and BERT, respectively. Both
the single-task setting, effectively leveraging the MT-DNN and BERT are fine-tuned based on the pre-
existing body of research. Two such examples are trained BERTBASE . The X-axis indicates the amount of
the use of the SAN answer module for the pairwise domain-specific labeled samples used for adaptation.
Model 0.1% 1% 10% 100% training data, MT-DNN achieves 82.1% in accu-
SNLI Dataset (Dev Accuracy%) racy while BERT’s accuracy is 52.5%; with 1%
#Training Data 549 5,493 54,936 549,367 of the training data, the accuracy from MT-DNN
BERT 52.5 78.1 86.7 91.0 is 85.2% and BERT is 78.1%. We observe similar
MT-DNN 82.1 85.2 88.4 91.5 results on SciTail. The results indicate that the rep-
SciTail Dataset (Dev Accuracy%) resentations learned by MT-DNN are more consis-
#Training Data 23 235 2,359 23,596 tently effective for domain adaptation than BERT.
BERT 51.2 82.2 90.5 94.3 In Table 5, we compare our adapted mod-
MT-DNN 81.9 88.3 91.1 95.7 els, using all in-domain training samples, against
several strong baselines including the best re-
Table 4: Domain adaptation results on SNLI and Sci- sults reported in the leaderboards. We see that
Tail, as shown in Figure 2. MT-DNNLARGE generates new state-of-the-art re-
sults on both datasets, pushing the benchmarks to
One of the most important criteria of building 91.6% on SNLI (1.5% absolute improvement) and
practical systems is fast adaptation to new tasks 95.0% on SciTail (6.7% absolute improvement),
and domains. This is because it is prohibitively respectively. This results in the new state-of-the-
expensive to collect labeled training data for new art for both SNLI and SciTail. All of these demon-
domains or tasks. Very often, we only have very strate the exceptional performance of MT-DNN on
small training data or even no training data. domain adaptation.
To evaluate the models using the above crite- Model Dev Test
rion, we perform domain adaptation experiments
SNLI Dataset (Accuracy%)
on two NLI tasks, SNLI and SciTail, using the fol-
GPT (Radford et al., 2018) - 89.9
lowing procedure:
Kim et al. (2018)∗ - 90.1
1. use the MT-DNN model or the BERT as ini- BERTBASE 91.0 90.8
tial model including both BASE and LARGE MT-DNNBASE 91.5 91.1
model settings; BERTLARGE 91.7 91.0
MT-DNNLARGE 92.2 91.6
2. create for each new task (SNLI or SciTail) a
SciTail Dataset (Accuracy%)
task-specific model, by adapting the trained
GPT (Radford et al., 2018)∗ - 88.3
MT-DNN using task-specific training data;
BERTBASE 94.3 92.0
3. evaluate the models using task-specific test MT-DNNBASE 95.7 94.1
data. BERTLARGE 95.7 94.4
MT-DNNLARGE 96.3 95.0
We starts with the default training/dev/test set
of these tasks. But we randomly sample 0.1%,
Table 5: Results on the SNLI and SciTail dataset.
1%, 10% and 100% of its training data. As a re- Previous state-of-the-art results are marked by
sult, we obtain four sets of training data for Sci- ∗, obtained from the official SNLI leaderboard
Tail, which respectively includes 23, 235, 2.3k and (https://nlp.stanford.edu/projects/snli/) and the
23.5k training samples. Similarly, we obtain four official SciTail leaderboard maintained by AI2
sets of training data for SNLI, which respectively (https://leaderboard.allenai.org/scitail).
include 549, 5.5k, 54.9k and 549.3k training sam-
ples.
5 Conclusion
We perform random sampling five times and re-
port the mean among all the runs. Results on dif- In this work we proposed a model called MT-
ferent amounts of training data from SNLI and Sc- DNN to combine multi-task learning and lan-
iTail are reported in Figure 2. We observe that guage model pre-training for language represen-
MT-DNN outperforms the BERT baseline consis- tation learning. MT-DNN obtains new state-of-
tently with more details provided in Table 4. The the-art results on ten NLU tasks across three pop-
fewer training examples used, the larger improve- ular benchmarks: SNLI, SciTail, and GLUE. MT-
ment MT-DNN demonstrates over BERT. For ex- DNN also demonstrates an exceptional generaliza-
ample, with only 0.1% (23 samples) of the SNLI tion capability in domain adaptation experiments.
There are many future areas to explore to im- J. Gao, M. Galley, and L. Li. 2018. Neural approaches
prove MT-DNN, including a deeper understand- to conversational AI. CoRR, abs/1809.08267.
ing of model structure sharing in MTL, a more Max Glockner, Vered Shwartz, and Yoav Goldberg.
effective training method that leverages related- 2018. Breaking nli systems with sentences that re-
ness among multiple tasks, for both fine-tuning quire simple lexical inferences. In The 56th Annual
Meeting of the Association for Computational Lin-
and pre-training (Dong et al., 2019), and ways of
guistics (ACL), Melbourne, Australia.
incorporating the linguistic structure of text in a
more explicit and controllable manner. At last, Han Guo, Ramakanth Pasunuru, and Mohit Bansal.
2018. Soft layer-specific multi-task summarization
we also would like to verify whether MT-DNN
with entailment and question generation. In Pro-
is resilience against adversarial attacks (Glockner ceedings of the 56th Annual Meeting of the Associa-
et al., 2018; Talman and Chatzikyriakidis, 2018; tion for Computational Linguistics (Volume 1: Long
Liu et al., 2019). Papers), pages 687–697.
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng,
Acknowledgments Alex Acero, and Larry Heck. 2013. Learning deep
structured semantic models for web search using
We would like to thanks Jade Huang from Mi- clickthrough data. In Proceedings of the 22nd ACM
crosoft for her generous help on this work. international conference on Conference on informa-
tion & knowledge management, pages 2333–2338.
ACM.
References
Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018.
Samuel R Bowman, Gabor Angeli, Christopher Potts, SciTail: A textual entailment dataset from science
and Christopher D Manning. 2015a. A large anno- question answering. In AAAI.
tated corpus for learning natural language inference.
In Proceedings of the 2015 Conference on Empiri- Seonhoon Kim, Jin-Hyuk Hong, Inho Kang, and No-
cal Methods in Natural Language Processing, pages jun Kwak. 2018. Semantic sentence matching with
632–642. densely-connected recurrent and co-attentive infor-
mation. arXiv preprint arXiv:1805.11360.
Samuel R. Bowman, Gabor Angeli, Christopher Potts,
and Christopher D. Manning. 2015b. A large an- Diederik Kingma and Jimmy Ba. 2014. Adam: A
notated corpus for learning natural language infer- method for stochastic optimization. arXiv preprint
ence. In Proceedings of the 2015 Conference on arXiv:1412.6980.
Empirical Methods in Natural Language Processing Xiaodong Liu, Kevin Duh, and Jianfeng Gao. 2018a.
(EMNLP). Association for Computational Linguis- Stochastic answer networks for natural language in-
tics. ference. arXiv preprint arXiv:1804.07888.
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng,
Matt Deeds, Nicole Hamilton, and Greg Hullender. Kevin Duh, and Ye-Yi Wang. 2015. Representa-
2005. Learning to rank using gradient descent. In tion learning using multi-task deep neural networks
Proceedings of the 22nd international conference on for semantic classification and information retrieval.
Machine learning, pages 89–96. ACM. In Proceedings of the 2015 Conference of the North
American Chapter of the Association for Computa-
Rich Caruana. 1997. Multitask learning. Machine tional Linguistics: Human Language Technologies,
learning, 28(1):41–75. pages 912–921.
Ronan Collobert, Jason Weston, Léon Bottou, Michael Xiaodong Liu, Pengcheng He, Weizhu Chen, and
Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Jianfeng Gao. 2019. Improving multi-task deep
2011. Natural language processing (almost) from neural networks via knowledge distillation for
scratch. Journal of Machine Learning Research, natural language understanding. arXiv preprint
12(Aug):2493–2537. arXiv:1904.09482.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Xiaodong Liu, Yelong Shen, Kevin Duh, and Jianfeng
Kristina Toutanova. 2018. Bert: Pre-training of deep Gao. 2018b. Stochastic answer networks for ma-
bidirectional transformers for language understand- chine reading comprehension. In Proceedings of the
ing. arXiv preprint arXiv:1810.04805. 56th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers). Asso-
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, ciation for Computational Linguistics.
Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming
Zhou, and Hsiao-Wuen Hon. 2019. Unified Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol
language model pre-training for natural language Vinyals, and Lukasz Kaiser. 2015. Multi-task
understanding and generation. arXiv preprint sequence to sequence learning. arXiv preprint
arXiv:1905.03197. arXiv:1511.06114.
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word rep-
resentations. arXiv preprint arXiv:1802.05365.
Jason Phang, Thibault Févry, and Samuel R Bowman.
2018. Sentence encoders on stilts: Supplementary
training on intermediate labeled-data tasks. arXiv
preprint arXiv:1811.01088.
Alec Radford, Karthik Narasimhan, Tim Salimans, and
Ilya Sutskever. 2018. Improving language under-
standing by generative pre-training.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Percy Liang. 2016. Squad: 100,000+ questions for
machine comprehension of text. pages 2383–2392.
Sebastian Ruder12, Joachim Bingel, Isabelle Augen-
stein, and Anders Søgaard. 2019. Latent multi-task
architecture learning.
Aarne Talman and Stergios Chatzikyriakidis. 2018.
Testing the generalization power of neural network
models across nli benchmarks. arXiv preprint
arXiv:1810.09774.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. arXiv preprint arXiv:1706.03762.
Alex Wang, Amapreet Singh, Julian Michael, Felix
Hill, Omer Levy, and Samuel R Bowman. 2018.
Glue: A multi-task benchmark and analysis platform
for natural language understanding. arXiv preprint
arXiv:1804.07461.
Yichong Xu, Xiaodong Liu, Yelong Shen, Jingjing
Liu, and Jianfeng Gao. 2018. Multi-task learning
for machine reading comprehension. arXiv preprint
arXiv:1809.06963.
Yu Zhang and Qiang Yang. 2017. A survey on multi-
task learning. arXiv preprint arXiv:1707.08114.