Professional Documents
Culture Documents
BERT For Joint Intent Classification and Slot Filling
BERT For Joint Intent Classification and Slot Filling
Table 2: NLU performance on Snips and ATIS datasets. The metrics are intent classification accuracy, slot filling
F1, and sentence-level semantic frame accuracy (%). The results for the first group of models are cited from Goo
et al. (2018).
softmax classifier. for the training set. We also use Snips (Coucke
et al., 2018), which is collected from the Snips per-
yns = softmax(Ws hn + bs ) , n ∈ 1 . . . N (2) sonal voice assistant. The training, development
and test sets contain 13,084, 700 and 700 utter-
where hn is the hidden state corresponding to the
ances, respectively. There are 72 slot labels and 7
first sub-token of word xn .
intent types for the training set.
To jointly model intent classification and slot
filling, the objective is formulated as:
4.2 Training Details
N
Y We use English uncased BERT-Base model1 ,
p(y i , y s |x) = p(y i |x) p(yns |x) , (3) which has 12 layers, 768 hidden states, and 12
n=1
heads. BERT is pre-trained on BooksCorpus
The learning objective is to maximize the condi- (800M words) (Zhu et al., 2015) and English
tional probability p(y i , y s |x). The model is fine- Wikipedia (2,500M words). For fine-tuning, all
tuned end-to-end via minimizing the cross-entropy hyper-parameters are tuned on the development
loss. set. The maximum length is 50. The batch size is
128. Adam (Kingma and Ba, 2014) is used for op-
3.3 Conditional Random Field timization with an initial learning rate of 5e-5. The
Slot label predictions are dependent on predictions dropout probability is 0.1. The maximum number
for surrounding words. It has been shown that of epochs is selected from [1, 5, 10, 20, 30, 40].
structured prediction models can improve the slot
filling performance, such as conditional random 4.3 Results
fields (CRF). Zhou and Xu (2015) improves se-
mantic role labeling by adding a CRF layer for a Model Epochs Intent Slot
BiLSTM encoder. Here we investigate the efficacy Joint BERT 30 98.6 97.0
of adding CRF for modeling slot label dependen- No joint 30 98.0 95.8
cies, on top of the joint BERT model. Joint BERT 40 98.3 96.4
Joint BERT 20 99.0 96.0
4 Experiments and Analysis Joint BERT 10 98.6 96.5
We evaluate the proposed model on two public Joint BERT 5 98.0 95.1
benchmark datasets, ATIS and Snips. Joint BERT 1 98.0 93.3
The first group of models are the baselines 98.6%), and the slot filling F1 drops to 95.8%
and it consists of the state-of-the-art joint intent (from 97.0%). We also compare the joint BERT
classification and slot filling models: sequence- model with different fine-tuning epochs. The joint
based joint model using BiLSTM (Hakkani-Tür BERT model fine-tuned with only 1 epoch already
et al., 2016), attention-based model (Liu and Lane, outperforms the first group of models in Table 2.
2016), and slot-gated model (Goo et al., 2018). We further select a case from Snips, as in Ta-
The second group of models includes the pro- ble 4, showing how joint BERT outperforms the
posed joint BERT models. As can be seen from slot-gated model (Goo et al., 2018) by exploiting
Table 2, joint BERT models significantly outper- the language representation power of BERT to im-
form the baseline models on both datasets. On prove the generalization capability. In this case,
Snips, joint BERT achieves intent classification “mother joan of the angels” is wrongly predicted
accuracy of 98.6% (from 97.0%), slot filling F1 of by the slot-gated model as an object name and
97.0% (from 88.8%), and sentence-level seman- the intent is also wrong. However, joint BERT
tic frame accuracy of 92.8% (from 75.5%). On correctly predicts the slot labels and intent be-
ATIS, joint BERT achieves intent classification ac- cause “mother joan of the angels” is a movie entry
curacy of 97.5% (from 94.1%), slot filling F1 of in Wikipedia. The BERT model was pre-trained
96.1% (from 95.2%), and sentence-level seman- partly on Wikipedia and possibly learned this in-
tic frame accuracy of 88.2% (from 82.6%). Joint formation for this rare phrase.
BERT+CRF replaces the softmax classifier with
CRF and it performs comparably to BERT, proba-
bly due to the self-attention mechanism in Trans- 5 Conclusion
former, which may have sufficiently modeled the
label structures. We propose a joint intent classification and slot
Compared to ATIS, Snips includes multiple do- filling model based on BERT, aiming at addressing
mains and has a larger vocabulary. For the more the poor generalization capability of traditional
complex Snips dataset, joint BERT achieves a NLU models. Experimental results show that our
large gain in the sentence-level semantic frame proposed joint BERT model outperforms BERT
accuracy, from 75.5% to 92.8% (22.9% relative). models modeling intent classification and slot fill-
This demonstrates the strong generalization capa- ing separately, demonstrating the efficacy of ex-
bility of joint BERT model, considering that it is ploiting the relationship between the two tasks.
pre-trained on large-scale text from mismatched Our proposed joint BERT model achieves signif-
domains and genres (books and wikipedia). On icant improvement on intent classification accu-
ATIS, joint BERT also achieves significant im- racy, slot filling F1, and sentence-level semantic
provement on the sentence-level semantic frame frame accuracy on ATIS and Snips datasets over
accuracy, from 82.6% to 88.2% (6.8% relative). previous state-of-the-art models. Future work in-
cludes evaluations of the proposed approach on
4.4 Ablation Analysis and Case Study other large-scale and more complex NLU datasets,
We conduct ablation analysis on Snips, as shown and exploring the efficacy of combining external
in Table 3. Without joint learning, the accu- knowledge with BERT.
racy of intent classification drops to 98.0% (from
References Baolin Peng, Kaisheng Yao, Li Jing, and Kam-Fai
Wong. 2015. Recurrent neural networks with ex-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua ternal memory for spoken language understanding.
Bengio. 2014. Neural machine translation by In NLPCC 2015, Nanchang, China, October 9-13,
jointly learning to align and translate. CoRR, 2015, Proceedings, pages 25–35.
abs/1409.0473.
Alice Coucke, Alaa Saade, Adrien Ball, Théodore Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt
Bluche, Alexandre Caulier, David Leroy, Clément Gardner, Christopher Clark, Kenton Lee, and Luke
Doumouro, Thibault Gisselbrecht, Francesco Calt- Zettlemoyer. 2018. Deep contextualized word rep-
agirone, Thibaut Lavril, Maël Primet, and Joseph resentations. In NAACL-HLT 2018, New Orleans,
Dureau. 2018. Snips voice platform: an embedded Louisiana, USA, June 1-6, 2018, Volume 1 (Long Pa-
spoken language understanding system for private- pers), pages 2227–2237.
by-design voice interfaces. CoRR, abs/1805.10190.
Alec Radford, Karthik Narasimhan, Tim Salimans, and
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Ilya Sutskever. 2018. Improving language under-
Kristina Toutanova. 2018. BERT: pre-training of standing with unsupervised learning. In Technical
deep bidirectional transformers for language under- report, OpenAI.
standing. CoRR, abs/1810.04805.
Suman V. Ravuri and Andreas Stolcke. 2015. Recur-
Chih-Wen Goo, Guang Gao, Yun-Kai Hsu, Chih-Li rent neural network and LSTM models for lexical
Huo, Tsung-Chieh Chen, Keng-Wei Hsu, and Yun- utterance classification. In INTERSPEECH 2015,
Nung Chen. 2018. Slot-gated modeling for joint slot Dresden, Germany, September 6-10, 2015, pages
filling and intent prediction. In NAACL-HLT, New 135–139. ISCA.
Orleans, Louisiana, USA, June 1-6, 2018, Volume 2
(Short Papers), pages 753–757. Gökhan Tür, Dilek Hakkani-Tür, and Larry P. Heck.
2010. What is left to be understood in atis? In 2010
Daniel Guo, Gökhan Tür, Wen-tau Yih, and Geoffrey
IEEE Spoken Language Technology Workshop, SLT
Zweig. 2014. Joint semantic utterance classification
2010, Berkeley, California, USA, December 12-15,
and slot filling with recursive neural networks. In
2010, pages 19–24.
2014 IEEE Spoken Language Technology Workshop,
SLT 2014, South Lake Tahoe, NV, USA, December 7-
10, 2014, pages 554–559. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Dilek Hakkani-Tür, Gökhan Tür, Asli Çelikyilmaz, Kaiser, and Illia Polosukhin. 2017. Attention is all
Yun-Nung Chen, Jianfeng Gao, Li Deng, and Ye- you need. In NIPS 2017, 4-9 December 2017, Long
Yi Wang. 2016. Multi-domain joint semantic frame Beach, CA, USA, pages 6000–6010.
parsing using bi-directional RNN-LSTM. In Inter-
speech 2016, San Francisco, CA, USA, September Ngoc Thang Vu. 2016. Sequential convolutional neu-
8-12, 2016, pages 715–719. ral networks for slot filling in spoken language un-
derstanding. In Interspeech 2016, 17th Annual Con-
Yoon Kim. 2014. Convolutional neural networks for ference of the International Speech Communication
sentence classification. In EMNLP 2014, October Association, San Francisco, CA, USA, September 8-
25-29, 2014, Doha, Qatar, A meeting of SIGDAT, 12, 2016, pages 3250–3254. ISCA.
a Special Interest Group of the ACL, pages 1746–
1751. ACL. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.
Le, Mohammad Norouzi, Wolfgang Macherey,
Diederik P. Kingma and Jimmy Ba. 2014. Adam: Maxim Krikun, Yuan Cao, Qin Gao, Klaus
A method for stochastic optimization. CoRR, Macherey, Jeff Klingner, Apurva Shah, Melvin
abs/1412.6980. Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan
Gouws, Yoshikiyo Kato, Taku Kudo, Hideto
Gakuto Kurata, Bing Xiang, Bowen Zhou, and Mo Yu.
Kazawa, Keith Stevens, George Kurian, Nishant
2016. Leveraging sentence-level information with
Patil, Wei Wang, Cliff Young, Jason Smith, Jason
encoder LSTM for natural language understanding.
Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,
CoRR, abs/1601.01530.
Macduff Hughes, and Jeffrey Dean. 2016. Google’s
Bing Liu and Ian Lane. 2016. Attention-based recur- neural machine translation system: Bridging the gap
rent neural network models for joint intent detection between human and machine translation. CoRR,
and slot filling. In Interspeech 2016, San Francisco, abs/1609.08144.
CA, USA, September 8-12, 2016, pages 685–689.
Puyang Xu and Ruhi Sarikaya. 2013. Convolutional
Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2017. neural network based triangular CRF for joint in-
Adversarial multi-task learning for text classifica- tent detection and slot filling. In 2013 IEEE Work-
tion. In ACL 2017, Vancouver, Canada, July 30 - shop on Automatic Speech Recognition and Under-
August 4, Volume 1: Long Papers, pages 1–10. As- standing, Olomouc, Czech Republic, December 8-
sociation for Computational Linguistics. 12, 2013, pages 78–83.
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,
Alexander J. Smola, and Eduard H. Hovy. 2016. Hi-
erarchical attention networks for document classifi-
cation. In NAACL HLT 2016, San Diego California,
USA, June 12-17, 2016, pages 1480–1489. The As-
sociation for Computational Linguistics.
Kaisheng Yao, Baolin Peng, Yu Zhang, Dong Yu, Ge-
offrey Zweig, and Yangyang Shi. 2014. Spoken lan-
guage understanding using long short-term memory
neural networks. In 2014 IEEE Spoken Language
Technology Workshop, SLT 2014, South Lake Tahoe,
NV, USA, December 7-10, 2014, pages 189–194.
Xiang Zhang, Junbo Jake Zhao, and Yann LeCun.
2015. Character-level convolutional networks for
text classification. In NIPS 2015, December 7-12,
2015, Montreal, Quebec, Canada, pages 649–657.
Lin Zhao and Zhe Feng. 2018. Improving slot filling
in spoken language understanding with joint pointer
and attention. In ACL 2018, Melbourne, Australia,
July 15-20, 2018, Volume 2: Short Papers, pages
426–431.
Zhiwei Zhao and Youzheng Wu. 2016. Attention-
based convolutional neural networks for sentence
classification. In Interspeech 2016, San Francisco,
CA, USA, September 8-12, 2016, pages 705–709.
ISCA.