Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

BERT for Joint Intent Classification and Slot Filling

Qian Chen∗, Zhu Zhuo, Wen Wang


Speech Lab, DAMO Academy, Alibaba Group
{tanqing.cq, zhuozhu.zz, w.wang}@alibaba-inc.com

Abstract Query Find me a movie by Steven Spielberg


Intent find movie
Intent classification and slot filling are two Frame genre = movie
essential tasks for natural language under- Slot
directed by = Steven Spielberg
standing. They often suffer from small-scale
arXiv:1902.10909v1 [cs.CL] 28 Feb 2019

human-labeled training data, resulting in poor


Table 1: An example from user query to semantic
generalization capability, especially for rare
frame.
words. Recently a new language represen-
tation model, BERT (Bidirectional Encoder
Representations from Transformers), facili-
tates pre-training deep bidirectional represen- Intent classification is a classification problem
tations on large-scale unlabeled corpora, and that predicts the intent label y i and slot filling is
has created state-of-the-art models for a wide a sequence labeling task that tags the input word
variety of natural language processing tasks sequence x = (x1 , x2 , · · · , xT ) with the slot label
after simple fine-tuning. However, there has sequence y s = (y1s , y2s , · · · , yTs ). Recurrent neu-
not been much effort on exploring BERT for ral network (RNN) based approaches, particularly
natural language understanding. In this work, gated recurrent unit (GRU) and long short-term
we propose a joint intent classification and
memory (LSTM) models, have achieved state-of-
slot filling model based on BERT. Experi-
mental results demonstrate that our proposed the-art performance for intent classification and
model achieves significant improvement on slot filling. Recently, several joint learning meth-
intent classification accuracy, slot filling F1, ods for intent classification and slot filling were
and sentence-level semantic frame accuracy proposed to exploit and model the dependencies
on several public benchmark datasets, com- between the two tasks and improve the perfor-
pared to the attention-based recurrent neural mance over independent models (Guo et al., 2014;
network models and slot-gated models.
Hakkani-Tür et al., 2016; Liu and Lane, 2016; Goo
1 Introduction et al., 2018). Prior work has shown that attention
mechanism (Bahdanau et al., 2014) helps RNNs
In recent years, a variety of smart speakers have to deal with long-range dependencies. Hence,
been deployed and achieved great success, such as attention-based joint learning methods were pro-
Google Home, Amazon Echo, Tmall Genie, which posed and achieved the state-of-the-art perfor-
facilitate goal-oriented dialogues and help users mance for joint intent classification and slot fill-
to accomplish their tasks through voice interac- ing (Liu and Lane, 2016; Goo et al., 2018).
tions. Natural language understanding (NLU) is Lack of human-labeled data for NLU and other
critical to the performance of goal-oriented spo- natural language processing (NLP) tasks results
ken dialogue systems. NLU typically includes the in poor generalization capability. To address the
intent classification and slot filling tasks, aiming data sparsity challenge, a variety of techniques
to form a semantic parse for user utterances. In- were proposed for training general purpose lan-
tent classification focuses on predicting the intent guage representation models using an enormous
of the query, while slot filling extracts semantic amount of unannotated text, such as ELMo (Pe-
concepts. Table 1 shows an example of intent clas- ters et al., 2018) and Generative Pre-trained Trans-
sification and slot filling for user query “Find me former (GPT) (Radford et al., 2018). Pre-trained
a movie by Steven Spielberg”. models can be fine-tuned on NLP tasks and have

Ongoing work. achieved significant improvement over training on
task-specific annotated data. More recently, a pre- PlayMusic O ... I-track

training technique, Bidirectional Encoder Rep- ...


Trm Trm Trm Trm Trm Trm
resentations from Transformers (BERT) (Devlin
et al., 2018), was proposed and has created state- Trm Trm ... Trm Trm Trm Trm

of-the-art models for a wide variety of NLP tasks,


...
including question answering (SQuAD v1.1), nat- E1 E2 E2 ET-2 ET-1 ET

ural language inference, and others.


[CLS] play ... red ##bre ##ast [SEP]
However, there has not been much effort in ex-
ploring BERT for NLU. The technical contribu- Figure 1: A high-level view of the proposed model.
tions in this work are two folds: 1) we explore The input query is “play the song little robin redbreast”.
the BERT pre-trained model to address the poor
generalization capability of NLU; 2) we propose
a joint intent classification and slot filling model 3.1 BERT
based on BERT and demonstrate that the pro- The model architecture of BERT is a multi-layer
posed model achieves significant improvement on bidirectional Transformer encoder based on the
intent classification accuracy, slot filling F1, and original Transformer model (Vaswani et al., 2017).
sentence-level semantic frame accuracy on several The input representation is a concatenation of
public benchmark datasets, compared to attention- WordPiece embeddings (Wu et al., 2016), posi-
based RNN models and slot-gated models. tional embeddings, and the segment embedding.
Specially, for single sentence classification and
2 Related work tagging tasks, the segment embedding has no dis-
crimination. A special classification embedding
Deep learning models have been extensively ex- ([CLS]) is inserted as the first token and a special
plored in NLU. According to whether intent clas- token ([SEP]) is added as the final token. Given an
sification and slot filling are modeled separately input token sequence x = (x1 , . . . , xT ), the out-
or jointly, we categorize NLU models into inde- put of BERT is H = (h1 , . . . , hT ).
pendent modeling approaches and joint modeling
The BERT model is pre-trained with two strate-
approaches.
gies on large-scale unlabeled text, i.e., masked
Approaches for intent classification in- language model and next sentence prediction.
clude CNN (Kim, 2014; Zhang et al., 2015), The pre-trained BERT model provides a power-
LSTM (Ravuri and Stolcke, 2015), attention- ful context-dependent sentence representation and
based CNN (Zhao and Wu, 2016), hierarchical can be used for various target tasks, i.e., intent
attention networks (Yang et al., 2016), adversarial classification and slot filling, through the fine-
multi-task learning (Liu et al., 2017), and others. tuning procedure, similar to how it is used for
Approaches for slot filling include CNN (Vu, other NLP tasks.
2016), deep LSTM (Yao et al., 2014), RNN-
EM (Peng et al., 2015), encoder-labeler deep 3.2 Joint Intent Classification and Slot Filling
LSTM (Kurata et al., 2016), and joint pointer and
attention (Zhao and Feng, 2018), among others. BERT can be easily extended to a joint intent clas-
Joint modeling approaches include CNN- sification and slot filling model. Based on the hid-
CRF (Xu and Sarikaya, 2013), RecNN (Guo et al., den state of the first special token ([CLS]), denoted
2014), joint RNN-LSTM (Hakkani-Tür et al., h1 , the intent is predicted as:
2016), attention-based BiRNN (Liu and Lane,
2016), and slot-gated attention-based model (Goo y i = softmax(Wi h1 + bi ) , (1)
et al., 2018).
For slot filling, we feed the final hidden states
3 Proposed Approach of other tokens h2 , . . . , hT into a softmax layer
to classify over the slot filling labels. To make
We first briefly describe the BERT model (Devlin this procedure compatible with the WordPiece tok-
et al., 2018) and then introduce the proposed joint enization, we feed each tokenized input word into
model based on BERT. Figure 1 illustrates a high- a WordPiece tokenizer and use the hidden state
level view of the proposed model. corresponding to the first sub-token as input to the
Snips ATIS
Models
Intent Slot Sent Intent Slot Sent
RNN-LSTM (Hakkani-Tür et al., 2016) 96.9 87.3 73.2 92.6 94.3 80.7
Atten.-BiRNN (Liu and Lane, 2016) 96.7 87.8 74.1 91.1 94.2 78.9
Slot-Gated (Goo et al., 2018) 97.0 88.8 75.5 94.1 95.2 82.6
Joint BERT 98.6 97.0 92.8 97.5 96.1 88.2
Joint BERT + CRF 98.4 96.7 92.6 97.9 96.0 88.6

Table 2: NLU performance on Snips and ATIS datasets. The metrics are intent classification accuracy, slot filling
F1, and sentence-level semantic frame accuracy (%). The results for the first group of models are cited from Goo
et al. (2018).

softmax classifier. for the training set. We also use Snips (Coucke
et al., 2018), which is collected from the Snips per-
yns = softmax(Ws hn + bs ) , n ∈ 1 . . . N (2) sonal voice assistant. The training, development
and test sets contain 13,084, 700 and 700 utter-
where hn is the hidden state corresponding to the
ances, respectively. There are 72 slot labels and 7
first sub-token of word xn .
intent types for the training set.
To jointly model intent classification and slot
filling, the objective is formulated as:
4.2 Training Details
N
Y We use English uncased BERT-Base model1 ,
p(y i , y s |x) = p(y i |x) p(yns |x) , (3) which has 12 layers, 768 hidden states, and 12
n=1
heads. BERT is pre-trained on BooksCorpus
The learning objective is to maximize the condi- (800M words) (Zhu et al., 2015) and English
tional probability p(y i , y s |x). The model is fine- Wikipedia (2,500M words). For fine-tuning, all
tuned end-to-end via minimizing the cross-entropy hyper-parameters are tuned on the development
loss. set. The maximum length is 50. The batch size is
128. Adam (Kingma and Ba, 2014) is used for op-
3.3 Conditional Random Field timization with an initial learning rate of 5e-5. The
Slot label predictions are dependent on predictions dropout probability is 0.1. The maximum number
for surrounding words. It has been shown that of epochs is selected from [1, 5, 10, 20, 30, 40].
structured prediction models can improve the slot
filling performance, such as conditional random 4.3 Results
fields (CRF). Zhou and Xu (2015) improves se-
mantic role labeling by adding a CRF layer for a Model Epochs Intent Slot
BiLSTM encoder. Here we investigate the efficacy Joint BERT 30 98.6 97.0
of adding CRF for modeling slot label dependen- No joint 30 98.0 95.8
cies, on top of the joint BERT model. Joint BERT 40 98.3 96.4
Joint BERT 20 99.0 96.0
4 Experiments and Analysis Joint BERT 10 98.6 96.5
We evaluate the proposed model on two public Joint BERT 5 98.0 95.1
benchmark datasets, ATIS and Snips. Joint BERT 1 98.0 93.3

4.1 Data Table 3: Ablation Analysis for the Snips dataset.


The ATIS dataset (Tür et al., 2010) is widely used
in NLU research, which includes audio recordings Table 2 shows the model performance as slot
of people making flight reservations. We use the filling F1, intent classification accuracy, and
same data division as Goo et al. (2018) for both sentence-level semantic frame accuracy on the
datasets. The training, development and test sets Snips and ATIS datasets.
contain 4,478, 500 and 893 utterances, respec-
1
tively. There are 120 slot labels and 21 intent types https://github.com/google-research/bert
Query need to see mother joan of the angels in one second
Gold, predicted by joint BERT correctly
Intent SearchScreeningEvent
Slots O O O B-movie-name I-movie-name I-movie-name I-movie-name I-movie-name B-timeRange
I-timeRange I-timeRange
Predicted by Slot-Gated Model (Goo et al., 2018)
Intent BookRestaurant
Slots O O O B-object-name I-object-name I-object-name I-object-name I-object-name B-timeRange
I-timeRange I-timeRange

Table 4: A case in the Snips dataset.

The first group of models are the baselines 98.6%), and the slot filling F1 drops to 95.8%
and it consists of the state-of-the-art joint intent (from 97.0%). We also compare the joint BERT
classification and slot filling models: sequence- model with different fine-tuning epochs. The joint
based joint model using BiLSTM (Hakkani-Tür BERT model fine-tuned with only 1 epoch already
et al., 2016), attention-based model (Liu and Lane, outperforms the first group of models in Table 2.
2016), and slot-gated model (Goo et al., 2018). We further select a case from Snips, as in Ta-
The second group of models includes the pro- ble 4, showing how joint BERT outperforms the
posed joint BERT models. As can be seen from slot-gated model (Goo et al., 2018) by exploiting
Table 2, joint BERT models significantly outper- the language representation power of BERT to im-
form the baseline models on both datasets. On prove the generalization capability. In this case,
Snips, joint BERT achieves intent classification “mother joan of the angels” is wrongly predicted
accuracy of 98.6% (from 97.0%), slot filling F1 of by the slot-gated model as an object name and
97.0% (from 88.8%), and sentence-level seman- the intent is also wrong. However, joint BERT
tic frame accuracy of 92.8% (from 75.5%). On correctly predicts the slot labels and intent be-
ATIS, joint BERT achieves intent classification ac- cause “mother joan of the angels” is a movie entry
curacy of 97.5% (from 94.1%), slot filling F1 of in Wikipedia. The BERT model was pre-trained
96.1% (from 95.2%), and sentence-level seman- partly on Wikipedia and possibly learned this in-
tic frame accuracy of 88.2% (from 82.6%). Joint formation for this rare phrase.
BERT+CRF replaces the softmax classifier with
CRF and it performs comparably to BERT, proba-
bly due to the self-attention mechanism in Trans- 5 Conclusion
former, which may have sufficiently modeled the
label structures. We propose a joint intent classification and slot
Compared to ATIS, Snips includes multiple do- filling model based on BERT, aiming at addressing
mains and has a larger vocabulary. For the more the poor generalization capability of traditional
complex Snips dataset, joint BERT achieves a NLU models. Experimental results show that our
large gain in the sentence-level semantic frame proposed joint BERT model outperforms BERT
accuracy, from 75.5% to 92.8% (22.9% relative). models modeling intent classification and slot fill-
This demonstrates the strong generalization capa- ing separately, demonstrating the efficacy of ex-
bility of joint BERT model, considering that it is ploiting the relationship between the two tasks.
pre-trained on large-scale text from mismatched Our proposed joint BERT model achieves signif-
domains and genres (books and wikipedia). On icant improvement on intent classification accu-
ATIS, joint BERT also achieves significant im- racy, slot filling F1, and sentence-level semantic
provement on the sentence-level semantic frame frame accuracy on ATIS and Snips datasets over
accuracy, from 82.6% to 88.2% (6.8% relative). previous state-of-the-art models. Future work in-
cludes evaluations of the proposed approach on
4.4 Ablation Analysis and Case Study other large-scale and more complex NLU datasets,
We conduct ablation analysis on Snips, as shown and exploring the efficacy of combining external
in Table 3. Without joint learning, the accu- knowledge with BERT.
racy of intent classification drops to 98.0% (from
References Baolin Peng, Kaisheng Yao, Li Jing, and Kam-Fai
Wong. 2015. Recurrent neural networks with ex-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua ternal memory for spoken language understanding.
Bengio. 2014. Neural machine translation by In NLPCC 2015, Nanchang, China, October 9-13,
jointly learning to align and translate. CoRR, 2015, Proceedings, pages 25–35.
abs/1409.0473.

Alice Coucke, Alaa Saade, Adrien Ball, Théodore Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt
Bluche, Alexandre Caulier, David Leroy, Clément Gardner, Christopher Clark, Kenton Lee, and Luke
Doumouro, Thibault Gisselbrecht, Francesco Calt- Zettlemoyer. 2018. Deep contextualized word rep-
agirone, Thibaut Lavril, Maël Primet, and Joseph resentations. In NAACL-HLT 2018, New Orleans,
Dureau. 2018. Snips voice platform: an embedded Louisiana, USA, June 1-6, 2018, Volume 1 (Long Pa-
spoken language understanding system for private- pers), pages 2227–2237.
by-design voice interfaces. CoRR, abs/1805.10190.
Alec Radford, Karthik Narasimhan, Tim Salimans, and
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Ilya Sutskever. 2018. Improving language under-
Kristina Toutanova. 2018. BERT: pre-training of standing with unsupervised learning. In Technical
deep bidirectional transformers for language under- report, OpenAI.
standing. CoRR, abs/1810.04805.
Suman V. Ravuri and Andreas Stolcke. 2015. Recur-
Chih-Wen Goo, Guang Gao, Yun-Kai Hsu, Chih-Li rent neural network and LSTM models for lexical
Huo, Tsung-Chieh Chen, Keng-Wei Hsu, and Yun- utterance classification. In INTERSPEECH 2015,
Nung Chen. 2018. Slot-gated modeling for joint slot Dresden, Germany, September 6-10, 2015, pages
filling and intent prediction. In NAACL-HLT, New 135–139. ISCA.
Orleans, Louisiana, USA, June 1-6, 2018, Volume 2
(Short Papers), pages 753–757. Gökhan Tür, Dilek Hakkani-Tür, and Larry P. Heck.
2010. What is left to be understood in atis? In 2010
Daniel Guo, Gökhan Tür, Wen-tau Yih, and Geoffrey
IEEE Spoken Language Technology Workshop, SLT
Zweig. 2014. Joint semantic utterance classification
2010, Berkeley, California, USA, December 12-15,
and slot filling with recursive neural networks. In
2010, pages 19–24.
2014 IEEE Spoken Language Technology Workshop,
SLT 2014, South Lake Tahoe, NV, USA, December 7-
10, 2014, pages 554–559. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Dilek Hakkani-Tür, Gökhan Tür, Asli Çelikyilmaz, Kaiser, and Illia Polosukhin. 2017. Attention is all
Yun-Nung Chen, Jianfeng Gao, Li Deng, and Ye- you need. In NIPS 2017, 4-9 December 2017, Long
Yi Wang. 2016. Multi-domain joint semantic frame Beach, CA, USA, pages 6000–6010.
parsing using bi-directional RNN-LSTM. In Inter-
speech 2016, San Francisco, CA, USA, September Ngoc Thang Vu. 2016. Sequential convolutional neu-
8-12, 2016, pages 715–719. ral networks for slot filling in spoken language un-
derstanding. In Interspeech 2016, 17th Annual Con-
Yoon Kim. 2014. Convolutional neural networks for ference of the International Speech Communication
sentence classification. In EMNLP 2014, October Association, San Francisco, CA, USA, September 8-
25-29, 2014, Doha, Qatar, A meeting of SIGDAT, 12, 2016, pages 3250–3254. ISCA.
a Special Interest Group of the ACL, pages 1746–
1751. ACL. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.
Le, Mohammad Norouzi, Wolfgang Macherey,
Diederik P. Kingma and Jimmy Ba. 2014. Adam: Maxim Krikun, Yuan Cao, Qin Gao, Klaus
A method for stochastic optimization. CoRR, Macherey, Jeff Klingner, Apurva Shah, Melvin
abs/1412.6980. Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan
Gouws, Yoshikiyo Kato, Taku Kudo, Hideto
Gakuto Kurata, Bing Xiang, Bowen Zhou, and Mo Yu.
Kazawa, Keith Stevens, George Kurian, Nishant
2016. Leveraging sentence-level information with
Patil, Wei Wang, Cliff Young, Jason Smith, Jason
encoder LSTM for natural language understanding.
Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,
CoRR, abs/1601.01530.
Macduff Hughes, and Jeffrey Dean. 2016. Google’s
Bing Liu and Ian Lane. 2016. Attention-based recur- neural machine translation system: Bridging the gap
rent neural network models for joint intent detection between human and machine translation. CoRR,
and slot filling. In Interspeech 2016, San Francisco, abs/1609.08144.
CA, USA, September 8-12, 2016, pages 685–689.
Puyang Xu and Ruhi Sarikaya. 2013. Convolutional
Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2017. neural network based triangular CRF for joint in-
Adversarial multi-task learning for text classifica- tent detection and slot filling. In 2013 IEEE Work-
tion. In ACL 2017, Vancouver, Canada, July 30 - shop on Automatic Speech Recognition and Under-
August 4, Volume 1: Long Papers, pages 1–10. As- standing, Olomouc, Czech Republic, December 8-
sociation for Computational Linguistics. 12, 2013, pages 78–83.
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,
Alexander J. Smola, and Eduard H. Hovy. 2016. Hi-
erarchical attention networks for document classifi-
cation. In NAACL HLT 2016, San Diego California,
USA, June 12-17, 2016, pages 1480–1489. The As-
sociation for Computational Linguistics.
Kaisheng Yao, Baolin Peng, Yu Zhang, Dong Yu, Ge-
offrey Zweig, and Yangyang Shi. 2014. Spoken lan-
guage understanding using long short-term memory
neural networks. In 2014 IEEE Spoken Language
Technology Workshop, SLT 2014, South Lake Tahoe,
NV, USA, December 7-10, 2014, pages 189–194.
Xiang Zhang, Junbo Jake Zhao, and Yann LeCun.
2015. Character-level convolutional networks for
text classification. In NIPS 2015, December 7-12,
2015, Montreal, Quebec, Canada, pages 649–657.
Lin Zhao and Zhe Feng. 2018. Improving slot filling
in spoken language understanding with joint pointer
and attention. In ACL 2018, Melbourne, Australia,
July 15-20, 2018, Volume 2: Short Papers, pages
426–431.
Zhiwei Zhao and Youzheng Wu. 2016. Attention-
based convolutional neural networks for sentence
classification. In Interspeech 2016, San Francisco,
CA, USA, September 8-12, 2016, pages 705–709.
ISCA.

Jie Zhou and Wei Xu. 2015. End-to-end learning of


semantic role labeling using recurrent neural net-
works. In ACL 2015, July 26-31, 2015, Beijing,
China, Volume 1: Long Papers, pages 1127–1137.

Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan


Salakhutdinov, Raquel Urtasun, Antonio Torralba,
and Sanja Fidler. 2015. Aligning books and movies:
Towards story-like visual explanations by watching
movies and reading books. In 2015 IEEE Interna-
tional Conference on Computer Vision, ICCV 2015,
Santiago, Chile, December 7-13, 2015, pages 19–
27.

You might also like