Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Fine-tune BERT for Extractive Summarization

Yang Liu
Institute for Language, Cognition and Computation
School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB
[email protected]

Abstract has reached a bottleneck due to the complex-


BERT (Devlin et al., 2018), a pre-trained ity of the task. In this paper, we argue that,
Transformer (Vaswani et al., 2017) model, BERT (Devlin et al., 2018), with its pre-training
has achieved ground-breaking performance on on a huge dataset and the powerful architecture
arXiv:1903.10318v2 [cs.CL] 5 Sep 2019

multiple NLP tasks. In this paper, we describe for learning complex features, can further boost
B ERTSUM, a simple variant of BERT, for ex- the performance of extractive summarization .
tractive summarization. Our system is the state In this paper, we focus on designing differ-
of the art on the CNN/Dailymail dataset, out-
ent variants of using BERT on the extractive
performing the previous best-performed sys-
tem by 1.65 on ROUGE-L. The codes to repro- summarization task and showing their results on
duce our results are available at https://github. CNN/Dailymail and NYT datasets. We found that
com/nlpyang/BertSum a flat architecture with inter-sentence Transformer
layers performs the best, achieving the state-of-
1 Introduction
the-art results on this task.
Single-document summarization is the task of au-
tomatically generating a shorter version of a doc- 2 Methodology
ument while retaining its most important informa- Let d denote a document containing several sen-
tion. The task has received much attention in the tences [sent1 , sent2 , · · · , sentm ], where senti is
natural language processing community due to its the i-th sentence in the document. Extractive sum-
potential for various information access applica- marization can be defined as the task of assigning a
tions. Examples include tools which digest textual label yi ∈ {0, 1} to each senti , indicating whether
content (e.g., news, social media, reviews), answer the sentence should be included in the summary. It
questions, or provide recommendations. is assumed that summary sentences represent the
The task is often divided into two paradigms, most important content of the document.
abstractive summarization and extractive summa-
rization. In abstractive summarization, target sum- 2.1 Extractive Summarization with BERT
maries contains words or phrases that were not in
To use BERT for extractive summarization, we
the original text and usually require various text
require it to output the representation for each
rewriting operations to generate, while extractive
sentence. However, since BERT is trained as a
approaches form summaries by copying and con-
masked-language model, the output vectors are
catenating the most important spans (usually sen-
grounded to tokens instead of sentences. Mean-
tences) in a document. In this paper, we focus on
while, although BERT has segmentation embed-
extractive summarization.
dings for indicating different sentences, it only has
Although many neural models have been
two labels (sentence A or sentence B), instead of
proposed for extractive summarization re-
multiple sentences as in extractive summarization.
cently (Cheng and Lapata, 2016; Nallapati et al.,
Therefore, we modify the input sequence and em-
2017; Narayan et al., 2018; Dong et al., 2018;
beddings of BERT to make it possible for extract-
Zhang et al., 2018; Zhou et al., 2018), the im-
ing summaries.
provement on automatic metrics like ROUGE

Please see https://arxiv.org/abs/1908.08345 for the full Encoding Multiple Sentences As illustrated in
and most current version of this paper Figure 1, we insert a [CLS] token before each sen-
Input Docum ent [ CLS] sent one [ SEP] [ CLS] 2nd sent [ SEP] [ CLS] sent again [ SEP]

Token Em beddings E[CLS] Esent Eone E[SEP] E[CLS] E2nd Esent E[SEP] E[CLS] Esent Eagain E[SEP]

Inter val Segm ent


Em beddings
EA EA EA EA EB EB EB EB EA EA EA EA

Position Em beddings E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12

BERT

T1 ...... T2 ...... T3 ......


Sum m ar ization Layer s

Y1 Y2 Y3

Figure 1: The overview architecture of the B ERTSUM model.

tence and a [SEP] token after each sentence. In to get the predicted score:
vanilla BERT, The [CLS] is used as a symbol to
aggregate features from one sentence or a pair of Ŷi = σ(Wo Ti + bo ) (1)
sentences. We modify the model by using mul- where σ is the Sigmoid function.
tiple [CLS] symbols to get features for sentences
ascending the symbol. Inter-sentence Transformer Instead of a sim-
ple sigmoid classifier, Inter-sentence Transformer
Interval Segment Embeddings We use inter- applies more Transformer layers only on sen-
val segment embeddings to distinguish multiple tence representations, extracting document-level
sentences within a document. For senti we will features focusing on summarization tasks from the
assign a segment embedding EA or EB condi- BERT outputs:
tioned on i is odd or even. For example, for
[sent1 , sent2 , sent3 , sent4 , sent5 ] we will assign h̃l = LN(hl−1 + MHAtt(hl−1 )) (2)
[EA , EB , EA , EB , EA ]. l l
h = LN(h̃ + FFN(h̃ )) l
(3)
The vector Ti which is the vector of the i-th
[CLS] symbol from the top BERT layer will be where h0 = PosEmb(T ) and T are the sen-
used as the representation for senti . tence vectors output by BERT, PosEmb is the
function of adding positional embeddings (indi-
2.2 Fine-tuning with Summarization Layers
cating the position of each sentence) to T ; LN
After obtaining the sentence vectors from BERT, is the layer normalization operation (Ba et al.,
we build several summarization-specific layers 2016); MHAtt is the multi-head attention oper-
stacked on top of the BERT outputs, to capture ation (Vaswani et al., 2017); the superscript l indi-
document-level features for extracting summaries. cates the depth of the stacked layer.
For each sentence senti , we will calculate the fi- The final output layer is still a sigmoid classi-
nal predicted score Ŷi . The loss of the whole fier:
model is the Binary Classification Entropy of Ŷi Ŷi = σ(Wo hL i + bo ) (4)
against gold label Yi . These summarization layers
where hL is the vector for senti from the top
are jointly fine-tuned with BERT.
layer (the L-th layer ) of the Transformer. In
Simple Classifier Like in the original BERT pa- experiments, we implemented Transformers with
per, the Simple Classifier only adds a linear layer L = 1, 2, 3 and found Transformer with 2 layers
on the BERT outputs and use a sigmoid function performs the best.
Recurrent Neural Network Although the per two steps, which makes the batch size approx-
Transformer model achieved great results on imately equal to 36. Model checkpoints are saved
several tasks, there are evidence that Recurrent and evaluated on the validation set every 1,000
Neural Networks still have their advantages, steps. We select the top-3 checkpoints based on
especially when combining with techniques in their evaluation losses on the validations set, and
Transformer (Chen et al., 2018). Therefore, we report the averaged results on the test set.
apply an LSTM layer over the BERT outputs to When predicting summaries for a new docu-
learn summarization-specific features. ment, we first use the models to obtain the score
To stabilize the training, pergate layer normal- for each sentence. We then rank these sentences
ization (Ba et al., 2016) is applied within each by the scores from higher to lower, and select the
LSTM cell. At time step i, the input to the LSTM top-3 sentences as the summary.
layer is the BERT output Ti , and the output is cal-
culated as: Trigram Blocking During the predicting pro-
  cess, Trigram Blocking is used to reduce redun-
Fi dancy. Given selected summary S and a candidate
 Ii  sentence c, we will skip c is there exists a trigram
 Oi  = LNh (Wh hi−1 ) + LNx (Wx Ti ) (5)
 
overlapping between c and S. This is similar to the
Gi Maximal Marginal Relevance (MMR) (Carbonell
Ci = σ(Fi ) Ci−1 and Goldstein, 1998) but much simpler.
+ σ(Ii ) tanh(Gi−1 ) (6) 3.2 Summarization Datasets
hi =σ(Ot ) tanh(LNc (Ct )) (7)
We evaluated on two benchmark datasets, namely
where Fi , Ii , Oi are forget gates, input gates, the CNN/DailyMail news highlights dataset (Her-
output gates; Gi is the hidden vector and Ci mann et al., 2015) and the New York Times
is the memory vector; hi is the output vector; Annotated Corpus (NYT; Sandhaus 2008). The
LNh , LNx , LNc are there difference layer normal- CNN/DailyMail dataset contains news articles and
ization operations; Bias terms are not shown. associated highlights, i.e., a few bullet points giv-
The final output layer is also a sigmoid classi- ing a brief overview of the article. We used the
fier: standard splits of Hermann et al. (2015) for train-
Ŷi = σ(Wo hi + bo ) (8) ing, validation, and testing (90,266/1,220/1,093
CNN documents and 196,961/12,148/10,397 Dai-
lyMail documents). We did not anonymize enti-
3 Experiments ties. We first split sentences by CoreNLP and pre-
In this section we present our implementation, de- process the dataset following methods in See et al.
scribe the summarization datasets and our evalua- (2017).
tion protocol, and analyze our results. The NYT dataset contains 110,540 articles with
abstractive summaries. Following Durrett et al.
3.1 Implementation Details (2016), we split these into 100,834 training and
We use PyTorch, OpenNMT (Klein et al., 2017) 9,706 test examples, based on date of publication
and the ‘bert-base-uncased’∗ version of BERT to (test is all articles published on January 1, 2007 or
implement the model. BERT and summarization later). We took 4,000 examples from the training
layers are jointly fine-tuned. Adam with β1 = 0.9, set as the validation set. We also followed their fil-
β2 = 0.999 is used for fine-tuning. Learning rate tering procedure, documents with summaries that
schedule is following (Vaswani et al., 2017) with are shorter than 50 words were removed from the
warming-up on first 10,000 steps: raw dataset. The filtered test set (NYT50) in-
cludes 3,452 test examples. We first split sen-
lr = 2e−3 · min(step−0.5 , step · warmup−1.5 ) tences by CoreNLP and pre-process the dataset
following methods in Durrett et al. (2016).
All models are trained for 50,000 steps on 3 Both datasets contain abstractive gold sum-
GPUs (GTX 1080 Ti) with gradient accumulation maries, which are not readily suited to training

https://github.com/huggingface/pytorch-pretrained- extractive summarization models. A greedy algo-
BERT rithm was used to generate an oracle summary for
Model ROUGE-1 ROUGE-2 ROUGE-L
P GN∗ 39.53 17.28 37.98
D CA∗ 41.69 19.47 37.92
L EAD 40.42 17.62 36.67
O RACLE 52.59 31.24 48.87
R EFRESH∗ 41.0 18.8 37.7
N EUSUM∗ 41.59 19.01 37.98
Transformer 40.90 18.02 37.17
B ERTSUM+Classifier 43.23 20.22 39.60
B ERTSUM+Transformer 43.25 20.24 39.63
B ERTSUM+LSTM 43.22 20.17 39.59

Table 1: Test set results on the CNN/DailyMail dataset using ROUGE F1 . Results with ∗ mark are taken from the
corresponding papers.

each document. The algorithm greedily select sen- • D CA (Celikyilmaz et al., 2018) is the Deep
tences which can maximize the ROUGE scores as Communicating Agents, a state-of-the-art ab-
the oracle sentences. We assigned label 1 to sen- stractive summarization system with multi-
tences selected in the oracle summary and 0 other- ple agents to represent the document as well
wise. as hierarchical attention mechanism over the
agents for decoding.
4 Experimental Results
As illustrated in the table, all BERT-based mod-
The experimental results on CNN/Dailymail els outperformed previous state-of-the-art models
datasets are shown in Table 1. For comparison, by a large margin. B ERTSUM with Transformer
we implement a non-pretrained Transformer base- achieved the best performance on all three met-
line which uses the same architecture as BERT, but rics. The B ERTSUM with LSTM model does not
with smaller parameters. It is randomly initial- have an obvious influence on the summarization
ized and only trained on the summarization task. performance compared to the Classifier model.
The Transformer baseline has 6 layers, the hidden Ablation studies are conducted to show the con-
size is 512 and the feed-forward filter size is 2048. tribution of different components of B ERTSUM.
The model is trained with same settings follow- The results are shown in in Table 2. Interval seg-
ing Vaswani et al. (2017). We also compare our ments increase the performance of base model.
model with several previously proposed systems. Trigram blocking is able to greatly improve the
summarization results. This is consistent to pre-
• L EAD is an extractive baseline which uses the vious conclusions that a sequential extractive de-
first-3 sentences of the document as a sum- coder is helpful to generate more informative sum-
mary. maries. However, here we use the trigram block-
ing as a simple but robust alternative.
• R EFRESH (Narayan et al., 2018) is an extrac-
tive summarization system trained by glob- Model R-1 R-2 R-L
ally optimizing the ROUGE metric with rein- B ERTSUM+Classifier 43.23 20.22 39.60
forcement learning. -interval segments 43.21 20.17 39.57
-trigram blocking 42.57 19.96 39.04
• N EUSUM (Zhou et al., 2018) is the state-of- Table 2: Results of ablation studies of B ERTSUM on
the-art extractive system that jontly score and CNN/Dailymail test set using ROUGE F1 (R-1 and R-
select sentences. 2 are shorthands for unigram and bigram overlap, R-L
is the longest common subsequence).
• P GN (See et al., 2017), is the Pointer Gener-
ator Network, an abstractive summarization The experimental results on NYT datasets are
system based on an encoder-decoder archi- shown in Table 3. Different from CNN/Dailymail,
tecture. we use the limited-length recall evaluation, fol-
lowing Durrett et al. (2016). We truncate the pre- Jianpeng Cheng and Mirella Lapata. 2016. Neural
dicted summaries to the lengths of the gold sum- summarization by extracting sentences and words.
In Proceedings of the ACL Conference.
maries and evaluate summarization quality with
ROUGE Recall. Compared baselines are (1) First- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
k words, which is a simple baseline by extract- Kristina Toutanova. 2018. Bert: Pre-training of deep
ing first k words of the input article; (2) Full bidirectional transformers for language understand-
is the best-performed extractive model in Dur- ing.
rett et al. (2016); (3) Deep Reinforced (Paulus Yue Dong, Yikang Shen, Eric Crawford, Herke van
et al., 2018) is an abstractive model, using rein- Hoof, and Jackie Chi Kit Cheung. 2018. Banditsum:
force learning and encoder-decoder structure. The Extractive summarization as a contextual bandit. In
B ERTSUM+Classifier can achieve the state-of-the- Proceedings of the EMNLP Conference.
art results on this dataset. Greg Durrett, Taylor Berg-Kirkpatrick, and Dan Klein.
2016. Learning-based single-document summariza-
Model R-1 R-2 R-L tion with compression and anaphoricity constraints.
First-k words 39.58 20.11 35.78 In Proceedings of the ACL Conference.
Full∗ 42.2 24.9 -
Karl Moritz Hermann, Tomas Kocisky, Edward
Deep Reinforced∗ 42.94 26.02 -
Grefenstette, Lasse Espeholt, Will Kay, Mustafa Su-
B ERTSUM+Classifier 46.66 26.35 42.62 leyman, and Phil Blunsom. 2015. Teaching ma-
chines to read and comprehend. In Advances in Neu-
Table 3: Test set results on the NYT50 dataset using ral Information Processing Systems, pages 1693–
ROUGE Recall. The predicted summary are truncated 1701.
to the length of the gold-standard summary. Results
with ∗ mark are taken from the corresponding papers. Guillaume Klein, Yoon Kim, Yuntian Deng, Jean
Senellart, and Alexander M Rush. 2017. Opennmt:
Open-source toolkit for neural machine translation.
In arXiv preprint arXiv:1701.02810.
5 Conclusion
Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017.
In this paper, we explored how to use BERT for Summarunner: A recurrent neural network based se-
extractive summarization. We proposed the B ERT- quence model for extractive summarization of docu-
ments. In Proceedings of the AAAI Conference.
SUM model and tried several summarization layers
can be applied with BERT. We did experiments Shashi Narayan, Shay B Cohen, and Mirella Lapata.
on two large-scale datasets and found the B ERT- 2018. Ranking sentences for extractive summariza-
SUM with inter-sentence Transformer layers can tion with reinforcement learning. In Proceedings of
achieve the best performance. the NAACL Conference.

Romain Paulus, Caiming Xiong, and Richard Socher.


2018. A deep reinforced model for abstractive sum-
References marization. In Proceedings of the ICLR Conference.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- Evan Sandhaus. 2008. The New York Times Annotated
ton. 2016. Layer normalization. arXiv preprint Corpus. Linguistic Data Consortium, Philadelphia,
arXiv:1607.06450. 6(12).
Jaime G Carbonell and Jade Goldstein. 1998. The use Abigail See, Peter J. Liu, and Christopher D. Manning.
of mmr and diversity-based reranking for reodering 2017. Get to the point: Summarization with pointer-
documents and producing summaries. generator networks. In Proceedings of the ACL Con-
ference.
Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, and
Yejin Choi. 2018. Deep communicating agents for Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
abstractive summarization. In Proceedings of the Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
NAACL Conference. Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in Neural Information Pro-
Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin cessing Systems, pages 5998–6008.
Johnson, Wolfgang Macherey, George Foster, Llion
Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, Xingxing Zhang, Mirella Lapata, Furu Wei, and Ming
et al. 2018. The best of both worlds: Combining Zhou. 2018. Neural latent extractive document sum-
recent advances in neural machine translation. In marization. In Proceedings of the EMNLP Confer-
Proceedings of the ACL Conference. ence.
Qingyu Zhou, Nan Yang, Furu Wei, Shaohan Huang,
Ming Zhou, and Tiejun Zhao. 2018. Neural docu-
ment summarization by jointly learning to score and
select sentences. In Proceedings of the ACL Confer-
ence.

You might also like