A Primer in BERTology

A Primer in BERTology: What we know about how BERT works
Anna Rogers, Olga Kovaleva, Anna Rumshisky

Department of Computer Science, University of Massachusetts Lowell
Lowell, MA 01854
{arogers, okovalev, arum}@cs.uml.edu
Abstract 2 Overview of BERT architecture
Transformer-based models are now widely

Fundamentally, BERT is a stack of Transformer
arXiv:2002.12327v1 [cs.CL] 27 Feb 2020
used in NLP, but we still do not understand a encoder layers (Vaswani et al., 2017) which consist
lot about their inner workings. This paper de- of multiple “heads”, i.e., fully-connected neural
scribes what is known to date about the famous networks augmented with a self-attention mecha-
BERT model (Devlin et al., 2019), synthesiz- nism. For every input token in a sequence, each
ing over 40 analysis studies. We also provide head computes key, value and query vectors, which
an overview of the proposed modifications to are used to create a weighted representation. The
the model and its training regime. We then out-
outputs of all heads in the same layer are combined
line the directions for further research.
and run through a fully-connected layer. Each layer
is wrapped with a skip connection and layer nor-
1 Introduction malization is applied after it.
The conventional workflow for BERT consists
Since their introduction in 2017, Transformers
of two stages: pre-training and fine-tuning. Pre-
(Vaswani et al., 2017) took NLP by storm, of-
training uses two semi-supervised tasks: masked
fering enhanced parallelization and better model-
language modeling (MLM, prediction of randomly
ing of long-range dependencies. The best known
masked input tokens) and next sentence prediction
Transformer-based model is BERT (Devlin et al.,
(NSP, predicting if two input sentences are adjacent
2019) which obtained state-of-the-art results in nu-
to each other). In fine-tuning for downstream ap-
merous benchmarks, and was integrated in Google
plications, one or more fully-connected layers are
search1 , improving an estimated 10% of queries.
typically added on top of the final encoder layer.
While it is clear that BERT and other
The input representations are computed as fol-
Transformer-based models work remarkably well,
it is less clear why, which limits further hypothesis-
driven improvement of the architecture. Unlike
CNNs, the Transformers have little cognitive mo-
tivation, and the size of these models limits our
ability to experiment with pre-training and perform
ablation studies. This explains a large number of
studies over the past year that attempted to under-
stand the reasons behind BERT’s performance.
This paper provides an overview of what has
been learned to date, highlighting the questions
which are still unresolved. We focus on the studies
investigating the types of knowledge learned by
BERT, where this knowledge is represented, how it
is learned, and the methods proposed to improve it.
1
https://blog.google/products/search/
search-language-understanding-bert Figure 1: BERT fine-tuning (Devlin et al., 2019).
Figure 2: Parse trees recovered from BERT representations by Hewitt et al. (2019)
lows: BERT first tokenizes the given word into 4 What knowledge does BERT have?
wordpieces (Wu et al., 2016b), and then combines
three embedding layers (token, position, and seg- A number of studies have looked at the types of
ment) to obtain a fixed-length vector. Special to- knowledge encoded in BERT’s weights. The pop-
ken [CLS] is used for classification predictions, ular approaches include fill-in-the-gap probes of
and [SEP] separates input segments.The original BERT’s MLM, analysis of self-attention weights,
BERT comes in two versions: base and large, vary- and probing classifiers using different BERT repre-
ing in the number of layers, their hidden size, and sentations as inputs.
number of attention heads.
4.1 Syntactic knowledge
Lin et al. (2019) showed that BERT represen-
tations are hierarchical rather than linear, i.e.
3 BERT embeddings
there is something akin to syntactic tree structure
in addition to the word order information. Ten-
Unlike the conventional static embeddings ney et al. (2019b) and Liu et al. (2019a) also
(Mikolov et al., 2013a; Pennington et al., 2014), showed that BERT embeddings encode informa-
BERT’s representations are contextualized, i.e., tion about parts of speech, syntactic chunks
every input token is represented by a vector and roles. However, BERT’s knowledge of syn-
dependent on the particular context of occurrence. tax is partial, since probing classifiers could not
In the current studies of BERT’s representation recover the labels of distant parent nodes in the
space, the term ‘embedding’ refers to the output syntactic tree (Liu et al., 2019a).
vector of a given (typically final) Transformer As far as how syntactic information is repre-
layer. sented, it seems that syntactic structure is not
directly encoded in self-attention weights, but
Wiedemann et al. (2019) find that BERT’s con-
they can be transformed to reflect it. Htut et al.
textualized embeddings form distinct and clear
(2019) were unable to extract full parse trees from
clusters corresponding to word senses, which con-
BERT heads even with the gold annotations for
firms that the basic distributional hypothesis holds
the root. Jawahar et al. (2019) include a brief il-
for these representations. However, Mickus et al.
lustration of a dependency tree extracted directly
(2019) note that representations of the same word
from self-attention weights, but provide no quan-
varies depending on position of the sentence in
titative evaluation. However, Hewitt and Manning
which it occurs, likely due to NSP objective.
(2019) were able to learn transformation matrices
Ethayarajh (2019) measure how similar the em- that would successfully recover much of the Stan-
beddings for identical words are in every layer and ford Dependencies formalism for PennTreebank
find that later BERT layers produce more context- data (see Figure 2). Jawahar et al. (2019) try to
specific representations. They also find that BERT approximate BERT representations with Tensor
embeddings occupy a narrow cone in the vector Product Decomposition Networks (McCoy et al.,
space, and this effect increases from lower to higher 2019a), concluding that the dependency trees are
layers. That is, two random words will on aver- the best match among 5 decomposition schemes
age have a much higher cosine similarity than ex- (although the reported MSE differences are very
pected if embeddings were directionally uniform small).
(isotropic). Regarding syntactic competence of BERT’s
MLM, Goldberg (2019) showed that BERT takes
subject-predicate agreement into account when
performing the cloze task. This was the case even
for sentences with distractor clauses between the
subject and the verb, and meaningless sentences. A
study of negative polarity items (NPIs) by Warstadt Figure 3: BERT’s world knowledge (Petroni et al.,
et al. (2019) showed that BERT is better able to 2019)
detect the presence of NPIs (e.g. ”ever”) and the
words that allow their use (e.g. ”whether”) than
scope violations. numerous practitioners using BERT to extract such
The above evidence of syntactic knowledge is be- knowledge.
lied by the fact that BERT does not “understand” Petroni et al. (2019) showed that, for some re-
negation and is insensitive to malformed input. lation types, vanilla BERT is competitive with
In particular, its predictions were not altered even methods relying on knowledge bases (Figure 3).
with shuffled word order, truncated sentences, re- Davison et al. (2019) suggest that it generalizes
moved subjects and objects (Ettinger, 2019). This better to unseen data. However, to retrieve BERT’s
is in line with the recent findings on adversarial at- knowledge we need good template sentences, and
tacks, with models disturbed by nonsensical inputs there is work on their automatic extraction and aug-
(Wallace et al., 2019a), and suggests that BERT’s mentation (Bouraoui et al., 2019; Jiang et al.)
encoding of syntactic structure does not indi- However, BERT cannot reason based on its
cate that it actually relies on that knowledge. world knowledge. Forbes et al. (2019) show that
BERT can “guess” the affordances and properties
4.2 Semantic knowledge of many objects, but does not have the information
To date, more studies were devoted to BERT’s about their interactions (e.g. it “knows” that people
knowledge of syntactic rather than semantic phe- can walk into houses, and that houses are big, but
nomena. However, we do have evidence from an it cannot infer that houses are bigger than people.)
MLM probing study that BERT has some knowl- Zhou et al. (2020) and Richardson and Sabharwal
edge for semantic roles (Ettinger, 2019). BERT is (2019) also show that the performance drops with
even able to prefer the incorrect fillers for semantic the number of necessary inference steps. At the
roles that are semantically related to the correct same time, Poerner et al. (2019) show that some
ones, to those that are unrelated (e.g. “to tip a chef” of BERT’s success in factoid knowledge retrieval
should be better than “to tip a robin”, but worse comes from learning stereotypical character com-
than “to tip a waiter”). binations, e.g. it would predict that a person with
Tenney et al. (2019b) showed that BERT en- an Italian-sounding name is Italian, even when it is
codes information about entity types, relations, factually incorrect.
semantic roles, and proto-roles, since this infor-
mation can be detected with probing classifiers. 5 Localizing linguistic knowledge
BERT struggles with representations of num- 5.1 Self-attention heads
bers. Addition and number decoding tasks showed
that BERT does not form good representations for Attention is widely considered to be useful for un-
floating point numbers and fails to generalize away derstanding Transformer models, and several stud-
from the training data (Wallace et al., 2019b). A ies proposed classification of attention head types:
part of the problem is BERT’s wordpiece tokeniza-
• attending to the word itself, to previous/next
tion, since numbers of similar values can be divided
words and to the end of the sentence (Ra-
up into substantially different word chunks.
ganato and Tiedemann, 2018);
4.3 World knowledge
• attending to previous/next tokens, [CLS],
MLM component of BERT is easy to adapt for [SEP], punctuation, and “attending broadly”
knowledge induction by filling in the blanks (e.g. over the sequence (Clark et al., 2019);
“Cats like to chase [ ]”). There is at least one
probing study of world knowledge in BERT (Et- • the 5 attention types shown in Figure 4 (Ko-
tinger, 2019), but the bulk of evidence comes from valeva et al., 2019).
Vertical Diagonal Vertical + diagonal Block Heterogeneous
[CLS] [SEP] [SEP] [CLS] [SEP] [SEP] [CLS] [SEP] [SEP] [CLS] [SEP] [SEP] [CLS] [SEP] [SEP]
Figure 4: Attention patterns in BERT (Kovaleva et al., 2019)
According to Clark et al. (2019), “attention are heads that attend to words in obj role more
weight has a clear meaning: how much a partic- than the positional baseline. The evidence for
ular word will be weighted when computing the nsubj, advmod, and amod has some variation
next representation for the current word”. How- between these two studies. The overall conclusion
ever, Kovaleva et al. (2019) showed that most self- is also supported by Voita et al. (2019)’s data for
attention heads do not directly encode any non- the base Transformer in machine translation con-
trivial linguistic information, since less than half text. Hoover et al. (2019) hypothesize that even
of them had the “heterogeneous” pattern2 . Much complex dependencies like dobj are encoded by
of the model encoded the vertical pattern (attention a combination of heads rather than a single head,
to [CLS], [SEP], and punctuation tokens), con- but this work is limited to qualitative analysis.
sistent with the observations by Clark et al. (2019). Both Clark et al. (2019) and Htut et al. (2019)
This apparent redundancy must be related to the conclude that no single head has the complete
overparametrization issue (see section 7). syntactic tree information, in line with evidence
Attention to [CLS] is easy to interpret as atten- of partial knowledge of syntax (see subsection 4.1).
tion to an aggregated sentence-level representation, Lin et al. (2019) present evidence that atten-
but BERT also attends a lot to [SEP] and punc- tion weights are weak indicators of subject-
tuation. Clark et al. (2019) hypothesize that peri- verb agreement and reflexive anafora. Instead
ods and commas are simply almost as frequent as of serving as strong pointers between tokens that
[CLS] and [SEP], and the model learns to rely should be related, BERT’s self-attention weights
on them. They suggest also that the function of were close to a uniform attention baseline, but there
[SEP] might be one of “no-op”, a signal to ignore was some sensitivity to different types of distractors
the head if its pattern is not applicable to the current coherent with psycholinguistic data.
case. [SEP] gets increased attention starting in Clark et al. (2019) identify a BERT head that can
layer 5, but its importance for prediction drops. If be directly used as a classifier to perform corefer-
this hypothesis is correct, attention probing studies ence resolution on par with a rule-based system,.
that excluded the [SEP] and [CLS] tokens (as Kovaleva et al. (2019) showed that even when
e.g. Lin et al. (2019) and Htut et al. (2019)) should attention heads specialize in tracking semantic
perhaps be revisited. relations, they do not necessarily contribute to
Proceeding to the analysis of the “heteroge- BERT’s performance on relevant tasks. Koval-
neous” self-attention pattern, a number of studies eva et al. (2019) identified two heads of base BERT,
looked for specific BERT heads with linguistically in which self-attention maps were closely aligned
interpretable functions. with annotations of core frame semantic relations
Some BERT heads seem to specialize in cer- (Baker et al., 1998). Although such relations should
tain types of syntactic relations. Htut et al. have been instrumental to tasks such as inference,
(2019) and Clark et al. (2019) report that there a head ablation study showed that these heads were
are BERT heads that attended significantly more not essential for BERT’s success on GLUE tasks.
than a random baseline to words in certain syntac-
tic positions. The datasets and methods used in 5.2 BERT layers
these studies differ, but they both find that there
The first layer of BERT receives as input representa-
2
The experiments were conducted with BERT fine-tuned tions that are a combination of token, segment, and
on GLUE tasks (Wang et al., 2018). positional embeddings. It stands to reason that the
lower layers have the most linear word order tagging and chunking were also performed best at
information. Lin et al. (2019) report a decrease in the middle layers, in both BERT-base and BERT-
the knowledge of linear word order around layer 4 large.
in BERT-base. This is accompanied by increased The final layers of BERT are the most task-
knowledge of hierarchical sentence structure, as de- specific. In pre-training, this means specificity
tected by the probing tasks of predicting the index to the MLM task, which would explain why the
of a token, the main auxiliary verb and the sentence middle layers are more transferable (Liu et al.,
subject. 2019a). In fine-tuning, it explains why the final
There is a wide consensus among studies with layers change the most (Kovaleva et al., 2019). At
different tasks, datasets and methodologies that the same time, Hao et al. (2019) report that if the
syntactic information is the most prominent in weights of lower layers of the fine-tuned BERT are
the middle BERT3 layers. Hewitt and Manning restored to their original values, it does not dramat-
(2019) had the most success reconstructing syn- ically hurt the model performance.
tactic tree depth from the middle BERT layers (6- Tenney et al. (2019a) suggest that while most of
9 for base-BERT, 14-19 for BERT-large). Gold- syntactic information can be localized in a few lay-
berg (2019) report the best subject-verb agreement ers, semantics is spread across the entire model,
around layers 8-9, and the performance on syntac- which would explain why certain non-trivial exam-
tic probing tasks used by Jawahar et al. (2019) also ples get solved incorrectly at first but correctly at
seemed to peak around the middle of the model. higher layers. This is rather to be expected: se-
The prominence of syntactic information in the mantics permeates all language, and linguists de-
middle BERT layers must be related to Liu et al. bate whether meaningless structures can exist at all
(2019a) observation that the middle layers of Trans- (Goldberg, 2006, p.166-182). But this raises the
formers are overall the best-performing and the question of what stacking more Transformer layers
most transferable across tasks (see Figure 5). actually achieves in BERT in terms of the spread of
semantic knowledge, and whether that is beneficial.
The authors’ comparison between base and large
BERTs shows that the overall pattern of cumulative
score gains is the same, only more spread out in
the large BERT.
The above view is disputed by Jawahar et al.
(2019), who place “surface features in lower layers,
syntactic features in middle layers and semantic
features in higher layers”. However, the conclu-
sion with regards to the semantic features seems
Figure 5: BERT layer transferability (columns corre- surprising, given that only one SentEval semantic
spond to probing tasks) (Liu et al., 2019a). task in this study actually topped at the last layer,
and three others peaked around the middle and then
There is conflicting evidence about syntactic considerably degraded by the final layers.
chunks. Tenney et al. (2019a) conclude that “the
basic syntactic information appears earlier in the 6 Training BERT
network while high-level semantic features appears
This section reviews the proposals to optimize the
at the higher layers”, drawing parallels between this
training and architecture of the original BERT.
order and the order of components in a typical NLP
pipeline - from POS-tagging to dependency parsing 6.1 Pre-training BERT
to semantic role labeling. Jawahar et al. (2019) also The original BERT is a bidirectional Transformer
report that the lower layers were more useful for pre-trained on two tasks: next sentence prediction
chunking, while middle layers were more useful for (NSP) and masked language model (MLM). Multi-
parsing. At the same time, the probing experiments ple studies have come up with alternative training
by Liu et al. (2019a) find the opposite: both POS- objectives to improve on BERT.
3
These BERT results are also compatible with findings by
Vig and Belinkov (2019), who report the highest attention to • Removing NSP does not hurt or slightly im-
tokens in dependency relations in the middle layers of GPT-2. proves task performance (Liu et al., 2019b;
Joshi et al., 2020; Clinchant et al., 2019), es- of increasing the corpus volume and longer train-
pecially in cross-lingual setting (Wang et al., ing. The data also does not have to be unstructured
2019b). Wang et al. (2019a) replace NSP with text: although BERT is actively used as a source
the task of predicting both the next and the of world knowledge (subsection 4.3), there are on-
previous sentences. Lan et al. (2020) replace going efforts to incorporate structured knowledge
the negative NSP examples by the swapped resources (Peters et al., 2019a).
sentences from positive examples, rather than Another way to integrate external knowledge is
sentences from different documents. use entity embeddings as input, as in E-BERT (Po-
• Dynamic masking (Liu et al., 2019b) improves erner et al., 2019) and ERNIE (Zhang et al., 2019).
on BERT’s MLM by using diverse masks for Alternatively, SemBERT (Zhang et al., 2020) inte-
training examples within an epoch; grates semantic role information with BERT repre-
sentations.
• Beyond-sentence MLM. Lample and Conneau
(2019) replace sentence pairs with arbitrary
text streams, and subsample frequent outputs
similarly to Mikolov et al. (2013b).
• Permutation language modeling. Yang et al.
(2019) replace MLM with training on differ-
ent permutations of word order in the input
sequence, maximizing the probability of the
original word order. See also the n-gram word
order reconstruction task (Wang et al., 2019a).
• Span boundary objective aims to predict a
masked span (rather than single words) using Figure 6: Pre-trained weights help BERT find wider
optima in fine-tuning on MRPC (right) than training
only the representations of the tokens at the
from scratch (left) (Hao et al., 2019)
span’s boundary (Joshi et al., 2020);
• Phrase masking and named entity masking
Pre-training is the most expensive part of train-
(Zhang et al., 2019) aim to improve represen-
ing BERT, and it would be informative to know
tation of structured knowledge by masking
how much benefit it provides. Hao et al. (2019) con-
entities rather than individual words;
clude that pre-trained weights help the fine-tuned
• Continual learning is sequential pre-training BERT find wider and flatter areas with smaller gen-
on a large number of tasks4 , each with their eralization error, which makes the model more ro-
own loss which are then combined to continu- bust to overfitting (see Figure 6). However, on
ally update the model (Sun et al., 2019b). some tasks a randomly initialized and fine-tuned
• Conditional MLM by Wu et al. (2019b) re- BERT obtains competitive or higher results than
places the segmentation embeddings with “la- the pre-trained BERT with the task classifier and
bel embeddings”, which also include the label frozen weights (Kovaleva et al., 2019).
for a given sentence from an annotated task
6.2 Model architecture choices
dataset (e.g. sentiment analysis).
• Clinchant et al. (2019) propose replacing the To date, the most systematic study of BERT ar-
MASK token with [UNK] token, as this could chitecture was performed by Wang et al. (2019b).
help the model to learn certain representation They experimented with the number of layers,
for unknowns that could be exploited by a heads, and model parameters, varying one option
neural machine translation model. and freezing the others. They concluded that the
number of heads was not as significant as the num-
Another obvious source of improvement is pre- ber of layers, which is consistent with the findings
training data. Liu et al. (2019c) explore the benefits of Voita et al. (2019) and Michel et al. (2019), dis-
4
New token-level tasks in ERNIE include prediction cussed in section 7, and also the observation by
whether a token is capitalized and whether it occurs in other Liu et al. (2019a) that middle layers were the most
segments of the same document. Segment-level tasks include
sentence reordering, sentence distance prediction, and super- transferable. Larger hidden representation size was
vised discourse relation classification. consistently better, but the gains varied by setting.
Liu et al. (2019c) show that large-batch training • Two-stage fine-tuning introduces an interme-
(8k examples) improves both the language model diate supervised training stage between pre-
perplexity and downstream task performance. They training and fine-tuning (Phang et al., 2019;
also publish their recommendations for other model Garg et al., 2020).
parameters. You et al. (2019) report that with a
batch size of 32k BERT’s training time can be sig- • Adversarial token perturbations improve ro-
nificantly reduced with no degradation in perfor- bustness of the model (Zhu et al., 2019).
mance. Zhou et al. (2019) observe that the em-
bedding values of the trained [CLS] token are not With larger and larger models even fine-tuning
centered around zero, their normalization stabilizes becomes expensive, but Houlsby et al. (2019) show
the training leading to a slight performance gain on that it can be successfully approximated by insert-
text classification tasks. ing adapter modules. They adapt BERT to 26 clas-
Gong et al. (2019) note that, since self-attention sification tasks, achieving competitive performance
patterns in higher layers resemble the ones in lower at a fraction of the computational cost. Artetxe et al.
layers, the model training can be done in a recursive (2019) also find adapters helpful in reusing mono-
manner, where the shallower version is trained first lingual BERT weights for cross-lingual transfer.
and then the trained parameters are copied to deeper An alternative to fine-tuning is extracting fea-
layers. Such “warm-start” can lead to a 25% faster tures from frozen representations, but fine-tuning
training speed while reaching similar accuracy to works better for BERT (Peters et al., 2019b).
the original BERT on GLUE tasks. Initialization can have a dramatic effect on the
training process (Petrov, 2010). However, variation
6.3 Fine-tuning BERT across initializations is not often reported, although
Pre-training + fine-tuning workflow is a crucial part the performance improvements claimed in many
of BERT. The former is supposed to provide task- NLP modeling papers may be within the range of
independent linguistic knowledge, and the fine- that variation (Crane, 2018). Dodge et al. (2020)
tuning process would presumably teach the model report significant variation for BERT fine-tuned on
to rely on the representations that are more useful GLUE tasks, where both weight initialization and
for the task at hand. training data order contribute to the variation. They
Kovaleva et al. (2019) did not find that to be the also propose an early-stopping technique to avoid
case for BERT fine-tuned on GLUE tasks5 : dur- full fine-tuning for the less-promising seeds.
ing fine-tuning, the most changes for 3 epochs oc-
curred in the last two layers of the models, but those 7 How big should BERT be?
changes caused self-attention to focus on [SEP] 7.1 Overparametrization
rather than on linguistically interpretable patterns.
It is understandable why fine-tuning would increase Transformer-based models keep increasing in size:
the attention to [CLS], but not [SEP]. If Clark e.g. T5 (Wu et al., 2016a) is over 30 times
et al. (2019) are correct that [SEP] serves as larger than the base BERT. This raises concerns
“no-op” indicator, fine-tuning basically tells BERT about computational complexity of self-attention
what to ignore. (Wu et al., 2019a), environmental issues (Strubell
Several studies explored the possibilities of im- et al., 2019; Schwartz et al., 2019), as well as re-
proving the fine-tuning of BERT: producibility and access to research resources in
academia vs. industry.
• Taking more layers into account. Yang and Human language is incredibly complex, and
Zhao (2019) learn a complementary represen- would perhaps take many more parameters to de-
tation of the information in the deeper layers scribe fully, but the current models do not make
that is combined with the output layer. Su good use of the parameters they already have. Voita
and Cheng (2019) propose using a weighted et al. (2019) showed that all but a few Transformer
representation of all layers instead of the final heads could be pruned without significant losses
layer output. in performance. For BERT, Clark et al. (2019)
5
observe that most heads in the same layer show
See also experiments with multilingual BERT by (Singh
et al., 2019), where fine-tuning affected the top and the middle similar self-attention patterns (perhaps related to
layers of the model. the fact that the output of all self-attention heads in
Compression Performance Speedup Model Evaluation
DistilBERT (Sanh et al., 2019) ×2.5 90% ×1.6 BERT6 All GLUE tasks
BERT6 -PKD (Sun et al., 2019a) ×1.6 97% ×1.9 BERT6 No WNLI, CoLA and STS-B
BERT3 -PKD (Sun et al., 2019a) ×2.4 92% ×3.7 BERT3 No WNLI, CoLA and STS-B
Distillation
(Aguilar et al., 2019) ×2 94% - BERT6 CoLA, MRPC, QQP, RTE

BERT-48 (Zhao et al., 2019) ×62 87% ×77 BERT12 ∗† MNLI, MRPC, SST-2
BERT-192 (Zhao et al., 2019) ×5.7 94% ×22 BERT12 ∗† MNLI, MRPC, SST-2
TinyBERT (Jiao et al., 2019) ×7.5 96% ×9.4 BERT4 ∗† All GLUE tasks
MobileBERT (Sun et al.) ×4.3 100% ×4 BERT24 † No WNLI
PD (Turc et al., 2019) ×1.6 98% ×2.53 BERT6 † No WNLI, CoLA and STS-B
MiniBERT(Tsai et al., 2019) ×6§ 98% ×27§ mBERT3 † CoNLL-2018 POS and morphology
BiLSTM soft (Tang et al., 2019) ×110 91% ×434‡ BiLSTM1 MNLI, QQP, SST-2
Other Quant.
Q-BERT (Shen et al., 2019) ×13 99% - BERT12 MNLI, SST-2

Q8BERT (Zafrir et al., 2019) ×4 99% - BERT12 All GLUE tasks
ALBERT-base (Lan et al., 2019) ×9 97% ×5.6 BERT12 ∗∗ MNLI, SST-2
ALBERT-xxlarge (Lan et al., 2019) ×0.47 107% ×0.3 BERT12 ∗∗ MNLI, SST-2
BERT-of-Theseus (Xu et al., 2020) ×1.6 98% - BERT6 No WNLI
Table 1: Comparison of BERT compression studies. Compression, performance retention, and inference time
speedup figures are given with respect to BERTbase , unless indicated otherwise. Performance retention is measured
as a ratio of average scores achieved by a given model and by BERTbase . The subscript in the model description
reflects the number of layers used. ∗ Smaller vocabulary used. † The dimensionality of the hidden layers is reduced.
∗∗
The dimensionality of the embedding layer is reduced. ‡ Compared to BERTlarge . § Compared to mBERT.
a layer is passed through the same MLP), which ex- Two main approaches include knowledge distilla-
plains why Michel et al. (2019) were able to reduce tion and quantization.
most layers to a single head. The studies in the knowledge distillation frame-
Depending on the task, some BERT heads/layers work (Hinton et al., 2015) use a smaller student-
are not only useless, but also harmful to the down- network that is trained to mimic the behavior of
stream task performance. Positive effects from dis- a larger teacher-network (BERT-large or BERT-
abling heads were reported for machine translation base). This is achieved through experiments with
(Michel et al., 2019), and for GLUE tasks, both loss functions (Sanh et al., 2019; Jiao et al., 2019),
heads and layers could be disabled (Kovaleva et al., mimicking the activation patterns of individual por-
2019). Additionally, Tenney et al. (2019a) exam- tions of the teacher network (Sun et al., 2019a),
ine the cumulative gains of their structural probing and knowledge transfer at different stages at the
classifier, observing that in 5 out of 8 probing tasks pre-training (Turc et al., 2019; Jiao et al., 2019;
some layers cause a drop in scores (typically in the Sun et al.) or at the fine-tuning stage (Jiao et al.,
final layers). 2019)).
Many experiments comparing BERT-base and The quantization approach aims to decrease
BERT-large saw the larger model perform better BERT’s memory footprint through lowering the
Liu et al. (2019a), but that is not always the case. precision of its weights (Shen et al., 2019; Zafrir
In particular, the opposite was observed for subject- et al., 2019). Note that this strategy often requires
verb agreement (Goldberg, 2019) and sentence sub- compatible hardware.
ject detection Lin et al. (2019). Other techniques include decomposing BERT’s
Given the complexity of language, and amounts embedding matrix into smaller matrices (Lan et al.,
of pre-training data, it is not clear why BERT ends 2019) and progressive model replacing (Xu et al.,
up with redundant heads and layers. Clark et al. 2020).
(2019) suggest that one of the possible reasons is
the use of attention dropouts, which causes some 8 Multilingual BERT
attention weights to be zeroed-out during training. Multilingual BERT (mBERT6 ) is a version of
BERT that was trained on Wikipedia in 104 lan-
7.2 BERT compression guages (110K wordpiece vocabulary). Languages
Given the above evidence of overparametrization, with a lot of data were subsampled, and some were
it does not come as a surprise that BERT can be super-sampled using exponential smoothing.
efficiently compressed with minimal accuracy loss. 6
https://github.com/google-research/
Such efforts to date are summarized in Table 1. bert/blob/master/multilingual.md
mBERT performs surprisingly well in zero-shot
transfer on many tasks (Wu and Dredze, 2019;
Pires et al., 2019), although not in language gener-
ation (Rönnqvist et al., 2019). The model seems
to naturally learn high-quality cross-lingual word
alignments (Libovický et al., 2019), with caveats
for open-class parts of speech (Cao et al., 2019).
Adding more languages does not seem to harm the
quality of representations (Artetxe et al., 2019).
mBERT generalizes across some scripts (Pires
et al., 2019), and can retrieve parallel sentences,
although Libovický et al. (2019) note that this
task could be solvable by simple lexical matches.
Pires et al. (2019) conclude that mBERT represen-
tation space shows some systematicity in between-
language mappings, which makes it possible in Figure 7: Language centroids of the mean-pooled
some cases to “translate” between languages by mBERT representations (Libovický et al., 2019)
shifting the representations by the average parallel
sentences offset for a given language pair. vocabulary, without any shared word-pieces.
mBERT is simply trained on a multilingual cor- To date, the following proposals were made for
pus, with no language IDs, but it encodes language improving mBERT:
identities (Wu and Dredze, 2019; Libovický et al.,
2019), and adding the IDs in pre-training was not • fine-tuning on multilingual datasets is im-
beneficial (Wang et al., 2019b). It is also aware proved by freezing the bottom layers (Wu and
of at least some typological language features (Li- Dredze, 2019);
bovický et al., 2019; Singh et al., 2019), and trans- • improving word alignment in fine-tuning (Cao
fer between structurally similar languages works et al., 2019);
better (Wang et al., 2019b; Pires et al., 2019). • translation language modeling (Lample and
Singh et al. (2019) argue that if typological fea- Conneau, 2019) is an alternative pre-training
tures structure its representation space, it could not objective where words are masked in parallel
be considered as interlingua. However, Artetxe sentence pairs (the model can attend to one or
et al. (2019) show that cross-lingual transfer can both sentences to solve the prediction task);
be achieved by only retraining the input embed- • Huang et al. (2019) combine 5 pre-training
dings while keeping monolingual BERT weights, tasks (monolingual and cross-lingual MLM,
which suggests that even monolingual models learn translation language modeling, cross-lingual
generalizable linguistic abstractions. word recovery and paraphrase classification);
At least some of the syntactic properties of En-
glish BERT hold for mBERT: its MLM is aware of A fruitful research direction is using monolin-
4 types of agreement in 26 languages (Bacon and gual BERT directly in cross-lingual setting. Clin-
Regier, 2019), and main auxiliary of the sentence chant et al. (2019) experiment with initializing the
can be detected in German and Nordic languages encoder part of the neural MT model with monolin-
Rönnqvist et al. (2019). gual BERT weights. Artetxe et al. (2019) and Tran
(2019) independently showed that mBERT does
Pires et al. (2019) and Wu and Dredze (2019)
not have to be pre-trained on multiple languages: it
hypothesize that shared word-pieces help mBERT,
is possible to freeze the Transformer weights and
based on experiments where the task performance
retrain only the input embeddings.
correlated with the amount of shared vocabulary
between languages. However, Wang et al. (2019b) 9 Discussion
dispute this account, showing that bilingual BERT
models are not hampered by the lack of shared 9.1 Limitations
vocabulary. Artetxe et al. (2019) also show cross- As shown in section 4, multiple probing studies
lingual transfer is possible by swapping the model report that BERT possesses a surprising amount of
syntactic, semantic, and world knowledge. How- starting points that we already have.
ever, as Tenney et al. (2019a) aptly stated, “the Benchmarks that require verbal reasoning.
fact that a linguistic pattern is not observed by our While BERT enabled breakthroughs on many NLP
probing classifier does not guarantee that it is not benchmarks, a growing list of analysis papers are
there, and the observation of a pattern does not showing that its verbal reasoning abilities are not
tell us how it is used”. There is also the issue of as impressive as it seems. In particular, it was
tradeoff between the complexity of the probe and shown to rely on shallow heuristics in both natural
the tested hypothesis (Liu et al., 2019a). A more language inference (McCoy et al., 2019b; Zellers
complex probe might be able to recover more infor- et al., 2019) and reading comprehension (Si et al.,
mation, but it becomes less clear whether we are 2019; Rogers et al., 2020; Sugawara et al., 2020).
still talking about the original model. As with any optimization method, if there is a short-
Furthermore, different probing methods may re- cut in the task, we have no reason to expect that
veal complementary or even contradictory informa- BERT will not learn it. To overcome this, the NLP
tion, in which case a single test (as done in most community needs to incentivize dataset develop-
studies) would not be sufficient (Warstadt et al., ment on par with modeling work, which at present
2019). Certain methods might also favor a certain is often perceived as more prestigious.
model, e.g., RoBERTa is trailing BERT with one Developing methods to “teach” reasoning.
tree extraction method, but leading with another While the community had success extracting knowl-
(Htut et al., 2019). edge from large pre-trained models, they often fail
Head and layer ablation studies (Michel et al., if any reasoning needs to be performed on top of
2019; Kovaleva et al., 2019) inherently assume that the facts they possess (see subsection 4.3). For in-
certain knowledge is contained in heads/layers, but stance, Richardson et al. (2019) propose a method
there is evidence of more diffuse representations to “teach” BERT quantification, conditionals, com-
spread across the full network: the gradual increase paratives, and boolean coordination.
in accuracy on difficult semantic parsing tasks (Ten-
ney et al., 2019a), the absence of heads that would Learning what happens at inference time.
perform parsing “in general” (Clark et al., 2019; Most of the BERT analysis papers focused on dif-
Htut et al., 2019). Ablations are also problematic if ferent probes of the model, but we know much less
the same information was duplicated elsewhere in about what knowledge actually gets used. At the
the network. To mitigate that, Michel et al. (2019) moment, we know that the knowledge represented
prune heads in the order set by a proxy importance in BERT does not necessarily get used in down-
score, and Voita et al. (2019) fine-tune the pre- stream tasks (Kovaleva et al., 2019). As starting
trained Transformer with a regularized objective points for work in this direction, we also have other
that has the head-disabling effect. head ablation studies (Voita et al., 2019; Michel
Many papers are accompanied by attention vi- et al., 2019) and studies of how BERT behaves in
sualizations, with a growing number of visualiza- reading comprehension task (van Aken et al., 2019;
tion tools (Vig, 2019; Hoover et al., 2019). How- Arkhangelskaia and Dutta, 2019).
ever, there is ongoing debate on the merits of atten- 10 Conclusion
tion as a tool for interpreting deep learning models
(Jain and Wallace, 2019; Serrano and Smith, 2019; In a little over a year, BERT has become a ubiqui-
Wiegreffe and Pinter, 2019; Brunner et al., 2020). tous baseline in NLP engineering experiments and
Also, visualization is typically limited to qualitative inspired numerous studies analyzing the model and
analysis (Belinkov and Glass, 2019), and should proposing various improvements. The stream of
not be interpreted as definitive evidence. papers seems to be accelerating rather than slow-
ing down, and we hope that this survey will help
9.2 Directions for further research the community to focus on the biggest unresolved
questions.
BERTology has clearly come a long way, but it
is fair to say we still have more questions than
answers about how BERT works. In this section, References
we list what we believe to be the most promising Gustavo Aguilar, Yuan Ling, Yu Zhang, Benjamin Yao,
directions for further research, together with the Xing Fan, and Edward Guo. 2019. Knowledge distil-
lation from internal representations. arXiv preprint Joe Davison, Joshua Feldman, and Alexander Rush.
arXiv:1910.03723. 2019. Commonsense Knowledge Mining from Pre-
trained Models. In Proceedings of the 2019 Con-
Betty van Aken, Benjamin Winter, Alexander Löser, ference on Empirical Methods in Natural Language
and Felix A Gers. 2019. How does BERT answer Processing and the 9th International Joint Confer-
questions? a layer-wise analysis of transformer rep- ence on Natural Language Processing (EMNLP-
resentations. In Proceedings of the 28th ACM Inter- IJCNLP), pages 1173–1178, Hong Kong, China. As-
national Conference on Information and Knowledge sociation for Computational Linguistics.
Management, pages 1823–1832.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Ekaterina Arkhangelskaia and Sourav Dutta. 2019. Kristina Toutanova. 2019. BERT: Pre-training of
Whatcha lookin’at? DeepLIFTing BERT’s at- deep bidirectional transformers for language under-
tention in question answering. arXiv preprint standing. In Proceedings of the 2019 Conference of
arXiv:1910.06431. the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
Mikel Artetxe, Sebastian Ruder, and Dani Yogatama.
nologies, Volume 1 (Long and Short Papers), pages
2019. On the Cross-lingual Transferability of Mono-
4171–4186.
lingual Representations. arXiv:1910.11856 [cs].
Geoff Bacon and Terry Regier. 2019. Does BERT Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali
agree? evaluating knowledge of structure depen- Farhadi, Hannaneh Hajishirzi, and Noah Smith.
dence through agreement relations. arXiv preprint 2020. Fine-Tuning Pretrained Language Models:
arXiv:1908.09892. Weight Initializations, Data Orders, and Early Stop-
ping. arXiv:2002.06305 [cs].
Collin F. Baker, Charles J. Fillmore, and John B. Lowe.
1998. The Berkeley Framenet project. In Proceed- Kawin Ethayarajh. 2019. How contextual are contextu-
ings of the 17th International Conference on Compu- alized word representations? Comparing the geome-
tational Linguistics, volume 1, pages 86–90. Associ- try of BERT, ELMo, and GPT-2 embeddings. arXiv
ation for Computational Linguistics. preprint arXiv:1909.00512.
Yonatan Belinkov and James Glass. 2019. Analysis Allyson Ettinger. 2019. What BERT is not: Lessons
Methods in Neural Language Processing: A Survey. from a new suite of psycholinguistic diagnostics for
Transactions of the Association for Computational language models. arXiv:1907.13528 [cs].
Linguistics, 7:49–72.
Maxwell Forbes, Ari Holtzman, and Yejin Choi. 2019.
Zied Bouraoui, Jose Camacho-Collados, and Steven Do Neural Language Representations Learn Physi-
Schockaert. 2019. Inducing Relational Knowledge cal Commonsense? In Proceedings of the 41st An-
from BERT. arXiv:1911.12753 [cs]. nual Conference of the Cognitive Science Society
(CogSci 2019), page 7.
Gino Brunner, Yang Liu, Damian Pascual, Oliver
Richter, Massimiliano Ciaramita, and Roger Watten- Siddhant Garg, Thuy Vu, and Alessandro Moschitti.
hofer. 2020. On Identifiability in Transformers. In 2020. TANDA: Transfer and Adapt Pre-Trained
International Conference on Learning Representa- Transformer Models for Answer Sentence Selection.
tions. In AAAI.
Steven Cao, Nikita Kitaev, and Dan Klein. 2019. Mul-
Adele Goldberg. 2006. Constructions at Work: The
tilingual Alignment of Contextual Word Representa-
Nature of Generalization in Language. Oxford Uni-
tions. In International Conference on Learning Rep-
versity Press, USA.
resentations.
Kevin Clark, Urvashi Khandelwal, Omer Levy, and Yoav Goldberg. 2019. Assessing bert’s syntactic abili-
Christopher D Manning. 2019. What does BERT ties. arXiv preprint arXiv:1901.05287.
look at? An analysis of BERT’s attention. arXiv
preprint arXiv:1906.04341. Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Li-
wei Wang, and Tieyan Liu. 2019. Efficient train-
Stephane Clinchant, Kweon Woo Jung, and Vassilina ing of BERT by progressively stacking. In Inter-
Nikoulina. 2019. On the use of BERT for Neu- national Conference on Machine Learning, pages
ral Machine Translation. In Proceedings of the 3rd 2337–2346.
Workshop on Neural Generation and Translation,
pages 108–117, Hong Kong. Association for Com- Yaru Hao, Li Dong, Furu Wei, and Ke Xu. 2019. Visu-
putational Linguistics. alizing and understanding the effectiveness of bert.
In Proceedings of the 2019 Conference on Empirical
Matt Crane. 2018. Questionable Answers in Question Methods in Natural Language Processing and the
Answering Research: Reproducibility and Variabil- 9th International Joint Conference on Natural Lan-
ity of Published Results. Transactions of the Associ- guage Processing (EMNLP-IJCNLP), pages 4134–
ation for Computational Linguistics, 6:241–252. 4143.
John Hewitt and Christopher D Manning. 2019. A Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.
structural probe for finding syntax in word represen- Weld, Luke Zettlemoyer, and Omer Levy. 2020.
tations. In Proceedings of the 2019 Conference of SpanBERT: Improving Pre-training by Representing
the North American Chapter of the Association for and Predicting Spans.
Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages Olga Kovaleva, Alexey Romanov, Anna Rogers, and
4129–4138. Anna Rumshisky. 2019. Revealing the Dark Secrets
of BERT. In Proceedings of the 2019 Conference on
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Empirical Methods in Natural Language Processing
Distilling the knowledge in a neural network. arXiv and the 9th International Joint Conference on Natu-
preprint arXiv:1503.02531. ral Language Processing (EMNLP-IJCNLP), pages
4356–4365, Hong Kong, China. Association for
Benjamin Hoover, Hendrik Strobelt, and Sebastian Computational Linguistics.
Gehrmann. 2019. exBERT: A Visual Analysis Tool Guillaume Lample and Alexis Conneau. 2019.
to Explore Learned Representations in Transformers Cross-lingual Language Model Pretraining.
Models. arXiv:1910.05276 [cs]. arXiv:1901.07291 [cs].
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Zhenzhong Lan, Mingda Chen, Sebastian Goodman,
Bruna Morrone, Quentin de Laroussilhe, Andrea Kevin Gimpel, Piyush Sharma, and Radu Soricut.
Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Albert: A lite BERT for self-supervised
2019. Parameter-Efficient Transfer Learning for learning of language representations. arXiv preprint
NLP. arXiv:1902.00751 [cs, stat]. arXiv:1909.11942.
Phu Mon Htut, Jason Phang, Shikha Bordia, and Zhenzhong Lan, Mingda Chen, Sebastian Goodman,
Samuel R Bowman. 2019. Do attention heads in Kevin Gimpel, Piyush Sharma, and Radu Soricut.
BERT track syntactic dependencies? arXiv preprint 2020. ALBERT: A Lite BERT for Self-supervised
arXiv:1911.12246. Learning of Language Representations.
Jindřich Libovický, Rudolf Rosa, and Alexander Fraser.
Haoyang Huang, Yaobo Liang, Nan Duan, Ming Gong, 2019. How Language-Neutral is Multilingual
Linjun Shou, Daxin Jiang, and Ming Zhou. 2019. BERT? arXiv:1911.03310 [cs].
Unicoder: A Universal Language Encoder by Pre-
training with Multiple Cross-lingual Tasks. In Pro- Yongjie Lin, Yi Chern Tan, and Robert Frank. 2019.
ceedings of the 2019 Conference on Empirical Meth- Open sesame: Getting inside bert’s linguistic knowl-
ods in Natural Language Processing and the 9th In- edge. arXiv preprint arXiv:1906.01698.
ternational Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), pages 2485–2494, Nelson F Liu, Matt Gardner, Yonatan Belinkov,
Hong Kong, China. Association for Computational Matthew Peters, and Noah A Smith. 2019a. Lin-
Linguistics. guistic knowledge and transferability of contextual
representations. arXiv preprint arXiv:1903.08855.
Sarthak Jain and Byron C. Wallace. 2019. Attention is Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
not Explanation. In Proceedings of the 2019 Con- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
ference of the North American Chapter of the Asso- Luke Zettlemoyer, and Veselin Stoyanov. 2019b.
ciation for Computational Linguistics: Human Lan- RoBERTa: A Robustly Optimized BERT Pretrain-
guage Technologies, Volume 1 (Long and Short Pa- ing Approach. arXiv:1907.11692 [cs].
pers), pages 3543–3556.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Ganesh Jawahar, Benoı̂t Sagot, Djamé Seddah, Samuel dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Unicomb, Gerardo Iñiguez, Márton Karsai, Yannick Luke Zettlemoyer, and Veselin Stoyanov. 2019c.
Léo, Márton Karsai, Carlos Sarraute, Éric Fleury, Roberta: A robustly optimized BERT pretraining ap-
et al. 2019. What does BERT learn about the struc- proach. arXiv preprint arXiv:1907.11692.
ture of language? In 57th Annual Meeting of the As-
sociation for Computational Linguistics (ACL), Flo- R. Thomas McCoy, Tal Linzen, Ewan Dunbar, and
rence, Italy. Paul Smolensky. 2019a. RNNs implicitly imple-
ment tensor-product representations. In Interna-
Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham tional Conference on Learning Representations.
Neubig. How Can We Know What Language Mod- R Thomas McCoy, Ellie Pavlick, and Tal Linzen.
els Know? 2019b. Right for the wrong reasons: Diagnosing
syntactic heuristics in natural language inference.
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, arXiv preprint arXiv:1902.01007.
Xiao Chen, Linlin Li, Fang Wang, and Qun
Liu. 2019. TinyBERT: Distilling BERT for Paul Michel, Omer Levy, and Graham Neubig. 2019.
natural language understanding. arXiv preprint Are sixteen heads really better than one? arXiv
arXiv:1909.10351. preprint arXiv:1905.10650.
Timothee Mickus, Denis Paperno, Mathieu Constant, Nina Poerner, Ulli Waltinger, and Hinrich Schütze.
and Kees van Deemeter. 2019. What do you mean, 2019. Bert is not a knowledge base (yet): Fac-
bert? assessing BERT as a distributional semantics tual knowledge vs. name-based reasoning in unsu-
model. arXiv preprint arXiv:1911.05758. pervised qa. arXiv preprint arXiv:1911.03681.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Alessandro Raganato and Jörg Tiedemann. 2018.
rado, and Jeff Dean. 2013a. Distributed representa- An Analysis of Encoder Representations in
tions of words and phrases and their compositional- Transformer-Based Machine Translation. In
ity. In Advances in neural information processing Proceedings of the 2018 EMNLP Workshop
systems, pages 3111–3119. BlackboxNLP: Analyzing and Interpreting Neural
Networks for NLP, pages 287–297, Brussels, Bel-
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor- gium. Association for Computational Linguistics.
rado, and Jeff Dean. 2013b. Distributed representa-
tions of words and phrases and their compositional- Kyle Richardson, Hai Hu, Lawrence S Moss, and
ity. In Advances in Neural Information Processing Ashish Sabharwal. 2019. Probing natural lan-
Systems 26 (NIPS 2013), pages 3111–3119. guage inference models through semantic fragments.
arXiv preprint arXiv:1909.07521.
Jeffrey Pennington, Richard Socher, and Christopher D
Manning. 2014. Glove: Global vectors for word rep- Kyle Richardson and Ashish Sabharwal. 2019. What
resentation. In Proceedings of the 2014 conference Does My QA Model Know? Devising Controlled
on empirical methods in natural language process- Probes using Expert Knowledge. arXiv:1912.13337
ing (EMNLP), pages 1532–1543. [cs].
Matthew E. Peters, Mark Neumann, Robert Logan,
Anna Rogers, Olga Kovaleva, Matthew Downey, and
Roy Schwartz, Vidur Joshi, Sameer Singh, and
Anna Rumshisky. 2020. Getting Closer to AI Com-
Noah A. Smith. 2019a. Knowledge Enhanced Con-
plete Question Answering: A Set of Prerequisite
textual Word Representations. In Proceedings of
Real Tasks. In AAAI, page 11.
the 2019 Conference on Empirical Methods in Natu-
ral Language Processing and the 9th International Samuel Rönnqvist, Jenna Kanerva, Tapio Salakoski,
Joint Conference on Natural Language Process- and Filip Ginter. 2019. Is Multilingual BERT Flu-
ing (EMNLP-IJCNLP), pages 43–54, Hong Kong, ent in Language Generation? In Proceedings of the
China. Association for Computational Linguistics. First NLPL Workshop on Deep Learning for Natural
Matthew E. Peters, Sebastian Ruder, and Noah A. Language Processing, pages 29–36, Turku, Finland.
Smith. 2019b. To Tune or Not to Tune? Adapt- Linköping University Electronic Press.
ing Pretrained Representations to Diverse Tasks. In
Proceedings of the 4th Workshop on Representation Victor Sanh, Lysandre Debut, Julien Chaumond, and
Learning for NLP (RepL4NLP-2019), pages 7–14, Thomas Wolf. 2019. Distilbert, a distilled version
Florence, Italy. Association for Computational Lin- of BERT: smaller, faster, cheaper and lighter. arXiv
guistics. preprint arXiv:1910.01108.
Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren
Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Etzioni. 2019. Green AI. arXiv:1907.10597 [cs,
Alexander Miller. 2019. Language Models as stat].
Knowledge Bases? In Proceedings of the 2019 Con-
ference on Empirical Methods in Natural Language Sofia Serrano and Noah A. Smith. 2019. Is Attention
Processing and the 9th International Joint Confer- Interpretable? arXiv:1906.03731 [cs].
ence on Natural Language Processing (EMNLP-
Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei
IJCNLP), pages 2463–2473, Hong Kong, China. As-
Yao, Amir Gholami, Michael W Mahoney, and Kurt
sociation for Computational Linguistics.
Keutzer. 2019. Q-BERT: Hessian based ultra low
Slav Petrov. 2010. Products of Random Latent Vari- precision quantization of BERT. arXiv preprint
able Grammars. In Human Language Technologies: arXiv:1909.05840.
The 2010 Annual Conference of the North American
Chapter of the Association for Computational Lin- Chenglei Si, Shuohang Wang, Min-Yen Kan, and Jing
guistics, pages 19–27, Los Angeles, California. As- Jiang. 2019. What does BERT learn from multiple-
sociation for Computational Linguistics. choice reading comprehension datasets? arXiv
preprint arXiv:1910.12391.
Jason Phang, Thibault Févry, and Samuel R. Bowman.
2019. Sentence Encoders on STILTs: Supplemen- Jasdeep Singh, Bryan McCann, Richard Socher, and
tary Training on Intermediate Labeled-data Tasks. Caiming Xiong. 2019. BERT is Not an Interlin-
arXiv:1811.01088 [cs]. gua and the Bias of Tokenization. In Proceedings
of the 2nd Workshop on Deep Learning Approaches
Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. for Low-Resource NLP (DeepLo 2019), pages 47–
How multilingual is multilingual bert? arXiv 55, Hong Kong, China. Association for Computa-
preprint arXiv:1906.01502. tional Linguistics.
Emma Strubell, Ananya Ganesh, and Andrew McCal- Jesse Vig and Yonatan Belinkov. 2019. Analyzing the
lum. 2019. Energy and Policy Considerations for Structure of Attention in a Transformer Language
Deep Learning in NLP. In ACL 2019. Model. In Proceedings of the 2019 ACL Workshop
BlackboxNLP: Analyzing and Interpreting Neural
Ta-Chun Su and Hsiang-Chih Cheng. 2019. Sesame- Networks for NLP, pages 63–76, Florence, Italy. As-
BERT: Attention for Anywhere. arXiv:1910.03176 sociation for Computational Linguistics.
[cs].
Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-
Saku Sugawara, Pontus Stenetorp, Kentaro Inui, and nrich, and Ivan Titov. 2019. Analyzing multi-
Akiko Aizawa. 2020. Assessing the Benchmark- head self-attention: Specialized heads do the heavy
ing Capacity of Machine Reading Comprehension lifting, the rest can be pruned. arXiv preprint
Datasets. In AAAI. arXiv:1905.09418.
Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019a.
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gard-
Patient knowledge distillation for BERT model com-
ner, and Sameer Singh. 2019a. Universal Adversar-
pression. arXiv preprint arXiv:1908.09355.
ial Triggers for Attacking and Analyzing NLP. In
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Proceedings of the 2019 Conference on Empirical
Tian, Hua Wu, and Haifeng Wang. 2019b. ERNIE Methods in Natural Language Processing and the
2.0: A Continual Pre-training Framework for Lan- 9th International Joint Conference on Natural Lan-
guage Understanding. arXiv:1907.12412 [cs]. guage Processing (EMNLP-IJCNLP), pages 2153–
2162, Hong Kong, China. Association for Computa-
Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, tional Linguistics.
Yiming Yang, and Denny Zhou. Mobilebert: Task-
agnostic compression of bert for resource limited de- Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh,
vices. and Matt Gardner. 2019b. Do nlp models know
numbers? probing numeracy in embeddings. arXiv
Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga preprint arXiv:1909.07940.
Vechtomova, and Jimmy Lin. 2019. Distilling task-
specific knowledge from BERT into simple neural Alex Wang, Amapreet Singh, Julian Michael, Felix
networks. arXiv preprint arXiv:1903.12136. Hill, Omer Levy, and Samuel R. Bowman. 2018.
GLUE: A Multi-Task Benchmark and Analysis Plat-
Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019a. form for Natural Language Understanding. In
Bert rediscovers the classical nlp pipeline. arXiv Proceedings of the 2018 EMNLP Workshop Black-
preprint arXiv:1905.05950. boxNLP: Analyzing and Interpreting Neural Net-
works for NLP, pages 353–355, Brussels, Belgium.
Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Association for Computational Linguistics.
Adam Poliak, R. Thomas McCoy, Najoung Kim,
Benjamin Van Durme, Samuel R. Bowman, Dipan- Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Li-
jan Das, and Ellie Pavlick. 2019b. What do you wei Peng, and Luo Si. 2019a. StructBERT: Incor-
learn from context? Probing for sentence structure porating Language Structures into Pre-training for
in contextualized word representations. In Interna- Deep Language Understanding. arXiv:1908.04577
tional Conference on Learning Representations. [cs].
Ke Tran. 2019. From English to Foreign Languages: Zihan Wang, Stephen Mayhew, Dan Roth, et al. 2019b.
Transferring Pre-trained Language Models. Cross-lingual ability of multilingual BERT: An em-
Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Ari- pirical study. arXiv preprint arXiv:1912.07840.
vazhagan, Xin Li, and Amelia Archer. 2019. Small
Alex Warstadt, Yu Cao, Ioana Grosu, Wei Peng, Ha-
and practical bert models for sequence labeling.
gen Blix, Yining Nie, Anna Alsop, Shikha Bor-
dia, Haokun Liu, Alicia Parrish, et al. 2019. In-
Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina vestigating BERT’s knowledge of language: Five
Toutanova. 2019. Well-read students learn better: analysis methods with NPIs. arXiv preprint
The impact of student initialization on knowledge arXiv:1909.02597.
distillation. arXiv preprint arXiv:1908.08962.
Gregor Wiedemann, Steffen Remus, Avi Chawla, and
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Chris Biemann. 2019. Does BERT make any
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz sense? interpretable word sense disambiguation
Kaiser, and Illia Polosukhin. 2017. Attention is all with contextualized embeddings. arXiv preprint
you need. In Advances in neural information pro- arXiv:1909.10430.
cessing systems, pages 5998–6008.
Sarah Wiegreffe and Yuval Pinter. 2019. Attention
Jesse Vig. 2019. Visualizing Attention in is not not Explanation. In Proceedings of the
Transformer-Based Language Representation 2019 Conference on Empirical Methods in Natu-
Models. arXiv:1904.02679 [cs, stat]. ral Language Processing and the 9th International
Joint Conference on Natural Language Process- Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
ing (EMNLP-IJCNLP), pages 11–20, Hong Kong, Farhadi, and Yejin Choi. 2019. Hellaswag: Can a
China. Association for Computational Linguistics. machine really finish your sentence? arXiv preprint
arXiv:1905.07830.
Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin,
and Michael Auli. 2019a. Pay Less Attention with Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang,
Lightweight and Dynamic Convolutions. In Interna- Maosong Sun, and Qun Liu. 2019. ERNIE: En-
tional Conference on Learning Representations. hanced Language Representation with Informative
Entities. In Proceedings of the 57th Annual Meet-
Shijie Wu and Mark Dredze. 2019. Beto, bentz, be- ing of the Association for Computational Linguis-
cas: The surprising cross-lingual effectiveness of tics, pages 1441–1451, Florence, Italy. Association
bert. arXiv preprint arXiv:1904.09077. for Computational Linguistics.
Xing Wu, Shangwen Lv, Liangjun Zang, Jizhong Han, Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li,
and Songlin Hu. 2019b. Conditional BERT Contex- Shuailiang Zhang, Xi Zhou, and Xiang Zhou. 2020.
tual Augmentation. In ICCS 2019: Computational Semantics-aware BERT for Language Understand-
Science – ICCS 2019, pages 84–95. ing. In AAAI 2020.
Sanqiang Zhao, Raghav Gupta, Yang Song, and Denny
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.
Zhou. 2019. Extreme language model compres-
Le, Mohammad Norouzi, Wolfgang Macherey,
sion with optimal subwords and shared projections.
Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, Jeff Klingner, Apurva Shah, Melvin John-
son, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Wenxuan Zhou, Junyi Du, and Xiang Ren. 2019. Im-
Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith proving BERT fine-tuning with embedding normal-
Stevens, George Kurian, Nishant Patil, Wei Wang, ization. arXiv preprint arXiv:1911.03918.
Cliff Young, Jason Smith, Jason Riesa, Alex Rud-
nick, Oriol Vinyals, Greg Corrado, Macduff Hughes, Xuhui Zhou, Yue Zhang, Leyang Cui, and Dandan
and Jeffrey Dean. 2016a. Google’s neural machine Huang. 2020. Evaluating Commonsense in Pre-
translation system: Bridging the gap between human trained Language Models. In AAAI 2020.
and machine translation. abs/1609.08144.
Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Gold-
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V stein, and Jingjing Liu. 2019. FreeLB: Enhanced
Le, Mohammad Norouzi, Wolfgang Macherey, Adversarial Training for Language Understanding.
Maxim Krikun, Yuan Cao, Qin Gao, Klaus arXiv:1909.11764 [cs].
Macherey, et al. 2016b. Google’s neural ma-
chine translation system: Bridging the gap between
human and machine translation. arXiv preprint
arXiv:1609.08144.
Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei,

and Ming Zhou. 2020. Bert-of-theseus: Compress-
ing bert by progressive module replacing. arXiv
preprint arXiv:2002.02925.
Junjie Yang and Hai Zhao. 2019. Deepening

Hidden Representations from Pre-trained Lan-
guage Models for Natural Language Understanding.
arXiv:1911.01940 [cs].
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-

bonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019.
XLNet: Generalized Autoregressive Pretraining for
Language Understanding. arXiv:1906.08237 [cs].
Yang You, Jing Li, Sashank Reddi, Jonathan

Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xi-
aodan Song, James Demmel, and Cho-Jui Hsieh.
2019. Large batch optimization for deep learn-
ing: Training BERT in 76 minutes. arXiv preprint
arXiv:1904.00962, 1(5).
Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe

Wasserblat. 2019. Q8BERT: Quantized 8bit BERT.

A Primer in BERTology

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Primer in BERTology

Uploaded by

Copyright:

Available Formats

A Primer in BERTology: What we know about how BERT works

Anna Rogers, Olga Kovaleva, Anna Rumshisky

Abstract 2 Overview of BERT architecture

Transformer-based models are now widely

Figure 4: Attention patterns in BERT (Kovaleva et al., 2019)

(Aguilar et al., 2019) ×2 94% - BERT6 CoLA, MRPC, QQP, RTE

Q-BERT (Shen et al., 2019) ×13 99% - BERT12 MNLI, SST-2

Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei,

Junjie Yang and Hai Zhao. 2019. Deepening

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-

Yang You, Jing Li, Sashank Reddi, Jonathan

Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe

You might also like