Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Globally Normalized Transition-Based Neural Networks

Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn,


Alessandro Presta, Kuzman Ganchev, Slav Petrov and Michael Collins∗
Google Inc
New York, NY
{andor,chrisalberti,djweiss,severyn,apresta,kuzman,slav,mjcollins}@google.com

Abstract Chen and Manning (2014). We do not use any re-


currence, but perform beam search for maintaining
We introduce a globally normalized multiple hypotheses and introduce global normal-
arXiv:1603.06042v2 [cs.CL] 8 Jun 2016

transition-based neural network model ization with a conditional random field (CRF) ob-
that achieves state-of-the-art part-of- jective (Bottou et al., 1997; Le Cun et al., 1998;
speech tagging, dependency parsing and Lafferty et al., 2001; Collobert et al., 2011) to
sentence compression results. Our model overcome the label bias problem that locally nor-
is a simple feed-forward neural network malized models suffer from. Since we use beam
that operates on a task-specific transition inference, we approximate the partition function
system, yet achieves comparable or better by summing over the elements in the beam,
accuracies than recurrent models. We dis- and use early updates (Collins and Roark, 2004;
cuss the importance of global as opposed Zhou et al., 2015). We compute gradients based
to local normalization: a key insight is on this approximate global normalization and
that the label bias problem implies that perform full backpropagation training of all neural
globally normalized models can be strictly network parameters based on the CRF loss.
more expressive than locally normalized In Section 3 we revisit the label bias problem
models. and the implication that globally normalized mod-
els are strictly more expressive than locally nor-
1 Introduction malized models. Lookahead features can par-
tially mitigate this discrepancy, but cannot fully
Neural network approaches have taken
compensate for it—a point to which we return
the field of natural language processing
later. To empirically demonstrate the effective-
(NLP) by storm. In particular, variants of
ness of global normalization, we evaluate our
long short-term memory (LSTM) networks
model on part-of-speech tagging, syntactic de-
(Hochreiter and Schmidhuber, 1997) have
pendency parsing and sentence compression (Sec-
produced impressive results on some of the
tion 4). Our model achieves state-of-the-art ac-
classic NLP tasks such as part-of-speech
curacy on all of these tasks, matching or outper-
tagging (Ling et al., 2015), syntactic parsing
forming LSTMs while being significantly faster.
(Vinyals et al., 2015) and semantic role labeling
In particular for dependency parsing on the Wall
(Zhou and Xu, 2015). One might speculate that
Street Journal we achieve the best-ever published
it is the recurrent nature of these models that
unlabeled attachment score of 94.61%.
enables these results.
As discussed in more detail in Section 5,
In this work we demonstrate that simple
we also outperform previous structured training
feed-forward networks without any recurrence
approaches used for neural network transition-
can achieve comparable or better accuracies
based parsing. Our ablation experiments
than LSTMs, as long as they are globally nor-
show that we outperform Weiss et al. (2015) and
malized. Our model, described in detail in
Alberti et al. (2015) because we do global back-
Section 2, uses a transition system (Nivre, 2006)
propagation training of all model parameters,
and feature embeddings as introduced by
while they fix the neural network parameters when

On leave from Columbia University. training the global part of their model. We
also outperform Zhou et al. (2015) despite using a sions for any complete parse is n(x) = 2 × m.3
smaller beam. To shed additional light on the la- A complete structure is then a sequence of deci-
bel bias problem in practice, we provide a sentence sion/state pairs (s1 , d1 ) . . . (sn , dn ) such that s1 =
compression example where the local model com- s† , di ∈ S(si ) for i = 1 . . . n, and si+1 =
pletely fails. We then demonstrate that a globally t(si , di ). We use the notation d1:j to refer to a de-
normalized parsing model without any lookahead cision sequence d1 . . . dj .
features is almost as accurate as our best model, We assume that there is a one-to-one mapping
while a locally normalized model loses more than between decision sequences d1:j−1 and states sj :
10% absolute in accuracy because it cannot effec- that is, we essentially assume that a state encodes
tively incorporate evidence as it becomes avail- the entire history of decisions. Thus, each state
able. can be reached by a unique decision sequence
Finally, we provide an open-source implemen- from s† .4 We will use decision sequences d1:j−1
tation of our method, called SyntaxNet,1 which and states interchangeably: in a slight abuse of
we have integrated into the popular TensorFlow2 notation, we define ρ(d1:j−1 , d; θ) to be equal to
framework. We also provide a pre-trained, ρ(s, d; θ) where s is the state reached by the deci-
state-of-the art English dependency parser called sion sequence d1:j−1 .
“Parsey McParseface,” which we tuned for a bal- The scoring function ρ(s, d; θ) can be defined
ance of speed, simplicity, and accuracy. in a number of ways. In this work, following
Chen and Manning (2014), Weiss et al. (2015),
2 Model and Zhou et al. (2015), we define it via a feed-
forward neural network as
At its core, our model is an incremental transition-
based parser (Nivre, 2006). To apply it to different
tasks we only need to adjust the transition system ρ(s, d; θ) = φ(s; θ (l) ) · θ (d) .
and the input features.
Here θ (l) are the parameters of the neural network,
2.1 Transition System excluding the parameters at the final layer. θ (d) are
the final layer parameters for decision d. φ(s; θ (l) )
Given an input x, most often a sentence, we define:
is the representation for state s computed by the
• A set of states S(x). neural network under parameters θ (l) . Note that
• A special start state s† ∈ S(x). the score is linear in the parameters θ (d) . We next
• A set of allowed decisions A(s, x) for all s ∈ describe how softmax-style normalization can be
S(x). performed at the local or global level.
• A transition function t(s, d, x) returning a
new state s′ for any decision d ∈ A(s, x). 2.2 Global vs. Local Normalization
We will use a function ρ(s, d, x; θ) to compute the In the Chen and Manning (2014) style of greedy
score of decision d in state s for input x. The neural network parsing, the conditional probabil-
vector θ contains the model parameters and we ity distribution over decisions dj given context
assume that ρ(s, d, x; θ) is differentiable with re- d1:j−1 is defined as
spect to θ.
In this section, for brevity, we will drop the de-
exp ρ(d1:j−1 , dj ; θ)
pendence of x in the functions given above, simply p(dj |d1:j−1 ; θ) = , (1)
ZL (d1:j−1 ; θ)
writing S, A(s), t(s, d), and ρ(s, d; θ).
Throughout this work we will use transition sys-
where
tems in which all complete structures for the same
input x have the same number of decisions n(x) X
ZL (d1:j−1 ; θ) = exp ρ(d1:j−1 , d′ ; θ).
(or n for brevity). In dependency parsing for ex-
d′ ∈A(d1:j−1 )
ample, this is true for both the arc-standard and
arc-eager transition systems (Nivre, 2006), where 3
Note that this is not true for the swap transition system
for a sentence x of length m, the number of deci- defined in Nivre (2009).
4
It is straightforward to extend the approach to make use
1
http://github.com/tensorflow/models/tree/master/syntaxnet of dynamic programming in the case where the same state
2
http://www.tensorflow.org can be reached by multiple decision sequences.
Each ZL (d1:j−1 ; θ) is a local normalization term. A significant practical advantange of the locally
The probability of a sequence of decisions d1:n is normalized cost Eq. (4) is that the local parti-
tion function ZL and its derivative can usually be
Y
n
computed efficiently. In contrast, the ZG term in
pL (d1:n ) = p(dj |d1:j−1 ; θ)
Eq. (5) contains a sum over d′1:n ∈ Dn that is in
j=1
P many cases intractable.
exp nj=1 ρ(d1:j−1 , dj ; θ)
= Qn . (2) To make learning tractable with the glob-
j=1 ZL (d1:j−1 ; θ) ally normalized model, we use beam search
and early updates (Collins and Roark, 2004;
Beam search can be used to attempt to find the
Zhou et al., 2015). As the training sequence is
maximum of Eq. (2) with respect to d1:n . The
being decoded, we keep track of the location of
additive scores used in beam search are the log-
the gold path in the beam. If the gold path falls
softmax of each decision, ln p(dj |d1:j−1 ; θ), not
out of the beam at step j, a stochastic gradient
the raw scores ρ(d1:j−1 , dj ; θ).
step is taken on the following objective:
In contrast, a Conditional Random Field (CRF)
defines a distribution pG (d1:n ) as follows: Lglobal−beam (d∗1:j ; θ) =
P
exp nj=1 ρ(d1:j−1 , dj ; θ) X
j
X X
j
pG (d1:n ) = , (3) − ρ(d∗1:i−1 , d∗i ; θ) + ln exp ρ(d′1:i−1 , d′i ; θ).(6)
ZG (θ) i=1 i=1
d′1:j ∈Bj

where Here the set Bj contains all paths in the beam


X X
n at step j, together with the gold path prefix d∗1:j .
ZG (θ) = exp ρ(d′1:j−1 , d′j ; θ) It is straightforward to derive gradients of the
d′1:n ∈Dn j=1 loss in Eq. (6) and to back-propagate gradients to
all levels of a neural network defining the score
and Dn is the set of all valid sequences of deci- ρ(s, d; θ). If the gold path remains in the beam
sions of length n. ZG (θ) is a global normalization throughout decoding, a gradient step is performed
term. The inference problem is now to find using Bn , the beam at the end of decoding.
X
n
argmax pG (d1:n ) = argmax ρ(d1:j−1 , dj ; θ). 3 The Label Bias Problem
d1:n ∈Dn d1:n ∈Dn j=1
Intuitively, we would like the model to be
Beam search can again be used to approximately able to revise an earlier decision made during
find the argmax. search, when later evidence becomes available that
rules out the earlier decision as incorrect. At
2.3 Training first glance, it might appear that a locally nor-
Training data consists of inputs x paired with gold malized model used in conjunction with beam
decision sequences d∗1:n . We use stochastic gradi- search or exact search is able to revise ear-
ent descent on the negative log-likelihood of the lier decisions. However the label bias problem
data under the model. Under a locally normalized (see Bottou (1991), Collins (1999) pages 222-226,
model, the negative log-likelihood is Lafferty et al. (2001), Bottou and LeCun (2005),
Smith and Johnson (2007)) means that locally
Llocal (d∗1:n ; θ) = − ln pL (d∗1:n ; θ) = (4) normalized models often have a very weak ability
to revise earlier decisions.
X
n X
n
− ρ(d∗1:j−1 , d∗j ; θ) + ln ZL (d∗1:j−1 ; θ), This section gives a formal perspective on the
j=1 j=1 label bias problem, through a proof that globally
normalized models are strictly more expressive
whereas under a globally normalized model it is than locally normalized models. The theorem was
originally proved5 by Smith and Johnson (2007).
Lglobal (d∗1:n ; θ) = − ln pG (d∗1:n ; θ) =
5
Xn More precisely Smith and Johnson (2007) prove the the-
− ρ(d∗1:j−1 , d∗j ; θ) + ln ZG (θ). (5) orem for models with potential functions of the form
ρ(di−1 , di , xi ); the generalization to potential functions of
j=1 the form ρ(d1:i−1 , di , x1:i ) is straightforward.
The example underlying the proof gives a clear il- construct a globally normalized model pG such
lustration of the label bias problem.6 that pG = pL . Consider a locally normalized
model with scores ρ(d1:i−1 , di , x1:i ). Define a
Global Models can be Strictly More Expressive
global model pG with scores
than Local Models Consider a tagging problem
where the task is to map an input sequence x1:n ρ′ (d1:i−1 , di , x1:i ) = log pL (di |d1:i−1 , x1:i ).
to a decision sequence d1:n . First, consider a lo-
cally normalized model where we restrict the scor- Then it is easily verified that
ing function to access only the first i input sym-
bols x1:i when scoring decision di . We will re- pG (d1:n |x1:n ) = pL (d1:n |x1:n )
turn to this restriction soon. The scoring function
for all x1:n , d1:n . 
ρ can be an otherwise arbitrary function of the tu-
In proving PG * PL we will use a simple prob-
ple hd1:i−1 , di , x1:i i:
lem where every example seen in training or test
Y
n data is one of the following two tagged sentences:
pL (d1:n |x1:n ) = pL (di |d1:i−1 , x1:i )
i=1 x1 x2 x3 = a b c, d1 d2 d3 = A B C
P
exp ni=1 ρ(d1:i−1 , di , x1:i ) x1 x2 x3 = a b e, d1 d2 d3 = A D E (7)
= Qn .
i=1 ZL (d1:i−1 , x1:i )
Note that the input x2 = b is ambiguous: it can
Second, consider a globally normalized model take tags B or D. This ambiguity is resolved when
P the next input symbol, c or e, is observed.
exp ni=1 ρ(d1:i−1 , di , x1:i )
pG (d1:n |x1:n ) = . Now consider a globally normalized model,
ZG (x1:n ) where the scores ρ(d1:i−1 , di , x1:i ) are de-
This model again makes use of a scoring function fined as follows. Define T as the set
ρ(d1:i−1 , di , x1:i ) restricted to the first i input sym- {(A, B), (B, C), (A, D), (D, E)} of bigram tag
bols when scoring decision di . transitions seen in the data. Similarly, define E
Define PL to be the set of all possible distribu- as the set {(a, A), (b, B), (c, C), (b, D), (e, E)} of
tions pL (d1:n |x1:n ) under the local model obtained (word, tag) pairs seen in the data. We define
as the scores ρ vary. Similarly, define PG to be the
ρ(d1:i−1 , di , x1:i ) (8)
set of all possible distributions pG (d1:n |x1:n ) un-
der the global model. Here a “distribution” is a = α × J(di−1 , di ) ∈ T K + α × J(xi , di ) ∈ EK
function from a pair (x1:n , d1:n ) to a probability
where α is the single scalar parameter of the
p(d1:n |x1:n ). Our main result is the following:
model, and JπK = 1 if π is true, 0 otherwise.
Theorem 3.1 See also Proof that PG * PL : We will construct a glob-
Smith and Johnson (2007). PL is a strict ally normalized model pG such that there is no lo-
subset of PG , that is PL ( PG . cally normalized model such that pL = pG .
To prove this we will first prove that PL ⊆ PG . Under the definition in Eq. (8), it is straightfor-
This step is straightforward. We then show that ward to show that
PG * PL ; that is, there are distributions in PG
lim pG (A B C|a b c) = lim pG (A D E|a b e) = 1.
that are not in PL . The proof that PG * PL gives α→∞ α→∞
a clear illustration of the label bias problem.
In contrast, under any definition for
Proof that PL ⊆ PG : We need to show that
ρ(d1:i−1 , di , x1:i ), we must have
for any locally normalized distribution pL , we can
6
Smith and Johnson (2007) cite Michael Collins as the pL (A B C|a b c) + pL (A D E|a b e) ≤ 1 (9)
source of the example underlying the proof. Note that
the theorem refers to conditional models of the form This follows because pL (A B C|a b c) =
p(d1:n |x1:n ) with global or local normalization. Equiva-
lence (or non-equivalence) results for joint models of the pL (A|a) × pL (B|A, a b) × pL (C|A B, a b c)
form p(d1:n , x1:n ) are quite different: for example results and pL (A D E|a b e) = pL (A|a) ×
from Chi (1999) and Abney et al. (1999) imply that weighted pL (D|A, a b) × pL (E|A D, a b e). The in-
context-free grammars (a globally normalized joint model)
and probabilistic context-free grammars (a locally normal- equality pL (B|A, a b) + pL (D|A, a b) ≤ 1 then
ized joint model) are equally expressive. immediately implies Eq. (9).
It follows that for sufficiently large values of α, on a diverse set of structured prediction tasks. We
we have pG (A B C|a b c) + pG (A D E|a b e) > 1, apply our approach to POS tagging, syntactic de-
and given Eq. (9) it is impossible to de- pendency parsing, and sentence compression.
fine a locally normalized model with While directly optimizing the global model de-
pL (A B C|a b c) = pG (A B C|a b c) and fined by Eq. (5) works well, we found that train-
pL (A D E|a b e) = pG (A D E|a b e).  ing the model in two steps achieves the same pre-
Under the restriction that scores cision much faster: we first pretrain the network
ρ(d1:i−1 , di , x1:i ) depend only on the first i using the local objective given in Eq. (4), and then
input symbols, the globally normalized model is perform additional training steps using the global
still able to model the data in Eq. (7), while the objective given in Eq. (6). We pretrain all layers
locally normalized model fails (see Eq. 9). The except the softmax layer in this way. We purpose-
ambiguity at input symbol b is naturally resolved fully abstain from complicated hand engineering
when the next symbol (c or e) is observed, but of input features, which might improve perfor-
the locally normalized model is not able to revise mance further (Durrett and Klein, 2015).
its prediction. We use the training recipe from
It is easy to fix the locally normalized model Weiss et al. (2015) for each training stage of
for the example in Eq. (7) by allowing scores our model. Specifically, we use averaged stochas-
ρ(d1:i−1 , di , x1:i+1 ) that take into account the in- tic gradient descent with momentum, and we
put symbol xi+1 . More generally we can have a tune the learning rate, learning rate schedule,
model of the form ρ(d1:i−1 , di , x1:i+k ) where the momentum, and early stopping time using a
integer k specifies the amount of lookahead in the separate held-out corpus for each task. We tune
model. Such lookahead is common in practice, but again with a different set of hyperparameters for
insufficient in general. For every amount of looka- training with the global objective.
head k, we can construct examples that cannot be
4.1 Part of Speech Tagging
modeled with a locally normalized model by du-
plicating the middle input b in (7) k + 1 times. Part of speech (POS) tagging is a classic NLP task,
Only a local model with scores ρ(d1:i−1 , di , x1:n ) where modeling the structure of the output is im-
that considers the entire input can capture any dis- portant for achieving state-of-the-art performance.
tribution p(d1:n |x1:n ): inQthis case the decompo- Data & Evaluation. We conducted experi-
n
sition pL (d1:n |x1:n ) = i=1 pL (di |d1:i−1 , x1:n ) ments on a number of different datasets: (1)
makes no independence assumptions. the English Wall Street Journal (WSJ) part
However, increasing the amount of context used of the Penn Treebank (Marcus et al., 1993)
as input comes at a cost, requiring more powerful with standard POS tagging splits; (2) the En-
learning algorithms, and potentially more train- glish “Treebank Union” multi-domain corpus
ing data. For a detailed analysis of the trade- containing data from the OntoNotes corpus
offs between structural features in CRFs and more version 5 (Hovy et al., 2006), the English Web
powerful local classifiers without structural con- Treebank (Petrov and McDonald, 2012), and
straints, see Liang et al. (2008); in these exper- the updated and corrected Question Treebank
iments local classifiers are unable to reach the (Judge et al., 2006) with identical setup to
performance of CRFs on problems such as pars- Weiss et al. (2015); and (3) the CoNLL ’09
ing and named entity recognition where structural multi-lingual shared task (Hajič et al., 2009).
constraints are important. Note that there is noth-
ing to preclude an approach that makes use of both Model Configuration. Inspired by the inte-
global normalization and more powerful scoring grated POS tagging and parsing transition system
functions ρ(d1:i−1 , di , x1:n ), obtaining the best of of Bohnet and Nivre (2012), we employ a simple
both worlds. The experiments that follow make transition system that uses only a S HIFT action and
use of both. predicts the POS tag of the current word on the
buffer as it gets shifted to the stack. We extract the
4 Experiments following features on a window ±3 tokens cen-
tered at the current focus token: word, cluster,
To demonstrate the flexibility and modeling power character n-gram up to length 3. We also extract
of our approach, we provide experimental results the tag predicted for the previous 4 tokens. The
En En-Union CoNLL ’09 Avg
Method WSJ News Web QTB Ca Ch Cz En Ge Ja Sp -
Linear CRF 97.17 97.60 94.58 96.04 98.81 94.45 98.90 97.50 97.14 97.90 98.79 97.17
Ling et al. (2015) 97.78 97.44 94.03 96.18 98.77 94.38 99.00 97.60 97.84 97.06 98.71 97.16
Our Local (B=1) 97.44 97.66 94.46 96.59 98.91 94.56 98.96 97.36 97.35 98.02 98.88 97.29
Our Local (B=8) 97.45 97.69 94.46 96.64 98.88 94.56 98.96 97.40 97.35 98.02 98.89 97.30
Our Global (B=8) 97.44 97.77 94.80 96.86 99.03 94.72 99.02 97.65 97.52 98.37 98.97 97.47
Parsey McParseface - 97.52 94.24 96.45 - - - - - - - - -

Table 1: Final POS tagging test set results on English WSJ and Treebank Union as well as CoNLL’09. We also show the
performance of our pre-trained open source model, “Parsey McParseface.”

network in these experiments has a single hidden is standard. For the CoNLL ’09 datasets we fol-
layer with 256 units on WSJ and Treebank Union low standard practice and include all punctuation
and 64 on CoNLL’09. in the evaluation. We follow Alberti et al. (2015)
and use our own predicted POS tags so that we
Results. In Table 1 we compare our model to can include a k-best tag feature (see below) but
a linear CRF and to the compositional character- use the supplied predicted morphological features.
to-word LSTM model of Ling et al. (2015). The We report unlabeled and labeled attachment scores
CRF is a first-order linear model with exact infer- (UAS/LAS).
ence and the same emission features as our model.
It additionally also has transition features of the Model Configuration. Our model configuration
word, cluster and character n-gram up to length 3 is basically the same as the one originally pro-
on both endpoints of the transition. The results for posed by Chen and Manning (2014) and then re-
Ling et al. (2015) were solicited from the authors. fined by Weiss et al. (2015). In particular, we use
Our local model already compares favorably the arc-standard transition system and extract the
against these methods on average. Using beam same set of features as prior work: words, part of
search with a locally normalized model does not speech tags, and dependency arcs and labels in the
help, but with global normalization it leads to a surrounding context of the state, as well as k-best
7% reduction in relative error, empirically demon- tags as proposed by Alberti et al. (2015). We use
strating the effect of label bias. The set of char- two hidden layers of 1,024 dimensions each.
acter ngrams feature is very important, increasing
average accuracy on the CoNLL’09 datasets by Results. Tables 2 and 3 show our final pars-
about 0.5% absolute. This shows that character- ing results and a comparison to the best sys-
level modeling can also be done with a simple tems from the literature. We obtain the best ever
feed-forward network without recurrence. published results on almost all datasets, includ-
ing the WSJ. Our main results use the same pre-
4.2 Dependency Parsing trained word embeddings as Weiss et al. (2015)
In dependency parsing the goal is to produce a di- and Alberti et al. (2015), but no tri-training. When
rected tree representing the syntactic structure of we artificially restrict ourselves to not use pre-
the input sentence. trained word embeddings, we observe only a mod-
est drop of ∼0.5% UAS; for example, training
Data & Evaluation. We use the same corpora only on the WSJ yields 94.08% UAS and 92.15%
as in our POS tagging experiments, except that LAS for our global model with a beam of size 32.
we use the standard parsing splits of the WSJ. To Even though we do not use tri-training, our
avoid over-fitting to the development set (Sec. 22), model compares favorably to the 94.26% LAS
we use Sec. 24 for tuning the hyperparameters and 92.41% UAS reported by Weiss et al. (2015)
of our models. We convert the English con- with tri-training. As we show in Sec. 5, these
stituency trees to Stanford style dependencies gains can be attributed to the full backpropagation
(De Marneffe et al., 2006) using version 3.3.0 of training that differentiates our approach from that
the converter. For English, we use predicted POS of Weiss et al. (2015) and Alberti et al. (2015).
tags (the same POS tags are used for all models) Our results also significantly outperform the
and exclude punctuation from the evaluation, as LSTM-based approaches of Dyer et al. (2015) and
WSJ Union-News Union-Web Union-QTB
Method UAS LAS UAS LAS UAS LAS UAS LAS
Martins et al. (2013)⋆ 92.89 90.55 93.10 91.13 88.23 85.04 94.21 91.54
Zhang and McDonald (2014)⋆ 93.22 91.02 93.32 91.48 88.65 85.59 93.37 90.69
Weiss et al. (2015) 93.99 92.05 93.91 92.25 89.29 86.44 94.17 92.06
Alberti et al. (2015) 94.23 92.36 94.10 92.55 89.55 86.85 94.74 93.04
Our Local (B=1) 92.95 91.02 93.11 91.46 88.42 85.58 92.49 90.38
Our Local (B=32) 93.59 91.70 93.65 92.03 88.96 86.17 93.22 91.17
Our Global (B=32) 94.61 92.79 94.44 92.93 90.17 87.54 95.40 93.64
Parsey McParseface (B=8) - - 94.15 92.51 89.08 86.29 94.77 93.17

Table 2: Final English dependency parsing test set results. We note that training our system using only the WSJ corpus (i.e. no
pre-trained embeddings or other external resources) yields 94.08% UAS and 92.15% LAS for our global model with beam 32.

Catalan Chinese Czech English German Japanese Spanish


Method UAS LAS UAS LAS UAS LAS UAS LAS UAS LAS UAS LAS UAS LAS
Best Shared Task Result - 87.86 - 79.17 - 80.38 - 89.88 - 87.48 - 92.57 - 87.64
Ballesteros et al. (2015) 90.22 86.42 80.64 76.52 79.87 73.62 90.56 88.01 88.83 86.10 93.47 92.55 90.38 86.59
Zhang and McDonald (2014) 91.41 87.91 82.87 78.57 86.62 80.59 92.69 90.01 89.88 87.38 92.82 91.87 90.82 87.34
Lei et al. (2014) 91.33 87.22 81.67 76.71 88.76 81.77 92.75 90.00 90.81 87.81 94.04 91.84 91.16 87.38
Bohnet and Nivre (2012) 92.44 89.60 82.52 78.51 88.82 83.73 92.87 90.60 91.37 89.38 93.67 92.63 92.24 89.60
Alberti et al. (2015) 92.31 89.17 83.57 79.90 88.45 83.57 92.70 90.56 90.58 88.20 93.99 93.10 92.26 89.33
Our Local (B=1) 91.24 88.21 81.29 77.29 85.78 80.63 91.44 89.29 89.12 86.95 93.71 92.85 91.01 88.14
Our Local (B=16) 91.91 88.93 82.22 78.26 86.25 81.28 92.16 90.05 89.53 87.4 93.61 92.74 91.64 88.88
Our Global (B=16) 92.67 89.83 84.72 80.85 88.94 84.56 93.22 91.23 90.91 89.15 93.65 92.84 92.62 89.95

Table 3: Final CoNLL ’09 dependency parsing test set results.

Ballesteros et al. (2015). Generated corpus Human eval


Method A F1 read info
4.3 Sentence Compression Filippova et al. (2015) 35.36 82.83 4.66 4.03
Our final structured prediction task is extractive Automatic - - 4.31 3.77
sentence compression. Our Local (B=1) 30.51 78.72 4.58 4.03
Our Local (B=8) 31.19 75.69 - -
Data & Evaluation. We follow Our Global (B=8) 35.16 81.41 4.67 4.07
Filippova et al. (2015), where a large news
collection is used to heuristically generate com- Table 4: Sentence compression results on News data. Auto-
matic refers to application of the same automatic extraction
pression instances. Our final corpus contains rules used to generate the News training corpus.
about 2.3M compression instances: we use 2M
examples for training, 130k for development and
160k for the final test. We report per-token F1 put, as well as features from the history of predic-
score and per-sentence accuracy (A), i.e. per- tions. We use a single hidden layer of size 400.
centage of instances that fully match the golden
compressions. Following Filippova et al. (2015) Results. Table 4 shows our sentence compres-
we also run a human evaluation on 200 sentences sion results. Our globally normalized model again
where we ask the raters to score compressions for significantly outperforms the local model. Beam
readability (read) and informativeness (info) search with a locally normalized model suffers
on a scale from 0 to 5. from severe label bias issues that we discuss on
a concrete example in Section 5. We also com-
Model Configuration. The transition system pare to the sentence compression system from
for sentence compression is similar to POS tag- Filippova et al. (2015), a 3-layer stacked LSTM
ging: we scan sentences from left-to-right and la- which uses dependency label information. The
bel each token as keep or drop. We extract fea- LSTM and our global model perform on par on
tures from words, POS tags, and dependency la- both the automatic evaluation as well as the hu-
bels from a window of tokens centered on the in- man ratings, but our model is roughly 100× faster.
All compressions kept approximately 42% of the Method UAS LAS
tokens on average and all the models are signifi-
Local (B=1) 92.85 90.59
cantly better than the automatic extractions (p <
Local (B=16) 93.32 91.09
0.05).
Global (B=16) {θ (d) } 93.45 91.21
5 Discussion Global (B=16) {W2 , θ (d) } 94.01 91.77
Global (B=16) {W1 , W2 , θ (d) } 94.09 91.81
We derived a proof for the label bias problem Global (B=16) (full) 94.38 92.17
and the advantages of global models. We then
emprirically verified this theoretical superiority Table 5: WSJ dev set scores for successively deeper levels
of backpropagation. The full parameter set corresponds to
by demonstrating state-of-the-art performance on backpropagation all the way to the embeddings. Wi : hidden
three different tasks. In this section we situate and layer i weights.
compare our model to previous work and provide
two examples of the label bias problem in practice.
with a Perceptron-like hinge loss between the
5.1 Related Neural CRF Work gold and best elements of the beam. When we
limited the backpropagation depth to training only
Neural network models have been been combined
the top layer θ (d) , we found negligible differences
with conditional random fields and globally
in accuracy: 93.20% and 93.28% for the CRF
normalized models before. Bottou et al. (1997)
and Le Cun et al. (1998) describe global train- objective and hinge loss respectively. However,
ing of neural network models for structured when training with full backpropagation the CRF
accuracy is 0.2% higher and training converged
prediction problems. Peng et al. (2009) add
more than 4× faster.
a non-linear neural network layer to a linear-
chain CRF and Do and Artires (2010) apply Zhou et al. (2015) perform full backpropaga-
a similar approach to more general Markov tion training like us, but even with a much
network structures. Yao et al. (2014) and larger beam, their performance is significantly
Zheng et al. (2015) introduce recurrence into the lower than ours. We also apply our model
model and Huang et al. (2015) finally combine to two additional tasks, while they experi-
CRFs and LSTMs. These neural CRF models are ment only with dependency parsing. Finally,
limited to sequence labeling tasks where exact Watanabe and Sumita (2015) introduce recurrent
inference is possible, while our model works well components and additional techniques like max-
when exact inference is intractable. violation updates for a corresponding constituency
parsing model. In contrast, our model does not re-
5.2 Related Transition-Based Parsing Work quire any recurrence or specialized training.
For early work on neural-networks for
5.3 Label Bias in Practice
transition-based parsing, see Henderson (2003;
2004). Our work is closest to the work of We observed several instances of severe label bias
Weiss et al. (2015), Zhou et al. (2015) and in the sentence compression task. Although us-
Watanabe and Sumita (2015); in these approaches ing beam search with the local model outperforms
global normalization is added to the local model greedy inference on average, beam search leads
of Chen and Manning (2014). Empirically, the local model to occasionally produce empty
Weiss et al. (2015) achieves the best performance, compressions (Table 6). It is important to note
even though their model keeps the parameters of that these are not search errors: the empty com-
the locally normalized neural network fixed and pression has higher probability under pL than the
only trains a perceptron that uses the activations prediction from greedy inference. However, the
as features. Their model is therefore limited in more expressive globally normalized model does
its ability to revise the predictions of the locally not suffer from this limitation, and correctly gives
normalized model. In Table 5 we show that full the empty compression almost zero probability.
backpropagation training all the way to the word We also present some empirical evidence that
embeddings is very important and significantly the label bias problem is severe in parsing. We
contributes to the performance of our model. We trained models where the scoring functions in
also compared training under the CRF objective parsing at position i in the sentence are limited to
Method Predicted compression pL pG
Local (B=1) In Pakistan, former leader Pervez Musharraf has appeared in court for the first time, on treason charges. 0.13 0.05
Local (B=8) In Pakistan, former leader Pervez Musharraf has appeared in court for the first time, on treason charges. 0.16 <10−4
Global (B=8) In Pakistan, former leader Pervez Musharraf has appeared in court for the first time, on treason charges. 0.06 0.07

Table 6: Example sentence compressions where the label bias of the locally normalized model leads to a breakdown during
beam search. The probability of each compression under the local (pL ) and global (pG ) models shows that only the global
model can properly represent zero probability for the empty compression.

considering only tokens x1:i ; hence unlike the full siderable gains over a locally normalized model,
parsing model, there is no ability to look ahead although performance is lower than our full glob-
in the sentence when making a decision.7 The ally normalized approach.
result for a greedy model under this constraint
is 76.96% UAS; for a locally normalized model 6 Conclusions
with beam search is 81.35%; and for a globally
We presented a simple and yet powerful model ar-
normalized model is 93.60%. Thus the globally
chitecture that produces state-of-the-art results for
normalized model gets very close to the perfor-
POS tagging, dependency parsing and sentence
mance of a model with full lookahead, while the
compression. Our model combines the flexibil-
locally normalized model with a beam gives dra-
ity of transition-based algorithms and the model-
matically lower performance. In our final exper-
ing power of neural networks. Our results demon-
iments with full lookahead, the globally normal-
strate that feed-forward network without recur-
ized model achieves 94.01% accuracy, compared
rence can outperform recurrent models such as
to 93.07% accuracy for a local model with beam
LSTMs when they are trained with global normal-
search. Thus adding lookahead allows the lo-
ization. We further support our empirical findings
cal model to close the gap in performance to the
with a proof showing that global normalization
global model; however there is still a significant
helps the model overcome the label bias problem
difference in accuracy, which may in large part be
from which locally normalized models suffer.
due to the label bias problem.
A number of authors have considered modified Acknowledgements
training procedures for greedy models, or for lo-
cally normalized models. Daumé III et al. (2009) We would like to thank Ling Wang for training
introduce Searn, an algorithm that allows a his C2W part-of-speech tagger on our setup, and
classifier making greedy decisions to become Emily Pitler, Ryan McDonald, Greg Coppola and
more robust to errors made in previous deci- Fernando Pereira for tremendously helpful discus-
sions. Goldberg and Nivre (2013) describe im- sions. Finally, we are grateful to all members of
provements to a greedy parsing approach that the Google Parsing Team.
makes use of methods from imitation learn-
ing (Ross et al., 2011) to augment the training
set. Note that these methods are focused on
References
greedy models: they are unlikely to solve the [Abney et al.1999] Steven Abney, David McAllester,
label bias problem when used in conjunction and Fernando Pereira. 1999. Relating probabilis-
tic grammars and automata. Proceedings of the 37th
with beam search, given that the problem is Annual Meeting of the Association for Computa-
one of expressivity of the underlying model. tional Linguistics, pages 131–160.
More recent work (Yazdani and Henderson, 2015;
Vaswani and Sagae, 2016) has augmented locally [Alberti et al.2015] Chris Alberti, David Weiss, Greg
Coppola, and Slav Petrov. 2015. Improved
normalized models with correctness probabilities transition-based parsing and tagging with neural net-
or error states, effectively adding a step after every works. In Proceedings of the 2015 Conference on
decision where the probability of correctness of Empirical Methods in Natural Language Process-
the resulting structure is evaluated. This gives con- ing, pages 1354–1359.

7
This setting may be important in some applications, [Ballesteros et al.2015] Miguel Ballesteros, Chris Dyer,
where for example parse structures for sentence prefixes are and Noah A. Smith. 2015. Improved transition-
required, or where the input is received one word at a time based parsing by modeling characters instead of
and online processing is beneficial. words with LSTMs. In Proceedings of the 2015
Conference on Empirical Methods in Natural Lan- [Do and Artires2010] Trinh Minh Tri Do and Thierry
guage Processing, pages 349–359. Artires. 2010. Neural conditional random fields. In
International Conference on Artificial Intelligence
[Bohnet and Nivre2012] Bernd Bohnet and Joakim and Statistics, volume 9, pages 177–184.
Nivre. 2012. A transition-based system for joint
part-of-speech tagging and labeled non-projective [Durrett and Klein2015] Greg Durrett and Dan Klein.
dependency parsing. In Proceedings of the 2012 2015. Neural crf parsing. In Proceedings of the
Joint Conference on Empirical Methods in Natural 53rd Annual Meeting of the Association for Compu-
Language Processing and Computational Natural tational Linguistics and the 7th International Joint
Language Learning, pages 1455–1465. Conference on Natural Language Processing, pages
302–312.
[Bottou and LeCun2005] Léon Bottou and Yann Le-
Cun. 2005. Graph transformer networks for image [Dyer et al.2015] Chris Dyer, Miguel Ballesteros,
recognition. Bulletin of the International Statistical Wang Ling, Austin Matthews, and Noah A. Smith.
Institute (ISI). 2015. Transition-based dependency parsing with
stack long short-term memory. In Proceedings of
[Bottou et al.1997] Léon Bottou, Yann Le Cun, and the 53rd Annual Meeting of the Association for
Yoshua Bengio. 1997. Global training of docu- Computational Linguistics, pages 334–343.
ment processing systems using graph transformer
networks. In Proceedings of Computer Vision and [Filippova et al.2015] Katja Filippova, Enrique Alfon-
Pattern Recognition (CVPR), pages 489–493. seca, Carlos A. Colmenares, Łukasz Kaiser, and
Oriol Vinyals. 2015. Sentence compression by dele-
[Bottou1991] Léon Bottou. 1991. Une approche tion with lstms. In Proceedings of the 2015 Con-
théorique de lapprentissage connexionniste: Appli- ference on Empirical Methods in Natural Language
cations à la reconnaissance de la parole. Ph.D. the- Processing, pages 360–368.
sis, Doctoral dissertation, Universite de Paris XI.
[Goldberg and Nivre2013] Yoav Goldberg and Joakim
[Chen and Manning2014] Danqi Chen and Christo- Nivre. 2013. Training deterministic parsers with
pher D. Manning. 2014. A fast and accurate de- non-deterministic oracles. Transactions of the Asso-
pendency parser using neural networks. In Proceed- ciation for Computational Linguistics, 1:403–414.
ings of the 2014 Conference on Empirical Methods
in Natural Language Processing, pages 740–750. [Hajič et al.2009] Jan Hajič, Massimiliano Cia-
ramita, Richard Johansson, Daisuke Kawahara,
[Chi1999] Zhiyi Chi. 1999. Statistical properties Maria Antònia Martı́, Lluı́s Màrquez, Adam Mey-
of probabilistic context-free grammars. Computa- ers, Joakim Nivre, Sebastian Padó, Jan Štěpánek,
tional Linguistics, pages 131–160. Pavel Straňák, Mihai Surdeanu, Nianwen Xue,
and Yi Zhang. 2009. The conll-2009 shared task:
[Collins and Roark2004] Michael Collins and Brian
Syntactic and semantic dependencies in multi-
Roark. 2004. Incremental parsing with the percep-
ple languages. In Proceedings of the Thirteenth
tron algorithm. In Proceedings of the 42nd Meet-
Conference on Computational Natural Language
ing of the Association for Computational Linguistics
Learning: Shared Task, pages 1–18.
(ACL’04), pages 111–118.
[Henderson2003] James Henderson. 2003. Inducing
[Collins1999] Michael Collins. 1999. Head-Driven
history representations for broad coverage statistical
Statistical Models for Natural Language Parsing.
parsing. In Proceedings of the 2003 Human Lan-
Ph.D. thesis, University of Pennsylvania.
guage Technology Conference of the North Ameri-
[Collobert et al.2011] Ronan Collobert, Jason Weston, can Chapter of the Association for Computational
Léon Bottou, Michael Karlen, Koray Kavukcuoglu, Linguistics, pages 24–31.
and Pavel Kuksa. 2011. Natural language process-
[Henderson2004] James Henderson. 2004. Discrimi-
ing (almost) from scratch. The Journal of Machine
native training of a neural network statistical parser.
Learning Research, 12:2493–2537.
In Proceedings of the 42nd Meeting of the Associa-
[Daumé III et al.2009] Hal Daumé III, John Langford, tion for Computational Linguistics (ACL’04), pages
and Daniel Marcu. 2009. Search-based struc- 95–102.
tured prediction. Machine Learning Journal (MLJ),
[Hochreiter and Schmidhuber1997] Sepp Hochreiter
75(3):297–325.
and Jürgen Schmidhuber. 1997. Long short-term
[De Marneffe et al.2006] Marie-Catherine De Marn- memory. Neural computation, 9(8):1735–1780.
effe, Bill MacCartney, and Christopher D. Manning.
[Hovy et al.2006] Eduard Hovy, Mitchell Marcus,
2006. Generating typed dependency parses from
Martha Palmer, Lance Ramshaw, and Ralph
phrase structure parses. In Proceedings of Fifth In-
Weischedel. 2006. Ontonotes: The 90% solution.
ternational Conference on Language Resources and
In Proceedings of the Human Language Technology
Evaluation, pages 449–454.
Conference of the NAACL, Short Papers, pages
57–60.
[Huang et al.2015] Zhiheng Huang, Wei Xu, and Kai [Peng et al.2009] Jian Peng, Liefeng Bo, and Jinbo Xu.
Yu. 2015. Bidirectional LSTM-CRF models for se- 2009. Conditional neural fields. In Advances in
quence tagging. arXiv preprint arXiv:1508.01991. Neural Information Processing Systems 22, pages
1419–1427.
[Judge et al.2006] John Judge, Aoife Cahill, and Josef
van Genabith. 2006. Questionbank: Creating a cor- [Petrov and McDonald2012] Slav Petrov and Ryan Mc-
pus of parse-annotated questions. In Proceedings of Donald. 2012. Overview of the 2012 shared task
the 21st International Conference on Computational on parsing the web. Notes of the First Workshop
Linguistics and 44th Annual Meeting of the Associa- on Syntactic Analysis of Non-Canonical Language
tion for Computational Linguistics, pages 497–504. (SANCL).
[Lafferty et al.2001] John Lafferty, Andrew McCallum, [Ross et al.2011] Stéphane Ross, Geoffrey J. Gordon,
and Fernando Pereira. 2001. Conditional random and J. Andrew Bagnell. 2011. No-regret reduc-
fields: Probabilistic models for segmenting and la- tions for imitation learning and structured predic-
beling sequence data. In Proceedings of the Eigh- tion. AISTATS.
teenth International Conference on Machine Learn-
ing, pages 282–289. [Smith and Johnson2007] Noah Smith and Mark John-
son. 2007. Weighted and probabilistic context-free
[Le Cun et al.1998] Yann Le Cun, Léon Bottou, Yoshua grammars are equally expressive. Computational
Bengio, and Patrick Haffner. 1998. Gradient based Linguistics, pages 477–491.
learning applied to document recognition. Proceed-
ings of IEEE, 86(11):2278–2324. [Vaswani and Sagae2016] Ashish Vaswani and Kenji
Sagae. 2016. Efficient structured inference for
[Lei et al.2014] Tao Lei, Yu Xin, Yuan Zhang, Regina transition-based parsing with neural networks and
Barzilay, and Tommi Jaakkola. 2014. Low-rank error states. Transactions of the Association for
tensors for scoring dependency structures. In Pro- Computational Linguistics, 4:183–196.
ceedings of the 52nd Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 1381– [Vinyals et al.2015] Oriol Vinyals, Łukasz Kaiser,
1391. Terry Koo, Slav Petrov, Ilya Sutskever, and Geof-
frey Hinton. 2015. Grammar as a foreign language.
[Liang et al.2008] Percy Liang, Hal Daumé, III, and In Advances in Neural Information Processing
Dan Klein. 2008. Structure compilation: Trading Systems 28, pages 2755–2763.
structure for features. In Proceedings of the 25th In-
ternational Conference on Machine Learning, pages [Watanabe and Sumita2015] Taro Watanabe and Ei-
592–599. ichiro Sumita. 2015. Transition-based neural con-
stituent parsing. In Proceedings of the 53rd Annual
[Ling et al.2015] Wang Ling, Chris Dyer, Alan W Meeting of the Association for Computational Lin-
Black, Isabel Trancoso, Ramon Fermandez, Silvio guistics and the 7th International Joint Conference
Amir, Luis Marujo, and Tiago Luis. 2015. Finding on Natural Language Processing, pages 1169–1179.
function in form: Compositional character models
for open vocabulary word representation. In Pro- [Weiss et al.2015] David Weiss, Chris Alberti, Michael
ceedings of the 2015 Conference on Empirical Meth- Collins, and Slav Petrov. 2015. Structured training
ods in Natural Language Processing, pages 1520– for neural network transition-based parsing. In Pro-
1530. ceedings of the 53rd Annual Meeting of the Associa-
tion for Computational Linguistics, pages 323–333.
[Marcus et al.1993] Mitchell P. Marcus, Beatrice San-
torini, and Mary Ann Marcinkiewicz. 1993. Build- [Yao et al.2014] Kaisheng Yao, Baolin Peng, Geoffrey
ing a large annotated corpus of English: The Penn Zweig, Dong Yu, Xiaolong Li, and Feng Gao. 2014.
Treebank. Computational Linguistics, 19(2):313– Recurrent conditional random field for language un-
330. derstanding. In IEEE International Conference on
[Martins et al.2013] Andre Martins, Miguel Almeida, Acoustics, Speech, and Signal Processing (ICASSP
and Noah A. Smith. 2013. Turning on the turbo: ’14).
Fast third-order non-projective turbo parsers. In [Yazdani and Henderson2015] Majid Yazdani and
Proceedings of the 51st Annual Meeting of the As- James Henderson. 2015. Incremental recurrent
sociation for Computational Linguistics, pages 617– neural network dependency parser with search-
622. based discriminative training. In Proceedings of the
[Nivre2006] Joakim Nivre. 2006. Inductive Depen- Nineteenth Conference on Computational Natural
dency Parsing. Springer-Verlag New York, Inc. Language Learning, pages 142–152.

[Nivre2009] Joakim Nivre. 2009. Non-projective de- [Zhang and McDonald2014] Hao Zhang and Ryan Mc-
pendency parsing in expected linear time. In Pro- Donald. 2014. Enforcing structural diversity in
ceedings of the Joint Conference of the 47th Annual cube-pruned dependency parsing. In Proceedings
Meeting of the ACL and the 4th International Joint of the 52nd Annual Meeting of the Association for
Conference on Natural Language Processing of the Computational Linguistics, pages 656–661.
AFNLP, pages 351–359.
[Zheng et al.2015] Shuai Zheng, Sadeep Jayasumana, tional Linguistics and the 7th International Joint
Bernardino Romera-Paredes, Vibhav Vineet, Conference on Natural Language Processing, pages
Zhizhong Su, Dalong Du, Chang Huang, and Philip 1127–1137.
H. S. Torr. 2015. Conditional random fields as re-
current neural networks. In The IEEE International [Zhou et al.2015] Hao Zhou, Yue Zhang, and Jiajun
Conference on Computer Vision (ICCV), pages Chen. 2015. A neural probabilistic structured-
1529–1537. prediction model for transition-based dependency
parsing. In Proceedings of the 53rd Annual Meet-
[Zhou and Xu2015] Jie Zhou and Wei Xu. 2015. End- ing of the Association for Computational Linguis-
to-end learning of semantic role labeling using re- tics, pages 1213–1222.
current neural networks. In Proceedings of the 53rd
Annual Meeting of the Association for Computa-

You might also like