Professional Documents
Culture Documents
Globally Normalized Transition-Based Neural Networks
Globally Normalized Transition-Based Neural Networks
transition-based neural network model ization with a conditional random field (CRF) ob-
that achieves state-of-the-art part-of- jective (Bottou et al., 1997; Le Cun et al., 1998;
speech tagging, dependency parsing and Lafferty et al., 2001; Collobert et al., 2011) to
sentence compression results. Our model overcome the label bias problem that locally nor-
is a simple feed-forward neural network malized models suffer from. Since we use beam
that operates on a task-specific transition inference, we approximate the partition function
system, yet achieves comparable or better by summing over the elements in the beam,
accuracies than recurrent models. We dis- and use early updates (Collins and Roark, 2004;
cuss the importance of global as opposed Zhou et al., 2015). We compute gradients based
to local normalization: a key insight is on this approximate global normalization and
that the label bias problem implies that perform full backpropagation training of all neural
globally normalized models can be strictly network parameters based on the CRF loss.
more expressive than locally normalized In Section 3 we revisit the label bias problem
models. and the implication that globally normalized mod-
els are strictly more expressive than locally nor-
1 Introduction malized models. Lookahead features can par-
tially mitigate this discrepancy, but cannot fully
Neural network approaches have taken
compensate for it—a point to which we return
the field of natural language processing
later. To empirically demonstrate the effective-
(NLP) by storm. In particular, variants of
ness of global normalization, we evaluate our
long short-term memory (LSTM) networks
model on part-of-speech tagging, syntactic de-
(Hochreiter and Schmidhuber, 1997) have
pendency parsing and sentence compression (Sec-
produced impressive results on some of the
tion 4). Our model achieves state-of-the-art ac-
classic NLP tasks such as part-of-speech
curacy on all of these tasks, matching or outper-
tagging (Ling et al., 2015), syntactic parsing
forming LSTMs while being significantly faster.
(Vinyals et al., 2015) and semantic role labeling
In particular for dependency parsing on the Wall
(Zhou and Xu, 2015). One might speculate that
Street Journal we achieve the best-ever published
it is the recurrent nature of these models that
unlabeled attachment score of 94.61%.
enables these results.
As discussed in more detail in Section 5,
In this work we demonstrate that simple
we also outperform previous structured training
feed-forward networks without any recurrence
approaches used for neural network transition-
can achieve comparable or better accuracies
based parsing. Our ablation experiments
than LSTMs, as long as they are globally nor-
show that we outperform Weiss et al. (2015) and
malized. Our model, described in detail in
Alberti et al. (2015) because we do global back-
Section 2, uses a transition system (Nivre, 2006)
propagation training of all model parameters,
and feature embeddings as introduced by
while they fix the neural network parameters when
∗
On leave from Columbia University. training the global part of their model. We
also outperform Zhou et al. (2015) despite using a sions for any complete parse is n(x) = 2 × m.3
smaller beam. To shed additional light on the la- A complete structure is then a sequence of deci-
bel bias problem in practice, we provide a sentence sion/state pairs (s1 , d1 ) . . . (sn , dn ) such that s1 =
compression example where the local model com- s† , di ∈ S(si ) for i = 1 . . . n, and si+1 =
pletely fails. We then demonstrate that a globally t(si , di ). We use the notation d1:j to refer to a de-
normalized parsing model without any lookahead cision sequence d1 . . . dj .
features is almost as accurate as our best model, We assume that there is a one-to-one mapping
while a locally normalized model loses more than between decision sequences d1:j−1 and states sj :
10% absolute in accuracy because it cannot effec- that is, we essentially assume that a state encodes
tively incorporate evidence as it becomes avail- the entire history of decisions. Thus, each state
able. can be reached by a unique decision sequence
Finally, we provide an open-source implemen- from s† .4 We will use decision sequences d1:j−1
tation of our method, called SyntaxNet,1 which and states interchangeably: in a slight abuse of
we have integrated into the popular TensorFlow2 notation, we define ρ(d1:j−1 , d; θ) to be equal to
framework. We also provide a pre-trained, ρ(s, d; θ) where s is the state reached by the deci-
state-of-the art English dependency parser called sion sequence d1:j−1 .
“Parsey McParseface,” which we tuned for a bal- The scoring function ρ(s, d; θ) can be defined
ance of speed, simplicity, and accuracy. in a number of ways. In this work, following
Chen and Manning (2014), Weiss et al. (2015),
2 Model and Zhou et al. (2015), we define it via a feed-
forward neural network as
At its core, our model is an incremental transition-
based parser (Nivre, 2006). To apply it to different
tasks we only need to adjust the transition system ρ(s, d; θ) = φ(s; θ (l) ) · θ (d) .
and the input features.
Here θ (l) are the parameters of the neural network,
2.1 Transition System excluding the parameters at the final layer. θ (d) are
the final layer parameters for decision d. φ(s; θ (l) )
Given an input x, most often a sentence, we define:
is the representation for state s computed by the
• A set of states S(x). neural network under parameters θ (l) . Note that
• A special start state s† ∈ S(x). the score is linear in the parameters θ (d) . We next
• A set of allowed decisions A(s, x) for all s ∈ describe how softmax-style normalization can be
S(x). performed at the local or global level.
• A transition function t(s, d, x) returning a
new state s′ for any decision d ∈ A(s, x). 2.2 Global vs. Local Normalization
We will use a function ρ(s, d, x; θ) to compute the In the Chen and Manning (2014) style of greedy
score of decision d in state s for input x. The neural network parsing, the conditional probabil-
vector θ contains the model parameters and we ity distribution over decisions dj given context
assume that ρ(s, d, x; θ) is differentiable with re- d1:j−1 is defined as
spect to θ.
In this section, for brevity, we will drop the de-
exp ρ(d1:j−1 , dj ; θ)
pendence of x in the functions given above, simply p(dj |d1:j−1 ; θ) = , (1)
ZL (d1:j−1 ; θ)
writing S, A(s), t(s, d), and ρ(s, d; θ).
Throughout this work we will use transition sys-
where
tems in which all complete structures for the same
input x have the same number of decisions n(x) X
ZL (d1:j−1 ; θ) = exp ρ(d1:j−1 , d′ ; θ).
(or n for brevity). In dependency parsing for ex-
d′ ∈A(d1:j−1 )
ample, this is true for both the arc-standard and
arc-eager transition systems (Nivre, 2006), where 3
Note that this is not true for the swap transition system
for a sentence x of length m, the number of deci- defined in Nivre (2009).
4
It is straightforward to extend the approach to make use
1
http://github.com/tensorflow/models/tree/master/syntaxnet of dynamic programming in the case where the same state
2
http://www.tensorflow.org can be reached by multiple decision sequences.
Each ZL (d1:j−1 ; θ) is a local normalization term. A significant practical advantange of the locally
The probability of a sequence of decisions d1:n is normalized cost Eq. (4) is that the local parti-
tion function ZL and its derivative can usually be
Y
n
computed efficiently. In contrast, the ZG term in
pL (d1:n ) = p(dj |d1:j−1 ; θ)
Eq. (5) contains a sum over d′1:n ∈ Dn that is in
j=1
P many cases intractable.
exp nj=1 ρ(d1:j−1 , dj ; θ)
= Qn . (2) To make learning tractable with the glob-
j=1 ZL (d1:j−1 ; θ) ally normalized model, we use beam search
and early updates (Collins and Roark, 2004;
Beam search can be used to attempt to find the
Zhou et al., 2015). As the training sequence is
maximum of Eq. (2) with respect to d1:n . The
being decoded, we keep track of the location of
additive scores used in beam search are the log-
the gold path in the beam. If the gold path falls
softmax of each decision, ln p(dj |d1:j−1 ; θ), not
out of the beam at step j, a stochastic gradient
the raw scores ρ(d1:j−1 , dj ; θ).
step is taken on the following objective:
In contrast, a Conditional Random Field (CRF)
defines a distribution pG (d1:n ) as follows: Lglobal−beam (d∗1:j ; θ) =
P
exp nj=1 ρ(d1:j−1 , dj ; θ) X
j
X X
j
pG (d1:n ) = , (3) − ρ(d∗1:i−1 , d∗i ; θ) + ln exp ρ(d′1:i−1 , d′i ; θ).(6)
ZG (θ) i=1 i=1
d′1:j ∈Bj
Table 1: Final POS tagging test set results on English WSJ and Treebank Union as well as CoNLL’09. We also show the
performance of our pre-trained open source model, “Parsey McParseface.”
network in these experiments has a single hidden is standard. For the CoNLL ’09 datasets we fol-
layer with 256 units on WSJ and Treebank Union low standard practice and include all punctuation
and 64 on CoNLL’09. in the evaluation. We follow Alberti et al. (2015)
and use our own predicted POS tags so that we
Results. In Table 1 we compare our model to can include a k-best tag feature (see below) but
a linear CRF and to the compositional character- use the supplied predicted morphological features.
to-word LSTM model of Ling et al. (2015). The We report unlabeled and labeled attachment scores
CRF is a first-order linear model with exact infer- (UAS/LAS).
ence and the same emission features as our model.
It additionally also has transition features of the Model Configuration. Our model configuration
word, cluster and character n-gram up to length 3 is basically the same as the one originally pro-
on both endpoints of the transition. The results for posed by Chen and Manning (2014) and then re-
Ling et al. (2015) were solicited from the authors. fined by Weiss et al. (2015). In particular, we use
Our local model already compares favorably the arc-standard transition system and extract the
against these methods on average. Using beam same set of features as prior work: words, part of
search with a locally normalized model does not speech tags, and dependency arcs and labels in the
help, but with global normalization it leads to a surrounding context of the state, as well as k-best
7% reduction in relative error, empirically demon- tags as proposed by Alberti et al. (2015). We use
strating the effect of label bias. The set of char- two hidden layers of 1,024 dimensions each.
acter ngrams feature is very important, increasing
average accuracy on the CoNLL’09 datasets by Results. Tables 2 and 3 show our final pars-
about 0.5% absolute. This shows that character- ing results and a comparison to the best sys-
level modeling can also be done with a simple tems from the literature. We obtain the best ever
feed-forward network without recurrence. published results on almost all datasets, includ-
ing the WSJ. Our main results use the same pre-
4.2 Dependency Parsing trained word embeddings as Weiss et al. (2015)
In dependency parsing the goal is to produce a di- and Alberti et al. (2015), but no tri-training. When
rected tree representing the syntactic structure of we artificially restrict ourselves to not use pre-
the input sentence. trained word embeddings, we observe only a mod-
est drop of ∼0.5% UAS; for example, training
Data & Evaluation. We use the same corpora only on the WSJ yields 94.08% UAS and 92.15%
as in our POS tagging experiments, except that LAS for our global model with a beam of size 32.
we use the standard parsing splits of the WSJ. To Even though we do not use tri-training, our
avoid over-fitting to the development set (Sec. 22), model compares favorably to the 94.26% LAS
we use Sec. 24 for tuning the hyperparameters and 92.41% UAS reported by Weiss et al. (2015)
of our models. We convert the English con- with tri-training. As we show in Sec. 5, these
stituency trees to Stanford style dependencies gains can be attributed to the full backpropagation
(De Marneffe et al., 2006) using version 3.3.0 of training that differentiates our approach from that
the converter. For English, we use predicted POS of Weiss et al. (2015) and Alberti et al. (2015).
tags (the same POS tags are used for all models) Our results also significantly outperform the
and exclude punctuation from the evaluation, as LSTM-based approaches of Dyer et al. (2015) and
WSJ Union-News Union-Web Union-QTB
Method UAS LAS UAS LAS UAS LAS UAS LAS
Martins et al. (2013)⋆ 92.89 90.55 93.10 91.13 88.23 85.04 94.21 91.54
Zhang and McDonald (2014)⋆ 93.22 91.02 93.32 91.48 88.65 85.59 93.37 90.69
Weiss et al. (2015) 93.99 92.05 93.91 92.25 89.29 86.44 94.17 92.06
Alberti et al. (2015) 94.23 92.36 94.10 92.55 89.55 86.85 94.74 93.04
Our Local (B=1) 92.95 91.02 93.11 91.46 88.42 85.58 92.49 90.38
Our Local (B=32) 93.59 91.70 93.65 92.03 88.96 86.17 93.22 91.17
Our Global (B=32) 94.61 92.79 94.44 92.93 90.17 87.54 95.40 93.64
Parsey McParseface (B=8) - - 94.15 92.51 89.08 86.29 94.77 93.17
Table 2: Final English dependency parsing test set results. We note that training our system using only the WSJ corpus (i.e. no
pre-trained embeddings or other external resources) yields 94.08% UAS and 92.15% LAS for our global model with beam 32.
Table 6: Example sentence compressions where the label bias of the locally normalized model leads to a breakdown during
beam search. The probability of each compression under the local (pL ) and global (pG ) models shows that only the global
model can properly represent zero probability for the empty compression.
considering only tokens x1:i ; hence unlike the full siderable gains over a locally normalized model,
parsing model, there is no ability to look ahead although performance is lower than our full glob-
in the sentence when making a decision.7 The ally normalized approach.
result for a greedy model under this constraint
is 76.96% UAS; for a locally normalized model 6 Conclusions
with beam search is 81.35%; and for a globally
We presented a simple and yet powerful model ar-
normalized model is 93.60%. Thus the globally
chitecture that produces state-of-the-art results for
normalized model gets very close to the perfor-
POS tagging, dependency parsing and sentence
mance of a model with full lookahead, while the
compression. Our model combines the flexibil-
locally normalized model with a beam gives dra-
ity of transition-based algorithms and the model-
matically lower performance. In our final exper-
ing power of neural networks. Our results demon-
iments with full lookahead, the globally normal-
strate that feed-forward network without recur-
ized model achieves 94.01% accuracy, compared
rence can outperform recurrent models such as
to 93.07% accuracy for a local model with beam
LSTMs when they are trained with global normal-
search. Thus adding lookahead allows the lo-
ization. We further support our empirical findings
cal model to close the gap in performance to the
with a proof showing that global normalization
global model; however there is still a significant
helps the model overcome the label bias problem
difference in accuracy, which may in large part be
from which locally normalized models suffer.
due to the label bias problem.
A number of authors have considered modified Acknowledgements
training procedures for greedy models, or for lo-
cally normalized models. Daumé III et al. (2009) We would like to thank Ling Wang for training
introduce Searn, an algorithm that allows a his C2W part-of-speech tagger on our setup, and
classifier making greedy decisions to become Emily Pitler, Ryan McDonald, Greg Coppola and
more robust to errors made in previous deci- Fernando Pereira for tremendously helpful discus-
sions. Goldberg and Nivre (2013) describe im- sions. Finally, we are grateful to all members of
provements to a greedy parsing approach that the Google Parsing Team.
makes use of methods from imitation learn-
ing (Ross et al., 2011) to augment the training
set. Note that these methods are focused on
References
greedy models: they are unlikely to solve the [Abney et al.1999] Steven Abney, David McAllester,
label bias problem when used in conjunction and Fernando Pereira. 1999. Relating probabilis-
tic grammars and automata. Proceedings of the 37th
with beam search, given that the problem is Annual Meeting of the Association for Computa-
one of expressivity of the underlying model. tional Linguistics, pages 131–160.
More recent work (Yazdani and Henderson, 2015;
Vaswani and Sagae, 2016) has augmented locally [Alberti et al.2015] Chris Alberti, David Weiss, Greg
Coppola, and Slav Petrov. 2015. Improved
normalized models with correctness probabilities transition-based parsing and tagging with neural net-
or error states, effectively adding a step after every works. In Proceedings of the 2015 Conference on
decision where the probability of correctness of Empirical Methods in Natural Language Process-
the resulting structure is evaluated. This gives con- ing, pages 1354–1359.
7
This setting may be important in some applications, [Ballesteros et al.2015] Miguel Ballesteros, Chris Dyer,
where for example parse structures for sentence prefixes are and Noah A. Smith. 2015. Improved transition-
required, or where the input is received one word at a time based parsing by modeling characters instead of
and online processing is beneficial. words with LSTMs. In Proceedings of the 2015
Conference on Empirical Methods in Natural Lan- [Do and Artires2010] Trinh Minh Tri Do and Thierry
guage Processing, pages 349–359. Artires. 2010. Neural conditional random fields. In
International Conference on Artificial Intelligence
[Bohnet and Nivre2012] Bernd Bohnet and Joakim and Statistics, volume 9, pages 177–184.
Nivre. 2012. A transition-based system for joint
part-of-speech tagging and labeled non-projective [Durrett and Klein2015] Greg Durrett and Dan Klein.
dependency parsing. In Proceedings of the 2012 2015. Neural crf parsing. In Proceedings of the
Joint Conference on Empirical Methods in Natural 53rd Annual Meeting of the Association for Compu-
Language Processing and Computational Natural tational Linguistics and the 7th International Joint
Language Learning, pages 1455–1465. Conference on Natural Language Processing, pages
302–312.
[Bottou and LeCun2005] Léon Bottou and Yann Le-
Cun. 2005. Graph transformer networks for image [Dyer et al.2015] Chris Dyer, Miguel Ballesteros,
recognition. Bulletin of the International Statistical Wang Ling, Austin Matthews, and Noah A. Smith.
Institute (ISI). 2015. Transition-based dependency parsing with
stack long short-term memory. In Proceedings of
[Bottou et al.1997] Léon Bottou, Yann Le Cun, and the 53rd Annual Meeting of the Association for
Yoshua Bengio. 1997. Global training of docu- Computational Linguistics, pages 334–343.
ment processing systems using graph transformer
networks. In Proceedings of Computer Vision and [Filippova et al.2015] Katja Filippova, Enrique Alfon-
Pattern Recognition (CVPR), pages 489–493. seca, Carlos A. Colmenares, Łukasz Kaiser, and
Oriol Vinyals. 2015. Sentence compression by dele-
[Bottou1991] Léon Bottou. 1991. Une approche tion with lstms. In Proceedings of the 2015 Con-
théorique de lapprentissage connexionniste: Appli- ference on Empirical Methods in Natural Language
cations à la reconnaissance de la parole. Ph.D. the- Processing, pages 360–368.
sis, Doctoral dissertation, Universite de Paris XI.
[Goldberg and Nivre2013] Yoav Goldberg and Joakim
[Chen and Manning2014] Danqi Chen and Christo- Nivre. 2013. Training deterministic parsers with
pher D. Manning. 2014. A fast and accurate de- non-deterministic oracles. Transactions of the Asso-
pendency parser using neural networks. In Proceed- ciation for Computational Linguistics, 1:403–414.
ings of the 2014 Conference on Empirical Methods
in Natural Language Processing, pages 740–750. [Hajič et al.2009] Jan Hajič, Massimiliano Cia-
ramita, Richard Johansson, Daisuke Kawahara,
[Chi1999] Zhiyi Chi. 1999. Statistical properties Maria Antònia Martı́, Lluı́s Màrquez, Adam Mey-
of probabilistic context-free grammars. Computa- ers, Joakim Nivre, Sebastian Padó, Jan Štěpánek,
tional Linguistics, pages 131–160. Pavel Straňák, Mihai Surdeanu, Nianwen Xue,
and Yi Zhang. 2009. The conll-2009 shared task:
[Collins and Roark2004] Michael Collins and Brian
Syntactic and semantic dependencies in multi-
Roark. 2004. Incremental parsing with the percep-
ple languages. In Proceedings of the Thirteenth
tron algorithm. In Proceedings of the 42nd Meet-
Conference on Computational Natural Language
ing of the Association for Computational Linguistics
Learning: Shared Task, pages 1–18.
(ACL’04), pages 111–118.
[Henderson2003] James Henderson. 2003. Inducing
[Collins1999] Michael Collins. 1999. Head-Driven
history representations for broad coverage statistical
Statistical Models for Natural Language Parsing.
parsing. In Proceedings of the 2003 Human Lan-
Ph.D. thesis, University of Pennsylvania.
guage Technology Conference of the North Ameri-
[Collobert et al.2011] Ronan Collobert, Jason Weston, can Chapter of the Association for Computational
Léon Bottou, Michael Karlen, Koray Kavukcuoglu, Linguistics, pages 24–31.
and Pavel Kuksa. 2011. Natural language process-
[Henderson2004] James Henderson. 2004. Discrimi-
ing (almost) from scratch. The Journal of Machine
native training of a neural network statistical parser.
Learning Research, 12:2493–2537.
In Proceedings of the 42nd Meeting of the Associa-
[Daumé III et al.2009] Hal Daumé III, John Langford, tion for Computational Linguistics (ACL’04), pages
and Daniel Marcu. 2009. Search-based struc- 95–102.
tured prediction. Machine Learning Journal (MLJ),
[Hochreiter and Schmidhuber1997] Sepp Hochreiter
75(3):297–325.
and Jürgen Schmidhuber. 1997. Long short-term
[De Marneffe et al.2006] Marie-Catherine De Marn- memory. Neural computation, 9(8):1735–1780.
effe, Bill MacCartney, and Christopher D. Manning.
[Hovy et al.2006] Eduard Hovy, Mitchell Marcus,
2006. Generating typed dependency parses from
Martha Palmer, Lance Ramshaw, and Ralph
phrase structure parses. In Proceedings of Fifth In-
Weischedel. 2006. Ontonotes: The 90% solution.
ternational Conference on Language Resources and
In Proceedings of the Human Language Technology
Evaluation, pages 449–454.
Conference of the NAACL, Short Papers, pages
57–60.
[Huang et al.2015] Zhiheng Huang, Wei Xu, and Kai [Peng et al.2009] Jian Peng, Liefeng Bo, and Jinbo Xu.
Yu. 2015. Bidirectional LSTM-CRF models for se- 2009. Conditional neural fields. In Advances in
quence tagging. arXiv preprint arXiv:1508.01991. Neural Information Processing Systems 22, pages
1419–1427.
[Judge et al.2006] John Judge, Aoife Cahill, and Josef
van Genabith. 2006. Questionbank: Creating a cor- [Petrov and McDonald2012] Slav Petrov and Ryan Mc-
pus of parse-annotated questions. In Proceedings of Donald. 2012. Overview of the 2012 shared task
the 21st International Conference on Computational on parsing the web. Notes of the First Workshop
Linguistics and 44th Annual Meeting of the Associa- on Syntactic Analysis of Non-Canonical Language
tion for Computational Linguistics, pages 497–504. (SANCL).
[Lafferty et al.2001] John Lafferty, Andrew McCallum, [Ross et al.2011] Stéphane Ross, Geoffrey J. Gordon,
and Fernando Pereira. 2001. Conditional random and J. Andrew Bagnell. 2011. No-regret reduc-
fields: Probabilistic models for segmenting and la- tions for imitation learning and structured predic-
beling sequence data. In Proceedings of the Eigh- tion. AISTATS.
teenth International Conference on Machine Learn-
ing, pages 282–289. [Smith and Johnson2007] Noah Smith and Mark John-
son. 2007. Weighted and probabilistic context-free
[Le Cun et al.1998] Yann Le Cun, Léon Bottou, Yoshua grammars are equally expressive. Computational
Bengio, and Patrick Haffner. 1998. Gradient based Linguistics, pages 477–491.
learning applied to document recognition. Proceed-
ings of IEEE, 86(11):2278–2324. [Vaswani and Sagae2016] Ashish Vaswani and Kenji
Sagae. 2016. Efficient structured inference for
[Lei et al.2014] Tao Lei, Yu Xin, Yuan Zhang, Regina transition-based parsing with neural networks and
Barzilay, and Tommi Jaakkola. 2014. Low-rank error states. Transactions of the Association for
tensors for scoring dependency structures. In Pro- Computational Linguistics, 4:183–196.
ceedings of the 52nd Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 1381– [Vinyals et al.2015] Oriol Vinyals, Łukasz Kaiser,
1391. Terry Koo, Slav Petrov, Ilya Sutskever, and Geof-
frey Hinton. 2015. Grammar as a foreign language.
[Liang et al.2008] Percy Liang, Hal Daumé, III, and In Advances in Neural Information Processing
Dan Klein. 2008. Structure compilation: Trading Systems 28, pages 2755–2763.
structure for features. In Proceedings of the 25th In-
ternational Conference on Machine Learning, pages [Watanabe and Sumita2015] Taro Watanabe and Ei-
592–599. ichiro Sumita. 2015. Transition-based neural con-
stituent parsing. In Proceedings of the 53rd Annual
[Ling et al.2015] Wang Ling, Chris Dyer, Alan W Meeting of the Association for Computational Lin-
Black, Isabel Trancoso, Ramon Fermandez, Silvio guistics and the 7th International Joint Conference
Amir, Luis Marujo, and Tiago Luis. 2015. Finding on Natural Language Processing, pages 1169–1179.
function in form: Compositional character models
for open vocabulary word representation. In Pro- [Weiss et al.2015] David Weiss, Chris Alberti, Michael
ceedings of the 2015 Conference on Empirical Meth- Collins, and Slav Petrov. 2015. Structured training
ods in Natural Language Processing, pages 1520– for neural network transition-based parsing. In Pro-
1530. ceedings of the 53rd Annual Meeting of the Associa-
tion for Computational Linguistics, pages 323–333.
[Marcus et al.1993] Mitchell P. Marcus, Beatrice San-
torini, and Mary Ann Marcinkiewicz. 1993. Build- [Yao et al.2014] Kaisheng Yao, Baolin Peng, Geoffrey
ing a large annotated corpus of English: The Penn Zweig, Dong Yu, Xiaolong Li, and Feng Gao. 2014.
Treebank. Computational Linguistics, 19(2):313– Recurrent conditional random field for language un-
330. derstanding. In IEEE International Conference on
[Martins et al.2013] Andre Martins, Miguel Almeida, Acoustics, Speech, and Signal Processing (ICASSP
and Noah A. Smith. 2013. Turning on the turbo: ’14).
Fast third-order non-projective turbo parsers. In [Yazdani and Henderson2015] Majid Yazdani and
Proceedings of the 51st Annual Meeting of the As- James Henderson. 2015. Incremental recurrent
sociation for Computational Linguistics, pages 617– neural network dependency parser with search-
622. based discriminative training. In Proceedings of the
[Nivre2006] Joakim Nivre. 2006. Inductive Depen- Nineteenth Conference on Computational Natural
dency Parsing. Springer-Verlag New York, Inc. Language Learning, pages 142–152.
[Nivre2009] Joakim Nivre. 2009. Non-projective de- [Zhang and McDonald2014] Hao Zhang and Ryan Mc-
pendency parsing in expected linear time. In Pro- Donald. 2014. Enforcing structural diversity in
ceedings of the Joint Conference of the 47th Annual cube-pruned dependency parsing. In Proceedings
Meeting of the ACL and the 4th International Joint of the 52nd Annual Meeting of the Association for
Conference on Natural Language Processing of the Computational Linguistics, pages 656–661.
AFNLP, pages 351–359.
[Zheng et al.2015] Shuai Zheng, Sadeep Jayasumana, tional Linguistics and the 7th International Joint
Bernardino Romera-Paredes, Vibhav Vineet, Conference on Natural Language Processing, pages
Zhizhong Su, Dalong Du, Chang Huang, and Philip 1127–1137.
H. S. Torr. 2015. Conditional random fields as re-
current neural networks. In The IEEE International [Zhou et al.2015] Hao Zhou, Yue Zhang, and Jiajun
Conference on Computer Vision (ICCV), pages Chen. 2015. A neural probabilistic structured-
1529–1537. prediction model for transition-based dependency
parsing. In Proceedings of the 53rd Annual Meet-
[Zhou and Xu2015] Jie Zhou and Wei Xu. 2015. End- ing of the Association for Computational Linguis-
to-end learning of semantic role labeling using re- tics, pages 1213–1222.
current neural networks. In Proceedings of the 53rd
Annual Meeting of the Association for Computa-