Tensor2Tensor For Neural Machine Translation
Tensor2Tensor For Neural Machine Translation
Ashish Vaswani1 , Samy Bengio1 , Eugene Brevdo1 , Francois Chollet1 , Aidan N. Gomez1 ,
Stephan Gouws1 , Llion Jones1 , Łukasz Kaiser1, 3 , Nal Kalchbrenner2 , Niki Parmar1 ,
Ryan Sepassi1, 4 , Noam Shazeer1 , and Jakob Uszkoreit1
arXiv:1803.07416v1 [cs.LG] 16 Mar 2018
1
Google Brain
2
DeepMind
3
Corresponding author: [email protected]
4
Corresponding author: [email protected]
Abstract
Tensor2Tensor is a library for deep learning models that is well-suited for neural machine trans-
lation and includes the reference implementation of the state-of-the-art Transformer model.
2 Self-Attention
Instead of convolutions, one can use stacked self-attention layers. This was introduced in the
Transformer model (Vaswani et al., 2017) and has significantly improved state-of-the-art in ma-
chine translation and language modeling while also improving the speed of training. Research
Figure 1: The Transformer model architecture.
continues in applying the model in more domains and exploring the space of self-attention
mechanisms. It is clear that self-attention is a powerful tool in general-purpose sequence mod-
eling.
While RNNs represent sequence history in their hidden state, the Transformer has no such
fixed-size bottleneck. Instead, each timestep has full direct access to the history through the
dot-product attention mechanism. This has the effect of both enabling the model to learn more
distant temporal relationships, as well as speeding up training because there is no need to wait
for a hidden state to propagate across time. This comes at the cost of memory usage, as the
attention mechanism scales with t2 , where t is the length the sequence. Future work may
reduce this scaling factor.
The Transformer model is illustrated in Figure 1. It uses stacked self-attention and point-
wise, fully connected layers for both the encoder and decoder, shown in the left and right halves
of Figure 1 respectively.
Encoder: The encoder is composed of a stack of identical layers. Each layer has two
sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple,
positionwise fully connected feed-forward network.
Decoder: The decoder is also composed of a stack of identical layers. In addition to the
two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs
multi-head attention over the output of the encoder stack.
Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential
operations for different layer types. n is the sequence length, d is the representation dimension,
k is the kernel size of convolutions and r the size of the neighborhood in restricted self-attention.
Layer Type Complexity per Layer Sequential Maximum Path Length
Operations
Self-Attention O(n2 · d) O(1) O(1)
Recurrent O(n · d2 ) O(n) O(n)
Convolutional O(k · n · d2 ) O(1) O(logk (n))
Self-Attention (restricted) O(r · n · d) O(1) O(n/r)
More details about multi-head attention and overall architecture can be found in Vaswani
et al. (2017).
2016). These hyperparameters were chosen after experimentation on the development set. We
set the maximum output length during inference to input length + 50, but terminate early when
possible (Wu et al., 2016).
3 Tensor2Tensor
Tensor2Tensor (T2T) is a library of deep learning models and datasets designed to make deep
learning research faster and more accessible. T2T uses TensorFlow (Abadi et al., 2016) through-
out and there is a strong focus on performance as well as usability. Through its use of Tensor-
Flow and various T2T-specific abstractions, researchers can train models on CPU, GPU (single
or multiple), and TPU, locally and in the cloud, usually with no or minimal device-specific code
or configuration.
Development began focused on neural machine translation and so Tensor2Tensor includes
many of the most successful NMT models and standard datasets. It has since added support for
other task types as well across multiple media (text, images, video, audio). Both the number of
models and datasets has grown significantly.
Usage is standardized across models and problems which makes it easy to try a new model
on multiple problems or try multiple models on a single problem. See Example Usage (appendix
B) to see some of the usability benefits of standardization of commands and unification of
datasets, models, and training, evaluation, decoding procedures.
Development is done in the open on GitHub (http://github.com/tensorflow/tensor2tensor)
with many contributors inside and outside Google.
4 System Overview
There are five key components that specify a training run in Tensor2Tensor:
3. Hyperparameters: parameters that control the instantiation of the model and training pro-
cedure (for example, the number of hidden layers or the optimizer’s learning rate). These
are specified in code and named so they can be easily shared and reproduced.
4. Model: the model ties together the preceding components to instantiate the parameter-
ized transformation from inputs to targets, compute the loss and evaluation metrics, and
construct the optimization procedure.
5. Estimator and Experiment: These classes that are part of TensorFlow handle in-
stantiating the runtime, running the training loop, and executing basic support services like
model checkpointing, logging, and alternation between training and evaluation.
These abstractions enable users to focus their attention only on the component they’re
interested in experimenting with. Users that wish to try models on a new problem usually only
have to define a new problem. Users that wish to create or modify models only have to create
a model or edit hyperparameters. The other components remain untouched, out of the way, and
available for use, all of which reduces mental load and allows users to more quickly iterate on
their ideas at scale.
Appendix A contains an outline of the code and appendix B contains example usage.
• The Image Transformer (Parmar et al., 2018) extends the Transformer model to images. It
relies heavily on many of the attention building blocks in Tensor2Tensor and adds many
of its own.
• The Adafactor optimizer (pending publication), which significantly reduces memory re-
quirements for second-moment estimates, was developed within Tensor2Tensor and tried
on various models and problems.
2. End-to-end regression tests that run on a regular basis for important model-problem pairs
and verify that certain quality metrics are achieved.
3. Setting random seeds on multiple levels (Python, numpy, and TensorFlow) to mitigate
the effects of randomness (though this is effectively impossible to achieve in full in a
multithreaded, distributed, floating-point system).
References
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard,
M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan,
V., Warden, P., Wicke, M., Yu, Y., and Zheng, X. (2016). Tensorflow: A system for large-scale machine
learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16),
pages 265–283.
Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and
translate. CoRR, abs/1409.0473.
Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learn-
ing phrase representations using RNN encoder-decoder for statistical machine translation. CoRR,
abs/1406.1078.
Chollet, F. (2016). Xception: Deep learning with depthwise separable convolutions. arXiv preprint
arXiv:1610.02357.
Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, Y. N. (2017). Convolutional sequence to
sequence learning. CoRR, abs/1705.03122.
Gomez, A. N., Ren, M., Urtasun, R., and Grosse, R. B. (2017). The reversible residual network: Back-
propagation without storing activations. CoRR, abs/1707.04585.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–
1780.
Kaiser, Ł. and Bengio, S. (2016). Can active memory replace attention? In Advances in Neural Information
Processing Systems, pages 3781–3789.
Kaiser, L., Gomez, A. N., and Chollet, F. (2017). Depthwise separable convolutions for neural machine
translation. CoRR, abs/1706.03059.
Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuous translation models. In Proceedings
EMNLP 2013, pages 1700–1709.
Kalchbrenner, N., Espeholt, L., Simonyan, K., van den Oord, A., Graves, A., and Kavukcuoglu, K. (2016).
Neural machine translation in linear time. CoRR, abs/1610.10099.
Meng, F., Lu, Z., Wang, M., Li, H., Jiang, W., and Liu, Q. (2015). Encoding source language with
convolutional neural network for machine translation. In ACL, pages 20–30.
Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, Ł., Shazeer, N., and Ku, A. (2018). Image Transformer.
ArXiv e-prints.
Sennrich, R., Haddow, B., and Birch, A. (2015). Neural machine translation of rare words with subword
units. arXiv preprint arXiv:1508.07909.
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. (2017). Outrageously
large neural networks: The sparsely-gated mixture-of-experts layer. CoRR, abs/1701.06538.
Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks. In
Advances in Neural Information Processing Systems, pages 3104–3112.
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior,
A., and Kavukcuoglu, K. (2016). WaveNet: A generative model for raw audio. CoRR, abs/1609.03499.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin,
I. (2017). Attention is all you need. CoRR.
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q.,
Macherey, K., et al. (2016). Google’s neural machine translation system: Bridging the gap between
human and machine translation. CoRR, abs/1609.08144.
Zhou, J., Cao, Y., Wang, X., Li, P., and Xu, W. (2016). Deep recurrent models with fast-forward connec-
tions for neural machine translation. CoRR, abs/1606.04199.
– Create and include the Parallelism object in the RunConfig which enables data-parallel
duplication of the model on multiple devices (for example, for multi-GPU synchronous train-
ing).
• Create Experiment, including training and evaluation hooks which control support services like
logging and checkpointing
– T2TModel.estimator model fn
∗ model(features)
· model.bottom: This uses feature type information from the Problem to transform
the input features into a form consumable by the model body (for example, embedding
integer token ids into a dense float space).
· model.body: The core of the model.
· model.top: Transforming the output of the model body into the target space using
information from the Problem
· model.loss
∗ When training: model.optimize
∗ When evaluating: create evaluation metrics
– Problem.input fn: produce an input pipeline for a given mode. Uses TensorFlow’s
tf.data.Dataset API.
∗ Problem.dataset which creates a stream of individual examples
∗ Pad and batch the examples into a form ready for efficient processing
– estimator.train
∗ train op = model fn(input fn(mode=TRAIN))
∗ Run the train op for the number of training steps specified
– estimator.evaluate
∗ metrics = model fn(input fn(mode=EVAL))
∗ Accumulate the metrics across the number of evaluation steps specified
B Example Usage
Tensor2Tensor usage is standardized across problems and models. Below you’ll find a set of commands
that generates a dataset, trains and evaluates a model, and produces decodes from that trained model.
Experiments can typically be reproduced with the (problem, model, hyperparameter set) triple.
The following train the attention-based Transformer model on WMT data translating from English
to German:
pip install tensor2tensor
PROBLEM=translate_ende_wmt32k
MODEL=transformer
HPARAMS=transformer_base
# Generate data
t2t-datagen \
--problem=$PROBLEM \
--data_dir=$DATA_DIR \
--tmp_dir=$TMP_DIR