S 2 S
S 2 S
12 contributors
(seq2seq) Tutorial
Blogpost, Github)
This version of the tutorial requires TensorFlow Nightly. For using the
If make use of this codebase for your research, please cite this.
Introduction
Basic
Embedding
Encoder
Decoder
Loss
Intermediate
Bidirectional RNNs
Beam search
Hyperparameters
Multi-GPU training
Benchmarks
IWSLT English-Vietnamese
WMT German-English
Standard HParams
Other resources
Acknowledgment
References
BibTex
Introduction
Neural Machine Translation (NMT) which was the very first testbed for
data iterator
models
3. Providing tips and tricks for building the very best NMT models and
Campaign.
NMT, explaining how to build and train a vanilla NMT model. The
with attention mechanism. We then discuss tips and tricks to build the
best possible NMT models (both in speed and translation quality) such
attention.
Basic
disfluency in the translation outputs and was not quite like how we,
humans, translate. We read the entire source sentence, understand its
Figure 1. Encoder-decoder architecture – example of a general approach for NMT. An encoder converts a source sentence
choice for sequential data is the recurrent neural network (RNN), used
by most NMT models. Usually an RNN is used for both the encoder
and decoder. The RNN models, however, differ in terms of: (a)
can find more information about RNNs and LSTM on this blog post.
system
Let's first dive into the heart of building an NMT model with concrete
defer data preparation and the full code to later. This part refers to file
model.py.
At the bottom layer, the encoder and decoder RNNs receive as input
the following: first, the source sentence, then a boundary marker "<s>"
which indicates the transition from the encoding to the decoding mode,
and the target sentence. For training, we will feed the system with the
indices:
words.
words.
Embedding
Given the categorical nature of words, the model must first look up the
and all get the same embedding. The embedding weights, one set per
language, are usually learned during training.
# Embedding
embedding_encoder
#
"embedding_encoder",
Look up embedding:
= variable_scope.get_variable(
encoder_emb_inp = embedding_ops.embedding_lookup(
embedding_encoder, encoder_inputs)
Once retrieved, the word embeddings are then fed as input into the
the source language and a decoder for the target language. These
a better job when fitting large training datasets). The encoder RNN
encoder_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)
encoder_cell, encoder_emb_inp,
sequence_length=source_sequence_length, time_major=True)
Decoder
_
The decoder also needs to have access to the source information, and
one simple way to achieve that is to initialize it with the last hidden
decoder_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)
# Helper
helper = tf.contrib.seq2seq.TrainingHelper(
Decoder
decoder = tf.contrib.seq2seq.BasicDecoder(
output_layer=projection_layer)
# Dynamic decoding
logits = outputs.rnn_output
Here, the core part of this code is the BasicDecoder object, decoder,
more in helper.py.
tgt_vocab_size, use_bias=False)
Loss
Given the logits above, we are now ready to compute our training loss:
crossent = tf.nn.sparse_softmax_cross_entropy_with_logits(
labels=decoder_outputs, logits=logits)
rplexity values as we train.
batch_size)
Important note: It's worth pointing out that we divide the loss by
plays down the errors made on short sentences. More subtly, our
hyperparameters (applied to the former way) can't be used for the
latter way. For example, if both approaches use SGD with a learning of
1.0, the latter approach effectively uses a much smaller learning rate
of 1/ num_time_steps.
We have now defined the forward pass of our NMT model. Computing
params
gradients
=
gradients,
we
often set to
0.001; and
#
can
Optimization
optimizer =
and
a
_
clip
norm.
value like 5
is
tf.trainable_variables()
=
clipped_gradients,
tf.gradients(train_loss,
= tf.clip_by_global_norm(
max_gradient_norm)
or
a common
be set to decrease
The max
params)
value, max_gradient_norm,
can
as
is usually
tf.train.AdamOptimizer(learning_rate)
optimizer.apply_gradients(
training
in the
progresses.
clipping. Here,
is
the optimizer.
range
a learning
0.0001 to
update_step =
zip(clipped_gradients, params))
Let's train our very first NMT model, translating from Vietnamese to
Run the following command to download the data for training NMT
mkdir /tmp/nmt_model
python -m nmt.nmt \
--src=vi --tgt=en \
--vocab_prefix=/tmp/nmt_data/vocab \
--train_prefix=/tmp/nmt_data/train \
--dev_prefix=/tmp/nmt_data/tst2012 \
--test_prefix=/tmp/nmt_data/tst2013 \
--out_dir=/tmp/nmt_model \
--num_train_steps=12000 \
--steps_per_stats=100 \
--num_layers=2 \
--num_units=128 \
--dropout=0.2 \
--metrics=bleu
,. ,
# Start epoch 0, lr 1, Tue Apr 25 23:17:41 2017
trích
theories
ra
ref:
từ
That
of
học
Karl
, .,
thuyết
of
Marx
course
của
</s>
Karl
was
Marx
the
</s>
<unk>
</s> epoch
distilled
0 step
from
100 lr
the
bleu 0.00
bleu 0.00
bleu 0.00
bleu 0.00
training:
--src=en --tgt=vi
While you're training your NMT models (and once you have trained
models), you can obtain translations given previously unseen source
sentences. This process is called inference. There is a clear distinction
decoding strategy.
3. For each timestep on the decoder side, we treat the RNN's output
the word "moi" has the highest translation probability in the first
next timestep.
our code).
_
Figure 3. Greedy decoding – example of how a trained NMT model produces a translation for a source sentence "Je suis
# Helper
helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(
embedding_decoder,
# Decoder
decoder = tf.contrib.seq2seq.BasicDecoder(
output_layer=projection_layer)
# Dynamic decoding
outputs, = tf.contrib.seq2seq.dynamic_decode(
= outputs.sample_id
Here, we use GreedyEmbeddingHelper instead of TrainingHelper.
maximum_iterations = tf.round(tf.reduce
_max(source_sequence_length) *2)
_data/tst2013.vi)
python -m nmt.nmt \
--out_dir=/tmp/nmt_model \
--inference_input_file=/tmp/my_infer_file.vi \
--inference_output_file=/tmp/nmt_model/output_infer
Note the above commands can also be run while the model is still
Intermediate
Having gone through the most basic seq2seq model, let's get more
advanced! To build state-of-the-art neural machine translation systems,
Luong et al., 2015 and others. The key idea of the attention
Figure 4. Attention visualization – example of the alignments between source and target sentences. Image is taken from
Remember that in the vanilla seq2seq model, we pass the last source
state from the encoder to the decoder when starting the decoding
NMT system as
mechanism
described
et al., 2015)
computation.
of
.
an
For clarity,
attention-based
We highlight
we
in detail
don't show
Figure (2).
1. The current target hidden state is compared with all source states to
derive attention weights (can be visualized as in
Figure 4).
with the current target hidden state to yield the final attention vector 4.
The attention vector is fed as an input to the next time step (input
feeding). The first three steps can be summarized by
Here, the function score is used to compared the target hidden state
additive forms given in Eq. (4). Once computed, the attention vector
$$a_t$$ is used to derive the softmax logit and loss. This is similar to
the target hidden state at the top layer of a vanilla seq2seq model. The
attention_wrapper.py.
variants. These variants depend on the form of the scoring function and
certain choices matter. First, the basic form of attention, i.e., direct
it's important to feed the attention vector to the next timestep to inform
et al., 2015). Lastly, choices of the scoring function can often result in
mechanism, we happen to use the set of source hidden states (or their
al., 2015):
attention_mechanism = tf.contrib.seq2seq.LuongAttention(
num_units, attention_states,
memory_sequence_length=source_sequence_length)
source hidden states at the top layer and has the shape of [max_time,
decoder_cell, attention_mechanism,
attention_layer_size=num_units)
The rest of the code is almost the same as in the Section Decoder!
NMT model
To enable attention,
bahdanau or
we
normed_bahdanau
need to use one
as
of luong
attention flag
,
during training. The flag specifies which attention mechanism we are
going to use. In addition, we need to create a new directory for the
mkdir /tmp/nmt_attention_model
python -m nmt.nmt \
--attention=scaled_luong \
--src=vi --tgt=en \
--vocab_prefix=/tmp/nmt_data/vocab \
--train_prefix=/tmp/nmt_data/train \
--dev_prefix=/tmp/nmt_data/tst2012 \
--test_prefix=/tmp/nmt_data/tst2013 \
--out_dir=/tmp/nmt_attention_model \
--num_train_steps=12000 \
--steps_per_stats=100 \
--num_layers=2 \
--num_units=128 \
--dropout=0.2 \
--metrics=bleu
After training, we can use the same inference command with the new
out_dir for inference:
python -m nmt.nmt \
--out_dir=/tmp/nmt_attention_model \
--inference_input_file=/tmp/my_infer_file.vi \
-
-inference_output_file=/tmp/nmt_attention_model/output_infer
Graphs
inputs.
serving binary).
session.run calls.
The inference graph is usually very different from the other two, so it
periodically saves checkpoints, and the eval session and the infer
Session
with tf.variable_scope('root'):
train_inputs = tf.placeholder()
initializer = tf.global_variables_initializer()
eval_inputs = tf.placeholder()
eval_loss = BuildEvalModel(eval_inputs)
infer_inputs = tf.placeholder()
inference_output = BuildInferenceModel(infer_inputs)
sess = tf.Session()
sess.run(initializer)
for i in itertools.count():
train_input_data = ...
sess.run([loss, train_op], feed_dict={train_inputs:
train_input_data})
if i% EVAL_STEPS == 0:
while data_to_eval:
eval_input_data
sess.run([eval_loss],
= ... feed_dict={eval_inputs:
eval_input_data})
if i% INFER_STEPS == 0:
sess.run(inference_output, feed_dict={infer_inputs:
infer_input_data})
train_graph = tf.Graph()
eval_graph = tf.Graph()
infer_graph = tf.Graph()
with train_graph.as_default():
train_iterator
train_model =
= ...
BuildTrainModel(train_iterator)
initializer = tf.global_variables_initializer()
with eval_graph.as_default():
eval_iterator
eval_model =
= ...
BuildEvalModel(eval_iterator)
with infer_graph.as_default():
infer_iterator,
infer_model =
infer_inputs = ...
BuildInferenceModel(infer_iterator)
checkpoints_path ="/tmp/model/checkpoints"
train_sess = tf.Session(graph=train_graph)
eval_sess = tf.Session(graph=eval_graph)
infer_sess = tf.Session(graph=infer_graph)
train_sess.run(initializer)
train_sess.run(train_iterator.initializer)
for i in itertools.count():
train_model.train(train_sess)
if i% EVAL_STEPS == 0:
checkpoint_path = train_model.saver.save(train_sess,
checkpoints_path, global_step=i)
eval_model.saver.restore(eval_sess, checkpoint_path)
eval_sess.run(eval_iterator.initializer)
while data_to_eval:
eval_model.eval(eval_sess)
if i% INFER_STEPS == 0:
checkpoint_path = train_model.saver.save(train_sess,
_model.saver.restore(infer_sess, checkpoint_path)
infer_sess.run(infer_iterator.initializer,
infer_model.infer(infer_sess)
cover the new input data pipeline (as introduced in TensorFlow 1.2) in
Prior to TensorFlow 1.2, users had two options for feeding data to the
The first approach is easier for users who aren't familiar with
third approaches are more standard but a little less flexible; they also
using feed_dict and are the standard for both single-machine and
distributed training.
train_dataset = tf.data.TextLineDataset(train_files)
eval_dataset = tf.data.TextLineDataset(eval_file)
from_tensor_slices(infer_batch)
includes reading and cleaning the data, bucketing (in the case of
a lookup table object table, this map converts the first tuple elements
(table.lookup(words), size))
translations of each other and each one is read into its own dataset,
then a new dataset containing the tuples of the zipped lines can be
created via:
source_target_dataset = tf.data.Dataset.zip((source_dataset,
target_dataset))
vectors to the length of the longest source and target vector in each
batch.
batched_dataset = source_target_dataset.padded_batch(
batch_size,
padded_shapes=((tf.TensorShape([None]), # source
vectors of unknown size
tf.TensorShape([])), # size(source)
(tf.TensorShape([None]), # target vectors of unknown size
tf.TensorShape([]))), # size(target)
_eos_id
Values emitted from this dataset will be nested tuples whose tensors
matrices.
Reading data from a Dataset requires three lines of code: create the
batched_iterator = batched_dataset.make_initializable_iterator
session.run(batched_iterator.initializer, feed_dict={...})
underlying dataset.
Bidirectional RNNs
bidirectional layer:
forward_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)
cells.append(tf.contrib.rnn.DeviceWrapper(
sequence_length=source_sequence_length, time_major=True)
Beam search
beam width of, say size 10, is generally sufficient. For more information,
_
we refer readers to Section 7.2.3 of Neubig, (2017). Here's an
example of how beam search can be done:
decoder_initial_state = tf.contrib.seq2seq.tile_batch(
encoder_state, multiplier=hparams.beam_width)
decoder = tf.contrib.seq2seq.BeamSearchDecoder(
cell=decoder_cell,
embedding=embedding_decoder,
start_tokens=start_tokens,
end_token=end_token,
initial_state=decoder_initial_state,
beam_width=beam_width,
output_layer=projection_layer,
length_penalty_weight=0.0)
# Dynamic decoding
Note that the same dynamic_decode() API call is used, similar to the
translations = outputs.predicted_ids
Hyperparameters
work well for different settings. For this tutorial code, we recommend
Multi-GPU training
Training a NMT model may take several days. Placing different RNN
cells = []
for i inrange(num_layers):
cells.append(tf.contrib.rnn.DeviceWrapper(
tf.contrib.rnn.LSTMCell(num_units),
cell = tf.contrib.rnn.MultiRNNCell(cells)
You may notice the speed improvement of the attention based NMT
layer’s output to query attention at each time step. That means each
decoding step must wait its previous step completely finished; hence,
on multiple GPUs.
Therefore, each decoding step can start as soon as its previous step's
cells = []
for i inrange(num_layers):
cells.append(tf.contrib.rnn.DeviceWrapper(
tf.contrib.rnn.LSTMCell(num_units),
attention_cell = cells.pop(0)
attention_cell = tf.contrib.seq2seq.AttentionWrapper(
attention_cell,
attention_mechanism,
dense layer.
output_attention=False,)
Benchmarks
IWSLT English-Vietnamese
Results.
al., 2002).
step-time, 32.2K wps) on TitanX. Here, step-time means the time taken
to run one mini-batch (of size 128). For wps, we count words on both
WMT German-English
train=train.tok.clean.bpe.32000.(de|en),
dev=newstest2013.tok.bpe.32000.(de|en),
Training
data is split into subword units using BPE (32K operations). We train
Results.
The first 2 rows are the averaged results of 2 models (model 1, model
2). Results in the third row is with GNMT attention (model) ; trained with
4 GPUs.
27.6
28.9
29.9
WMT SOTA -
29.3
These results show that our code builds strong baseline systems for
NMT.
wps) on Nvidia K40m & (0.7s step-time, 8.7K wps) on Nvidia TitanX
only:
gpus
NMT (4 layers) 2.2s, 3.4K 1.9s, 3.9K
-
NMT (8 layers) 3.5s, 2.0K -
2.9s, 2.4K
-
NMT + GNMT attention (8 layers) 4.2s, 1.7K -
1.9s, 3.8K
These results show that without GNMT attention, the gains from using
multiple gpus are minimal. With GNMT attention, we obtain from
The first 2 rows are our models with GNMT attention: model 1(4
Systems newstest2014
newstest2015
26.5
27.6
24.9
-
25.2
-
The above results show our models are very competitive among
models of similar architectures. [Note that OpenNMT uses smaller
models and the current best result (as of this writing) is 28.4 obtained
Standard HParams
Benchmark.
We will use the WMT16 German-English data, you can download the
nmt/scripts/wmt16_en_de.sh
/tmp/wmt16
python -m nmt.nmt \
--src=de --tgt=en \
--ckpt=/path/to/checkpoint/translate.ckpt \
--hparams_path=nmt/standard_hparams/wmt16_gnmt_4_layer.json \
--out_dir=/tmp/deen_gnmt \
--vocab_prefix=/tmp/wmt16/vocab.bpe.32000 \
--inference_input_file=/tmp/wmt16/newstest2014.tok.bpe.32000.de
\ --inference_output_file=/tmp/deen_gnmt/output_infer \ --
inference_ref_file=/tmp/wmt16/newstest2014.tok.bpe.32000.en
Here is an example command for training the GNMT WMT
German-English model.
python -m nmt.nmt \
--src=de --tgt=en \
--hparams_path=nmt/standard_hparams/wmt16_gnmt_4_layer.json \
--out_dir=/tmp/deen_gnmt \
--vocab_prefix=/tmp/wmt16/vocab.bpe.32000 \
--train_prefix=/tmp/wmt16/train.tok.clean.bpe.32000 \ --
dev_prefix=/tmp/wmt16/newstest2013.tok.bpe.32000 \ --test
_prefix=/tmp/wmt16/newstest2015.tok.bpe.32000
Other resources
For deeper reading on Neural Machine Translation and
Neubig, (2017).
[Matlab]
We would like to thank Denny Britz, Anna Goldie, Derek Murray, and
Cinjon Resnick for their work bringing new features to TensorFlow and
GNMT systems; as well as the Google Brain team for their support and
feedback!
References
BibTex
@article{luong17,
year = {2017},