Get Complete Project Material File(s) Now! »
Sequence-to-sequence NMT with attention
The major problem with the basic sequence-to-sequence model was the reliance on a xed-length, static representation of the input sequence for all decoding steps. This has several implications. Firstly, the translation quality was unsurprisingly shown to decrease as the source sentence length increased (Bahdanau et al., 2015); compressing a longer sequence into a xed-length vector leads to a greater loss of information. Secondly, the use of the same context vector for each decoding step is suboptimal because the representation must represent the entire sequence, even though for a given decoding step, certain source words are more useful than others.
A solution was found in an alignment technique, inspired by a similar but more restricted method used for handwriting generation (Graves, 2013). Word alignment has always played a central role in MT and therefore the motivation behind the alignment technique is not surprising: at a given point in the translation process, certain input words are more important than others to select the translation of the next word. The technique, referred to as an attention mechanism, was rst successfully applied to sequence-to-sequence translation models by Bahdanau et al. (2015), resulting in the rst state-of-the-art NMT models to outperform phrase-based systems. The attention mechanism is designed to assign weights to each of the annotation vectors h(i) produced by the RNN encoder. The weights are then used to calculate a weighted average of annotation vectors to produce a context vector c(j), representing the input sequence and specic to the decoding time step j. The attention mechanism is a simple neural network, which, for each decoding step j and for each position i of the input sequence, calculates an energy score e(ij) based on the previous decoder state z(j1).
The entire schema for the sequence-to-sequence model with attention is shown in This has the advantage of calculating more pertinent representations of the input sequence, specic to each decoding step, and remedies the performance drop previously seen for longer input sequences. It is a soft alignment technique, which predicts a probability distribution over the input sequence at each decoding step. A by-product of the strategy is that a soft alignment is automatically learnt between each decoded word and the sequence of inputs words. When the weights are visualised in a matrix such as Figure 3.9, the alignment can even correspond to our intuitions about word alignment in translation, with higher weights for source words that are more likely to be the translation of or are useful for the translation of the target word. The weights are therefore sometimes used as proxies for word alignment probabilities. In reality, these attention weights do not always correspond to what would be expected from an alignment model; Koehn and Knowles (2017) show that attention weights can sometimes be concentrated on neighbouring words. Note that this is not the case in Figure 3.9.
Recent advances in NMT
Newtechniques and architectures are continually being developed forNMT, and although we shall not use all of these techniques in our own contributions in this thesis, it is worth mentioning their existence. We choose to mention two recent developments: characterlevel NMT and the attention-based transformer model. Character-level NMT One of the major problems with both SMT and NMT approaches has been the translation of words that do not appear in the training data.
So called out-of-vocabulary words are either typically not translated or are translated as they appear in the source sentence. This solution is reasonable for certain named entities, which can be translated using the same word. However, this is not the ideal solution for words that should be translated using a word specic to the target language and were simply absent from the training data. Augmenting the amount of training data used is one solution to the problem, although it will never solve the problem entirely. The Zipan distribution of words in a language makes it practically impossible for a (nite) MT vocabulary to cover all words that you may wish to translate. Moreover, increasing the vocabulary size linearly increases the complexity of training MT models and of decoding.
The problem is especially apparent when translating into morphologically rich languages, as the type/token ratio is higher (the dierent morphological variants of the same lemma are encoded as separate items), a larger vocabulary is needed. In the previous section, we introduced the notion of subword units, which are the result of segmenting words into smaller units prior to translation in a bid to increase the generalisation capacity of translation models (Sennrich et al., 2016d). The technique enables the vocabulary coverage to be wider, due to the fact that shorter sequences are more like to be represented in the data, resulting in improved translation performances. Character-levelNMTtakes this principle further by supposing that instead of representing sentences as sequences of words, they can be represented as sequencesof characters. Various approaches have shown that it is possible to learn to translate atthis level. Luong and Manning (2016) adopt a hybrid strategy by using character-based translation for rare words only. They nd that this strategy outperforms a pure wordbased strategy and is capable of producing well-formed words in a morphologically rich setting. Other authors have gone further by proposing purely character-based strategies to NMT. Costa-Jussà and Fonollosa (2016) and Lee et al. (2017) both rely on convolutional neural network encoders to encode sequences of character embeddings. Whereas Costa- Jussà and Fonollosa (2016) still preserve word boundaries, and predict on a word-by-word basis, Lee et al. (2017) adopt a fully character-based approach, whereby no preliminary segmentation is performed. The systems achieve comparable results to those trained on words and subword units. The results are encouraging, and suggest that a greater generalisation capacity can be achieved through these models. However, challenges still remain, notably concerning sentence-level grammaticality, which appears to suer somewhat in character-based models compared to those relying on larger translation units such as words or subwords (Sennrich, 2017).
Evaluating Machine Translation
Progress in MT relies on having a reliable way of measuring the quality of a translation. Given two dierent MT systems, particularly where one system presents a novel aspect over the other baseline system, it is important to have a way of measuring which one produces “better” translations. The standard way of testing this is to translate a test set of sentences with each system and to compare the two sets of outputs. If human reference translations are also provided, evaluation can be based on a comparison between the MT outputs and the human translations, and if not must be based on the source sentences alone. The comparison can be performed either manually (by human experts) or automatically, with both evaluation types complex to perform. Evaluation remains the Achilles heel of MT and a dicult task to perform automatically. To date, the most reliable technique for judging translation quality remains manual evaluation by human evaluators, the inevitable downside of which is that it is time-consuming and costly. Humans also tend to be subjective; each evaluator will have dierent attitudes to dierent types of error, and it therefore becomes dicult to compare evaluation scores across the literature, without redoing human evaluation for each new MT model trained. The development of automatic metrics has been instrumental in the development of MT architectures, as they enable MT outputs to be regularly compared at little cost and provide a deterministic and therefore reproducible way of evaluating translations. However, developing automatic metrics that mimic human quality evaluation is extremely complex due to the subtleties of natural language, and current metrics are far from being able to match the delity of human judgments, as we shall see below. Current automatic metrics for the global evaluation of translation quality are also inadequate in terms of evaluating contextual phenomena such as those we will review in this thesis, because they are not designed to focus on particular aspects of translation quality. This will be the focus of Section 4.1.
Table of contents :
1 Introduction and overview
1.1 Motivation for Contextual Machine Translation
1.2 Structure and detailed summary of this thesis
1.3 Publications related to this thesis
I State of the Art: Contextual Machine Translation
2 The Role of Context
2.1 Ambiguity and the problem of translation
2.1.1 Source language ambiguity
2.1.2 Cross-lingual meaning transfer ambiguity
2.1.3 Target language ambiguity
2.1.4 Human versus machine translation
2.2 The importance of context in MT
2.2.1 What is context?
2.2.2 Nature and use of context
2.3 Conclusion
3 Sentence-level Machine Translation
3.1 Statistical Machine Translation (SMT)
3.1.1 Word alignments
3.1.2 Phrase-based translation models
3.1.3 Domain adaptation
3.1.4 Successes and Limitations of SMT
3.2 Neural Machine Translation (NMT)
3.2.1 Neural networks for NLP
3.2.2 Sequence-to-sequence NMT
3.2.3 Sequence-to-sequence NMT with attention
3.2.4 Recent advances in NMT
3.2.5 Successes and limitations
3.3 Evaluating Machine Translation
3.3.1 Issues in human evaluation of MT quality
3.3.2 Standard automatic evaluation metrics
3.3.3 Discussion
4 Contextual Machine Translation
4.1 Evaluating contextual MT
4.1.1 Problems associated with automatic evaluation of context
4.1.2 MT metrics augmented with discourse information
4.1.3 Conclusion
4.2 Modelling context for MT
4.2.1 Modelling context for SMT
4.2.2 Modelling context for NMT
4.3 Translation using structured linguistic context
4.3.1 Anaphoric pronouns
4.3.2 Lexical choice
4.3.3 Discourse connectives
4.3.4 Whole document decoding
4.4 Translation using unstructured linguistic context
4.5 Translation using extra-linguistic context
4.6 Conclusion on evaluating contextual MT
II Using contextual information for Machine Translation: strategies and evaluation
5 Adapting translation to extra-linguistic context via pre-processing
5.1 Integrating speaker gender via domain adaptation
5.1.1 Annotating the The Big Bang Theory reproducible corpus
5.1.2 SMT models: baselines and adaptations
5.1.3 Manual analysis and discussion
5.1.4 Conclusion on data partitioning
5.2 Conclusion
6 Improving cohesion-based translation using post-processing
6.1 Preserving style in MT: generating English tag questions
6.1.1 Tag questions (TQs) and the diculty for MT
6.1.2 Improving TQ generation in MT into English: our post-edition approach
6.1.3 Results, analysis and discussion
6.1.4 Conclusion to our tag-question expriments
6.2 Anaphoric pronoun translation with linguistically motivated features
6.2.1 Classication system: description and motivation
6.2.2 Results, analysis and discussion
6.2.3 Conclusion to pronoun translation via post-edition
6.3 General conclusion on post-edition approaches
7 Context-aware translation models
7.1 Translating discourse phenomena with unstructured linguistic context .
7.1.1 Hand-crafted test sets for contextual MT evaluation
7.1.2 Modifying the NMT architecture
7.1.3 Evaluation results and analysis
7.1.4 Conclusion and perspectives
7.2 Contextual NMT with extra-linguistic context
7.2.1 Creation of extra-linguistically annotated data
7.2.2 Contextual strategies
7.2.3 Experiments
7.2.4 BLEU score results
7.2.5 Targeted evaluation of speaker gender
7.2.6 Conclusion and perspectives
7.3 Conclusion
8 DiaBLa: A corpus for the evaluation of contextual MT
8.1 Dialogue and human judgment collection protocol
8.1.1 Participants
8.1.2 Scenarios
8.1.3 Evaluation
8.1.4 MT systems and setup
8.2 Description of the corpus
8.2.1 Overview of translation successes and failures
8.2.2 Comparison with existing corpora
8.3 Evaluating contextual MT with the DiaBLa corpus
8.3.1 Overall MT quality
8.3.2 Focus on a discourse-level phenomenon
8.4 Perspectives
8.4.1 Language analysis of MT-assisted interaction
8.4.2 MT evaluation
Conclusion and Perspectives
9 Conclusion and Perspectives
9.1 Conclusion
9.1.1 Trends in contextual MT and the impact on our work
9.1.2 Review of our aims and contributions
9.2 Perspectives
9.2.1 Evaluation of MT
9.2.2 Interpretability of contextual NMT strategies
9.2.3 Contextual MT for low resource language pairs
9.2.4 Contextual MT to Multimodal MT
9.2.5 Conclusion: To the future and beyond the sentence
Appendices
A Context-aware translation models
A.1 Translating discourse phenomena with unstructured linguistic context .
A.1.1 Training and decoding parameters
A.1.2 Visualisation of hierarchical attention weights
A.2 Contextual NMT with extra-linguistic context
A.2.1 Experimental setup
B DiaBLa: A corpus for the evaluation of contextual MT
B.1 Role-play scenarios
B.2 Dialogue collection: Final evaluation form
Bibliography