2023 wmt-1 64

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4

Tom Kocmi and Christian Federmann


Microsoft, One Microsoft Way, Redmond, WA-98052, USA
{tomkocmi,chrife}@microsoft.com

Abstract Metric Acc. Meta


GEMBA-MQM 96.5% (1) 0.802 (3)
This paper introduces GEMBA-MQM, a GPT- XCOMET-Ensemble 95.2% (1) 0.825 (1)
based evaluation metric designed to detect docWMT22CometDA 93.7% (2) 0.768 (9)
docWMT22CometKiwiDA 93.7% (2) 0.767 (9)
translation quality errors, specifically for the
XCOMET-QE-Ensemble 93.5% (2) 0.808 (2)
quality estimation setting without the need COMET 93.5% (2) 0.779 (6)
for human reference translations. Based on MetricX-23 93.4% (3) 0.808 (2)
the power of large language models (LLM), CometKiwi 93.2% (3) 0.782 (5)
GEMBA-MQM employs a fixed three-shot Calibri-COMET22 93.1% (3) 0.767 (10)
BLEURT-20 93.0% (4) 0.776 (7)
prompting technique, querying the GPT-4
MaTESe 92.8% (4) 0.782 (5)
model to mark error quality spans. Compared mre-score-labse-regular 92.7% (4) 0.743 (13)
to previous works, our method has language- mbr-bleurtxv1p-qe 92.5% (4) 0.788 (4)
agnostic prompts, thus avoiding the need for KG-BERTScore 92.5% (5) 0.774 (7)
manual prompt preparation for new languages. MetricX-23-QE 92.0% (5) 0.800 (3)
BERTscore 90.2% (7) 0.742 (13)
While preliminary results indicate that MS-COMET-QE-22 90.1% (8) 0.744 (12)
GEMBA-MQM achieves state-of-the-art embed_llama 87.3% (10) 0.701 (16)
accuracy for system ranking, we advise f200spBLEU 86.8% (11) 0.704 (15)
BLEU 85.9% (12) 0.696 (16)
caution when using it in academic works to chrF 85.2% (12) 0.694 (17)
demonstrate improvements over other methods
due to its dependence on the proprietary, Table 1: Preliminary results of the WMT 2023 Metric
black-box GPT model. Shared task. The first column shows the system-level
accuracy, and the second column is the Metrics 2023
meta evaluation. Metrics with gray background need
1 Introduction human references. The table does not contain the worst-
performing, non-standard metrics due to space reasons.
GEMBA-MQM builds on the recent finding that
large language models (LLMs) can be prompted to as accuracy, fluency, style, terminology, etc.), sub-
assess the quality of machine translation (Kocmi classes (accuracy > mistranslation), and is marked
and Federmann, 2023a). We release the scoring with its severity (critical, major, minor). Segment
script.1 scores are computed by aggregating errors, each
The earlier work Kocmi and Federmann (2023a) weighted by its respective severity coefficient (25,
(GEMBA-DA) adopted a straightforward method- 5, 1). While their approach employed a few-shot
ology of assessing single score values for each prompting with a chain-of-thought strategy (Wei
segment without specifying the scale in detail. et al., 2022), our GEMBA-MQM approach differs
Employing a zero-shot approach, their technique in two aspects: 1) We streamline the process using
showed an unparalleled accuracy in assessment, only single-step prompting, and 2) our prompts are
surpassing all other non-LLM metrics on the universally applicable across languages, avoiding
WMT22 metrics test set (Freitag et al., 2022). the need for manual prompt preparation for each
Next, Lu et al. (2023) (EAPrompt) investigated language pair.
prompting LLMs to assess individual error classes
Another notable effort by Fernandes et al. (2023)
from a multidimensional quality metrics (MQM)
paralleled the EAPrompt approach, also marking
framework (Freitag et al., 2021), where each error
MQM error spans. In contrast, their approach used
can be classified into various error classes (such
a PaLM-2 model, pooling MQM annotations to
1
https://github.com/MicrosoftTranslator/GEMBA/ sample a few shot examples for the prompt. Their
768
Proceedings of the Eighth Conference on Machine Translation (WMT), pages 768–775
December 6–7, 2023. ©2023 Association for Computational Linguistics
(System) You are an annotator for the quality of machine translation. Your task is to identify
errors and assess the quality of the translation.

(user) {source_language} source:\n


```{source_segment}```\n
{target_language} translation:\n
```{target_segment}```\n
\n
Based on the source segment and machine translation surrounded with triple backticks, identify
error types in the translation and classify them. The categories of errors are: accuracy
(addition, mistranslation, omission, untranslated text), fluency (character encoding, grammar,
inconsistency, punctuation, register, spelling),
locale convention (currency, date, name, telephone, or time format)
style (awkward), terminology (inappropriate for context, inconsistent use), non-translation,
other, or no-error.\n
Each error is classified as one of three categories: critical, major, and minor.
Critical errors inhibit comprehension of the text. Major errors disrupt the flow, but what
the text is trying to say is still understandable. Minor errors are technically errors,
but do not disrupt the flow or hinder comprehension.

(assistant) {observed error classes}

Figure 1: The general prompt for GEMBA-MQM omits the gray part which performed subpar on internal data (we
include it in GEMBA-locale-MQM). The “(user)” and “(assistant)” section is repeated for each few-shot example.

fine-tuning experiments did not improve system- We compare our method against the best-
level performance for the top-tier models. performing reference-based metrics of WMT22:
MetrixX_XXL (non-public metric), COMET-22
2 Description (Rei et al., 2022), UNITE (Wan et al., 2022b),
BLEURT-20 (Pu et al., 2021), and COMET-20
Our technique adopts few-shot learning with the (Rei et al., 2020). In addition, we also compare
GPT-4 model (OpenAI, 2023), prompting the against “classic” string-based metrics BLEU (Pa-
model to mark quality error spans using the MQM pineni et al., 2002) and ChrF (Popović, 2015).
framework. The underlying prompt template is Lastly, we compare against reference-less metrics
modeled on guidelines for human annotators and of WMT22: CometKIWI (Rei et al., 2022), Unite-
shown in Figure 1. src (Wan et al., 2022a), Comet-QE (Rei et al.,
In contrast to other methods, we use three pre- 2021), MS-COMET-QE-22 (Kocmi et al., 2022b).
determined examples (see Appendix A), allowing We contrast our work with other LLM-based
the method to be used with any language pair, evaluation methods such as GEMBA-DA (Kocmi
avoiding the need to create language pair specific and Federmann, 2023b) and EAPrompt (Lu et al.,
MQM few-shot examples. This was the original 2023), conducting experiments using two GPT
limitation that prevented Fernandes et al. (2023) models: GPT-3.5-Turbo and the more powerful
from evaluating AutoMQM beyond two language GPT-4 (OpenAI, 2023).
pairs. Our decision was not driven by a desire to en-
hance performance — since domain and language- 3.1 Test set
specific prompts typically boost it (Moslem et al.,
2023) — but rather to ensure our method can be The main evaluation of our work has been done
evaluated across any language pairs. on the MQM22 (Freitag et al., 2022) and internal
Microsoft data. Furthermore, a few days before the
3 Experiments camera-ready deadline, organizers of Metrics 2023
(Freitag et al., 2023) released results on the blind
To measure the performance of the GEMBA-MQM test set, showing performance on unseen data.
metric, we follow the methodology and use test The MQM22 test set contains human judgments
data provided by the WMT22 Metrics shared task for three translation directions: English into Ger-
(Freitag et al., 2022) which hosts an annual eval- man, English into Russian, and Chinese into En-
uation of automatic metrics, benchmarking them glish. The test set contains a total of 54 machine
against human gold labels. translation system outputs or human translations. It
769
contains a total of 106k segments. Translation sys- the official WMT22 script.2 Reported scores match
tems are mainly from participants of the WMT22 Table 11 of the WMT22 metrics findings paper
General MT shared task (Kocmi et al., 2022a). The (Freitag et al., 2022).
source segments and human reference translations Furthermore, organizers of Metrics shared task
for each language pair contain around 2,000 sen- 2023 defined a new meta-evaluation metric based
tences from four different text domains: news, so- on four different scenarios, each contributing to the
cial, conversational, and e-commerce. The gold final score with a weight of 0.25:
standard for scoring translation quality is based on
human MQM ratings, annotated by professionals – system-level pairwise accuracy;
who mark individual errors in each translation, as – system-level Pearson correlation;
– segment-level Accuracy-t (Deutsch et al.,
described in Freitag et al. (2021).
2023); and
The MQM23 test set is the blind set for this
– segment-level Pearson correlation.
year’s WMT Metrics shared task prepared in the
same way as MQM22, but with unseen data for all The motivation is to measure metrics in the most
participants, making it the most reliable evaluation general usage scenarios (for example, for segment-
as neither participants nor LLM could overfit to level filtering) and not just for system ranking.
those data. The main difference from last year’s However, we question the decision behind the use
iteration is the replacement of English into Russian of Pearson correlation, especially on the system
with Hebrew into English. Also, some domains level. As Mathur et al. (2020) showed, Pearson
have been updated; see Kocmi et al. (2023). used for metric evaluation is sensitive when applied
Additionally, we evaluated GEMBA-MQM on to small sample sizes (in MQM23, the sample size
a large internal test set, an extended version of the is as little as 12 systems); it is heavily affected by
data set described by Kocmi et al. (2021). This test outliers (Osborne and Overbay, 2004; Ma et al.,
set contains human scores collected with source- 2019), which need to be removed before running
based Direct Assessment (DA, Graham et al., 2013) the evaluation; and it measures linear correlation
and its variant DA+SQM (Kocmi et al., 2022a). with the gold MQM data, which are not necessarily
This test set contains 15 high-resource languages linear to start with (especially the discrete segment-
paired with English. Specifically, these are: Ara- level scores, with error weights of 0.1, 1, 5, 25).
bic, Czech, Dutch, French, German, Hindi, Ital- Although it is desirable to have an automatic
ian, Japanese, Korean, Polish, Portuguese, Russian, metric that correlates highly with human annotation
Simplified Chinese, Spanish, and Turkish. behaviour and which is useful for segment-level
evaluation, more research is needed regarding the
3.2 Evaluation methods
proper way of testing these properties.
The main use case of automatic metrics is system
ranking, either when comparing a baseline to a new 4 Results
model, when claiming state-of-the-art results, when In this section, we discuss the results observed on
comparing different model architectures in ablation three different test sets: 1) MQM test data from
studies, or when deciding if to deploy a new model WMT, 2) internal test data from Microsoft, and 3)
to production. Therefore, we focus on a method a subset of the internal test data to measure the
that specifically measures this target: system-level impact of the MQM locale convention.
pairwise accuracy (Kocmi et al., 2021).
The pairwise accuracy is defined as the number 4.1 Results on MQM Test Data from WMT
of system pairs ranked correctly by the metric with The results of the blind set MQM23 in Table 1
respect to the human ranking divided by the total show that GEMBA-MQM outperforms all other
number of system pair comparisons. techniques on the three languages evaluated in the
Formally: system ranking scenario. Furthermore, when evalu-
ated in the meta-evaluation scenario it achieves the
|sign(metric∆) == sign(human∆)| third cluster rank.
Accuracy =
|all system pairs| In addition to the official results, we also test on
We reproduced all scores reported in the MQM22 test data and show results in Table 2. The
2
WMT22 Metrics shared task findings paper using https://github.com/google-research/mt-metrics-eval

770
Metric Acc. 15 langs Cs + De
# of system pairs (N) 4,468 734
EAPrompt-Turbo 90.9%
GEMBA-DA-GPT4 89.8% COMETKiwi 79.9 81.3
GEMBA-locale-MQM-Turbo 89.8% GEMBA-locale-MQM-Turbo 78.6 81.3
EAPrompt-Turbo 89.4% GEMBA-MQM-Turbo 78.4 83.0
GEMBA-MQM-GPT4 89.4% COMET-QE 77.8 79.8
GEMBA-DA-GPT4 87.6% COMET-22 76.5 79.2
GEMBA-DA-Turbo 86.9% COMET-20 76.3 79.6
GEMBA-MQM-Turbo 86.5% BLEURT-20 75.8 79.7
GEMBA-DA-Turbo 86.5% chrF 68.1 70.6
MetricX_XXL 85.0% BLEU 66.8 68.9
BLEURT-20 84.7%
COMET-22 83.9% Table 3: System-level pairwise accuracy results for our
COMET-20 83.6% internal test set. The first column is for all 15 languages,
UniTE 82.8%
COMETKiwi 78.8%
and the second is Czech and German only. All lan-
COMET-QE 78.1% guages are paired with English.
BERTScore 77.4%
UniTE-src 75.9% Source Vstupné do památky činí 16,50 Eur.
MS-COMET-QE-22 75.5% Hypothesis Admission to the monument is 16.50 Euros.
chrF 73.4% GPT annot. locale convention/currency: "euros"
BLEU 70.8%
Table 4: An example of a wrong error class “locale
Table 2: The system-level pairwise accuracy results for convention” as marked by GEMBA-locale-MQM. The
the WMT 22 metrics task test set. Gray metrics need translation is correct, however, we assume that the GPT
reference translations which are not the focus of the model might not have liked the use of Euros in a Czech
current evaluation. text because Euros are not used in the Czech Republic.

main conclusion is that all GEMBA-MQM variants class. GPT assigned this class for errors not re-
outperform traditional metrics (such as COMET lated to translations. It flagged Czech sentences as
or Metric XXL). When focusing on the quality a locale convention error when the currency Euro
estimation task, we can see that the GEMBA- was mentioned, even when the translation was fine,
locale-MQM-Turbo method slightly outperforms see example in Table 4. We assume that it was
EAPrompt, which is the closest similar technique. using this error class to mark parts not standard for
However, we can see that our final technique a given language but more investigation would be
GEMBA-MQM is performing significantly worse needed to draw any deeper conclusions.
than the GEMBA-locale-MQM metric, while the The evaluation on internal test data in Table 4
only difference is the removal of the locale conven- showed gains of 1.7% accuracy. However, when
tion error class. We believe this to be caused by evaluating over 15 languages, we observed a small
the test set. We discuss our decision to remove the degradation of 0.2%. For MQM22 in Table 2, the
locale convention error class in Section 4.3. degradation is even bigger.
When we look at the distribution of the error
4.2 Results on Internal Test Data
classes over the fifteen highest resource languages
Table 3 shows that GEMBA-MQM-Turbo outper- in Table 5, we observe that 32% of all errors for
forms almost all other metrics, losing only to GEMBA-locale-MQM are marked as a locale con-
COMETKIWI-22. This shows some limitations vention suggesting a misuse of GPT for this error
of GPT-based evaluation on blind test sets. Due class. Therefore, instead of explaining this class in
to access limitations, we do not have results for the prompt, we removed it. This resulted in about
GPT-4, which we assume should outperform the half of the original locale errors being reassigned
GPT-3.5 Turbo model. We leave this experiment to other error classes, while the other half was not
for future work. marked.
In conclusion, we decided to remove this class as
4.3 Removal of Locale Convention it is not aligned with what we expected to measure
When investigating the performance of GEMBA- and how GPT appears to be using the classes. Thus,
locale-MQM on a subset of internal data (Czech we force GPT to classify those errors using other
and German), we observed a critical error in this error categories. Given the different behaviour for
prompt regarding the "locale convention" error internal and external test data, this deserves more
771
Error class GEMBA-locale-MQM GEMBA-MQM outperforming established metrics such as COMET
accuracy 960,838 (39%) 1,072,515 (51%) and BLEURT-20.
locale con. 808,702 (32%) (0%)
fluency 674,228 (27%) 699,037 (33%) We would like to acknowledge the inherent limi-
style 23,943 (1%) 41,188 (2%) tations tied to using a proprietary model like GPT.
terminology 17,379 (1%) 290,490 (14%) Our recommendation to the academic community
Other errors 4,126 (0%) 10615 (1%)
is to be cautious with employing GEMBA-MQM
Total 2,489,216 2,113,845 on top of GPT models. For future research, we
Table 5: Distribution of errors for both types of prompts want to explore how our approach performs with
over all segments of the internal test set for the Turbo other, more open LLMs such as LLama 2 (Tou-
model. vron et al., 2023). Confirming superior behaviour
on publicly distributed models (at least their bina-
ries) could open the path for broader usage of the
investigation in future work. technique in the academic environment.
5 Caution with “Black Box” LLMs Limitations
Although GEMBA-MQM is the state-of-the-art While our findings and techniques with GEMBA-
technique for system ranking, we would like to MQM bring promising advancements in translation
discuss in this section the inherent limitations of quality error marking, it is essential to highlight the
using “black box” LLMs (such as GPT-4) when limitations encountered in this study.
conducting academic research.
Firstly, we would like to point out that GPT- – Reliance on Proprietary GPT Models:
4 is a proprietary model, which leads to several GEMBA-MQM depends on the GPT-4 model,
problems. One of them is that we do not know which remains proprietary in nature. We do
which training data it was trained on, therefore any not know what data the model was trained
published test data should be considered as part on or if the same model is still deployed
of their training data (and is, therefore, possibly and therefore the results are comparable.
tainted). Secondly, we cannot guarantee that the As Chen et al. (2023) showed, the model’s
model will be available in the future, or that it performance fluctuated throughout 2023;
won’t be updated in the future, meaning any results
– High-Resource Languages Only: As WMT
from such a model are relevant only for the specific
evaluations primarily focus on high-resource
sampling time. As Chen et al. (2023) showed, the
languages, we cannot conclude if the method
model’s performance fluctuated and decreased over
will perform well on low-resource languages.
the span of 2023.
As this impacts all proprietary LLMs, we advo- Acknowledgements
cate for increased research using publicly available
models, like LLama 2 (Touvron et al., 2023). This We are grateful to our anonymous reviewers for
approach ensures future findings can be compared their insightful comments and patience that have
both to “black box” LLMs while also allowing helped improve the paper. We would like to thank
comparison to “open” models.3 our colleagues on the Microsoft Translator research
team for their valuable feedback.
6 Conclusion
In this paper, we have introduced and evaluated References
the GEMBA-MQM metric, a GPT-based metric for Lingjiao Chen, Matei Zaharia, and James Zou. 2023.
translation quality error marking. This technique How is chatgpt’s behavior changing over time? arXiv
takes advantage of the GPT-4 model with a fixed preprint arXiv:2307.09009.
three-shot prompting strategy. Preliminary results Daniel Deutsch, George Foster, and Markus Freitag.
show that GEMBA-MQM achieves a new state of 2023. Ties matter: Modifying kendall’s tau for
the art when used as a metric for system ranking, modern metric meta-evaluation. arXiv preprint
arXiv:2305.14324.
3
Although LLama 2 is not fully open, its binary files have
been released. Thus, when used it as a scorer, we are using Patrick Fernandes, Daniel Deutsch, Mara Finkelstein,
the exact same model. Parker Riley, André FT Martins, Graham Neubig,
772
Ankush Garg, Jonathan H Clark, Markus Freitag, United Arab Emirates (Hybrid). Association for Com-
and Orhan Firat. 2023. The devil is in the errors: putational Linguistics.
Leveraging large language models for fine-grained
machine translation evaluation. arXiv preprint Tom Kocmi and Christian Federmann. 2023a. Large
arXiv:2308.07286. language models are state-of-the-art evaluators of
translation quality. arXiv preprint arXiv:2302.14520.
Markus Freitag, George Foster, David Grangier, Viresh
Ratnakar, Qijun Tan, and Wolfgang Macherey. 2021. Tom Kocmi and Christian Federmann. 2023b. Large
Experts, errors, and context: A large-scale study of language models are state-of-the-art evaluators of
human evaluation for machine translation. Transac- translation quality. In Proceedings of the 24th An-
tions of the Association for Computational Linguis- nual Conference of the European Association for Ma-
tics, 9:1460–1474. chine Translation, pages 193–203, Tampere, Finland.
European Association for Machine Translation.
Markus Freitag, Nitika Mathur, Chi kiu Lo, Elefthe- Tom Kocmi, Christian Federmann, Roman Grund-
rios Avramidis, Ricardo Rei, Brian Thompson, Tom kiewicz, Marcin Junczys-Dowmunt, Hitokazu Mat-
Kocmi, Frédéric Blain, Daniel Deutsch, Craig Stew- sushita, and Arul Menezes. 2021. To ship or not to
art, Chrysoula Zerva, Sheila Castilho, Alon Lavie, ship: An extensive evaluation of automatic metrics
and George Foster. 2023. Results of wmt23 met- for machine translation. In Proceedings of the Sixth
rics shared task. In Proceedings of the Seventh Con- Conference on Machine Translation, pages 478–494,
ference on Machine Translation (WMT), Singapore, Online. Association for Computational Linguistics.
Singapore (Hybrid). Association for Computational
Linguistics. Tom Kocmi, Hitokazu Matsushita, and Christian Feder-
mann. 2022b. MS-COMET: More and better human
Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo, judgements improve metric performance. In Proceed-
Craig Stewart, Eleftherios Avramidis, Tom Kocmi, ings of the Seventh Conference on Machine Trans-
George Foster, Alon Lavie, and André F. T. Martins. lation (WMT), pages 541–548, Abu Dhabi, United
2022. Results of WMT22 metrics shared task: Stop Arab Emirates (Hybrid). Association for Computa-
using BLEU – neural metrics are better and more tional Linguistics.
robust. In Proceedings of the Seventh Conference
on Machine Translation (WMT), pages 46–68, Abu Qingyu Lu, Baopu Qiu, Liang Ding, Kanjian Zhang,
Dhabi, United Arab Emirates (Hybrid). Association Tom Kocmi, and Dacheng Tao. 2023. Error analysis
for Computational Linguistics. prompting enables human-like translation evaluation
in large language models: A case study on chatgpt.
Yvette Graham, Timothy Baldwin, Alistair Moffat, and
Justin Zobel. 2013. Continuous measurement scales Qingsong Ma, Johnny Wei, Ondřej Bojar, and Yvette
in human evaluation of machine translation. In Pro- Graham. 2019. Results of the WMT19 metrics
ceedings of the 7th Linguistic Annotation Workshop shared task: Segment-level and strong MT sys-
and Interoperability with Discourse, pages 33–41, tems pose big challenges. In Proceedings of the
Sofia, Bulgaria. Association for Computational Lin- Fourth Conference on Machine Translation (Volume
guistics. 2: Shared Task Papers, Day 1), pages 62–90, Flo-
rence, Italy. Association for Computational Linguis-
Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, tics.
Ondřej Bojar, Anton Dvorkovich, Christian Fed-
ermann, Mark Fishel, Markus Freitag, Thamme Nitika Mathur, Timothy Baldwin, and Trevor Cohn.
Gowda, Roman Grundkiewicz, Barry Haddow, 2020. Tangled up in BLEU: Reevaluating the eval-
Philipp Koehn, Benjamin Marie, Christof Monz, uation of automatic machine translation evaluation
Makoto Morishita, Kenton Murray, Masaaki Nagata, metrics. In Proceedings of the 58th Annual Meet-
Toshiaki Nakazawa, Martin Popel, and Maja Popović. ing of the Association for Computational Linguistics,
2023. Findings of the 2023 conference on machine pages 4984–4997, Online. Association for Computa-
translation (WMT23). In Proceedings of the Seventh tional Linguistics.
Conference on Machine Translation (WMT), Singa- Yasmin Moslem, Rejwanul Haque, John D. Kelleher,
pore, Singapore (Hybrid). Association for Computa- and Andy Way. 2023. Adaptive machine translation
tional Linguistics. with large language models. In Proceedings of the
24th Annual Conference of the European Association
Tom Kocmi, Rachel Bawden, Ondřej Bojar, Anton for Machine Translation, pages 227–237, Tampere,
Dvorkovich, Christian Federmann, Mark Fishel, Finland. European Association for Machine Transla-
Thamme Gowda, Yvette Graham, Roman Grund- tion.
kiewicz, Barry Haddow, Rebecca Knowles, Philipp
Koehn, Christof Monz, Makoto Morishita, Masaaki OpenAI. 2023. Gpt-4 technical report.
Nagata, Toshiaki Nakazawa, Michal Novák, Martin
Popel, and Maja Popović. 2022a. Findings of the Jason W Osborne and Amy Overbay. 2004. The power
2022 conference on machine translation (WMT22). of outliers (and why researchers should always check
In Proceedings of the Seventh Conference on Ma- for them). Practical Assessment, Research, and Eval-
chine Translation (WMT), pages 1–45, Abu Dhabi, uation, 9(1):6.
773
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan,
Jing Zhu. 2002. Bleu: a method for automatic evalu- Melanie Kambadur, Sharan Narang, Aurelien Ro-
ation of machine translation. In Proceedings of the driguez, Robert Stojnic, Sergey Edunov, and Thomas
40th Annual Meeting of the Association for Compu- Scialom. 2023. Llama 2: Open foundation and fine-
tational Linguistics, pages 311–318, Philadelphia, tuned chat models.
Pennsylvania, USA. Association for Computational
Linguistics. Yu Wan, Keqin Bao, Dayiheng Liu, Baosong Yang,
Derek F. Wong, Lidia S. Chao, Wenqiang Lei, and
Maja Popović. 2015. chrF: character n-gram F-score Jun Xie. 2022a. Alibaba-translate China’s submis-
for automatic MT evaluation. In Proceedings of the sion for WMT2022 metrics shared task. In Proceed-
Tenth Workshop on Statistical Machine Translation, ings of the Seventh Conference on Machine Trans-
pages 392–395, Lisbon, Portugal. Association for lation (WMT), pages 586–592, Abu Dhabi, United
Computational Linguistics. Arab Emirates (Hybrid). Association for Computa-
tional Linguistics.
Amy Pu, Hyung Won Chung, Ankur Parikh, Sebastian
Gehrmann, and Thibault Sellam. 2021. Learning Yu Wan, Dayiheng Liu, Baosong Yang, Haibo Zhang,
compact metrics for MT. In Proceedings of the 2021 Boxing Chen, Derek Wong, and Lidia Chao. 2022b.
Conference on Empirical Methods in Natural Lan- UniTE: Unified translation evaluation. In Proceed-
guage Processing, pages 751–762, Online and Punta ings of the 60th Annual Meeting of the Association
Cana, Dominican Republic. Association for Compu- for Computational Linguistics (Volume 1: Long Pa-
tational Linguistics. pers), pages 8117–8127, Dublin, Ireland. Association
for Computational Linguistics.
Ricardo Rei, José G. C. de Souza, Duarte Alves,
Chrysoula Zerva, Ana C Farinha, Taisiya Glushkova, Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Alon Lavie, Luisa Coheur, and André F. T. Martins. Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
2022. COMET-22: Unbabel-IST 2022 submission et al. 2022. Chain-of-thought prompting elicits rea-
for the metrics shared task. In Proceedings of the soning in large language models. Advances in Neural
Seventh Conference on Machine Translation (WMT), Information Processing Systems, 35:24824–24837.
pages 578–585, Abu Dhabi, United Arab Emirates
(Hybrid). Association for Computational Linguistics.
Ricardo Rei, Ana C Farinha, Chrysoula Zerva, Daan
van Stigt, Craig Stewart, Pedro Ramos, Taisiya
Glushkova, André F. T. Martins, and Alon Lavie.
2021. Are references really needed? unbabel-IST
2021 submission for the metrics shared task. In Pro-
ceedings of the Sixth Conference on Machine Trans-
lation, pages 1030–1040, Online. Association for
Computational Linguistics.
Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
Lavie. 2020. COMET: A neural framework for MT
evaluation. In Proceedings of the 2020 Conference
on Empirical Methods in Natural Language Process-
ing (EMNLP), pages 2685–2702, Online. Association
for Computational Linguistics.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton
Ferrer, Moya Chen, Guillem Cucurull, David Esiobu,
Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,
Cynthia Gao, Vedanuj Goswami, Naman Goyal, An-
thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan
Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,
Isabel Kloumann, Artem Korenev, Punit Singh Koura,
Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di-
ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar-
tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly-
bog, Yixin Nie, Andrew Poulton, Jeremy Reizen-
stein, Rashi Rungta, Kalyan Saladi, Alan Schelten,
Ruan Silva, Eric Michael Smith, Ranjan Subrama-
nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay-
lor, Adina Williams, Jian Xiang Kuan, Puxin Xu,
774
A Three examples Used for Few-shot Prompting

English source: I do apologise about this, we must gain permission from the account holder to discuss
an order with another person, I apologise if this was done previously, however, I would not be able
to discuss this with yourself without the account holders permission.
German translation: Ich entschuldige mich dafür, wir müssen die Erlaubnis einholen, um eine Bestellung
mit einer anderen Person zu besprechen. Ich entschuldige mich, falls dies zuvor geschehen wäre, aber
ohne die Erlaubnis des Kontoinhabers wäre ich nicht in der Lage, dies mit dir involvement.
MQM annotations:
Critical:
no-error
Major:
accuracy/mistranslation - "involvement"
accuracy/omission - "the account holder"
Minor:
fluency/grammar - "wäre"
fluency/register - "dir"

English source: Talks have resumed in Vienna to try to revive the nuclear pact, with both sides
trying to gauge the prospects of success after the latest exchanges in the stop-start negotiations.
Czech transation: Ve Vídni se ve Vídni obnovily rozhovory o oživení jaderného paktu, přičemže obě
partaje se snaží posoudit vyhlídky na úspěch po posledních výměnách v jednáních.
MQM annotations:
Critical:
no-error
Major:
accuracy/addition - "ve Vídni"
accuracy/omission - "the stop-start"
Minor:
terminology/inappropriate for context - "partaje"

Chinese source: 大众点评乌鲁木齐家居商场频道为您提供高铁居然之家地址,电话,营业时间等最新商户信息,


找装修公司,就上大众点评
English translation: Urumqi Home Furnishing Store Channel provides you with the latest business
information such as the address, telephone number, business hours, etc., of high-speed rail, and
find a decoration company, and go to the reviews.
MQM annotations:
Critical:
accuracy/addition - "of high-speed rail"
Major:
accuracy/mistranslation - "go to the reviews"
Minor:
style/awkward - "etc.,"

Figure 2: Three examples used for all languages.

775

You might also like