Task 2: Explainable QE

In this subtask, we propose to address translation error identification as rationale extraction. Instead of training a dedicated word-level model, the goal is to infer translation errors as an explanation for sentence-level quality scores. In particular, for each pair of source and target sentences, participating teams are asked to provide (1) a sentence-level score estimating the translation quality and (2) a list of continuous token-level scores where the tokens with the highest scores are expected to correspond to translation errors considered relevant by human annotators.

For the explainable QE subtask this year, we will use the same language pairs used in Task 1:

For each language pair, the participants can use the sentence-level scores to train their QE systems, available here. (Please see the resources listed in the “Additional training resources” section for more details.) As this subtask aims to promote the research in explainability of QE systems, we encourage the participants to use or develop explanation methods which can identify contributions of tokens in the input. The participants are not allowed to supervise their models with any token-level or word-level labels or signals (whether they are from natural or synthetic data) in order to directly predict word-level errors.

Upcoming

Apart from the language-pairs mentioned above, we will also provide test sets on some surprise language-pairs. Check-back soon!

Submission

For each language pair, a submission is a zip file consisting of three files.

Evaluation

The aim of evaluation is to assess the quality of explanations, not sentence-level predictions. So, to evaluate the submitted approaches, we will measure how well the token-level scores provided by the participants correspond with human word-level error annotation. The primary metric here is Recall at Top K. In addition, AUC and Average Precision will be used as secondary metrics. Although the metrics for sentence-level predictions (e.g., Pearson’s correlation) may be shown on the leaderboard, we will not use them for ranking the participants or determining the winner in this explainability task.

The official evaluation script for this task can be found here.

Baselines

Following the Eval4NLP shared task, we will use the methods below as baselines.

Quality Estimation Systems

Post-Hoc Explainability Tools

References

As this subtask is similar to the explainable QE shared task organized by the Eval4NLP workshop last year, we recommend checking their findings paper. Apart from that, please find the list of related papers below.