The quality prediction task follows the trend of the previous years in comprising a sentence-level subtask where the goal is to predict the quality score for each source-target sentence pair and a word-level subtask where the goal is to predict the translation errors, assigning OK/BAD tags to each word (token) of the target translation.
Both subtasks include annotations derived in two different ways, depending on the language pair: direct assessments (DA), following the trend of older editions, and multi-dimensional quality metrics (MQM), aligned with the Metrics shared task. Detailed information for the language pairs, the annotation specifics, and the available training and development resources in each category are provided below. We also note that the sentence- and word-level subtasks use the same source-target sentences for each language pair.
Important
Participants will be able to submit predictions for any of the subtasks/language pairs of their choice (or all of them).
This year we will evaluate submitted models not only on correlations with human scores, but also with respect to the ability to properly distinguish translations with critical errors (e.g. hallucinations) and attribute low quality scores to them. To that end the provided test sets will include a set of source-target segments with critical errors for which no additional training data will be provided. We thus aim to investigate whether submitted models are robust to cases such as significant deviations in meaning, hallucinations, etc.
Note
The evaluation for critical errors will be separate to the main evaluation of performance for quality prediction and will not be included in the leaderboard.
We introduce two new zero-shot test sets, aiming to further motivate zero-/few-shot approaches.
Below we present the language pairs for the 2023 shared task along with available training resources
❗
Check back soon for updated information.
Language Pair | Sentence-level annotation | Word-level annotation | Train data | Dev data | Test data |
---|---|---|---|---|---|
English-German (En-De) | MQM | MQM | previously released | previously released | TBA: Aug 1st |
Chinese-English (Zh-En) | MQM | MQM | previously released | previously released | TBA: Aug 1st |
Hebrew-English | MQM | MQM | - | - | TBA: Aug 1st |
English-Marathi (En-Mr) | DA | Post-edits | previously released | previously released | TBA: Aug 1st |
English-Hindi (En-Hi) | DA | Post-edits* | new train set released | new dev set released | TBA: Aug 1st |
English-Tamil (En-Ta) | DA | Post-edits* | new train set released | new dev set released | TBA: Aug 1st |
English-Telegu (En-Te) | DA | - | new train set released | new dev set released | TBA: Aug 1st |
English-Gujarati (En-Gu) | DA | - | new train set released | new dev set released | TBA: Aug 1st |
English-Farsi (En-Fa) | - | Post-edits | - | - | TBA: Aug 1st |
(*) post-edits (word level quality estimation) available as test-set only
Note
The MQM annotation convention that we follow is similar to last year’s QE task (see dev files in the data table). Specifically based on the provided MQM annotations we compute the MQM error by summing penalties for each error category:
- +1 point for minor errors
- +5 points for major errors
- +10 points for critical errors
To align with DA annotations we subtract the summed penalties from 100 (perfect score) and we then divide by the sentence length (computed as number of words). We finally standardize the scores for each language pair/annotator.
Baselines
We will use the following baselines:
We will use Spearman as primary metric and also compute Kendall and Pearson as secondary metrics.
### Critical error detection As a complementary evaluation, we will evaluate systems with respect to the detection of critical errors. We specifically want to assess the ability of submitted models to score translations with critical errors, such as hallucinations, lower than the other translations. To evaluate this, we will compute the AUROC and recall@n for each submission, with respect to the translations flagged as critical errors.
We will use MCC (Matthews correlation coefficient) as a primary metric and F1-score as secondary.
The competition will take place on CODALAB with one competition instance for sentence-level and one for word-level subtasks.
❗
This year we will also require participants to fill in a form describing their model and data choices for each submission.
For each submission you wish to make (under “Participate>Submit” on codalab), please upload a single zip file with the predictions and the system metadata.
For the metadata, we expect a ‘metadata.txt’ file, with exactly two non-empty lines which are for the teamname and the short system description, respectively. The first line of metadata.txt must contain your team name. You can use your CodaLab username as your teamname. The second line of metadata.txt must contain a short description (2-3 sentences) of the system you used to generate the results. This description will not be shown to other participants. Note that submissions without a description will be invalid. It is fine to use the same submission for multiple submissions/phases if you use the same model (e.g. a multilingual or multitasking model)
For the predictions we describe the exact format expected separately for each subtask:
For the predictions we expect a single TSV file for each submitted QE system output (submitted online in the respective codalab competition), named ‘predictions.txt’.
The file should be formatted with the two first lines indicating model size, then indication of ensemble model number,and the rest containing predicted scores, one per line for each sentence, as follows:
Line 1: <DISK FOOTRPINT (in bytes, without compression)>
Line 2: <NUMBER OF PARAMETERS>
Line 3: <NUMBER OF ENSEMBLED MODELS> (set to 1 if there is no ensemble)
Lines 4-n where -n is the number of test samples: <LANGUAGE PAIR> <METHOD NAME> <SEGMENT NUMBER> <SEGMENT SCORE>
Where:
Each field should be delimited by a single tab (<\t>) character.
For the predictions we expect a single TSV file for each submitted QE system output (submitted online in the respective codalab competition), named ‘predictions.txt’.
You can submit different systems for any of the MQM or post-edited language pairs independently. The output of your system should be the predicted word-level tags, formatted in the following way:
Line 1: <DISK FOOTRPINT (in bytes, without compression)>
Line 2: <NUMBER OF PARAMETERS>
Line 3: <NUMBER OF ENSEMBLED MODELS> (set to 1 if there is no ensemble)
Lines 4-n where -n is the total number of tokens (words) in the test samples: <LANGUAGE PAIR> <METHOD NAME> <TYPE> <SEGMENT NUMBER> <WORD INDEX> <WORD> <BINARY SCORE>
Where:
Each field should be delimited by a single tab (<\t>) character.