This is a word-level subtask where the goal is to predict the translation error spans as opposed to binary OK/BAD tasks.
For this task we will use the error spans obtained from the MQM annotations. Participants will be asked to predict both the error span (start and end indices) as well as the error severity (major or minor) for each segment.
Important
Participants will be able to submit predictions for any of the language pairs of their choice (or all of them).
Below we present the language pairs for the 2023 shared task along with available training resources
Language Pair | Sentence-level annotation | Word-level annotation | Train data | Dev data | Test data |
---|---|---|---|---|---|
English-German (En-De) | MQM | MQM | MQM 2020 2022 | MQM 2020 2022 | TBA: Aug 1st |
Chinese-English (Zh-En) | MQM | MQM | MQM 2020 2022 | MQM 2020 2022 | TBA: Aug 1st |
Hebrew-English | MQM | MQM | - | - | TBA: Aug 1st |
The primary evaluation metric will be F1-score and we also plan to report Recall and Precision.
We will use CometKiwi as a baseline, trained on the MQM 2020-2021 data to predict OK/BAD-major/BAD-minor errors.
Line 1: <DISK FOOTRPINT (in bytes, without compression)>
Line 2: <NUMBER OF PARAMETERS>
Line 3: <NUMBER OF ENSEMBLED MODELS> (set to 1 if there is no ensemble)
Lines 4-n where -n is the number of test samples: <LANGUAGE PAIR> <METHOD NAME> <SEGMENT NUMBER> <TARGET SENTENCE> <ERROR START INDICES> <ERROR END INDICES> <ERROR TYPES>
Where:
Each field should be delimited by a single tab (<\t>) character.
Output example
2409244995
2280000000
3
he-en <\t> example-ensemble <\t> 0 <\t> This is a sample translation without errors. <\t> -1 <\t> -1 <\t> no-error
he-en <\t> example-ensemble <\t> 1 <\t> This is a sample translation with a span that is considered major error and another span that is considered minor error. <\t> 49 97 <\t> 70 118 <\t> major minor
…