For this lab, we're going to be examining some output from current statistical machine translation. We will evaluate these systems manually both to get a feeling for where state of the art is right now as well as to see how well our judgements correlate with automatic evaulation measures.
Download the two excel sheets below. Each sheet contains 10 test sentences with 1) the original sentence (in either Hindi or French) 2) a human reference translation and 3) multiple system translations. For each system translation score them on a scale of 1-5 (5 being better) based on:
In addition, as you read through the sentences, also jot down any observations that you have about the systems (e.g. System X seems to consistently have grammatical issues).
When you're all done, calculate the average fluency, adequacy and overall scores for the different systems on each language pair then go to this Google spreadsheet and enter your results in one of the columns. We'll look at the aggregate results as a class at the end.
The data was obtained from Evaluation Matrix