diff --git a/metrics/meteor/README.md b/metrics/meteor/README.md index 0b234f70..4d4f21c7 100644 --- a/metrics/meteor/README.md +++ b/metrics/meteor/README.md @@ -116,6 +116,9 @@ While the correlation between METEOR and human judgments was measured for Chines Furthermore, while the alignment and matching done in METEOR is based on unigrams, using multiple word entities (e.g. bigrams) could contribute to improving its accuracy -- this has been proposed in [more recent publications](https://www.cs.cmu.edu/~alavie/METEOR/pdf/meteor-naacl-2010.pdf) on the subject. +Scores differ by up to **±10 points** across v1.0↔v1.5 and flag combinations (`-l`, `-norm`, `-vOut`). +Pin the Java package and document your flags. This uses the NLTK implementation (METEOR v1.0). +[Lübbers, 2024](https://github.com/cluebbers/Reproducibility-METEOR-NLP) ## Citation