-
Evaluation datasets: objective and subjective.
-
Evaluation scripts: run
src/main.py
for evaluation (usepython src/main.py -h
to check the required and available command-line arguments). Model settings are insrc/models.py
and the paths must be customized. Few-shot examples for objective evaluation can be found here. When running the evaluation script, model responses will be stored underresponses/
(of the same format as the corresponding evaluation dataset) and evaluation results will be stored underresults/
(of csv format). A printout will be provided after the evaluations are complete. It does not matter if the evaluation script terminates unexpectedly: simply rerun it and the results that are already obtained will not be repeated again. See here for detailed evaluation methods. See ml3m documentation to better understand the evaluation scripts. -
Evaluation results: the evaluations we have done in our technical reports. All statistics for objective evaluation are released, but only a partial example for subjective evaluation is released.