DISC-Law-Eval Benchmark

Evaluation datasets: objective and subjective.
Evaluation scripts: run src/main.py for evaluation (use python src/main.py -h to check the required and available command-line arguments). Model settings are in src/models.py and the paths must be customized. Few-shot examples for objective evaluation can be found here. When running the evaluation script, model responses will be stored under responses/ (of the same format as the corresponding evaluation dataset) and evaluation results will be stored under results/ (of csv format). A printout will be provided after the evaluations are complete. It does not matter if the evaluation script terminates unexpectedly: simply rerun it and the results that are already obtained will not be repeated again. See here for detailed evaluation methods. See ml3m documentation to better understand the evaluation scripts.
Evaluation results: the evaluations we have done in our technical reports. All statistics for objective evaluation are released, but only a partial example for subjective evaluation is released.

Provide feedback