Releases: modelscope/evalscope
Releases · modelscope/evalscope
v0.10.0 release
What's Changed
Feat: Add EvalScope dashboard by @Yunnglin in #277
- Including single-model evaluation results and multi-model comparison, refer to the 📖 Visualizing Evaluation Results for more details
Others
- Add
model-id
in arguments by @Yunnglin in #274 - Add
ifeval
and unify report format by @Yunnglin in #275 - Add
iquiz
and use first metric by default for multi metrics by @Yunnglin in #288 - Support specifying system prompt by @Yunnglin in #283
- Bug-fix multi-metrics dataset by @Yunnglin in #282
- Bug-fix mmlu read local data by @Yunnglin in #273
功能更新
主要更新
其他
- 在参数中添加
model-id
,由 @Yunnglin 在 #274 中实现 - 添加
ifeval
评测基准;并统一报告格式,由 @Yunnglin 在 #275 中实现 - 添加
iquiz
评测基准;支持多指标的评测集在展示结果时默认使用第一个指标的结果,由 @Yunnglin 在 #288 中实现 - 支持指定system prompt,由 @Yunnglin 在 #283 中实现
- 修复多指标数据集的错误,由 @Yunnglin 在 #282 中实现
- 修复 mmlu 读取本地数据的问题,由 @Yunnglin 在 #273 中实现
Full Changelog: v0.9.0...v0.10.0
v0.9.0 release
What's Changed
- Support for specifying model service API URL for evaluation: Evaluation can be performed on both local and remote model services.
- Support for custom schema for mixed data evaluation: Combine different datasets for a more comprehensive assessment of model -capabilities with less data.
- Add benchmark contribution guidelines: Users can add their own benchmarks to make the tool more powerful and beneficial for more people.
中文
- 支持指定模型服务API URL评测:不论是本地模型还是远端模型服务都可以评测
- 支持自定义schema进行数据混合评测:混合不同的数据集,用更少的数据,更全面的评估模型能力
- 添加benchmark贡献指南:可以自行添加benchmark,让工具变的更强大,让更多人受益
Full Changelog: v0.8.2...v0.9.0
v0.8.2 release
What's Changed
- add user group by @Yunnglin in #251
- fix perf seed by @Yunnglin in #254
- add spawn env by @Yunnglin in #256
- Fix: sglang API response does not contain 'object' field. by @tghfly in #260
- fix parse response by @Yunnglin in #262
- fix predict by @Yunnglin in #264
- compat ragas 0.2.9 and remove chinese prompt cache by @Yunnglin in #265
New Contributors
Full Changelog: v0.8.1...v0.8.2
v0.8.1 release
What's Changed
- Unify
opencompass
andvlmeval
output dirs by @Yunnglin in #242 - Perf add more metrics by @Yunnglin in #245
- Perf add
trust remote
parameter by @Yunnglin in #246 - Compat ms-swift<3.0 by @Yunnglin in #249
- Fix humaneval for native eval by @Yunnglin in #248
中文版本
- 统一
opencompass
和vlmeval
输出目录,作者:@Yunnglin,相关链接:#242 - 模型压测:增加更多指标,作者:@Yunnglin,相关链接:#245
- 模型压测:添加
trust remote
参数,作者:@Yunnglin,相关链接:#246 - 兼容 ms-swift<3.0,作者:@Yunnglin,相关链接:#249
- 修复本地评估的 humaneval 问题,作者:@Yunnglin,相关链接:#248
Full Changelog: v0.8.0...v0.8.1
v0.8.0 release
v0.7.2 release
v0.7.1 release
v0.7.0 release
Release Notes
- Refactor the
perf
module, more robust and easier to use. #178 - Add speed benchmarking in the
perf
module. #178 - Add multi-modal benchmark
flickr8k
in theperf
module for speed benchmark. #211
Bug Fixes
- Add timeout for download punkt.zip #206
- Fix parallel for speed benchmarking in the
perf
module. #215
Documentation Updates
中文说明
特性
缺陷修复
文档更新
v0.6.1 release
Release Notes
- Add CMMLU benchmark #198
- Add publish workflow #186
- Adapt RAGAS v0.2.5 and update readme #205
- Adapt MTEB v1.19 #196
Bug Fixes
- Set datasets version: dataset>=3.0.0, <=3.0.1 #184
- Set pyarrow version to <=17.0.0 to avoid installation issue on OSX. #187
- Add timeout for download punkt.zip #206
Documentation Updates
中文说明
特性
缺陷修复
- 设置datasets 版本,修复兼容性问题: dataset>=3.0.0, <=3.0.1 #184
- 设置 pyarrow版本:<=17.0.0 修复在OSX操作系统下的安装问题 #187
- 增加下载punkt.zip时的超时时间 #206
文档更新
Release v0.6.0
Release Notes
- Support multi-modal RAG evaluation #149
- Add CLIP_Benchmark
- Add end-to-end multi-modal RAG evaluation in Ragas
- To be compatible with Ragas v0.2.3 #165 #171
- Support truncating input for CLIP models #163 #164
- Support saving knowledge graphs when generating datasets in Ragas #175
Bug Fixes
- Fix issue of abnormal metrics during CMTEB evaluation #157
- Fix issue of GenerationConfig being None #173
- Update datasets version constraints #184
- Add publish workflow #186
Documentation Updates
中文说明
特性
- 添加多模态RAG评测支持 #149
- 支持CLIP_Benchmark
- 支持Ragas端到端多模态RAG评测
- 兼容Ragas v0.2.3 #165 #171
- 支持CLIP模型截断输入 #163 #164
- 支持Ragas生成数据集时保存知识图谱 #175