-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[Docs] Added documentation for benchmarks #3340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
a0906d8
Added documentation for benchmarks
Tanuj-Taneja1 3215f42
Merge branch 'master' into docs/benchmark
fengju0213 9e6682d
Merge branch 'master' into docs/benchmark
fengju0213 6273f9e
Merge branch 'master' into docs/benchmark
fengju0213 0a08881
enhance: Added documentation for benchmarks PR3340 (#3357)
Wendong-Fan 978e5f0
Merge branch 'master' into docs/benchmark
Wendong-Fan File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,143 @@ | ||
| --- | ||
| title: "BenchMarks" | ||
| description: "Learn about CAMEL's Benchmark module." | ||
| --- | ||
|
|
||
| ## Overview | ||
|
|
||
| The **Benchmark** module in CAMEL provides a framework for evaluating AI agents and language models across various tasks and domains. It includes implementations of multiple benchmarks and provides a interface for running evaluations, measuring performance, and generating detailed reports. | ||
|
|
||
| The module supports benchmarks for: | ||
|
|
||
| - **API calling and tool use** (APIBank, APIBench, Nexus) | ||
| - **General AI assistance** (GAIA) | ||
| - **Browser-based comprehension** (BrowseComp) | ||
| - **Retrieval-Augmented Generation** (RAGBench) | ||
|
|
||
| ## Architecture | ||
|
|
||
| ### Base Class: `BaseBenchmark` | ||
|
|
||
| All benchmarks inherit from the `BaseBenchmark` abstract class, which provides a common interface for downloading data, loading datasets, running evaluations, and accessing results. | ||
|
|
||
| #### BaseBenchmark Methods | ||
|
|
||
| | Method | Description | Parameters | | ||
| | ------------ | ---------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ | | ||
| | `__init__()` | Initialize the benchmark | `name`: Benchmark name<br>`data_dir`: Data directory path<br>`save_to`: Results save path<br>`processes`: Number of parallel processes | | ||
| | `download()` | Download benchmark data | None | | ||
| | `load()` | Load benchmark data | `force_download`: Force re-download | | ||
| | `run()` | Run the benchmark evaluation | `agent`: ChatAgent to evaluate<br>`on`: Data split ("train", "valid", "test")<br>`randomize`: Shuffle data<br>`subset`: Limit number of examples | | ||
| | `train` | Get training data | None | | ||
| | `valid` | Get validation data | None | | ||
| | `test` | Get test data | None | | ||
| | `results` | Get evaluation results | None | | ||
|
|
||
| ## Available Benchmarks | ||
|
|
||
| ### 1. GAIA Benchmark | ||
|
|
||
| **GAIA (General AI Assistants)** is a benchmark for evaluating general-purpose AI assistants on real-world tasks requiring multiple steps, tool use, and reasoning. | ||
|
|
||
| ### 2. APIBank Benchmark | ||
|
|
||
| **APIBank** evaluates the ability of LLMs to make correct API calls and generate appropriate responses in multi-turn conversations. | ||
|
|
||
| ### 3. APIBench Benchmark | ||
|
|
||
| **APIBench (Gorilla)** tests the ability to generate correct API calls for various machine learning frameworks (HuggingFace, TensorFlow Hub, Torch Hub). | ||
|
|
||
| ### 4. Nexus Benchmark | ||
|
|
||
| **Nexus** evaluates function calling capabilities across multiple domains including security APIs, location services, and climate data. | ||
|
|
||
| #### Available Tasks | ||
|
|
||
| | Task | Description | | ||
| | ---------------------------- | ----------------------------- | | ||
| | `"NVDLibrary"` | CVE and CPE API calls | | ||
| | `"VirusTotal"` | Malware and security analysis | | ||
| | `"OTX"` | Open Threat Exchange API | | ||
| | `"PlacesAPI"` | Location and mapping services | | ||
| | `"ClimateAPI"` | Weather and climate data | | ||
| | `"VirusTotal-ParallelCalls"` | Multiple parallel API calls | | ||
| | `"VirusTotal-NestedCalls"` | Nested API calls | | ||
| | `"NVDLibrary-NestedCalls"` | Nested CVE/CPE calls | | ||
|
|
||
| ### 5. BrowseComp Benchmark | ||
|
|
||
| **BrowseComp** evaluates browser-based comprehension by testing agents on questions that require understanding web content. | ||
|
|
||
| ### 6. RAGBench Benchmark | ||
|
|
||
| **RAGBench** evaluates Retrieval-Augmented Generation systems using context relevancy and faithfulness metrics. | ||
|
|
||
| #### Available Subsets | ||
|
|
||
| | Subset | Description | | ||
| | ------------ | -------------------------------------------------- | | ||
| | `"hotpotqa"` | Multi-hop question answering | | ||
| | `"covidqa"` | COVID-19 related questions | | ||
| | `"finqa"` | Financial question answering | | ||
| | `"cuad"` | Contract understanding | | ||
| | `"msmarco"` | Microsoft Machine Reading Comprehension | | ||
| | `"pubmedqa"` | Biomedical questions | | ||
| | `"expertqa"` | Expert-level questions | | ||
| | `"techqa"` | Technical questions | | ||
| | Others | `"emanual"`, `"delucionqa"`, `"hagrid"`, `"tatqa"` | | ||
|
|
||
| ## Common Usage Pattern | ||
|
|
||
| All benchmarks follow a similar pattern: | ||
|
|
||
| ```python | ||
| from camel.benchmarks import <BenchmarkName> | ||
| from camel.agents import ChatAgent | ||
|
|
||
| # 1. Initialize | ||
| benchmark = <BenchmarkName>( | ||
| data_dir="./data", | ||
| save_to="./results.json", | ||
| processes=4 | ||
| ) | ||
|
|
||
| # 2. Load data | ||
| benchmark.load(force_download=False) | ||
|
|
||
| # 3. Create agent | ||
| agent = ChatAgent(...) | ||
|
|
||
| # 4. Run evaluation | ||
| results = benchmark.run( | ||
| agent=agent, | ||
| # benchmark-specific parameters | ||
| randomize=False, | ||
| subset=None # or number of examples | ||
| ) | ||
|
|
||
| # 5. Access results | ||
| print(results) # Summary metrics | ||
| print(benchmark.results) # Detailed per-example results | ||
| ``` | ||
|
|
||
| ## Implementing Custom Benchmarks | ||
|
|
||
| To create a custom benchmark, inherit from `BaseBenchmark` and implement: | ||
|
|
||
| 1. `download()`: Download benchmark data | ||
| 2. `load()`: Load data into `self._data` dictionary | ||
| 3. `run()`: Execute benchmark and populate `self._results` | ||
| 4. Optional: Override `train`, `valid`, `test` properties | ||
|
|
||
| ## References | ||
|
|
||
| - **GAIA**: https://huggingface.co/datasets/gaia-benchmark/GAIA | ||
| - **APIBank**: https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/api-bank | ||
| - **APIBench (Gorilla)**: https://huggingface.co/datasets/gorilla-llm/APIBench | ||
| - **Nexus**: https://huggingface.co/collections/Nexusflow/nexusraven-v2 | ||
| - **BrowseComp**: https://openai.com/index/browsecomp/ | ||
| - **RAGBench**: https://arxiv.org/abs/2407.11005 | ||
|
|
||
| ## Other Resources | ||
|
|
||
| - Explore the [Agents](./agents.md) module for creating custom agents | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems we missing the reference for BrowseComp Benchmark cc @fengju0213