- When deploying language models with tool-calling capabilities in production environments, it's essential to ensure their effectiveness and reliability. This evaluation process goes beyond traditional testing and focuses on two key aspects:
+ Tool evaluations ensure AI models use your tools correctly in production. Unlike traditional testing, evaluations measure two key aspects:
+
+ 1. **Tool selection**: Does the model choose the right tools for the task?
+ 2. **Parameter accuracy**: Does the model provide correct arguments?
- 1. **Tool Utilization**: Assessing how efficiently the language model uses the available tools.
- 2. **Intent Understanding**: Evaluating the language model's ability to comprehend user intents and select the appropriate tools to fulfill those intents.
+ Arcade's evaluation framework helps you validate tool-calling capabilities before deployment, ensuring reliability in real-world applications. You can evaluate tools from MCP servers, Arcade Gateways, or custom implementations.
- Arcade's Evaluation Framework provides a comprehensive approach to assess and validate the tool-calling capabilities of language models, ensuring they meet the high standards required for real-world applications.
-## Why Evaluate Tool Calling by Task?
-
-Language models augmented with tool-use capabilities can perform complex tasks by invoking external tools or APIs. However, without proper evaluation, these models might:
-
-- **Misinterpret user intents**, leading to incorrect tool selection.
-- **Provide incorrect arguments** to tools, causing failures or undesired outcomes.
-- **Fail to execute the necessary sequence of tool calls**, especially in tasks requiring multiple steps.
-
-Evaluating tool calling by task ensures that the language model can handle specific scenarios reliably, providing confidence in its performance in production settings.
-
-## Evaluation Scoring
-
-Scoring in the evaluation framework is based on comparing the model's actual tool calls with the expected ones for each evaluation case. The total score for a case depends on:
-
-1. **Tool Selection**: Whether the model selected the correct tools for the task.
-2. **Tool Call Arguments**: The correctness of the arguments provided to the tools, evaluated by critics.
-3. **Evaluation Rubric**: Each aspect of the evaluation is weighted according to the rubric, affecting its impact on the final score.
-
-The evaluation result includes:
-
-- **Score**: A normalized value between 0.0 and 1.0.
-- **Result**:
- - _Passed_: Score is above the fail threshold.
- - _Failed_: Score is below the fail threshold.
- - _Warned_: Score is between the warning and fail thresholds.
+## What can go wrong?
-## Critics: Types and Usage
+Without proper evaluation, AI models might:
-Critics are essential for evaluating the correctness of tool call arguments. Different types of critics serve various evaluation needs:
+- **Misinterpret user intents**, selecting the wrong tools
+- **Provide incorrect arguments**, causing failures or unexpected behavior
+- **Skip necessary tool calls**, missing steps in multi-step tasks
+- **Make incorrect assumptions** about parameter defaults or formats
-### BinaryCritic
+## How evaluation works
-`BinaryCritic`s check for exact matches between expected and actual values after casting.
+Evaluations compare the model's actual tool calls with expected tool calls for each test case.
-- **Use Case**: When exact values are required (e.g., specific numeric parameters).
-- **Example**: Ensuring the model provides the exact user ID in a function call.
+### Scoring components
-### NumericCritic
+1. **Tool selection**: Did the model choose the correct tool?
+2. **Parameter evaluation**: Are the arguments correct? (evaluated by critics)
+3. **Weighted scoring**: Each aspect has a weight that affects the final score
-`NumericCritic` evaluates numeric values within a specified range, allowing for acceptable deviations.
+### Evaluation results
-- **Use Case**: When values can be approximate but should be within a certain threshold.
-- **Example**: Accepting approximate results in mathematical computations due to floating-point precision.
+Each test case receives:
-### SimilarityCritic
+- **Score**: Calculated from weighted critic scores, normalized proportionally (weights can be any positive value)
+- **Status**:
+ - **Passed**: Score meets or exceeds fail threshold (default: 0.8)
+ - **Failed**: Score falls below fail threshold
+ - **Warned**: Score is between warn and fail thresholds (default: 0.9)
-`SimilarityCritic` measures the similarity between expected and actual string values using metrics like cosine similarity.
+Example output:
-- **Use Case**: When the exact wording isn't critical, but the content should be similar.
-- **Example**: Evaluating if the message content in a communication tool is similar to the expected message.
-
-### DatetimeCritic
-
-`DatetimeCritic` evaluates the closeness of datetime values within a specified tolerance.
-
-- **Use Case**: When datetime values should be within a certain range of the expected time.
-- **Example**: Verifying if a scheduled event time is close enough to the intended time.
-
-### Choosing the Right Critic
-
-- **Exact Matches Needed**: Use **BinaryCritic** for strict equality.
-- **Numeric Ranges**: Use **NumericCritic** when a tolerance is acceptable.
-- **Textual Similarity**: Use **SimilarityCritic** for comparing messages or descriptions.
-- **Datetime Tolerance**: Use **DatetimeCritic** when a tolerance is acceptable for datetime comparisons.
-
-Critics are defined with fields such as `critic_field`, `weight`, and parameters specific to their types (e.g., `similarity_threshold` for `SimilarityCritic`).
+```
+PASSED Get weather for city -- Score: 1.00
+WARNED Send message with typo -- Score: 0.85
+FAILED Wrong tool selected -- Score: 0.50
+```
-## Rubrics and Setting Thresholds
+## Next steps
-An **EvalRubric** defines the evaluation criteria and thresholds for determining pass/fail outcomes. Key components include:
+- [Create an evaluation suite](/guides/create-tools/evaluate-tools/create-evaluation-suite) to start testing your tools
+- [Run evaluations](/guides/create-tools/evaluate-tools/run-evaluations) with multiple providers
+- Explore [capture mode](/guides/create-tools/evaluate-tools/capture-mode) to bootstrap test expectations
+- Compare tool sources with [comparative evaluations](/guides/create-tools/evaluate-tools/comparative-evaluations)
-- **Fail Threshold**: The minimum score required to pass the evaluation.
-- **Warn Threshold**: The score threshold for issuing a warning.
-- **Weights**: Assigns importance to different aspects of the evaluation (e.g., tool selection, argument correctness).
+## Advanced features
-### Setting Up a Rubric
+Once you're comfortable with basic evaluations, explore these advanced capabilities:
-- **Define Fail and Warn Thresholds**: Choose values between 0.0 and 1.0 to represent acceptable performance levels.
-- **Assign Weights**: Allocate weights to tool selection and critics to reflect their importance in the overall evaluation.
-- **Configure Failure Conditions**: Set flags like `fail_on_tool_selection` to enforce strict criteria.
+### Capture mode
-### Example Rubric Configuration:
+Record tool calls without scoring to discover what models actually call. Useful for bootstrapping test expectations and debugging. [Learn more →](/guides/create-tools/evaluate-tools/capture-mode)
-A rubric that requires a score of at least 0.85 to pass and issues a warning if the score is between 0.85 and 0.95:
+### Comparative evaluations
-- Fail Threshold: 0.85
-- Warn Threshold: 0.95
-- Fail on Tool Selection: True
-- Tool Selection Weight: 1.0
+Test the same cases against different tool sources (tracks) with isolated registries. Compare how models perform with different tool implementations. [Learn more →](/guides/create-tools/evaluate-tools/comparative-evaluations)
-```python
-rubric = EvalRubric(
- fail_threshold=0.85,
- warn_threshold=0.95,
- fail_on_tool_selection=True,
- tool_selection_weight=1.0,
-)
-```
+### Output formats
-## Building an Evaluation Suite
-
-An **EvalSuite** orchestrates the running of multiple evaluation cases. Here's how to build one:
-
-1. **Initialize EvalSuite**: Provide a name, system message, tool catalog, and rubric.
-2. **Add Evaluation Cases**: Use `add_case` or `extend_case` to include various scenarios.
-3. **Specify Expected Tool Calls**: Define the tools and arguments expected for each case.
-4. **Assign Critics**: Attach critics relevant to each case to evaluate specific arguments.
-5. **Run the Suite**: Execute the suite using the Arcade CLI to collect results.
-
-### Example: Math Tools Evaluation Suite
-
-An evaluation suite for math tools might include cases such as:
-
-- **Adding Two Large Numbers**:
- - **User Message**: "Add 12345 and 987654321"
- - **Expected Tool Call**: `add(a=12345, b=987654321)`
- - **Critics**:
- - `BinaryCritic` for arguments `a` and `b`
-- **Calculating Square Roots**:
- - **User Message**: "What is the square root of 3224990521?"
- - **Expected Tool Call**: `sqrt(a=3224990521)`
- - **Critics**:
- - `BinaryCritic` for argument `a`
-
-### Example: Slack Messaging Tools Evaluation Suite
-
-An evaluation suite for Slack messaging tools might include cases such as:
-
-- **Sending a Direct Message**:
- - **User Message**: "Send a direct message to johndoe saying 'Hello, can we meet at 3 PM?'"
- - **Expected Tool Call**: `send_dm_to_user(user_name='johndoe', message='Hello, can we meet at 3 PM?')`
- - **Critics**:
- - `BinaryCritic` for `user_name`
- - `SimilarityCritic` for `message`
-- **Posting a Message to a Channel**:
- - **User Message**: "Post 'The new feature is now live!' in the #announcements channel"
- - **Expected Tool Call**: `send_message_to_channel(channel_name='announcements', message='The new feature is now live!')`
- - **Critics**:
- - `BinaryCritic` for `channel_name`
- - `SimilarityCritic` for `message`
+Save results in multiple formats (txt, md, html, json) for reporting and analysis. Specify output files with extensions or use no extension for all formats. [Learn more →](/guides/create-tools/evaluate-tools/run-evaluations#output-formats)
diff --git a/app/en/references/cli-cheat-sheet/page.mdx b/app/en/references/cli-cheat-sheet/page.mdx
index db1283113..351f12857 100644
--- a/app/en/references/cli-cheat-sheet/page.mdx
+++ b/app/en/references/cli-cheat-sheet/page.mdx
@@ -333,25 +333,93 @@ import '../../../cheat-sheet-print.css'