RoGuard 1.0, a SOTA instruction fine-tuned LLM, is designed to help safeguard our Text Generation API. It performs safety classification at both the prompt and response levels, deciding whether or not each input or output violates our policies. This dual-level assessment is essential for moderating both user queries and the model’s own generated outputs. At the heart of our system is an LLM that’s been fine-tuned from the Llama-3.1-8B-Instruct model. We trained this LLM with a particular focus on high-quality instruction tuning to optimize for safety judgment performance.
Llama 3.1 is licensed under the Llama 3.1 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved.
To set up the environment, install all required dependencies by running the following commands from the root of the repository:
python -m venv venv_roguard
source venv_roguard/bin/activate
pip install -r requirements.txt
RoGuard 1.0 is a taxonomy-adaptive model, meaning it can generalize to any given taxonomy. For evaluation and benchmarking, we provide dataset-specific prompts in the prompt directory that reflect the taxonomy of each dataset. Users can also use their own datasets and corresponding taxonomies with RoGuard.
To evaluate RoGuard on a specific evaluation set, run the following command from the root of the repository, using the appropriate dataset configuration file:
python inference.py --config configs/RoGuardEval.json
Multiple configuration files are already prepared and ready to use in the configs/
folder.
To run an evaluation, each config file (in JSON format) should follow this structure:
{
"name": "RoGuardEval", // Eval name
"model_path": "Roblox/Llama-3.1-8B-Instruct-RoGuard-1.0", // Our model path in huggingface
"base_model": "meta-llama/Meta-Llama-3.1-8B-Instruct", // Base model used for fine-tuning
"max_output_tokens": 100, // Max tokens the model can generate
"eval_prompt": "prompts/RoGuardEval.txt", // Prompt template file, with placeholders for {prompt} and {response}
"llm_output_field": "Response Safety", // Field in model output to evaluate for safety
"llm_flagged_value": "unsafe", // Value indicating an unsafe response from the model
"eval_dataset": "Roblox/RoGuard-Eval", // Evaluation dataset on Hugging Face
"eval_label_field": "violation", // Field in the dataset that holds the ground truth label
"eval_flagged_value": "true", // Label value that indicates a violation (unsafe content)
"output_file": "outputs/RoGuardEval.csv" // Path to save the evaluation results as a CSV file
}
-
Evaluation Results (
*.csv
):input_prompt
: the original promptinput_response
: the model’s generated responseactual_unsafe
: ground-truth label (if provided)predicted_unsafe
: model’s predictioncorrect
: whether the prediction matched the ground truth
-
Summary Metrics (
*_summary.csv
):- Count-based metrics:
Total Examples
,True Positives
,False Negatives
,False Positives
,True Negatives
- Performance metrics (as percentages):
Precision
,Recall
,F1 Score
,False Positive Rate
- Count-based metrics:
.
├── configs/ # Evaluation configs for different datasets
│ ├── aegis.json
│ ├── ...
│ └── RoGuardEval.json
├── prompts/ # Prompt files for inference or evaluation
│ ├── aegis.json
│ ├── ...
│ └── RoGuardEval.txt
├── outputs/ # Output CSVs for results and summaries
│ ├── RoGuardEval.csv
│ └── RoGuardEval_summary.csv
├── inference.py # Script for running inference/evaluation
└── requirements.txt # Python dependencies
We benchmark RoGuard 1.0 model on a comprehensive set of open-source datasets for both prompt and response, as well as on RoGuard-Eval. This allows us to evaluate our model on both in-domain and out-of-domain datasets. We report our results in terms of F-1 score for binary violating/non-violating classification. In the table below, we compare our performance with that of several well-known models. The RoGuard 1.0 outperforms other models while generalizing on out-of-domain datasets.
- Prompt Metrics: These evaluate how well the model classifies or responds to potentially harmful user inputs
- Response Metrics: These measure how well the model handles or generates responses, ensuring its outputs are safe and aligned.
If you are using RoGuard 1.0, please cite it as:
@online{roblox2025roguard,
author = {Mahesh Nandwana and Adam McFarlin and Nishchaie Khanna},
title = {State‑of‑the‑Art LLM Helps Safeguard Unlimited Text Generation on Roblox: RoGuard 1.0 — Advancing Safety With Robust Guardrails},
year = {2025},
month = {Jul 22},
howpublished = {\url{https://corp.roblox.com/newsroom/2025/07/roguard-advancing-safety-for-llms-with-robust-guardrails}},
}