RoGuard 1.0: Advancing Safety for LLMs with Robust Guardrails

RoGuard 1.0, a SOTA instruction fine-tuned LLM, is designed to help safeguard our Text Generation API. It performs safety classification at both the prompt and response levels, deciding whether or not each input or output violates our policies. This dual-level assessment is essential for moderating both user queries and the model’s own generated outputs. At the heart of our system is an LLM that’s been fine-tuned from the Llama-3.1-8B-Instruct model. We trained this LLM with a particular focus on high-quality instruction tuning to optimize for safety judgment performance.

📦 Installation

To set up the environment, install all required dependencies by running the following commands from the root of the repository:

python -m venv venv_roguard
source venv_roguard/bin/activate 
pip install -r requirements.txt

🧠 Inference

RoGuard 1.0 is a taxonomy-adaptive model, meaning it can generalize to any given taxonomy. For evaluation and benchmarking, we provide dataset-specific prompts in the prompt directory that reflect the taxonomy of each dataset. Users can also use their own datasets and corresponding taxonomies with RoGuard.

To evaluate RoGuard on a specific evaluation set, run the following command from the root of the repository, using the appropriate dataset configuration file:

python inference.py --config configs/RoGuardEval.json

⚙️ Configuration

Multiple configuration files are already prepared and ready to use in the configs/ folder.

To run an evaluation, each config file (in JSON format) should follow this structure:

{
  "name": "RoGuardEval",                                        // Eval name

  "model_path": "Roblox/Llama-3.1-8B-Instruct-RoGuard-1.0",     // Our model path in huggingface
  "base_model": "meta-llama/Meta-Llama-3.1-8B-Instruct",        // Base model used for fine-tuning
  "max_output_tokens": 100,                                     // Max tokens the model can generate

  "eval_prompt": "prompts/RoGuardEval.txt",                     // Prompt template file, with placeholders for {prompt} and {response}
  "llm_output_field": "Response Safety",                        // Field in model output to evaluate for safety
  "llm_flagged_value": "unsafe",                                // Value indicating an unsafe response from the model

  "eval_dataset": "Roblox/RoGuard-Eval",                        // Evaluation dataset on Hugging Face
  "eval_label_field": "violation",                              // Field in the dataset that holds the ground truth label
  "eval_flagged_value": "true",                                 // Label value that indicates a violation (unsafe content)

  "output_file": "outputs/RoGuardEval.csv"                      // Path to save the evaluation results as a CSV file
}

📄 Output Files

Evaluation Results (*.csv):
- input_prompt: the original prompt
- input_response: the model’s generated response
- actual_unsafe: ground-truth label (if provided)
- predicted_unsafe: model’s prediction
- correct: whether the prediction matched the ground truth
Summary Metrics (*_summary.csv):
- Count-based metrics:
  Total Examples, True Positives, False Negatives, False Positives, True Negatives
- Performance metrics (as percentages):
  Precision, Recall, F1 Score, False Positive Rate

📁 Directory Structure

.
├── configs/                # Evaluation configs for different datasets
│   ├── aegis.json
│   ├── ...
│   └── RoGuardEval.json
├── prompts/                # Prompt files for inference or evaluation
│   ├── aegis.json
│   ├── ...
│   └── RoGuardEval.txt
├── outputs/                # Output CSVs for results and summaries
│   ├── RoGuardEval.csv
│   └── RoGuardEval_summary.csv
├── inference.py            # Script for running inference/evaluation
└── requirements.txt        # Python dependencies

📊 Model Benchmark Results

We benchmark RoGuard 1.0 model on a comprehensive set of open-source datasets for both prompt and response, as well as on RoGuard-Eval. This allows us to evaluate our model on both in-domain and out-of-domain datasets. We report our results in terms of F-1 score for binary violating/non-violating classification. In the table below, we compare our performance with that of several well-known models. The RoGuard 1.0 outperforms other models while generalizing on out-of-domain datasets.

Prompt Metrics: These evaluate how well the model classifies or responds to potentially harmful user inputs
Response Metrics: These measure how well the model handles or generates responses, ensuring its outputs are safe and aligned.

📚 Citation

If you are using RoGuard 1.0, please cite it as:

@online{roblox2025roguard,
  author       = {Mahesh Nandwana and Adam McFarlin and Nishchaie Khanna},
  title        = {State‑of‑the‑Art LLM Helps Safeguard Unlimited Text Generation on Roblox: RoGuard 1.0 — Advancing Safety With Robust Guardrails},
  year         = {2025},
  month        = {Jul 22},
  howpublished = {\url{https://corp.roblox.com/newsroom/2025/07/roguard-advancing-safety-for-llms-with-robust-guardrails}},
}

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
configs		configs
prompts		prompts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py
models.py		models.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RoGuard 1.0: Advancing Safety for LLMs with Robust Guardrails

📦 Installation

🧠 Inference

⚙️ Configuration

📄 Output Files

📁 Directory Structure

📊 Model Benchmark Results

📚 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

Roblox/RoGuard-1.0

Folders and files

Latest commit

History

Repository files navigation

RoGuard 1.0: Advancing Safety for LLMs with Robust Guardrails

📦 Installation

🧠 Inference

⚙️ Configuration

📄 Output Files

📁 Directory Structure

📊 Model Benchmark Results

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages