Skip to content

Roblox/RoGuard-1.0

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RoGuard 1.0: Advancing Safety for LLMs with Robust Guardrails

Hugging Face Model License
Hugging Face Data License

RoGuard 1.0, a SOTA instruction fine-tuned LLM, is designed to help safeguard our Text Generation API. It performs safety classification at both the prompt and response levels, deciding whether or not each input or output violates our policies. This dual-level assessment is essential for moderating both user queries and the model’s own generated outputs. At the heart of our system is an LLM that’s been fine-tuned from the Llama-3.1-8B-Instruct model. We trained this LLM with a particular focus on high-quality instruction tuning to optimize for safety judgment performance.

Llama 3.1 is licensed under the Llama 3.1 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved.

📦 Installation

To set up the environment, install all required dependencies by running the following commands from the root of the repository:

python -m venv venv_roguard
source venv_roguard/bin/activate 
pip install -r requirements.txt

🧠 Inference

RoGuard 1.0 is a taxonomy-adaptive model, meaning it can generalize to any given taxonomy. For evaluation and benchmarking, we provide dataset-specific prompts in the prompt directory that reflect the taxonomy of each dataset. Users can also use their own datasets and corresponding taxonomies with RoGuard.

To evaluate RoGuard on a specific evaluation set, run the following command from the root of the repository, using the appropriate dataset configuration file:

python inference.py --config configs/RoGuardEval.json

⚙️ Configuration

Multiple configuration files are already prepared and ready to use in the configs/ folder.

To run an evaluation, each config file (in JSON format) should follow this structure:

{
  "name": "RoGuardEval",                                        // Eval name

  "model_path": "Roblox/Llama-3.1-8B-Instruct-RoGuard-1.0",     // Our model path in huggingface
  "base_model": "meta-llama/Meta-Llama-3.1-8B-Instruct",        // Base model used for fine-tuning
  "max_output_tokens": 100,                                     // Max tokens the model can generate

  "eval_prompt": "prompts/RoGuardEval.txt",                     // Prompt template file, with placeholders for {prompt} and {response}
  "llm_output_field": "Response Safety",                        // Field in model output to evaluate for safety
  "llm_flagged_value": "unsafe",                                // Value indicating an unsafe response from the model

  "eval_dataset": "Roblox/RoGuard-Eval",                        // Evaluation dataset on Hugging Face
  "eval_label_field": "violation",                              // Field in the dataset that holds the ground truth label
  "eval_flagged_value": "true",                                 // Label value that indicates a violation (unsafe content)

  "output_file": "outputs/RoGuardEval.csv"                      // Path to save the evaluation results as a CSV file
}

📄 Output Files

  • Evaluation Results (*.csv):

    • input_prompt: the original prompt
    • input_response: the model’s generated response
    • actual_unsafe: ground-truth label (if provided)
    • predicted_unsafe: model’s prediction
    • correct: whether the prediction matched the ground truth
  • Summary Metrics (*_summary.csv):

    • Count-based metrics:
      Total Examples, True Positives, False Negatives, False Positives, True Negatives
    • Performance metrics (as percentages):
      Precision, Recall, F1 Score, False Positive Rate

📁 Directory Structure

.
├── configs/                # Evaluation configs for different datasets
│   ├── aegis.json
│   ├── ...
│   └── RoGuardEval.json
├── prompts/                # Prompt files for inference or evaluation
│   ├── aegis.json
│   ├── ...
│   └── RoGuardEval.txt
├── outputs/                # Output CSVs for results and summaries
│   ├── RoGuardEval.csv
│   └── RoGuardEval_summary.csv
├── inference.py            # Script for running inference/evaluation
└── requirements.txt        # Python dependencies

📊 Model Benchmark Results

We benchmark RoGuard 1.0 model on a comprehensive set of open-source datasets for both prompt and response, as well as on RoGuard-Eval. This allows us to evaluate our model on both in-domain and out-of-domain datasets. We report our results in terms of F-1 score for binary violating/non-violating classification. In the table below, we compare our performance with that of several well-known models. The RoGuard 1.0 outperforms other models while generalizing on out-of-domain datasets.

  • Prompt Metrics: These evaluate how well the model classifies or responds to potentially harmful user inputs
  • Response Metrics: These measure how well the model handles or generates responses, ensuring its outputs are safe and aligned.

Newsroom_RoGuard_2507_Hero_Final_Background

📚 Citation

If you are using RoGuard 1.0, please cite it as:

@online{roblox2025roguard,
  author       = {Mahesh Nandwana and Adam McFarlin and Nishchaie Khanna},
  title        = {State‑of‑the‑Art LLM Helps Safeguard Unlimited Text Generation on Roblox: RoGuard 1.0 — Advancing Safety With Robust Guardrails},
  year         = {2025},
  month        = {Jul 22},
  howpublished = {\url{https://corp.roblox.com/newsroom/2025/07/roguard-advancing-safety-for-llms-with-robust-guardrails}},
}

About

RoGuard 1.0 is a State-of-the-Art LLM that helps safeguard unlimited text generation on Roblox

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages