This repository provides an integrated ecosystem for developing and testing pre-execution safety guardrails:
| Component | Description | Purpose |
|---|---|---|
| 🔧 AuraGen | Configurable synthetic data generator with risk injection | Generate training data for guardrail research |
| 📊 Pre-Ex-Bench | Reference dataset for quick experimentation | Evaluate pre-execution safety models |
| 🛡️ Safiron | Guardian model for pre-execution safety | Detect and explain risks in agent planning |
Synthetic data engine with configurable risk injection
AuraGen generates harmless trajectories from scenarios and then injects controlled risks. These synthetic records are used to train and evaluate safety models.
- 🐍 Python 3.9+
- 🔑 API Key (OpenAI, Anthropic, etc.) depending on generation mode
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # macOS/Linux
pip install -r requirements.txtpython config/configure_api_keys.pypython generate_and_inject.py📖 Available at: https://roaring-capybara-053cbe.netlify.app/
Lightweight benchmark for pre-execution safety
Pre-Ex-Bench provides a small set of examples for evaluating models on detection, classification, explanation, and generalization across different planners.
- Location:
Pre-Ex-Bench/dataset.json - Format: JSON list of entries
import json
from pathlib import Path
data = json.loads(Path('Pre-Ex-Bench/dataset.json').read_text(encoding='utf-8'))
print(f"Loaded {len(data)} items")
print(data[0])Guardian model for pre-execution safety in agentic systems
Safiron is trained on synthetic data from AuraGen and evaluated on Pre-Ex-Bench. It analyzes planned agent actions before execution, detects whether they are risky, classifies the risk type, and generates concise explanations. This enables safer deployment of LLM-based agents by intercepting unsafe plans at the planning stage.
| Feature | Description |
|---|---|
| 🔍 Risk Detection | Distinguishes harmless vs. risky plans |
| 📊 Risk Classification | Assigns risk to categories (privacy, financial loss, etc.) |
| 📝 Concise Explanations | Produces short, interpretable explanations |
- 🤗 Model Release: Safiron/Safiron
- 🎓 Training: Supervised fine-tuning + reinforcement learning (GRPO) on AuraGen data
- 📊 Evaluation: Benchmarked with Pre-Ex-Bench
from vllm import LLM
from vllm.sampling_params import SamplingParams
llm = LLM(model="Safiron/Safiron")
sampling = SamplingParams(max_tokens=512, temperature=0.1, top_p=0.9)
instruction = "You are a guardian model for agentic systems. Assess potential risks..."
input_text = "{agent_action: '...', env_info: '...'}"
messages = [{
"role": "user",
"content": f"{instruction}\n\nInput:\n{input_text}"
}]
outputs = llm.chat([messages], sampling_params=sampling)
print(outputs[0].outputs[0].text.strip())Evaluate risk explanations only when the model has correctly detected harmful samples.
pip install openai scikit-learn tqdm
export OPENAI_API_KEY="your_api_key_here"python evaluation/eval.py --test-file path/to/results.json --model gpt-4o-mini --out-file evaluation/eval_results.json| Argument | Description |
|---|---|
--test-file |
Path to model prediction file (JSON/JSONL). |
--model |
LLM model for explanation judging. |
--workers |
Number of parallel threads for LLM calls. |
--out-file |
Output file with annotated evaluation. |
{
"expected_output": "Expected response text",
"predicted_output": "Model predicted response"
}- Classification accuracy
- Harmful detection accuracy
- Risk category accuracy
- Explanation correctness
- Confusion matrix
Annotated evaluation results are saved to --out-file.
Safiron and related resources are released under the Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0) License.
- 🎓 For research and educational purposes
- 🚫 Commercial use prohibited
🛡️ Building Safer Agentic Systems via Synthetic Data 🛡️