Skip to content

HowieHwong/Agentic-Guardian

Repository files navigation

🛡️ Agentic-Guardian

Python License Paper Documentation Hugging Face

Build foundational guardrails for general agentic systems via synthetic data


📋 Table of Contents

🌟 Overview

This repository provides an integrated ecosystem for developing and testing pre-execution safety guardrails:

Component Description Purpose
🔧 AuraGen Configurable synthetic data generator with risk injection Generate training data for guardrail research
📊 Pre-Ex-Bench Reference dataset for quick experimentation Evaluate pre-execution safety models
🛡️ Safiron Guardian model for pre-execution safety Detect and explain risks in agent planning

🔧 AuraGen

Synthetic data engine with configurable risk injection

AuraGen generates harmless trajectories from scenarios and then injects controlled risks. These synthetic records are used to train and evaluate safety models.

🚀 Quick Setup

📋 Prerequisites

  • 🐍 Python 3.9+
  • 🔑 API Key (OpenAI, Anthropic, etc.) depending on generation mode

📦 Installation

python -m venv .venv
.venv\Scripts\activate     # Windows
# source .venv/bin/activate  # macOS/Linux
pip install -r requirements.txt

⚙️ Configure API Keys

python config/configure_api_keys.py

🎯 Generate Data

python generate_and_inject.py

📚 Documentation

📖 Available at: https://roaring-capybara-053cbe.netlify.app/

📊 Pre-Ex-Bench

Lightweight benchmark for pre-execution safety

Pre-Ex-Bench provides a small set of examples for evaluating models on detection, classification, explanation, and generalization across different planners.

📁 Dataset

  • Location: Pre-Ex-Bench/dataset.json
  • Format: JSON list of entries

💻 Usage Example

import json
from pathlib import Path

data = json.loads(Path('Pre-Ex-Bench/dataset.json').read_text(encoding='utf-8'))
print(f"Loaded {len(data)} items")
print(data[0])

🛡️ Safiron

Guardian model for pre-execution safety in agentic systems

Safiron is trained on synthetic data from AuraGen and evaluated on Pre-Ex-Bench. It analyzes planned agent actions before execution, detects whether they are risky, classifies the risk type, and generates concise explanations. This enables safer deployment of LLM-based agents by intercepting unsafe plans at the planning stage.

🎯 Core Capabilities

Feature Description
🔍 Risk Detection Distinguishes harmless vs. risky plans
📊 Risk Classification Assigns risk to categories (privacy, financial loss, etc.)
📝 Concise Explanations Produces short, interpretable explanations

🏆 Model Information

  • 🤗 Model Release: Safiron/Safiron
  • 🎓 Training: Supervised fine-tuning + reinforcement learning (GRPO) on AuraGen data
  • 📊 Evaluation: Benchmarked with Pre-Ex-Bench

🚀 Minimal Usage

from vllm import LLM
from vllm.sampling_params import SamplingParams

llm = LLM(model="Safiron/Safiron")
sampling = SamplingParams(max_tokens=512, temperature=0.1, top_p=0.9)

instruction = "You are a guardian model for agentic systems. Assess potential risks..."
input_text = "{agent_action: '...', env_info: '...'}"

messages = [{
    "role": "user", 
    "content": f"{instruction}\n\nInput:\n{input_text}"
}]

outputs = llm.chat([messages], sampling_params=sampling)
print(outputs[0].outputs[0].text.strip())

🔧 Evaluation

Overview

Evaluate risk explanations only when the model has correctly detected harmful samples.

Installation

pip install openai scikit-learn tqdm
export OPENAI_API_KEY="your_api_key_here"

Usage

python evaluation/eval.py --test-file path/to/results.json --model gpt-4o-mini --out-file evaluation/eval_results.json

Arguments

Argument Description
--test-file Path to model prediction file (JSON/JSONL).
--model LLM model for explanation judging.
--workers Number of parallel threads for LLM calls.
--out-file Output file with annotated evaluation.

Input Format

{
  "expected_output": "Expected response text",
  "predicted_output": "Model predicted response"
}

Output

  • Classification accuracy
  • Harmful detection accuracy
  • Risk category accuracy
  • Explanation correctness
  • Confusion matrix

Annotated evaluation results are saved to --out-file.

📄 License

Safiron and related resources are released under the Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0) License.

  • 🎓 For research and educational purposes
  • 🚫 Commercial use prohibited

🛡️ Building Safer Agentic Systems via Synthetic Data 🛡️

About

Building a Foundational Guardrail for General Agentic Systems via Synthetic Data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages