Datagen

This page introduces CAMEL’s data generation modules for creating high-quality training data with explicit reasoning, diverse instructions, and advanced automated refinement.

Chain of Thought (CoT): Generates explicit reasoning paths
Self-Instruct: Produces instruction-following data from both humans and machines
Source2Synth: Synthesizes multi-hop QA from source text or code
Self-Improving CoT: Iteratively improves reasoning through agent self-critique

Chain of Thought (CoT) Data Generation

Chain of Thought (CoT) data generation creates step-by-step reasoning paths for problem solving, leveraging dual agents and advanced search/verification logic.

Key Features

Core Components

Quick Start: CoT Data Generation

Spin up chain-of-thought data generation with dual agents, golden answers, and CoT solution generation:

from camel.agents import ChatAgent
from camel.datagen import CoTDataGenerator

# Initialize agents
generator_agent = ChatAgent("System message for generator")
verifier_agent = ChatAgent("System message for verifier")

# Define golden answers
golden_answers = {
    "question1": "answer1",
    "question2": "answer2"
}

# Create generator
cot_generator = CoTDataGenerator(
    generator_agent=generator_agent,
    verifier_agent=verifier_agent,
    golden_answers=golden_answers,
    search_limit=100
)

# Generate solution
solution = cot_generator.solve("question1")

Data Import/Export for CoT

Easily import question-answer pairs or export generated solutions for further use:

# Import QA pairs from JSON
cot_generator.import_qa_from_json("qa_pairs.json")

# Export solutions
cot_generator.export_solutions("solutions.json")

Solution Generation Process

Configuration Options

Output Format

Self-Instruct: Instruction Generation

Self-Instruct is a pipeline for generating high-quality, diverse instructions by combining human-written seed tasks and machine-generated prompts, all filtered for quality and diversity.

Key Features

Core Components

Quick Start: Self-Instruct Generation

Quickly set up an instruction generation pipeline with both human and machine prompts:

from camel.agents import ChatAgent
from camel.datagen.self_instruct import SelfInstructPipeline

# Initialize agent
agent = ChatAgent()

# Create pipeline with default settings
pipeline = SelfInstructPipeline(
    agent=agent,
    seed='seed_tasks.jsonl',  # Path to human-written seed tasks
    num_machine_instructions=5,
    data_output_path='./data_output.json',
    human_to_machine_ratio=(6, 2)
)

# Generate instructions
pipeline.generate()

Custom Filtering Example

Use custom filters to refine and deduplicate instructions as needed:

from camel.datagen.self_instruct import SelfInstructPipeline
from camel.datagen.self_instruct.filter import InstructionFilter

# Configure filters
filter_config = {
    "length": {},
    "keyword": {},
    "punctuation": {},
    "non_english": {},
    "rouge_similarity": {
        "threshold": 0.7,
        "metric": "rouge-l"
    }
}

pipeline = SelfInstructPipeline(
    agent=agent,
    seed='seed_tasks.jsonl',
    instruction_filter=InstructionFilter(filter_config),
    num_machine_instructions=5
)

Pipeline Stages

Pipeline Parameters

Filter Configuration

Input/Output Format

Seed Tasks (Input):

{"instruction": "Classify the sentiment of this text as positive or negative."}
{"instruction": "Generate a summary of the given paragraph."}

Generated Data (Output):

{
  "machine_instructions": [
    {
      "instruction": "...",
      "is_classification": true,
      "instances": [
        {
          "input": "...",
          "output": "..."
        }
      ]
    }
  ]
}

Source2Synth: Multi-hop Question-Answer Generation

Source2Synth generates complex multi-hop QA pairs from source text (or code) via an orchestrated pipeline of AI-driven and rule-based steps, with curation and complexity control.

Core Components

Key Features

Quick Start: Source2Synth Pipeline

Rapidly generate a multi-hop QA dataset from your own text or source files:

from camel.datagen.source2synth import (
    UserDataProcessor,
    ProcessorConfig
)

# Create configuration
config = ProcessorConfig(
    seed=42,
    min_length=50,
    max_length=1000,
    complexity_threshold=0.5,
    dataset_size=10,
    use_ai_model=True,
)

# Initialize processor
processor = UserDataProcessor(config)

# Process a single text
result = processor.process_text(
    "Your source text here",
    source="example_source"
)

# Process multiple texts
texts = ["Text 1", "Text 2", "Text 3"]
sources = ["source1", "source2", "source3"]
batch_results = processor.process_batch(texts, sources)

ProcessorConfig Parameters

Pipeline Stages

Self-Improving CoT Data Generation

This pipeline implements self-taught reasoning—an iterative process where an AI agent refines its own reasoning traces via self-evaluation, feedback, and reward models for continual improvement.

Key Components

Architecture Stages

Quick Start: Self-Improving CoT Pipeline

Launch a self-improving reasoning workflow with just a few lines:

from camel.agents import ChatAgent
from camel.datagen import SelfImprovingCoTPipeline

# Initialize agents
reason_agent = ChatAgent(
    """Answer my question and give your 
    final answer within \\boxed{}."""
)

evaluate_agent = ChatAgent(
    "You are a highly critical teacher who evaluates the student's answers "
    "with a meticulous and demanding approach."
)

# Prepare your problems
problems = [
    {"problem": "Your problem text here"},
    # Add more problems...
]

# Create and run the pipeline
pipeline = SelfImprovingCoTPipeline(
    reason_agent=reason_agent,
    evaluate_agent=evaluate_agent,
    problems=problems,
    max_iterations=3,
    output_path="star_output.json"
)

results = pipeline.generate()

Advanced: External Reward Model Integration

Evaluate and guide reasoning traces with an external reward model, such as Nemotron:

from camel.models.reward import NemotronRewardModel

# Initialize reward model
reward_model = NemotronRewardModel(
    model_type=ModelType.NVIDIA_NEMOTRON_340B_REWARD,
    url="https://integrate.api.nvidia.com/v1",
    api_key="your_api_key"
)

# Create pipeline with reward model
pipeline = SelfImprovingCoTPipeline(
    reason_agent=reason_agent,
    evaluate_agent=evaluate_agent,
    problems=problems,
    reward_model=reward_model,
    score_threshold={
        "correctness": 0.8,
        "clarity": 0.7,
        "completeness": 0.7
    }
)

Input/Output Format

Input Format (JSON):

{
  "problems": [
    {
      "problem": "Problem text here",
      "solution": "Optional solution text"
    }
  ]
}

Output Format (JSON):

Original problem
Final reasoning trace
Improvement history with iterations
Evaluation scores and feedback per iteration

Configuration Options

Get Started

Key Modules

Cookbooks

Chain of Thought (CoT) Data Generation

Quick Start: CoT Data Generation

Data Import/Export for CoT

Self-Instruct: Instruction Generation

Quick Start: Self-Instruct Generation

Custom Filtering Example

Source2Synth: Multi-hop Question-Answer Generation

Quick Start: Source2Synth Pipeline

Self-Improving CoT Data Generation

Quick Start: Self-Improving CoT Pipeline

Advanced: External Reward Model Integration

Get Started

Key Modules

Cookbooks

​Chain of Thought (CoT) Data Generation

Quick Start: CoT Data Generation

Data Import/Export for CoT

​Self-Instruct: Instruction Generation

Quick Start: Self-Instruct Generation

Custom Filtering Example

​Source2Synth: Multi-hop Question-Answer Generation

Quick Start: Source2Synth Pipeline

​Self-Improving CoT Data Generation

Quick Start: Self-Improving CoT Pipeline

Advanced: External Reward Model Integration

Chain of Thought (CoT) Data Generation

Self-Instruct: Instruction Generation

Source2Synth: Multi-hop Question-Answer Generation

Self-Improving CoT Data Generation