Synthetic Data Kit

Tool for generating high-quality synthetic datasets to fine-tune LLMs.

Generate Reasoning Traces, QA Pairs, save them to a fine-tuning format with a simple CLI.

Checkout our guide on using the tool to unlock task-specific reasoning in Llama-3 family

What does Synthetic Data Kit offer?

Fine-Tuning Large Language Models is easy. There are many mature tools that you can use to fine-tune Llama model family using various post-training techniques.

Why target data preparation?

Multiple tools support standardized formats. However, most of the times your dataset is not structured in "user", "assistant" threads or in a certain format that plays well with a fine-tuning packages.

This toolkit simplifies the journey of:

Using local LLM (via vLLM) to generate examples
Modular 4 command flow
Converting your existing files to fine-tuning friendly formats
Creating synthetic datasets
Supporting various formats of post-training fine-tuning

How does Synthetic Data Kit offer it?

The tool is designed to follow a simple CLI structure with 4 commands:

ingest various file formats
create your fine-tuning format: QA pairs, QA pairs with CoT, summary format
curate: Using Llama as a judge to curate high quality examples.
save-as: After that you can simply save these to a format that your fine-tuning workflow requires.

You can override any parameter or detail by either using the CLI or overriding the default YAML config.

Installation

From PyPI

# Create a new environment

conda create -n synthetic-data python=3.10 

conda activate synthetic-data

pip install synthetic-data-kit

(Alternatively) From Source

git clone https://github.com/meta-llama/synthetic-data-kit.git
cd synthetic-data-kit
pip install -e .

To get an overview of commands type:

synthetic-data-kit --help

1. Tool Setup

The tool expects respective files to be put in named folders.
You also need a vLLM server running the LLM that you will utilize for generating your dataset.

# Create directory structure
mkdir -p data/{pdf,html,youtube,docx,ppt,txt,output,generated,cleaned,final}

# Start vLLM server
# Note you will need to grab your HF Authentication from: https://huggingface.co/settings/tokens
vllm serve meta-llama/Llama-3.3-70B-Instruct --port 8000

2. Usage

The flow follows 4 simple steps: ingest, create, curate, save-as, please paste your file into the respective folder:

# Check if vLLM server is running
synthetic-data-kit system-check

# Parse a document to text
synthetic-data-kit ingest docs/report.pdf
# This will save file to data/output/report.txt

# Generate QA pairs (default)
synthetic-data-kit create data/output/report.txt --type qa

OR 

# Generate Chain of Thought (CoT) reasoning examples
synthetic-data-kit create data/output/report.txt --type cot

# Both of these will save file to data/generated/report_qa_pairs.json

# Filter content based on quality
synthetic-data-kit curate data/generated/report_qa_pairs.json

# Convert to alpaca fine-tuning format and save as HF arrow file
synthetic-data-kit save-as data/cleaned/report_cleaned.json --format alpaca --storage hf

Configuration

The toolkit uses a YAML configuration file (default: configs/config.yaml).

Note, this can be overridden via either CLI arguments OR passing a custom YAML file

# Example configuration
vllm:
  api_base: "http://localhost:8000/v1"
  model: "meta-llama/Llama-3.3-70B-Instruct"

generation:
  temperature: 0.7
  chunk_size: 4000
  num_pairs: 25

curate:
  threshold: 7.0
  batch_size: 8

Customizing Configuration

Create a overriding configuration file and use it with the -c flag:

synthetic-data-kit -c my_config.yaml ingest docs/paper.pdf

Examples

Processing a PDF Document

# Ingest PDF
synthetic-data-kit ingest research_paper.pdf

# Generate QA pairs
synthetic-data-kit create data/output/research_paper.txt -n 30 --threshold 8.0

# Curate data
synthetic-data-kit curate data/generated/research_paper_qa_pairs.json -t 8.5

# Save in OpenAI fine-tuning format (JSON)
synthetic-data-kit save-as data/cleaned/research_paper_cleaned.json -f ft

# Save in OpenAI fine-tuning format (HF dataset)
synthetic-data-kit save-as data/cleaned/research_paper_cleaned.json -f ft --storage hf

Processing a YouTube Video

# Extract transcript
synthetic-data-kit ingest "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

# Generate QA pairs with specific model
synthetic-data-kit create data/output/youtube_dQw4w9WgXcQ.txt

Processing Multiple Files

# Bash script to process multiple files
for file in data/pdf/*.pdf; do
  filename=$(basename "$file" .pdf)
  
  synthetic-data-kit ingest "$file"
  synthetic-data-kit create "data/output/${filename}.txt" -n 20
  synthetic-data-kit curate "data/generated/${filename}_qa_pairs.json" -t 7.5
  synthetic-data-kit save-as "data/cleaned/${filename}_cleaned.json" -f chatml
done

Advanced Usage

Custom Prompt Templates

Edit the prompts section in your configuration file to customize generation behavior:

prompts:
  qa_generation: |
    You are creating question-answer pairs for fine-tuning a legal assistant.
    Focus on technical legal concepts, precedents, and statutory interpretation.
    
    Below is a chunk of text about: {summary}...
    
    Create {num_pairs} high-quality question-answer pairs based ONLY on this text.
    
    Return ONLY valid JSON formatted as:
    [
      {
        "question": "Detailed legal question?",
        "answer": "Precise legal answer."
      },
      ...
    ]
    
    Text:
    ---
    {text}
    ---

Mental Model:

graph LR
    SDK --> SystemCheck[system-check]
    SDK[synthetic-data-kit] --> Ingest[ingest]
    SDK --> Create[create]
    SDK --> Curate[curate]
    SDK --> SaveAs[save-as]
    
    Ingest --> PDFFile[PDF File]
    Ingest --> HTMLFile[HTML File]
    Ingest --> YouTubeURL[File Format]

    
    Create --> CoT[CoT]
    Create --> QA[QA Pairs]
    Create --> Summary[Summary]
    
    Curate --> Filter[Filter by Quality]
    
    SaveAs --> JSONL[JSONL Format]
    SaveAs --> Alpaca[Alpaca Format]
    SaveAs --> FT[Fine-Tuning Format]
    SaveAs --> ChatML[ChatML Format]

Troubleshooting FAQs:

vLLM Server Issues

Ensure vLLM is installed: pip install vllm
Start server with: vllm serve <model_name> --port 8000
Check connection: synthetic-data-kit system-check

Memory Issues

If you encounter CUDA out of memory errors:

Use a smaller model
Reduce batch size in config
Start vLLM with --gpu-memory-utilization 0.85

JSON Parsing Issues

If you encounter issues with the curate command:

Use the -v flag to enable verbose output
Set smaller batch sizes in your config.yaml
Ensure the LLM model supports proper JSON output
Install json5 for enhanced JSON parsing: pip install json5

Parser Errors

Ensure required dependencies are installed for specific parsers:
- PDF: pip install pdfminer.six
- HTML: pip install beautifulsoup4
- YouTube: pip install pytubefix youtube-transcript-api
- DOCX: pip install python-docx
- PPTX: pip install python-pptx

License

Contributing

Contributions are welcome! Read our contributing guide

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
configs		configs
synthetic_data_kit		synthetic_data_kit
use-cases		use-cases
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DOCS.md		DOCS.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Synthetic Data Kit

What does Synthetic Data Kit offer?

Why target data preparation?

How does Synthetic Data Kit offer it?

Installation

From PyPI

(Alternatively) From Source

1. Tool Setup

2. Usage

Configuration

Customizing Configuration

Examples

Processing a PDF Document

Processing a YouTube Video

Processing Multiple Files

Advanced Usage

Custom Prompt Templates

Mental Model:

Troubleshooting FAQs:

vLLM Server Issues

Memory Issues

JSON Parsing Issues

Parser Errors

License

Contributing

About

Uh oh!

Releases

Packages

Languages

License

mrvladd-d/synthetic-data-kit

Folders and files

Latest commit

History

Repository files navigation

Synthetic Data Kit

What does Synthetic Data Kit offer?

Why target data preparation?

How does Synthetic Data Kit offer it?

Installation

From PyPI

(Alternatively) From Source

1. Tool Setup

2. Usage

Configuration

Customizing Configuration

Examples

Processing a PDF Document

Processing a YouTube Video

Processing Multiple Files

Advanced Usage

Custom Prompt Templates

Mental Model:

Troubleshooting FAQs:

vLLM Server Issues

Memory Issues

JSON Parsing Issues

Parser Errors

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages