LLM Dataset Builder

A Rust application that automatically generates high-quality question-answer pairs from documentation, making it perfect for training Large Language Models (LLMs). The application uses Ollama with the Qwen v2.5 14B model to process various data sources and creates targeted questions based on content length and complexity.

Features

Smart Question Generation

Automatically calculates the optimal number of questions based on content length
Base target: 1 question per 10 words of content
Adds 25% extra questions (minimum 2) to ensure quality coverage

Example:

100 words → 10 base questions + 3 extra = 13 questions
20 words → 2 base questions + 2 extra = 4 questions

Recursive Content Processing

If the initial question generation doesn't meet the target:

First attempts to process the entire section
If insufficient questions, splits content by headings
If still insufficient, splits content by paragraphs
Each subsection gets a proportional number of questions based on its word count

Intelligent File Handling

Outputs in JSONL format (one JSON object per line)
Checks for existing question files before processing
Converts older JSON files to JSONL format automatically
Skips processing if sufficient questions already exist
Maintains quality by ensuring minimum question thresholds

Multiple Data Source Support

Local files
URLs (web pages)
GitHub repositories
GitHub release notes
Handles both Markdown and plain text content

Installation

Option 1: Download Pre-built Binary (Recommended)

Go to the Releases page
Download the latest binary for your platform:
- llm_dataset_builder-macos for macOS
- llm_dataset_builder-linux for Linux
Make the binary executable:
```
chmod +x llm_dataset_builder-*
```

Option 2: Build from Source

If you want to build from source:

Ensure you have Rust installed
Clone this repository
Build the project:
```
cargo build --release
```

Option 3: Development Setup

For development, we provide a comprehensive Makefile to manage the project:

Prerequisites:
- Install uv for Python package management:
```
curl -LsSf https://astral.sh/uv/install.sh | sh
```
Set up the development environment:
```
make setup
```
This will:
- Create a Python virtual environment using uv
- Install pre-commit hooks
- Install required Rust components
- Configure git hooks

Common development commands:

make build         # Build debug version
make release      # Build release version
make test         # Run all tests
make lint         # Run formatting and clippy checks
make check        # Run all checks (format, lint, test)
make run          # Run the application
make doc          # Generate documentation

Additional commands:

make help         # Show all available commands
make fmt-fix      # Fix code formatting
make test-coverage # Generate test coverage report
make dist         # Create release artifacts

The pre-commit hooks will run these checks before each commit:

Trailing whitespace removal
End of file fixing
YAML validation
Large file checks
Rust formatting
Cargo check
Clippy lints

Configuration

The application can be configured using environment variables or command line arguments. Command line arguments take precedence over environment variables.

Environment Variables

Copy the .env.example file to .env and customize the values:

cp .env.example .env

Available environment variables:

OLLAMA_ENDPOINT: Ollama API endpoint (default: "http://localhost:11434")
OLLAMA_MODEL: Ollama model to use (default: "m/qwen2514bmax")
OUTPUT_DIR: Output directory for collected data (default: "output")

Command Line Arguments

Command line arguments override environment variables:

cargo run -- -e http://localhost:11434 -m m/qwen2514bmax -d output

Options:

-e, --ollama-endpoint: Ollama API endpoint
-m, --model: Ollama model to use
-d, --output-dir: Output directory for collected data

Usage

Prerequisites

Ollama installed and running locally
Qwen v2.5 14B model installed:
```
ollama pull m/qwen2514bmax
```

Running

Start your Ollama server

Run the application:

./llm_dataset_builder-macos  # or ./llm_dataset_builder-linux

Or if built from source:

cargo run

Enter data sources when prompted:

Enter a data source (press Enter to finish):
- URL (e.g., https://example.com/file.txt)
- Local path (e.g., /path/to/file)
- GitHub URL (e.g., https://github.com/user/repo/tree/branch/path)
- GitHub releases URL (e.g., https://github.com/user/repo/releases)

Output Format

Questions are saved in JSONL format:

{"question":"What is the main purpose of this application?","answer":"The application automatically generates question-answer pairs from documentation for training LLMs."}
{"question":"How does it calculate the base number of questions?","answer":"It generates one question for every 10 words of content, rounded up."}

Processing Logic

Content Analysis
- Counts total words in content
- Calculates base questions (words/10)
- Adds 25% extra questions (min 2)
- Sets minimum acceptable at 80% of base goal

Question Generation

Section (100 words):
Base goal: 10 questions
Extra questions: max(ceil(10 * 0.25), 2) = 3
Generation target: 13 questions
Minimum acceptable: 8 questions

Recursive Processing If initial generation falls short:

1. Try whole section first
2. If not enough questions:
   Split into heading sections
   Each section target = total_target * (section_words / total_words)
3. If still not enough:
   Split into paragraphs
   Each paragraph target = total_target * (paragraph_words / total_words)

Example Output

For a documentation file with 1000 words:

Processing file: docs.md
Total words: 1000
Base goal: 100 questions
Generation target: 125 questions (+25 extra)
Minimum acceptable: 80 questions

Processing section 1/3 (400 words, target 50 questions)
Got 45 questions from full section
Splitting section by headings...
- Heading 1 (250 words): 31 questions
- Heading 2 (150 words): 19 questions
Total: 50 questions

Processing section 2/3 (500 words, target 63 questions)
Got 63 questions from full section

Processing section 3/3 (100 words, target 12 questions)
Got 10 questions from full section
Splitting section by paragraphs...
- Paragraph 1: 7 questions
- Paragraph 2: 5 questions
Total: 12 questions

Final result: 125 questions generated
Saved to: output/docs_qa.jsonl

Example dataset generated here: https://huggingface.co/datasets/technovangelist/OllamaDocs

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. All PRs are automatically tested with:

Unit tests
Integration tests
Clippy lints
Code formatting checks

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.github/workflows		.github/workflows
docs		docs
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
rust		rust

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Dataset Builder

Features

Smart Question Generation

Recursive Content Processing

Intelligent File Handling

Multiple Data Source Support

Installation

Option 1: Download Pre-built Binary (Recommended)

Option 2: Build from Source

Option 3: Development Setup

Configuration

Environment Variables

Command Line Arguments

Usage

Prerequisites

Running

Output Format

Processing Logic

Example Output

Contributing

License

About

Uh oh!

Releases 3

Packages

Languages

License

nirabo/llm_dataset_builder

Folders and files

Latest commit

History

Repository files navigation

LLM Dataset Builder

Features

Smart Question Generation

Recursive Content Processing

Intelligent File Handling

Multiple Data Source Support

Installation

Option 1: Download Pre-built Binary (Recommended)

Option 2: Build from Source

Option 3: Development Setup

Configuration

Environment Variables

Command Line Arguments

Usage

Prerequisites

Running

Output Format

Processing Logic

Example Output

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages