A Rust application that automatically generates high-quality question-answer pairs from documentation, making it perfect for training Large Language Models (LLMs). The application uses Ollama with the Qwen v2.5 14B model to process various data sources and creates targeted questions based on content length and complexity.
- Automatically calculates the optimal number of questions based on content length
- Base target: 1 question per 10 words of content
- Adds 25% extra questions (minimum 2) to ensure quality coverage
- Example:
100 words → 10 base questions + 3 extra = 13 questions 20 words → 2 base questions + 2 extra = 4 questions
If the initial question generation doesn't meet the target:
- First attempts to process the entire section
- If insufficient questions, splits content by headings
- If still insufficient, splits content by paragraphs
- Each subsection gets a proportional number of questions based on its word count
- Outputs in JSONL format (one JSON object per line)
- Checks for existing question files before processing
- Converts older JSON files to JSONL format automatically
- Skips processing if sufficient questions already exist
- Maintains quality by ensuring minimum question thresholds
- Local files
- URLs (web pages)
- GitHub repositories
- GitHub release notes
- Handles both Markdown and plain text content
- Go to the Releases page
- Download the latest binary for your platform:
llm_dataset_builder-macos
for macOSllm_dataset_builder-linux
for Linux
- Make the binary executable:
chmod +x llm_dataset_builder-*
If you want to build from source:
- Ensure you have Rust installed
- Clone this repository
- Build the project:
cargo build --release
For development, we provide a comprehensive Makefile to manage the project:
-
Prerequisites:
- Install uv for Python package management:
curl -LsSf https://astral.sh/uv/install.sh | sh
- Install uv for Python package management:
-
Set up the development environment:
make setup
This will:
- Create a Python virtual environment using uv
- Install pre-commit hooks
- Install required Rust components
- Configure git hooks
-
Common development commands:
make build # Build debug version make release # Build release version make test # Run all tests make lint # Run formatting and clippy checks make check # Run all checks (format, lint, test) make run # Run the application make doc # Generate documentation
-
Additional commands:
make help # Show all available commands make fmt-fix # Fix code formatting make test-coverage # Generate test coverage report make dist # Create release artifacts
The pre-commit hooks will run these checks before each commit:
- Trailing whitespace removal
- End of file fixing
- YAML validation
- Large file checks
- Rust formatting
- Cargo check
- Clippy lints
The application can be configured using environment variables or command line arguments. Command line arguments take precedence over environment variables.
Copy the .env.example
file to .env
and customize the values:
cp .env.example .env
Available environment variables:
OLLAMA_ENDPOINT
: Ollama API endpoint (default: "http://localhost:11434")OLLAMA_MODEL
: Ollama model to use (default: "m/qwen2514bmax")OUTPUT_DIR
: Output directory for collected data (default: "output")
Command line arguments override environment variables:
cargo run -- -e http://localhost:11434 -m m/qwen2514bmax -d output
Options:
-e, --ollama-endpoint
: Ollama API endpoint-m, --model
: Ollama model to use-d, --output-dir
: Output directory for collected data
- Ollama installed and running locally
- Qwen v2.5 14B model installed:
ollama pull m/qwen2514bmax
- Start your Ollama server
- Run the application:
Or if built from source:
./llm_dataset_builder-macos # or ./llm_dataset_builder-linux
cargo run
- Enter data sources when prompted:
Enter a data source (press Enter to finish): - URL (e.g., https://example.com/file.txt) - Local path (e.g., /path/to/file) - GitHub URL (e.g., https://github.com/user/repo/tree/branch/path) - GitHub releases URL (e.g., https://github.com/user/repo/releases)
Questions are saved in JSONL format:
{"question":"What is the main purpose of this application?","answer":"The application automatically generates question-answer pairs from documentation for training LLMs."}
{"question":"How does it calculate the base number of questions?","answer":"It generates one question for every 10 words of content, rounded up."}
-
Content Analysis
- Counts total words in content
- Calculates base questions (words/10)
- Adds 25% extra questions (min 2)
- Sets minimum acceptable at 80% of base goal
-
Question Generation
Section (100 words): Base goal: 10 questions Extra questions: max(ceil(10 * 0.25), 2) = 3 Generation target: 13 questions Minimum acceptable: 8 questions
-
Recursive Processing If initial generation falls short:
1. Try whole section first 2. If not enough questions: Split into heading sections Each section target = total_target * (section_words / total_words) 3. If still not enough: Split into paragraphs Each paragraph target = total_target * (paragraph_words / total_words)
For a documentation file with 1000 words:
Processing file: docs.md
Total words: 1000
Base goal: 100 questions
Generation target: 125 questions (+25 extra)
Minimum acceptable: 80 questions
Processing section 1/3 (400 words, target 50 questions)
Got 45 questions from full section
Splitting section by headings...
- Heading 1 (250 words): 31 questions
- Heading 2 (150 words): 19 questions
Total: 50 questions
Processing section 2/3 (500 words, target 63 questions)
Got 63 questions from full section
Processing section 3/3 (100 words, target 12 questions)
Got 10 questions from full section
Splitting section by paragraphs...
- Paragraph 1: 7 questions
- Paragraph 2: 5 questions
Total: 12 questions
Final result: 125 questions generated
Saved to: output/docs_qa.jsonl
Example dataset generated here: https://huggingface.co/datasets/technovangelist/OllamaDocs
Contributions are welcome! Please feel free to submit a Pull Request. All PRs are automatically tested with:
- Unit tests
- Integration tests
- Clippy lints
- Code formatting checks
This project is licensed under the MIT License - see the LICENSE file for details.