This project is a Python-based tool for generating fine-tuning datasets using the DeepSeek API. It creates single-turn Q&A pairs or multi-turn conversations in various styles, based on customizable topics and categories. It also includes a dataset validator to ensure quality before fine-tuning.
-
Generate realistic Q&A examples with customizable styles
-
Support for single or multi-turn conversations
-
Style options:
helpful
,corporate
,casual
,technical
,creative
,educational
-
Validate
.jsonl
datasets before fine-tuning -
Topic categories covering technology, business, education, lifestyle, and more
-
Rate limiting support and environment configuration via
.env
-
Python 3.7+
-
Dependencies:
bash
pip install -r requirements.txt
-
Clone the repository:
bash
git clone https://github.com/luisriverag/deepseek-api_dataset-generator.gitcd deepseek-api_dataset-generator
-
Create a
.env
file:CopyEdit
python3 generator.py --create-env
-
Edit
.env
and add your DeepSeek API key:env
DEEPSEEK_API_KEY=your_deepseek_api_key_here
python3 generator.py
python3 generator.py \ --output my_dataset.jsonl \ --count 100 \ --style technical \ --categories technology education \ --conversation-turns 2
python3 generator.py --validate-only --output my_dataset.jsonl
Available topic categories:
-
technology
-
business
-
education
-
lifestyle
-
creative
-
science
-
all
(default)
Each line in the output .jsonl
file follows the format:
json
{ "messages": [ {"role": "user", "content": "What is quantum computing?"}, {"role": "assistant", "content": "Quantum computing uses principles of quantum mechanics to perform computations..."} ]}
Ensures:
-
Proper JSON structure
-
Valid roles:
user
,assistant
,system
-
Token estimation
-
Warnings for long/empty content
-
Review your dataset.
-
Use it for fine-tuning a model that supports DeepSeek-style training.
-
Monitor performance and adjust generation parameters as needed.
MIT License. See LICENSE for details.