DeepSeek API Dataset Generator

This project is a Python-based tool for generating fine-tuning datasets using the DeepSeek API. It creates single-turn Q&A pairs or multi-turn conversations in various styles, based on customizable topics and categories. It also includes a dataset validator to ensure quality before fine-tuning.

🚀 Features

Generate realistic Q&A examples with customizable styles
Support for single or multi-turn conversations
Style options: helpful, corporate, casual, technical, creative, educational
Validate .jsonl datasets before fine-tuning
Topic categories covering technology, business, education, lifestyle, and more
Rate limiting support and environment configuration via .env

🧰 Requirements

Python 3.7+
DeepSeek API Key
Dependencies:

bash

pip install -r requirements.txt

🔐 Setup

Clone the repository:

bash

git clone https://github.com/luisriverag/deepseek-api_dataset-generator.gitcd deepseek-api_dataset-generator
Create a .env file:

CopyEdit

python3 generator.py --create-env
Edit .env and add your DeepSeek API key:

env

DEEPSEEK_API_KEY=your_deepseek_api_key_here

🛠️ Usage

Generate 50 examples (default):

python3 generator.py

Customize output:

python3 generator.py \ --output my_dataset.jsonl \ --count 100 \ --style technical \ --categories technology education \ --conversation-turns 2

Validate an existing dataset:

python3 generator.py --validate-only --output my_dataset.jsonl

📚 Categories

Available topic categories:

technology
business
education
lifestyle
creative
science
all (default)

📄 Output Format

Each line in the output .jsonl file follows the format:

json

{ "messages": [ {"role": "user", "content": "What is quantum computing?"}, {"role": "assistant", "content": "Quantum computing uses principles of quantum mechanics to perform computations..."} ]}

✅ Validation

Ensures:

Proper JSON structure
Valid roles: user, assistant, system
Token estimation
Warnings for long/empty content

🤖 Next Steps

Review your dataset.
Use it for fine-tuning a model that supports DeepSeek-style training.
Monitor performance and adjust generation parameters as needed.

📄 License

MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
generator.py		generator.py
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DeepSeek API Dataset Generator

🚀 Features

🧰 Requirements

🔐 Setup

🛠️ Usage

Generate 50 examples (default):

Customize output:

Validate an existing dataset:

📚 Categories

📄 Output Format

✅ Validation

🤖 Next Steps

📄 License

About

Uh oh!

Releases

Packages

Languages

License

luisriverag/deepseek-api_dataset-generator

Folders and files

Latest commit

History

Repository files navigation

DeepSeek API Dataset Generator

🚀 Features

🧰 Requirements

🔐 Setup

🛠️ Usage

Generate 50 examples (default):

Customize output:

Validate an existing dataset:

📚 Categories

📄 Output Format

✅ Validation

🤖 Next Steps

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages