Skip to content

luisriverag/deepseek-api_dataset-generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

DeepSeek API Dataset Generator

This project is a Python-based tool for generating fine-tuning datasets using the DeepSeek API. It creates single-turn Q&A pairs or multi-turn conversations in various styles, based on customizable topics and categories. It also includes a dataset validator to ensure quality before fine-tuning.


🚀 Features

  • Generate realistic Q&A examples with customizable styles

  • Support for single or multi-turn conversations

  • Style options: helpful, corporate, casual, technical, creative, educational

  • Validate .jsonl datasets before fine-tuning

  • Topic categories covering technology, business, education, lifestyle, and more

  • Rate limiting support and environment configuration via .env


🧰 Requirements

  • Python 3.7+

  • DeepSeek API Key

  • Dependencies:

    bash

    pip install -r requirements.txt


🔐 Setup

  1. Clone the repository:

    bash

    git clone https://github.com/luisriverag/deepseek-api_dataset-generator.gitcd deepseek-api_dataset-generator

  2. Create a .env file:

    CopyEdit

    python3 generator.py --create-env

  3. Edit .env and add your DeepSeek API key:

    env

    DEEPSEEK_API_KEY=your_deepseek_api_key_here


🛠️ Usage

Generate 50 examples (default):

python3 generator.py

Customize output:

python3 generator.py \ --output my_dataset.jsonl \ --count 100 \ --style technical \ --categories technology education \ --conversation-turns 2

Validate an existing dataset:

python3 generator.py --validate-only --output my_dataset.jsonl


📚 Categories

Available topic categories:

  • technology

  • business

  • education

  • lifestyle

  • creative

  • science

  • all (default)


📄 Output Format

Each line in the output .jsonl file follows the format:

json

{ "messages": [ {"role": "user", "content": "What is quantum computing?"}, {"role": "assistant", "content": "Quantum computing uses principles of quantum mechanics to perform computations..."} ]}


✅ Validation

Ensures:

  • Proper JSON structure

  • Valid roles: user, assistant, system

  • Token estimation

  • Warnings for long/empty content


🤖 Next Steps

  1. Review your dataset.

  2. Use it for fine-tuning a model that supports DeepSeek-style training.

  3. Monitor performance and adjust generation parameters as needed.


📄 License

MIT License. See LICENSE for details.

About

Python-based tool for generating fine-tuning datasets using the DeepSeek API

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages