Skip to content

ioscbasotcstw/Snail

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Snail

Generate Synthetic Dataset Using Gemini Model & Google Search Tool

This repository provides a powerful class designed to search for tasks, expressions, and similar entities that can be quantified, enabling you to generate a synth dataset with ease. By leveraging the Gemini model and the Google search tool, you can now seamlessly build datasets for advanced reasoning and problem-solving applications.

Overview

The main objective of this class is to help you:

  1. Search for specific tasks and expressions using Google search.
  2. Extract enumerated listings from the search results.
  3. Generate a comprehensive synth dataset.
  4. Create and push your dataset to Hugging Face in a streamlined manner.

Getting Started

0. Installing

Install and set up the tool with the following command. I highly recommend using Google Colab because it's much easier and more flexible. Don't forget to set up huggingface-cli login, because you will need to log in to HF when you push the dataset to HF.

!git clone https://github.com/ioscbasotcstw/Snail.git
!pip install -r '/content/Snail/requirements.txt'
!huggingface-cli login

1. Initialize the Class

Set up the class by providing all the necessary parameters. One key parameter is role, which tailors the context for more relevant system instructions via system_instruction_google_search.

from Snail.snail.cot_dsgen import CoTDatasetGenerator
 
snail = CoTDatasetGenerator(google_api_key=google_api, model_id='gemini-2.0-flash-thinking-exp-01-21', user_query="List a 20 math problems from easiest to hard and numerate their", role="mathematician")
result = snail.searching()

2. Extract Enumerations

After obtaining the results, extract the enumerated listings. Ensure that your user_query specifies the desired number of items to extract.

instruction = snail.extract_listings(result)

3. Customize Your CoT System Instruction

Tweak the system_instruction_cot parameter to better fit your needs. Your instruction should detail the problem-solving process step-by-step. The thought process should be encapsulated within tags and the final answer within tags.

snail.system_instruction_cot = f"""You are a {snail.role} expert skilled at explaining step by step mathematician problems, using a Chain of Thought (CoT) framework. Your response must include:
- A thought process inside <thought></thought> tags, where you analyze the problem.
- A final response inside <answer></answer> tags, solving the problem.
Ensure your reasoning is clear, concise.
"""

4. Generate CoT Results

Leverage the configured instructions to produce CoT outputs. This step generates responses in the desired CoT format.

output = snail.get_result(instruction, 2)

5. Create and Push Your Dataset

Finally, transform your outputs into a JSON-formatted dataset and push it to Hugging Face.

ds = snail.create_ds(instruction, output)
snail.transform_alpaca_format(ds)
snail.push_to_hf(json_path="path to file", repo_id="HF username/Repo name")

About

Snail is a synth dataset generator

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages