Snail

Generate Synthetic Dataset Using Gemini Model & Google Search Tool

This repository provides a powerful class designed to search for tasks, expressions, and similar entities that can be quantified, enabling you to generate a synth dataset with ease. By leveraging the Gemini model and the Google search tool, you can now seamlessly build datasets for advanced reasoning and problem-solving applications.

Overview

The main objective of this class is to help you:

Search for specific tasks and expressions using Google search.
Extract enumerated listings from the search results.
Generate a comprehensive synth dataset.
Create and push your dataset to Hugging Face in a streamlined manner.

Getting Started

0. Installing

Install and set up the tool with the following command. I highly recommend using Google Colab because it's much easier and more flexible. Don't forget to set up huggingface-cli login, because you will need to log in to HF when you push the dataset to HF.

!git clone https://github.com/ioscbasotcstw/Snail.git
!pip install -r '/content/Snail/requirements.txt'
!huggingface-cli login

1. Initialize the Class

Set up the class by providing all the necessary parameters. One key parameter is role, which tailors the context for more relevant system instructions via system_instruction_google_search.

from Snail.snail.cot_dsgen import CoTDatasetGenerator
 
snail = CoTDatasetGenerator(google_api_key=google_api, model_id='gemini-2.0-flash-thinking-exp-01-21', user_query="List a 20 math problems from easiest to hard and numerate their", role="mathematician")
result = snail.searching()

2. Extract Enumerations

After obtaining the results, extract the enumerated listings. Ensure that your user_query specifies the desired number of items to extract.

instruction = snail.extract_listings(result)

3. Customize Your CoT System Instruction

Tweak the system_instruction_cot parameter to better fit your needs. Your instruction should detail the problem-solving process step-by-step. The thought process should be encapsulated within tags and the final answer within tags.

snail.system_instruction_cot = f"""You are a {snail.role} expert skilled at explaining step by step mathematician problems, using a Chain of Thought (CoT) framework. Your response must include:
- A thought process inside <thought></thought> tags, where you analyze the problem.
- A final response inside <answer></answer> tags, solving the problem.
Ensure your reasoning is clear, concise.
"""

4. Generate CoT Results

Leverage the configured instructions to produce CoT outputs. This step generates responses in the desired CoT format.

output = snail.get_result(instruction, 2)

5. Create and Push Your Dataset

Finally, transform your outputs into a JSON-formatted dataset and push it to Hugging Face.

ds = snail.create_ds(instruction, output)
snail.transform_alpaca_format(ds)
snail.push_to_hf(json_path="path to file", repo_id="HF username/Repo name")

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
example		example
img		img
snail		snail
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Snail

Generate Synthetic Dataset Using Gemini Model & Google Search Tool

Overview

Getting Started

0. Installing

1. Initialize the Class

2. Extract Enumerations

3. Customize Your CoT System Instruction

4. Generate CoT Results

5. Create and Push Your Dataset

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

ioscbasotcstw/Snail

Folders and files

Latest commit

History

Repository files navigation

Snail

Generate Synthetic Dataset Using Gemini Model & Google Search Tool

Overview

Getting Started

0. Installing

1. Initialize the Class

2. Extract Enumerations

3. Customize Your CoT System Instruction

4. Generate CoT Results

5. Create and Push Your Dataset

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages