This repository provides a powerful class designed to search for tasks, expressions, and similar entities that can be quantified, enabling you to generate a synth dataset with ease. By leveraging the Gemini model and the Google search tool, you can now seamlessly build datasets for advanced reasoning and problem-solving applications.
The main objective of this class is to help you:
- Search for specific tasks and expressions using Google search.
- Extract enumerated listings from the search results.
- Generate a comprehensive synth dataset.
- Create and push your dataset to Hugging Face in a streamlined manner.
Install and set up the tool with the following command. I highly recommend using Google Colab because it's much easier and more flexible.
Don't forget to set up huggingface-cli login
, because you will need to log in to HF when you push the dataset to HF.
!git clone https://github.com/ioscbasotcstw/Snail.git
!pip install -r '/content/Snail/requirements.txt'
!huggingface-cli login
Set up the class by providing all the necessary parameters. One key parameter is role, which tailors the context for more relevant system instructions via system_instruction_google_search.
from Snail.snail.cot_dsgen import CoTDatasetGenerator
snail = CoTDatasetGenerator(google_api_key=google_api, model_id='gemini-2.0-flash-thinking-exp-01-21', user_query="List a 20 math problems from easiest to hard and numerate their", role="mathematician")
result = snail.searching()
After obtaining the results, extract the enumerated listings. Ensure that your user_query specifies the desired number of items to extract.
instruction = snail.extract_listings(result)
Tweak the system_instruction_cot parameter to better fit your needs. Your instruction should detail the problem-solving process step-by-step. The thought process should be encapsulated within tags and the final answer within tags.
snail.system_instruction_cot = f"""You are a {snail.role} expert skilled at explaining step by step mathematician problems, using a Chain of Thought (CoT) framework. Your response must include:
- A thought process inside <thought></thought> tags, where you analyze the problem.
- A final response inside <answer></answer> tags, solving the problem.
Ensure your reasoning is clear, concise.
"""
Leverage the configured instructions to produce CoT outputs. This step generates responses in the desired CoT format.
output = snail.get_result(instruction, 2)
Finally, transform your outputs into a JSON-formatted dataset and push it to Hugging Face.
ds = snail.create_ds(instruction, output)
snail.transform_alpaca_format(ds)
snail.push_to_hf(json_path="path to file", repo_id="HF username/Repo name")