Skip to content

newturing/GStar-Assignment-2-GRPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GRPO for Countdown Math Problems

This project implements the Group Relative Policy Optimization (GRPO) algorithm to fine-tune a language model on the "Countdown" math task. The goal is to train an LLM to generate correct mathematical equations that reach a target number using a given set of integers.

Setup

pip install --upgrade uv
uv venv
source .venv/bin/activate
uv pip install vllm==0.7.2 triton==3.1.0 datasets transformers==4.51.3 tensorboard torch gpustat datasets python-dotenv
uv pip install flash-attn==2.7.4.post1 --no-build-isolation

Assignment Structure

Instruction: gstar_assignment2.pdf

Main File: starter.py:

Problem 1

Part 1: GRPO Pipeline Implementation

  • Task 1: Implement reward helper functions (_extract_answer, _validate_numbers, _evaluate_equation).
  • Task 2: Implement the main reward_fn.
  • Task 3: Implement compute_group_normalized_advantages.
  • Task 5: Implement masked mean for GRPO and masked_mean_drgrpo for DR.GRPO.

Part 2: Main Experiments and Report (please refer to section 1.6 of gstar_assignment2.pdf for the Main Report and Experiments

Problem 2: (Optional) Open-ended Investigation. Please refer to Section 2 of gstar_assignment2.pdf

About

GRPO for Countdown Math Problems

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages