relign is a fully open-sourced RL library tailored specifically for the research and development of reasoning engines. It currently supports state-of-the-art reinforcement learning algorithms like PPO and GRPO, alongside useful abstractions for Chain of Thought (CoT) and MCTS inference strategies. All these can be evaluated on popular reasoning benchmarks.
Note: relign is alpha software—it may be buggy.
- Installation
- Example
- Bounties
- What's Next
- Training Runs
- Contributing (Ranked by Urgency)
- Acknowledgements
-
Create and activate a conda environment:
conda create -n relign python=3.10 -y conda activate relign
-
Install RELIGN in editable mode:
pip install -e .
In the example
folder, we provide code to fine-tune a 1B SFT model via PPO on a gsm8k-math benchmark using a Chain of Thought inference strategy. This example demonstrates the different abstraction layers that relign offers. A blog post detailing exactly what's happening here, why it is important, and where we see this going will follow soon.
The example runs on two A6000 GPUs (96GB VRAM total).
deepspeed --num_gpus=2 examples/ppo_gsm.py
If you have a custom DeepSpeed config file (e.g., ds_config.json), you can also specify it:
Not just the models will be rewarded for their work, but more importantly, our contributors. Implement bounties and we will send you relign
Description | Reward in RELIGN |
---|---|
✅ Completed: GRPO – Implement DeepSeek's GRPO in RELIGN and train it with standard CoT inference on gsm8k math | 250k |
Create a complex medical reasoning task specification + verifier based on HuatuoGPT-o1 | 250k |
Implement a multi-step learning inference strategy | 250k |
If you'd like to propose a new challenge or feature and set your own reward, go to: www.relign.ai/bounties.
Bounty Instructions (common GitHub approach):
- Fork the repository.
- Make your changes in a new branch.
- Submit a Pull Request referencing the bounty issue.
- We will review your PR and, if merged, send the funds to your wallet.
-
Docs Page & Project Layout
Comprehensive documentation about the features and classes. -
More Memory-Efficient Algorithm Runners
Some runs require a lot of VRAM. We aim to set up smaller scale experiments such that developers can run and train models on single-GPU machines.
We recently benchmarked GRPO in RELIGN. You can view the detailed training report here.
- Episode Generators / Tasks
- We would encourage everyone to add new (novel) tasks and environments to the library on which we can test post-training methods. Some inspiration below:
- Coding
- MLE-bench
- Trading
- General/Scientific Q&A
- We would encourage everyone to add new (novel) tasks and environments to the library on which we can test post-training methods. Some inspiration below:
RELIGN builds upon and is inspired by the following works:
- Guidance
- DeepSeek-Math
- Framework structure inspired by [Stable Baselines 3]
- Special acknowledgement to VinePPO for the MCTS + CoT approach and DeepSpeed policy abstractions.
Thank you to all contributors for your open-source efforts!