Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

🤗 Hugging Face Paper | 🤗 Hugging Face Collection

Updates

30/06/2025: 🎉 We release our paper, models, game-play dataset and self-play codebase.

Introduction

Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on expert-curated problem-answer pairs and domain-specific reward engineering.

We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision. Through zero-sum self-play, SPIRAL generates an infinite curriculum of progressively challenging problems as models must constantly adapt to stronger opponents.

Applying SPIRAL to Qwen3 base models in two-player zero-sum text games, we observe the agents develop advanced reasoning strategies to win the competitive game. Furthermore, the trained models show substantial gains on a range of math and general reasoning benchmarks. These results suggest that self-play in zero-sum games can naturally induce transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.

Architecture

SPIRAL employs an actor-learner architecture for scalable self-play training. Parallel actors sample trajectories from a diverse set of games using vectorized environments. A single policy $\pi_t$ plays both roles, generating zero-sum, sparse reward game trajectories. The centralized learner processes these trajectories using Role-conditioned Advantage Estimation (RAE) to compute separate advantages, $A_0(s,a)$ and $A_1(s,a)$, for each role. These are then used for on-policy reinforcement learning updates.

Usage

Installation

# clone codebase
git clone [email protected]:spiral-rl/spiral.git && cd spiral

# prepare environment
conda create -y -n spiral python=3.10
conda activate spiral

# install dependencies
pip install vllm==0.8.4 && pip install oat-llm==0.2.1
pip install -e .

Training

bash run.sh

This training script runs SPIRAL on the Kuhn Poker environment for 400 policy iteration steps. It has been tested on an 8×H100 GPU setup. During training, we evaluate model performance online using three metrics:

Win rate against a fixed opponent on the training game;
Win rate against a fixed opponent on an out-of-domain game ;
Accuracy on math reasoning benchmarks.

Example result curves are shown below.

Evaluation

In addition to the online evaluation, we provide offline evaluation across a broader range of benchmarks to assess the model's OOD game and general reasoning capabilities.

Game evaluation

# we rely on openrouter to play against gemini models
export OPENROUTER_API_KEY=""

# Add your models to the batch_run.sh
bash evals/game/batch_run.sh

Benchmark evaluation

cd evals/benchmarks
# Add your models to the batch_run.sh
bash batch_run.sh

Citation

If you find our work useful for your research, please consider citing:

@article{liu2025spiral,
  title={SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning},
  author={Liu, Bo and Guertler, Leon and Yu, Simon and Liu, Zichen and Qi, Penghui and Balcells, Daniel and Liu, Mickel and Tan, Cheston and Shi, Weiyan and Lin, Min and Lee, Wee Sun and Jaques, Natasha},
  journal={arXiv preprint arXiv:2506.24119},
  year={2025},
  url={https://arxiv.org/abs/2506.24119}
}

Acknowledgement

This work is supported by PlasticLabs and Sea AI Lab for computing resources.
The language games are sampled from TextArena, a collection of competitive text-based games for language model evaluation and reinforcement learning.
The multi-agent, multi-turn RL training is implemented with 🌾 Oat, a modular and research-friendly LLM RL framework.
We did exploration on PEFT experiments using UnstableBaselines, a lightweight, LoRA-first library for fast prototyping of self-play algorithms on TextArena.
The base models are from Qwen3.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
data		data
evals		evals
spiral		spiral
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
run.sh		run.sh
run_multi_node.sh		run_multi_node.sh
train_spiral.py		train_spiral.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

Updates

Introduction

Architecture

Usage

Installation

Training

Evaluation

Citation

Acknowledgement

About

Uh oh!

Releases

Contributors 5

Uh oh!

Languages

License

spiral-rl/spiral

Folders and files

Latest commit

History

Repository files navigation

Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

Updates

Introduction

Architecture

Usage

Installation

Training

Evaluation

Citation

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors 5

Uh oh!

Languages