Skip to content

Official implementation of paper "Learning to Optimize Multi-objective Alignment Through Dynamic Reward Weighting"

License

Notifications You must be signed in to change notification settings

yining610/dynamic-reward-weighting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting

Dynamic multi-objective reward weighting methods compatible with various online reinforcement learning algorithms, datasets, and model families
Paper on arXiv


📖 Folder Structure

steps/       // callable scripts for data preprocessing
verl/        // source code of models, algorithms, data structures, metrics, etc. 
examples/    // bash scripts to run jobs
data/        // pre-processed data used in experiments

⚙️ Environment

We use the Dockerfile to build the environment. For more setup instructions, see the verl environment setup guide

We use Wandb to log experiments, so please log in before running them.

🚀 Experiment

We provide all the bash scripts used in our experiments in the examples/ directory.

Hypervolume-Guided Weight Adaptation

Take an example of training Qwen3 on GRPO:

bash examples/grpo_trainer/run_qwen3-8b_multiobjective_vanilla.sh

run experiments in batch:

bash examples/grpo_trainer/run_batch_vanilla.sh

Gradient-Based Weight Optimization:

Take an example of training Qwen3 on GRPO:

bash examples/grpo_trainer/run_qwen3-8b_multiobjective_optimization.sh

run experiments in batch:

bash examples/grpo_trainer/run_batch_optimization.sh

Preliminary experiments:

We provide scripts to replicate the preliminary findings shown in Appendix A.2 in the paper

bash examples/preliminary_experiment/model_merge.sh
bash examples/preliminary_experiment/run_main_generation_dual.sh

Note that we need to merge the saved checkpoints from FSDP and Megatron backends to HuggingFace models first.

Result Analysis

We also provide analysis code in analysis_fns.py for analyzing and visualizing results, with examples in analysis.ipynb.

Important Files

Some important files that we modified and added on Verl

verl/trainer/ppo/ray_trainer_vanilla.py
verl/trainer/ppo/ray_trainer_optimization.py
verl/trainer/main_generation_dual.py

verl/utils/reward_score/dynamic_math/*

verl/workers/reward_manager/multi_objective.py
verl/workers/reward_manager/multi_objective_optimization.py
verl/workers/fsdp_workers.py

📚 Citation

If you use our code, please cite the following paper:

@misc{lu2025learningoptimizemultiobjectivealignment,
      title={Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting}, 
      author={Yining Lu and Zilong Wang and Shiyang Li and Xin Liu and Changlong Yu and Qingyu Yin and Zhan Shi and Zixuan Zhang and Meng Jiang},
      year={2025},
      eprint={2509.11452},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2509.11452}, 
}

About

Official implementation of paper "Learning to Optimize Multi-objective Alignment Through Dynamic Reward Weighting"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published