Support for pairwise judges in online training #1194

swarnaHub · 2025-06-02T21:57:50Z

What does this PR do? Please describe:
Adds support for online training with any pairwise LLM-as-a-Judge that generates real-valued scores for responses.

Fixes #{issue number}

Does your PR introduce any breaking changes? If yes, please list them:
List of all backwards-incompatible changes.

Check list:

Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
Did you read the contributor guideline?
Did you make sure that your PR does only one thing instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

src/fairseq2/recipes/lm/_online_finetune/_grpo.py

src/fairseq2/recipes/lm/_online_finetune/_remote_model.py

swarnaHub · 2025-07-03T18:25:35Z

src/fairseq2/recipes/lm/__init__.py

+from fairseq2.recipes.lm._online_finetune._rewards import (
+    GenerativePairwiseVerifierHandler as GenerativePairwiseVerifierHandler,
+)
+


These are generic generative-judge reward classes, which internally will call the specific extractors (that users will have to define) as follows in the reward config:

reward: name: "generative_pairwise_verifier" config: prompt_key: prompt_raw tokenizer: /datasets/pretrained-llms/Llama-3.1-8B-Instruct judgment_extractor: "j1_pairwise_score_extractor"

For scalar RMs, the "judgment_extractor" will be empty or ignored.

@jacklanchantin FYI

Pairwise J1 prompt

2779600

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 2, 2025

swarna added 6 commits June 8, 2025 00:51

Adding Pairwise-J1 support

81dba36

Minor changes

30bf472

Merge branch 'online_training' into swarna/pairwise_judge

99c064c

Simplifying

b4b228a

Add logging back in

9a9aac0

Add generation prompt.

63d5c51

swarnaHub marked this pull request as ready for review June 17, 2025 05:30

swarnaHub requested a review from cbalioglu as a code owner June 17, 2025 05:30

swarnaHub requested review from uralik and jacklanchantin June 17, 2025 05:30

swarna added 2 commits June 17, 2025 05:42

removing debug statements

ad1fd47

More logging for scores out of range

d70ca73

uralik reviewed Jun 21, 2025

View reviewed changes

src/fairseq2/recipes/lm/_online_finetune/_grpo.py Outdated Show resolved Hide resolved

uralik reviewed Jun 21, 2025

View reviewed changes

src/fairseq2/recipes/lm/_online_finetune/_grpo.py Outdated Show resolved Hide resolved

uralik reviewed Jun 21, 2025

View reviewed changes

src/fairseq2/recipes/lm/_online_finetune/_remote_model.py Outdated Show resolved Hide resolved

swarna added 2 commits June 22, 2025 05:29

Adding reward name as a reward class attribute

490d54e

Cleaning up generative judges with extractor classes

4bd8774

swarnaHub commented Jul 3, 2025

View reviewed changes

swarna and others added 3 commits July 3, 2025 21:52

Typing

b3eb993

Merge branch 'ot_merge' into swarna/pairwise_judge

1675d0b

logger label

7fd7c7f

uralik requested review from zyaoj and artemru as code owners July 7, 2025 18:50

uralik changed the base branch from online_training to ot_merge July 7, 2025 18:50

uralik merged commit a7ffaa5 into ot_merge Jul 7, 2025
8 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for pairwise judges in online training #1194

Support for pairwise judges in online training #1194

Uh oh!

swarnaHub commented Jun 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

swarnaHub Jul 3, 2025

Uh oh!

swarnaHub Jul 7, 2025

Uh oh!

Uh oh!

Uh oh!

Support for pairwise judges in online training #1194

Support for pairwise judges in online training #1194

Uh oh!

Conversation

swarnaHub commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

swarnaHub Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

swarnaHub Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

swarnaHub commented Jun 2, 2025 •

edited

Loading