Skip to content

Support for pairwise judges in online training #1194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jul 7, 2025
Merged

Conversation

swarnaHub
Copy link

@swarnaHub swarnaHub commented Jun 2, 2025

What does this PR do? Please describe:
Adds support for online training with any pairwise LLM-as-a-Judge that generates real-valued scores for responses.

Fixes #{issue number}

Does your PR introduce any breaking changes? If yes, please list them:
List of all backwards-incompatible changes.

Check list:

  • Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
  • Did you read the contributor guideline?
  • Did you make sure that your PR does only one thing instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 2, 2025
@swarnaHub swarnaHub marked this pull request as ready for review June 17, 2025 05:30
@swarnaHub swarnaHub requested a review from cbalioglu as a code owner June 17, 2025 05:30
from fairseq2.recipes.lm._online_finetune._rewards import (
GenerativePairwiseVerifierHandler as GenerativePairwiseVerifierHandler,
)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are generic generative-judge reward classes, which internally will call the specific extractors (that users will have to define) as follows in the reward config:

reward:
    name: "generative_pairwise_verifier"
    config:
        prompt_key: prompt_raw
        tokenizer: /datasets/pretrained-llms/Llama-3.1-8B-Instruct
        judgment_extractor: "j1_pairwise_score_extractor"

For scalar RMs, the "judgment_extractor" will be empty or ignored.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uralik uralik requested review from zyaoj and artemru as code owners July 7, 2025 18:50
@uralik uralik changed the base branch from online_training to ot_merge July 7, 2025 18:50
@uralik uralik merged commit a7ffaa5 into ot_merge Jul 7, 2025
8 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants