Skip to content

Conversation

jamesbraza
Copy link
Collaborator

This PR:

  • Makes two LightEval SampleLevelMetrics, from accuracy_reward and format_reward
  • Makes LightEval LightevalTaskConfig to support all different evaluation modes:
    • All-tasks at once vs categories
    • Soft vs non-soft
    • Reasoning vs non-reasoning
    • Train vs test

To run a gpt-4o baseline:

ETHER0_REMOTES_API_BASE_URL="http://127.0.0.1:8000" ETHER0_REMOTES_API_TOKEN=abc123 python -m lighteval endpoint litellm "model_name=gpt-4o" "community|ether0:loose:functional-group|0|0,community|ether0:loose:molecule-completion|0|0,community|ether0:loose:molecule-formula|0|0,community|ether0:loose:molecule-name|0|0,community|ether0:loose:oracle-solubility|0|0,community|ether0:loose:property-cat-eve|0|0,community|ether0:loose:property-cat-safety|0|0,community|ether0:loose:property-cat-smell|0|0,community|ether0:loose:property-regression-adme|0|0,community|ether0:loose:property-regression-ld50|0|0,community|ether0:loose:property-regression-pka|0|0,community|ether0:loose:reaction-prediction|0|0,community|ether0:loose:retro-synthesis|0|0,community|ether0:loose:simple-formula|0|0" --custom-tasks src/ether0/lighteval_tasks.py --save-details

We are forced to hit a few warnings from LightEval (huggingface/lighteval#800, huggingface/lighteval#801), but eventually a Markdown results table appears:

Task Version Metric Value Stderr
all ether0_accuracy 0.1714 ± 0.0540
community:ether0:_average:0 ether0_accuracy 0.1714 ± 0.0540
community:ether0:loose:functional-group:0 0 ether0_accuracy 0.0000 ± 0.0000
community:ether0:loose:molecule-completion:0 0 ether0_accuracy 0.1200 ± 0.0663
community:ether0:loose:molecule-formula:0 0 ether0_accuracy 0.0000 ± 0.0000
community:ether0:loose:molecule-name:0 0 ether0_accuracy 0.0400 ± 0.0400
community:ether0:loose:oracle-solubility:0 0 ether0_accuracy 0.1600 ± 0.0748
community:ether0:loose:property-cat-eve:0 0 ether0_accuracy 0.3600 ± 0.0980
community:ether0:loose:property-cat-safety:0 0 ether0_accuracy 0.4400 ± 0.1013
community:ether0:loose:property-cat-smell:0 0 ether0_accuracy 0.2000 ± 0.0816
community:ether0:loose:property-regression-adme:0 0 ether0_accuracy 0.3600 ± 0.0980
community:ether0:loose:property-regression-ld50:0 0 ether0_accuracy 0.4000 ± 0.1000
community:ether0:loose:property-regression-pka:0 0 ether0_accuracy 0.3200 ± 0.0952
community:ether0:loose:reaction-prediction:0 0 ether0_accuracy 0.0000 ± 0.0000
community:ether0:loose:retro-synthesis:0 0 ether0_accuracy 0.0000 ± 0.0000
community:ether0:loose:simple-formula:0 0 ether0_accuracy 0.0000 ± 0.0000

@jamesbraza jamesbraza self-assigned this Jun 10, 2025
@Copilot Copilot AI review requested due to automatic review settings June 10, 2025 02:30
@jamesbraza jamesbraza added the enhancement New feature or request label Jun 10, 2025
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR integrates LightEval into the ether0 benchmark by adding new SampleLevelMetric evaluations and task configurations while also updating dependency pins.

  • Introduces new LightEval tasks and metrics in ether0/lighteval_tasks.py.
  • Refactors problem type filtering functions in ether0/models.py.
  • Adds integration tests for custom tasks in tests/test_lighteval_tasks.py and updates dependency configurations in the pyproject.toml files.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/test_lighteval_tasks.py Added integration tests for custom LightEval tasks.
src/ether0/models.py Refactored problem type filtering for clearer separation of concerns.
src/ether0/lighteval_tasks.py Introduced new tasks and metric evaluation functions for LightEval integration.
pyproject.toml Updated dependency list to include LightEval extras.
packages/remotes/pyproject.toml Adjusted tensorboard pin version to ensure compatibility.
Comments suppressed due to low confidence (1)

packages/remotes/pyproject.toml:45

  • Verify that downgrading the tensorboard pin to >=2.18 does not impact the features relied on in the remotes package; update or add tests if necessary.
"tensorboard>=2.18",  # Indirect dependency we pin to keep recent

"ether0",
"ether0.remotes[serve]",
"tensorboard>=2.19", # Indirect dependency we pin to keep recent
"tensorboard>=2.18", # Indirect dependency we pin to keep recent
Copy link
Collaborator Author

@jamesbraza jamesbraza Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I loosened the pinning here to allow for package resolution, as lighteval==0.10.0 requires numpy v1: huggingface/lighteval#416

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant