Basic LightEval integration #11

jamesbraza · 2025-06-10T02:30:32Z

This PR:

Makes two LightEval SampleLevelMetrics, from accuracy_reward and format_reward
Makes LightEval LightevalTaskConfig to support all different evaluation modes:
- All-tasks at once vs categories
- Soft vs non-soft
- Reasoning vs non-reasoning
- Train vs test

To run a gpt-4o baseline:

ETHER0_REMOTES_API_BASE_URL="http://127.0.0.1:8000" ETHER0_REMOTES_API_TOKEN=abc123 python -m lighteval endpoint litellm "model_name=gpt-4o" "community|ether0:loose:functional-group|0|0,community|ether0:loose:molecule-completion|0|0,community|ether0:loose:molecule-formula|0|0,community|ether0:loose:molecule-name|0|0,community|ether0:loose:oracle-solubility|0|0,community|ether0:loose:property-cat-eve|0|0,community|ether0:loose:property-cat-safety|0|0,community|ether0:loose:property-cat-smell|0|0,community|ether0:loose:property-regression-adme|0|0,community|ether0:loose:property-regression-ld50|0|0,community|ether0:loose:property-regression-pka|0|0,community|ether0:loose:reaction-prediction|0|0,community|ether0:loose:retro-synthesis|0|0,community|ether0:loose:simple-formula|0|0" --custom-tasks src/ether0/lighteval_tasks.py --save-details

We are forced to hit a few warnings from LightEval (huggingface/lighteval#800, huggingface/lighteval#801), but eventually a Markdown results table appears:

Task	Version	Metric	Value		Stderr
all		ether0_accuracy	0.1714	±	0.0540
community:ether0:_average:0		ether0_accuracy	0.1714	±	0.0540
community:ether0:loose:functional-group:0	0	ether0_accuracy	0.0000	±	0.0000
community:ether0:loose:molecule-completion:0	0	ether0_accuracy	0.1200	±	0.0663
community:ether0:loose:molecule-formula:0	0	ether0_accuracy	0.0000	±	0.0000
community:ether0:loose:molecule-name:0	0	ether0_accuracy	0.0400	±	0.0400
community:ether0:loose:oracle-solubility:0	0	ether0_accuracy	0.1600	±	0.0748
community:ether0:loose:property-cat-eve:0	0	ether0_accuracy	0.3600	±	0.0980
community:ether0:loose:property-cat-safety:0	0	ether0_accuracy	0.4400	±	0.1013
community:ether0:loose:property-cat-smell:0	0	ether0_accuracy	0.2000	±	0.0816
community:ether0:loose:property-regression-adme:0	0	ether0_accuracy	0.3600	±	0.0980
community:ether0:loose:property-regression-ld50:0	0	ether0_accuracy	0.4000	±	0.1000
community:ether0:loose:property-regression-pka:0	0	ether0_accuracy	0.3200	±	0.0952
community:ether0:loose:reaction-prediction:0	0	ether0_accuracy	0.0000	±	0.0000
community:ether0:loose:retro-synthesis:0	0	ether0_accuracy	0.0000	±	0.0000
community:ether0:loose:simple-formula:0	0	ether0_accuracy	0.0000	±	0.0000

… lighteval

Copilot

Pull Request Overview

This PR integrates LightEval into the ether0 benchmark by adding new SampleLevelMetric evaluations and task configurations while also updating dependency pins.

Introduces new LightEval tasks and metrics in ether0/lighteval_tasks.py.
Refactors problem type filtering functions in ether0/models.py.
Adds integration tests for custom tasks in tests/test_lighteval_tasks.py and updates dependency configurations in the pyproject.toml files.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/test_lighteval_tasks.py	Added integration tests for custom LightEval tasks.
src/ether0/models.py	Refactored problem type filtering for clearer separation of concerns.
src/ether0/lighteval_tasks.py	Introduced new tasks and metric evaluation functions for LightEval integration.
pyproject.toml	Updated dependency list to include LightEval extras.
packages/remotes/pyproject.toml	Adjusted tensorboard pin version to ensure compatibility.

Comments suppressed due to low confidence (1)

packages/remotes/pyproject.toml:45

Verify that downgrading the tensorboard pin to >=2.18 does not impact the features relied on in the remotes package; update or add tests if necessary.

"tensorboard>=2.18",  # Indirect dependency we pin to keep recent

tests/test_lighteval_tasks.py

jamesbraza · 2025-06-10T04:08:14Z

packages/remotes/pyproject.toml

    "ether0",
    "ether0.remotes[serve]",
-    "tensorboard>=2.19",  # Indirect dependency we pin to keep recent
+    "tensorboard>=2.18",  # Indirect dependency we pin to keep recent


I loosened the pinning here to allow for package resolution, as lighteval==0.10.0 requires numpy v1: huggingface/lighteval#416

jamesbraza added 3 commits June 9, 2025 12:21

Decomposed out make_problem_types_filter from filter_problem_types

637a1f1

Relaxed tensorboard version constraint to allow for a resolution with…

bb893ab

… lighteval

Created lighteval extra for usage, with mypy disables

9a0bb89

jamesbraza requested review from albertbou92, geemi725, ludomitch, maykcaldas, mskarlin, sidnarayanan and whitead June 10, 2025 02:30

jamesbraza self-assigned this Jun 10, 2025

Copilot AI review requested due to automatic review settings June 10, 2025 02:30

jamesbraza added the enhancement New feature or request label Jun 10, 2025

Copilot AI reviewed Jun 10, 2025

View reviewed changes

tests/test_lighteval_tasks.py Show resolved Hide resolved

jamesbraza force-pushed the lighteval branch from f373d4c to 48cd7b1 Compare June 10, 2025 03:58

jamesbraza commented Jun 10, 2025

View reviewed changes

Created lighteval tasks module to shim with LightEval, with tests

e0f81da

jamesbraza force-pushed the lighteval branch from 48cd7b1 to e0f81da Compare June 10, 2025 04:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Basic LightEval integration #11

Basic LightEval integration #11

Uh oh!

jamesbraza commented Jun 10, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

jamesbraza Jun 10, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Basic LightEval integration #11

Are you sure you want to change the base?

Basic LightEval integration #11

Uh oh!

Conversation

jamesbraza commented Jun 10, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

jamesbraza Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jamesbraza Jun 10, 2025 •

edited

Loading