Skip to content

Feature Request: Enhancements for Test Orchestration, Non-Determinism, and Extensibility #2112

Open
@dalssoft

Description

@dalssoft

Description

Description

First, let me say thank you for pydantic-evals. It has been the foundation of our LLM evaluation efforts.

To meet the demands of a production environment, I have built a comprehensive evaluation framework on top of pydantic-evals. This framework addresses challenges around test suite orchestration, handling non-determinism, and custom integrations. The features suggested below are not just ideas, I have a working implementation of most of them and am very willing to contribute pull requests to pydantic-evals if these suggestions are aligned with the project's vision.


Suggested Features

1. A Standardized Evaluation Architecture (BaseEvaluation)

To enable advanced features like auto-discovery, we need a consistent way to define an evaluation. I've had great success using an abstract base class for this.

The Suggestion:
Introduce a class like BaseEvaluation that standardizes the structure of an evaluation. This provides a clear pattern for users, reduces boilerplate, and serves as the foundation for a test runner.

Example of how it might look:

from pydantic_evals import BaseEvaluation, Case

class MyLLMEvaluation(BaseEvaluation):
    """A standard way to define a suite of tests."""

    def load_cases(self) -> list[Case]:
        # Logic to load cases from a file or database
        return [...]

    def get_task(self) -> callable:
        # Returns the task function to be evaluated
        return my_llm_function

This simple structure makes evaluations self-contained and discoverable.


2. Test Suite Runner with Auto-Discovery

Building on the BaseEvaluation concept, a test runner could automatically find and execute all evaluations in a project.

The Suggestion:
A CLI command that scans a directory for classes inheriting from BaseEvaluation and runs them, producing a consolidated report. This mirrors the developer experience of tools like pytest.

Example of how it might look:

# Discover and run all evaluations in the 'evals/' directory
$ pydantic-evals run ./evals/

This would help on CI/CD, as it enables a single command to execute the entire system's evaluation suite.


3. Built-in Support for Non-Determinism

The framework should make it trivial to test the stability of both the agent and the evaluators.

A) Multiple Executions per Task

To check if an LLM task is stable, we must run it multiple times on the same input. My custom implementation includes a wrapper for this, but building it into evaluate() would be much cleaner.

Suggested API:

# The `runs_per_case` argument tells the evaluator to execute
# the task 3 times for each case. The evaluator would receive
# a list of 3 outputs for scoring.
results = await dataset.evaluate(
    task=my_llm_task,
    runs_per_case=3
)

B) Multiple Executions per Evaluator

This helps when using non-deterministic judges (like an LLMJudge). We need to ensure our measurements are stable.

Suggested API:

# The `runs_per_evaluator` argument would run the LLMJudge
# 5 times on the *same output* and could aggregate the scores
# (e.g., average) or report on their variance.
results = await dataset.evaluate(
    task=my_llm_task,
    evaluators=[LLMJudge(...)],
    runs_per_evaluator=5
)

4. Callback System for Extensibility

To integrate evaluations with other tools (e.g., logging platforms, monitoring dashboards), a callback or plugin system could help.

The Suggestion:
Allow users to pass a list of callback handlers to the evaluate method. These handlers would have methods that are triggered at different points in the evaluation lifecycle.

Example of how it might look:

# A simple callback handler for custom logging
class WAndBLogger:
    def on_case_end(self, result: CaseResult):
        # Log metrics for the completed case to Weights & Biases
        log_to_wandb({"score": result.score, ...})

    def on_dataset_end(self, report: EvaluationReport):
        # Log the final summary report
        log_summary_to_wandb(report.to_dict())


# The callback instance is passed to the evaluate call
results = await dataset.evaluate(
    task=my_task,
    callbacks=[WAndBLogger()]
)

This would unlock a range of custom integrations and plugins without cluttering the core library.


I strongly believe these features would significantly enhance pydantic-evals for teams operating at scale. I'm ready to help draft PRs and collaborate on the design and implementation if these ideas are accepted or at least on the right path. Thank you for your time and for this fantastic library!

References

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions