Feature Request: Enhancements for Test Orchestration, Non-Determinism, and Extensibility

### Description

**Description**

First, let me say thank you for `pydantic-evals`. It has been the foundation of our LLM evaluation efforts.

To meet the demands of a production environment, I have built a comprehensive evaluation framework on top of `pydantic-evals`. This framework addresses challenges around test suite orchestration, handling non-determinism, and custom integrations. The features suggested below are not just ideas, I have a working implementation of most of them and am very willing to contribute pull requests to `pydantic-evals` if these suggestions are aligned with the project's vision.

-----

### Suggested Features

#### 1\. A Standardized Evaluation Architecture (`BaseEvaluation`)

To enable advanced features like auto-discovery, we need a consistent way to define an evaluation. I've had great success using an abstract base class for this.

**The Suggestion:**
Introduce a class like `BaseEvaluation` that standardizes the structure of an evaluation. This provides a clear pattern for users, reduces boilerplate, and serves as the foundation for a test runner.

**Example of how it might look:**

```python
from pydantic_evals import BaseEvaluation, Case

class MyLLMEvaluation(BaseEvaluation):
    """A standard way to define a suite of tests."""

    def load_cases(self) -> list[Case]:
        # Logic to load cases from a file or database
        return [...]

    def get_task(self) -> callable:
        # Returns the task function to be evaluated
        return my_llm_function
```

This simple structure makes evaluations self-contained and discoverable.

-----

#### 2\. Test Suite Runner with Auto-Discovery

Building on the `BaseEvaluation` concept, a test runner could automatically find and execute all evaluations in a project.

**The Suggestion:**
A CLI command that scans a directory for classes inheriting from `BaseEvaluation` and runs them, producing a consolidated report. This mirrors the developer experience of tools like `pytest`.

**Example of how it might look:**

```bash
# Discover and run all evaluations in the 'evals/' directory
$ pydantic-evals run ./evals/
```

This would help on CI/CD, as it enables a single command to execute the entire system's evaluation suite.

-----

#### 3\. Built-in Support for Non-Determinism

The framework should make it trivial to test the stability of both the agent and the evaluators.

**A) Multiple Executions per Task**

To check if an LLM task is stable, we must run it multiple times on the same input. My custom implementation includes a wrapper for this, but building it into `evaluate()` would be much cleaner.

**Suggested API:**

```python
# The `runs_per_case` argument tells the evaluator to execute
# the task 3 times for each case. The evaluator would receive
# a list of 3 outputs for scoring.
results = await dataset.evaluate(
    task=my_llm_task,
    runs_per_case=3
)
```

**B) Multiple Executions per Evaluator**

This helps when using non-deterministic judges (like an `LLMJudge`). We need to ensure our measurements are stable.

**Suggested API:**

```python
# The `runs_per_evaluator` argument would run the LLMJudge
# 5 times on the *same output* and could aggregate the scores
# (e.g., average) or report on their variance.
results = await dataset.evaluate(
    task=my_llm_task,
    evaluators=[LLMJudge(...)],
    runs_per_evaluator=5
)
```

-----

#### 4\. Callback System for Extensibility

To integrate evaluations with other tools (e.g., logging platforms, monitoring dashboards), a callback or plugin system could help.

**The Suggestion:**
Allow users to pass a list of callback handlers to the `evaluate` method. These handlers would have methods that are triggered at different points in the evaluation lifecycle.

**Example of how it might look:**

```python
# A simple callback handler for custom logging
class WAndBLogger:
    def on_case_end(self, result: CaseResult):
        # Log metrics for the completed case to Weights & Biases
        log_to_wandb({"score": result.score, ...})

    def on_dataset_end(self, report: EvaluationReport):
        # Log the final summary report
        log_summary_to_wandb(report.to_dict())


# The callback instance is passed to the evaluate call
results = await dataset.evaluate(
    task=my_task,
    callbacks=[WAndBLogger()]
)
```

This would unlock a range of custom integrations and plugins without cluttering the core library.

-----

I strongly believe these features would significantly enhance `pydantic-evals` for teams operating at scale. I'm ready to help draft PRs and collaborate on the design and implementation if these ideas are accepted or at least on the right path. Thank you for your time and for this fantastic library\!

### References

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Enhancements for Test Orchestration, Non-Determinism, and Extensibility #2112

Description

Suggested Features

1. A Standardized Evaluation Architecture (`BaseEvaluation`)

2. Test Suite Runner with Auto-Discovery

3. Built-in Support for Non-Determinism

4. Callback System for Extensibility

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Enhancements for Test Orchestration, Non-Determinism, and Extensibility #2112

Description

Description

Suggested Features

1. A Standardized Evaluation Architecture (BaseEvaluation)

2. Test Suite Runner with Auto-Discovery

3. Built-in Support for Non-Determinism

4. Callback System for Extensibility

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. A Standardized Evaluation Architecture (`BaseEvaluation`)