Description
Description
Description
First, let me say thank you for pydantic-evals
. It has been the foundation of our LLM evaluation efforts.
To meet the demands of a production environment, I have built a comprehensive evaluation framework on top of pydantic-evals
. This framework addresses challenges around test suite orchestration, handling non-determinism, and custom integrations. The features suggested below are not just ideas, I have a working implementation of most of them and am very willing to contribute pull requests to pydantic-evals
if these suggestions are aligned with the project's vision.
Suggested Features
1. A Standardized Evaluation Architecture (BaseEvaluation
)
To enable advanced features like auto-discovery, we need a consistent way to define an evaluation. I've had great success using an abstract base class for this.
The Suggestion:
Introduce a class like BaseEvaluation
that standardizes the structure of an evaluation. This provides a clear pattern for users, reduces boilerplate, and serves as the foundation for a test runner.
Example of how it might look:
from pydantic_evals import BaseEvaluation, Case
class MyLLMEvaluation(BaseEvaluation):
"""A standard way to define a suite of tests."""
def load_cases(self) -> list[Case]:
# Logic to load cases from a file or database
return [...]
def get_task(self) -> callable:
# Returns the task function to be evaluated
return my_llm_function
This simple structure makes evaluations self-contained and discoverable.
2. Test Suite Runner with Auto-Discovery
Building on the BaseEvaluation
concept, a test runner could automatically find and execute all evaluations in a project.
The Suggestion:
A CLI command that scans a directory for classes inheriting from BaseEvaluation
and runs them, producing a consolidated report. This mirrors the developer experience of tools like pytest
.
Example of how it might look:
# Discover and run all evaluations in the 'evals/' directory
$ pydantic-evals run ./evals/
This would help on CI/CD, as it enables a single command to execute the entire system's evaluation suite.
3. Built-in Support for Non-Determinism
The framework should make it trivial to test the stability of both the agent and the evaluators.
A) Multiple Executions per Task
To check if an LLM task is stable, we must run it multiple times on the same input. My custom implementation includes a wrapper for this, but building it into evaluate()
would be much cleaner.
Suggested API:
# The `runs_per_case` argument tells the evaluator to execute
# the task 3 times for each case. The evaluator would receive
# a list of 3 outputs for scoring.
results = await dataset.evaluate(
task=my_llm_task,
runs_per_case=3
)
B) Multiple Executions per Evaluator
This helps when using non-deterministic judges (like an LLMJudge
). We need to ensure our measurements are stable.
Suggested API:
# The `runs_per_evaluator` argument would run the LLMJudge
# 5 times on the *same output* and could aggregate the scores
# (e.g., average) or report on their variance.
results = await dataset.evaluate(
task=my_llm_task,
evaluators=[LLMJudge(...)],
runs_per_evaluator=5
)
4. Callback System for Extensibility
To integrate evaluations with other tools (e.g., logging platforms, monitoring dashboards), a callback or plugin system could help.
The Suggestion:
Allow users to pass a list of callback handlers to the evaluate
method. These handlers would have methods that are triggered at different points in the evaluation lifecycle.
Example of how it might look:
# A simple callback handler for custom logging
class WAndBLogger:
def on_case_end(self, result: CaseResult):
# Log metrics for the completed case to Weights & Biases
log_to_wandb({"score": result.score, ...})
def on_dataset_end(self, report: EvaluationReport):
# Log the final summary report
log_summary_to_wandb(report.to_dict())
# The callback instance is passed to the evaluate call
results = await dataset.evaluate(
task=my_task,
callbacks=[WAndBLogger()]
)
This would unlock a range of custom integrations and plugins without cluttering the core library.
I strongly believe these features would significantly enhance pydantic-evals
for teams operating at scale. I'm ready to help draft PRs and collaborate on the design and implementation if these ideas are accepted or at least on the right path. Thank you for your time and for this fantastic library!
References
No response