Skip to content

Is it possible to trace agents, tool call/ expected tool calls in Pydantic Eval? #1650

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dinhngoc267 opened this issue May 6, 2025 · 3 comments
Assignees
Labels
evals question Further information is requested

Comments

@dinhngoc267
Copy link

dinhngoc267 commented May 6, 2025

Question

Hi,

In my case, because an agent can handle a request without calling a tooll (but it's not as good as using tool, I also cannot force the agent to use tool, as depend on request, for some of them the agent might not need to use tools) so I would like to evaluate the work flow (which agents are involved in multi-agent system) and which tools are called (in order) to know how well the system is designed.

Currently, In Pydantic Eval, I don't see it tracks this information. Is there anyway to do this? Also is this recommended?

Additional Context

No response

@dinhngoc267 dinhngoc267 added the question Further information is requested label May 6, 2025
@Kludex Kludex added the evals label May 6, 2025
@dinhngoc267
Copy link
Author

Hi @Kludex I think this is a little bit complicated, there might be multiple flows that can achieve an expected result.

So I think I would need the EvaluatorContext trace message history. Then I can use this data for a custom evaluator.
With the current implementation I think I will put message history in the output of the agent beside its actual output I guess.

@HamzaFarhan
Copy link
Contributor

from dataclasses import dataclass
from functools import partial
from typing import Any, TypeVar

from pydantic import BaseModel
from pydantic_ai import Agent
from pydantic_ai.messages import ToolCallPart
from pydantic_ai.models import KnownModelName
from pydantic_evals import Dataset

ResultT = TypeVar("ResultT")


@dataclass
class Query[ResultT]:
    query: str
    output_type: type[ResultT]


class ToolCall(BaseModel):
    tool_name: str
    params: dict[str, Any]


@dataclass
class QueryResult[ResultT]:
    result: ResultT
    tool_calls: list[ToolCall]


async def task(inputs: Query, model: KnownModelName, agent_name: str) -> QueryResult:
    agent = Agent(model=model, name=agent_name, instructions="Help the user with their query.")
    tool_calls = []
    async with agent.iter(user_prompt=inputs.query, output_type=inputs.output_type) as agent_run:
        async for node in agent_run:
            if agent.is_call_tools_node(node):
                for part in node.model_response.parts:
                    if isinstance(part, ToolCallPart) and part.tool_name not in ["final_result"]:
                        tool_calls.append(ToolCall(tool_name=part.tool_name, params=part.args_as_dict()))

    res = agent_run.result.output if agent_run.result is not None else None
    return QueryResult(result=res, tool_calls=tool_calls)


async def evaluate(dataset: Dataset[Query, QueryResult], model: KnownModelName, agent_name: str):
    report = await dataset.evaluate(task=partial(task, model=model, agent_name=agent_name))
    report.print(
        include_input=True,
        include_output=True,
        include_expected_output=True,
        include_durations=True,
        include_total_duration=True,
        include_averages=True,
    )

So now in your Evaluator(s), you can evaluate both QueryResult.result, and QueryResult.tool_calls however you want.

@dinhngoc267
Copy link
Author

@HamzaFarhan Thank youuuuuuuu

@Kludex Kludex closed this as completed May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
evals question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants