-
Notifications
You must be signed in to change notification settings - Fork 3k
TaskAdherence V2 prompt updates #41616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR modernizes the TaskAdherence evaluator by overhauling the prompt to expect a structured JSON output and adding helper routines to normalize conversation inputs for more reliable LLM calls.
- Switched the prompt’s response format to
json_object
and restructured the system prompt with clear JSON schema and examples. - Updated the evaluator implementation (
_task_adherence.py
) to apply new formatting helpers and handle dictionary outputs from the LLM, including exposing fulladditional_details
. - Extended
utils.py
with three new functions—reformat_conversation_history
,reformat_agent_response
, andreformat_tool_definitions
—to prepare inputs for the prompt.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
sdk/.../task_adherence.prompty | Changed output type to JSON object and rewrote system prompt to define keys, steps, and scoring examples |
sdk/.../_task_adherence/_task_adherence.py | Removed old parsing helper, imported new reformatters, mutated eval_input , unified output handling |
sdk/.../_common/utils.py | Added conversation/response/tool-definition reformatters with fallback logic; imported ErrorMessage |
Comments suppressed due to low confidence (1)
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_task_adherence/_task_adherence.py:145
- [nitpick] Mutating
eval_input
in place may obscure the original data flow. Consider assigning the reformatted value to a new local variable to preserve the raw input for debugging.
eval_input['query'] = reformat_conversation_history(eval_input["query"])
try: | ||
conversation_history = _get_conversation_history(query) | ||
return _pretty_format_conversation_history(conversation_history) | ||
except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid using a bare except which can hide unexpected errors; catch specific exceptions (e.g., ValueError
, KeyError
) and log the exception to aid debugging.
Copilot uses AI. Check for mistakes.
# Higher inter model variance (0.345 vs 0.607) | ||
# Lower percentage of mode in Likert scale (73.4% vs 75.4%) | ||
# Lower pairwise agreement between LLMs (85% vs 90% at the pass/fail level with threshold of 3) | ||
return query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fallback returns the original message list rather than a string, which may break downstream prompts that expect a formatted string. Consider serializing or stringifying the original query for consistency.
Copilot uses AI. Check for mistakes.
…into ghyadav/task_adherence_v2 # Conflicts: # sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_common/utils.py # sdk/evaluation/azure-ai-evaluation/tests/unittests/test_utils.py
CONVERSATION_HISTORY: | | ||
SYSTEM MESSAGE: Always use tools for factual queries. Never provide personal opinions. | ||
User: What's the weather in Tokyo? | ||
AGENT_RESPONSE: | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we defining Multi-Turn vs Single turn Few shots? in case of Multi-Turn, do we call this as Agent Response? of
Description
Please add an informative description that covers that changes made by the pull request and link all relevant issues.
If an SDK is being regenerated based on a new swagger spec, a link to the pull request containing these swagger spec changes has been included above.
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines