Skip to content

Merge main into live #46241

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 17, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 23 additions & 23 deletions docs/ai/conceptual/evaluation-libraries.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@ description: Learn about the Microsoft.Extensions.AI.Evaluation libraries, which
ms.topic: concept-article
ms.date: 05/13/2025
---
# The Microsoft.Extensions.AI.Evaluation libraries (Preview)
# The Microsoft.Extensions.AI.Evaluation libraries

The Microsoft.Extensions.AI.Evaluation libraries (currently in preview) simplify the process of evaluating the quality and accuracy of responses generated by AI models in .NET intelligent apps. Various metrics measure aspects like relevance, truthfulness, coherence, and completeness of the responses. Evaluations are crucial in testing, because they help ensure that the AI model performs as expected and provides reliable and accurate results.
The Microsoft.Extensions.AI.Evaluation libraries simplify the process of evaluating the quality and accuracy of responses generated by AI models in .NET intelligent apps. Various metrics measure aspects like relevance, truthfulness, coherence, and completeness of the responses. Evaluations are crucial in testing, because they help ensure that the AI model performs as expected and provides reliable and accurate results.

The evaluation libraries, which are built on top of the [Microsoft.Extensions.AI abstractions](../microsoft-extensions-ai.md), are composed of the following NuGet packages:

Expand All @@ -31,34 +31,34 @@ You can also customize to add your own evaluations by implementing the <xref:Mic

Quality evaluators measure response quality. They use an LLM to perform the evaluation.

| Metric | Description | Evaluator type |
|----------------|--------------------------------------------------------|----------------|
| `Relevance` | Evaluates how relevant a response is to a query | <xref:Microsoft.Extensions.AI.Evaluation.Quality.RelevanceEvaluator> |
| `Completeness` | Evaluates how comprehensive and accurate a response is | <xref:Microsoft.Extensions.AI.Evaluation.Quality.CompletenessEvaluator> |
| `Retrieval` | Evaluates performance in retrieving information for additional context | <xref:Microsoft.Extensions.AI.Evaluation.Quality.RetrievalEvaluator> |
| `Fluency` | Evaluates grammatical accuracy, vocabulary range, sentence complexity, and overall readability| <xref:Microsoft.Extensions.AI.Evaluation.Quality.FluencyEvaluator> |
| `Coherence` | Evaluates the logical and orderly presentation of ideas | <xref:Microsoft.Extensions.AI.Evaluation.Quality.CoherenceEvaluator> |
| `Equivalence` | Evaluates the similarity between the generated text and its ground truth with respect to a query | <xref:Microsoft.Extensions.AI.Evaluation.Quality.EquivalenceEvaluator> |
| `Groundedness` | Evaluates how well a generated response aligns with the given context | <xref:Microsoft.Extensions.AI.Evaluation.Quality.GroundednessEvaluator> |
| `Relevance (RTC)`, `Truth (RTC)`, and `Completeness (RTC)` | Evaluates how relevant, truthful, and complete a response is | <xref:Microsoft.Extensions.AI.Evaluation.Quality.RelevanceTruthAndCompletenessEvaluator>† |
| Evaluator type | Metric | Description |
|----------------------------------------------------------------------|-------------|-------------|
| <xref:Microsoft.Extensions.AI.Evaluation.Quality.RelevanceEvaluator> | `Relevance` | Evaluates how relevant a response is to a query |
| <xref:Microsoft.Extensions.AI.Evaluation.Quality.CompletenessEvaluator> | `Completeness` | Evaluates how comprehensive and accurate a response is |
| <xref:Microsoft.Extensions.AI.Evaluation.Quality.RetrievalEvaluator> | `Retrieval` | Evaluates performance in retrieving information for additional context |
| <xref:Microsoft.Extensions.AI.Evaluation.Quality.FluencyEvaluator> | `Fluency` | Evaluates grammatical accuracy, vocabulary range, sentence complexity, and overall readability|
| <xref:Microsoft.Extensions.AI.Evaluation.Quality.CoherenceEvaluator> | `Coherence` | Evaluates the logical and orderly presentation of ideas |
| <xref:Microsoft.Extensions.AI.Evaluation.Quality.EquivalenceEvaluator> | `Equivalence` | Evaluates the similarity between the generated text and its ground truth with respect to a query |
| <xref:Microsoft.Extensions.AI.Evaluation.Quality.GroundednessEvaluator> | `Groundedness` | Evaluates how well a generated response aligns with the given context |
| <xref:Microsoft.Extensions.AI.Evaluation.Quality.RelevanceTruthAndCompletenessEvaluator>† | `Relevance (RTC)`, `Truth (RTC)`, and `Completeness (RTC)` | Evaluates how relevant, truthful, and complete a response is |

† This evaluator is marked [experimental](../../fundamentals/syslib-diagnostics/experimental-overview.md).

### Safety evaluators

Safety evaluators check for presence of harmful, inappropriate, or unsafe content in a response. They rely on the Azure AI Foundry Evaluation service, which uses a model that's fine tuned to perform evaluations.

| Metric | Description | Evaluator type |
|--------------------|-----------------------------------------------------------------------|------------------------------|
| `Groundedness Pro` | Uses a fine-tuned model hosted behind the Azure AI Foundry Evaluation service to evaluate how well a generated response aligns with the given context | <xref:Microsoft.Extensions.AI.Evaluation.Safety.GroundednessProEvaluator> |
| `Protected Material` | Evaluates response for the presence of protected material | <xref:Microsoft.Extensions.AI.Evaluation.Safety.ProtectedMaterialEvaluator> |
| `Ungrounded Attributes` | Evaluates a response for the presence of content that indicates ungrounded inference of human attributes | <xref:Microsoft.Extensions.AI.Evaluation.Safety.UngroundedAttributesEvaluator> |
| `Hate And Unfairness` | Evaluates a response for the presence of content that's hateful or unfair | <xref:Microsoft.Extensions.AI.Evaluation.Safety.HateAndUnfairnessEvaluator>† |
| `Self Harm` | Evaluates a response for the presence of content that indicates self harm | <xref:Microsoft.Extensions.AI.Evaluation.Safety.SelfHarmEvaluator>† |
| `Violence` | Evaluates a response for the presence of violent content | <xref:Microsoft.Extensions.AI.Evaluation.Safety.ViolenceEvaluator>† |
| `Sexual` | Evaluates a response for the presence of sexual content | <xref:Microsoft.Extensions.AI.Evaluation.Safety.SexualEvaluator>† |
| `Code Vulnerability` | Evaluates a response for the presence of vulnerable code | <xref:Microsoft.Extensions.AI.Evaluation.Safety.CodeVulnerabilityEvaluator> |
| `Indirect Attack` | Evaluates a response for the presence of indirect attacks, such as manipulated content, intrusion, and information gathering | <xref:Microsoft.Extensions.AI.Evaluation.Safety.IndirectAttackEvaluator> |
| Evaluator type | Metric | Description |
|---------------------------------------------------------------------------|--------------------|-------------|
| <xref:Microsoft.Extensions.AI.Evaluation.Safety.GroundednessProEvaluator> | `Groundedness Pro` | Uses a fine-tuned model hosted behind the Azure AI Foundry Evaluation service to evaluate how well a generated response aligns with the given context |
| <xref:Microsoft.Extensions.AI.Evaluation.Safety.ProtectedMaterialEvaluator> | `Protected Material` | Evaluates response for the presence of protected material |
| <xref:Microsoft.Extensions.AI.Evaluation.Safety.UngroundedAttributesEvaluator> | `Ungrounded Attributes` | Evaluates a response for the presence of content that indicates ungrounded inference of human attributes |
| <xref:Microsoft.Extensions.AI.Evaluation.Safety.HateAndUnfairnessEvaluator>† | `Hate And Unfairness` | Evaluates a response for the presence of content that's hateful or unfair |
| <xref:Microsoft.Extensions.AI.Evaluation.Safety.SelfHarmEvaluator>† | `Self Harm` | Evaluates a response for the presence of content that indicates self harm |
| <xref:Microsoft.Extensions.AI.Evaluation.Safety.ViolenceEvaluator>† | `Violence` | Evaluates a response for the presence of violent content |
| <xref:Microsoft.Extensions.AI.Evaluation.Safety.SexualEvaluator>† | `Sexual` | Evaluates a response for the presence of sexual content |
| <xref:Microsoft.Extensions.AI.Evaluation.Safety.CodeVulnerabilityEvaluator> | `Code Vulnerability` | Evaluates a response for the presence of vulnerable code |
| <xref:Microsoft.Extensions.AI.Evaluation.Safety.IndirectAttackEvaluator> | `Indirect Attack` | Evaluates a response for the presence of indirect attacks, such as manipulated content, intrusion, and information gathering |

† In addition, the <xref:Microsoft.Extensions.AI.Evaluation.Safety.ContentHarmEvaluator> provides single-shot evaluation for the four metrics supported by `HateAndUnfairnessEvaluator`, `SelfHarmEvaluator`, `ViolenceEvaluator`, and `SexualEvaluator`.

Expand Down
3 changes: 0 additions & 3 deletions docs/ai/quickstarts/build-chat-app.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,6 @@ zone_pivot_groups: openai-library

In this quickstart, you learn how to create a conversational .NET console chat app using an OpenAI or Azure OpenAI model. The app uses the <xref:Microsoft.Extensions.AI> library so you can write code using AI abstractions rather than a specific SDK. AI abstractions enable you to change the underlying AI model with minimal code changes.

> [!NOTE]
> The [`Microsoft.Extensions.AI`](https://www.nuget.org/packages/Microsoft.Extensions.AI/) library is currently in Preview.

:::zone target="docs" pivot="openai"

[!INCLUDE [openai-prereqs](includes/prerequisites-openai.md)]
Expand Down
20 changes: 9 additions & 11 deletions docs/ai/quickstarts/evaluate-ai-response.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,17 @@
---
title: Quickstart - Evaluate a model's response
title: Quickstart - Evaluate the quality of a model's response
description: Learn how to create an MSTest app to evaluate the AI chat response of a language model.
ms.date: 03/18/2025
ms.topic: quickstart
ms.custom: devx-track-dotnet, devx-track-dotnet-ai
---

# Evaluate a model's response
# Evaluate the quality of a model's response

In this quickstart, you create an MSTest app to evaluate the chat response of an OpenAI model. The test app uses the [Microsoft.Extensions.AI.Evaluation](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation) libraries.
In this quickstart, you create an MSTest app to evaluate the quality of a chat response from an OpenAI model. The test app uses the [Microsoft.Extensions.AI.Evaluation](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation) libraries.

> [!NOTE]
>
> - The `Microsoft.Extensions.AI.Evaluation` library is currently in Preview.
> - This quickstart demonstrates the simplest usage of the evaluation API. Notably, it doesn't demonstrate use of the [response caching](../conceptual/evaluation-libraries.md#cached-responses) and [reporting](../conceptual/evaluation-libraries.md#reporting) functionality, which are important if you're authoring unit tests that run as part of an "offline" evaluation pipeline. The scenario shown in this quickstart is suitable in use cases such as "online" evaluation of AI responses within production code and logging scores to telemetry, where caching and reporting aren't relevant. For a tutorial that demonstrates the caching and reporting functionality, see [Tutorial: Evaluate a model's response with response caching and reporting](../tutorials/evaluate-with-reporting.md)
> This quickstart demonstrates the simplest usage of the evaluation API. Notably, it doesn't demonstrate use of the [response caching](../conceptual/evaluation-libraries.md#cached-responses) and [reporting](../conceptual/evaluation-libraries.md#reporting) functionality, which are important if you're authoring unit tests that run as part of an "offline" evaluation pipeline. The scenario shown in this quickstart is suitable in use cases such as "online" evaluation of AI responses within production code and logging scores to telemetry, where caching and reporting aren't relevant. For a tutorial that demonstrates the caching and reporting functionality, see [Tutorial: Evaluate a model's response with response caching and reporting](../tutorials/evaluate-with-reporting.md)

## Prerequisites

Expand All @@ -39,9 +37,9 @@ Complete the following steps to create an MSTest project that connects to the `g
```dotnetcli
dotnet add package Azure.AI.OpenAI
dotnet add package Azure.Identity
dotnet add package Microsoft.Extensions.AI.Abstractions --prerelease
dotnet add package Microsoft.Extensions.AI.Evaluation --prerelease
dotnet add package Microsoft.Extensions.AI.Evaluation.Quality --prerelease
dotnet add package Microsoft.Extensions.AI.Abstractions
dotnet add package Microsoft.Extensions.AI.Evaluation
dotnet add package Microsoft.Extensions.AI.Evaluation.Quality
dotnet add package Microsoft.Extensions.AI.OpenAI --prerelease
dotnet add package Microsoft.Extensions.Configuration
dotnet add package Microsoft.Extensions.Configuration.UserSecrets
Expand All @@ -51,9 +49,9 @@ Complete the following steps to create an MSTest project that connects to the `g

```bash
dotnet user-secrets init
dotnet user-secrets set AZURE_OPENAI_ENDPOINT <your-azure-openai-endpoint>
dotnet user-secrets set AZURE_OPENAI_ENDPOINT <your-Azure-OpenAI-endpoint>
dotnet user-secrets set AZURE_OPENAI_GPT_NAME gpt-4o
dotnet user-secrets set AZURE_TENANT_ID <your-tenant-id>
dotnet user-secrets set AZURE_TENANT_ID <your-tenant-ID>
```

(Depending on your environment, the tenant ID might not be needed. In that case, remove it from the code that instantiates the <xref:Azure.Identity.DefaultAzureCredential>.)
Expand Down
3 changes: 0 additions & 3 deletions docs/ai/quickstarts/prompt-model.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,6 @@ zone_pivot_groups: openai-library

In this quickstart, you learn how to create a .NET console chat app to connect to and prompt an OpenAI or Azure OpenAI model. The app uses the <xref:Microsoft.Extensions.AI> library so you can write code using AI abstractions rather than a specific SDK. AI abstractions enable you to change the underlying AI model with minimal code changes.

> [!NOTE]
> The <xref:Microsoft.Extensions.AI> library is currently in Preview.

:::zone target="docs" pivot="openai"

[!INCLUDE [openai-prereqs](includes/prerequisites-openai.md)]
Expand Down
5 changes: 1 addition & 4 deletions docs/ai/quickstarts/structured-output.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,6 @@ ms.custom: devx-track-dotnet, devx-track-dotnet-ai

In this quickstart, you create a chat app that requests a response with *structured output*. A structured output response is a chat response that's of a type you specify instead of just plain text. The chat app you create in this quickstart analyzes sentiment of various product reviews, categorizing each review according to the values of a custom enumeration.

> [!NOTE]
> The <xref:Microsoft.Extensions.AI> library, which is used in this quickstart, is currently in Preview.

## Prerequisites

- [.NET 8 or a later version](https://dotnet.microsoft.com/download)
Expand All @@ -37,7 +34,7 @@ Complete the following steps to create a console app that connects to the `gpt-4
```dotnetcli
dotnet add package Azure.AI.OpenAI
dotnet add package Azure.Identity
dotnet add package Microsoft.Extensions.AI --prerelease
dotnet add package Microsoft.Extensions.AI
dotnet add package Microsoft.Extensions.AI.OpenAI --prerelease
dotnet add package Microsoft.Extensions.Configuration
dotnet add package Microsoft.Extensions.Configuration.UserSecrets
Expand Down
7 changes: 2 additions & 5 deletions docs/ai/quickstarts/use-function-calling.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,6 @@ zone_pivot_groups: openai-library

In this quickstart, you create a .NET console AI chat app to connect to an AI model with local function calling enabled. The app uses the <xref:Microsoft.Extensions.AI> library so you can write code using AI abstractions rather than a specific SDK. AI abstractions enable you to change the underlying AI model with minimal code changes.

> [!NOTE]
> The [`Microsoft.Extensions.AI`](https://www.nuget.org/packages/Microsoft.Extensions.AI/) library is currently in Preview.

:::zone target="docs" pivot="openai"

[!INCLUDE [openai-prereqs](includes/prerequisites-openai.md)]
Expand Down Expand Up @@ -54,7 +51,7 @@ Complete the following steps to create a .NET console app to connect to an AI mo
```bash
dotnet add package Azure.Identity
dotnet add package Azure.AI.OpenAI
dotnet add package Microsoft.Extensions.AI --prerelease
dotnet add package Microsoft.Extensions.AI
dotnet add package Microsoft.Extensions.AI.OpenAI --prerelease
dotnet add package Microsoft.Extensions.Configuration
dotnet add package Microsoft.Extensions.Configuration.UserSecrets
Expand All @@ -65,7 +62,7 @@ Complete the following steps to create a .NET console app to connect to an AI mo
:::zone target="docs" pivot="openai"

```bash
dotnet add package Microsoft.Extensions.AI --prerelease
dotnet add package Microsoft.Extensions.AI
dotnet add package Microsoft.Extensions.AI.OpenAI --prerelease
dotnet add package Microsoft.Extensions.Configuration
dotnet add package Microsoft.Extensions.Configuration.UserSecrets
Expand Down
6 changes: 4 additions & 2 deletions docs/ai/toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -81,9 +81,11 @@ items:
items:
- name: The Microsoft.Extensions.AI.Evaluation libraries
href: conceptual/evaluation-libraries.md
- name: "Quickstart: Evaluate a model's response"
- name: "Quickstart: Evaluate the quality of a response"
href: quickstarts/evaluate-ai-response.md
- name: "Tutorial: Evaluate a response with response caching and reporting"
- name: "Tutorial: Evaluate the safety of a response"
href: tutorials/evaluate-safety.md
- name: "Tutorial: Evaluate a response with caching and reporting"
href: tutorials/evaluate-with-reporting.md
- name: Resources
items:
Expand Down
Loading
Loading