0% found this document useful (0 votes)
120 views

Guide To RAG System Evaluation Metrics

Uploaded by

JAZZiNGH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views

Guide To RAG System Evaluation Metrics

Uploaded by

JAZZiNGH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Comprehensive

Guide to
LLM &
RAG System
Evaluation
Metrics

Dipanjan (DJ)
Standard RAG
Evaluation Metrics

Two major points in a RAG System need evaluation:


Retriever: This is where we measure retrieval performance from the
Vector DB for input queries
Contextual Precision: Relevant retrieved context to input query should rank higher
Contextual Recall: Retrieved context should align with expected ground truth response
Contextual Relevancy: Relevancy of statements in retrieved context to the input query
should be more in count

Generator: This is where we measure the quality of generated responses


from the LLM for input queries and retrieved context
Answer Relevancy: Relevancy of statements in generated response to the input query
should be more in count or semantically similar (LLM-based or semantic similarity)
Faithfulness: Count of truthful claims made in the generated response w.r.t the retrieved
context should be more
Hallucination Check: Number of statements in generated response which contradict the
ground truth context should be minimal
Custom LLM as a Judge: You can create your own judging metrics based on custom
evaluation criteria as needed
Source: Dipanjan (DJ)
Retriever Metrics

Context Precision: Measures, whether retrieved context document


chunks (nodes) that are relevant to the given input query, are
ranked higher than irrelevant ones

Context Recall: Measures the extent of which of the retrieved


context document chunks (nodes) aligns with the expected
response answer (ground truth reference)

Context Relevancy: Measures the relevancy of the information in


the retrieved context document chunks (nodes) to the given input
query

Source: Dipanjan (DJ)


Contextual Precision

Measures whether retrieved context document chunks (nodes) that


are relevant to the given input query are ranked higher than
irrelevant ones

Higher Context Precision score represents a better retrieval system


which can correctly rank relevant nodes higher

Source: Dipanjan (DJ)


Contextual Precision
Example:

Metric Computation:

Source: Dipanjan (DJ)


Contextual Recall

Measures the extent of which of the retrieved context document


chunks (nodes) aligns with the expected response answer (ground
truth reference)

Higher Context Recall score represents a better retrieval system


which can capture all relevant context information from your
Vector DB

Source: Dipanjan (DJ)


Contextual Recall
Example:

Metric Computation:

Source: Dipanjan (DJ)


Contextual Relevancy

Measures the relevancy of the information in the retrieved context


document chunks (nodes) to the given input query

Higher Context Relevance score represents a better retrieval


system which can retrieve more semantically relevant nodes for
queries

Source: Dipanjan (DJ)


Contextual Relevancy
Example:

Metric Computation:

Source: Dipanjan (DJ)


Generator Metrics

Answer Relevancy (LLM Based): Measures the relevancy of the


information in the generated response to the provided input query
using LLM as a Judge
Answer Relevancy (Similarity Based): Measures the relevancy of
the information in the generated response to the provided input
query using semantic similarity between LLM generated queries
from the response and the input query
Faithfulness: Measures if the information in the generated
response factually aligns with the contents of the retrieved context
document chunks (nodes)
Hallucination Check: Measures the proportion of contradictory
statements by comparing the generated response to the expected
context document chunks (ground truth reference)
Custom LLM-as-a-Judge: G-Eval is a framework that uses LLMs
with chain-of-thoughts (CoT) to evaluate LLM responses and
retrieved contexts based on ANY custom criteria

Source: Dipanjan (DJ)


Answer Relevancy - LLM
Based

Measures the relevancy of the information in the generated


response to the provided input query using LLM as a Judge

Higher Answer Relevance shows the LLM Generator is able to


generate better quality relevant responses for queries

Source: Dipanjan (DJ)


Answer Relevancy - LLM
Based
Example:

Metric Computation:

Source: Dipanjan (DJ)


Answer Relevancy -
Similarity Based

Measures the relevancy of the information in the generated


response to the provided input query using semantic similarity
between LLM generated queries from the response and the input
query

Higher Answer Relevance shows the LLM Generator is able to


generate better quality relevant responses for queries

Source: Dipanjan (DJ)


Answer Relevancy -
Similarity Based
Example:

Metric Computation:

Source: Dipanjan (DJ)


Faithfulness

Measures if the information in the generated response factually


aligns with the contents of the retrieved context document chunks
(nodes)

Higher Faithfulness means the generated response is more


grounded with regard to the retrieved context reducing
contradictions

Source: Dipanjan (DJ)


Faithfulness
Example:

Metric Computation:

Source: Dipanjan (DJ)


Hallucination Check

Measures the proportion of contradictory statements by comparing


the generated response to the expected context document chunks
(ground truth reference)

Lower the hallucination score, lower the proportion of contradictory


statements making the response more grounded and relevant

Source: Dipanjan (DJ)


Hallucination Check
Example:

Metric Computation:

Source: Dipanjan (DJ)


Custom LLM as a Judge

G-Eval is a framework that uses LLMs with chain-of-thoughts (CoT)


to evaluate LLM responses and retrieved contexts based on ANY
custom criteria

You can decide the judging criteria and give detailed evaluation
steps using prompt instructions

Source: Dipanjan (DJ)


Custom LLM as a Judge

Source: Dipanjan (DJ)


RAG Evaluation Example
with DeepEval

You can leverage libraries like DeepEval and Ragas to make things
easier for you or even create your own custom eval metrics

Source: Dipanjan (DJ)

You might also like