0% found this document useful (0 votes)

120 views

Guide To RAG System Evaluation Metrics

Uploaded by

JAZZiNGH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

120 views

Guide To RAG System Evaluation Metrics

Uploaded by

JAZZiNGH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Comprehensive

Guide to
LLM &
RAG System
Evaluation
Metrics

Dipanjan (DJ)
Standard RAG
Evaluation Metrics

Two major points in a RAG System need evaluation:

Retriever: This is where we measure retrieval performance from the
Vector DB for input queries
Contextual Precision: Relevant retrieved context to input query should rank higher
Contextual Recall: Retrieved context should align with expected ground truth response
Contextual Relevancy: Relevancy of statements in retrieved context to the input query
should be more in count

Generator: This is where we measure the quality of generated responses

from the LLM for input queries and retrieved context
Answer Relevancy: Relevancy of statements in generated response to the input query
should be more in count or semantically similar (LLM-based or semantic similarity)
Faithfulness: Count of truthful claims made in the generated response w.r.t the retrieved
context should be more
Hallucination Check: Number of statements in generated response which contradict the
ground truth context should be minimal
Custom LLM as a Judge: You can create your own judging metrics based on custom
evaluation criteria as needed
Source: Dipanjan (DJ)
Retriever Metrics

Context Precision: Measures, whether retrieved context document

chunks (nodes) that are relevant to the given input query, are
ranked higher than irrelevant ones

Context Recall: Measures the extent of which of the retrieved

context document chunks (nodes) aligns with the expected
response answer (ground truth reference)

Context Relevancy: Measures the relevancy of the information in

the retrieved context document chunks (nodes) to the given input
query

Source: Dipanjan (DJ)

Contextual Precision

Measures whether retrieved context document chunks (nodes) that

are relevant to the given input query are ranked higher than
irrelevant ones

Higher Context Precision score represents a better retrieval system

which can correctly rank relevant nodes higher

Source: Dipanjan (DJ)

Contextual Precision
Example:

Metric Computation:

Source: Dipanjan (DJ)

Contextual Recall

Measures the extent of which of the retrieved context document

chunks (nodes) aligns with the expected response answer (ground
truth reference)

Higher Context Recall score represents a better retrieval system

which can capture all relevant context information from your
Vector DB

Source: Dipanjan (DJ)

Contextual Recall
Example:

Metric Computation:

Source: Dipanjan (DJ)

Contextual Relevancy

Measures the relevancy of the information in the retrieved context

document chunks (nodes) to the given input query

Higher Context Relevance score represents a better retrieval

system which can retrieve more semantically relevant nodes for
queries

Source: Dipanjan (DJ)

Contextual Relevancy
Example:

Metric Computation:

Source: Dipanjan (DJ)

Generator Metrics

Answer Relevancy (LLM Based): Measures the relevancy of the

information in the generated response to the provided input query
using LLM as a Judge
Answer Relevancy (Similarity Based): Measures the relevancy of
the information in the generated response to the provided input
query using semantic similarity between LLM generated queries
from the response and the input query
Faithfulness: Measures if the information in the generated
response factually aligns with the contents of the retrieved context
document chunks (nodes)
Hallucination Check: Measures the proportion of contradictory
statements by comparing the generated response to the expected
context document chunks (ground truth reference)
Custom LLM-as-a-Judge: G-Eval is a framework that uses LLMs
with chain-of-thoughts (CoT) to evaluate LLM responses and
retrieved contexts based on ANY custom criteria

Source: Dipanjan (DJ)

Answer Relevancy - LLM
Based

Measures the relevancy of the information in the generated

response to the provided input query using LLM as a Judge

Higher Answer Relevance shows the LLM Generator is able to

generate better quality relevant responses for queries

Source: Dipanjan (DJ)

Answer Relevancy - LLM
Based
Example:

Metric Computation:

Source: Dipanjan (DJ)

Answer Relevancy -
Similarity Based

Measures the relevancy of the information in the generated

response to the provided input query using semantic similarity
between LLM generated queries from the response and the input
query

Higher Answer Relevance shows the LLM Generator is able to

generate better quality relevant responses for queries

Source: Dipanjan (DJ)

Answer Relevancy -
Similarity Based
Example:

Metric Computation:

Source: Dipanjan (DJ)

Faithfulness

Measures if the information in the generated response factually

aligns with the contents of the retrieved context document chunks
(nodes)

Higher Faithfulness means the generated response is more

grounded with regard to the retrieved context reducing
contradictions

Source: Dipanjan (DJ)

Faithfulness
Example:

Metric Computation:

Source: Dipanjan (DJ)

Hallucination Check

Measures the proportion of contradictory statements by comparing

the generated response to the expected context document chunks
(ground truth reference)

Lower the hallucination score, lower the proportion of contradictory

statements making the response more grounded and relevant

Source: Dipanjan (DJ)

Hallucination Check
Example:

Metric Computation:

Source: Dipanjan (DJ)

Custom LLM as a Judge

G-Eval is a framework that uses LLMs with chain-of-thoughts (CoT)

to evaluate LLM responses and retrieved contexts based on ANY
custom criteria

You can decide the judging criteria and give detailed evaluation
steps using prompt instructions

Source: Dipanjan (DJ)

Custom LLM as a Judge

Source: Dipanjan (DJ)

RAG Evaluation Example
with DeepEval

You can leverage libraries like DeepEval and Ragas to make things
easier for you or even create your own custom eval metrics

Source: Dipanjan (DJ)

Document - Finance - Etf - Credit Suisse - Etf PDF
No ratings yet
Document - Finance - Etf - Credit Suisse - Etf PDF
4 pages
LLM Assignment 1
No ratings yet
LLM Assignment 1
3 pages
Andrea Martorana Tusa: Failure Prediction For Manufacturing Industry
No ratings yet
Andrea Martorana Tusa: Failure Prediction For Manufacturing Industry
23 pages
Best Practices For Prompt Engineering With The OpenAI
No ratings yet
Best Practices For Prompt Engineering With The OpenAI
6 pages
Lecture 3 Finetuning Part 1
No ratings yet
Lecture 3 Finetuning Part 1
85 pages
Bagging and Boosting Regression Algorithms
100% (1)
Bagging and Boosting Regression Algorithms
84 pages
Machine Learning (Analytics Vidhya) : What Is Logistic Regression?
100% (1)
Machine Learning (Analytics Vidhya) : What Is Logistic Regression?
5 pages
Shreyash's Resume
No ratings yet
Shreyash's Resume
1 page
Dl All Units Materials
No ratings yet
Dl All Units Materials
138 pages
Vector Database in LLMs
No ratings yet
Vector Database in LLMs
14 pages
Bias and Variance
No ratings yet
Bias and Variance
6 pages
Lab I TENSOR FLOW AND KERAS
No ratings yet
Lab I TENSOR FLOW AND KERAS
3 pages
Unit 2
No ratings yet
Unit 2
112 pages
Navies Bayes
No ratings yet
Navies Bayes
18 pages
Social Network Analytics Session2
No ratings yet
Social Network Analytics Session2
34 pages
6 - Train - Test - Split - Ipynb - Colaboratory
No ratings yet
6 - Train - Test - Split - Ipynb - Colaboratory
5 pages
1. Deep Learning
No ratings yet
1. Deep Learning
127 pages
Soft Max
No ratings yet
Soft Max
6 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
Bias-Variance Tradeoff Presentation
No ratings yet
Bias-Variance Tradeoff Presentation
11 pages
ch9 Ensemble Learning
No ratings yet
ch9 Ensemble Learning
19 pages
NLP Presentation
No ratings yet
NLP Presentation
20 pages
Deep Learning Step by Step
No ratings yet
Deep Learning Step by Step
171 pages
Bayesian belief Network
No ratings yet
Bayesian belief Network
23 pages
02 ML Supervised Learning
No ratings yet
02 ML Supervised Learning
32 pages
Gradient Descent
No ratings yet
Gradient Descent
15 pages
Prompt_Engineering_Notes
No ratings yet
Prompt_Engineering_Notes
2 pages
2.2 ML Session Bias Variance Tradeoffs
No ratings yet
2.2 ML Session Bias Variance Tradeoffs
38 pages
Core Java
No ratings yet
Core Java
217 pages
Chap 11 12 - Practical Methodology and Applications - Heechul Lim
No ratings yet
Chap 11 12 - Practical Methodology and Applications - Heechul Lim
60 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
AutoGen - The Automated Program Generator
No ratings yet
AutoGen - The Automated Program Generator
196 pages
Technical Seminar: Sapthagiri College of Engineering
No ratings yet
Technical Seminar: Sapthagiri College of Engineering
18 pages
Knowledge Graphs v Vector Databases and when not to use them!
No ratings yet
Knowledge Graphs v Vector Databases and when not to use them!
3 pages
RAG with math
No ratings yet
RAG with math
7 pages
DL Lab Manual
No ratings yet
DL Lab Manual
65 pages
2023 Intro To Generative Ai
No ratings yet
2023 Intro To Generative Ai
15 pages
Using Categorical Data With One Hot Encoding - Kaggle PDF
No ratings yet
Using Categorical Data With One Hot Encoding - Kaggle PDF
4 pages
PThread API Reference
No ratings yet
PThread API Reference
348 pages
Pthread
No ratings yet
Pthread
4 pages
Data Science New
No ratings yet
Data Science New
9 pages
Introduction To Data Visualization With Python
No ratings yet
Introduction To Data Visualization With Python
47 pages
Automatic Music Generation
No ratings yet
Automatic Music Generation
16 pages
SUpport Vector Machine
No ratings yet
SUpport Vector Machine
28 pages
GenerativeAdversialNetwork
No ratings yet
GenerativeAdversialNetwork
21 pages
Short Report On Expert Systems
100% (1)
Short Report On Expert Systems
12 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
Top 45 Machine Learning Interview Questions in 2025
100% (1)
Top 45 Machine Learning Interview Questions in 2025
37 pages
Classification Techniques
No ratings yet
Classification Techniques
99 pages
2.neural Network
No ratings yet
2.neural Network
19 pages
Artificial Intelligence & Expert System
100% (1)
Artificial Intelligence & Expert System
18 pages
542 315 Word2vec
No ratings yet
542 315 Word2vec
20 pages
Fine-Tuning AI Models - A Guide. Fine-Tuning Is A Technique For Adapting - by Prabhu Srivastava - Medium
No ratings yet
Fine-Tuning AI Models - A Guide. Fine-Tuning Is A Technique For Adapting - by Prabhu Srivastava - Medium
12 pages
CSC445: Neural Networks
No ratings yet
CSC445: Neural Networks
51 pages
Lab7 LLM Chains
No ratings yet
Lab7 LLM Chains
7 pages
CS 8520: Artificial Intelligence: Knowledge Representation
No ratings yet
CS 8520: Artificial Intelligence: Knowledge Representation
30 pages
Deep Learning: - Course Code: - Unit 1
No ratings yet
Deep Learning: - Course Code: - Unit 1
21 pages
FineTuning Process Using OpenAI 1703440516
No ratings yet
FineTuning Process Using OpenAI 1703440516
14 pages
Heart Disease Prediction-02-1
No ratings yet
Heart Disease Prediction-02-1
27 pages
Matplotlib PDF
No ratings yet
Matplotlib PDF
16 pages
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
From Everand
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
Fouad Sabry
No ratings yet
ACP-WGM11-WP04-Draft Iridium Technical Specification Version 1.1 - 051906
No ratings yet
ACP-WGM11-WP04-Draft Iridium Technical Specification Version 1.1 - 051906
48 pages
Slide - Lecture N1: Routes of Drug Administration
No ratings yet
Slide - Lecture N1: Routes of Drug Administration
36 pages
Gusdorf_The Georges Hotel Spring 2023 Student Workbook_FINAL
No ratings yet
Gusdorf_The Georges Hotel Spring 2023 Student Workbook_FINAL
11 pages
Wa0023.
No ratings yet
Wa0023.
12 pages
Garrett Motion E-Turbo Whitepaper 2019
No ratings yet
Garrett Motion E-Turbo Whitepaper 2019
29 pages
Business Studies_Class XII_Case Studiies
No ratings yet
Business Studies_Class XII_Case Studiies
2 pages
21052448 - Sandali Dev Sinha project 2 internship mapping
No ratings yet
21052448 - Sandali Dev Sinha project 2 internship mapping
18 pages
IATF Audit Observation 21.06.2021-23.06.2021
100% (1)
IATF Audit Observation 21.06.2021-23.06.2021
6 pages
LAT Past Paper 28 November 2021
No ratings yet
LAT Past Paper 28 November 2021
11 pages
Detailing For Buckling
No ratings yet
Detailing For Buckling
15 pages
Lease Financing: Submitted by N.S.Abhay Kumar
No ratings yet
Lease Financing: Submitted by N.S.Abhay Kumar
19 pages
Using Openbsd 3.3 Asa Firewall/Gateway For Home DSL or Cable
No ratings yet
Using Openbsd 3.3 Asa Firewall/Gateway For Home DSL or Cable
16 pages
Grade 2 Mental Maths Worksheet 3
No ratings yet
Grade 2 Mental Maths Worksheet 3
2 pages
6300
No ratings yet
6300
7 pages
Pivot 4A Budget of Work (Bow) For Senior High School - Applied Subjects
No ratings yet
Pivot 4A Budget of Work (Bow) For Senior High School - Applied Subjects
10 pages
Infoblox Datasheet Infoblox Dns Traffic Control PDF
No ratings yet
Infoblox Datasheet Infoblox Dns Traffic Control PDF
4 pages
Complete Lab Report 1st Sem
No ratings yet
Complete Lab Report 1st Sem
24 pages
Important Notice: Easypen I4O5X
No ratings yet
Important Notice: Easypen I4O5X
1 page
Sb1094 Manual
No ratings yet
Sb1094 Manual
56 pages
Faceless Appeals Conceptual Framework 110323 PDF
No ratings yet
Faceless Appeals Conceptual Framework 110323 PDF
134 pages
fk452f fp0 Se
No ratings yet
fk452f fp0 Se
2 pages
Comparison of Integrated Development Environments
No ratings yet
Comparison of Integrated Development Environments
43 pages
Part 3 of PatSars Syllabus
No ratings yet
Part 3 of PatSars Syllabus
10 pages
BSA 2016 - Keep The Etiquette - The Office
No ratings yet
BSA 2016 - Keep The Etiquette - The Office
9 pages
Cocacola Kwanza LTD Vs Pili Abel Mkoma ( (DC) Civil Appeal No 23 of 2005) 2006 TZHC 239 (21 September 2006)
No ratings yet
Cocacola Kwanza LTD Vs Pili Abel Mkoma ( (DC) Civil Appeal No 23 of 2005) 2006 TZHC 239 (21 September 2006)
13 pages
DS-06004 1305019 Rev.h
No ratings yet
DS-06004 1305019 Rev.h
25 pages
Session 4
No ratings yet
Session 4
19 pages
Safety Activity
No ratings yet
Safety Activity
4 pages
Pa 114 - Project Management: Roderick Tuling Olivar, MPA (CAR), CHRA
No ratings yet
Pa 114 - Project Management: Roderick Tuling Olivar, MPA (CAR), CHRA
9 pages