Understanding RAG Evaluation: The Future of AI Interactions

Data Science

Date : 10/23/2024

Data Science

Date : 10/23/2024

Understanding RAG Evaluation: The Future of AI Interactions

Explore the intricacies of RAG evaluation, its challenges, and techniques. Learn how RAG systems enhance AI interactions with improved context and reduced bias.

Dipanjan Karanjai

AUTHOR - FOLLOW
Dipanjan Karanjai
Associate Manager, Data Science

Understanding RAG Systems: The Future of AI Interactions

Like the blog

Table of contents

Understanding RAG Evaluation: The Future of AI Interactions

RAG systems- The future of AI-driven interactions.
RAG evaluation challenges
Introduction to RAG evaluation and RAGAS library:
Other embedding quality evaluation strategies:
DeepEval:
OpenAI Evals:
Conclusion

Like the blog

Table of contents

Understanding RAG Evaluation: The Future of AI Interactions

RAG systems- The future of AI-driven interactions.
RAG evaluation challenges
Introduction to RAG evaluation and RAGAS library:
Other embedding quality evaluation strategies:
DeepEval:
OpenAI Evals:
Conclusion

Understanding RAG Systems: The Future of AI Interactions

Step into our article where we unravel the intricacies and challenges surrounding RAG evaluation, shedding light on this groundbreaking technology.

RAG systems- The future of AI-driven interactions.

Retrieval-augmented generation, or RAG, significantly advances natural language processing. By integrating retrieval mechanisms into generative models like GPT, RAG enables AI systems to access and incorporate external knowledge sources during text generation. RAG systems have gained popularity due to the following pros:

Improved Contextual Understanding: RAG models can leverage a vast array of external knowledge- structured or unstructured, such as databases, tabular data, articles, or books, to generate more contextually relevant and accurate responses. AI systems can better understand and respond to a wider range of queries and prompts.
Enhanced Content Creation: RAG enables AI systems to generate content that is not only coherent but also backed by factual information from external sources. This can be particularly valuable in applications like content generation, where accuracy and relevance are crucial.
Better Question Answering: RAG models excel at question-answering tasks by retrieving relevant information from external sources and incorporating it into their responses. This capability can significantly improve the performance of AI systems in domains such as customer support, education, and information retrieval.
Reduced Bias and Misinformation: By leveraging diverse external knowledge sources, RAG models can mitigate bias and misinformation in AI-generated content. They can cross-reference information from multiple sources, giving users a more balanced and accurate perspective.

RAG evaluation challenges

Evaluating and assessing the performance of RAG systems is a challenge. RAG evaluation presents several difficulties due to its unique characteristics and requirements. Some of these challenges are as follows:

Subjectivity of Evaluation: Assessing the quality of RAG-generated content often involves subjective criteria such as relevance, coherence, and factual accuracy. Different evaluators may have varying interpretations, making it challenging to establish consistent evaluation standards.
Diverse Use Cases: RAG applications span many use cases, from question answering to content generation. Each use case may require different evaluation methodologies and metrics, making it challenging to develop a one-size-fits-all evaluation framework.
Complexity of Language Understanding: RAG models must understand and generate human-like language, which is inherently complex and nuanced. Evaluating the accuracy and appropriateness of generated responses requires sophisticated linguistic analysis and human judgment.
Integration of External Knowledge: RAG models leverage external knowledge sources, which adds a layer of complexity to evaluation. Ensuring that the model effectively incorporates and utilizes external knowledge while maintaining coherence and relevance poses a challenge.
Dynamic Nature of Knowledge: External knowledge sources are dynamic and constantly evolving. Evaluating RAG models' ability to adapt to new information and updates in external knowledge presents challenges in designing realistic evaluation scenarios.
Scalability and Efficiency: Evaluating RAG models at scale requires significant computational resources and human annotator time. Scaling up evaluation processes to handle large datasets and diverse use cases while maintaining efficiency is a practical challenge.
Ethical Considerations: Evaluating RAG models' ethical implications, such as bias, fairness, and privacy, adds another layer of complexity. Ensuring that evaluation methodologies account for these ethical considerations requires careful design and consideration.

Introduction to RAG evaluation and RAGAS library:

A RAG system typically consists of 2 essential parts: the retriever and the generator. Thus, the system needs to be evaluated in these two areas.

Retriever evaluation: The retriever evaluation primarily focuses on the context retrieved and its relevance in answering the question given. Multiple metrics are shown below and provided in the RAGAS library. They try to evaluate the retrieval in the following aspects:

Context precision: How relevant the context retrieved is, to the question asked?

Context recall: How well is the retriever fetching all the relevant contexts required to answer the question?

Context relevancy: How relevant the sentences in the context retrieved is to the question asked?

Context entities recall: A measure of what fraction of context entities are recalled from ground_truths

Let us delve deeper into these metrics to get more clarity:

The figure below shows 4 sections that are necessary for RAG retrieval evaluation-

Question : Query asked by the end user

Context : Documents retrieved based on the query, having max relevance score.

LLM answer : Answer generated by text-generation model.

Ground truth: Correct answer as per SME.

RAG evaluation and RAGAS library

1.1 Context precision: This metric depends on the top-K selected as the number of retrieved chunks considered as the context for the downstream LLM. Context precision calculation does not require the ground truth. It only requires questions and the retrieved contexts.

Notice the red dots in the above figure.

Therefore, context precision @K is given by:

context precision

Where v_k , is the relevance indicator. v_k takes either 0 for irrelevant section or 1 for relevant section.

Using RAGAS library:

Using RAGAS library

from datasets import Dataset
from ragas.metrics import context_precision
from ragas import evaluate

system_ans = [“sample generated answer”]
data_samples = {
'question': ['What is dynamic load analysis and how to perform it?'],
'answer': system_ans,
'contexts' : [contexts],
'ground_truths': [sme_expected_answer]
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[context_precision])
score.to_pandas()

Context precision values lie between [0,1]. A value of 1 means a high measure of context relevant to the question.

1.2 Context recall: Unlike context precision, context recall considers the ground truth and the retrieved context. Notice the green dots in the above figure. Context recall is given by:

context recall

Simply by changing the metrics argument to [context_recall] in the ragas.evaluate method, we can use the same code shown in context precision to calculate context recall using the RAGAS library. The context recall values range from [0,1] to [1]. 1 is the highest measure of context recall, meaning that all the sentences in the ground truth have been covered in the retrieved context.

1.3 Context relevancy: Measures the degree to which relevant sentences to answer a question are present within the retrieved context. It is given by-

context relevancy

The values range from [0,1] to 1, with 1 being the maximum score where all the sentences in the retrieved context are relevant to the question asked.

Sample code for context relevancy calculation:

from ragas.metrics import ContextRelevancy
context_relevancy = ContextRelevancy()

data_samples = {
'question': ['What is dynamic load analysis and how to perform it?'],
'answer': system_ans,
'contexts' : [contexts],
'ground_truths': [sme_expected_answer]
}
dataset = Dataset.from_dict(data_samples)
results = context_relevancy.score(dataset)

1.4 Context entity recall: This measures the proportion of entities present in the context that were supposed to be present as part of the ground truth to the entities present in the ground truth. Notice the green dots in the above figure.
Context entities recall is given by:

context entities recall

GE refers to ground truth entities, and CE refers to context entities. Entities are nothing but key elements in the sentences. For example, in this sentence- “The Initial Scantling Evaluation (ISE) is the first step in assessing the structural integrity of a vessel. It involves determining the minimum satisfactory scantlings (thicknesses) of each structural component of a vessel, considering factors such as hull girder strength requirements, local scantling requirements for loading conditions, fatigue strength of connections, and scantling requirements for main supporting members. This process typically uses specific software and considers environmental severity factors for the intended site and transit routes. The ISE is part of a two-phase design criteria for hull structures, with the second phase being the Total Strength Assessment (TSA). It's important to note that a design that complies with the ISE minimum scantlings does not necessarily mean it will also be satisfactory in the TSA.” the entities are highlighted.

Simply by changing the metrics argument to [context_entity_recall] in the ragas.evaluate method, we can use the same code as shown in context precision to calculate context entities recall using the RAGAS library.

It is to be noted that the calculation of the above metrics depends on LLM calls behind the hood, which retrieve the entities, check for context relevancy, etc. The RAGAS library uses a long chain with calls to LLMs for the evaluation. One can bring their own LLM(s) from any provider (Ollama, Google, Azure, TogetherAI, anthropic, hugging face, etc.).

2. Generator evaluation:

Now that we have already evaluated the retriever's performance, let's examine how we can evaluate the performance of the texts/answers generated from the generator part of a RAG system.

Traditionally, multiple metrics are used for evaluating the performance of a generation model:

2.1 BLEU score: Like the Jaccard index, this score calculates the shared n-grams between the ground truth and the generated answer.
BLEU score is given by:

BLEU score

Where BP is the brevity penalty, penalizes the generations shorter than the expected text generations. p_i is the n-gram precision. Given by-

Where r is the length of the generated text, and c is the length of the expected / ground truth text.

2.2 ROGUE score: There are various flavors of rogue scores. They primarily measure recall, unlike BLEU, which measures precision.

These traditional methods fail to capture hallucinations and ‘away-from-context’ answers generated by LLMs.

2.3 Faithfulness: Measures the model-produced response's factual accuracy to the given ground truth. Faithfulness is calculated using the formula given below:

Faithfulness is calculated using the formula

The number of claims is again made by LLMs when using the RAGAS library.

Sample code snippet:

from datasets import Dataset
from ragas.metrics import faithfulness
from ragas import evaluate

data_samples = {
'question': list_of_questions,
'answer': list_of_answers
'contexts' : [[list_of_contexts_for_q1], [list_of_contexts_for_q2],…..]]
}dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[faithfulness])
score.to_pandas()

2.4 Answer relevance: This is measured by comparing the original question's mean cosine similarity to several artificial questions that were created (reverse engineered) based on the response. This metric utilizes questions, contexts and the generated answer for its calculation. High answer relevance would mean that less unwanted contexts are present in the answer. Answer relevancy is given by:

Sample code:

from datasets import Dataset
from ragas.metrics import answer_relevancy
from ragas import evaluate

data_samples = {
'question': list_of_questions,
'answer': list_of_answers
'contexts' : [[list_of_contexts_for_q1], [list_of_contexts_for_q2]…..],
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[answer_relevancy])
score.to_pandas()

Other embedding quality evaluation strategies:

Embedding rank evaluation: Embedding quality can be evaluated by finding the ranks at which the expected documents are retrieved. Summing these ranks up for a number of questions, we can expect that the quality of the embeddings is good if this summation is less. This can be used for comparative study of embeddings.

Example: From the table below, we can see that emb-3 is seemingly better compared to the other 2 embeddings due to an overall lower retrieval rank of expected sections

Qn no.	Emb 1- expected section rank	Emb 2- expected section rank	Emb 3- expected section rank
1	2	5	1
2	7	4	3
3	1	5	5
4	6	2	2
5	1	5	2
	Sum = 17	Sum = 21	Sum = 13

Embedding spread evaluation: The embeddings are expected to capture as much information as possible, including the details of the various categories of contexts available. For example, in an RAG that does question answering on legal documents, it is expected that the embeddings related to murder must be significantly different from civil cases. Hence, for a variety of questions from other domains, the embeddings retrieved in the context should vary. Thus the cosine similarity between the question (with prompt) vs available sections should have higher degrees of spread. Hence, standard dev, var, iqr, etc, of cosine similarities on multiple questions can be used as a measure.

Example: Embedding 2 has a maximum sum of cosine similarity standard deviation and, hence, appears to collect more information than the other embeddings.

Qn no	Emb 1- cosine similarity standard deviation	Emb 2- cosine similarity standard deviation	Emb 3- cosine similarity standard deviation
1	12	15	19
2	38	54	31
3	3	15	15
4	13	21	12
5	8	5	12
	Sum = 74	Sum = 110	Sum = 89

Though we have discussed multiple evaluation metrics that are commonly used, other metrics need to be covered by this blog, like meteor, perplexity, etc. Professionals and scientists are looking for various evaluation metrics, and how RAGS are evaluated may vary based on different use cases. RAGAS library has grown to be very popular within a span of 1 year. The future holds endless potential in the GenAI space, and RAG has definitely set up a spot for the future. It should not be surprising to see different metrics getting evolved or created with time.

DeepEval:

Apart from RAGAS, other libraries are being developed that are useful when creating custom evaluation frameworks. DeepEval library has a G-Eval metric wherein the evaluator/user can provide the evaluation steps. The G-Eval evaluation system uses a chain of thought technique to create generation steps, which are used to calculate the G-Eval metric. In short, the GEval comprises two steps- evaluation_step and generation_step. When evaluation_step is provided by the user, GEval skips the first step of generating the evaluation step and proceeds with the generation step.

The DeepEval library offers a variety of LLM evaluation metrics, including the RAGAS metrics from the RAGAS library. DeepEval uses a pytest framework for getting a report, as shown below:

pytest framework

The Score provides a value for the metrics specified, the threshold shows the pass/ fail threshold for the respective score, and the reason shows the justification for the score. Overall success rate and status are populated based on the number of evaluations and their average scores. DeepEval provides other metrics like toxicity and bias. DeepEval also supports LLMs from any provider. To do this, a custom class inheriting the DeepEValBaseLLM class can be created as shown below in the Sample DeepEval code snippet for G-Eval calculation:

from deepeval import assert_test
from deepeval.metrics import GEval, HallucinationMetric
from deepeval.models import DeepEvalBaseModel, DeepEvalBaseLLM
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from langchain_community.chat_models import ChatOllama

##### Custom class to wrap ollama model into a DeepEvalBaseLLM model
### Evaluate takes DeepEvalBaseLLM model or an Azure model ###
class CustomEvalModel(DeepEvalBaseLLM):
def __init__(self, model
):
self.model = model

def load_model(self):
return self.model

def generate(self, prompt: str) -> str:
chat_model = self.load_model()
return chat_model.invoke(prompt).content

async def a_generate(self, prompt: str) -> str:
chat_model = self.load_model()
res = await chat_model.ainvoke(prompt)
return res.content

def get_model_name(self):
return "Custom Azure OpenAI Model"

def test_model_outputs():
# Given
############### ollama
short_transcript = "A short transcript snippet which talks about semi-conductor supply chain fragmentation."
llm = ChatOllama(model="phi3:mini")
custom_deep_eval_model = CustomEvalModel(model=llm)

system_prompt = (
"Summarize this transcript in a JSON output with keys 'title' and 'summary'"
)
user_prompt = f"{system_prompt}: {short_transcript}"
actual_output = llm.invoke(f"""<role>: system, <content>:{system_prompt};\n <role>:user, <content>: {user_prompt}
""")
output = actual_output.content
output = eval(output)
print(output)
# When
test_case = LLMTestCase(
input=short_transcript,
actual_output=output["summary"],
expected_output="fragmented nature of the supply chain and existing monopolies",
context=[short_transcript],
)

# Metric: Insights
insights_metric = GEval(
name="insights",
model=custom_deep_eval_model,
threshold=0.5,
evaluation_steps=[
"Determine how entertaining the summary of the transcript is"
],
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
)

assert_test(test_case, [insights_metric])

### Run the pytest using this command- "deepeval run test test_deepeval_g_eval.py" #############

OpenAI Evals:

Like DeepEval, another module, OpenAI Evals, also allows custom evaluations. OpenAI Evals contain two types of evaluations: basic evaluations and model-graded evaluations.

Basic evals:

A simple evals are provided as a part of the OpenAI evals, wherein no models are involved in the computation of evals scores. Basic evals require the generated and correct reference answers for calculation. Some basic evals metrics are: match, includes, fuzzyMatch, JsonMatch.

For a model completion ‘a’ and a reference list of correct answers ‘B’, the following evals implement:

basic/match.py:Match: any([a.startswith(b) for b in B])
basic/includes.py:Includes: any([(b in a) for b in B])
basic/fuzzy_match.py:FuzzyMatch: any([(a in b or b in a) for b in B])
basic/json_match.py:JsonMatch: This metric just checks if keys and values in 2 jsons match.

Model-graded evals:

OpenAI evals provide support for building your model-graded evals. Model-graded evals are required where the desired model response can contain large variations, such as answering an open-ended question. Model-graded evals require LLMs for evaluation. The evaluation model and the model under review can be different. Model-graded evals are created such that the ground truth can be parsed and compared with the generated answer, e.g., in multiple-choice format or with a simple yes/no. OpenAI evals support custom evaluation prompts just like DeepEvals. Required arguments for running model-graded evals-

prompt: The evaluation prompt should receive the model's response to the initial prompt, possibly in addition to some additional data, and direct the model to provide a parsable evaluation. Curly brace-denoted portions (\key}) are filled in using the additional args (see below) or the data input_outputs.

input_outputs: A mapping specifying which inputs to use to generate which completions. There will only be a single input-completion pair for many evals, though there can be more, e.g. when comparing two completions against each other.

choice_strings: The choices that we expect the model completion to contain given the evaluation prompt. For example, "ABCDE" or ["Yes", "No", "Unsure"]. Any other choices the model returns are parsed into "__invalid__".

choice_scores (optional): A mapping of each choice to its score, which is logged as a metric. For example, if a response of "Yes" (resp. "No") indicates that the model's original completion was good (resp. bad), we may assign this choice a score of 1 (resp. 0).

eval_type (optional): How we expect the model to format its response to the evaluation prompt. Currently, the supported options are cot_classify (expects that the parsable portion of the response (i.e., the portion containing the choice) will be at the end of the completion), classify_cot(expects that the model response will include the choice first.), classify (expects that the model response will only contain the choice).

Below are some examples of model-graded evals. Please note that the evaluation dataset must contain the input question, the completion answer (LLM-generated answer), and the ideal answer (ground truth answer).

Sample model-graded evals. The model-graded evals are necessary to be put into an yml file as shown below and are required to be saved in this directory- evals/registry/model graded:

E.g.,1

fact:
prompt: |-
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Question]: {input}
************
[Expert]: {ideal}
************
[Submission]: {completion}
************
[END DATA]

Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
(A) The submitted answer is a subset of the expert answer and fully consistent with it.
(B) The submitted answer is a superset of the expert answer and is fully consistent with it.
(C) The submitted answer contains all the same details as the expert answer.
(D) There is a disagreement between the submitted answer and the expert answer.
(E) The answers differ, but these differences don't matter from the perspective of factuality.
choice_strings: ABCDE
input_outputs:
input: completion

E.g. 2

closedqa:
prompt: |-
You are assessing a submitted answer on a given task based on a criterion. Here is the data:
[BEGIN DATA]
***
[Task]: {input}
***
[Submission]: {completion}
***
[Criterion]: {criteria}
***
[END DATA]
Does the submission meet the criterion? First, write out in a step by step manner your reasoning about the criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer. At the end, repeat just the letter again by itself on a new line.

Reasoning:
eval_type: cot_classify
choice_scores:
"Y": 1.0
"N": 0.0
choice_strings: 'YN'
input_outputs:
input: "completion"

Sample dataset in JSONL file for evaluation:

{"input": "write a 1-2 funny lines about apple", "completion": "Why did the apple go to the doctor? Because it had a bad core-ache!", "choice": "Yes"}
{"input": "write a 1-2 boring lines about apple", "completion": "Apples are a type of fruit that grow on trees and are often consumed as a healthy snack.", "choice": "No"}
{"input": "write a 1-2 funny lines about pineapple", "completion": "Why did the pineapple stop in the middle of the road? Because it ran out of juice!", "choice": "Yes"}

To run the evals, simply use this code: oaieval gpt-3.5-turbo my_test_dataset.

Conclusion

There are various kinds of metrics for RAG evaluation- Some metrics are keyword-based matching or n-gram matching, based metrics that are statistical in nature, and some metrics require the involvement of LLMs or models that can generate answers based on the evaluation criteria that are either custom defined by the user (DeepEvals G-Eval, OpenAI Evals, etc.) or are part of an existing evaluation framework (RAGAS). Evaluation frameworks are being developed, updated, modified, and improved continuously and have become essential nowadays to benchmark RAG system performances, allowing engineers to build excellent RAG systems.

Dipanjan Karanjai

AUTHOR - FOLLOW
Dipanjan Karanjai
Associate Manager, Data Science

Topic Tags

Rag Evaluation

Embedding Rank Evaluation

Embedding Spread Evaluation

DeepEval Framework

OpenAI Evals

RAG-Specific Metrics

Next Topic

Prompt Engineering Best Practices for Structured AI Outputs

Continue reading

Next Topic

Prompt Engineering Best Practices for Structured AI Outputs

Continue reading

our categories

Telecom, Media, Technology

Travel & Hospitality

Healthcare & Life Sciences

Banking & Financial Services

Ready to talk?

Join forces with our data science and AI leaders to navigate your toughest challenges.

recommended articles

Data Engineering Automation : An In-Depth Guide

Blog

Data Engineering Automation : An In-Depth Guide

Quantifying Customer Spending Uplift from Loyalty Programs via Causal Analysis

Blog

Quantifying Customer Spending Uplift from Loyalty Programs via Causal Analysis

Graph Neural Networks Enhancing Personalization in E-Commerce Product Recommendations

Blog

Graph Neural Networks Enhancing Personalization in E-Commerce Product Recommendations

×

Thank you for a like!

Stay informed and up-to-date with the most recent trends in data science and AI.

Share this article

×

Ready to talk?

Join forces with our data science and AI leaders to navigate your toughest challenges.