RAG Pipeline Evaluation Using RAGAS
Last Updated: April 24, 2025
Ragas is an open source framework for model-based evaluation to evaluate your Retrieval Augmented Generation (RAG) pipelines and LLM applications. It supports metrics like correctness, tone, hallucination (faithfulness), fluency, and more.
For more information about evaluators, supported metrics and usage, check out:
This notebook shows how to use the Ragas-Haystack integration to evaluate a RAG pipeline against various metrics.
Notebook by Anushree Bannadabhavi, Siddharth Sahu, Julian Risch
Prerequisites:
- Ragas uses OpenAI key for computing some metrics, so we need an OpenAI API key.
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")
Install dependencies
!pip install ragas-haystack
Importing Required Libraries
from haystack import Document, Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.embedders import OpenAITextEmbedder, OpenAIDocumentEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses import ChatMessage
from haystack.components.generators import OpenAIGenerator
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.builders import AnswerBuilder
from haystack_integrations.components.evaluators.ragas import RagasEvaluator
from ragas.llms import HaystackLLMWrapper
from ragas.metrics import AnswerRelevancy, ContextPrecision, Faithfulness
Creating a Sample Dataset
In this section we create a sample dataset containing information about AI companies and their language models. This dataset serves as the context for retrieving relevant data during pipeline execution.
dataset = [
"OpenAI is one of the most recognized names in the large language model space, known for its GPT series of models. These models excel at generating human-like text and performing tasks like creative writing, answering questions, and summarizing content. GPT-4, their latest release, has set benchmarks in understanding context and delivering detailed responses.",
"Anthropic is well-known for its Claude series of language models, designed with a strong focus on safety and ethical AI behavior. Claude is particularly praised for its ability to follow complex instructions and generate text that aligns closely with user intent.",
"DeepMind, a division of Google, is recognized for its cutting-edge Gemini models, which are integrated into various Google products like Bard and Workspace tools. These models are renowned for their conversational abilities and their capacity to handle complex, multi-turn dialogues.",
"Meta AI is best known for its LLaMA (Large Language Model Meta AI) series, which has been made open-source for researchers and developers. LLaMA models are praised for their ability to support innovation and experimentation due to their accessibility and strong performance.",
"Meta AI with it's LLaMA models aims to democratize AI development by making high-quality models available for free, fostering collaboration across industries. Their open-source approach has been a game-changer for researchers without access to expensive resources.",
"Microsoft’s Azure AI platform is famous for integrating OpenAI’s GPT models, enabling businesses to use these advanced models in a scalable and secure cloud environment. Azure AI powers applications like Copilot in Office 365, helping users draft emails, generate summaries, and more.",
"Amazon’s Bedrock platform is recognized for providing access to various language models, including its own models and third-party ones like Anthropic’s Claude and AI21’s Jurassic. Bedrock is especially valued for its flexibility, allowing users to choose models based on their specific needs.",
"Cohere is well-known for its language models tailored for business use, excelling in tasks like search, summarization, and customer support. Their models are recognized for being efficient, cost-effective, and easy to integrate into workflows.",
"AI21 Labs is famous for its Jurassic series of language models, which are highly versatile and capable of handling tasks like content creation and code generation. The Jurassic models stand out for their natural language understanding and ability to generate detailed and coherent responses.",
"In the rapidly advancing field of artificial intelligence, several companies have made significant contributions with their large language models. Notable players include OpenAI, known for its GPT Series (including GPT-4); Anthropic, which offers the Claude Series; Google DeepMind with its Gemini Models; Meta AI, recognized for its LLaMA Series; Microsoft Azure AI, which integrates OpenAI’s GPT Models; Amazon AWS (Bedrock), providing access to various models including Claude (Anthropic) and Jurassic (AI21 Labs); Cohere, which offers its own models tailored for business use; and AI21 Labs, known for its Jurassic Series. These companies are shaping the landscape of AI by providing powerful models with diverse capabilities.",
]
Initializing RAG Pipeline Components
This section sets up the essential components required to build a Retrieval-Augmented Generation (RAG) pipeline. These components include a Document Store for managing and storing documents, an Embedder for generating embeddings to enable similarity-based retrieval, and a Retriever for fetching relevant documents. Additionally, a Prompt Template is designed to structure the pipeline’s input, while a Chat Generator handles response generation.
# Sets up an in-memory store to hold documents
document_store = InMemoryDocumentStore()
docs = [Document(content=doc) for doc in dataset]
# Embeds the documents using OpenAI's embedding models to enable similarity search.
document_embedder = OpenAIDocumentEmbedder(model="text-embedding-3-small")
text_embedder = OpenAITextEmbedder(model="text-embedding-3-small")
docs_with_embeddings = document_embedder.run(docs)
document_store.write_documents(docs_with_embeddings["documents"])
# Configures a retriever to fetch relevant documents based on embeddings
retriever = InMemoryEmbeddingRetriever(document_store, top_k=2)
# Defines a template for prompting the LLM with a user query and the retrieved documents
template = [
ChatMessage.from_user(
"""
Given the following information, answer the question.
Context:
{% for document in documents %}
{{ document.content }}
{% endfor %}
Question: {{question}}
Answer:
"""
)
]
# Sets up an LLM-based generator to create responses
prompt_builder = ChatPromptBuilder(template=template)
chat_generator = OpenAIChatGenerator(model="gpt-4o-mini")
Configuring RagasEvaluator Component
Pass all the Ragas metrics you want to use for evaluation, ensuring that all the necessary information to calculate each selected metric is provided.
For example:
- AnswerRelevancy: requires both the query and the response. It does not consider factuality but instead assigns lower score to cases where the response lacks completeness or contains redundant details.
- ContextPrecision: requires the query, retrieved documents, and the reference. It evaluates to what extent the retrieved documents contain precisely only what is relevant to answer the query.
- Faithfulness: requires the query, retrieved documents, and the response. The response is regarded as faithful if all the claims that are made in the response can be inferred from the retrieved documents.
Make sure to include all relevant data for each metric to ensure accurate evaluation.
llm = OpenAIGenerator(model="gpt-4o-mini")
evaluator_llm = HaystackLLMWrapper(llm)
ragas_evaluator = RagasEvaluator(
ragas_metrics=[AnswerRelevancy(), ContextPrecision(), Faithfulness()],
evaluator_llm=evaluator_llm,
)
Building and Connecting the RAG Pipeline
Here we add and connect the initialized components to form a RAG Haystack pipeline.
# Creating the Pipeline
rag_pipeline = Pipeline()
# Adding the components
rag_pipeline.add_component("text_embedder", text_embedder)
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", chat_generator)
rag_pipeline.add_component("answer_builder", AnswerBuilder())
rag_pipeline.add_component("ragas_evaluator", ragas_evaluator)
# Connecting the components
rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
rag_pipeline.connect("retriever", "prompt_builder")
rag_pipeline.connect("prompt_builder.prompt", "llm.messages")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")
rag_pipeline.connect("retriever", "ragas_evaluator.documents")
rag_pipeline.connect("llm.replies", "ragas_evaluator.response")
question = "What makes Meta AI’s LLaMA models stand out?"
reference = "Meta AI’s LLaMA models stand out for being open-source, supporting innovation and experimentation due to their accessibility and strong performance."
result = rag_pipeline.run(
{
"text_embedder": {"text": question},
"prompt_builder": {"question": question},
"answer_builder": {"query": question},
"ragas_evaluator": {"query": question, "reference": reference},
# Each metric expects a specific set of parameters as input. Refer to the
# Ragas class' documentation for more details.
}
)
print(result['answer_builder']['answers'][0].data, '\n')
print(result['ragas_evaluator']['result'])
Evaluating: 0%| | 0/3 [00:00<?, ?it/s]
Meta AI’s LLaMA models stand out for several reasons:
1. **Open-Source Accessibility**: The LLaMA models are open-source, allowing researchers and developers to use, modify, and experiment with them freely, which promotes innovation.
2. **Strong Performance**: LLaMA models are recognized for their high-performance capabilities, enabling effective application across various tasks and industries.
3. **Democratization of AI Development**: By providing these advanced models for free, Meta AI aims to democratize access to AI technology, allowing individuals and smaller organizations to engage in AI development without the barrier of high costs.
4. **Fostering Collaboration**: The open-source nature encourages collaboration among researchers and industries, creating a community-driven approach to AI advancement.
These factors together position LLaMA models as significant contributors to the AI landscape, especially for those lacking access to expensive proprietary models.
{'answer_relevancy': 0.9889, 'context_precision': 1.0000, 'faithfulness': 0.6667}
Standalone Evaluation of the RAG Pipeline
This section explores an alternative approach to evaluating a RAG pipeline without using the RagasEvaluator
component. It emphasizes manual extraction of outputs and organizing them for evaluation.
You can use any existing Haystack pipeline for this purpose. For demonstration, we will create a simple RAG pipeline similar to the one described earlier, but without including the RagasEvaluator
component.
Setting Up a Basic RAG Pipeline
We construct a simple RAG pipeline similar to the approach above but without the RagasEvaluator component.
# Initialize components for RAG pipeline
document_store = InMemoryDocumentStore()
docs = [Document(content=doc) for doc in dataset]
document_embedder = OpenAIDocumentEmbedder(model="text-embedding-3-small")
text_embedder = OpenAITextEmbedder(model="text-embedding-3-small")
docs_with_embeddings = document_embedder.run(docs)
document_store.write_documents(docs_with_embeddings["documents"])
retriever = InMemoryEmbeddingRetriever(document_store, top_k=2)
template = [
ChatMessage.from_user(
"""
Given the following information, answer the question.
Context:
{% for document in documents %}
{{ document.content }}
{% endfor %}
Question: {{question}}
Answer:
"""
)
]
prompt_builder = ChatPromptBuilder(template=template)
chat_generator = OpenAIChatGenerator(model="gpt-4o-mini")
# Creating the Pipeline
rag_pipeline = Pipeline()
# Adding the components
rag_pipeline.add_component("text_embedder", text_embedder)
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", chat_generator)
rag_pipeline.add_component("answer_builder", AnswerBuilder())
# Connecting the components
rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
rag_pipeline.connect("retriever", "prompt_builder")
rag_pipeline.connect("prompt_builder.prompt", "llm.messages")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")
Extracting Outputs for Evaluation
After building the pipeline, we use it to generate the necessary outputs, such as retrieved documents and responses. These outputs are then structured into a dataset for evaluation.
questions = [
"Who are the major players in the large language model space?",
"What is Microsoft’s Azure AI platform known for?",
"What kind of models does Cohere provide?",
]
references = [
"The major players include OpenAI (GPT Series), Anthropic (Claude Series), Google DeepMind (Gemini Models), Meta AI (LLaMA Series), Microsoft Azure AI (integrating GPT Models), Amazon AWS (Bedrock with Claude and Jurassic), Cohere (business-focused models), and AI21 Labs (Jurassic Series).",
"Microsoft’s Azure AI platform is known for integrating OpenAI’s GPT models, enabling businesses to use these models in a scalable and secure cloud environment.",
"Cohere provides language models tailored for business use, excelling in tasks like search, summarization, and customer support.",
]
evals_list = []
for que_idx in range(len(questions)):
single_turn = {}
single_turn['user_input'] = questions[que_idx]
single_turn['reference'] = references[que_idx]
# Running the pipeline
response = rag_pipeline.run(
{
"text_embedder": {"text": questions[que_idx]},
"prompt_builder": {"question": questions[que_idx]},
"answer_builder": {"query": questions[que_idx]},
}
)
# the response of the pipeline
single_turn['response'] = response["answer_builder"]["answers"][0].data
haystack_documents = response["answer_builder"]["answers"][0].documents
# extracting context from haystack documents
# retrieved durring answer generation process
single_turn['retrieved_contexts'] = [doc.content for doc in haystack_documents]
evals_list.append(single_turn)
When constructing the
evals_list
, it is important to align the keys in the single_turn dictionary with the attributes defined in the Ragas SingleTurnSample. This ensures compatibility with the Ragas evaluation framework. Use the retrieved documents and pipeline outputs to populate these fields accurately, as demonstrated in the provided code snippet.
Evaluating the pipeline using Ragas EvaluationDataset
The extracted dataset is converted into a Ragas EvaluationDataset so that Ragas can process it. We then initialize an LLM evaluator using the HaystackLLMWrapper. Finally, we call Ragas’s evaluate() function with our evaluation dataset, three metrics, and the LLM evaluator.
from ragas import evaluate
from ragas.dataset_schema import EvaluationDataset
evaluation_dataset = EvaluationDataset.from_list(evals_list)
llm = OpenAIGenerator(model="gpt-4o-mini")
evaluator_llm = HaystackLLMWrapper(llm)
result = evaluate(
dataset=evaluation_dataset,
metrics=[AnswerRelevancy(), ContextPrecision(), Faithfulness()],
llm=evaluator_llm,
)
print(result)
result.to_pandas()
Evaluating: 0%| | 0/9 [00:00<?, ?it/s]
{'answer_relevancy': 0.9701, 'context_precision': 1.0000, 'faithfulness': 1.0000}
user_input \
0 Who are the major players in the large languag...
1 What is Microsoft’s Azure AI platform known for?
2 What kind of models does Cohere provide?
retrieved_contexts \
0 [In the rapidly advancing field of artificial ...
1 [Microsoft’s Azure AI platform is famous for i...
2 [Cohere is well-known for its language models ...
response \
0 The major players in the large language model ...
1 Microsoft’s Azure AI platform is known for int...
2 Cohere provides language models tailored for b...
reference answer_relevancy \
0 The major players include OpenAI (GPT Series),... 1.000000
1 Microsoft’s Azure AI platform is known for int... 1.000000
2 Cohere provides language models tailored for b... 0.910337
context_precision faithfulness
0 1.0 1.0
1 1.0 1.0
2 1.0 1.0
Haystack Useful Sources