Integration: NVIDIA
Use NVIDIA models with Haystack.
Table of Contents
- Overview
- Prerequisites
- Installation
- Components
- Usage
- Self-host with NVIDIA NIM
- Use NVIDIA components in Haystack pipelines
- License
Overview
The nvidia-haystack package contains Haystack integrations for chat models, embeddings, and reranking powered by
NVIDIA AI Foundation Models and hosted on the
NVIDIA API Catalog.
NVIDIA AI Foundation models are community- and NVIDIA-built models that are optimized to deliver the best performance on NVIDIA-accelerated infrastructure. You can use the API to query live endpoints that are available on the NVIDIA API Catalog to get quick results from a DGX-hosted cloud compute environment, or you can download models with NVIDIA NIM, which is included with the NVIDIA AI Enterprise license. The ability to run models on-premises gives your enterprise ownership of your customizations and full control of your IP and AI application.
NIM microservices are packaged as container images on a per model or model family basis and are distributed as NGC container images through the NVIDIA NGC Catalog.
Prerequisites
To get access to the NVIDIA API Catalog, do the following:
- Create a free account on the NVIDIA API Catalog and log in.
- Click your profile icon, and then click API Keys.
- Click Generate API Key, and then click Generate Key.
- Copy and save your key.
Set the key as an environment variable:
export NVIDIA_API_KEY="nvapi-..."
Installation
pip install nvidia-haystack
Components
This integration introduces the following components:
-
NvidiaTextEmbedder: A component for embedding text using NVIDIA embedding models. For models that differentiate between query and document inputs, this component embeds the input query.
-
NvidiaDocumentEmbedder: A component for embedding documents using NVIDIA embedding models.
-
NvidiaGenerator: A component for generating text using generative models.
-
NvidiaChatGenerator: A component for chat completion using NVIDIA-hosted models. Takes a list of
ChatMessageand returnsChatMessagereplies. -
NvidiaRanker: A component for ranking documents using NVIDIA reranking models.
Usage
NvidiaTextEmbedder
from haystack_integrations.components.embedders.nvidia import NvidiaTextEmbedder
text_to_embed = "I love pizza!"
text_embedder = NvidiaTextEmbedder(model="nvidia/llama-3.2-nv-embedqa-1b-v2")
text_embedder.warm_up()
print(text_embedder.run(text_to_embed))
# {'embedding': [-0.02264290489256382, -0.03457780182361603, ...}
NvidiaDocumentEmbedder
from haystack.dataclasses import Document
from haystack_integrations.components.embedders.nvidia import NvidiaDocumentEmbedder
documents = [
Document(content="Pizza is made with dough and cheese"),
Document(content="Cake is made with flour and sugar"),
Document(content="Omelet is made with eggs"),
]
document_embedder = NvidiaDocumentEmbedder(model="nvidia/llama-3.2-nv-embedqa-1b-v2")
document_embedder.warm_up()
document_embedder.run(documents=documents)
# {'documents': [Document(id=..., content: 'Pizza is made with dough and cheese', embedding: vector of size 2048), ...], 'meta': {'usage': {'prompt_tokens': 36, 'total_tokens': 36}}}
NvidiaGenerator
from haystack_integrations.components.generators.nvidia import NvidiaGenerator
generator = NvidiaGenerator(
model="meta/llama-3.1-70b-instruct",
model_arguments={
"temperature": 0.2,
"top_p": 0.7,
"max_tokens": 1024,
},
)
generator.warm_up()
result = generator.run(prompt="When was the Golden Gate Bridge built?")
print(result["replies"])
print(result["meta"])
# ['The Golden Gate Bridge was built between 1933 and 1937...']
# [{'role': 'assistant', 'finish_reason': 'stop'}]
NvidiaChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret
from haystack_integrations.components.generators.nvidia import NvidiaChatGenerator
generator = NvidiaChatGenerator(
model="meta/llama-3.1-8b-instruct",
api_key=Secret.from_env_var("NVIDIA_API_KEY"),
)
messages = [ChatMessage.from_user("What's Natural Language Processing? Be brief.")]
result = generator.run(messages)
print(result["replies"])
print(result["meta"])
NvidiaRanker
from haystack import Document
from haystack.utils import Secret
from haystack_integrations.components.rankers.nvidia import NvidiaRanker
ranker = NvidiaRanker(
api_key=Secret.from_env_var("NVIDIA_API_KEY"),
)
ranker.warm_up()
query = "What is the capital of Germany?"
documents = [
Document(content="Berlin is the capital of Germany."),
Document(content="The capital of Germany is Berlin."),
Document(content="Germany's capital is Berlin."),
]
result = ranker.run(query, documents, top_k=1)
print(result["documents"][0].content)
# The capital of Germany is Berlin.
Self-host with NVIDIA NIM
When you are ready to deploy your AI application, you can self-host models with NVIDIA NIM. For more information, refer to NVIDIA NIM Microservices.
The following code connects to locally hosted NIM microservices:
from haystack_integrations.components.generators.nvidia import NvidiaChatGenerator
# Connect to a chat NIM running at localhost:8000
generator = NvidiaChatGenerator(
base_url="http://localhost:8000/v1",
model="meta/llama-3.1-8b-instruct",
)
Use NVIDIA components in Haystack pipelines
Indexing pipeline
from haystack import Pipeline
from haystack.dataclasses import Document
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.embedders.nvidia import NvidiaDocumentEmbedder
documents = [
Document(content="Tilde lives in San Francisco"),
Document(content="Tuana lives in Amsterdam"),
Document(content="Bilge lives in Istanbul"),
]
document_store = InMemoryDocumentStore()
document_embedder = NvidiaDocumentEmbedder(model="nvidia/llama-3.2-nv-embedqa-1b-v2")
writer = DocumentWriter(document_store=document_store)
indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=document_embedder, name="document_embedder")
indexing_pipeline.add_component(instance=writer, name="writer")
indexing_pipeline.connect("document_embedder.documents", "writer.documents")
indexing_pipeline.run(data={"document_embedder": {"documents": documents}})
# Calling filter with no arguments prints the contents of the document store
document_store.filter_documents({})
RAG query pipeline
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.builders import PromptBuilder
from haystack_integrations.components.generators.nvidia import NvidiaGenerator
from haystack_integrations.components.embedders.nvidia import NvidiaTextEmbedder
prompt = """Answer the query, based on the content in the documents.
If you can't answer based on the given documents, say so.
Documents:
{% for doc in documents %}
{{doc.content}}
{% endfor %}
Query: {{query}}
"""
text_embedder = NvidiaTextEmbedder(model="nvidia/llama-3.2-nv-embedqa-1b-v2")
retriever = InMemoryEmbeddingRetriever(document_store=document_store)
prompt_builder = PromptBuilder(template=prompt)
generator = NvidiaGenerator(model="meta/llama-3.1-70b-instruct")
generator.warm_up()
rag_pipeline = Pipeline()
rag_pipeline.add_component(instance=text_embedder, name="text_embedder")
rag_pipeline.add_component(instance=retriever, name="retriever")
rag_pipeline.add_component(instance=prompt_builder, name="prompt_builder")
rag_pipeline.add_component(instance=generator, name="generator")
rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
rag_pipeline.connect("retriever.documents", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "generator")
question = "Who lives in San Francisco?"
result = rag_pipeline.run(
data={
"text_embedder": {"text": question},
"prompt_builder": {"query": question},
}
)
print(result)
# {'text_embedder': {'meta': {'usage': {'prompt_tokens': 10, 'total_tokens': 10}}}, 'generator': {'replies': ['Tilde'], 'meta': [{'role': 'assistant', 'finish_reason': 'stop'}], 'usage': {'completion_tokens': 3, 'prompt_tokens': 101, 'total_tokens': 104}}}
License
nvidia-haystack is distributed under the terms of the
Apache-2.0 license.
