# Sparse Embedding Retrieval with Qdrant and FastEmbed


In this notebook, we will see how to use Sparse Embedding Retrieval techniques (such as SPLADE) in Haystack.

We will use the Qdrant Document Store and FastEmbed Sparse Embedders.

## Why SPLADE?

- Sparse Keyword-Based Retrieval (based on BM25 algorithm or similar ones) is simple and fast, requires few resources but relies on lexical matching and struggles to capture semantic meaning.
- Dense Embedding-Based Retrieval takes semantics into account but requires considerable computational resources, usually does not work well on novel domains, and does not consider precise wording.

While good results can be achieved by combining the two approaches ([tutorial](https://haystack.deepset.ai/tutorials/33_hybrid_retrieval)), SPLADE (Sparse Lexical and Expansion Model for Information Retrieval) introduces a new method that encapsulates the positive aspects of both techniques.
In particular, SPLADE uses Language Models like BERT to weigh the relevance of different terms in the query and perform automatic term expansions, reducing the vocabulary mismatch problem (queries and relevant documents often lack term overlap).

Main features:
- Better than dense embedding Retrievers on precise keyword matching
- Better than BM25 on semantic matching
- Slower than BM25
- Still experimental compared to both BM25 and dense embeddings: few models; supported by few Document Stores

**Resources**
- [SPLADE for Sparse Vector Search Explained - great guide by Pinecone](https://www.pinecone.io/learn/splade/)
- [SPLADE GitHub repository, with links to all related papers](https://github.com/naver/splade)

## Install dependencies

In [None]:
!pip install -U fastembed-haystack qdrant-haystack wikipedia transformers

## Sparse Embedding Retrieval

### Indexing

#### Create a Qdrant Document Store

In [2]:
from haystack_integrations.document_stores.qdrant import QdrantDocumentStore

document_store = QdrantDocumentStore(
    ":memory:",
    recreate_index=True,
    return_embedding=True,
    use_sparse_embeddings=True  # set this parameter to True, otherwise the collection schema won't allow to store sparse vectors
)

#### Download Wikipedia pages and create raw documents

We download a few Wikipedia pages about animals and create Haystack documents from them.

In [3]:
nice_animals=["Capybara", "Dolphin"]

import wikipedia
from haystack.dataclasses import Document

raw_docs=[]
for title in nice_animals:
    page = wikipedia.page(title=title, auto_suggest=False)
    doc = Document(content=page.content, meta={"title": page.title, "url":page.url})
    raw_docs.append(doc)

#### Initialize a `FastembedSparseDocumentEmbedder`

The `FastembedSparseDocumentEmbedder` enrichs a list of documents with their sparse embeddings.

We are using `prithvida/Splade_PP_en_v1`, a good sparse embedding model with a permissive license.

We also want to embed the title of the document, because it contains relevant information.

For more customization options, refer to the [docs](https://docs.haystack.deepset.ai/docs/fastembedsparsedocumentembedder).

In [4]:
from haystack_integrations.components.embedders.fastembed import FastembedSparseDocumentEmbedder

sparse_doc_embedder = FastembedSparseDocumentEmbedder(model="prithvida/Splade_PP_en_v1",
                                                      meta_fields_to_embed=["title"])
sparse_doc_embedder.warm_up()

# let's try the embedder
print(sparse_doc_embedder.run(documents=[Document(content="An example document")]))

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/133 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/755 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.onnx:   0%|          | 0.00/532M [00:00<?, ?B/s]

Calculating sparse embeddings: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 12.05it/s]

{'documents': [Document(id=cd69a8e89f3c179f243c483a337c5ecb178c58373a253e461a64545b669de12d, content: 'An example document', sparse_embedding: vector with 19 non-zero elements)]}





#### Indexing pipeline

In [5]:
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
from haystack import Pipeline

In [6]:
indexing = Pipeline()
indexing.add_component("cleaner", DocumentCleaner())
indexing.add_component("splitter", DocumentSplitter(split_by='sentence', split_length=4))
indexing.add_component("sparse_doc_embedder", sparse_doc_embedder)
indexing.add_component("writer", DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE))

indexing.connect("cleaner", "splitter")
indexing.connect("splitter", "sparse_doc_embedder")
indexing.connect("sparse_doc_embedder", "writer")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7f21068632e0>
üöÖ Components
  - cleaner: DocumentCleaner
  - splitter: DocumentSplitter
  - sparse_doc_embedder: FastembedSparseDocumentEmbedder
  - writer: DocumentWriter
üõ§Ô∏è Connections
  - cleaner.documents -> splitter.documents (List[Document])
  - splitter.documents -> sparse_doc_embedder.documents (List[Document])
  - sparse_doc_embedder.documents -> writer.documents (List[Document])

#### Let's index our documents!
‚ö†Ô∏è If you are running this notebook on Google Colab, please note that Google Colab only provides 2 CPU cores, so the sparse embedding generation could be not as fast as it can be on a standard machine.

In [7]:
indexing.run({"documents":raw_docs})

Calculating sparse embeddings: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 152/152 [02:29<00:00,  1.02it/s]
200it [00:00, 2418.48it/s]             


{'writer': {'documents_written': 152}}

In [8]:
document_store.count_documents()

152

### Retrieval

#### Retrieval pipeline

Now, we create a simple retrieval Pipeline:
- `FastembedSparseTextEmbedder`: transforms the query into a sparse embedding
- `QdrantSparseEmbeddingRetriever`: looks for relevant documents, based on the similarity of the sparse embeddings

In [9]:
from haystack import Pipeline
from haystack_integrations.components.retrievers.qdrant import QdrantSparseEmbeddingRetriever
from haystack_integrations.components.embedders.fastembed import FastembedSparseTextEmbedder

sparse_text_embedder = FastembedSparseTextEmbedder(model="prithvida/Splade_PP_en_v1")

query_pipeline = Pipeline()
query_pipeline.add_component("sparse_text_embedder", sparse_text_embedder)
query_pipeline.add_component("sparse_retriever", QdrantSparseEmbeddingRetriever(document_store=document_store))

query_pipeline.connect("sparse_text_embedder.sparse_embedding", "sparse_retriever.query_sparse_embedding")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7f21067cf3d0>
üöÖ Components
  - sparse_text_embedder: FastembedSparseTextEmbedder
  - sparse_retriever: QdrantSparseEmbeddingRetriever
üõ§Ô∏è Connections
  - sparse_text_embedder.sparse_embedding -> sparse_retriever.query_sparse_embedding (SparseEmbedding)

#### Try the retrieval pipeline

In [10]:
question = "Where do capybaras live?"

results = query_pipeline.run({"sparse_text_embedder": {"text": question}})

Calculating sparse embeddings: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  9.02it/s]


In [11]:
import rich

for d in results['sparse_retriever']['documents']:
  rich.print(f"\nid: {d.id}\n{d.content}\nscore: {d.score}\n---")

## Understanding SPLADE vectors

(Inspiration: [FastEmbed SPLADE notebook](https://qdrant.github.io/fastembed/examples/SPLADE_with_FastEmbed))

We have seen that our model encodes text into a sparse vector (= a vector with many zeros).
An efficient representation of sparse vectors is to save the indices and values of nonzero elements.

Let's try to understand what information resides in these vectors...

In [12]:
question = "Where do capybaras live?"
sparse_embedding = sparse_text_embedder.run(text=question)["sparse_embedding"]
rich.print(sparse_embedding.to_dict())

Calculating sparse embeddings: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 10.06it/s]


In [13]:
from transformers import AutoTokenizer

# we need the tokenizer vocabulary
tokenizer = AutoTokenizer.from_pretrained("Qdrant/Splade_PP_en_v1") # ONNX export of the original model

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

In [14]:
def get_tokens_and_weights(sparse_embedding, tokenizer):
    token_weight_dict = {}
    for i in range(len(sparse_embedding.indices)):
        token = tokenizer.decode([sparse_embedding.indices[i]])
        weight = sparse_embedding.values[i]
        token_weight_dict[token] = weight

    # Sort the dictionary by weights
    token_weight_dict = dict(sorted(token_weight_dict.items(), key=lambda item: item[1], reverse=True))
    return token_weight_dict


rich.print(get_tokens_and_weights(sparse_embedding, tokenizer))

Very nice! ü¶´

- tokens are ordered by relevance
- the query is expanded with relevant tokens/terms: "location", "habitat"...

## Hybrid Retrieval

Ideally, techniques like SPLADE are intended to replace other approaches (BM25 and Dense Embedding Retrieval) and their combinations.

However, sometimes it may make sense to combine, for example, Dense Embedding Retrieval and Sparse Embedding Retrieval. You can find some positive examples in the appendix of this paper ([An Analysis of Fusion Functions for Hybrid Retrieval](https://arxiv.org/abs/2210.11934)).
Make sure this works for your use case and conduct an evaluation.

---

Below we show how to create such an application in Haystack.

In the example, we use the Qdrant Hybrid Retriever: it compares dense and sparse query and document embeddings and retrieves the most relevant documents , merging the scores with Reciprocal Rank Fusion.

If you want to customize the behavior more, see Hybrid Retrieval Pipelines ([tutorial](https://haystack.deepset.ai/tutorials/33_hybrid_retrieval)).



In [15]:
from haystack_integrations.document_stores.qdrant import QdrantDocumentStore
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack_integrations.components.embedders.fastembed import FastembedSparseDocumentEmbedder, FastembedDocumentEmbedder
from haystack.document_stores.types import DuplicatePolicy
from haystack import Pipeline

In [16]:
document_store = QdrantDocumentStore(
    ":memory:",
    recreate_index=True,
    return_embedding=True,
    use_sparse_embeddings=True,
    embedding_dim = 384
)

In [17]:
hybrid_indexing = Pipeline()
hybrid_indexing.add_component("cleaner", DocumentCleaner())
hybrid_indexing.add_component("splitter", DocumentSplitter(split_by='sentence', split_length=4))
hybrid_indexing.add_component("sparse_doc_embedder", FastembedSparseDocumentEmbedder(model="prithvida/Splade_PP_en_v1", meta_fields_to_embed=["title"]))
hybrid_indexing.add_component("dense_doc_embedder", FastembedDocumentEmbedder(model="BAAI/bge-small-en-v1.5", meta_fields_to_embed=["title"]))
hybrid_indexing.add_component("writer", DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE))

hybrid_indexing.connect("cleaner", "splitter")
hybrid_indexing.connect("splitter", "sparse_doc_embedder")
hybrid_indexing.connect("sparse_doc_embedder", "dense_doc_embedder")
hybrid_indexing.connect("dense_doc_embedder", "writer")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7f1fe8292170>
üöÖ Components
  - cleaner: DocumentCleaner
  - splitter: DocumentSplitter
  - sparse_doc_embedder: FastembedSparseDocumentEmbedder
  - dense_doc_embedder: FastembedDocumentEmbedder
  - writer: DocumentWriter
üõ§Ô∏è Connections
  - cleaner.documents -> splitter.documents (List[Document])
  - splitter.documents -> sparse_doc_embedder.documents (List[Document])
  - sparse_doc_embedder.documents -> dense_doc_embedder.documents (List[Document])
  - dense_doc_embedder.documents -> writer.documents (List[Document])

In [18]:
hybrid_indexing.run({"documents":raw_docs})

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

ort_config.json:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/706 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

model_optimized.onnx:   0%|          | 0.00/66.5M [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Calculating sparse embeddings: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 152/152 [02:14<00:00,  1.13it/s]
Calculating embeddings: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 152/152 [00:41<00:00,  3.68it/s]
200it [00:00, 655.45it/s]


{'writer': {'documents_written': 152}}

In [19]:
document_store.filter_documents()[0]

Document(id=5e2d65ac05a8a238b359773c3d855e026aca6e617df8a011964b401d8b242a1e, content: ' Overall, they tend to be dwarfed by other Cetartiodactyls. Several species have female-biased sexua...', meta: {'title': 'Dolphin', 'url': 'https://en.wikipedia.org/wiki/Dolphin', 'source_id': '6584a10fad50d363f203669ff6efc19e7ae2a5a28ca9351f5cceb5ba88f8e847'}, embedding: vector of size 384, sparse_embedding: vector with 129 non-zero elements)

In [20]:
from haystack_integrations.components.retrievers.qdrant import QdrantHybridRetriever
from haystack_integrations.components.embedders.fastembed import FastembedTextEmbedder


hybrid_query = Pipeline()
hybrid_query.add_component("sparse_text_embedder", FastembedSparseTextEmbedder(model="prithvida/Splade_PP_en_v1"))
hybrid_query.add_component("dense_text_embedder", FastembedTextEmbedder(model="BAAI/bge-small-en-v1.5", prefix="Represent this sentence for searching relevant passages: "))
hybrid_query.add_component("retriever", QdrantHybridRetriever(document_store=document_store))

hybrid_query.connect("sparse_text_embedder.sparse_embedding", "retriever.query_sparse_embedding")
hybrid_query.connect("dense_text_embedder.embedding", "retriever.query_embedding")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7f1fe8293190>
üöÖ Components
  - sparse_text_embedder: FastembedSparseTextEmbedder
  - dense_text_embedder: FastembedTextEmbedder
  - retriever: QdrantHybridRetriever
üõ§Ô∏è Connections
  - sparse_text_embedder.sparse_embedding -> retriever.query_sparse_embedding (SparseEmbedding)
  - dense_text_embedder.embedding -> retriever.query_embedding (List[float])

In [21]:
question = "Where do capybaras live?"

results = hybrid_query.run(
    {"dense_text_embedder": {"text": question},
     "sparse_text_embedder": {"text": question}}
)

Calculating sparse embeddings: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  9.95it/s]
Calculating embeddings: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 12.05it/s]


In [22]:
import rich

for d in results['retriever']['documents']:
  rich.print(f"\nid: {d.id}\n{d.content}\nscore: {d.score}\n---")

## üìö Docs on Sparse Embedding support in Haystack
- [Retrievers](https://docs.haystack.deepset.ai/docs/retrievers)
- [Qdrant Sparse Embedding Retriever](https://docs.haystack.deepset.ai/docs/qdrantsparseembeddingretriever)
- [Qdrant Hybrid Retriever](https://docs.haystack.deepset.ai/docs/qdranthybridretriever)
- [FastEmbed Sparse Text Embedder](https://docs.haystack.deepset.ai/docs/fastembedsparsetextembedder)
- [Fastembed Sparse Document Embedder](https://docs.haystack.deepset.ai/docs/fastembedsparsedocumentembedder)

(*Notebook by [Stefano Fiorucci](https://github.com/anakin87)*)