📘 New User Story: Lufthansa Industry Solutions Uses Haystack to Power Enterprise RAG
Maintained by deepset

Integration: PaddleOCR

Use PaddleOCR’s text-recognition and document-parsing capabilities with Haystack

Authors
deepset

Table of Contents

Overview

PaddleOCR converts documents and images into structured, AI-friendly data (like JSON and Markdown) with industry-leading accuracy—powering AI applications for everyone from indie developers and startups to large enterprises worldwide.

This integration allows you to use PaddleOCR’s text-recognition and document-parsing capabilities with Haystack.

Components

Initialization

Every component of the PaddleOCR integration requires an access token from PaddlePaddle AI Studio. By default, authentication uses the AISTUDIO_ACCESS_TOKEN environment variable. You can also provide an access_token when initializing each component. The AI Studio access token can be obtained from this page.

Installation

pip install paddleocr-haystack

Usage

How to use the PaddleOCRVLDocumentConverter

To start, visit the PaddleOCR official website, click the API button in the upper-left corner, choose the example code for Large Model document parsing(PaddleOCR-VL), and copy the API_URL.

Basic usage with a local file:

from pathlib import Path
from haystack.utils import Secret
from haystack_integrations.components.converters.paddleocr import PaddleOCRVLDocumentConverter

converter = PaddleOCRVLDocumentConverter(
    api_url="<your-api-url>",
    access_token=Secret.from_env_var("AISTUDIO_ACCESS_TOKEN"),
)

result = converter.run(sources=[Path("my_document.pdf")])
documents = result["documents"]

Here’s an example of an indexing pipeline that processes PDFs with OCR and writes them to a Document Store:

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.utils import Secret
from haystack_integrations.components.converters.paddleocr import PaddleOCRVLDocumentConverter

document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component(
    "converter",
    PaddleOCRVLDocumentConverter(
        api_url="<your-api-url>",
        access_token=Secret.from_env_var("AISTUDIO_ACCESS_TOKEN"),
    )
)
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component("splitter", DocumentSplitter(split_by="page", split_length=1))
pipeline.add_component("writer", DocumentWriter(document_store=document_store))

pipeline.connect("converter", "cleaner")
pipeline.connect("cleaner", "splitter")
pipeline.connect("splitter", "writer")

file_paths = ["invoice.pdf", "receipt.pdf", "contract.pdf"]
pipeline.run({"converter": {"sources": file_paths}})

License

paddleocr-haystack is distributed under the terms of the Apache-2.0 license.