๐Ÿ“ข Unified Haystack Ecosystem: One Name, One Product Family, One Look
Maintained by deepset

Integration: Azure Document Intelligence

Use Azure Document Intelligence with Haystack

Authors
deepset

Table of Contents

Overview

AzureDocumentIntelligenceConverter provides an integration of Azure Document Intelligence (formerly Form Recognizer) with Haystack by deepset.

This component uses Azure’s Document Intelligence service to convert various file formats into Haystack Documents with markdown content. It supports advanced document analysis including layout detection, table extraction, and structured content recognition.

Supported file formats: PDF, JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, HTML.

Key features:

  • Markdown output with preserved structure (headings, tables, lists)
  • Inline table integration (tables rendered as markdown tables)
  • Improved layout analysis and reading order
  • Support for section headings
  • Multiple model options for different use cases

Installation

Install the Azure Document Intelligence integration:

pip install "azure-doc-intelligence-haystack"

Usage

To use the AzureDocumentIntelligenceConverter, you need an active Azure subscription with a deployed Document Intelligence or Cognitive Services resource. You need to provide a service endpoint as AZURE_DI_ENDPOINT and an API key as AZURE_DI_API_KEY for authentication.

import os
from haystack_integrations.components.converters.azure_doc_intelligence import (
    AzureDocumentIntelligenceConverter,
)
from haystack.utils import Secret

converter = AzureDocumentIntelligenceConverter(
    endpoint=os.environ["AZURE_DI_ENDPOINT"],
    api_key=Secret.from_env_var("AZURE_DI_API_KEY"),
)

results = converter.run(sources=["invoice.pdf", "contract.docx"])
documents = results["documents"]

# Documents contain markdown with inline tables
print(documents[0].content)

Model Options

The converter supports different Azure Document Intelligence models depending on your needs:

  • prebuilt-document (default): General document analysis with markdown output
  • prebuilt-read: Fast OCR for text extraction
  • prebuilt-layout: Enhanced layout analysis with better table and structure detection
  • Custom models: Use your own trained models by providing the model ID
# Use a specific model
converter = AzureDocumentIntelligenceConverter(
    endpoint=os.environ["AZURE_DI_ENDPOINT"],
    api_key=Secret.from_env_var("AZURE_DI_API_KEY"),
    model_id="prebuilt-layout",  # Enhanced layout analysis
)

Metadata

The converter automatically adds metadata to each Document:

  • model_id: The Azure model used for analysis
  • page_count: Number of pages in the document
  • file_path: The source file path (filename only by default, or full path if store_full_path=True)

You can also provide custom metadata:

results = converter.run(
    sources=["document.pdf"],
    meta={"category": "legal", "priority": "high"}
)

For more details on Azure Document Intelligence capabilities and setup, refer to the Azure documentation.