Integration: Azure Document Intelligence
Use Azure Document Intelligence with Haystack
Table of Contents
Overview
AzureDocumentIntelligenceConverter provides an integration of
Azure Document Intelligence (formerly Form Recognizer) with
Haystack by
deepset.
This component uses Azure’s Document Intelligence service to convert various file formats into Haystack Documents with markdown content. It supports advanced document analysis including layout detection, table extraction, and structured content recognition.
Supported file formats: PDF, JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, HTML.
Key features:
- Markdown output with preserved structure (headings, tables, lists)
- Inline table integration (tables rendered as markdown tables)
- Improved layout analysis and reading order
- Support for section headings
- Multiple model options for different use cases
Installation
Install the Azure Document Intelligence integration:
pip install "azure-doc-intelligence-haystack"
Usage
To use the AzureDocumentIntelligenceConverter, you need an active
Azure subscription with a deployed Document Intelligence or Cognitive Services resource. You need to provide a service endpoint as AZURE_DI_ENDPOINT and an API key as AZURE_DI_API_KEY for authentication.
import os
from haystack_integrations.components.converters.azure_doc_intelligence import (
AzureDocumentIntelligenceConverter,
)
from haystack.utils import Secret
converter = AzureDocumentIntelligenceConverter(
endpoint=os.environ["AZURE_DI_ENDPOINT"],
api_key=Secret.from_env_var("AZURE_DI_API_KEY"),
)
results = converter.run(sources=["invoice.pdf", "contract.docx"])
documents = results["documents"]
# Documents contain markdown with inline tables
print(documents[0].content)
Model Options
The converter supports different Azure Document Intelligence models depending on your needs:
prebuilt-document(default): General document analysis with markdown outputprebuilt-read: Fast OCR for text extractionprebuilt-layout: Enhanced layout analysis with better table and structure detection- Custom models: Use your own trained models by providing the model ID
# Use a specific model
converter = AzureDocumentIntelligenceConverter(
endpoint=os.environ["AZURE_DI_ENDPOINT"],
api_key=Secret.from_env_var("AZURE_DI_API_KEY"),
model_id="prebuilt-layout", # Enhanced layout analysis
)
Metadata
The converter automatically adds metadata to each Document:
model_id: The Azure model used for analysispage_count: Number of pages in the documentfile_path: The source file path (filename only by default, or full path ifstore_full_path=True)
You can also provide custom metadata:
results = converter.run(
sources=["document.pdf"],
meta={"category": "legal", "priority": "high"}
)
For more details on Azure Document Intelligence capabilities and setup, refer to the Azure documentation.
