Memory Usage Spike When Analyzing 5MB PDF with Document Intelligence SDK #38972

kokazaki03 · 2024-12-23T04:38:08Z

Package Name:
azure-ai-documentintelligence==1.0.0b4
Package Version:
Operating System:
mac OS 14.7.1（23H222）
Python Version:
Python 3.11.3

Describe the bug

I encountered an issue with the Document Intelligence SDK when analyzing a PDF file of approximately 5MB. Upon receiving the result using result: AnalyzeResult = poller.result(), the memory usage spiked to around 4GB. This increase in memory usage seems abnormal. The actual response, when output to a txt file, was only about 360MB in size.

Document Intelligence API Version: "2024-07-31-preview"
model_id: prebuilt-layout

To Reproduce

Use the Document Intelligence SDK to analyze a PDF file of approximately 5MB.
Retrieve the result using result: AnalyzeResult = poller.result().
Observe the memory usage during the process.

Expected behavior

The memory usage should not increase significantly and should be proportional to the size of the analyzed PDF file.

** Actual behavior **
The memory usage spiked to around 4GB, which seems disproportionate to the size of the PDF file and the resulting output.

Screenshots

import os
from dotenv import load_dotenv
from memory_profiler import profile
from local_blob_file_loader import LocalBlobFileLoader
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import (
    AnalyzeDocumentRequest,
    ContentFormat,
    AnalyzeResult,
)


class DocumentIntelligenceAnalyzer:
    def __init__(self):
        self.di_endpoint = os.getenv("AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT")
        self.di_key = os.getenv("AZURE_DOCUMENT_INTELLIGENCE_KEY")
        self.api_version = "2024-07-31-preview"
        self.document_intelligence_client = DocumentIntelligenceClient(
            endpoint=self.di_endpoint,
            credential=AzureKeyCredential(self.di_key),
            api_version=self.api_version,
        )
        self.blob_loader = LocalBlobFileLoader()

    @profile
    def extract_text_from_document_by_url(
        self, document_file_path: str, api_model_name: str = "prebuilt-layout"
    ) -> AnalyzeResult:
        request = AnalyzeDocumentRequest(url_source=document_file_path)
        poller = self.document_intelligence_client.begin_analyze_document(
            model_id=api_model_name,
            analyze_request=request,
            output_content_format=ContentFormat.MARKDOWN,
        )
        result: AnalyzeResult = poller.result()
        return result

    @profile
    def extract_text_from_document_binary(
        self, document_file_path: str
    ) -> AnalyzeResult:
        bytes_data = self.blob_loader.load_as_bytes(document_file_path)
        request = AnalyzeDocumentRequest(bytes_source=bytes_data)
        poller = self.document_intelligence_client.begin_analyze_document(
            model_id="prebuilt-layout",
            analyze_request=request,
            output_content_format=ContentFormat.MARKDOWN,
        )
        result: AnalyzeResult = poller.result()
        return result


@profile
def process():
    file_path = "data/some_file.pdf"
    di_analyzer = DocumentIntelligenceAnalyzer()
    result = di_analyzer.extract_text_from_document_binary(file_path)

    # save as text file
    new_extension = ".txt"
    file_name, _ = os.path.splitext(file_path)
    output_file = f"{file_name}{new_extension}"
    result_text = repr(result)
    create_text_file(file_path=output_file, text=result_text)


@profile
def create_text_file(file_path: str, text: str):
    with open(file_path, "w") as f:
        f.write(text)


if __name__ == "__main__":
    load_dotenv()
    process()

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    38     38.7 MiB     38.7 MiB           1       @profile
    39                                             def extract_text_from_document_binary(
    40                                                 self, document_file_path: str
    41                                             ) -> AnalyzeResult:
    42     43.8 MiB      5.1 MiB           1           bytes_data = self.blob_loader.load_as_bytes(document_file_path)
    43     57.4 MiB     13.6 MiB           1           request = AnalyzeDocumentRequest(bytes_source=bytes_data)
    44     87.0 MiB     29.7 MiB           2           poller = self.document_intelligence_client.begin_analyze_document(
    45     57.4 MiB      0.0 MiB           1               model_id="prebuilt-layout",
    46     57.4 MiB      0.0 MiB           1               analyze_request=request,
    47     57.4 MiB      0.0 MiB           1               output_content_format=ContentFormat.MARKDOWN,
    48                                                 )
    49   5191.2 MiB   5104.2 MiB           1           result: AnalyzeResult = poller.result()
    50   5191.3 MiB      0.0 MiB           1           return result


Filename: main.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    21   4341.2 MiB   4341.2 MiB           1   @profile
    22                                         def create_text_file(file_path: str, text: str):
    23   4687.2 MiB      0.1 MiB           2       with open(file_path, "w") as f:
    24   4687.2 MiB    345.9 MiB           1           f.write(text)


Filename: main.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     7     29.9 MiB     29.9 MiB           1   @profile
     8                                         def process():
     9     29.9 MiB      0.0 MiB           1       file_path = "data/j01101k_20240830_2.pdf"
    10     38.7 MiB      8.8 MiB           1       di_analyzer = DocumentIntelligenceAnalyzer()
    11   3550.3 MiB   3511.6 MiB           1       result = di_analyzer.extract_text_from_document_binary(file_path)
    12                                         
    13                                             # save as text file
    14   3550.3 MiB      0.0 MiB           1       new_extension = ".txt"
    15   3550.3 MiB      0.0 MiB           1       file_name, _ = os.path.splitext(file_path)
    16   3550.3 MiB      0.0 MiB           1       output_file = f"{file_name}{new_extension}"
    17   4341.2 MiB    790.9 MiB           1       result_text = repr(result)
    18   4687.2 MiB    346.0 MiB           1       create_text_file(file_path=output_file, text=result_text)

Additional context
The PDF file contains approximately 1600 pages, with each page containing information equivalent to two standard pages. Any insights or suggestions to mitigate this issue would be greatly appreciated.

shinxi · 2024-12-24T10:31:51Z

Looks similar to #37750. It would be beneficial to have an option that directly returns the AnalyzeResult as Python dictionaries instead of converting them into Azure-specific classes, which would reduce memory overhead.

xiangyan99 · 2024-12-30T17:51:00Z

Thanks for the feedback, we’ll investigate asap.

kokazaki03 · 2025-03-11T06:38:33Z

Hi, is there any update on this issue?

catalinaperalta · 2025-03-27T00:21:04Z

Given the size of the document and the amount of data analyzed and returned per page this is probably the cause of the high memory usage. Current library design implements models as a dictionary under the hood, which is why they can be accessed like a regular dictionary as well.

github-actions · 2025-03-27T00:24:20Z

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @bojunehsu @vkurpad.

catalinaperalta · 2025-03-27T00:35:58Z

Please see the suggestions for a workaround from @bojunehsu in the comment here: #37750 (comment)

bojunehsu · 2025-04-01T22:31:21Z

@kokazaki03 Your observations are expected. Loading JSON into Python dictionaries is known to use ~10x the amount of memory as the original JSON file. Using streaming approaches can mitigate this issue.

xiangyan99 added Client This issue points to a problem in the data-plane of the library. Document Intelligence and removed needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. labels Dec 30, 2024

xiangyan99 assigned YalinLi0312 Dec 30, 2024

github-actions bot added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team labels Dec 30, 2024

l0lawrence assigned catalinaperalta and unassigned YalinLi0312 Feb 18, 2025

catalinaperalta assigned bojunehsu and unassigned catalinaperalta Mar 27, 2025

catalinaperalta added the Service Attention Workflow: This issue is responsible by Azure service team. label Mar 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Usage Spike When Analyzing 5MB PDF with Document Intelligence SDK #38972

Memory Usage Spike When Analyzing 5MB PDF with Document Intelligence SDK #38972

kokazaki03 commented Dec 23, 2024

shinxi commented Dec 24, 2024

xiangyan99 commented Dec 30, 2024

kokazaki03 commented Mar 11, 2025

catalinaperalta commented Mar 27, 2025 •

edited

Loading

github-actions bot commented Mar 27, 2025

catalinaperalta commented Mar 27, 2025 •

edited

Loading

bojunehsu commented Apr 1, 2025

Memory Usage Spike When Analyzing 5MB PDF with Document Intelligence SDK #38972

Memory Usage Spike When Analyzing 5MB PDF with Document Intelligence SDK #38972

Comments

kokazaki03 commented Dec 23, 2024

shinxi commented Dec 24, 2024

xiangyan99 commented Dec 30, 2024

kokazaki03 commented Mar 11, 2025

catalinaperalta commented Mar 27, 2025 • edited Loading

github-actions bot commented Mar 27, 2025

catalinaperalta commented Mar 27, 2025 • edited Loading

bojunehsu commented Apr 1, 2025

catalinaperalta commented Mar 27, 2025 •

edited

Loading

catalinaperalta commented Mar 27, 2025 •

edited

Loading