Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Usage Spike When Analyzing 5MB PDF with Document Intelligence SDK #38972

Open
kokazaki03 opened this issue Dec 23, 2024 · 2 comments
Open
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Document Intelligence needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that

Comments

@kokazaki03
Copy link

  • Package Name:
    azure-ai-documentintelligence==1.0.0b4

  • Package Version:

  • Operating System:
    mac OS 14.7.1(23H222)

  • Python Version:
    Python 3.11.3

Describe the bug

I encountered an issue with the Document Intelligence SDK when analyzing a PDF file of approximately 5MB. Upon receiving the result using result: AnalyzeResult = poller.result(), the memory usage spiked to around 4GB. This increase in memory usage seems abnormal. The actual response, when output to a txt file, was only about 360MB in size.

  • Document Intelligence API Version: "2024-07-31-preview"
  • model_id: prebuilt-layout

To Reproduce

  1. Use the Document Intelligence SDK to analyze a PDF file of approximately 5MB.
  2. Retrieve the result using result: AnalyzeResult = poller.result().
  3. Observe the memory usage during the process.

Expected behavior

The memory usage should not increase significantly and should be proportional to the size of the analyzed PDF file.

** Actual behavior **
The memory usage spiked to around 4GB, which seems disproportionate to the size of the PDF file and the resulting output.

Screenshots

import os
from dotenv import load_dotenv
from memory_profiler import profile
from local_blob_file_loader import LocalBlobFileLoader
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import (
    AnalyzeDocumentRequest,
    ContentFormat,
    AnalyzeResult,
)


class DocumentIntelligenceAnalyzer:
    def __init__(self):
        self.di_endpoint = os.getenv("AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT")
        self.di_key = os.getenv("AZURE_DOCUMENT_INTELLIGENCE_KEY")
        self.api_version = "2024-07-31-preview"
        self.document_intelligence_client = DocumentIntelligenceClient(
            endpoint=self.di_endpoint,
            credential=AzureKeyCredential(self.di_key),
            api_version=self.api_version,
        )
        self.blob_loader = LocalBlobFileLoader()

    @profile
    def extract_text_from_document_by_url(
        self, document_file_path: str, api_model_name: str = "prebuilt-layout"
    ) -> AnalyzeResult:
        request = AnalyzeDocumentRequest(url_source=document_file_path)
        poller = self.document_intelligence_client.begin_analyze_document(
            model_id=api_model_name,
            analyze_request=request,
            output_content_format=ContentFormat.MARKDOWN,
        )
        result: AnalyzeResult = poller.result()
        return result

    @profile
    def extract_text_from_document_binary(
        self, document_file_path: str
    ) -> AnalyzeResult:
        bytes_data = self.blob_loader.load_as_bytes(document_file_path)
        request = AnalyzeDocumentRequest(bytes_source=bytes_data)
        poller = self.document_intelligence_client.begin_analyze_document(
            model_id="prebuilt-layout",
            analyze_request=request,
            output_content_format=ContentFormat.MARKDOWN,
        )
        result: AnalyzeResult = poller.result()
        return result


@profile
def process():
    file_path = "data/some_file.pdf"
    di_analyzer = DocumentIntelligenceAnalyzer()
    result = di_analyzer.extract_text_from_document_binary(file_path)

    # save as text file
    new_extension = ".txt"
    file_name, _ = os.path.splitext(file_path)
    output_file = f"{file_name}{new_extension}"
    result_text = repr(result)
    create_text_file(file_path=output_file, text=result_text)


@profile
def create_text_file(file_path: str, text: str):
    with open(file_path, "w") as f:
        f.write(text)


if __name__ == "__main__":
    load_dotenv()
    process()
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    38     38.7 MiB     38.7 MiB           1       @profile
    39                                             def extract_text_from_document_binary(
    40                                                 self, document_file_path: str
    41                                             ) -> AnalyzeResult:
    42     43.8 MiB      5.1 MiB           1           bytes_data = self.blob_loader.load_as_bytes(document_file_path)
    43     57.4 MiB     13.6 MiB           1           request = AnalyzeDocumentRequest(bytes_source=bytes_data)
    44     87.0 MiB     29.7 MiB           2           poller = self.document_intelligence_client.begin_analyze_document(
    45     57.4 MiB      0.0 MiB           1               model_id="prebuilt-layout",
    46     57.4 MiB      0.0 MiB           1               analyze_request=request,
    47     57.4 MiB      0.0 MiB           1               output_content_format=ContentFormat.MARKDOWN,
    48                                                 )
    49   5191.2 MiB   5104.2 MiB           1           result: AnalyzeResult = poller.result()
    50   5191.3 MiB      0.0 MiB           1           return result


Filename: main.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    21   4341.2 MiB   4341.2 MiB           1   @profile
    22                                         def create_text_file(file_path: str, text: str):
    23   4687.2 MiB      0.1 MiB           2       with open(file_path, "w") as f:
    24   4687.2 MiB    345.9 MiB           1           f.write(text)


Filename: main.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     7     29.9 MiB     29.9 MiB           1   @profile
     8                                         def process():
     9     29.9 MiB      0.0 MiB           1       file_path = "data/j01101k_20240830_2.pdf"
    10     38.7 MiB      8.8 MiB           1       di_analyzer = DocumentIntelligenceAnalyzer()
    11   3550.3 MiB   3511.6 MiB           1       result = di_analyzer.extract_text_from_document_binary(file_path)
    12                                         
    13                                             # save as text file
    14   3550.3 MiB      0.0 MiB           1       new_extension = ".txt"
    15   3550.3 MiB      0.0 MiB           1       file_name, _ = os.path.splitext(file_path)
    16   3550.3 MiB      0.0 MiB           1       output_file = f"{file_name}{new_extension}"
    17   4341.2 MiB    790.9 MiB           1       result_text = repr(result)
    18   4687.2 MiB    346.0 MiB           1       create_text_file(file_path=output_file, text=result_text)

Additional context
The PDF file contains approximately 1600 pages, with each page containing information equivalent to two standard pages. Any insights or suggestions to mitigate this issue would be greatly appreciated.

@github-actions github-actions bot added customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Dec 23, 2024
@shinxi
Copy link

shinxi commented Dec 24, 2024

Looks similar to #37750. It would be beneficial to have an option that directly returns the AnalyzeResult as Python dictionaries instead of converting them into Azure-specific classes, which would reduce memory overhead.

@xiangyan99
Copy link
Member

Thanks for the feedback, we’ll investigate asap.

@xiangyan99 xiangyan99 added Client This issue points to a problem in the data-plane of the library. Document Intelligence and removed needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. labels Dec 30, 2024
@github-actions github-actions bot added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team labels Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Document Intelligence needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that
Projects
None yet
Development

No branches or pull requests

4 participants