[Document Intelligence] Corrupted images #31908

HarshaNalluru · 2024-11-23T02:19:29Z

Background

We encode the response Buffer to "utf8" by default in the NodeFetchClient at @azure/core-rest-pipeline.
"utf8" encoding of Buffer has been found to corrupt the output in certain cases, such as when dealing with PNG files (image/png content-type response).

Buffer.concat(buffer).toString("utf8")

This issue arises because "utf8" encoding is not suitable for binary data like images, which leads to corrupted files and unexpected behavior in applications that rely on this data.

What changed?

To address this issue, I've implemented the following changes

Content-Type Check: Introduced a check for the content type of the response header. If the content type is "image/png", the encoding is now set to "binary" instead of "utf8". This ensures that binary data, like PNG images, is handled correctly without corruption.
Encoding Parameter: The streamToText function has been updated to accept an additional parameter, encoding, which can be either "utf8" or "binary". This allows for more flexible handling of different types of response data, ensuring that the appropriate encoding is used based on the content type.

TO DO

While the changes in this PR address the immediate issue with PNG files, there are further improvements needed to ensure robust handling of various content types:

Robust Content-Type Check: The content-type check implemented in this PR needs to be more comprehensive. We need to categorize the content types for which the responses are known to break with "utf8" encoding and apply the appropriate encoding for each category. This will help ensure backward compatibility and prevent regressions in the future.
Introduce bodyAsBuffer Property: As an alternative approach, we could introduce a new bodyAsBuffer property. By doing so, we can bypass the need for encoding altogether for certain types of responses, allowing the data to be handled as a raw Buffer. This approach could simplify the handling of binary data and prevent encoding-related issues.

HarshaNalluru · 2024-11-23T02:25:49Z

sdk/core/core-rest-pipeline/src/nodeHttpClient.ts

@@ -349,7 +356,8 @@ function streamToText(stream: NodeJS.ReadableStream): Promise<string> {
      }
    });
    stream.on("end", () => {
-      resolve(Buffer.concat(buffer).toString("utf8"));


utf8 encoding corrupts the image data

… harshan/d-i-png-issue

azure-sdk · 2024-11-23T02:29:38Z

API change check

API changes are not detected in this pull request.

mpodwysocki · 2024-11-23T03:47:13Z

sdk/documentintelligence/ai-document-intelligence-rest/vitest.config.ts

    outputFile: {
      junit: "test-results.browser.xml",
    },
    fakeTimers: {
      toFake: ["setTimeout", "Date"],
    },
    watch: false,
-    include: ["test/**/*.spec.ts"],
+    include: ["test/**/figures.spec.ts"],


Do we only want to keep it at one test file?

No, apologies about this. I was testing my changes in core- with this example.
I'll revert this back once the changes in core are finalized :)

HarshaNalluru added 6 commits November 21, 2024 22:28

checkpoint

b98c8e9

checkpoint 2

e9fca6c

update the test

c1d1c75

revert logs

3c11c0d

outputs

b934b02

convertStream to text/buffer

55ab26d

github-actions bot added the Azure.Core label Nov 23, 2024

overload and minor test updates

039637e

HarshaNalluru commented Nov 23, 2024

View reviewed changes

HarshaNalluru added 2 commits November 22, 2024 18:27

Merge branch 'main' of https://github.com/azure/azure-sdk-for-js into…

fe45383

… harshan/d-i-png-issue

revert core changes

2390c9c

retian response.bodyAsText and use binary as encoding

db21c03

mpodwysocki reviewed Nov 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Document Intelligence] Corrupted images #31908

[Document Intelligence] Corrupted images #31908

HarshaNalluru commented Nov 23, 2024 •

edited

Loading

HarshaNalluru Nov 23, 2024

azure-sdk commented Nov 23, 2024 •

edited

Loading

mpodwysocki Nov 23, 2024

HarshaNalluru Nov 23, 2024

[Document Intelligence] Corrupted images #31908

Are you sure you want to change the base?

[Document Intelligence] Corrupted images #31908

Conversation

HarshaNalluru commented Nov 23, 2024 • edited Loading

Background

What changed?

TO DO

HarshaNalluru Nov 23, 2024

Choose a reason for hiding this comment

azure-sdk commented Nov 23, 2024 • edited Loading

mpodwysocki Nov 23, 2024

Choose a reason for hiding this comment

HarshaNalluru Nov 23, 2024

Choose a reason for hiding this comment

HarshaNalluru commented Nov 23, 2024 •

edited

Loading

azure-sdk commented Nov 23, 2024 •

edited

Loading