Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Document Intelligence] Corrupted images #31908

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

HarshaNalluru
Copy link
Member

@HarshaNalluru HarshaNalluru commented Nov 23, 2024

Background

We encode the response Buffer to "utf8" by default in the NodeFetchClient at @azure/core-rest-pipeline.
"utf8" encoding of Buffer has been found to corrupt the output in certain cases, such as when dealing with PNG files (image/png content-type response).

Buffer.concat(buffer).toString("utf8")

This issue arises because "utf8" encoding is not suitable for binary data like images, which leads to corrupted files and unexpected behavior in applications that rely on this data.

What changed?

To address this issue, I've implemented the following changes

  • Content-Type Check: Introduced a check for the content type of the response header. If the content type is "image/png", the encoding is now set to "binary" instead of "utf8". This ensures that binary data, like PNG images, is handled correctly without corruption.

  • Encoding Parameter: The streamToText function has been updated to accept an additional parameter, encoding, which can be either "utf8" or "binary". This allows for more flexible handling of different types of response data, ensuring that the appropriate encoding is used based on the content type.

TO DO

While the changes in this PR address the immediate issue with PNG files, there are further improvements needed to ensure robust handling of various content types:

  • Robust Content-Type Check: The content-type check implemented in this PR needs to be more comprehensive. We need to categorize the content types for which the responses are known to break with "utf8" encoding and apply the appropriate encoding for each category. This will help ensure backward compatibility and prevent regressions in the future.
  • Introduce bodyAsBuffer Property: As an alternative approach, we could introduce a new bodyAsBuffer property. By doing so, we can bypass the need for encoding altogether for certain types of responses, allowing the data to be handled as a raw Buffer. This approach could simplify the handling of binary data and prevent encoding-related issues.

@@ -349,7 +356,8 @@ function streamToText(stream: NodeJS.ReadableStream): Promise<string> {
}
});
stream.on("end", () => {
resolve(Buffer.concat(buffer).toString("utf8"));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utf8 encoding corrupts the image data

@azure-sdk
Copy link
Collaborator

azure-sdk commented Nov 23, 2024

API change check

API changes are not detected in this pull request.

outputFile: {
junit: "test-results.browser.xml",
},
fakeTimers: {
toFake: ["setTimeout", "Date"],
},
watch: false,
include: ["test/**/*.spec.ts"],
include: ["test/**/figures.spec.ts"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we only want to keep it at one test file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, apologies about this. I was testing my changes in core- with this example.
I'll revert this back once the changes in core are finalized :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants