Skip to content

Commit 18c73ca

Browse files
authored
fix: Do not hardcode file extension on temp files (#435)
This is a minor fix to improve our logging. When we buffer a file like input to disk in `process_data_with_model`, we always use the name `document.pdf`. This confused me when I found this in our logs: ``` 2025-06-30 17:02:01,906 unstructured_inference INFO Reading image file: /var/folders/5k/frv076q97yl0ywybmzydhbsr0000gn/T/tmpc0uq7zde/document.pdf ... 2025-06-30 17:02:01,951 unstructured_api ERROR cannot identify image file '/private/var/folders/5k/frv076q97yl0ywybmzydhbsr0000gn/T/tmpc0uq7zde/document.pdf' ``` This path can be either pdfs or images, so let's just drop the extension to save ourselves some confusion. Also added a comment so we don't forget why it's using a temp dir, not a temp file.
1 parent 3abe07a commit 18c73ca

File tree

3 files changed

+12
-5
lines changed

3 files changed

+12
-5
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
## 1.0.7
2+
3+
* Fix a hardcoded file extension causing confusion in the logs
4+
15
## 1.0.6
26

37
* Add slicing through indexing for vectorized elements

unstructured_inference/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "1.0.6" # pragma: no cover
1+
__version__ = "1.0.7" # pragma: no cover

unstructured_inference/inference/layout.py

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -337,12 +337,15 @@ def process_data_with_model(
337337
password: Optional[str] = None,
338338
**kwargs: Any,
339339
) -> DocumentLayout:
340-
"""Process PDF as file-like object `data` into a `DocumentLayout`.
340+
"""Process PDF or image as file-like object `data` into a `DocumentLayout`.
341341
342342
Uses the model identified by `model_name`.
343343
"""
344+
# Note: We use a temp dir, not a temp file,
345+
# because the latter fails on Windows
346+
# https://github.com/Unstructured-IO/unstructured-inference/pull/376
344347
with tempfile.TemporaryDirectory() as tmp_dir_path:
345-
file_path = os.path.join(tmp_dir_path, "document.pdf")
348+
file_path = os.path.join(tmp_dir_path, "document")
346349
with open(file_path, "wb") as f:
347350
f.write(data.read())
348351
f.flush()
@@ -365,8 +368,8 @@ def process_file_with_model(
365368
password: Optional[str] = None,
366369
**kwargs: Any,
367370
) -> DocumentLayout:
368-
"""Processes pdf file with name filename into a DocumentLayout by using a model identified by
369-
model_name."""
371+
"""Processes pdf or image file with name filename into a DocumentLayout by using
372+
a model identified by model_name."""
370373

371374
model = get_model(model_name, **kwargs)
372375
if isinstance(model, UnstructuredObjectDetectionModel):

0 commit comments

Comments
 (0)