Identification of images in docx #273

jkindahood · 2024-11-07T13:58:35Z

Hello everyone,
im working on docx document type with docling.
My first test was very poor.
I have a docx file which is only a testfile and this contains one image.
The standard procedure doesnt even recognise the image in the file...
Standard is this:

from docling.document_converter import DocumentConverter

source = "data/example_small.docx"
converter = DocumentConverter()
result = converter.convert(source)
for item in result.document:
print(item)

The example file is contained here!
Am im wrong with the code or is this a bug?

Greetings
example_small.docx
word_sample.docx

PeterStaar-IBM · 2024-11-08T05:33:02Z

@jkindahood Thanks for the feedback! Let us look into it and come back to you.

I dont see any handling of pictures in the msword-backend. If we need to add pictures (which we obviously need to do), we need to update handle_elements method.

jkindahood · 2024-11-08T13:45:55Z

can you describe me what to do?
maybe i can implement that.

PeterStaar-IBM · 2024-11-08T13:48:40Z

Yes, absolutely, if you follow the link, you see that we are have no add_picture method yet (as in the html version: https://github.com/DS4SD/docling/blob/main/docling/backend/html_backend.py#L429)

jkindahood · 2024-11-12T08:23:29Z

Thank you for your answer @PeterStaar-IBM.
I think this enhancement is stronly connected to this pull request:
#259
because im interested in the description of a picture, embedded into the text on the rigth position.
In a first step would you say we should read the images in wordbackend like in the pdf backend?
And in a second step we add the option to describe the image and add the description to the returned text?
Greetings

maxmnemonic · 2024-11-13T16:20:39Z

@jkindahood, small update, I'm working on PR to resolve image identification in DOCX: #330

jkindahood · 2024-11-14T08:17:50Z

@maxmnemonic all fine, i try to understand how your codebase works and still want to contribute.
Tell me if i can do some thing to help you.

jkindahood added the question Further information is requested label Nov 7, 2024

PeterStaar-IBM assigned maxmnemonic Nov 8, 2024

PeterStaar-IBM added enhancement New feature or request and removed question Further information is requested labels Nov 8, 2024

maxmnemonic mentioned this issue Nov 13, 2024

fix: Fixing images in the input Word files #330

Merged

4 tasks

maxmnemonic closed this as completed in #330 Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identification of images in docx #273

Identification of images in docx #273

jkindahood commented Nov 7, 2024

PeterStaar-IBM commented Nov 8, 2024

jkindahood commented Nov 8, 2024

PeterStaar-IBM commented Nov 8, 2024

jkindahood commented Nov 12, 2024

maxmnemonic commented Nov 13, 2024

jkindahood commented Nov 14, 2024

Identification of images in docx #273

Identification of images in docx #273

Comments

jkindahood commented Nov 7, 2024

PeterStaar-IBM commented Nov 8, 2024

jkindahood commented Nov 8, 2024

PeterStaar-IBM commented Nov 8, 2024

jkindahood commented Nov 12, 2024

maxmnemonic commented Nov 13, 2024

jkindahood commented Nov 14, 2024