Support loading of legacy microsoft file formats (e.g. `.doc`, `.ppt`, `.xls`) #8796

sjrl · 2025-02-03T09:16:33Z

Is your feature request related to a problem? Please describe.
Basically, the problem is the converter components in Haystack (e.g. DOCXToDocument, XLSXToDocument, etc.) don’t support the legacy Microsoft office file types (e.g. .doc, .xls, .ppt). This is because the underlying libraries we use in Haystack only support the modern microsoft office doc types.

After some online research I was unable to find other python libraries with permissive licenses that could support the conversion of these older formats.

Describe the solution you'd like
Instead, a common recommendation to handle legacy files is to convert them to the modern ones (e.g. .doc to .docx) using the command line tool from libreoffice (more info here). For example,

soffice --headless --convert-to docx  test.doc

So I think creating a new converter component that converts the legacy format into the modern one would be great! We could also potentially leverage passing the output as a ByteStream so we could maybe avoid writing temporary files. Perhaps it would make sense to make this behavior controllable via input parameters.

As a side-effect this component would also allow for the conversion of microsoft file types (and others) into formats such as PDF which may be helpful in scenarios such as running OCR or having more reliable page detections for .docx files.

Describe alternatives you've considered

It appears that Tika could also cover some of these cases. See parser docs here. However, it's not 100% clear to me if that's true and I think it would be nice to allow users to leverage our other converters without needing to deploy Tika.
Technically Unstructured IO also supports these legacy formats. See docs here. However, I say technically since their strategy is also to use libreoffice to convert the legacy formats to the modern ones and then leverage their other converters. So I believe it would be better for us to also natively support the libreoffice conversion.

The text was updated successfully, but these errors were encountered:

sjrl added the type:feature New feature or request label Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support loading of legacy microsoft file formats (e.g. `.doc`, `.ppt`, `.xls`) #8796

Support loading of legacy microsoft file formats (e.g. `.doc`, `.ppt`, `.xls`) #8796

sjrl commented Feb 3, 2025

Support loading of legacy microsoft file formats (e.g. .doc, .ppt, .xls) #8796

Support loading of legacy microsoft file formats (e.g. .doc, .ppt, .xls) #8796

Comments

sjrl commented Feb 3, 2025

Support loading of legacy microsoft file formats (e.g. `.doc`, `.ppt`, `.xls`) #8796

Support loading of legacy microsoft file formats (e.g. `.doc`, `.ppt`, `.xls`) #8796