Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support loading of legacy microsoft file formats (e.g. .doc, .ppt, .xls) #8796

Open
sjrl opened this issue Feb 3, 2025 · 0 comments
Open
Labels
type:feature New feature or request

Comments

@sjrl
Copy link
Contributor

sjrl commented Feb 3, 2025

Is your feature request related to a problem? Please describe.
Basically, the problem is the converter components in Haystack (e.g. DOCXToDocument, XLSXToDocument, etc.) don’t support the legacy Microsoft office file types (e.g. .doc, .xls, .ppt). This is because the underlying libraries we use in Haystack only support the modern microsoft office doc types.

After some online research I was unable to find other python libraries with permissive licenses that could support the conversion of these older formats.

Describe the solution you'd like
Instead, a common recommendation to handle legacy files is to convert them to the modern ones (e.g. .doc to .docx) using the command line tool from libreoffice (more info here). For example,

soffice --headless --convert-to docx  test.doc

So I think creating a new converter component that converts the legacy format into the modern one would be great! We could also potentially leverage passing the output as a ByteStream so we could maybe avoid writing temporary files. Perhaps it would make sense to make this behavior controllable via input parameters.

As a side-effect this component would also allow for the conversion of microsoft file types (and others) into formats such as PDF which may be helpful in scenarios such as running OCR or having more reliable page detections for .docx files.

Describe alternatives you've considered

  • It appears that Tika could also cover some of these cases. See parser docs here. However, it's not 100% clear to me if that's true and I think it would be nice to allow users to leverage our other converters without needing to deploy Tika.
  • Technically Unstructured IO also supports these legacy formats. See docs here. However, I say technically since their strategy is also to use libreoffice to convert the legacy formats to the modern ones and then leverage their other converters. So I believe it would be better for us to also natively support the libreoffice conversion.
@sjrl sjrl added the type:feature New feature or request label Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:feature New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant