Support loading of legacy microsoft file formats (e.g. .doc
, .ppt
, .xls
)
#8796
Labels
type:feature
New feature or request
.doc
, .ppt
, .xls
)
#8796
Is your feature request related to a problem? Please describe.
Basically, the problem is the converter components in Haystack (e.g.
DOCXToDocument
,XLSXToDocument
, etc.) don’t support the legacy Microsoft office file types (e.g..doc
,.xls
,.ppt
). This is because the underlying libraries we use in Haystack only support the modern microsoft office doc types.After some online research I was unable to find other python libraries with permissive licenses that could support the conversion of these older formats.
Describe the solution you'd like
Instead, a common recommendation to handle legacy files is to convert them to the modern ones (e.g.
.doc
to.docx
) using the command line tool fromlibreoffice
(more info here). For example,So I think creating a new converter component that converts the legacy format into the modern one would be great! We could also potentially leverage passing the output as a ByteStream so we could maybe avoid writing temporary files. Perhaps it would make sense to make this behavior controllable via input parameters.
As a side-effect this component would also allow for the conversion of microsoft file types (and others) into formats such as PDF which may be helpful in scenarios such as running OCR or having more reliable page detections for
.docx
files.Describe alternatives you've considered
libreoffice
to convert the legacy formats to the modern ones and then leverage their other converters. So I believe it would be better for us to also natively support thelibreoffice
conversion.The text was updated successfully, but these errors were encountered: