Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index files other than text/document files #21

Open
pacproduct opened this issue Nov 13, 2018 · 2 comments
Open

Index files other than text/document files #21

pacproduct opened this issue Nov 13, 2018 · 2 comments

Comments

@pacproduct
Copy link

pacproduct commented Nov 13, 2018

Hi.

I'm far from grasping the complexity of ES and the NC's fulltextsearch suite, but: I thought that the Ingest Attachment Processor Plugin that we add to ElasticSearch aims at indexing virtually any known type of file, thanks to Apache Tika that knows how to parse hundreds and hundreds of file types.

Despite that, it seems to me like files_fulltextsearch provides ES with the content of files only when they match the following types: Text, Office, PDF.

And indeed, I've installed and configured files_fulltextsearch on a local NextCloud instance for tests purposes, and I don't seem to be able to search within the content of ZIP files, Image files, etc. Although Tika knows these file types.

Isn't it possible to just send all file contents to ES so it indexes as many file types as it can?

Thx.

@solracsf solracsf changed the title Help request - Wide range of file types indexing? Wide range of file types indexing (zip...) ? Sep 15, 2021
@solracsf solracsf changed the title Wide range of file types indexing (zip...) ? Wide range of file types indexing (zip...) other than text/dicument files Sep 15, 2021
@solracsf solracsf pinned this issue Sep 15, 2021
@solracsf solracsf changed the title Wide range of file types indexing (zip...) other than text/dicument files Index files other than text/document files Sep 15, 2021
@tucker-m
Copy link

Does anyone know where the logic is that determines what files are indexed? I just took a quick look and couldn't find it. I noticed that markdown files aren't indexed, and I figured that one would be an easy fix (just treat it like a .txt), but I couldn't find the place where it reads file extensions.

@masahirominami
Copy link
Contributor

I am not sure if this is the right approach but here are some I found.

  • lib/Service/FilesService.php
        /**
         * @param string $mimeType
         * @param string $extension
         * @param string $parsed
         *
         * @throws KnownFileMimeTypeException
         */
        private function parseMimeTypeText(string $mimeType, string $extension, string &$parsed) {

                if (substr($mimeType, 0, 5) === 'text/') {
                        $parsed = self::MIMETYPE_TEXT;
                        throw new KnownFileMimeTypeException();
                }

                // 20220219 Parse XML files as TEXT files
                if (substr($mimeType, 0, 15) === 'application/xml') {
                        $parsed = self::MIMETYPE_TEXT;
                        throw new KnownFileMimeTypeException();
                }

                // 20220219 Parse .drawio file
                if ($extension  === 'drawio') {
                        $parsed = self::MIMETYPE_TEXT;
                        throw new KnownFileMimeTypeException();
                }

This way, application/xml and .drawio files are included for indexing.

.drawio files need a bit more extraction process for they are deflated xml.

Anyway, I have somehow done indexing .xml and .drawio files.
If anyone is interested, I can push my branch.

My blog article on the issue is here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants