Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attach data to a task: better MIME type detection #8346

Open
2 tasks done
deltheil opened this issue Aug 26, 2024 · 2 comments
Open
2 tasks done

Attach data to a task: better MIME type detection #8346

deltheil opened this issue Aug 26, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@deltheil
Copy link

deltheil commented Aug 26, 2024

Actions before raising this issue

  • I searched the existing issues and did not find anything similar.
  • I read/searched the docs

Is your feature request related to a problem? Please describe.

Context

I am uploading image files via https://app.cvat.ai/api/docs/#tag/tasks/operation/tasks_create_data (using the client_files parameters).

In my case, my image files are stored on disk in a content-addressable manner mimicking how git store and name files. E.g. typically, a JPEG file could be stored as /var/misc/images/1f/ec4f5cee029f96c1e9eddd09821a51c0a9f80a.

Problem

The problem is related to the CVAT engine MIME type detection which is based on file extensions:

  • def count_files(file_mapping, counter):
    for rel_path, full_path in file_mapping.items():
    mime = get_mime(full_path)
    if mime in counter:
    counter[mime].append(rel_path)
    elif rel_path.endswith('.jsonl'):
    continue
    else:
    slogger.glob.warn("Skip '{}' file (its mime type doesn't "
    "correspond to supported MIME file type)".format(full_path))
    counter = { media_type: [] for media_type in MEDIA_TYPES.keys() }
    count_files(
    file_mapping={ f:f for f in data['remote_files'] or data['client_files']},
    counter=counter,
    )
  • def _is_image(path):
    mime = mimetypes.guess_type(path)
    # Exclude vector graphic images because Pillow cannot work with them
    return mime[0] is not None and mime[0].startswith('image') and \
    not mime[0].startswith('image/svg')

E.g. is_image builds upon https://docs.python.org/3/library/mimetypes.html#mimetypes.guess_type:

def _is_image(path):
    mime = mimetypes.guess_type(path)
    # Exclude vector graphic images because Pillow cannot work with them
    return mime[0] is not None and mime[0].startswith('image') and \
        not mime[0].startswith('image/svg')

tl;dr

In my case, all the uploaded image files get ignored.

Describe the solution you'd like

I think it would be great if MIME type detection could be expanded to support magic detection (file headers), e.g. using https://github.com/ahupp/python-magic or anything equivalent. In other words, do not get limited to file extension based detection (.jpg, etc).

NB.: I am talking about images, but same could be done for other media types of course.

Describe alternatives you've considered

I am forced to rename (add an extension) at upload time (work around).

Additional context

No response

@deltheil deltheil added the enhancement New feature or request label Aug 26, 2024
@bsekachev
Copy link
Member

bsekachev commented Aug 26, 2024

Hello,

python-magic is significantly slower. We used it in the past, but it was decided to work with extensions.

Additionally, it will not work with cloud storages as CVAT needs to download file content -> much much slower.

@cvat-ai cvat-ai deleted a comment Aug 26, 2024
@cvat-ai cvat-ai deleted a comment Aug 26, 2024
@cvat-ai cvat-ai deleted a comment Aug 26, 2024
@deltheil
Copy link
Author

python-magic is significantly slower. We used it in the past, but it was decided to work with extensions.

Right, that's a drawback.

Additionally, it will not work with cloud storages as CVAT needs to download file content -> much much slower.

True (perhaps the Content-Type (HTTP header) and/or HEAD requests could be leveraged here - not sure how it's being handled right now).

For context: when using the FiftyOne built-in CVAT integration, this even turns into a bug as _get_job_ids polls forever (and no job is ever returned).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants
@deltheil @bsekachev and others