Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce processing overhead for large amount of imagery data #524

Open
ptpt opened this issue Aug 9, 2022 · 11 comments
Open

Reduce processing overhead for large amount of imagery data #524

ptpt opened this issue Aug 9, 2022 · 11 comments

Comments

@ptpt
Copy link
Member

ptpt commented Aug 9, 2022

Although processing images (reading geotags) are fast, for large amount of data it's still slow. We should avoid processing the same images that has been uploaded or processed. See https://forum.mapillary.com/t/mapillary-tools-0-8-no-status-kept/6197/16

@ptpt ptpt changed the title Improve processing overhead for large amount of imagery data Reduce processing overhead for large amount of imagery data Aug 9, 2022
@richlv
Copy link
Contributor

richlv commented Sep 19, 2022

Heya, any luck with this one?
Ended up reprocessing and EXIF-testing a lot of data today, would have benefited from the suggest functionality a lot :)

@ptpt
Copy link
Member Author

ptpt commented Sep 20, 2022

@richlv maybe this isn't an issue if you process your images once, and then keep uploading.

Instead of keeping processing and uploading:

mapillary_tools process_and_upload path/to/images

use this workflow:

# process once (no need to process again unless you have new images added)
mapillary_tools process path/to/images

# then run upload (can be run multiple times due to interruptions)
mapillary_tools upload path/to/images

@richlv
Copy link
Contributor

richlv commented Jan 19, 2023

maybe this isn't an issue if you process your images once, and then keep uploading

Thank you for the workaround - that might work, but it adds extra complexity in the workflow.
It already is some effort to capture and upload the images, would be great if the tools made this easier :)

Same as with the upload status tracking, could the processing status be tracked?

@ptpt
Copy link
Member Author

ptpt commented Jan 20, 2023

It's a tradeoff here. It could be added for process in process_and_upload but it adds the complexity: we can't write status in the processed folders because they could be read-only, also users get annoyed by writing these temporary files. Then we need a centralized database to track the last time the folder was changed (mtime), and it is hard to do it right https://apenwarr.ca/log/20181113

Thank you for the workaround - that might work, but it adds extra complexity in the workflow.

Simply replace process_and_upload with upload. process_and_upload is an alias of process and upload. See: https://github.com/mapillary/mapillary_tools#process_and_upload

@ptpt ptpt closed this as completed Jan 20, 2023
@richlv
Copy link
Contributor

richlv commented Jan 23, 2023

Thank you so much for looking into this.
Can we please reopen this issue? Details on the reasoning follow.

As far as I understand, mapillary_tools already uses TMPDIR to store the archive files for upload.
Thus status could be stored in this same directory, or a new environment variable like MLY_STATUS_DIR could be used.

Not quite sure why mtime would need to be tracked - just storing information of a particular file being processed seems sufficient. If I recall correctly, that is how the previous processing status tracking worked.
If mtime would be desirable for cases where file path and name are unchanged, but contents change, that seems like an exception and could be left as something to be documented. Such an implementation would likely be sufficient for most if not all users/situations.

The replacing of process_and_upload with upload wouldn't work - that basically requires me to manually track processing status anyway, or script that logic myself. And it's so much better for such an improvement to be available to all users :)

To clarify on the need, a frequent usecase, especially when travelling:

  • Collect many images with an action camera, up to 100GB per day.
  • Move images off the camera storage, split in directories. Up to 30-40 directories per day.
  • Overnight, try to upload as many of images as possible.

To do so, I have a simple script that loops over these directories and runs process_and_upload.
Directory status varies a lot - some directories are processed. Some are uploaded, but due to varying internet access quality, they are not always uploaded sequentially (some might fail in the middle), some can be uploaded partially.
Usually only a portion of directories gets uploaded on any given day, and new directories are added on a daily basis.

Thus the only thing the script really needs to do is loop over these directories, and process & upload.
But without the status tracking for processing in mapillary_tools, it's either a lot of re-processing, or I need to implement status tracking in my script. This is very tricky to do if I'd like to do it per image. Per directory could result in incorrect status if processing is interrupted. Any such implementation outside of mapillary_tools would also be fragile, as tool changes could break it, and cause data loss.

@ptpt
Copy link
Member Author

ptpt commented Jan 23, 2023

Thanks for the context. Would you mind sharing the relevant part of your script?

Here is a simple idea that avoids reprocess by checking the existence of each image directory's mapillary_image_description.json

#!/bin/sh
set -e

# loop through all image directories
for image_dir in "$@"; do
    if [ -d "$image_dir" ]; then
        desc_path="${image_dir}/mapillary_image_description.json"
        # check if the desc file exists
        if [ -f "$desc_path" ]; then
            echo processed already "$image_dir"
        else
            mapillary_tools process "$image_dir" --desc_path="$desc_path"
        fi
        mapillary_tools upload "$image_dir" --desc_path="$desc_path"
    fi
done

Usage:

sh loop.sh ~/daily_captures/*

@ptpt ptpt reopened this Jan 23, 2023
@richlv
Copy link
Contributor

richlv commented Jan 23, 2023

The script, if we exclude the part that does some GPX trickery for edge cases, is trivial.
$1 is path (passed from a for loop in the command), $gt usually is empty.

export TMPDIR='/somewhere/tmp'
...
echo "y" | mapillary_tools process_and_upload --user_name richlv --interpolate_directions "$1" $gt --skip_process_errors

Thank you for the script idea, greatly appreciated.

If checking for "mapillary_image_description.json" is a good approach, what would be the reason not to do so in mapillary_tools right away, and skip processing if found?
That seems simpler than users scripting their own hacks, and there could be a status message when the file is found, thus making the behaviour very obvious.

@ptpt
Copy link
Member Author

ptpt commented Jan 23, 2023

If checking for "mapillary_image_description.json" is a good approach, what would be the reason not to do so in mapillary_tools right away, and skip processing if found?

Think of mapillary_tools process image_dir as exiftool image_dir whose job is to extract metadata from images in the folder and write them out. Then checking this output file might be confusing to users.

I will keep the issue open until a good solution. For now, I would suggest script it:

image_dir="$1"
desc_path="${image_dir}/mapillary_image_description.json"
# check if the desc file exists
if ! [ -f "$desc_path" ]; then
    mapillary_tools process "$image_dir" $gt --desc_path="$desc_path" --interpolate_directions --skip_process_errors
fi
mapillary_tools upload "$image_dir" --desc_path="$desc_path" --user_name richlv

to replace

echo "y" | mapillary_tools process_and_upload --user_name richlv --interpolate_directions "$1" $gt --skip_process_errors

@richlv
Copy link
Contributor

richlv commented May 16, 2023

Have we observed specific scenarios where it would be confusing?
Wondering whether the current overhead isn't causing more confusion :)
A long time ago I added status checks in mapillary_tools and did a pull request, but that was before the full rewrite or two, won't be able to do anything merge-worthy now.

@ptpt
Copy link
Member Author

ptpt commented May 24, 2023

@richlv could you try out https://github.com/mapillary/mapillary_tools/releases/tag/v0.10.2? The image processing performance is improved significantly in this version (700 images per second on my laptop MBP M1).

@richlv
Copy link
Contributor

richlv commented May 24, 2023

Thanks, that indeed is faster - greatly appreciated :)
It would still be great to automatically avoid re-processing. Just this week I was several times restarting uploads with 10+ directories, 2GB of images per directory.

Is the potential for confusion easy to avoid, if it is still a concern?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants