-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Interrupted uploads do not resume #569
Comments
🤦♂️ I have just spent over 4 hours uploading a sequence of 163 images worth 618,237,384 bytes. mapillary_tools is still not done because it is constantly resetting to either 0 or an arbitrary number. This is something that should have been completed in about 5 minutes. And, before you ask: no, my connection is stable. You see, this is exactly what I mean by “Mapillary being dysfunctional”; every upload tool from Mapillary is completely useless and unusable. Or, a horror to use and nobody seems to care. It is almost offending if you look at the amount of hoops I have to jump through in order to just contribute some meager amount of images voluntarily. Just because I am geek I have the patience and skill to deal with it. But, how do you expect any non‑technical people to be willing to contribute via this mess? |
Thanks for reporting and sorry for the inconvenience. This looks suspicious indeed. It seems the server returned offset=0 regardless it had data chunk uploaded there. Could you check if it work without a proxy? I can't reproduce it locally:
|
If either of you need a reference point (debug logs) to work with I have been using the tools to upload BlackVue videos 24/7 for 2 weeks now. Something like 870GBytes, 180,000 mp4's so far that seems to be working okay. I have not been watching the logs closely, but only have maybe 20 of the ConnectionError similar to above. No doubt that I am uploading mp4's rather than jpg's has a lesser chance of failure, Drop me a note if you would like any analysis or dump, |
More logs…
Sure, this may be proxy related. However, I highly doubt it. Too me, the server looks like the most probable culprit because the offset is computed by the server. Unfortunately, I cannot bypass the proxy. Anyhow, proxy or not it should work because I cannot imagine a scenario where the server should respond a wrong offset over a proxy or the client to modify the offset. It is the server that keeps track of the upload session state. Btw, I have never had any issues with other online services accepting large (interrupted) uploads. I am not sure how you could reproduce this. But, it seems like these recurring messages are usually causing trouble:
|
@gitne I suspect it's related to HTTP status 412 error, curious to see the payload. I'm trying to add a check on retires: if the offset fetched from the server does not move as expected, exit with full HTTP response printed out. I will make a new alpha release and let you know soon. |
Great, thank you! 😄 Just FYI: I have observed this behavior ever since I have started using version 0.8.0. |
Just a quick check @gitne: were you uploading this sequence in multiple processes/machines simultaneously? That might cause the offset to be inconsistent AFICS. |
No, single instance on one machine only. |
The fix is released here https://github.com/mapillary/mapillary_tools/releases/tag/v0.9.5a1 BTW it would be great if you could setup a local env here https://github.com/mapillary/mapillary_tools#development so we can test on branches without releasing binaries (faster iteration). |
This is what I got on my first attempt with v0.9.5a1:
Next, I am simply going to retry and see how far it will go.
I will see what I can do. My best bet is to build a Flatpak because it provides a defined set of dependencies. This is going to help get reproducible results. 🤞 |
After retry I got this:
The offset is apparently still incorrect on retry. The last offset was 184,550,620 and on retry it was 18,152,520. 😕 That’s like a magnitude lower! 😲 |
It seems that you were uploading zipfiles from What looks really strange though, is that the embedded md5sum in the zip filename I can't reproduce this issue neither unfortunately: # generate zipfiles in mapillary_public_uploads
python3 -m mapillary_tools.commands --verbose process_and_upload ~/ImageTestData/myimages/ --dry_run
# upload these zipfiles
python3 -m mapillary_tools.commands --verbose upload_zip ./mapillary_public_uploads
# Here is one of the logs:
# as you can see the md5sum in the filename and the log match
2022-10-14 10:30:25,780 - DEBUG - Sending upload_end via IPC: {'total_sequence_count': 3, 'sequence_idx': 2, 'file_type': 'zip', 'import_path': 'mapillary_public_uploads/mly_tools_6bfe61d03070c89f5d6fccdced76f0c7.zip', 'sequence_image_count': 2, 'entity_size': 6746255, 'md5sum': '6bfe61d03070c89f5d6fccdced76f0c7', 'upload_start_time': 1665768624.169763, 'upload_total_time': 1.478506088256836, 'offset': 6746255, 'retries': 0, 'upload_last_restart_time': 1665768624.301506, 'upload_first_offset': 0, 'chunk_size': 0, 'upload_end_time': 1665768625.7800121}
@gitne would you mind share us one of these zipfile for debugging via |
A bit context of how the md5sum is calculated for zipfiles (all images will be zipped before being uploaded). Assume we are uploading N images as a sequence:
Once we got the |
I think, I may have solved the mystery behind those resetting offsets. And, it definitely IS server related/caused. It turns out that our ISP uses dynamic public IP addresses. Whenever the ISP shifts to a new public IP address the upload server counts this as a new upload session. Naturally, our proxy sits on the local network and thus is also affected by the public IP address change. So, apparently the upload server’s upload session is tied to a specific client (public) IP address. Well, lets just say that this is a bit sub‑optimal. Creating and assigning an upload session token free of any IP address per upload session is probably going to fix this issue and other similar scenarios where the upload client has no control over its public IP address. Migrating to an upload session token will not pose a security threat either because not only is this a widely adopted and commonly used method for authenticating an upload session but also a very secure method because the token is (only) a session secret with a very limited lifetime. The only thing the server has to be weary of is creating too many tokens per IP address and per unit of time. Fortunately, because of the intrinsic nature of a lifetime limited token the upload server does not need to store or keep track of issued tokens. Thus, you can implement load balancing and flooding protection in one go. I have tested this hypothesis by rerouting network traffic over a static public IP address node and I did not have a single interruption over GBs of data. So after all, I guess my gut feeling was right. |
Very interesting observation. Thanks for sharing.
The upload session is decided by the authorized user and the session_key in the url. I don't think IP changes upload sessions. If you upload a single file from IP A, then interrupt it, and then resume from IP B, it will result in the single file uploaded in the same bucket, instead of two buckets. Changing IP during uploading will interrupt and close the underlying TCP connection, which result in the data chunk partially uploaded. In this case the HTTP client should raise a network error, and retry with the new offset fetched from the server. However, what we observed from the logs is that the HTTP didn't get interrupted and the server even returned success (2xx), then the client continued uploading a few more chunks based on the calculated (expected) offset, until the HTTP 412 error responded. Could you run the test cli with/without static public IP address, and see what happens (especially the HTTP responses)? # Generate 2GB file
dd if=/dev/zero of=TEST_BIG_FILE count=4024000
# Upload it
# NOTE: The test cli does upload `TEST_BIG_FILE` to the server, but they won't be shown on your profile, and it will be deleted after a few days.
python3 -m tests.cli.upload_api_v4 --verbose TEST_BIG_FILE TEST_SESSION --user_name=YOUR_ACCOUNT |
Sorry for the long delay. Unfortunately, I was 🤧 incapacitated for the last week (no COVID). Anyway, I have run the test you have suggested, as usual over our proxy, no rerouting:
Well, and then it crashed. I am not sure this is intended? Anyway, I hope this output is going to help you figure out what you wanted to know. |
So, I have run the test with rerouting. Btw, whenever you see a traceback in the log it means that I have manually resumed uploading, just so that it gets finished.
I have interrupted here because I have noticed that the offset has been reset to 0 again. This log is actually my second attempt.
I hope this gives you a clue. 😉 |
Thanks @gitne! The logs you shared above didn't reproduce this HTTP 412 errors. However, I've reproduced it locally by switching on/off VPN during uploading. I think we should retry on these 412 errors instead of exiting. The reason why it took long for you is because:
I guess we should revert #571, and let you configure the retry waiting time. |
I don’t know, just adding some waiting time will not solve the core issue, which is that the offset is reset to 0 on a new IP address. All I know is that other upload services/servers do not exhibit this issue whenever I upload via a web browser. This issue did not exist with Mapillary when the web uploader was still available either. So, in my understanding there is something wrong with the server not accepting a live upload token from a different IP. Could it be a Cloudflare, Cloudfront, or the like configuration issue? My best guess is that there is some sort of network security software in front of the upload server that sends error codes on client closed connections. And, when the client tries to reestablish an existing TLS session with the same (public) session key from a different IP address the network security software says “Nah! Multiple IP addresses cannot share a TLS (public) session key” (which is not completely devoid of some security logic but imho this matter should be left to TLS to deal with that situation). Let me see if I can perhaps reproduce a HTTP 412 error code with that upload test script. |
@gitne the offset reset is likely that when Note these logs:
Do you get a different While fixing the issue with the team, the best we can do is:
Note on 1: it does not solve the problem, which is uploading the same data multiple times but it makes uploading faster |
Right, I have not thought of that; then I guess the upload token is not shared either.
If the above is true then this is indeed very reasonable. There is no need to pause.
Could be, I have not payed attention to this. I guess mainly because I could not figure out what that cryptic |
Indeed, the offset resets on |
v0.9.5 still resets to 0 on
|
Yeah, the client can't do much with this issue, it has to be fixed at the server side. |
So basically this issue persists. This time, I have a different but similar setup. I upload via a cell phone network. It is sloooow 🐌 but it basically works. The issue for me is now that the zip file to upload is huge (like gigabytes in size). During the day, I need my phone for daily use and cannot use it for upload but I can use it to upload during night hours. Whenever I tether to my desktop computer 🖥️ (where I do image and metadata processing before upload) to start or continue uploading I usually get a new public IP address.
As you can see the upload session resets. The uploading 🖥️ does not reboot. A 16 MB chunk size works on a slow connection too but is way too much to recover on any network errors. Furthermore, as you can see TLS can get some errors. Next time, I am going to switch from the default GnuTLS implementation to OpenSSL. Maybe this is going to make uploads work more reliably. |
Sometimes, the phone can switch between different cell network operators (because of weather conditions etc), which can also lead to public IP address changes. |
For now, I reroute my upload traffic to a public static IP address but this is only an ugly workaround. Voluntary contributors who upload stuff for free should not need to go to these lengths to upload images. You have to fix this! |
|
@nickplesha Can you fix this on the server side too? You do not have to read everything above, you can start reading since #569 (comment). |
I'll discuss with the team and get back. |
There is an alternative to modifying network settings on the server (if that is not an option to you): Like before, contributors could upload a sequence in batches and the backend would join them together into a sequence by virtue of the {
"SequenceUUID": "01234567-89AB-CDEF-0123-456789ABCDEF"
} {
"SequenceUUID": "ASNFZ4mrze8BI0VniavN7w"
} |
So, I read through the issue posts again in order to better understand what is going on. It comes down to: When a client wants to resume an interrupted upload from the same IP address then it gets the same bucket from before and things work as expected. But, when the client wants to resume from a different IP address then it gets another bucket (or there is a great chance of happening so). And, because upload session state is not shared between buckets, then in the later case, it counts as a new upload session. 💡 |
Did you change anything on the server side? Yesterday, I tried to upload a large zip file via a static IP address and it went fine until it reset to 0 after about 24 hours. So, what I meant by upload session lifetime of about 24 hours was that the lifetime timer should reset to 0 whenever a packet or chunk (whatever) is received. I am not sure whether you really need a total lifetime timer. Though, I am aware that long running uploads may pose some security threat but this can easily be mitigated by limiting the number of concurrent uploads per remote IP address. Note that limiting the number of concurrent upload sessions per IP address to one may block other clients from uploading which are behind a NAT on the same public IP address. |
Basic information
0.9.5
Linux
, probablyany
any
Steps to reproduce behavior
Ctrl+C
and then resuming (while the upload session is still alive on the server) with exactly the same command should have the same effect.'offset': 0
.It is
unclearwhetherthis is aclient orserver bug. My gut feeling tells me that it is a server bugbut I may be wrong. The server may change the chunk size during upload depending on load. The client should adapt to this. The server should also always respond with the correct offset, independent of the current chunk size.Expected behavior
Per sequence uploads should resume on the offset of the first incomplete chunk.
Actual behavior
Note that
'offset': 0
, while it should have been'offset': 15171468
because this is the offset of the first incomplete chunk.All of the above means that it is currently extremely difficult to upload large sequences on low bandwidth connections because as soon as the load on the server changes (which is very likely on long uploads), the chunk size changes and the client has to restart uploading from offset 0. This is a huge waste of resources!
The text was updated successfully, but these errors were encountered: