-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple download issues with the latest pyEGA3-5.0.1 (now 5.0.2) which makes it almost impossible to download data successfully #192
Comments
These are the exact errors I am getting too. I am also using the newest version with -c 5/10/20 my download speed only could reach around 200mb/s. My data size is around 4gb and it couldn't have finished yet for 4 days. It was able to complete download but it raised md5 file mismatch at the end and it restarted. I will try to download other files from same dataset, I hope I can get some of the rest.. |
pyEGA3 v5.0.2 has been released a couple of days ago, so might be worth trying to upgrade. However, I am also unable to download almost any files from some datasets for about a month and upgrading to v5.0.2 did not change that. |
@kauralasoo Thanks for reminding me to upgrade and I got my first retry attempt regarding ChunkedEncodingError only 15 minutes after switching to v5.0.2. I will wait for some time to see it's performance and when I was typing this I got the second retry attempt regarding HTTPError. Good luck to me. |
I have similar errors with my download reaching max 2.3GB. And with multiple crashes. It is not even 10% of the entire dataset I am trying to download. |
Today I was able to complete download process after so many trials but it failed because of md5 mismatch.. I wrote to the help desk and I am done with trying until I get some updates. |
Hi everyone, To clarify, the newly released version We acknowledge the download service is unstable at the moment and we are working on it. Thanks so much for your patience. Best regards |
@aaclan-ebi I totally understand that v5.0.2 only solves the problem you mentioned and thanks for your reply. Since the slice files bigger or smaller than the chunk size have been set to be removed, is it to say that the only cause of md5 mismatch now is that some of the slice files are missing and that we can solve this problem mannually by checking the file names which contain the start and the end of the files and then download only the missing part instead of the total file again? Or is there anything else to cause md5 mismatch that I can deal with manually? Will the complete slices be removed as soon as the md5 mismatch occurs, or will it ask me if I would like to hold the slices? I'm downloading a dataset of 43Tb with each file ranging from 200Gb to 400Gb and can only successfully download 25-30Gb everyday. Considering what you said I guess the download speed or connection stability is not gonna improve in quite a long time, thus I hope at least I'm not wasting all the time only to get a md5 mismatch which I can do nothing about. Thank you. |
If a |
You don't have to do anything manually.
The checksum calculation happens when all complete slices are downloaded and merged. Yes, if the computed checksum doesn't match the original file's checksum it will delete all slices right away and restart the downloading of the whole file as there is no way to know which part of the file is corrupted.
The pyega3 client will restart an incomplete slice (
I'm afraid I don't have a timeline, but we're actively working on this now. Please reach out to the ega helpdesk and they might be able to suggest alternatives for accessing the data. There have been issues with how the service is handling the requests. And it seems there's been an increased load for the past month after we first released this new version of the download service last December. We've already made some changes and I believe there should have been slight improvements over the past ~12 hours and we are continuing the work. Regarding the download speed, we'll verify this and figure out where the bottleneck is. Thanks for bringing this to our attention and for the details you gave. We are also looking into scaling the service to handle all connections/requests. Thanks again for your patience! Best regards |
I'm having similar issues. The download speeds are terribly low and I cannot download the files that belong to the dataset of interest. There seem to be many chunk files created but not a single file was downloaded properly. Below are some of the errors I'm getting. I'm using pyEGA3 client version 5.0.2
and:
|
@aaclan-ebi Thank you for your reply and I really appreciate your efforts. The speed today does improve a little although it’s still not enough for me. I also got reply from the helpdesk 5 hours ago to the email I sent on Feb. 5th and they said they would offer alternatives for me. Thank you again for your help. |
Thank you, @felis-silvestris98 ! Regarding the slow download speed, you mentioned you only get a max speed of 1 MB/s. May I know how many connections are you setting in pyega3 client when you get this download speed and how big is the current file you're downloading? You may not gain any performance benefit if you're setting too many connections as the bandwidth of your internet connection will always be a constraint. In line with this, may I request for you to try downloading a test file and provide me the pyega3 logs? The command would be: Many thanks |
@aaclan-ebi I'm using 30 connections as you recommend (if it's too many maybe you should not recommend "trying with 30 connections initially and adjusting from there to get maximum throughput") and the file is of 219Gb. It finished downloading 10 hours ago but unfortunately got md5 mismatch error, which means I've wasted another 7 days.
It took 7 days to download this file and for the first 6 days it downloaded 120Gb while for the last day it got 99Gb, so the speed did improve as you said. Here attached is the test log file with
|
I am sorry about the corrupted 219GB file :( Thanks for doing the test download and providing me these details, this is very helpful. The test download is using 1 connection and I agree 10-520KB/s is unreasonably slow given the speed of your internet connection. This slowness in the download can cause the Slice error's on the client because the service is timing out after a threshold time. We are investigating this. Best regards |
@felis-silvestris98 could you also give me the file accession (i.e. EGAF*) of the 219 GB file? We'll also investigate this. It's probably because of the instability of the download service at that time but we'll verify if something's odd. |
@aaclan-ebi It's EGAF00004693820 and it was downloaded on a CentOS server. I also tried to download EGAF00004693816 of 265Gb on a Windows device at the same time and it also got md5 mismatch. The bandwidth of the two devices are the same. |
@aaclan-ebi I have been trying to download more BAM files from the PCAWG consortium listed on ICGC. This does not work with pyega3==5.0.2 but did last month with pyega3==4.0.5. This has now also stopped working. Has EGA changes something for this to have stopped working or will there be an issue this end? The error is: .local/lib/python3.8/site-packages/urllib3/connection.py", line 186, in _new_conn |
I have a similar error in 195 where md5 sum is always incorrect (with 5.0.2) |
I am also observing these issues (including slice and md5 errors), preventing any download to finish (even after 1000+ retries).
|
I encountered the same question. |
I also can't get this to work. I'm connecting from a JANET backbone link about 3 miles from the EBI and I'm getting 1MB/s transfer speeds and timeout errors every few percent through the download. The same download has been running for ages without a single file being retrieved successfully. I've tried varying the chunk size and number of connections to no avail. I just get a ton of:
|
Same issue for me. It is just impossible to download any large file. It either gives me a "Connection reset by peer" or "too many 503 error responses" error. In the rare occasions that all slices are downloaded (with very low downloading speed), I get the inconsistent MD5 error. |
I am also receiving similar errors while downloading two EGA datasets using pyega3 version 5.0.2. I see this issue has been open for 3 months now. Is there any planned fix on pyega3? Thanks for helping out!
|
Unfortunately I am seeing similar issues downloading two seperate datasets. Currently maxing out at 1Mb/second download speed on any number of connections (ranging from 30 to 1). Is there an alternative way to download data aside from pyega3 at this point?
|
We have this error too. Can you update the package to make bulk download? I think this delay is substantially affecting on the entire process... |
We are having the same issue. We've now tried three different machines with different verions of Linux, and three different internet connections and storage systems. My observations (by editing some lines of code in pyEGA):
I checked the first two options by not removing the intermediary files, and merging the data myself.
So it's the transfers that are broken:
So it's either something in the http stream function or something on the serverside of EGA. |
@harmjanwestra we have seen the same behaviour after performing very similar tests. Manually re-assembling individual chunks that individually have the correct byte size and sum to the correct total according to the manifest still frequently produces a corrupted file with invalid compression format. The error is uniformly distributed — sometimes one can read nearly the whole bam before encountering an error, sometimes it occurs early. Some additional observations: We sorted a set of accessioned files by size and started downloading in ascending order. The logs indicate that the probability of an eventually successful download (md5 match) is inversely correlated with the size. Even small files fail the md5 check fairly frequently but mostly succeed on retry; once the files get over a few GB however they almost always have errors, and the chance of success on retry diminishes as they get bigger. The last file I tried (45G) failed in this manner 38 times in a row over two days and I was never able to download it. Most of the bams in our target dataset of 700 files are over 50G, and some are over 200G, so I don't believe it will be possible to use the current client for this dataset. Looking at the client code, I see this behaviour is consistent with the fact that there appears to be no data validation at the chunk level, only after the whole file is reconstructed — meaning if there are random errors in transmission that can be modelled as a simple binomial with a probability In any case, chunk-level validation would be good practice given the large size of many bams, and the non-zero bit error rate of network transmission. |
The same issue here. Downloads never finish and intermediate check of chunks raise pyega3 -c 20 -cf /Users/marek/Desktop/CREDANTIALS_FILE fetch EGAF00006130794 --output-dir /Users/marek/scATAC/W11_forebrain
[2023-06-28 11:33:47 +0200]
[2023-06-28 11:33:47 +0200] pyEGA3 - EGA python client version 5.0.2 (https://github.com/EGA-archive/ega-download-client)
[2023-06-28 11:33:47 +0200] Parts of this software are derived from pyEGA (https://github.com/blachlylab/pyega) by James Blachly
[2023-06-28 11:33:47 +0200] Python version : 3.11.4
[2023-06-28 11:33:47 +0200] OS version : Darwin Darwin Kernel Version 21.6.0: Mon Aug 22 20:19:52 PDT 2022; root:xnu-8020.140.49~2/RELEASE_ARM64_T6000
[2023-06-28 11:33:47 +0200] MacOS version : 12.6
[2023-06-28 11:33:47 +0200] Server URL: https://ega.ebi.ac.uk:8443/v2
[2023-06-28 11:33:47 +0200] Session-Id: 3368285841
[2023-06-28 11:33:48 +0200]
[2023-06-28 11:33:48 +0200] Authentication success for user 'ma****@****.se'
[2023-06-28 11:33:54 +0200] File Id: 'EGAF00006130794'(13282545173 bytes).
[2023-06-28 11:33:54 +0200] Total space : 926.35 GiB
[2023-06-28 11:33:54 +0200] Used space : 460.96 GiB
[2023-06-28 11:33:54 +0200] Free space : 465.39 GiB
[2023-06-28 11:33:54 +0200] Download starting [using 20 connection(s), file size 13282545157 and chunk length 104857600]...
47%|██████████████████████████████████████████████████████████████████████████████████████████████▋ | 6.29G/13.3G [02:35<02:52, 40.5MB/s]
[2023-06-28 11:36:29 +0200] ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
Traceback (most recent call last):
File "/Users/marek/miniconda3/envs/EGA/lib/python3.11/site-packages/urllib3/response.py", line 710, in _error_catcher
yield
File "/Users/marek/miniconda3/envs/EGA/lib/python3.11/site-packages/urllib3/response.py", line 1077, in read_chunked
self._update_chunk_length()
File "/Users/marek/miniconda3/envs/EGA/lib/python3.11/site-packages/urllib3/response.py", line 1012, in _update_chunk_length
raise InvalidChunkLength(self, line) from None
urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 bytes read)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/marek/miniconda3/envs/EGA/lib/python3.11/site-packages/requests/models.py", line 816, in generate
yield from self.raw.stream(chunk_size, decode_content=True)
File "/Users/marek/miniconda3/envs/EGA/lib/python3.11/site-packages/urllib3/response.py", line 937, in stream
yield from self.read_chunked(amt, decode_content=decode_content)
File "/Users/marek/miniconda3/envs/EGA/lib/python3.11/site-packages/urllib3/response.py", line 1065, in read_chunked
with self._error_catcher():
File "/Users/marek/miniconda3/envs/EGA/lib/python3.11/contextlib.py", line 155, in __exit__
self.gen.throw(typ, value, traceback)
File "/Users/marek/miniconda3/envs/EGA/lib/python3.11/site-packages/urllib3/response.py", line 727, in _error_catcher
raise ProtocolError(f"Connection broken: {e!r}", e) from e
urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
|
What worked for me to bypass this issue is to increase the size of the slice/chunk to be downloaded from default 100MB to 1GB. I was able to download files of size up to 13 GB. The parameter is Perhaps tweaking this parameter could help for even larger files as well. pyega3 -c 20 -ms 1073741824 -cf /PATH/TO/CREDANTIALS_FILE fetch EGAF**** --output-dir /PATH/TO/OUTPUT |
@mardzix I've tried this usinig a chunk size larger than the file size, but to no avail. edit update: nope, still doesn't work properly. |
I'm trying to download a dataset EGAD00001006237 and can't get a single sample. Constantly getting such errors:
|
what @mardzix suggested worked for me. With |
This is working like a charm! I had given up on downloading the data. |
Hi everyone, I encountered a recurring md5sum error ( The logicI found that ~5% of the 100MB chunks are corrupted during the download process. This is not an optimal solution, but it gets the job done. While this modification does double the download time, it effectively eliminates md5sum errors. I've used this approach to successfully download large files (50-100Gb). See the modified How to use itThe elegant way:clone this fork and install from source. The quick and dirty way:download only this specific file ( # go to pyega3 installation directory
$ cd `python3 -c 'import pyega3; print(pyega3.__file__[:-11])'`
$ cd ./libs/
# replace data_file.py with modified version
$ wget https://github.com/nloyfer/ega-download-client/blob/master/pyega3/libs/data_file.py -O data_file.py |
Very nice! I think this is the best possible workaround with the current API. I was able to download a 250G file successfully when previously the limit was well under 100G. For the record, there were 45 occurrences of Now, if EGA could update the |
Hi guys, I am new in bioinformatics. I run the code in our cluster as recommended by @nloyfer . However, when I tried to download, I got this error. Any suggestions what's going on?
|
@jaflo94 I had the same error, wget was not actually downloading the raw python file, try: |
@nloyfer Thank you for sharing your solution regarding the md5 errors. I can confirm we have been having many challenges with the download api due to connection issues. The dev team worked on optimising the API and on Friday last week we redeployed the download service. According to our logs users can download files now, however md5 errors can still occur. I scheduled a pyega client update release to mid October and will be discussing your solution with the dev team. |
Got the same problems - if I downloaded multiple files (~100G each) at the same time, all files got different md5 values then failed, and redownload again... the success rate increases when I only download one file each time, still it sometimes gets different md5 value. Hope the new release can resolve this issue. |
I have a similar problem. |
I am also having the same issue where there are too many 503 response issues. |
I suggest trying EGA's new Live Distribution service; https://ega-archive.org/access/download/files/live-outbox/ |
Any news here? I am trying to re-download a data set with the same script, but now I always get |
Same issue. Any updates? |
Hi, their live outbox works really well. I can no longer use PYEGA3 as the ports are blocked for us somehow, but this is really good: |
Mar. 12th 2023
Speed has improved so that I can finish a file of 301G within 2 days to get md5 mismatch error, and I've got this 3 times in the past week.
pyEGA3 is nice when downloading files of less than 100Mb rather than huge ones.
(I've got an alternative option to download from the hepdesk, and I'm just trying pyEGA3 bacause it will be more convenient than the alternative option if the md5 mismatch problem is solved)
Log file for v5.0.2 on Feb. 26th 2023
Most of the issues are the same as v5.0.1. The speed isn't improving. And it seems that the Error "Retrying (Retry(total=19, connect=False, read=9, redirect=None, status=10)) after connection broken by 'ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))'" occurs even more frequently than it did with v5.0.1.
Here is a fresh log file with v5.0.2.
pyega3_5.0.2_output.log
1. Low speed
I've downloaded data with this at the begining:
pyega3 -cf credential_file.json fetch EGAF0000*******
It was too slow so that I add
-c 5
or-c 30
.Still the download speed rarely reaches 1Mb/s and if it's not time out as it often does, the speed is around 100kb/s, which is totally unacceptable considering the data size of 43T: it will take 521.8 days to download the dataset even with 1 Mb/s speed. But i guess 1Mb/s is already pretty good performance with pyEGA3-5.0.1. And the connection number never reaches 30 as is set, usually only 2 to 4, sometimes 10. I didn't find it different a lot using
-c 5
or-c 30
The speed of downloading from a remote server with wget or rsync is around 7Mb/s for me so i think it's not a problem of my side but an inner defect of pyEGA3.
2. Unstable connection
Besides the frustrating speed, I notice that the connection is always getting "reset by peer" before a chunk is successfully downloaded and then the chunk simply disappear instead of continuing to grow, which means nearly half of what I considered as downloaded successfully is totally not downloaded at all and thus the speed is actually even lower than that shown in the log file. So I add
-ms 20971520
or-ms 10485760
to reduce the chunk size from the default 100Mb to 20Mb or 10Mb. It's just not completely helpless, and I still get hundreds of errors reporting "Retrying (Retry(total=19, connect=False, read=9, redirect=None, status=10)) after connection broken by 'ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))'", with almost nothing downloaded in hours. And then Slice Error or HTTPSConnectionPool(host='ega.ebi.ac.uk', port=8443): Read timed out occurs, so all that I can do is to set-M 1000
to keep it retries when I'm sleeping.I'm really trying hard to optimize my experience with pyEGA3-5.0.1, but I have to say that, after trying for nealy a month without a single file downloaded, if EGA doesn't provide some other approaches to download, I will never be able to get the dataset I need.
Thank you for helping.
Used versions
To Reproduce
I've tried a lot of settings:
Log File
Here attached is part of the log file I've got.
pyega3_output.log
The text was updated successfully, but these errors were encountered: