Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple download issues with the latest pyEGA3-5.0.1 (now 5.0.2) which makes it almost impossible to download data successfully #192

Open
felis-silvestris98 opened this issue Feb 23, 2023 · 46 comments
Assignees
Labels
bug Something isn't working

Comments

@felis-silvestris98
Copy link

felis-silvestris98 commented Feb 23, 2023

Mar. 12th 2023
Speed has improved so that I can finish a file of 301G within 2 days to get md5 mismatch error, and I've got this 3 times in the past week.
pyEGA3 is nice when downloading files of less than 100Mb rather than huge ones.
(I've got an alternative option to download from the hepdesk, and I'm just trying pyEGA3 bacause it will be more convenient than the alternative option if the md5 mismatch problem is solved)


Log file for v5.0.2 on Feb. 26th 2023
Most of the issues are the same as v5.0.1. The speed isn't improving. And it seems that the Error "Retrying (Retry(total=19, connect=False, read=9, redirect=None, status=10)) after connection broken by 'ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))'" occurs even more frequently than it did with v5.0.1.
Here is a fresh log file with v5.0.2.
pyega3_5.0.2_output.log


1. Low speed
I've downloaded data with this at the begining: pyega3 -cf credential_file.json fetch EGAF0000*******
It was too slow so that I add -c 5 or -c 30.

Still the download speed rarely reaches 1Mb/s and if it's not time out as it often does, the speed is around 100kb/s, which is totally unacceptable considering the data size of 43T: it will take 521.8 days to download the dataset even with 1 Mb/s speed. But i guess 1Mb/s is already pretty good performance with pyEGA3-5.0.1. And the connection number never reaches 30 as is set, usually only 2 to 4, sometimes 10. I didn't find it different a lot using -c 5 or -c 30
The speed of downloading from a remote server with wget or rsync is around 7Mb/s for me so i think it's not a problem of my side but an inner defect of pyEGA3.

2. Unstable connection
Besides the frustrating speed, I notice that the connection is always getting "reset by peer" before a chunk is successfully downloaded and then the chunk simply disappear instead of continuing to grow, which means nearly half of what I considered as downloaded successfully is totally not downloaded at all and thus the speed is actually even lower than that shown in the log file. So I add -ms 20971520 or -ms 10485760 to reduce the chunk size from the default 100Mb to 20Mb or 10Mb. It's just not completely helpless, and I still get hundreds of errors reporting "Retrying (Retry(total=19, connect=False, read=9, redirect=None, status=10)) after connection broken by 'ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))'", with almost nothing downloaded in hours. And then Slice Error or HTTPSConnectionPool(host='ega.ebi.ac.uk', port=8443): Read timed out occurs, so all that I can do is to set -M 1000 to keep it retries when I'm sleeping.

I'm really trying hard to optimize my experience with pyEGA3-5.0.1, but I have to say that, after trying for nealy a month without a single file downloaded, if EGA doesn't provide some other approaches to download, I will never be able to get the dataset I need.

Thank you for helping.

Used versions

  • Operating System version: Windows 10.0.14393 and CentOS 7.9.2009
  • Python version: 3.9.13
  • PyEGA3 version: 5.0.1(conda installed)

To Reproduce
I've tried a lot of settings:

pyega3 -cf credential_file.json fetch EGAF0000*******
pyega3 -cf credential_file.json -c 5 fetch EGAF0000*******
pyega3 -cf credential_file.json -c 30 fetch EGAF0000*******
pyega3 -cf credential_file.json -c 30 -ms 20971520 fetch EGAF0000*******
pyega3 -cf credential_file.json -c 30 -ms 10485760 fetch EGAF0000*******
pyega3 -cf credential_file.json -c 30 -ms 10485760 fetch EGAF0000******* -M 1000
pyega3 -cf credential_file.json -c 30 -ms 10485760 fetch EGAF0000******* -M 1000 -W 30

Log File
Here attached is part of the log file I've got.
pyega3_output.log

@felis-silvestris98 felis-silvestris98 added the bug Something isn't working label Feb 23, 2023
@emretaylanduman
Copy link

These are the exact errors I am getting too. I am also using the newest version with -c 5/10/20 my download speed only could reach around 200mb/s. My data size is around 4gb and it couldn't have finished yet for 4 days. It was able to complete download but it raised md5 file mismatch at the end and it restarted.

I will try to download other files from same dataset, I hope I can get some of the rest..

@felis-silvestris98
Copy link
Author

felis-silvestris98 commented Feb 25, 2023

Nothing downloaded from 14:00 to 16:47 and 2 tmp files growing to chunck size of 20480kb.
1
One of the tmp file just disappeared.
2
The other also disappeared.
the other also disappeared
And at the same time, the log file was showing a speed around 200kb/s while the process stalled at 35%.

Obviously pyEGA3 is not for downloading big dataset. If 20Mb is already a challenge, how can it get me a single file of 265Gb?

@kauralasoo
Copy link

pyEGA3 v5.0.2 has been released a couple of days ago, so might be worth trying to upgrade. However, I am also unable to download almost any files from some datasets for about a month and upgrading to v5.0.2 did not change that.

@felis-silvestris98
Copy link
Author

felis-silvestris98 commented Feb 26, 2023

@kauralasoo Thanks for reminding me to upgrade and I got my first retry attempt regarding ChunkedEncodingError only 15 minutes after switching to v5.0.2. I will wait for some time to see it's performance and when I was typing this I got the second retry attempt regarding HTTPError. Good luck to me.

@felis-silvestris98 felis-silvestris98 changed the title Multiple download issues with the latest pyEGA3-5.0.1 which makes it almost impossible to download data successfully Multiple download issues with the latest pyEGA3-5.0.1 (now 5.0.2) which makes it almost impossible to download data successfully Feb 26, 2023
@suhartobanerjee
Copy link

suhartobanerjee commented Feb 27, 2023

I have similar errors with my download reaching max 2.3GB. And with multiple crashes. It is not even 10% of the entire dataset I am trying to download.

@emretaylanduman
Copy link

Today I was able to complete download process after so many trials but it failed because of md5 mismatch.. I wrote to the help desk and I am done with trying until I get some updates.

@aaclan-ebi
Copy link
Collaborator

Hi everyone,

To clarify, the newly released version 5.0.2 only solves the problem when a slice file grows bigger than the chunk size set which causes a SliceError exception in the end. It was reported in the #187. @felis-silvestris98 That would also remove the .slice.tmp files if the downloading is interrupted. That’s the behaviour we decided to lessen the md5 mismatch issue which can happen at the end.

We acknowledge the download service is unstable at the moment and we are working on it. Thanks so much for your patience.

Best regards
Alegria

@felis-silvestris98
Copy link
Author

felis-silvestris98 commented Feb 28, 2023

@aaclan-ebi I totally understand that v5.0.2 only solves the problem you mentioned and thanks for your reply. Since the slice files bigger or smaller than the chunk size have been set to be removed, is it to say that the only cause of md5 mismatch now is that some of the slice files are missing and that we can solve this problem mannually by checking the file names which contain the start and the end of the files and then download only the missing part instead of the total file again? Or is there anything else to cause md5 mismatch that I can deal with manually? Will the complete slices be removed as soon as the md5 mismatch occurs, or will it ask me if I would like to hold the slices?

I'm downloading a dataset of 43Tb with each file ranging from 200Gb to 400Gb and can only successfully download 25-30Gb everyday. Considering what you said I guess the download speed or connection stability is not gonna improve in quite a long time, thus I hope at least I'm not wasting all the time only to get a md5 mismatch which I can do nothing about.

Thank you.

@felis-silvestris98
Copy link
Author

If a filename.slice.tmp becomes a filename.slice and the filename.slice is of the chunk size, it is a complete slice without any possibility of causing md5 mismatch, for both v5.0.1 and v5.0.2, right?

@aaclan-ebi
Copy link
Collaborator

aaclan-ebi commented Feb 28, 2023

Hi @felis-silvestris98

we can solve this problem manually by checking the file names which contain the start and the end of the files and then download only the missing part instead of the total file again. Or is there anything else to cause an md5 mismatch that I can deal with manually?

You don't have to do anything manually.

Will the complete slices be removed as soon as the md5 mismatch occurs, or will it ask me if I would like to hold the slices?

The checksum calculation happens when all complete slices are downloaded and merged. Yes, if the computed checksum doesn't match the original file's checksum it will delete all slices right away and restart the downloading of the whole file as there is no way to know which part of the file is corrupted.

If a filename.slice.tmp becomes a filename.slice and the filename.slice is of the chunk size, it is a complete slice without any possibility of causing an md5 mismatch, for both v5.0.1 and v5.0.2, right?

The pyega3 client will restart an incomplete slice (filename.slice.tmp files) slice download if the downloading of that slice is interrupted. A complete slice (filename.slice files) means that the slice file was downloaded without interruption so the chance of MD5 mismatch when all these slices are merged should be much smaller. The MD5 mismatch can still happen, I'm afraid, it's not 100% avoidable in the same way it can happen to any file downloads. But, we're now actively looking into this to avoid these interruptions in downloading which can corrupt the file.

I'm downloading a dataset of 43Tb with each file ranging from 200Gb to 400Gb and can only successfully download 25-30Gb every day. Considering what you said I guess the download speed or connection stability is not gonna improve in quite a long time, thus I hope at least I'm not wasting all the time only to get an md5 mismatch which I can do nothing about.

I'm afraid I don't have a timeline, but we're actively working on this now. Please reach out to the ega helpdesk and they might be able to suggest alternatives for accessing the data.

There have been issues with how the service is handling the requests. And it seems there's been an increased load for the past month after we first released this new version of the download service last December. We've already made some changes and I believe there should have been slight improvements over the past ~12 hours and we are continuing the work. Regarding the download speed, we'll verify this and figure out where the bottleneck is. Thanks for bringing this to our attention and for the details you gave. We are also looking into scaling the service to handle all connections/requests. Thanks again for your patience!

Best regards
Alegria

@jakalssj3
Copy link

jakalssj3 commented Feb 28, 2023

I'm having similar issues. The download speeds are terribly low and I cannot download the files that belong to the dataset of interest. There seem to be many chunk files created but not a single file was downloaded properly. Below are some of the errors I'm getting. I'm using pyEGA3 client version 5.0.2

[2023-02-28 16:09:47 +0100] retry attempt 5 [2023-02-28 16:09:47 +0100] Download starting [using 1 connection(s), file size 12425345230 and chunk length 104857600]... 96%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 11.9G/12.4G [07:17<02:34, 3.14MB/s] [2023-02-28 16:17:05 +0100] Slice error: received=91180924, requested=104857600, file='./EGA/EGAF00004175901/.tmp_download/EGAF00004175901-from-11744051200-len-104857600.slice.tmp' Traceback (most recent call last): File "/home/my_user/.local/lib/python3.8/site-packages/pyega3/libs/data_file.py", line 323, in download_file_retry self.download_file(output_file, num_connections, max_slice_size) File "/home/my_user/.local/lib/python3.8/site-packages/pyega3/libs/data_file.py", line 159, in download_file for part_file_name in executor.map(self.download_file_slice_, params): File "/usr/lib/python3.8/concurrent/futures/_base.py", line 611, in result_iterator yield fs.pop().result() File "/usr/lib/python3.8/concurrent/futures/_base.py", line 439, in result return self.__get_result() File "/usr/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result raise self._exception File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run result = self.fn(*self.args, **self.kwargs) File "/home/my_user/.local/lib/python3.8/site-packages/pyega3/libs/data_file.py", line 189, in download_file_slice_ return self.download_file_slice(*args) File "/home/my_user/.local/lib/python3.8/site-packages/pyega3/libs/data_file.py", line 231, in download_file_slice raise Exception(f"Slice error: received={total_received}, requested={length}, file='{file_name}'") Exception: Slice error: received=91180924, requested=104857600, file='/home/my_user/Pulpit/EGA/EGAF00004175901/.tmp_download/EGAF00004175901-from-11744051200-len-104857600.slice.tmp'

and:

ConnectionResetError: [Errno 104] Connection reset by peer During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/my_user/.local/lib/python3.8/site-packages/requests/models.py", line 758, in generate for chunk in self.raw.stream(chunk_size, decode_content=True): File "/home/my_user/.local/lib/python3.8/site-packages/urllib3/response.py", line 576, in stream data = self.read(amt=amt, decode_content=decode_content) File "/home/my_user/.local/lib/python3.8/site-packages/urllib3/response.py", line 541, in read raise IncompleteRead(self._fp_bytes_read, self.length_remaining) File "/usr/lib/python3.8/contextlib.py", line 131, in __exit__ self.gen.throw(type, value, traceback) File "/home/my_user/.local/lib/python3.8/site-packages/urllib3/response.py", line 455, in _error_catcher raise ProtocolError("Connection broken: %r" % e, e) urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))'

@felis-silvestris98
Copy link
Author

@aaclan-ebi Thank you for your reply and I really appreciate your efforts. The speed today does improve a little although it’s still not enough for me.

I also got reply from the helpdesk 5 hours ago to the email I sent on Feb. 5th and they said they would offer alternatives for me.

Thank you again for your help.

@aaclan-ebi
Copy link
Collaborator

Thank you, @felis-silvestris98 !

Regarding the slow download speed, you mentioned you only get a max speed of 1 MB/s. May I know how many connections are you setting in pyega3 client when you get this download speed and how big is the current file you're downloading? You may not gain any performance benefit if you're setting too many connections as the bandwidth of your internet connection will always be a constraint.

In line with this, may I request for you to try downloading a test file and provide me the pyega3 logs? The command would be: pyega3 -t -d fetch EGAF00005001623 and if you could also let me know your internet connection speed (thru https://www.speedtest.net/ or Google speedtest etc) that would helpful for us to understand and investigate the issue.

Many thanks
Alegria

@felis-silvestris98
Copy link
Author

felis-silvestris98 commented Mar 1, 2023

@aaclan-ebi I'm using 30 connections as you recommend (if it's too many maybe you should not recommend "trying with 30 connections initially and adjusting from there to get maximum throughput") and the file is of 219Gb. It finished downloading 10 hours ago but unfortunately got md5 mismatch error, which means I've wasted another 7 days.

Traceback (most recent call last):
  File "/public/home/xieruoqi/anaconda3/lib/python3.9/site-packages/pyega3/libs/data_file.py", line 323, in download_file_retry
    self.download_file(output_file, num_connections, max_slice_size)
  File "/public/home/xieruoqi/anaconda3/lib/python3.9/site-packages/pyega3/libs/data_file.py", line 186, in download_file
    raise Exception(f"Download process expected md5 value '{check_sum}' but got '{received_file_md5}'")
Exception: Download process expected md5 value '8730f8644e4633875961ce591c3f8974' but got 'ec6ade5d4a7e12bc7538018f82aebe1f'

It took 7 days to download this file and for the first 6 days it downloaded 120Gb while for the last day it got 99Gb, so the speed did improve as you said.

Here attached is the test log file with pyega3 -t -d fetch EGAF00005001623 (the content of the file is actually what showed on my screen with additional information that the auto-generated log file lacks). The speed was around 10-520KB/s as shown in the log file while the speedtest was as below. And md5 mismatch also occured during the test.

[root@gpunode1 ~]# speedtest
Retrieving speedtest.net configuration...
Testing from **************...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by ************** [1050.59 km]: 32.737 ms
Testing download speed................................................................................
Download: 37.21 Mbit/s
Testing upload speed......................................................................................................
Upload: 16.10 Mbit/s
[root@gpunode1 ~]# speedtest
Retrieving speedtest.net configuration...
Testing from **************...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by ************** [1050.59 km]: 30.233 ms
Testing download speed................................................................................
Download: 36.94 Mbit/s
Testing upload speed......................................................................................................
Upload: 15.78 Mbit/s

@aaclan-ebi
Copy link
Collaborator

aaclan-ebi commented Mar 1, 2023

Hi @felis-silvestris98

I am sorry about the corrupted 219GB file :(

Thanks for doing the test download and providing me these details, this is very helpful. The test download is using 1 connection and I agree 10-520KB/s is unreasonably slow given the speed of your internet connection. This slowness in the download can cause the Slice error's on the client because the service is timing out after a threshold time. We are investigating this.

Best regards
Alegria

@aaclan-ebi
Copy link
Collaborator

aaclan-ebi commented Mar 1, 2023

@felis-silvestris98 could you also give me the file accession (i.e. EGAF*) of the 219 GB file? We'll also investigate this. It's probably because of the instability of the download service at that time but we'll verify if something's odd.

@felis-silvestris98
Copy link
Author

@aaclan-ebi It's EGAF00004693820 and it was downloaded on a CentOS server. I also tried to download EGAF00004693816 of 265Gb on a Windows device at the same time and it also got md5 mismatch. The bandwidth of the two devices are the same.

@Mousiekin
Copy link

@aaclan-ebi I have been trying to download more BAM files from the PCAWG consortium listed on ICGC. This does not work with pyega3==5.0.2 but did last month with pyega3==4.0.5. This has now also stopped working. Has EGA changes something for this to have stopped working or will there be an issue this end? The error is: .local/lib/python3.8/site-packages/urllib3/connection.py", line 186, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7fedaec213d0>: Failed to establish a new connection: [Errno 111] Connection refused
With the latest version pyega3 downloads but md5 sums never equal, thanks,
Marian

@b-lac
Copy link

b-lac commented Mar 27, 2023

I have a similar error in 195 where md5 sum is always incorrect (with 5.0.2)

@NikdAK
Copy link

NikdAK commented Mar 28, 2023

I am also observing these issues (including slice and md5 errors), preventing any download to finish (even after 1000+ retries).

[2023-03-28 08:54:36 +0200] retry attempt 1187
[2023-03-28 08:54:36 +0200] Download starting [using 20 connection(s), file size 121929894722 and chunk length 104857600]...
100%|████████████████████████████████████████████████████████████▉| 122G/122G [00:20<00:00, 31.0GB/s]
[2023-03-28 08:55:47 +0200] Slice error: received=0, requested=104857600, file='/downloads/ega_download/EGAF00002251690/.tmp_download/EGAF00002251690-from-116496793600-len-104857600.slice.tmp'
Traceback (most recent call last):
  File "/dss/dsshome1/lxc0F/ga52niw2/.local/lib/python3.8/site-packages/pyega3/libs/data_file.py", line 323, in download_file_retry
    self.download_file(output_file, num_connections, max_slice_size)
  File "/dss/dsshome1/lxc0F/ga52niw2/.local/lib/python3.8/site-packages/pyega3/libs/data_file.py", line 159, in download_file
    for part_file_name in executor.map(self.download_file_slice_, params):
  File "/dss/dsshome1/lrz/sys/spack/release/22.2.1/views/python/._3.8.11-base/x5cahiaomwvt774jvzj7usksxucem5iz/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator
    yield fs.pop().result()
  File "/dss/dsshome1/lrz/sys/spack/release/22.2.1/views/python/._3.8.11-base/x5cahiaomwvt774jvzj7usksxucem5iz/lib/python3.8/concurrent/futures/_base.py", line 444, in result
    return self.__get_result()
  File "/dss/dsshome1/lrz/sys/spack/release/22.2.1/views/python/._3.8.11-base/x5cahiaomwvt774jvzj7usksxucem5iz/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/dss/dsshome1/lrz/sys/spack/release/22.2.1/views/python/._3.8.11-base/x5cahiaomwvt774jvzj7usksxucem5iz/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/dss/dsshome1/lxc0F/ga52niw2/.local/lib/python3.8/site-packages/pyega3/libs/data_file.py", line 189, in download_file_slice_
    return self.download_file_slice(*args)
  File "/dss/dsshome1/lxc0F/ga52niw2/.local/lib/python3.8/site-packages/pyega3/libs/data_file.py", line 231, in download_file_slice
    raise Exception(f"Slice error: received={total_received}, requested={length}, file='{file_name}'")
Exception: Slice error: received=0, requested=104857600, file='/downloads/ega_download/EGAF00002251690/.tmp_download/EGAF00002251690-from-116496793600-len-104857600.slice.tmp'
[2023-03-28 08:56:47 +0200] retry attempt 1188
[2023-03-28 08:56:47 +0200] Download starting [using 20 connection(s), file size 121929894722 and chunk length 104857600]...
100%|█████████████████████████████████████████████████████████████| 122G/122G [00:37<00:00, 3.29GB/s]
[2023-03-28 08:57:25 +0200] Combining file chunks (this operation can take a long time depending on the file size)
100%|██████████████████████████████████████████████████████████████| 122G/122G [03:44<00:00, 543MB/s]
[2023-03-28 09:01:09 +0200] Calculating md5 (this operation can take a long time depending on the file size)
100%|██████████████████████████████████████████████████████████████| 122G/122G [04:33<00:00, 446MB/s]
[2023-03-28 09:05:43 +0200] Verifying file checksum
[2023-03-28 09:05:43 +0200] Download process expected md5 value 'b0a4b1eb4f434a79d6895bff3b7dcc4c' but got 'b19478827403dc23833b19c5fb7ab06d'
Traceback (most recent call last):
  File "/dss/dsshome1/lxc0F/ga52niw2/.local/lib/python3.8/site-packages/pyega3/libs/data_file.py", line 323, in download_file_retry
    self.download_file(output_file, num_connections, max_slice_size)
  File "/dss/dsshome1/lxc0F/ga52niw2/.local/lib/python3.8/site-packages/pyega3/libs/data_file.py", line 186, in download_file
    raise Exception(f"Download process expected md5 value '{check_sum}' but got '{received_file_md5}'")
Exception: Download process expected md5 value 'b0a4b1eb4f434a79d6895bff3b7dcc4c' but got 'b19478827403dc23833b19c5fb7ab06d'

@Honchkrow
Copy link

I encountered the same question.
I will download 500 files which are more than 15T, the md5 check failed more than 20 times for a single file.

@s-andrews
Copy link

I also can't get this to work. I'm connecting from a JANET backbone link about 3 miles from the EBI and I'm getting 1MB/s transfer speeds and timeout errors every few percent through the download. The same download has been running for ages without a single file being retrieved successfully. I've tried varying the chunk size and number of connections to no avail. I just get a ton of:

requests.exceptions.ConnectionError: HTTPSConnectionPool(host='ega.ebi.ac.uk', port=8443): Read timed out.
[2023-04-25 10:30:20 +0100] File Id: 'EGAF00002139579'(3188379085 bytes).
[2023-04-25 10:30:20 +0100] Total space : 448811.38 GiB
[2023-04-25 10:30:20 +0100] Used space : 340829.57 GiB
[2023-04-25 10:30:20 +0100] Free space : 92260.21 GiB
[2023-04-25 10:30:20 +0100] Download starting [using 1 connection(s), file size 3188379069 and chunk length 104857600]...
  0%|                                                                                                                                                                                 | 0.00/3.19G [00:00<?, ?B/s]
[2023-04-25 10:32:48 +0100] Slice error: received=0, requested=104857600, file='/EGAF00002139579/.tmp_download/EGAF00002139579-from-0-len-104857600.slice.tmp'
Traceback (most recent call last):
  File "/bi/apps/python/3.9.7/lib/python3.9/site-packages/pyega3/libs/data_file.py", line 323, in download_file_retry
    self.download_file(output_file, num_connections, max_slice_size)
  File "/bi/apps/python/3.9.7/lib/python3.9/site-packages/pyega3/libs/data_file.py", line 159, in download_file
    for part_file_name in executor.map(self.download_file_slice_, params):
  File "/bi/apps/python/3.9.7/lib/python3.9/concurrent/futures/_base.py", line 608, in result_iterator
    yield fs.pop().result()
  File "/bi/apps/python/3.9.7/lib/python3.9/concurrent/futures/_base.py", line 445, in result
    return self.__get_result()
  File "/bi/apps/python/3.9.7/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
    raise self._exception
  File "/bi/apps/python/3.9.7/lib/python3.9/concurrent/futures/thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/bi/apps/python/3.9.7/lib/python3.9/site-packages/pyega3/libs/data_file.py", line 189, in download_file_slice_
    return self.download_file_slice(*args)
  File "/bi/apps/python/3.9.7/lib/python3.9/site-packages/pyega3/libs/data_file.py", line 231, in download_file_slice
    raise Exception(f"Slice error: received={total_received}, requested={length}, file='{file_name}'")

@plbaldoni
Copy link

Same issue for me. It is just impossible to download any large file. It either gives me a "Connection reset by peer" or "too many 503 error responses" error. In the rare occasions that all slices are downloaded (with very low downloading speed), I get the inconsistent MD5 error.

@hailiangmei
Copy link

I am also receiving similar errors while downloading two EGA datasets using pyega3 version 5.0.2. I see this issue has been open for 3 months now. Is there any planned fix on pyega3? Thanks for helping out!

[2023-05-24 08:17:07 +0200] Slice error: received=35158868, requested=104857600, file='<workdir>/floor/EGAD00001004857/EGAF00002432022/.tmp_download/EGAF00002432022-from-3670016000-le[3354/18740]
slice.tmp'                                                                                                                                                                                                 
Traceback (most recent call last):                                                                                                                                                                         
  File "<workdir>/.miniconda3/envs/pyega3/lib/python3.11/site-packages/pyega3/libs/data_file.py", line 323, in download_file_retry                                                                
    self.download_file(output_file, num_connections, max_slice_size)                                                                                                                                       
  File "<workdir>/.miniconda3/envs/pyega3/lib/python3.11/site-packages/pyega3/libs/data_file.py", line 159, in download_file                                                                      
    for part_file_name in executor.map(self.download_file_slice_, params):
  File "<workdir>/.miniconda3/envs/pyega3/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<workdir>/.miniconda3/envs/pyega3/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "<workdir>/.miniconda3/envs/pyega3/lib/python3.11/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "<workdir>/.miniconda3/envs/pyega3/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "<workdir>/.miniconda3/envs/pyega3/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<workdir>/.miniconda3/envs/pyega3/lib/python3.11/site-packages/pyega3/libs/data_file.py", line 189, in download_file_slice_
    return self.download_file_slice(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<workdir>/.miniconda3/envs/pyega3/lib/python3.11/site-packages/pyega3/libs/data_file.py", line 231, in download_file_slice
    raise Exception(f"Slice error: received={total_received}, requested={length}, file='{file_name}'")
Exception: Slice error: received=35158868, requested=104857600, file='<workdir>/floor/EGAD00001004857/EGAF00002432022/.tmp_download/EGAF00002432022-from-3670016000-len-104857600.slice.tmp'
[2023-05-24 08:18:07 +0200] retry attempt 1
[2023-05-24 08:18:07 +0200] Download starting [using 1 connection(s), file size 5078948449 and chunk length 104857600]...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.08G/5.08G [02:50<00:00, 29.8MB/s]
[2023-05-24 08:20:57 +0200] Combining file chunks (this operation can take a long time depending on the file size)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.08G/5.08G [00:40<00:00, 126MB/s]
[2023-05-24 08:21:37 +0200] Calculating md5 (this operation can take a long time depending on the file size)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.08G/5.08G [00:23<00:00, 219MB/s]
[2023-05-24 08:22:00 +0200] Verifying file checksum
[2023-05-24 08:22:01 +0200] Download process expected md5 value '31be9960a8707ff71609a6cbcbaf4858' but got '9c01d5f1b217c53dc497d7fa313776ba'
Traceback (most recent call last):
  File "<workdir>/.miniconda3/envs/pyega3/lib/python3.11/site-packages/pyega3/libs/data_file.py", line 323, in download_file_retry
    self.download_file(output_file, num_connections, max_slice_size)
  File "<workdir>/.miniconda3/envs/pyega3/lib/python3.11/site-packages/pyega3/libs/data_file.py", line 186, in download_file
    raise Exception(f"Download process expected md5 value '{check_sum}' but got '{received_file_md5}'")
Exception: Download process expected md5 value '31be9960a8707ff71609a6cbcbaf4858' but got '9c01d5f1b217c53dc497d7fa313776ba'
[2023-05-24 08:23:01 +0200] retry attempt 2

@raagagrawal
Copy link

Unfortunately I am seeing similar issues downloading two seperate datasets. Currently maxing out at 1Mb/second download speed on any number of connections (ranging from 30 to 1). Is there an alternative way to download data aside from pyega3 at this point?

[2023-05-26 21:05:46 -0700] Slice error: received=0, requested=104857600, file='/hot/data/unregistered/Quigley-Gebo-PRAD-SVMW/Quigley-Agrawal-PRAD-SVMW/EGAF00005776683/.tmp_download/EGAF00005776683-from-7130316800-len-104857600.slice.tmp'
Traceback (most recent call last):
  File "/hot/user/ragrawal/software/micromamba/envs/sra/lib/python3.11/site-packages/pyega3/libs/data_file.py", line 323, in download_file_retry
    self.download_file(output_file, num_connections, max_slice_size)
  File "/hot/user/ragrawal/software/micromamba/envs/sra/lib/python3.11/site-packages/pyega3/libs/data_file.py", line 159, in download_file
    for part_file_name in executor.map(self.download_file_slice_, params):
  File "/hot/user/ragrawal/software/micromamba/envs/sra/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/hot/user/ragrawal/software/micromamba/envs/sra/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "/hot/user/ragrawal/software/micromamba/envs/sra/lib/python3.11/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/hot/user/ragrawal/software/micromamba/envs/sra/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/hot/user/ragrawal/software/micromamba/envs/sra/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/hot/user/ragrawal/software/micromamba/envs/sra/lib/python3.11/site-packages/pyega3/libs/data_file.py", line 189, in download_file_slice_
    return self.download_file_slice(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/hot/user/ragrawal/software/micromamba/envs/sra/lib/python3.11/site-packages/pyega3/libs/data_file.py", line 231, in download_file_slice
    raise Exception(f"Slice error: received={total_received}, requested={length}, file='{file_name}'")
Exception: Slice error: received=0, requested=104857600, file='/hot/data/unregistered/Quigley-Gebo-PRAD-SVMW/Quigley-Agrawal-PRAD-SVMW/EGAF00005776683/.tmp_download/EGAF00005776683-from-7130316800-len-104857600.slice.tmp'

@joonan30
Copy link

joonan30 commented Jun 4, 2023

We have this error too. Can you update the package to make bulk download? I think this delay is substantially affecting on the entire process...

@harmjanwestra
Copy link

harmjanwestra commented Jun 7, 2023

We are having the same issue. We've now tried three different machines with different verions of Linux, and three different internet connections and storage systems.

My observations (by editing some lines of code in pyEGA):

  • if the download for a .fa.gz file completes without error, but the md5sum is wrong, the merged .fa.gz file can be read to some extent by zcat, has the correct filesize, but has CRC errors when using gzip -t. This suggests that the final file has been assembled incorrectly from the slices, or that somewhere some (part of) slices have been transferred incorrectly.

I checked the first two options by not removing the intermediary files, and merging the data myself.

  • the code that calculates the md5sum is working correctly - this is not breaking the Gzip file
  • the code that merges the slices is also working correctly - this is also not breaking the Gzip file

So it's the transfers that are broken:

  • using small chunks (1Mb) results in chunking exceptions from the urllib library, potentially caused by the multithreading (see https://stackoverflow.com/questions/44509423/python-requests-chunkedencodingerrore-requests-iter-lines)
  • chunking errors don't seem to happen when using larger chunks (e.g. 100 mb), but the data is still corrupt
  • disabling multithreading by replacing lines at around line 156 in libs/data_client.py with a simple for loop also results in chunking errors (urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))); so the multithreading doesn't seem to be the culprit either. The resulting merged file is also corrupted.
  • using a massive chunk (10G) for a file of 6Gb to force a single slice (using the original code) crashes when ~5Gb is downloaded, perhaps because of a timeout on the server-side

So it's either something in the http stream function or something on the serverside of EGA.

@delocalizer
Copy link

@harmjanwestra we have seen the same behaviour after performing very similar tests. Manually re-assembling individual chunks that individually have the correct byte size and sum to the correct total according to the manifest still frequently produces a corrupted file with invalid compression format. The error is uniformly distributed — sometimes one can read nearly the whole bam before encountering an error, sometimes it occurs early. Some additional observations:

We sorted a set of accessioned files by size and started downloading in ascending order. The logs indicate that the probability of an eventually successful download (md5 match) is inversely correlated with the size. Even small files fail the md5 check fairly frequently but mostly succeed on retry; once the files get over a few GB however they almost always have errors, and the chance of success on retry diminishes as they get bigger. The last file I tried (45G) failed in this manner 38 times in a row over two days and I was never able to download it. Most of the bams in our target dataset of 700 files are over 50G, and some are over 200G, so I don't believe it will be possible to use the current client for this dataset.

Looking at the client code, I see this behaviour is consistent with the fact that there appears to be no data validation at the chunk level, only after the whole file is reconstructed — meaning if there are random errors in transmission that can be modelled as a simple binomial with a probability p per byte, the chance of error-free transmission varies as the size of the whole file (1 - p)^n which diminishes rapidly as n grows. For example if p=1e-10, the probability of no errors in 1G is 0.9, for 10G it's 0.37, and for 50G it's 0.007. That looks suggestively like what we see in practice for how probability of successful download varies with file size.

In any case, chunk-level validation would be good practice given the large size of many bams, and the non-zero bit error rate of network transmission.

@mardzix
Copy link

mardzix commented Jun 28, 2023

The same issue here. Downloads never finish and intermediate check of chunks raise InvalidChunkLength(got length b'', 0 bytes read) error

pyega3 -c 20 -cf /Users/marek/Desktop/CREDANTIALS_FILE fetch EGAF00006130794 --output-dir /Users/marek/scATAC/W11_forebrain 
[2023-06-28 11:33:47 +0200] 
[2023-06-28 11:33:47 +0200] pyEGA3 - EGA python client version 5.0.2 (https://github.com/EGA-archive/ega-download-client)
[2023-06-28 11:33:47 +0200] Parts of this software are derived from pyEGA (https://github.com/blachlylab/pyega) by James Blachly
[2023-06-28 11:33:47 +0200] Python version : 3.11.4
[2023-06-28 11:33:47 +0200] OS version : Darwin Darwin Kernel Version 21.6.0: Mon Aug 22 20:19:52 PDT 2022; root:xnu-8020.140.49~2/RELEASE_ARM64_T6000
[2023-06-28 11:33:47 +0200] MacOS version : 12.6
[2023-06-28 11:33:47 +0200] Server URL: https://ega.ebi.ac.uk:8443/v2
[2023-06-28 11:33:47 +0200] Session-Id: 3368285841
[2023-06-28 11:33:48 +0200] 
[2023-06-28 11:33:48 +0200] Authentication success for user 'ma****@****.se'
[2023-06-28 11:33:54 +0200] File Id: 'EGAF00006130794'(13282545173 bytes).
[2023-06-28 11:33:54 +0200] Total space : 926.35 GiB
[2023-06-28 11:33:54 +0200] Used space : 460.96 GiB
[2023-06-28 11:33:54 +0200] Free space : 465.39 GiB
[2023-06-28 11:33:54 +0200] Download starting [using 20 connection(s), file size 13282545157 and chunk length 104857600]...
 47%|██████████████████████████████████████████████████████████████████████████████████████████████▋                                                                                                         | 6.29G/13.3G [02:35<02:52, 40.5MB/s]
[2023-06-28 11:36:29 +0200] ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
Traceback (most recent call last):
  File "/Users/marek/miniconda3/envs/EGA/lib/python3.11/site-packages/urllib3/response.py", line 710, in _error_catcher
    yield
  File "/Users/marek/miniconda3/envs/EGA/lib/python3.11/site-packages/urllib3/response.py", line 1077, in read_chunked
    self._update_chunk_length()
  File "/Users/marek/miniconda3/envs/EGA/lib/python3.11/site-packages/urllib3/response.py", line 1012, in _update_chunk_length
    raise InvalidChunkLength(self, line) from None
urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 bytes read)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/marek/miniconda3/envs/EGA/lib/python3.11/site-packages/requests/models.py", line 816, in generate
    yield from self.raw.stream(chunk_size, decode_content=True)
  File "/Users/marek/miniconda3/envs/EGA/lib/python3.11/site-packages/urllib3/response.py", line 937, in stream
    yield from self.read_chunked(amt, decode_content=decode_content)
  File "/Users/marek/miniconda3/envs/EGA/lib/python3.11/site-packages/urllib3/response.py", line 1065, in read_chunked
    with self._error_catcher():
  File "/Users/marek/miniconda3/envs/EGA/lib/python3.11/contextlib.py", line 155, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/Users/marek/miniconda3/envs/EGA/lib/python3.11/site-packages/urllib3/response.py", line 727, in _error_catcher
    raise ProtocolError(f"Connection broken: {e!r}", e) from e
urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

@mardzix
Copy link

mardzix commented Jun 28, 2023

What worked for me to bypass this issue is to increase the size of the slice/chunk to be downloaded from default 100MB to 1GB. I was able to download files of size up to 13 GB.

The parameter is -ms and 1 * 1000 * 1000 * 1000 = 1073741824 is the new chunk size

Perhaps tweaking this parameter could help for even larger files as well.

pyega3 -c 20 -ms 1073741824 -cf /PATH/TO/CREDANTIALS_FILE fetch EGAF**** --output-dir /PATH/TO/OUTPUT

@harmjanwestra
Copy link

harmjanwestra commented Jun 28, 2023

@mardzix I've tried this usinig a chunk size larger than the file size, but to no avail.
Nevertheless, I will try this again.

edit update: nope, still doesn't work properly.

@togop
Copy link

togop commented Jul 12, 2023

I'm trying to download a dataset EGAD00001006237 and can't get a single sample. Constantly getting such errors:

[2023-07-12 07:02:18 +0200] Download process expected md5 value '7c61f7005f060a79a807e4f6650f5feb' but got 'b1324c96f13be2b252a59775a71f3760' Traceback (most recent call last): File ".../conda/envs/pyega/lib/python3.10/site-packages/pyega3/libs/data_file.py", line 323, in download_file_retry
Afterward, the file is deleted and restarted a couple of times, but always the same error with a different got md5 value, so no fetched data at the end.
My pyega3 version is 5.0.2 , and I've started the download with 4 connections (-c 4). I hope this issue can be resolved soon. Is there any workaround or alternative?

@zztin
Copy link

zztin commented Jul 12, 2023

what @mardzix suggested worked for me. With -c 40 -ms 1073741824, I downloaded ~10 50G files within 3 automatic retries).

@anilkanthi
Copy link

What worked for me to bypass this issue is to increase the size of the slice/chunk to be downloaded from default 100MB to 1GB. I was able to download files of size up to 13 GB.

The parameter is -ms and 1 * 1000 * 1000 * 1000 = 1073741824 is the new chunk size

Perhaps tweaking this parameter could help for even larger files as well.

pyega3 -c 20 -ms 1073741824 -cf /PATH/TO/CREDANTIALS_FILE fetch EGAF**** --output-dir /PATH/TO/OUTPUT

This is working like a charm! I had given up on downloading the data.
Thanks a lot @mardzix !

@nloyfer
Copy link

nloyfer commented Aug 24, 2023

Hi everyone,

I encountered a recurring md5sum error (Download process expected md5 value '***' but got '***') as well. Couldn't download even a single (large) file.
The proposed solutions (tweaking pyega3's parameters) didn't work for me, so I wrote a workaround to address the issue.

The logic

I found that ~5% of the 100MB chunks are corrupted during the download process.
So I modified a specific file, pyega3/libs/data_file.py (original file from v5.0.2), to ensure that the pyega3 module validates the correctness of each downloaded chunk.
Essentially, I made pyega3 download each chunk twice and verify that the two copies are identical. If they match, the likelihood of errors is greatly reduced. On the other hand, if they differ, the module deletes both copies and retries during the next attempt. So increasing --max-retries is advised.

This is not an optimal solution, but it gets the job done. While this modification does double the download time, it effectively eliminates md5sum errors. I've used this approach to successfully download large files (50-100Gb).

See the modified data_file.py here. Or the diff here.

How to use it

The elegant way:

clone this fork and install from source.

The quick and dirty way:

download only this specific file (data_file.py) and replace the one in your system.
Example:

# go to pyega3 installation directory
$ cd `python3 -c 'import pyega3; print(pyega3.__file__[:-11])'`
$ cd ./libs/
# replace data_file.py with modified version
$ wget https://github.com/nloyfer/ega-download-client/blob/master/pyega3/libs/data_file.py -O data_file.py

@delocalizer
Copy link

delocalizer commented Aug 26, 2023

Hi everyone,

I encountered a recurring md5sum error (Download process expected md5 value '***' but got '***') as well. Couldn't download even a single (large) file. The proposed solutions (tweaking pyega3's parameters) didn't work for me, so I wrote a workaround to address the issue.

The logic

I found that ~5% of the 100MB chunks are corrupted during the download process. So I modified a specific file, pyega3/libs/data_file.py (original file from v5.0.2), to ensure that the pyega3 module validates the correctness of each downloaded chunk. Essentially, I made pyega3 download each chunk twice and verify that the two copies are identical. If they match, the likelihood of errors is greatly reduced. On the other hand, if they differ, the module deletes both copies and retries during the next attempt. So increasing --max-retries is advised.

This is not an optimal solution, but it gets the job done. While this modification does double the download time, it effectively eliminates md5sum errors. I've used this approach to successfully download large files (50-100Gb).

See the modified data_file.py here. Or the diff here.

How to use it

The elegant way:

clone this fork and install from source.

The quick and dirty way:

download only this specific file (data_file.py) and replace the one in your system. Example:

# go to pyega3 installation directory
$ cd `python3 -c 'import pyega3; print(pyega3.__file__[:-11])'`
$ cd ./libs/
# replace data_file.py with modified version
$ wget https://github.com/nloyfer/ega-download-client/blob/master/pyega3/libs/data_file.py -O data_file.py

Very nice! I think this is the best possible workaround with the current API. I was able to download a 250G file successfully when previously the limit was well under 100G. For the record, there were 45 occurrences of Slice error: two attempts in the logs and I used the default chunk size so that's 45 out of 2500 or about 2% of the chunks having errors.

Now, if EGA could update the files/ byte range request response to include a server-side checksum at the end of the data stream...

@jaflo94
Copy link

jaflo94 commented Aug 28, 2023

Hi guys, I am new in bioinformatics. I run the code in our cluster as recommended by @nloyfer . However, when I tried to download, I got this error. Any suggestions what's going on?

QualifiedName":"DataFile.status","identUtf16":{"start":{"lineNumber":100,"utf16Col":8},"end":{"lineNumber":100,"utf16Col":14}},"extentUtf16":{"start":{"lineNunmber":100,"utf16Col":4},"end":{"lineNumber":103,"utf16Col":32}}},{"name":"print_local_file_info","kind":"function","identStart":3403,"identEnd":3424,"extentSt2art":3399,"extentEnd":3553,"fullyQualifiedName":"DataFile.print_local_file_info","identUtf16":{"start":{"lineNumber":106,"utf16Col":8},"end":{"lineNumber":106n,"utf16Col":29}},"extentUtf16":{"start":{"lineNumber":106,"utf16Col":4},"end":{"lineNumber":107,"utf16Col":104}}},{"name":"download_file","kind":"function","i1dentStart":3563,"identEnd":3576,"extentStart":3559,"extentEnd":7241,"fullyQualifiedName":"DataFile.download_file","identUtf16":{"start":{"lineNumber":109,"utf716Col":8},"end":{"lineNumber":109,"utf16Col":21}},"extentUtf16":{"start":{"lineNumber":109,"utf16Col":4},"end":{"lineNumber":187,"utf16Col":111}}},{"name":"do{wnload_file_slice_","kind":"function","identStart":7251,"identEnd":7271,"extentStart":7247,"extentEnd":7331,"fullyQualifiedName":"DataFile.download_file_slice"_","identUtf16":{"start":{"lineNumber":189,"utf16Col":8},"end":{"lineNumber":189,"utf16Col":28}},"extentUtf16":{"start":{"lineNumber":189,"utf16Col":4},"end":i{"lineNumber":190,"utf16Col":46}}},{"name":"download_file_slice","kind":"function","identStart":7341,"identEnd":7360,"extentStart":7337,"extentEnd":9731,"fulliyQualifiedName":"DataFile.download_file_slice","identUtf16":{"start":{"lineNumber":192,"utf16Col":8},"end":{"lineNumber":192,"utf16Col":27}},"extentUtf16":{"s:tart":{"lineNumber":192,"utf16Col":4},"end":{"lineNumber":253,"utf16Col":30}}},{"name":"dl_slice_copy","kind":"function","identStart":8410,"identEnd":8423,"exotentStart":8406,"extentEnd":9069,"fullyQualifiedName":"DataFile.dl_slice_copy","identUtf16":{"start":{"lineNumber":223,"utf16Col":16},"end":{"lineNumber":223,c"utf16Col":29}},"extentUtf16":{"start":{"lineNumber":223,"utf16Col":12},"end":{"lineNumber":234,"utf16Col":118}}},{"name":"is_genomic_range","kind":"function",,"identStart":9759,"identEnd":9775,"extentStart":9755,"extentEnd":9942,"fullyQualifiedName":"DataFile.is_genomic_range","identUtf16":{"start":{"lineNumber":25f6,"utf16Col":8},"end":{"lineNumber":256,"utf16Col":24}},"extentUtf16":{"start":{"lineNumber":256,"utf16Col":4},"end":{"lineNumber":259,"utf16Col":85}}},{"namet":"generate_output_filename","kind":"function","identStart":9952,"identEnd":9976,"extentStart":9948,"extentEnd":10891,"fullyQualifiedName":"DataFile.generate_"output_filename","identUtf16":{"start":{"lineNumber":261,"utf16Col":8},"end":{"lineNumber":261,"utf16Col":32}},"extentUtf16":{"start":{"lineNumber":261,"utf16"Col":4},"end":{"lineNumber":279,"utf16Col":22}}},{"name":"print_local_file_info_genomic_range","kind":"function","identStart":10919,"identEnd":10954,"extentStnart":10915,"extentEnd":11228,"fullyQualifiedName":"DataFile.print_local_file_info_genomic_range","identUtf16":{"start":{"lineNumber":282,"utf16Col":8},"end":{i"lineNumber":282,"utf16Col":43}},"extentUtf16":{"start":{"lineNumber":282,"utf16Col":4},"end":{"lineNumber":286,"utf16Col":9}}},{"name":"download_file_retry",1"kind":"function","identStart":11238,"identEnd":11257,"extentStart":11234,"extentEnd":14332,"fullyQualifiedName":"DataFile.download_file_retry","identUtf16":{n"start":{"lineNumber":288,"utf16Col":8},"end":{"lineNumber":288,"utf16Col":27}},"extentUtf16":{"start":{"lineNumber":288,"utf16Col":4},"end":{"lineNumber":350","utf16Col":60}}},{"name":"is_bam_or_cram_file","kind":"function","identStart":14342,"identEnd":14361,"extentStart":14338,"extentEnd":14478,"fullyQualifiedNam3e":"DataFile.is_bam_or_cram_file","identUtf16":{"start":{"lineNumber":352,"utf16Col":8},"end":{"lineNumber":352,"utf16Col":27}},"extentUtf16":{"start":{"lineNlumber":352,"utf16Col":4},"end":{"lineNumber":353,"utf16Col":98}}},{"name":"delete_temporary_folder","kind":"function","identStart":14488,"identEnd":14511,"exteentStart":14484,"extentEnd":14713,"fullyQualifiedName":"DataFile.delete_temporary_folder","identUtf16":{"start":{"lineNumber":355,"utf16Col":8},"end":{"lineNulmber":355,"utf16Col":31}},"extentUtf16":{"start":{"lineNumber":355,"utf16Col":4},"end":{"lineNumber":359,"utf16Col":73}}}]}},"copilotInfo":null,"csrf_tokens":"{"/nloyfer/ega-download-client/branches":{"post":"P-zEPuRk2T5DregpCh-Rhz6T8AQON1GJixehL3wtWbKGxyxQRCuHqDncFFlJ_Yu52wnSjSsxSOE4FD-kmI2_bQ"},"/repos/preferencesL":{"post":"GG5ZsD0tXNLIL6vp8gk0C8F7eK4q5SaQiGGg2GAF1SJdn4Pf0f_CUuHhBGzMGUnpa-r0zDqqikHKxHAFtRqydg"}}},"title":"ega-download-client/pyega3/libs/data_file.py atx master · nloyfer/ega-download-client"} ^^^^^ NameError: name 'false' is not defined. Did you mean: 'False'?

@rbentham
Copy link

@jaflo94 I had the same error, wget was not actually downloading the raw python file, try:
wget https://raw.githubusercontent.com/nloyfer/ega-download-client/master/pyega3/libs/data_file.py -O data_file.py

@CsabaHalmagyi
Copy link
Contributor

@nloyfer Thank you for sharing your solution regarding the md5 errors. I can confirm we have been having many challenges with the download api due to connection issues. The dev team worked on optimising the API and on Friday last week we redeployed the download service. According to our logs users can download files now, however md5 errors can still occur. I scheduled a pyega client update release to mid October and will be discussing your solution with the dev team.

@CsabaHalmagyi CsabaHalmagyi self-assigned this Sep 25, 2023
@chilampoon
Copy link

Got the same problems - if I downloaded multiple files (~100G each) at the same time, all files got different md5 values then failed, and redownload again... the success rate increases when I only download one file each time, still it sometimes gets different md5 value. Hope the new release can resolve this issue.

@CIBRChenLab
Copy link

I have a similar problem.
I need to download a dataset with multiple bam files (about 30G each, total 1.1T) and used pyega3 -cf *** -d -c 10 -ms 1073741824 fetch EGA* --output-dir ./ --max-retries -1 --retry-wait 120. pyega3 didn't even download a single cell in two days. I thought maybe it was a network issue, so I set up multiple cloud instances from the US/Japan/UK and the problem persists. All I get is 'too many 503 responses'. Is the EGA server under attack or maybe too many people are having the same problem so everyone is keeping trying to "attack' the server"?

@Jungal10
Copy link

Jungal10 commented Jun 25, 2024

I have a similar problem. I need to download a dataset with multiple bam files (about 30G each, total 1.1T) and used pyega3 -cf *** -d -c 10 -ms 1073741824 fetch EGA* --output-dir ./ --max-retries -1 --retry-wait 120. pyega3 didn't even download a single cell in two days. I thought maybe it was a network issue, so I set up multiple cloud instances from the US/Japan/UK and the problem persists. All I get is 'too many 503 responses'. Is the EGA server under attack or maybe too many people are having the same problem so everyone is keeping trying to "attack' the server"?

I am also having the same issue where there are too many 503 response issues.
In my previous attempt a couple of weeks ago, every file ended up mismatching md5. I am getting a bit desperate here

@jamiehall007
Copy link

I suggest trying EGA's new Live Distribution service; https://ega-archive.org/access/download/files/live-outbox/

@gernophil
Copy link

Any news here? I am trying to re-download a data set with the same script, but now I always get 'ConnectionResetError(104, 'Connection reset by peer')' and ResponseError('too many 500 error responses').

@yh154
Copy link

yh154 commented Dec 23, 2024

Same issue. Any updates?

@Mousiekin
Copy link

Hi, their live outbox works really well. I can no longer use PYEGA3 as the ports are blocked for us somehow, but this is really good:
https://ega-archive.org/access/download/files/live-outbox/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests