requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url #14

bioothod · 2020-10-13T11:28:56Z

dev branch fails to start _mine_shard stage, it timeouts and rises following exception even with parallelism=1:

python3 -m cc_net --config reproduce --dump 2019-09 --task_parallelism 1                                                                                                                               
Will run cc_net.mine.main with the following config: Config(config_name='reproduce', dump='2019-09', output_dir=PosixPath('data'), mined_dir='reproduce', execution='local', num_shards=1600, num_segments_per_shard=-1, metadata='https://dl
.fbaipublicfiles.com/cc_net/1.0.0', min_len=300, hash_in_mem=50, lang_whitelist=[], lang_blacklist=[], lang_threshold=0.5, lm_dir=PosixPath('data/lm_sp'), cutoff=PosixPath('/home/zbr/awork/cc_net/cc_net/data/cutoff.csv'), lm_languages=No
ne, mine_num_processes=16, target_size='4G', cleanup_after_regroup=True, task_parallelism=1, pipeline=['fetch_metadata', 'split_by_lang'], experiments=[], cache_dir=None)                                                                   
Submitting 1600 jobs for _mine_shard, with parallelism=1                                                                                                                                                                                     
Waiting on 1 running jobs. Job ids: 17305                                                                                                                                                                                                    
Failed job 17305 (1 / 1600): Job (task=0) failed during processing with trace:
----------------------                                                                                                
multiprocessing.pool.RemoteTraceback:                                                                                 
"""                                                                                                                   
Traceback (most recent call last):                                                                                                                                                                                                           
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker                                                                                                                                                                     
    result = (True, func(*args, **kwds))                                                                                                                                                                                                     
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar                                                                                                                                                                     
    return list(map(*args))                                                                                           
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 476, in _global_transformer                                                                                                                                                           
    return _GLOBAL_TRANSFORMER(document)                                                                                                                                                                                                     
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 246, in __call__                                               
    y = self.do(x)                                                                                                                                                                                                                           
  File "/home/zbr/awork/cc_net/cc_net/minify.py", line 164, in do                                                                                                                                                                            
    self.fetch_metadata(doc["cc_segment"])                                                                                                                                                                                                   
  File "/home/zbr/awork/cc_net/cc_net/minify.py", line 146, in fetch_metadata                                         
    for m in jsonql.read_jsons(meta_file):                                                                            
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 485, in read_jsons                                             
    lines = open_read(file)                                                                                           
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 942, in open_read                                              
    return open_remote_file(filename)                                                                                 
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 1117, in open_remote_file                                                                                                                                                             
    raw_bytes = request_get_content(url)                                                                                                                                                                                                     
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 1094, in request_get_content                                                                                                                                                          
    raise e                                                                                                           
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 1088, in request_get_content                                   
    r.raise_for_status()                                                                                                                                                                                                                     
  File "/usr/local/lib/python3.8/dist-packages/requests/models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://dl.fbaipublicfiles.com/cc_net/1.0.0/2019-09/CC-MAIN-20190215183319-20190215205319-00026.json.gz

While that file can be downloaded in parallel using wget or other tools, so it is not related to network issues.
Last commit id: 9a0d5c2

The end of the log shows:

2020-10-12 20:39 INFO 17351:root - Downloaded https://dl.fbaipublicfiles.com/cc_net/1.0.0/2019-09/CC-MAIN-20190215183319-20190215205319-00026.json.gz [200] took 9s (420.6kB/s)
2020-10-12 20:39 INFO 17351:JsonReader - Processed 30_532 documents in 0.0026h (3278.9 doc/s).
2020-10-12 20:39 INFO 17351:MetadataFetcher - Loaded 30532 metadatas from https://dl.fbaipublicfiles.com/cc_net/1.0.0/2019-09/CC-MAIN-20190215183319-20190215205319-00026.json.gz
submitit WARNING (2020-10-13 01:26:39,791) - Caught signal 10 on gpurnd14: this job is timed-out.
2020-10-13 01:26 WARNING 17307:submitit - Caught signal 10 on gpurnd14: this job is timed-out.
2020-10-13 01:26 INFO 17307:submitit - Job not requeued because: timed-out and not checkpointable.
2020-10-13 01:26 INFO 17307:MetadataFetcher - Processed 0 documents in 5.0h (  0.0 doc/s).
2020-10-13 01:26 INFO 17307:MetadataFetcher - Read 0, stocking 0 doc in 0.3g.
submitit ERROR (2020-10-13 01:26:39,797) - Submitted job triggered an exception
2020-10-13 01:26 ERROR 17307:submitit - Submitted job triggered an exception
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 476, in _global_transformer
    return _GLOBAL_TRANSFORMER(document)
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 246, in __call__
    y = self.do(x)
  File "/home/zbr/awork/cc_net/cc_net/minify.py", line 164, in do
    self.fetch_metadata(doc["cc_segment"])
  File "/home/zbr/awork/cc_net/cc_net/minify.py", line 146, in fetch_metadata
    for m in jsonql.read_jsons(meta_file):
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 485, in read_jsons
    lines = open_read(file)
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 942, in open_read
    return open_remote_file(filename)
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 1117, in open_remote_file
    raw_bytes = request_get_content(url)
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 1094, in request_get_content
    raise e
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 1088, in request_get_content
    r.raise_for_status()
  File "/usr/local/lib/python3.8/dist-packages/requests/models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://dl.fbaipublicfiles.com/cc_net/1.0.0/2019-09/CC-MAIN-20190215183319-20190215205319-00026.json.gz

How this can be debugged, or what are the next steps to build a dataset?

The text was updated successfully, but these errors were encountered:

gwenzek · 2020-10-29T10:18:59Z

Public downloads from https://dl.fbaipublicfiles.com/ are rate-limited by IP.
If you reach your quota at some point in time it will take sometimes before you are unblocked.
Try re-running the next day with less workers to see if it happens again (I believe it shouldn't)

Please reopen if the issue persist, but I think you've taken the rights steps to debug this.

bioothod · 2020-10-29T13:04:39Z

It doesn't look like restarting the job helps at this stage, there is a 5 hour timeout between metadata fetcher and job completion in the logs. Also, restarting ends up with exactly the same error (file can be diffrent, but timeout is there) with single parallel task and with 100 tasks. And at this stage it doesn't look like things are being cached - the previous stage took 2-3 restarts to complete, now it is 100% failure in a few hours when process stuck, so this looks like a bug and not rate limiting issue (especially with just one job)

bioothod · 2020-10-29T13:06:34Z

And since you've closed this issue, I can not reopen it

lyogavin · 2021-11-24T17:35:43Z

I ran into the same issue. Is there any updates? @gwenzek

Can you pls help checking with https://dl.fbaipublicfiles.com/'s owner what's their rate limiting policy?

tannonk · 2022-02-06T18:06:29Z

I'm also experiencing the same issue despite waiting a couple of days for any rate limit to be reset. Any chance of a follow up on this?

gwenzek closed this as completed Oct 29, 2020

gwenzek self-assigned this Oct 29, 2020

gwenzek reopened this Oct 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url #14

requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url #14

bioothod commented Oct 13, 2020

gwenzek commented Oct 29, 2020

bioothod commented Oct 29, 2020

bioothod commented Oct 29, 2020

lyogavin commented Nov 24, 2021

tannonk commented Feb 6, 2022

requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url #14

requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url #14

Comments

bioothod commented Oct 13, 2020

gwenzek commented Oct 29, 2020

bioothod commented Oct 29, 2020

bioothod commented Oct 29, 2020

lyogavin commented Nov 24, 2021

tannonk commented Feb 6, 2022