You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.
dev branch fails to start _mine_shard stage, it timeouts and rises following exception even with parallelism=1:
python3 -m cc_net --config reproduce --dump 2019-09 --task_parallelism 1
Will run cc_net.mine.main with the following config: Config(config_name='reproduce', dump='2019-09', output_dir=PosixPath('data'), mined_dir='reproduce', execution='local', num_shards=1600, num_segments_per_shard=-1, metadata='https://dl
.fbaipublicfiles.com/cc_net/1.0.0', min_len=300, hash_in_mem=50, lang_whitelist=[], lang_blacklist=[], lang_threshold=0.5, lm_dir=PosixPath('data/lm_sp'), cutoff=PosixPath('/home/zbr/awork/cc_net/cc_net/data/cutoff.csv'), lm_languages=No
ne, mine_num_processes=16, target_size='4G', cleanup_after_regroup=True, task_parallelism=1, pipeline=['fetch_metadata', 'split_by_lang'], experiments=[], cache_dir=None)
Submitting 1600 jobs for _mine_shard, with parallelism=1
Waiting on 1 running jobs. Job ids: 17305
Failed job 17305 (1 / 1600): Job (task=0) failed during processing with trace:
----------------------
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 476, in _global_transformer
return _GLOBAL_TRANSFORMER(document)
File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 246, in __call__
y = self.do(x)
File "/home/zbr/awork/cc_net/cc_net/minify.py", line 164, in do
self.fetch_metadata(doc["cc_segment"])
File "/home/zbr/awork/cc_net/cc_net/minify.py", line 146, in fetch_metadata
for m in jsonql.read_jsons(meta_file):
File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 485, in read_jsons
lines = open_read(file)
File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 942, in open_read
return open_remote_file(filename)
File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 1117, in open_remote_file
raw_bytes = request_get_content(url)
File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 1094, in request_get_content
raise e
File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 1088, in request_get_content
r.raise_for_status()
File "/usr/local/lib/python3.8/dist-packages/requests/models.py", line 941, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://dl.fbaipublicfiles.com/cc_net/1.0.0/2019-09/CC-MAIN-20190215183319-20190215205319-00026.json.gz
While that file can be downloaded in parallel using wget or other tools, so it is not related to network issues.
Last commit id: 9a0d5c2
The end of the log shows:
2020-10-12 20:39 INFO 17351:root - Downloaded https://dl.fbaipublicfiles.com/cc_net/1.0.0/2019-09/CC-MAIN-20190215183319-20190215205319-00026.json.gz [200] took 9s (420.6kB/s)
2020-10-12 20:39 INFO 17351:JsonReader - Processed 30_532 documents in 0.0026h (3278.9 doc/s).
2020-10-12 20:39 INFO 17351:MetadataFetcher - Loaded 30532 metadatas from https://dl.fbaipublicfiles.com/cc_net/1.0.0/2019-09/CC-MAIN-20190215183319-20190215205319-00026.json.gz
submitit WARNING (2020-10-13 01:26:39,791) - Caught signal 10 on gpurnd14: this job is timed-out.
2020-10-13 01:26 WARNING 17307:submitit - Caught signal 10 on gpurnd14: this job is timed-out.
2020-10-13 01:26 INFO 17307:submitit - Job not requeued because: timed-out and not checkpointable.
2020-10-13 01:26 INFO 17307:MetadataFetcher - Processed 0 documents in 5.0h ( 0.0 doc/s).
2020-10-13 01:26 INFO 17307:MetadataFetcher - Read 0, stocking 0 doc in 0.3g.
submitit ERROR (2020-10-13 01:26:39,797) - Submitted job triggered an exception
2020-10-13 01:26 ERROR 17307:submitit - Submitted job triggered an exception
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 476, in _global_transformer
return _GLOBAL_TRANSFORMER(document)
File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 246, in __call__
y = self.do(x)
File "/home/zbr/awork/cc_net/cc_net/minify.py", line 164, in do
self.fetch_metadata(doc["cc_segment"])
File "/home/zbr/awork/cc_net/cc_net/minify.py", line 146, in fetch_metadata
for m in jsonql.read_jsons(meta_file):
File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 485, in read_jsons
lines = open_read(file)
File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 942, in open_read
return open_remote_file(filename)
File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 1117, in open_remote_file
raw_bytes = request_get_content(url)
File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 1094, in request_get_content
raise e
File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 1088, in request_get_content
r.raise_for_status()
File "/usr/local/lib/python3.8/dist-packages/requests/models.py", line 941, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://dl.fbaipublicfiles.com/cc_net/1.0.0/2019-09/CC-MAIN-20190215183319-20190215205319-00026.json.gz
How this can be debugged, or what are the next steps to build a dataset?
The text was updated successfully, but these errors were encountered:
Public downloads from https://dl.fbaipublicfiles.com/ are rate-limited by IP.
If you reach your quota at some point in time it will take sometimes before you are unblocked.
Try re-running the next day with less workers to see if it happens again (I believe it shouldn't)
Please reopen if the issue persist, but I think you've taken the rights steps to debug this.
It doesn't look like restarting the job helps at this stage, there is a 5 hour timeout between metadata fetcher and job completion in the logs. Also, restarting ends up with exactly the same error (file can be diffrent, but timeout is there) with single parallel task and with 100 tasks. And at this stage it doesn't look like things are being cached - the previous stage took 2-3 restarts to complete, now it is 100% failure in a few hours when process stuck, so this looks like a bug and not rate limiting issue (especially with just one job)
dev branch fails to start
_mine_shard
stage, it timeouts and rises following exception even withparallelism=1
:While that file can be downloaded in parallel using
wget
or other tools, so it is not related to network issues.Last commit id: 9a0d5c2
The end of the log shows:
How this can be debugged, or what are the next steps to build a dataset?
The text was updated successfully, but these errors were encountered: