You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
I have previously run the README command to download the wiki data:
/workspace/electra/data/create_datasets_from_start.sh wiki_only
It has spent a long time downloading the bzip2 file, and then a long time to unzip it to the unzip'd 90Gb file
-rw-r--r-- 1 nobody nogroup 94,992,294,413 Jun 28 16:29 wikicorpus_en.xml
I wanted to rerun the script, to re-create the datasets.
The script correctly spots that the bz2 file already exists, and doesn't attempt to re-download it:
Downloading: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
** Download file already exists, skipping download
However, it does not seem to spot that the file has previously been unzip'd, and tries to re-unzip it:
Unzipping: wikicorpus_en.xml.bz2
bzip2: Can't create output file /workspace/electra/data/download/wikicorpus_en/wikicorpus_en.xml: File exists.
Traceback (most recent call last):
File "/workspace/electra/data/dataPrep.py", line 312, in
main(args)
File "/workspace/electra/data/dataPrep.py", line 59, in main
downloader.download()
File "/workspace/electra/data/Downloader.py", line 33, in download
self.download_wikicorpus('en')
File "/workspace/electra/data/Downloader.py", line 71, in download_wikicorpus
downloader.download()
File "/workspace/electra/data/WikiDownloader.py", line 54, in download
subprocess.run('bzip2 -dk ' + self.save_path + '/' + filename, shell=True, check=True)
File "/usr/lib/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'bzip2 -dk /workspace/electra/data/download/wikicorpus_en/wikicorpus_en.xml.bz2' returned non-zero exit status 1.
Describe the solution you'd like
The end-result is fine, in that the script continues, but it would be good to perhaps check for the unzip'd file existence, so as to avoid unnecessary error messages and python stack traceback output.
Describe alternatives you've considered
It could be documented in the README.
The script should definitely not remove any unzip'd file, as it takes a substantial amount of time to unzip.
Additional context
None
The text was updated successfully, but these errors were encountered:
Related to ELECTRA/TF2
Is your feature request related to a problem? Please describe.
I have previously run the README command to download the wiki data:
/workspace/electra/data/create_datasets_from_start.sh wiki_only
It has spent a long time downloading the bzip2 file, and then a long time to unzip it to the unzip'd 90Gb file
-rw-r--r-- 1 nobody nogroup 94,992,294,413 Jun 28 16:29 wikicorpus_en.xml
I wanted to rerun the script, to re-create the datasets.
The script correctly spots that the bz2 file already exists, and doesn't attempt to re-download it:
Downloading: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
** Download file already exists, skipping download
However, it does not seem to spot that the file has previously been unzip'd, and tries to re-unzip it:
Unzipping: wikicorpus_en.xml.bz2
bzip2: Can't create output file /workspace/electra/data/download/wikicorpus_en/wikicorpus_en.xml: File exists.
Traceback (most recent call last):
File "/workspace/electra/data/dataPrep.py", line 312, in
main(args)
File "/workspace/electra/data/dataPrep.py", line 59, in main
downloader.download()
File "/workspace/electra/data/Downloader.py", line 33, in download
self.download_wikicorpus('en')
File "/workspace/electra/data/Downloader.py", line 71, in download_wikicorpus
downloader.download()
File "/workspace/electra/data/WikiDownloader.py", line 54, in download
Describe the solution you'd like
The end-result is fine, in that the script continues, but it would be good to perhaps check for the unzip'd file existence, so as to avoid unnecessary error messages and python stack traceback output.
Describe alternatives you've considered
It could be documented in the README.
The script should definitely not remove any unzip'd file, as it takes a substantial amount of time to unzip.
Additional context
None
The text was updated successfully, but these errors were encountered: