Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ELECTRA/TF2] Creation Of Datasets Should Check for Existence Of unzip'd File To Avoid Error Messages #1320

Open
psharpe99 opened this issue Jun 30, 2023 · 0 comments
Labels
enhancement New feature or request

Comments

@psharpe99
Copy link

Related to ELECTRA/TF2

Is your feature request related to a problem? Please describe.
I have previously run the README command to download the wiki data:
/workspace/electra/data/create_datasets_from_start.sh wiki_only
It has spent a long time downloading the bzip2 file, and then a long time to unzip it to the unzip'd 90Gb file
-rw-r--r-- 1 nobody nogroup 94,992,294,413 Jun 28 16:29 wikicorpus_en.xml

I wanted to rerun the script, to re-create the datasets.
The script correctly spots that the bz2 file already exists, and doesn't attempt to re-download it:
Downloading: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
** Download file already exists, skipping download
However, it does not seem to spot that the file has previously been unzip'd, and tries to re-unzip it:
Unzipping: wikicorpus_en.xml.bz2
bzip2: Can't create output file /workspace/electra/data/download/wikicorpus_en/wikicorpus_en.xml: File exists.
Traceback (most recent call last):
File "/workspace/electra/data/dataPrep.py", line 312, in
main(args)
File "/workspace/electra/data/dataPrep.py", line 59, in main
downloader.download()
File "/workspace/electra/data/Downloader.py", line 33, in download
self.download_wikicorpus('en')
File "/workspace/electra/data/Downloader.py", line 71, in download_wikicorpus
downloader.download()
File "/workspace/electra/data/WikiDownloader.py", line 54, in download

      subprocess.run('bzip2 -dk ' + self.save_path + '/' + filename, shell=True, check=True)
    File "/usr/lib/python3.6/subprocess.py", line 438, in run
      output=stdout, stderr=stderr)
  subprocess.CalledProcessError: Command 'bzip2 -dk /workspace/electra/data/download/wikicorpus_en/wikicorpus_en.xml.bz2' returned non-zero exit status 1.

Describe the solution you'd like
The end-result is fine, in that the script continues, but it would be good to perhaps check for the unzip'd file existence, so as to avoid unnecessary error messages and python stack traceback output.

Describe alternatives you've considered
It could be documented in the README.
The script should definitely not remove any unzip'd file, as it takes a substantial amount of time to unzip.

Additional context
None

@psharpe99 psharpe99 added the enhancement New feature or request label Jun 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant