[ELECTRA/TF2] Creation Of Datasets Should Check for Existence Of unzip'd File To Avoid Error Messages

Related to **ELECTRA/TF2** 

**Is your feature request related to a problem? Please describe.**
I have previously run the README command to download the wiki data:
       /workspace/electra/data/create_datasets_from_start.sh wiki_only
It has spent a long time downloading the bzip2 file, and then a long time to unzip it to the unzip'd 90Gb file
      -rw-r--r-- 1 nobody  nogroup 94,992,294,413 Jun 28 16:29 wikicorpus_en.xml

I wanted to rerun the script, to re-create the datasets.
The script correctly spots that the bz2 file already exists, and doesn't attempt to re-download it:
      Downloading: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
      ** Download file already exists, skipping download
However, it does not seem to spot that the file has previously been unzip'd, and tries to re-unzip it:
      Unzipping: wikicorpus_en.xml.bz2
      bzip2: Can't create output file /workspace/electra/data/download/wikicorpus_en/wikicorpus_en.xml: File exists.
      Traceback (most recent call last):
          File "/workspace/electra/data/dataPrep.py", line 312, in <module>
          main(args)
        File "/workspace/electra/data/dataPrep.py", line 59, in main
          downloader.download()
        File "/workspace/electra/data/Downloader.py", line 33, in download
          self.download_wikicorpus('en')
        File "/workspace/electra/data/Downloader.py", line 71, in download_wikicorpus
          downloader.download()
        File "/workspace/electra/data/WikiDownloader.py", line 54, in download
  
          subprocess.run('bzip2 -dk ' + self.save_path + '/' + filename, shell=True, check=True)
        File "/usr/lib/python3.6/subprocess.py", line 438, in run
          output=stdout, stderr=stderr)
      subprocess.CalledProcessError: Command 'bzip2 -dk /workspace/electra/data/download/wikicorpus_en/wikicorpus_en.xml.bz2' returned non-zero exit status 1.

**Describe the solution you'd like**
The end-result is fine, in that the script continues, but it would be good to perhaps check for the unzip'd file existence, so as to avoid unnecessary error messages and python stack traceback output.

**Describe alternatives you've considered**
It could be documented in the README.
The script should definitely not remove any unzip'd file, as it takes a substantial amount of time to unzip.

**Additional context**
None


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ELECTRA/TF2] Creation Of Datasets Should Check for Existence Of unzip'd File To Avoid Error Messages #1320

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ELECTRA/TF2] Creation Of Datasets Should Check for Existence Of unzip'd File To Avoid Error Messages #1320

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions