Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

walk method in GCSFSWrapper returns empty string as one of filenames #558

Open
alekswithakayy opened this issue Jun 3, 2020 · 2 comments · May be fixed by #561
Open

walk method in GCSFSWrapper returns empty string as one of filenames #558

alekswithakayy opened this issue Jun 3, 2020 · 2 comments · May be fixed by #561

Comments

@alekswithakayy
Copy link

To recreate:

import gcsfs
from petastorm.gcsfs_helpers.gcsfs_wrapper import GCSFSWrapper
path = "gs://your/bucket/path"
fs = GCSFSWrapper(gcsfs.GCSFileSystem())
_, directories, files = next(fs.walk(path))
print(files)
# returns ['', 'file1', 'file2']

This becomes a problem in petastorm.utils.add_to_dataset_metadata where we have the following line:

arrow_metadata = compat_get_metadata(dataset.pieces[0], dataset.fs.open)

The empty string ends up as pieces[0] and pyarrow ultimately throws the following error since this is not a valid filename:

Traceback (most recent call last):                                              
  File "build_petastorm_dataset.py", line 103, in <module>
    run(args)
  File "build_petastorm_dataset.py", line 79, in run
    .parquet(args.output_url)
  File "/opt/conda/default/lib/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "/opt/conda/default/lib/python3.6/site-packages/petastorm/etl/dataset_metadata.py", line 113, in materialize_dataset
    _generate_unischema_metadata(dataset, schema)
  File "/opt/conda/default/lib/python3.6/site-packages/petastorm/etl/dataset_metadata.py", line 206, in _generate_unischema_metadata
    utils.add_to_dataset_metadata(dataset, UNISCHEMA_KEY, serialized_schema)
  File "/opt/conda/default/lib/python3.6/site-packages/petastorm/utils.py", line 115, in add_to_dataset_metadata
    arrow_metadata = compat_get_metadata(dataset.pieces[0], dataset.fs.open)
  File "/opt/conda/default/lib/python3.6/site-packages/petastorm/compat.py", line 31, in compat_get_metadata
    arrow_metadata = piece.get_metadata()
  File "/opt/conda/default/lib/python3.6/site-packages/pyarrow/parquet.py", line 676, in get_metadata
    f = self.open()
  File "/opt/conda/default/lib/python3.6/site-packages/pyarrow/parquet.py", line 683, in open
    reader = self.open_file_func(self.path)
  File "/opt/conda/default/lib/python3.6/site-packages/pyarrow/parquet.py", line 1054, in _open_dataset_file
    buffer_size=dataset.buffer_size
  File "/opt/conda/default/lib/python3.6/site-packages/pyarrow/parquet.py", line 210, in __init__
    read_dictionary=read_dictionary, metadata=metadata)
  File "pyarrow/_parquet.pyx", line 1023, in pyarrow._parquet.ParquetReader.open
  File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Parquet file size is 0 bytes

@megaserg @selitvin

@megaserg
Copy link
Contributor

megaserg commented Jun 4, 2020

Yes, I recently realized the version I merged was full of bugs. I've fixed it, let me upstream the patch.

@alekswithakayy
Copy link
Author

@megaserg any updates on this? Willing to help if needed...

@megaserg megaserg linked a pull request Jun 14, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants