Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic shape of lables. #774

Open
ohindialign opened this issue Sep 7, 2022 · 3 comments
Open

Dynamic shape of lables. #774

ohindialign opened this issue Sep 7, 2022 · 3 comments

Comments

@ohindialign
Copy link

Hey,

I'm using petastorm for object detection, each image might have different number of objects in it.
While im using make_reader and specify inside TransfromSpec shape of (-1, 5) for the labels everything works fine,
But when im using make_batch_reader im getting an error about the shape. ( Tried (None, 5) too but still got an error)

Is there a way to specify dynamic size for some field?
And why there is a difference between make_reader and make_batch_reader?

Besides this, I'm getting a lot of future warning about pyarrow version ( working inside databricks environment)
Do you know have i can avoid all this warnings?

Hope you will be able to help me.
If any information is missing let me know.
petastorm version: 0.11.4

`future warnings example:
parquet_file = ParquetFile(self._dataset.fs.open(piece.path))
/databricks/python/lib/python3.9/site-packages/petastorm/py_dict_reader_worker.py:180: FutureWarning: 'ParquetDataset.partitions' attribute is deprecated as of pyarrow 5.0.0 and will be removed in a future version. Specify 'use_legacy_dataset=False' while constructing the ParquetDataset, and then use the '.partitioning' attribute instead.

databricks/python/lib/python3.9/site-packages/petastorm/fs_utils.py:88: FutureWarning: pyarrow.localfs is deprecated as of 2.0.0, please use pyarrow.fs.LocalFileSystem instead.
`
Thanks a lot!

@selitvin
Copy link
Collaborator

selitvin commented Sep 7, 2022

make_reader and make_batch_reader are quite different. Please read more here https://github.com/uber/petastorm/#non-petastorm-parquet-stores and repost if you need further clarifications.

The future warnings are a known issue. Hope to address it soon.

@ohindialign
Copy link
Author

I went over the documentation, my main problem is why dynamic size of a field is possible with make_reader but not with make_batch_reader..
Is there any efficiency difference between them? does make_batch_reader is faster when my row group are already saved in batch size?

@selitvin
Copy link
Collaborator

selitvin commented Sep 8, 2022

make_batch_reader reads a rowgroup and returns with minimal processing, i.e. more oriented on batch data reading.
make_reader returns each data from a row-group in a row-by-row fashion.

If your data has non-uniform size, as you describe, and you use make_batch_reader you must use TransformSpec in order to make all fields uniform (kinda all rows in a batch for collation). So it looks to me that you are headed in the right direction by trying to define a TransformSpec that would do it with make_batch_reader. Can you share perhaps a code snippet (preferably a runnable one) that demonstrates the problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants