Dynamic shape of lables. #774

ohindialign · 2022-09-07T08:37:01Z

Hey,

I'm using petastorm for object detection, each image might have different number of objects in it.
While im using make_reader and specify inside TransfromSpec shape of (-1, 5) for the labels everything works fine,
But when im using make_batch_reader im getting an error about the shape. ( Tried (None, 5) too but still got an error)

Is there a way to specify dynamic size for some field?
And why there is a difference between make_reader and make_batch_reader?

Besides this, I'm getting a lot of future warning about pyarrow version ( working inside databricks environment)
Do you know have i can avoid all this warnings?

Hope you will be able to help me.
If any information is missing let me know.
petastorm version: 0.11.4

`future warnings example:
parquet_file = ParquetFile(self._dataset.fs.open(piece.path))
/databricks/python/lib/python3.9/site-packages/petastorm/py_dict_reader_worker.py:180: FutureWarning: 'ParquetDataset.partitions' attribute is deprecated as of pyarrow 5.0.0 and will be removed in a future version. Specify 'use_legacy_dataset=False' while constructing the ParquetDataset, and then use the '.partitioning' attribute instead.

databricks/python/lib/python3.9/site-packages/petastorm/fs_utils.py:88: FutureWarning: pyarrow.localfs is deprecated as of 2.0.0, please use pyarrow.fs.LocalFileSystem instead.
`
Thanks a lot!

selitvin · 2022-09-07T20:51:45Z

make_reader and make_batch_reader are quite different. Please read more here https://github.com/uber/petastorm/#non-petastorm-parquet-stores and repost if you need further clarifications.

The future warnings are a known issue. Hope to address it soon.

ohindialign · 2022-09-08T07:35:52Z

I went over the documentation, my main problem is why dynamic size of a field is possible with make_reader but not with make_batch_reader..
Is there any efficiency difference between them? does make_batch_reader is faster when my row group are already saved in batch size?

selitvin · 2022-09-08T15:45:59Z

make_batch_reader reads a rowgroup and returns with minimal processing, i.e. more oriented on batch data reading.
make_reader returns each data from a row-group in a row-by-row fashion.

If your data has non-uniform size, as you describe, and you use make_batch_reader you must use TransformSpec in order to make all fields uniform (kinda all rows in a batch for collation). So it looks to me that you are headed in the right direction by trying to define a TransformSpec that would do it with make_batch_reader. Can you share perhaps a code snippet (preferably a runnable one) that demonstrates the problem?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic shape of lables. #774

Dynamic shape of lables. #774

ohindialign commented Sep 7, 2022

selitvin commented Sep 7, 2022

ohindialign commented Sep 8, 2022

selitvin commented Sep 8, 2022

Dynamic shape of lables. #774

Dynamic shape of lables. #774

Comments

ohindialign commented Sep 7, 2022

selitvin commented Sep 7, 2022

ohindialign commented Sep 8, 2022

selitvin commented Sep 8, 2022