Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1246 +/- ##
==========================================
+ Coverage 96.66% 96.67% +0.01%
==========================================
Files 46 48 +2
Lines 2877 2949 +72
==========================================
+ Hits 2781 2851 +70
- Misses 96 98 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Click here to view all benchmarks. |
hombit
left a comment
There was a problem hiding this comment.
I have a few overall design questions
- Maybe we should implement an iterator and iterable separately, so the iterable object may be reused.
- I'm not sure if there is a good use case for non-None
iter_limit > 1: we could get row duplicates with it even in a single batch, which is not the case for1andNone. I believe that for most ML approachesiter_limit = 1anditer_limit = Noneare enough, and if people would like to do multiple epochs, they would reuse an iterable object n times. - I understand that it may be out of scope and looks more like a Dask feature request, but I think we should also support anything a user can get from a Catalog object, including a Dask series they can get with
['column']or.map_partitions.
|
Thanks for taking a look!
resulting in an endless loop after simply switching the kwarg value but it changing the behavior makes it probably not worth it and users will just need to be aware of what they're doing by switching that bool. |
|
(2). I like (1). This is how it could look like: # interface
stream = catalog.partition_stream()
for epoch in range(1000):
print(f'Training epoch {epoch}')
for df in stream:
train_batch(df)
# implementation
class Catalog:
def partition_stream(self, ...): return CatalogPartitionStream(...)
class CatalogPartitionStream:
# Defines random seed, etc, so each epoch training has different shuffling
# Doesn't define partitions_left, etc
def __init__(self, ...): ...
def __iter__(self): CatalogPartitionIterator(...)
# Doesn't have __next__
class CatalogPartitionIterator:
# Defines iteration state, e.g. partitions_left, etc
def __init__(self,...): ...
def __iter__(self): return self
def __next__(self): ... |
|
It also may be |
…it infinite looping into it's own streamer
|
@hombit make some big changes based on our conversations! |
hombit
left a comment
There was a problem hiding this comment.
I do like this implementation!
I think that the Stream object should not be changed by an Iterator object. I propose to "split" rng when creating an iterator and pass a new one inside it, and then pass it back to get_next_partitions. So basically, the stream's rng is used to initialize new iterators, and the iterator's rng is used for all the shuffling.
|
@hombit Now the rng should be split |
|
@dougbrn |
Closes #1042. Generally pretty open on a lot of the design decisions, naming, etc here. I didn't attempt the pytorch integration mentioned in the issue, happy to discuss more about that!