Enable batch fetching in parallel #748

jarandaf · 2022-03-23T15:50:21Z

This is WIP. Happy to hear your thoughts on this @selitvin.

CLAassistant · 2022-03-23T16:35:17Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ jarandaf
❌ Jordi Aranda

Jordi Aranda seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

selitvin

Looks like we are in the right direction! Left some questions.

selitvin · 2022-03-24T01:00:16Z

petastorm/pytorch.py

-            for batch in self._iter_impl():
+            iterator = self._iter_impl()
+            if self._max_prefetch > 1:
+                iterator = BackgroundIterator(iterator, prefetch=self.max_prefetch)


Will the BackgroundIterator be properly destroyed and the thread joined when we exit the function (either nominal exit or with an exception?).

Need to check this, thank you for the heads up.

I know it could be tricky, but would be good to also test these aspects in a unit test or two.

selitvin · 2022-03-24T01:02:58Z

petastorm/pytorch.py

+    """Prefetch iterator results. A thread iterates the original iterator and
+    populates a queue. Iterating over this background iterator just consumes the underlying
+    queue until no other result is available."""
+    def __init__(self, iterator, prefetch=1000):


Is it really prefetching? Setting queue size does not guarantee that we will prefetch the data before user starts consuming it, does it? Perhaps we call the argument 'queue_size'?
Frankly, I am not sure how prefetching helps steady-state throughput. Wouldn't it just eliminate some hiccups when the training starts at the expense of training starting a bit later? Isn't steady state throughput the only important characteristic here?

You are right, probably prefetching is not the right word. As discussed in #740, the main motivation of this PR is to enable parallel batch building while training a model (otherwise the model will always have to wait for a batch to be available and this may take some time, specially if the dataset has a big number of columns). I have observed a ~3x speedup in data throughput with this change.

Sure - I see how this can speed up the training. This is a good change.

In my understanding we are not really doing prefetching here (depending on the timing, the consumer might try to fetch the first batch before the thread has populated it, i.e. nothing was prefetched).

If you are ok with just changing the name from prefetching to queue size, everything will fall in place then.

selitvin · 2022-03-24T01:03:40Z

petastorm/pytorch.py

+    """Prefetch iterator results. A thread iterates the original iterator and
+    populates a queue. Iterating over this background iterator just consumes the underlying
+    queue until no other result is available."""
+    def __init__(self, iterator, prefetch=1000):


Having default value of 1000 batches for the queue size maybe a bit too much, given a batch is a row-group, and a rowgroup of couple of hundreds MBs are common.

Loader's _iter_impl yields batches and not row groups, right? This is what is enqueued. It is true that depending on the queue size more or less row groups will be processed, but I expect this to be controlled via the queue size and the batch size.

You are absolutely right.

codecov · 2022-03-24T01:12:31Z

Codecov Report

Merging #748 (aae2993) into master (26e03c7) will decrease coverage by 0.27%.
The diff coverage is 35.71%.

❗ Current head aae2993 differs from pull request most recent head 4c43b96. Consider uploading reports for the commit 4c43b96 to get more accurate results

@@            Coverage Diff             @@
##           master     #748      +/-   ##
==========================================
- Coverage   86.27%   85.99%   -0.28%     
==========================================
  Files          85       85              
  Lines        5084     5111      +27     
  Branches      787      791       +4     
==========================================
+ Hits         4386     4395       +9     
- Misses        559      575      +16     
- Partials      139      141       +2

Impacted Files	Coverage Δ
petastorm/pytorch.py	`86.93% <35.71%> (-6.64%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 26e03c7...4c43b96. Read the comment docs.

jarandaf · 2022-04-06T08:29:38Z

@selitvin could you please have a look? Thank you!

selitvin · 2022-04-06T17:24:58Z

petastorm/pytorch.py

+    def run(self):
+        while not self.stop.isSet():
+            for item in self.iterator:
+                self.queue.put(item)


I don't think we can use blocking puts and gets and end up with a solution that is robust to deadlocks. Let's see if this works:

Producer ended up filling the queue and waits on a blocking put.

Consumer fails. Calls iterator.stop.set(), however since .put is blocking within an iterator queue, the event is never checked and the thread is not shut down.

Another scenario:

The queue is empty, hence consumer waits on a blocking .get.

However, producer raises an exception. The thread dies and the consumer is stuck forever on a .get.

I think a robust implementation for a BackgroundIterator could get pretty tricky. All these edge cases need to be carefully tested as these kind of failures would be hard to catch in production.

selitvin · 2022-04-06T17:27:55Z

petastorm/pytorch.py

                yield batch
        except Exception as e:
            self._error = e
            logger.error('Iteration on Petastorm DataLoader raise error: %s', repr(e))
            raise
        finally:
            self._in_iter = False
+            if isinstance(iterator, BackgroundIterator):
+                iterator.stop.set()


Let's make stop a private member (i.e. _stop) and add an API to the BackgroundIterator that performs stop (encapsulation principle).

selitvin · 2022-04-06T17:28:47Z

petastorm/pytorch.py

+    def __init__(self, iterator, queue_size=1000):
+        threading.Thread.__init__(self)
+        self.name = "background_iterator"
+        self.queue = Queue(queue_size)


Let's mark all data members that are not intended to be exposed to BackgroundIterator users as private (_ prefix).

chongxiaoc · 2022-07-13T05:46:57Z

Just noticed this nice work. @jarandaf Thanks!
Is it doable to support shuffle to be parallel as well?
I think usually shuffling is the bottleneck and petastorm uses single thread for that.

enable batch fetching in advance

12eb822

improve docstrings

aae2993

selitvin reviewed Mar 24, 2022

View reviewed changes

jarandaf and others added 2 commits March 28, 2022 12:30

add tests

7234d81

Merge branch 'uber:master' into feature/prefetch_batches

4c43b96

jarandaf changed the title ~~Enable batch fetching in advance~~ Enable batch fetching in parallel Mar 28, 2022

selitvin requested changes Apr 6, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable batch fetching in parallel #748

Enable batch fetching in parallel #748

jarandaf commented Mar 23, 2022 •

edited

Loading

CLAassistant commented Mar 23, 2022 •

edited

Loading

selitvin left a comment

selitvin Mar 24, 2022

jarandaf Mar 24, 2022

selitvin Mar 24, 2022

selitvin Mar 24, 2022

jarandaf Mar 24, 2022

selitvin Mar 24, 2022

selitvin Mar 24, 2022

jarandaf Mar 24, 2022

selitvin Mar 24, 2022

codecov bot commented Mar 24, 2022 •

edited

Loading

jarandaf commented Apr 6, 2022

selitvin Apr 6, 2022

selitvin Apr 6, 2022

selitvin Apr 6, 2022

chongxiaoc commented Jul 13, 2022

Enable batch fetching in parallel #748

Are you sure you want to change the base?

Enable batch fetching in parallel #748

Conversation

jarandaf commented Mar 23, 2022 • edited Loading

CLAassistant commented Mar 23, 2022 • edited Loading

selitvin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Mar 24, 2022 • edited Loading

Codecov Report

jarandaf commented Apr 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chongxiaoc commented Jul 13, 2022

jarandaf commented Mar 23, 2022 •

edited

Loading

CLAassistant commented Mar 23, 2022 •

edited

Loading

codecov bot commented Mar 24, 2022 •

edited

Loading