fix: Count full dataset in DatasetShard.len #3414

jeffkinnison · 2023-05-21T21:29:59Z

Running the new test with the previous code, DatasetShard.__len__ returned 256 rather than the full dataset size defined in the test. Summing the size of the batches returned by DatasetIterator.iter_batches returns the full dataset size.

github-actions · 2023-05-21T23:25:05Z

Unit Test Results

  6 files ±0   6 suites ±0 1h 19m 29s ⏱️ + 8m 4s
33 tests ±0 29 ✔️ ±0   4 💤 ±0 0 ❌ ±0
99 runs ±0 87 ✔️ ±0 12 💤 ±0 0 ❌ ±0

Results for commit 8ff5f4d. ± Comparison against base commit cb37535.

♻️ This comment has been updated with latest results.

arnavgarg1 · 2023-05-21T23:27:59Z

ludwig/data/dataset/ray.py

+            # Sum over all batches when using a DatasetIterator
+            count = sum(map(lambda b: b.count(), self.epoch_iter.iter_batches()))


Couple of quick questions:

When would we hit the TypeError? Is it in the case that self.epoch_iter is not a DatasetIterator object?

Does this end up doing a pass over the entire data and call count on each batch, then sum that total?

Yes, that's for backwards compatibility. I'm leaving it in for now just in case, but I think 2.3+ should follow the DatasetIterator path.

Yes. iter_batch returns an iterator over the batches of one epoch. It batches by block if batch_size isn't provided, but we pass batch_size to the actual batcher in a separate iter_batches call.

Makes sense!

How expensive is that pass over the data?

The cost is relative to dataset size and object store memory, but running time for a number of different dataset sizes seems to be roughly the same as the other methods of getting dataset size from Ray Data.

…nto ray-nightly-count-fix

…/ludwig-ai/ludwig/actions/runs/5063105910/jobs/9089341604

arnavgarg1 · 2023-05-24T18:34:21Z

ludwig/data/dataset/ray.py

+            # Sum over all batches when using a DatasetIterator
+            count = sum(map(lambda b: b.count(), self.epoch_iter.iter_batches()))


At this point, what is the type of epoch_iter?

If we get to that point, we're using a DatasetIterator with no underlying pipeline. AFAIK this should not happen, but if it does this catchall should prevent a crash. We typically see PipelinedDataIterator, which has the _base_dataset_pipeline attribute. That lets us call count directly on a pipeline object, similar to how counting worked with Ray<2.3.

tgaddair · 2023-05-24T19:47:23Z

ludwig/backend/datasource.py

-                # expand_paths returns two lists, so get the first element of each
-                read_path = read_path[0]
-                file_size = file_size[0]
+                try:


Rather than do a try except, can you just make this conditional based on the ray version? We have a lot of examples of this in the code already:

https://github.com/ludwig-ai/ludwig/blob/master/ludwig/backend/ray.py#L365

Sure, I'll convert it.

tgaddair · 2023-05-24T19:47:58Z

ludwig/contribs/mlflow/__init__.py

-        }
+        import ray
+
+        if version.parse(ray.__version__) >= version.parse("2.3.0"):


Can set this to a variable at the top, as doing this parsing is somewhat costly, so always better avoid doing it more than once.

tgaddair · 2023-05-24T19:48:49Z

ludwig/contribs/mlflow/__init__.py

+        import ray
+
+        if version.parse(ray.__version__) >= version.parse("2.3.0"):
+


Is 2.3.0 the right min version? It was working before with 2.3, right? So maybe we just want to check against 2.4?

Will do. The new mlflow integration also works with 2.3, but we can introduce it with the 2.4 bump.

tgaddair · 2023-05-24T19:49:54Z

ludwig/data/dataset/ray.py

+                count = next(self.epoch_iter._base_dataset_pipeline).count()
+            else:
+                count = next(self.epoch_iter).count()
+        except TypeError:


This is a bit hard to follow. Can we make this conditional based on a version of Ray instead of catching a type error?

I can break this out into more fine-grained conditions.

tgaddair · 2023-05-24T19:50:36Z

ludwig/data/dataset/ray.py

-        pipeline = next(self.dataset_epoch_iterator)
+        try:
+            pipeline = next(self.dataset_epoch_iterator)
+        except TypeError:


Similar to above, the TypeError isn't clearly why or when it happens. I would prefer checking Ray version, or at least leaving a comment explaining what can cause the error.

tgaddair · 2023-05-24T19:51:56Z

ludwig/data/dataset/ray.py

                    pipeline = pipeline.map_batches(augment_batch, batch_size=batch_size, batch_format="pandas")

                for batch in pipeline.iter_batches(prefetch_blocks=0, batch_size=batch_size, batch_format="pandas"):
+                    if _ray_230:
+                        batch = augment_batch(batch)


Oof, not good. This does augmentation in the worker process. We definitely don't want that, as it could slow down training by sucking up CPU cycles. Why does the map_batches call above no longer work? Can we fix it?

So DatasetIterator objects don't have a map_batches method, and calling map_batches on the underlying _base_dataset_pipeline in the iterator leads to downstream problems reusing the pipeline that aren't resolved by calling repeat. I'll play around with this and see if we can get it into a pipeline.

tgaddair · 2023-05-24T19:52:36Z

ludwig/hyperopt/execution.py

            # Explicitly raise a RuntimeError if an error is encountered during a Ray trial.
            # NOTE: Cascading the exception with "raise _ from e" still results in hanging.
-            raise RuntimeError(f"Encountered Ray Tune error: {e}")
+            raise RuntimeError(f"Encountered Ray Tune error: {traceback.format_exc()}")


You can change this to:

raise RuntimeError(...) from e

That way you get the whole traceback without needing to turn it into a stirng.

jeffkinnison added 3 commits May 21, 2023 17:28

add dataset length test

ca59cbe

fix dataset size computation for DatasetIterator

f1180ad

docstring

8a16a59

jeffkinnison marked this pull request as ready for review May 21, 2023 21:39

jeffkinnison requested review from tgaddair and arnavgarg1 May 21, 2023 21:40

arnavgarg1 reviewed May 21, 2023

View reviewed changes

jeffkinnison and others added 4 commits May 21, 2023 19:48

reduce dataset size and runtime

013e997

Merge branch 'master' into ray-nightly-count-fix

e6325f8

add back nightly updates

4e7f1e1

bad pip install

889801c

jeffkinnison requested a review from arnavgarg1 May 22, 2023 22:35

jeffkinnison and others added 5 commits May 22, 2023 23:06

count using underlying pipeline if possible

01dbfbe

gate with ray 2.3.x check

94d41de

Skip tokenizer test which seems to be broken on python 3.8 and 3.9.

3e44310

Merge branch 'ray-nightly-count-fix' of github.com:ludwig-ai/ludwig i…

5317817

…nto ray-nightly-count-fix

Revert spacy test skip. Seems to be working again? https://github.com…

8ff5f4d

…/ludwig-ai/ludwig/actions/runs/5063105910/jobs/9089341604

arnavgarg1 reviewed May 24, 2023

View reviewed changes

tgaddair reviewed May 24, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Count full dataset in DatasetShard.len #3414

fix: Count full dataset in DatasetShard.len #3414

jeffkinnison commented May 21, 2023 •

edited

Loading

github-actions bot commented May 21, 2023 •

edited

Loading

arnavgarg1 May 21, 2023

jeffkinnison May 22, 2023

arnavgarg1 May 23, 2023

jeffkinnison May 23, 2023

arnavgarg1 May 24, 2023

jeffkinnison May 24, 2023 •

edited

Loading

tgaddair May 24, 2023

jeffkinnison May 24, 2023

tgaddair May 24, 2023

tgaddair May 24, 2023

jeffkinnison May 24, 2023

tgaddair May 24, 2023

jeffkinnison May 24, 2023

tgaddair May 24, 2023

tgaddair May 24, 2023

jeffkinnison May 24, 2023

tgaddair May 24, 2023

		# Sum over all batches when using a DatasetIterator
		count = sum(map(lambda b: b.count(), self.epoch_iter.iter_batches()))

		import ray

		if version.parse(ray.__version__) >= version.parse("2.3.0"):

fix: Count full dataset in DatasetShard.__len__ #3414

Are you sure you want to change the base?

fix: Count full dataset in DatasetShard.__len__ #3414

Conversation

jeffkinnison commented May 21, 2023 • edited Loading

github-actions bot commented May 21, 2023 • edited Loading

Unit Test Results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeffkinnison May 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fix: Count full dataset in DatasetShard.len #3414

fix: Count full dataset in DatasetShard.len #3414

jeffkinnison commented May 21, 2023 •

edited

Loading

github-actions bot commented May 21, 2023 •

edited

Loading

jeffkinnison May 24, 2023 •

edited

Loading