Add CatalogStream Interface by dougbrn · Pull Request #1246 · astronomy-commons/lsdb

dougbrn · 2026-02-05T18:53:54Z

Closes #1042. Generally pretty open on a lot of the design decisions, naming, etc here. I didn't attempt the pytorch integration mentioned in the issue, happy to discuss more about that!

codecov · 2026-02-05T18:58:40Z

Codecov Report

❌ Patch coverage is 97.22222% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.67%. Comparing base (b1f2780) to head (5954018).
⚠️ Report is 25 commits behind head on main.

Files with missing lines	Patch %	Lines
src/lsdb/streams/catalog_streams.py	97.18%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1246      +/-   ##
==========================================
+ Coverage   96.66%   96.67%   +0.01%     
==========================================
  Files          46       48       +2     
  Lines        2877     2949      +72     
==========================================
+ Hits         2781     2851      +70     
- Misses         96       98       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2026-02-05T19:05:18Z

Before [`b1f2780`]	After [`c09cf7a`]	Ratio	Benchmark (Parameter)
178±1ms	188±5ms	1.05	benchmarks.time_open_many_columns_list
51.2±1ms	53.6±0.6ms	1.05	benchmarks.time_polygon_search
107±6ms	110±2ms	1.03	benchmarks.time_kdtree_crossmatch
7.00±0.05s	7.16±0.09s	1.02	benchmarks.time_create_large_catalog
396±8ms	403±7ms	1.02	benchmarks.time_open_many_columns_default
30.9±0.7ms	31.1±0.5ms	1.01	benchmarks.time_box_filter_on_partition
1.06±0.02s	1.06±0.01s	1	benchmarks.time_create_midsize_catalog
8.47±0.01s	8.44±0.03s	1	benchmarks.time_lazy_crossmatch_many_columns_all_suffixes
8.52±0.01s	8.55±0.05s	1	benchmarks.time_lazy_crossmatch_many_columns_overlapping_suffixes
3.90±0.03s	3.90±0.03s	1	benchmarks.time_open_many_columns_all

Click here to view all benchmarks.

…lsdb into data_iterator

hombit

I have a few overall design questions

Maybe we should implement an iterator and iterable separately, so the iterable object may be reused.
I'm not sure if there is a good use case for non-None iter_limit > 1: we could get row duplicates with it even in a single batch, which is not the case for 1 and None. I believe that for most ML approaches iter_limit = 1 and iter_limit = None are enough, and if people would like to do multiple epochs, they would reuse an iterable object n times.
I understand that it may be out of scope and looks more like a Dask feature request, but I think we should also support anything a user can get from a Catalog object, including a Dask series they can get with ['column'] or .map_partitions.

dougbrn · 2026-02-06T20:12:37Z

Thanks for taking a look!

I agree with this, not 100% sure on where the dividing line between the two will be and how the re-use will look API wise but I can make an attempt at it.
Fair point, I had originally introduced this as a way to allow larger iteration without stepping on the API rake of this way of interacting with a single iteration:

cat_iter = lsdb.CatalogIterator(cat, loop=False)

for chunk in cat_iter:
     # do something

resulting in an endless loop after simply switching the kwarg value

cat_iter = lsdb.CatalogIterator(cat, loop=True)

for chunk in cat_iter:
     # do something

but it changing the behavior makes it probably not worth it and users will just need to be aware of what they're doing by switching that bool.
3. I'm not sold on this. I do understand that there will be situations where users may have a series as a result of catalog work, but it feels like blurring the lines between what a user can do in LSDB with a series vs with a catalog. For example, even if something like catalog["nested"].crossmatch might feel potentially right to do, we don't allow crossmatch to flex for a dd.series (I know it's not a perfect comparison). It's very trivial in column selection and in map_partitions to return a dataframe (catalog) in place of a series, from my perspective, as well.

hombit · 2026-02-06T20:52:07Z

@dougbrn

(2). I like loop!
(3). Ok, let's keep it catalogy for now!

(1). This is how it could look like:

# interface
stream = catalog.partition_stream()
for epoch in range(1000):
    print(f'Training epoch {epoch}')
    for df in stream:
        train_batch(df)

# implementation

class Catalog:
    def partition_stream(self, ...): return CatalogPartitionStream(...)

class CatalogPartitionStream:
    # Defines random seed, etc, so each epoch training has different shuffling
    # Doesn't define partitions_left, etc
    def __init__(self, ...): ...
    def __iter__(self): CatalogPartitionIterator(...)
    # Doesn't have __next__

class CatalogPartitionIterator:
    # Defines iteration state, e.g. partitions_left, etc
    def __init__(self,...): ...
    def __iter__(self): return self
    def __next__(self): ...

hombit · 2026-02-06T20:55:25Z

It also may be infinite instead of loop. Or we can just drop that functionality and focus on a one-time catalog scanning

…it infinite looping into it's own streamer

dougbrn · 2026-02-06T23:42:34Z

@hombit make some big changes based on our conversations!

hombit

I do like this implementation!

I think that the Stream object should not be changed by an Iterator object. I propose to "split" rng when creating an iterator and pass a new one inside it, and then pass it back to get_next_partitions. So basically, the stream's rng is used to initialize new iterators, and the iterator's rng is used for all the shuffling.

src/lsdb/streams/catalog_streams.py

dougbrn · 2026-02-10T22:08:11Z

@hombit Now the rng should be split

hombit · 2026-02-10T23:48:06Z

@dougbrn rng.spawn(1)[0]

hombit

Looks good, thank you!

src/lsdb/streams/catalog_streams.py

dougbrn added 4 commits February 3, 2026 15:12

init catalog iterator; largely from uncle val

6d7dc9b

init integer looping

5f8e157

loop->iter_limit

259c79f

add test

272016f

dougbrn added 6 commits February 5, 2026 11:24

ci tune up

5a890bb

add to docs

8c7ab80

fix module call

34b5d64

add more doc examples

e2afa20

improve codecov

6555684

fix output

ed7ba85

dougbrn marked this pull request as ready for review February 5, 2026 22:13

dougbrn added 2 commits February 5, 2026 14:14

fix first example spacing

ef7bb4d

Merge branch 'main' into data_iterator

56d6084

dougbrn requested a review from hombit February 5, 2026 22:45

dougbrn added 2 commits February 6, 2026 10:18

enforce catalog input

75c47ce

Merge branch 'data_iterator' of https://github.com/astronomy-commons/…

89a5953

…lsdb into data_iterator

hombit reviewed Feb 6, 2026

View reviewed changes

dougbrn added 5 commits February 6, 2026 13:25

iterator->iterators

0f07b94

extra init change

12fcad6

iterator->streaming, split into iterable/iterator implementation, spl…

b444efa

…it infinite looping into it's own streamer

fix doctest

d6b25b7

fix cyclic import

764279e

dougbrn changed the title ~~Add CatalogIterator Interface~~ Add CatalogStream Interface Feb 6, 2026

add to docstring

c61e4a0

dougbrn requested a review from hombit February 6, 2026 23:42

hombit reviewed Feb 9, 2026

View reviewed changes

src/lsdb/streams/catalog_streams.py Outdated Show resolved Hide resolved

dougbrn added 2 commits February 10, 2026 13:45

split rng; row shuffling controlled by shuffle kwarg

d35ca4e

fix doctests at new rng

e0f8325

dougbrn requested a review from hombit February 10, 2026 22:05

hombit approved these changes Feb 11, 2026

View reviewed changes

src/lsdb/streams/catalog_streams.py Outdated Show resolved Hide resolved

src/lsdb/streams/catalog_streams.py Show resolved Hide resolved

dougbrn added 2 commits February 11, 2026 08:30

better rng split; remove unused code; no len for infinitestream

aaee471

update doctest rng results

5954018

dougbrn merged commit 2372c11 into main Feb 11, 2026
12 checks passed

dougbrn deleted the data_iterator branch February 11, 2026 17:03

Conversation

dougbrn commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hombit left a comment

Choose a reason for hiding this comment

Uh oh!

dougbrn commented Feb 6, 2026

Uh oh!

hombit commented Feb 6, 2026

Uh oh!

hombit commented Feb 6, 2026

Uh oh!

dougbrn commented Feb 6, 2026

Uh oh!

hombit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dougbrn commented Feb 10, 2026

Uh oh!

hombit commented Feb 10, 2026

Uh oh!

hombit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dougbrn commented Feb 5, 2026 •

edited

Loading

codecov bot commented Feb 5, 2026 •

edited

Loading

github-actions bot commented Feb 5, 2026 •

edited

Loading