Skip to content

refactor(datasets): use the shared scverse-misc dataset registry + downloader#1213

Draft
timtreis wants to merge 5 commits into
scverse:mainfrom
timtreis:feat/datasets-via-scverse-misc
Draft

refactor(datasets): use the shared scverse-misc dataset registry + downloader#1213
timtreis wants to merge 5 commits into
scverse:mainfrom
timtreis:feat/datasets-via-scverse-misc

Conversation

@timtreis

Copy link
Copy Markdown
Member

Demonstrates squidpy consuming the shared dataset infrastructure proposed in
scverse/scverse-misc#40, replacing squidpy's internal pooch-based registry/downloader.

⚠️ Draft — depends on scverse-misc#40 merging and a scverse-misc release that ships the
datasets extra. The dependency is currently scverse-misc[datasets] (unpinned); it'll
need a version floor once released.

What moves upstream vs. what stays

Stays in squidpy Now provided by scverse-misc[datasets]
datasets.yaml (unchanged) DatasetRegistry / DatasetEntry / FileEntry
datasets.yaml -> registry mapping (folds shape/library_id into metadata) pooch download + SHA-256 verify + URL fallback + archive extract
Domain loaders: image->ImageContainer, visium_10x->read.visium, spatialdata->read_zarr, anndata (shape-warning override) The Fetcher + pluggable register_loader registry
Public API: sq.datasets.* (unchanged)

Net effect

  • ~750 lines deleted from squidpy (the duplicated registry/downloader).
  • DatasetType enum -> free-form type strings dispatched via register_loader.
  • Direct pooch dependency dropped (now transitive via scverse-misc[datasets]).
  • Public API unchangedsq.datasets.cells(), visium_hne_sdata(), visium(),
    the anndata/image loaders all keep the same signatures and behavior.

Validation

  • tests/datasets/{test_registry,test_downloader,test_dataset}.py rewritten to the new
    structure — 34 passed, 5 internet deselected locally.
  • End-to-end live downloads through the new system verified:
    • sq.datasets.cells() -> SpatialData with all elements (spatialdata loader)
    • sq.datasets.imc() -> AnnData (4668, 34) (anndata loader)

Notes for reviewers

  • Cache subdir is now the dataset type (visium_10x, image) rather than the old
    visium/images. Internal-only, but the CI prefetch script (.scripts/ci/download_data.py)
    and any hard-coded cache paths should be updated in a follow-up.

🤖 Generated with Claude Code

timtreis and others added 5 commits June 15, 2026 13:08
Replace squidpy's internal pooch-based registry/downloader with the shared
scverse_misc.datasets system (scverse-misc[datasets]):

- _registry.py: build a scverse_misc DatasetRegistry from datasets.yaml,
  folding squidpy-specific shape/library_id into the generic metadata mapping.
  Drops squidpy's duplicated FileEntry/DatasetEntry/DatasetRegistry/DatasetType.
- _downloader.py: register squidpy's domain loaders (image -> ImageContainer,
  visium_10x -> read.visium, spatialdata -> read_zarr) via register_loader and
  override the built-in anndata loader for the shape warning. The pooch
  download/verify/extract machinery now lives in scverse-misc.
- _datasets.py: public API unchanged; type dispatch uses plain strings.
- pyproject: drop direct pooch dep (now via scverse-misc[datasets]).

Net ~750 lines deleted. Public API (sq.datasets.*) is unchanged.

Depends on scverse/scverse-misc#40.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
scverse-misc now ships a generic spatialdata loader, so squidpy no longer needs
its own; it registers only its domain loaders (image, visium_10x) plus the
anndata shape-warning override.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop the redundant 'visium' prefix in visium() so downloads land in
<datasetdir>/visium_10x/<sample>/ like every other type (was doubly nested
under visium/visium_10x). Update the hires-image path assertion accordingly.

Verified: all @internet datasets tests pass (8 passed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
registry.{anndata,image,spatialdata}_datasets were squidpy's old registry
properties, removed in the migration. Use dataset_names(type) instead. Visium
samples now cache to <datasetdir>/visium_10x/<sample>/ via the public API.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
scverse-misc dropped its DatasetRegistry/Fetcher/FetchContext classes for a typed
data model + functions. Adapt:
- _registry: parse_registry() -> (base_url, dict[str, DatasetEntry]); get_registry()
  returns the dict, get_base_url() the base. shape/library_id/doc_header now live in
  entry.metadata.
- _downloader: loaders are (entry, target, download, **kwargs); DatasetDownloader wraps
  fetch(); visium uses pooch.Untar instead of a manual tarfile loop.
- _datasets/tests updated to the dict + metadata shape.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant