Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(feat): allow join=outer for concat_on_disk #1504

Merged
merged 13 commits into from
Jul 4, 2024

Conversation

ilan-gold
Copy link
Contributor

Copy link

codecov bot commented May 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.45%. Comparing base (2c9e400) to head (ff68f99).
Report is 73 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1504      +/-   ##
==========================================
+ Coverage   84.44%   84.45%   +0.01%     
==========================================
  Files          36       36              
  Lines        5747     5745       -2     
==========================================
- Hits         4853     4852       -1     
+ Misses        894      893       -1     
Files with missing lines Coverage Δ
src/anndata/experimental/merge.py 91.79% <ø> (+0.42%) ⬆️

@ilan-gold ilan-gold marked this pull request as ready for review May 22, 2024 14:19
Comment on lines -539 to -542
if join != "inner":
raise NotImplementedError("only inner join is currently supported")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to become more familiar with the code, but it's shocking this was all that was needed. That being said, the tests do pass and I looked at the results so 🤷

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, looking at the code path, it makes some sense. concat_on_disk constructs sensible-looking reindexers for the alt_dim axis and then passes those on so there shouldn't be a need for the join argument anywhere else despite the option to use it

@zimea
Copy link

zimea commented May 22, 2024

In a first run, the object didn't contain the "var" dimension. Could you check whether you can replicate it?

@ilan-gold
Copy link
Contributor Author

@zimea #1505 is what you want but there are clearly some problems there. So if you need you var you can do something like this for each anndata:

my_adatas = []
for path in my_paths:
    adata = ad.read_h5ad(path)
    del adata.X
    my_adatas += [adata]
merged_in_memory = concat(my_adatas, merge="...")
merged_on_disk.var = merged_in_memory.var

which should be manageable unless your obsm is also very large, in which case I would del that as well. This isn't pretty but until I also fix #1505, it should work. Also you will need to choose a merge strategy based on what you want: https://anndata.readthedocs.io/en/stable/generated/anndata.concat.html

@ilan-gold
Copy link
Contributor Author

Ok @ivirshup I think I have a slightly better sense of what you mean when you talk about concat_on_disk + dask. It seems like concat_on_disk won't disappear but will be vastly simplified by using dask for which we need #1469.

For example, I have been looking into adding merge support and the main stumbling block there appears to be that doing anything with SparseDataset classes is difficult i.e., reindexing + writing, which is why

def _gen_slice_to_append(
datasets: Sequence[BaseCompressedSparseDataset],
reindexers,
max_loaded_elems: int,
axis=0,
fill_value=None,
):
for ds, ri in zip(datasets, reindexers):
n_slices = ds.shape[axis] * ds.shape[1 - axis] // max_loaded_elems
if n_slices < 2:
yield (csr_matrix, csc_matrix)[axis](
ri(to_memory(ds), axis=1 - axis, fill_value=fill_value)
)
else:
slice_size = max_loaded_elems // ds.shape[1 - axis]
if slice_size == 0:
slice_size = 1
rem_slices = ds.shape[axis]
idx = 0
while rem_slices > 0:
ds_part = None
if axis == 0:
ds_part = ds[idx : idx + slice_size, :]
elif axis == 1:
ds_part = ds[:, idx : idx + slice_size]
yield (csr_matrix, csc_matrix)[axis](
ri(ds_part, axis=1 - axis, fill_value=fill_value)
)
rem_slices -= slice_size
idx += slice_size
exists. So I think to move forward with the various merge types we should merge #1469 and then integrate read_dask_as_elem here. Sound good?

@ilan-gold ilan-gold self-assigned this May 29, 2024
@flying-sheep flying-sheep modified the milestones: 0.10.8, 0.10.9 Jun 18, 2024
Copy link
Member

@selmanozleyen selmanozleyen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Happy to see this was somewhat smooth.

@ilan-gold ilan-gold requested a review from flying-sheep June 28, 2024 14:21
Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. It tests correctness, and if performance turns out to be a problem, we can do that later

@ilan-gold ilan-gold merged commit 686786d into main Jul 4, 2024
14 of 15 checks passed
@ilan-gold ilan-gold deleted the ig/concat_on_disk_outer_join branch July 4, 2024 08:54
Copy link

lumberbot-app bot commented Jul 4, 2024

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 0.10.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 686786db35cafd8cbd7ebcc0128d5239fb06bce7
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #1504: (feat): allow `join=outer` for `concat_on_disk`'
  1. Push to a named branch:
git push YOURFORK 0.10.x:auto-backport-of-pr-1504-on-0.10.x
  1. Create a PR against branch 0.10.x, I would have named this PR:

"Backport PR #1504 on branch 0.10.x ((feat): allow join=outer for concat_on_disk)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

concat_on_disk outer join
4 participants