improve move_to_x #661

xinyuejohn · 2024-03-04T08:17:57Z

PR Checklist

This comment contains a description of changes (with reason)
Referenced issue is linked
If you've fixed a bug or added code that should be tested, add tests!
Documentation in docs is updated

Description of changes
Add three arguments to move_to_x: copy_uns, copy_obsm, and copy_varm and they are all set to be True and optional.

Technical details
Tests are already been added.

Additional context

for more information, see https://pre-commit.ci

eroell · 2024-03-05T13:13:29Z

A few points on this:

I wonder if it is a good idea to store computed results (e.g. adata.obsm["X_pca"]) together with altered data that would actually yield different results if one would recompute. Is there a striking example justifying to do that?
If .obsm is copied, .obsp might be as well following this logic I think?
Copying .varm should be omitted in any case my opinion; This contains per-feature computed results. By doing move_to_x, a new feature is actually added. In addition to the thoughts about changed data above, here the dimensionality of the data does not match anymore. Example below of how this pretty quickly fails :)

import ehrapy as ep
adata = ep.dt.diabetes_130(columns_obs_only=["race", "gender", "age"])
adata_prep = adata[
    :,
    (adata.var.index == "time_in_hospital_days")
    | (adata.var.index == "num_lab_procedures")
    | (adata.var.index == "num_procedures")
    | (adata.var.index == "num_medications")
    | (adata.var.index == "number_diagnoses"),
]
ep.pp.pca(adata_prep, n_comps=2)
adata_prep.varm["PCs"]
adata_prep_2 = ep.ad.move_to_x(adata_prep, "gender", copy_varm=True)

ValueError: Value passed for key 'PCs' is of incorrect shape. Values of varm must match dimensions ('var',) of parent. Value had shape (5,) while it should have had (6,).

Considering that move_to_x creates somewhat a new dataset, it might be justified and/or necessary to (re)compute things?

eroell · 2024-03-05T13:28:45Z

tests/anndata/test_anndata_ext.py

+
+
+def test_move_to_x_copy_varm(adata_move_obs_mix):
+    move_to_obs(adata_move_obs_mix, ["name"], copy_obs=True)


Doing copy_obs=True here means that "name" is kept as variable; so the move_to_x doesn't actually move it back; hence the failing example in my comment was a bit hidden :)

Zethson · 2024-03-05T13:51:12Z

@xinyuejohn could you also outline for us again for what exactly you need this feature, please? I do totally see @eroell points

xinyuejohn · 2024-03-07T10:06:44Z

A few points on this:

I wonder if it is a good idea to store computed results (e.g. adata.obsm["X_pca"]) together with altered data that would actually yield different results if one would recompute. Is there a striking example justifying to do that?

If .obsm is copied, .obsp might be as well following this logic I think?

Copying .varm should be omitted in any case my opinion; This contains per-feature computed results. By doing move_to_x, a new feature is actually added. In addition to the thoughts about changed data above, here the dimensionality of the data does not match anymore. Example below of how this pretty quickly fails :)
import ehrapy as ep
adata = ep.dt.diabetes_130(columns_obs_only=["race", "gender", "age"])
adata_prep = adata[
    :,
    (adata.var.index == "time_in_hospital_days")
    | (adata.var.index == "num_lab_procedures")
    | (adata.var.index == "num_procedures")
    | (adata.var.index == "num_medications")
    | (adata.var.index == "number_diagnoses"),
]
ep.pp.pca(adata_prep, n_comps=2)
adata_prep.varm["PCs"]
adata_prep_2 = ep.ad.move_to_x(adata_prep, "gender", copy_varm=True)
ValueError: Value passed for key 'PCs' is of incorrect shape. Values of varm must match dimensions ('var',) of parent. Value had shape (5,) while it should have had (6,).
Considering that move_to_x creates somewhat a new dataset, it might be justified and/or necessary to (re)compute things?

Thanks for your reply and I totally got your points!

Here's my use case:

adata.obs = pd.merge(adata.obs, df_statistics, how="left", left_index=True, right_index=True)
adata = ep.ad.move_to_x(adata, list(df_statistics.columns))
adata.obsm = obsm

In .obsm, there are some awkward arrays which still valid as long as the length of observations keep same.

eroell · 2024-03-08T10:13:31Z

Thanks for the example - so to my understanding you add some (maybe externally computed) statistics to the .obs field, and then move these to the .X field.

Quick clarification, what is in the awkward arrays (I guess not a PCA embedding as in my comment above), which would be in the .obm field that is still valid?

xinyuejohn · 2024-03-11T09:12:06Z

Thanks for the example - so to my understanding you add some (maybe externally computed) statistics to the .obs field, and then move these to the .X field.

Quick clarification, what is in the awkward arrays (I guess not a PCA embedding as in my comment above), which would be in the .obm field that is still valid?

I think this diagram will help you understand what I want to do. Please kindly have a look at it.

In adata, each row represents a patient's visit.
In .obs, there are all the episode level features. And they have only a single value and without any time information, like height/age/gender.
In .obsm, there are all time series features stored in awkward array. They have multiple records and have both time information and values. For example, adata.obsm['heart rate'][0] = awkward_array(time: [0, 1, 2, 4, 5, 7], value: [99, 80, 78, 91, 70, 90])

In this settings, I can easily move data from .obs to .X using move_to_x we are talking about and I can also move .obsm to .X using some aggregations. And if I change .X, .obsm will still be valid as it stores data irrelevant to .X

eroell · 2024-03-18T13:11:07Z

Thanks for the diagram, I think I get the point here - it is data in the .obsm in your case, not a computed result.
The .obsm field in the scRNA-seq setting is often used for computed results, e.g. most things that produce an embedding of some sort. In ehrapy, also many tools compute an embedding and store that in the .obsm field.

It probably makes sense to not mix things computed from .X with raw data together in .obsm?
Although expert users like you @xinyuejohn ofc can use this structure if helpful ;)

We have yet to include time series support in a meaningful way - a data field alike your use of .obsm is one of the candidates for sure.

Whatever this might look like, copying all the -data- as you suggest here should happen indeed.

I think I would refrain from copying .obsm when doing move_to_x, as in many cases this field will be of results computed from .X

What do you think @xinyuejohn ?

Zethson · 2024-03-18T14:07:03Z

I also think that it's weird if it happens automatically, but maybe we can just parameterize this and well document the use-case?

eroell · 2024-03-18T14:17:51Z

That would be .obsm, .uns, .obsp you'd consider taking in?

Zethson · 2024-03-19T19:23:39Z

Think so ya? Or just obms and uns?!?

Zethson · 2024-04-18T14:44:45Z

So what do we do with this PR?

eroell · 2024-04-19T11:27:33Z

I'd be in favor of not having these values copied now, but storing this data later more consistently.

But I'm also fine having options for copying .obsm, .uns, .obsp (while setting the default to False)

Zethson · 2024-05-14T15:55:55Z

Okay, then let's make these options if it makes it easier for ehrdata. That can also depend on an evaluation of @eroell

eroell · 2024-08-30T13:58:13Z

I'd close this PR for now, as I think the underlying outline of data handling patterns will not be the way to go

the .obsm field is likely not a suitable place to store data of time-series variables, as it lacks the .var field for many operations
moving .obsm to .X fails when .obsm is 3D arrays (.X must be 2D), which is what .obsm would often be if time series were to be stored here

xinyuejohn and others added 2 commits March 4, 2024 09:15

improve move_to_x

69f73c1

[pre-commit.ci] auto fixes from pre-commit.com hooks

dc556f5

for more information, see https://pre-commit.ci

xinyuejohn self-assigned this Mar 4, 2024

xinyuejohn linked an issue Mar 4, 2024 that may be closed by this pull request

current move_to_x() doesn't copy original .uns/.obsm/.varm ... #654

Closed

update tests

fae3867

xinyuejohn requested a review from Zethson March 4, 2024 14:01

eroell reviewed Mar 5, 2024

View reviewed changes

eroell closed this Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve move_to_x #661

improve move_to_x #661

xinyuejohn commented Mar 4, 2024

eroell commented Mar 5, 2024

eroell Mar 5, 2024

Zethson commented Mar 5, 2024

xinyuejohn commented Mar 7, 2024

eroell commented Mar 8, 2024

xinyuejohn commented Mar 11, 2024

eroell commented Mar 18, 2024

Zethson commented Mar 18, 2024

eroell commented Mar 18, 2024

Zethson commented Mar 19, 2024

Zethson commented Apr 18, 2024

eroell commented Apr 19, 2024

Zethson commented May 14, 2024

eroell commented Aug 30, 2024 •

edited

Loading



		def test_move_to_x_copy_varm(adata_move_obs_mix):
		move_to_obs(adata_move_obs_mix, ["name"], copy_obs=True)

improve move_to_x #661

improve move_to_x #661

Conversation

xinyuejohn commented Mar 4, 2024

eroell commented Mar 5, 2024

eroell Mar 5, 2024

Choose a reason for hiding this comment

Zethson commented Mar 5, 2024

xinyuejohn commented Mar 7, 2024

eroell commented Mar 8, 2024

xinyuejohn commented Mar 11, 2024

eroell commented Mar 18, 2024

Zethson commented Mar 18, 2024

eroell commented Mar 18, 2024

Zethson commented Mar 19, 2024

Zethson commented Apr 18, 2024

eroell commented Apr 19, 2024

Zethson commented May 14, 2024

eroell commented Aug 30, 2024 • edited Loading

eroell commented Aug 30, 2024 •

edited

Loading