-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate whether current KNN imputation default (FKNNI) works for all sparseness and AnnData objects #803
Comments
Does this relate to the same reportings as #734? |
Yes! Sorry this is a duplicate now :( |
I found interesting things after batch of tests from yesterday.
import os
import numpy as np
import pandas as pd
import anndata as ad
import ehrapy as ep
root = "1"
if not os.path.exists(root):
os.makedirs(root)
rng = np.random.default_rng()
failed = False
iteration = 0
while not failed:
iteration += 1
n_obs = rng.integers(10, 100000)
n_vars = rng.integers(1, 100)
adata = ad.AnnData(pd.DataFrame(
np.random.uniform(0.0, 100.0, size=(n_obs, n_vars)),
columns=[f'Var{i + 1}' for i in range(n_vars)]))
proba_missing = rng.uniform()
missing_count = 0
for i in range(adata.shape[0]):
for j in range(adata.shape[1]):
if rng.uniform() < proba_missing:
adata.X[i, j] = np.nan
missing_count += 1
print (f"Iteration {iteration}: n_obs={n_obs}, n_vars={n_vars}, size={n_obs * n_vars}, missing={missing_count}, "
f"missing ratio={missing_count / (n_obs * n_vars):0.3f}")
imputed_adata = ep.pp.knn_impute(adata, copy=True)
failed = np.isnan(imputed_adata.X).any()
print (f"NaN found!!! Saving Anndatas!")
adata.write_h5ad(f"{root}/before imputation.h5ad")
pd.DataFrame(adata.X, index=adata.obs_names, columns=adata.var_names).to_csv(f"{root}/before imputation.csv")
imputed_adata.write_h5ad(f"{root}/after imputation.h5ad")
pd.DataFrame(imputed_adata.X, index=adata.obs_names, columns=adata.var_names).to_csv(f"{root}/after imputation.csv")
I share these early observations before digging deeper because it may ring a bell for you guys (and save me from useless work😁). |
Ohhh these is as interesting as disturbing haha. I don't have any further ideas yet but you're doing amazing work, thank you! |
Actually further tests show the problem has nothing to do with columns. The imputation silently fails at some point and leave the rest untouched, and this "rest" may contain full columns. I'll dig into that asap. |
Huh, nice! Great reproducible example... How the |
I don't think writing and loading has anything to do with our problem: calling |
Digging further:
if self.strategy == "mean":
imputed_values = np.nanmean(neighbor_values, axis=0) After distances, neighbor_indices = self.index_.search(sample_data.reshape(1, -1), self.n_neighbors) With a failing dataset, the return _swigfaiss_avx2.IndexFlat_search(self, n, x, k, distances, labels, params)
Same questions as before: Do you guys have an idea at this stage? Does the input parameters look suspicious? I will have to inspect the native code if not. Do you know where I can find it? I already have a few suggestions:
|
Great digging, thank you!
facebookresearch/faiss#3830 could it be that we are searching more neighbors than there are samples in the population? Can we return more sane values here using FAISS? Edit: chatGPT seems to suggest that this is indeed the case.
Uhmm not to me...
We're using FAISS here https://github.com/facebookresearch/faiss which uhm should not have such big issues. It might be a configuration or number of neighbors issue. Maybe they say something in their wiki? https://github.com/facebookresearch/faiss/wiki The fknni code is here: https://github.com/Zethson/fknni/blob/main/src/fknni/faiss/faiss.py chatGPT suggests (I cleaned it a bit):
2. Decrease
|
I found an entirely other case which can lead to All take place in This time, FAISS returned correctly the nearest neighbour indices, but check what rows got extracted from the original dataset based on these indices (2 values were missing in this specific row): You got it, one column is filled with And we now have a |
I think I figured out what is going on for the first case I described. In mask = ~np.isnan(X).any(axis=1)
X_non_missing = X[mask]
index.train(X_non_missing)
index.add(X_non_missing) Here, we train index with no data. Consequently, |
Ahhh, this makes a lot of sense. Hmm, how do you think about solving this problem? |
Description of feature
We ran into cases in the past where users suddenly reported that it didn't impute everything and we need to understand this.
backend=fknni
(default) on all of them -> check whether any of them breakThe text was updated successfully, but these errors were encountered: