Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

input data requirements? #34

Open
mschart opened this issue Nov 26, 2024 · 0 comments
Open

input data requirements? #34

mschart opened this issue Nov 26, 2024 · 0 comments

Comments

@mschart
Copy link

mschart commented Nov 26, 2024

Greetings,
Thanks for this great sorting tool!

I have data of dimensions 60 k cells x 2800 features, for which rastermap works well, but when restricting this to the first 2400 features I get an error saying there are NaNs although there are none. I.e. too short dimensions cause this:

With the full data it works well:

In [44]: data_full.shape
Out[44]: (61473, 2496)

In [45]: np.issubdtype(data_full.dtype, np.floating)
Out[45]: True

In [46]: np.isnan(data_full).any()
Out[46]: False

In [49]:         model = Rastermap(n_PCs=200, n_clusters=100, locality=0.75,
    ...:                           time_lag_window=5,
    ...:                           bin_size=1).fit(data_full
    ...:                   )
2024-11-27 14:03:10,304 [INFO] normalizing data across axis=1
2024-11-27 14:03:11,925 [INFO] projecting out mean along axis=0
2024-11-27 14:03:13,379 [INFO] data normalized, 3.08sec
2024-11-27 14:03:13,380 [INFO] sorting activity: 61473 valid samples by 2496 timepoints
2024-11-27 14:03:35,074 [INFO] n_PCs = 200 computed, 24.77sec
2024-11-27 14:03:57,359 [INFO] 100 clusters computed, time 47.06sec
2024-11-27 14:04:18,419 [INFO] clusters sorted, time 68.12sec
2024-11-27 14:04:20,295 [INFO] clusters upsampled, time 69.99sec
2024-11-27 14:04:28,501 [INFO] rastermap complete, time 78.20sec

but when taking a subset of the data:

In [50]: data = data_full[:,keep_indices]

In [51]: data.shape
Out[51]: (61473, 1824)

In [52]: np.issubdtype(data.dtype, np.floating)
Out[52]: True

In [53]: np.isnan(data).any()
Out[53]: False

In [54]:         model = Rastermap(n_PCs=200, n_clusters=100, locality=0.75,
    ...:                           time_lag_window=5,
    ...:                           bin_size=1).fit(data)
2024-11-27 14:06:27,847 [INFO] normalizing data across axis=1
2024-11-27 14:06:28,978 [INFO] projecting out mean along axis=0
2024-11-27 14:06:30,070 [INFO] data normalized, 2.22sec
2024-11-27 14:06:30,070 [INFO] sorting activity: 61471 valid samples by 1824 timepoints
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[54], line 3
      1 model = Rastermap(n_PCs=200, n_clusters=100, locality=0.75, 
      2                   time_lag_window=5, 
----> 3                   bin_size=1).fit(data)

File ~/miniforge3/envs/iblenv/lib/python3.10/site-packages/rastermap/rastermap.py:327, in Rastermap.fit(self, data, Usv, Vsv, U_nodes, itrain, compute_X_embedding, BBt)
    325 if Usv is None:
    326     tic = time.time()
--> 327     Usv_valid = SVD(X[igood][:, itrain] if itrain is not None else X[igood], 
    328                    n_components=self.n_PCs)            
    329     Usv = np.nan * np.zeros((len(igood), Usv_valid.shape[1]), "float32")
    330     Usv[igood] = Usv_valid

File ~/miniforge3/envs/iblenv/lib/python3.10/site-packages/rastermap/svd.py:34, in SVD(X, n_components, return_USV, transpose)
     30 nmin = min(nmin, n_components)
     32 Xt = X.T if transpose else X
     33 U = TruncatedSVD(n_components=nmin, 
---> 34                  random_state=0).fit_transform(Xt)
     36 if transpose:
     37     sv = (U**2).sum(axis=0)**0.5

File ~/miniforge3/envs/iblenv/lib/python3.10/site-packages/sklearn/utils/_set_output.py:316, in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
    314 @wraps(f)
    315 def wrapped(self, X, *args, **kwargs):
--> 316     data_to_wrap = f(self, X, *args, **kwargs)
    317     if isinstance(data_to_wrap, tuple):
    318         # only wrap the first output for cross decomposition
    319         return_tuple = (
    320             _wrap_data_with_container(method, data_to_wrap[0], X, self),
    321             *data_to_wrap[1:],
    322         )

File ~/miniforge3/envs/iblenv/lib/python3.10/site-packages/sklearn/base.py:1473, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1466     estimator._validate_params()
   1468 with config_context(
   1469     skip_parameter_validation=(
   1470         prefer_skip_nested_validation or global_skip_validation
   1471     )
   1472 ):
-> 1473     return fit_method(estimator, *args, **kwargs)

File ~/miniforge3/envs/iblenv/lib/python3.10/site-packages/sklearn/decomposition/_truncated_svd.py:228, in TruncatedSVD.fit_transform(self, X, y)
    211 @_fit_context(prefer_skip_nested_validation=True)
    212 def fit_transform(self, X, y=None):
    213     """Fit model to X and perform dimensionality reduction on X.
    214 
    215     Parameters
   (...)
    226         Reduced version of X. This will always be a dense array.
    227     """
--> 228     X = self._validate_data(X, accept_sparse=["csr", "csc"], ensure_min_features=2)
    229     random_state = check_random_state(self.random_state)
    231     if self.algorithm == "arpack":

File ~/miniforge3/envs/iblenv/lib/python3.10/site-packages/sklearn/base.py:633, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)
    631         out = X, y
    632 elif not no_val_X and no_val_y:
--> 633     out = check_array(X, input_name="X", **check_params)
    634 elif no_val_X and not no_val_y:
    635     out = _check_y(y, **check_params)

File ~/miniforge3/envs/iblenv/lib/python3.10/site-packages/sklearn/utils/validation.py:1064, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
   1058     raise ValueError(
   1059         "Found array with dim %d. %s expected <= 2."
   1060         % (array.ndim, estimator_name)
   1061     )
   1063 if force_all_finite:
-> 1064     _assert_all_finite(
   1065         array,
   1066         input_name=input_name,
   1067         estimator_name=estimator_name,
   1068         allow_nan=force_all_finite == "allow-nan",
   1069     )
   1071 if copy:
   1072     if _is_numpy_namespace(xp):
   1073         # only make a copy if `array` and `array_orig` may share memory`

File ~/miniforge3/envs/iblenv/lib/python3.10/site-packages/sklearn/utils/validation.py:123, in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
    120 if first_pass_isfinite:
    121     return
--> 123 _assert_all_finite_element_wise(
    124     X,
    125     xp=xp,
    126     allow_nan=allow_nan,
    127     msg_dtype=msg_dtype,
    128     estimator_name=estimator_name,
    129     input_name=input_name,
    130 )

File ~/miniforge3/envs/iblenv/lib/python3.10/site-packages/sklearn/utils/validation.py:172, in _assert_all_finite_element_wise(X, xp, allow_nan, msg_dtype, estimator_name, input_name)
    155 if estimator_name and input_name == "X" and has_nan_error:
    156     # Improve the error message on how to handle missing values in
    157     # scikit-learn.
    158     msg_err += (
    159         f"\n{estimator_name} does not accept missing values"
    160         " encoded as NaN natively. For supervised learning, you might want"
   (...)
    170         "#estimators-that-handle-nan-values"
    171     )
--> 172 raise ValueError(msg_err)

ValueError: Input X contains NaN.
TruncatedSVD does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

It's something with this data, as for a random matrix of the same size it works, for restricted and not. Mhmmm. Happy to hear your thoughts about this! And thanks again for the great tool!

@mschart mschart changed the title minimal input data dimensions input data requirements? Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant