You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have data of dimensions 60 k cells x 2800 features, for which rastermap works well, but when restricting this to the first 2400 features I get an error saying there are NaNs although there are none. I.e. too short dimensions cause this:
With the full data it works well:
In [44]: data_full.shape
Out[44]: (61473, 2496)
In [45]: np.issubdtype(data_full.dtype, np.floating)
Out[45]: True
In [46]: np.isnan(data_full).any()
Out[46]: False
In [49]: model = Rastermap(n_PCs=200, n_clusters=100, locality=0.75,
...: time_lag_window=5,
...: bin_size=1).fit(data_full
...: )
2024-11-27 14:03:10,304 [INFO] normalizing data across axis=1
2024-11-27 14:03:11,925 [INFO] projecting out mean along axis=0
2024-11-27 14:03:13,379 [INFO] data normalized, 3.08sec
2024-11-27 14:03:13,380 [INFO] sorting activity: 61473 valid samples by 2496 timepoints
2024-11-27 14:03:35,074 [INFO] n_PCs = 200 computed, 24.77sec
2024-11-27 14:03:57,359 [INFO] 100 clusters computed, time 47.06sec
2024-11-27 14:04:18,419 [INFO] clusters sorted, time 68.12sec
2024-11-27 14:04:20,295 [INFO] clusters upsampled, time 69.99sec
2024-11-27 14:04:28,501 [INFO] rastermap complete, time 78.20sec
but when taking a subset of the data:
In [50]: data = data_full[:,keep_indices]
In [51]: data.shape
Out[51]: (61473, 1824)
In [52]: np.issubdtype(data.dtype, np.floating)
Out[52]: True
In [53]: np.isnan(data).any()
Out[53]: False
In [54]: model = Rastermap(n_PCs=200, n_clusters=100, locality=0.75,
...: time_lag_window=5,
...: bin_size=1).fit(data)
2024-11-27 14:06:27,847 [INFO] normalizing data across axis=1
2024-11-27 14:06:28,978 [INFO] projecting out mean along axis=0
2024-11-27 14:06:30,070 [INFO] data normalized, 2.22sec
2024-11-27 14:06:30,070 [INFO] sorting activity: 61471 valid samples by 1824 timepoints
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[54], line 3
1 model = Rastermap(n_PCs=200, n_clusters=100, locality=0.75,
2 time_lag_window=5,
----> 3 bin_size=1).fit(data)
File ~/miniforge3/envs/iblenv/lib/python3.10/site-packages/rastermap/rastermap.py:327, in Rastermap.fit(self, data, Usv, Vsv, U_nodes, itrain, compute_X_embedding, BBt)
325 if Usv is None:
326 tic = time.time()
--> 327 Usv_valid = SVD(X[igood][:, itrain] if itrain is not None else X[igood],
328 n_components=self.n_PCs)
329 Usv = np.nan * np.zeros((len(igood), Usv_valid.shape[1]), "float32")
330 Usv[igood] = Usv_valid
File ~/miniforge3/envs/iblenv/lib/python3.10/site-packages/rastermap/svd.py:34, in SVD(X, n_components, return_USV, transpose)
30 nmin = min(nmin, n_components)
32 Xt = X.T if transpose else X
33 U = TruncatedSVD(n_components=nmin,
---> 34 random_state=0).fit_transform(Xt)
36 if transpose:
37 sv = (U**2).sum(axis=0)**0.5
File ~/miniforge3/envs/iblenv/lib/python3.10/site-packages/sklearn/utils/_set_output.py:316, in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
314 @wraps(f)
315 def wrapped(self, X, *args, **kwargs):
--> 316 data_to_wrap = f(self, X, *args, **kwargs)
317 if isinstance(data_to_wrap, tuple):
318 # only wrap the first output for cross decomposition
319 return_tuple = (
320 _wrap_data_with_container(method, data_to_wrap[0], X, self),
321 *data_to_wrap[1:],
322 )
File ~/miniforge3/envs/iblenv/lib/python3.10/site-packages/sklearn/base.py:1473, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
1466 estimator._validate_params()
1468 with config_context(
1469 skip_parameter_validation=(
1470 prefer_skip_nested_validation or global_skip_validation
1471 )
1472 ):
-> 1473 return fit_method(estimator, *args, **kwargs)
File ~/miniforge3/envs/iblenv/lib/python3.10/site-packages/sklearn/decomposition/_truncated_svd.py:228, in TruncatedSVD.fit_transform(self, X, y)
211 @_fit_context(prefer_skip_nested_validation=True)
212 def fit_transform(self, X, y=None):
213 """Fit model to X and perform dimensionality reduction on X.
214
215 Parameters
(...)
226 Reduced version of X. This will always be a dense array.
227 """
--> 228 X = self._validate_data(X, accept_sparse=["csr", "csc"], ensure_min_features=2)
229 random_state = check_random_state(self.random_state)
231 if self.algorithm == "arpack":
File ~/miniforge3/envs/iblenv/lib/python3.10/site-packages/sklearn/base.py:633, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)
631 out = X, y
632 elif not no_val_X and no_val_y:
--> 633 out = check_array(X, input_name="X", **check_params)
634 elif no_val_X and not no_val_y:
635 out = _check_y(y, **check_params)
File ~/miniforge3/envs/iblenv/lib/python3.10/site-packages/sklearn/utils/validation.py:1064, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
1058 raise ValueError(
1059 "Found array with dim %d. %s expected <= 2."
1060 % (array.ndim, estimator_name)
1061 )
1063 if force_all_finite:
-> 1064 _assert_all_finite(
1065 array,
1066 input_name=input_name,
1067 estimator_name=estimator_name,
1068 allow_nan=force_all_finite == "allow-nan",
1069 )
1071 if copy:
1072 if _is_numpy_namespace(xp):
1073 # only make a copy if `array` and `array_orig` may share memory`
File ~/miniforge3/envs/iblenv/lib/python3.10/site-packages/sklearn/utils/validation.py:123, in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
120 if first_pass_isfinite:
121 return
--> 123 _assert_all_finite_element_wise(
124 X,
125 xp=xp,
126 allow_nan=allow_nan,
127 msg_dtype=msg_dtype,
128 estimator_name=estimator_name,
129 input_name=input_name,
130 )
File ~/miniforge3/envs/iblenv/lib/python3.10/site-packages/sklearn/utils/validation.py:172, in _assert_all_finite_element_wise(X, xp, allow_nan, msg_dtype, estimator_name, input_name)
155 if estimator_name and input_name == "X" and has_nan_error:
156 # Improve the error message on how to handle missing values in
157 # scikit-learn.
158 msg_err += (
159 f"\n{estimator_name} does not accept missing values"
160 " encoded as NaN natively. For supervised learning, you might want"
(...)
170 "#estimators-that-handle-nan-values"
171 )
--> 172 raise ValueError(msg_err)
ValueError: Input X contains NaN.
TruncatedSVD does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values
It's something with this data, as for a random matrix of the same size it works, for restricted and not. Mhmmm. Happy to hear your thoughts about this! And thanks again for the great tool!
The text was updated successfully, but these errors were encountered:
mschart
changed the title
minimal input data dimensions
input data requirements?
Dec 2, 2024
Greetings,
Thanks for this great sorting tool!
I have data of dimensions 60 k cells x 2800 features, for which rastermap works well, but when restricting this to the first 2400 features I get an error saying there are NaNs although there are none. I.e. too short dimensions cause this:
With the full data it works well:
but when taking a subset of the data:
It's something with this data, as for a random matrix of the same size it works, for restricted and not. Mhmmm. Happy to hear your thoughts about this! And thanks again for the great tool!
The text was updated successfully, but these errors were encountered: