- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 132
Open
Labels
bugIndicates an unexpected problem or unintended behaviorIndicates an unexpected problem or unintended behavior
Description
Describe the bug
Sometimes element wise operations return a dense result, sometimes they error. It seems to depend on the broadcasting.
To Reproduce
import sparse
import numpy as np
s = sparse.random((20, 10), .5)
dense_mtx = np.random.rand(20, 10)
dense_vec = np.random.rand(10)
print(type(s + dense_mtx))
# <class 'numpy.ndarray'>
s + dense_vec
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-1-8f24377c4243> in <module>
9 # np.ndarray
10
---> 11 s + dense_vec
/usr/local/lib/python3.8/site-packages/numpy/lib/mixins.py in func(self, other)
19 if _disables_array_ufunc(other):
20 return NotImplemented
---> 21 return ufunc(self, other)
22 func.__name__ = '__{}__'.format(name)
23 return func
~/github/sparse/sparse/_sparse_array.py in __array_ufunc__(self, ufunc, method, *inputs, **kwargs)
303
304 if method == "__call__":
--> 305 result = elemwise(ufunc, *inputs, **kwargs)
306 elif method == "reduce":
307 result = SparseArray._reduce(ufunc, *inputs, **kwargs)
~/github/sparse/sparse/_umath.py in elemwise(func, *args, **kwargs)
47 """
48
---> 49 return _Elemwise(func, *args, **kwargs).get_result()
50
51
~/github/sparse/sparse/_umath.py in __init__(self, func, *args, **kwargs)
460
461 self._check_broadcast()
--> 462 self._get_fill_value()
463
464 def get_result(self):
~/github/sparse/sparse/_umath.py in _get_fill_value(self)
554 equivalent_fv = equivalent(fill_value, fill_value_array).all()
555 if not equivalent_fv and self.shape != self.ndarray_shape:
--> 556 raise ValueError(
557 "Performing a mixed sparse-dense operation that would result in a dense array. "
558 "Please make sure that func(sparse_fill_values, ndarrays) is a constant array."
ValueError: Performing a mixed sparse-dense operation that would result in a dense array. Please make sure that func(sparse_fill_values, ndarrays) is a constant array.
Expected behavior
Consistency. I'd like a dense output for this operation (following #10 (comment)), but would be happy with the consistency of:
- Always error
- Always return dense
- Always return sparse
System
- OS and version: macOS 10.15
sparse
version (sparse.__version__
) 0.12.0+8.g75125cd- NumPy version (
np.__version__
) 1.20.2 - Numba version (
numba.__version__
) 0.53.1
Metadata
Metadata
Assignees
Labels
bugIndicates an unexpected problem or unintended behaviorIndicates an unexpected problem or unintended behavior
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
hameerabbasi commentedon Apr 21, 2021
The reasoning behind this one was that doing
np.add.outer(sparse_1d_array, dense_1d_array)
should fail because it'd be effectively dense; and it'd blow up memory due to the densification of the sparse array, whereasnp.multiply.outer(sparse_1d_array, dense_1d_array)
is completely fine (and will be sparse).One option is to allow "broadcasting" fill-values.
ivirshup commentedon Apr 22, 2021
But wouldn't this still blow up memory if both these arrays are dense?
I'm not sure I would agree that it's this libraries job to error here, and I'm not aware of sparse matrix library that refuses this kind of operation (examples being
scipy.sparse
, R'sMatrix
, julia'sSparseArray
). I think a warning here is very reasonable, but it would be nice if these acted like arrays. I would note that R and Julia return sparse arrays from these operations, whilescipy
returns a dense arrayPersonally, I'm not completley convinced allowing custom fill values is worth the complexity. It definitely has some cool applications, but to me the main goal is having working sparse arrays with an array interface.
I think this solution makes a lot of sense for a lazy interface, but is just a lot of complexity for an eager one.
hameerabbasi commentedon Apr 22, 2021
I'll have to rephrase this one:
np.add.outer(sparse_1d_array, dense_1d_array)
uses memory that's beyondO(sparse_memory * dense_memory)
, whereasnp.multiply.outer(sparse_1d_array, dense_1d_array)
fits in that range. Specifically, if we can't fit an algorithm that's better than a dense algorithm into our paradigm, we error, that's our definition of "memory blowup". In this particular case, one can simply donp.add.outer(sparse_1d_array.todense(), dense_1d_array)
.hameerabbasi commentedon Apr 22, 2021
This is also in line with the decision made over at #218.
ivirshup commentedon Apr 22, 2021
This isn't strictly true if the output array is dense, which would follow the current behaviour of
sparse_nd_array + dense_nd_array -> dense_nd_array
.But also, isn't the same general principle of large memory growth similar if both those arrays were sparse?
As a counter example here:
np.multiply.outer(sparse.random(100, .1, fill_value=1), np.arange(100))
. This seems like a case that would easily slip through testing, but show up in downstream use (a concern brought up here: #218 (comment)).To my reading, the short discussion there is more focussed on the behavior of
np.asarray
than the results of broadcasting. I think the results of operations on mixed sparse and dense arrays could use more discussion. I'd be interested to knowing why libraries in other languages tend to return sparse array outputs from these, even though the data can be dense.That said, it would be really nice if there was a universal way to convert an numpy-array-like to a numpy-array, which I thought was the intent of
np.asarray
.hameerabbasi commentedon Apr 22, 2021
Currently we do not accept dense outputs.
This is true, I hadn't considered that.
I was assuming a fill-value of 0 in this case, and this case is, indeed, handled more generally such that there aren't blowups. I do agree it could be tested better.
Not quite, there's a difference between "this is already something like an array, please convert it to one" and "this may cause a performance slowdown in some cases" (like CuPy, which disallows
__array__
) and the worse-case "this may blow up memory so badly I may have to restart my machine or face OOM errors" (which is the case here).ivirshup commentedon Apr 22, 2021
This doesn't seem to be the case to me. The example at the top of the issue (adding a sparse array and a dense array resulting in dense array) works on current master.
Elemwise
has code for special handling and returning of dense outputs, which is what allows that operation to occur:sparse/sparse/_umath.py
Line 459 in 0d50c9d
sparse/sparse/_umath.py
Lines 555 to 561 in 0d50c9d
I think this is a different point. My point here was it be nice if my library only worked on
np.ndarray
s, but I could accept any array like by simply calling some conversion function, sayarr.convert_to(np.ndarray)
on the passed input. It seems like the preferred way to do this is changing tonp.asarray(arr, like=np.ones(()))
? Though I'm not sure why passing a type aslike
(like=np.ndarray
) isn't the API. Also not sure how array likes are supposed to register conversion methods, but this may be a larger issue of me not fully understanding numpy's various dispatch mechanisms.Your NEP-37 would help solve a number of these problems.
hameerabbasi commentedon Apr 22, 2021
What I meant by this is that we do not support
ufunc(*input_arrays, out=dense_array)
, but yes, we do return output arrays being dense.While I agree, it's relevant. But in light of the point that by my definition, memory blowup could also happen on
SparseArray
s, I'm willing to go for conditional sparse/dense outputs, but making it consistent is out of the question, as in some cases, sparse inputs may be desirable.ivirshup commentedon Apr 24, 2021
Just to be sure we're on the same page here, does something like this fit what you mean?
I think this might take a significant refactor of the
elemwise
code. I would think that going for more a signature based dispatch (e.g. something multiple dispatch-y) would be very helpful for defining methods here.I would also like to investigate why it seems most other libraries seem to have gone for the consistency of just having these operations always return sparse arrays.
hameerabbasi commentedon Apr 24, 2021
For a fill-value of
0
and inputs lacking NaNs, yes. Surprisingly, it doesn't require a refactor at all.ivirshup commentedon Apr 24, 2021
I think this may be too inefficient of a way to find out that
a = sparse.random((100000, 100000)); a * np.ones(a)
returns sparse:sparse/sparse/_umath.py
Lines 531 to 543 in e036c25
I think it would make more sense to just say that "if the operation is multiplication, we only need too look at the intersection of the sparse indices". That is, I think it makes sense to have this logic written as code as opposed to determined dynamically.
hameerabbasi commentedon Apr 24, 2021
While I agree, that would be a big refactor indeed. What I would prefer instead is that we defer that to the rewrite that's taking place in collaboration with the TACO team.
ivirshup commentedon Apr 24, 2021
I hadn't realized this was something that was being worked on. Is there somewhere it can be tracked?
This definitely seems like the ideal, to just have all operations here generated by taco. I recall reading issues about this, but hadn't seen much movement on implementation. If this is something that might be a reality soon, then I would probably want to re-evaluate implementation priorities for
CSR
andCSC
.hameerabbasi commentedon Apr 24, 2021
Mostly in the TACO repository. They're adding an array API instead of just Einsum-type operations.
Advanced indexing might have to be done on top. The work is expected to complete in June, after which we'll need to wrap it into something a bit more Pythonic.
8 remaining items