Skip to content

Inconsistent elemwise auto-densify rules #460

@ivirshup

Description

@ivirshup
Contributor

Describe the bug

Sometimes element wise operations return a dense result, sometimes they error. It seems to depend on the broadcasting.

To Reproduce

import sparse
import numpy as np

s = sparse.random((20, 10), .5)
dense_mtx = np.random.rand(20, 10)
dense_vec = np.random.rand(10) 

print(type(s + dense_mtx))
# <class 'numpy.ndarray'>

s + dense_vec
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-8f24377c4243> in <module>
      9 # np.ndarray
     10 
---> 11 s + dense_vec

/usr/local/lib/python3.8/site-packages/numpy/lib/mixins.py in func(self, other)
     19         if _disables_array_ufunc(other):
     20             return NotImplemented
---> 21         return ufunc(self, other)
     22     func.__name__ = '__{}__'.format(name)
     23     return func

~/github/sparse/sparse/_sparse_array.py in __array_ufunc__(self, ufunc, method, *inputs, **kwargs)
    303 
    304         if method == "__call__":
--> 305             result = elemwise(ufunc, *inputs, **kwargs)
    306         elif method == "reduce":
    307             result = SparseArray._reduce(ufunc, *inputs, **kwargs)

~/github/sparse/sparse/_umath.py in elemwise(func, *args, **kwargs)
     47     """
     48 
---> 49     return _Elemwise(func, *args, **kwargs).get_result()
     50 
     51 

~/github/sparse/sparse/_umath.py in __init__(self, func, *args, **kwargs)
    460 
    461         self._check_broadcast()
--> 462         self._get_fill_value()
    463 
    464     def get_result(self):

~/github/sparse/sparse/_umath.py in _get_fill_value(self)
    554         equivalent_fv = equivalent(fill_value, fill_value_array).all()
    555         if not equivalent_fv and self.shape != self.ndarray_shape:
--> 556             raise ValueError(
    557                 "Performing a mixed sparse-dense operation that would result in a dense array. "
    558                 "Please make sure that func(sparse_fill_values, ndarrays) is a constant array."

ValueError: Performing a mixed sparse-dense operation that would result in a dense array. Please make sure that func(sparse_fill_values, ndarrays) is a constant array.

Expected behavior

Consistency. I'd like a dense output for this operation (following #10 (comment)), but would be happy with the consistency of:

  • Always error
  • Always return dense
  • Always return sparse

System

  • OS and version: macOS 10.15
  • sparse version (sparse.__version__) 0.12.0+8.g75125cd
  • NumPy version (np.__version__) 1.20.2
  • Numba version (numba.__version__) 0.53.1

Activity

added
bugIndicates an unexpected problem or unintended behavior
on Apr 21, 2021
hameerabbasi

hameerabbasi commented on Apr 21, 2021

@hameerabbasi
Collaborator

The reasoning behind this one was that doing np.add.outer(sparse_1d_array, dense_1d_array) should fail because it'd be effectively dense; and it'd blow up memory due to the densification of the sparse array, whereas np.multiply.outer(sparse_1d_array, dense_1d_array) is completely fine (and will be sparse).

One option is to allow "broadcasting" fill-values.

ivirshup

ivirshup commented on Apr 22, 2021

@ivirshup
ContributorAuthor

The reasoning behind this one was that doing np.add.outer(sparse_1d_array, dense_1d_array)

But wouldn't this still blow up memory if both these arrays are dense?

I'm not sure I would agree that it's this libraries job to error here, and I'm not aware of sparse matrix library that refuses this kind of operation (examples being scipy.sparse, R's Matrix, julia's SparseArray). I think a warning here is very reasonable, but it would be nice if these acted like arrays. I would note that R and Julia return sparse arrays from these operations, while scipy returns a dense array

One option is to allow "broadcasting" fill-values.

Personally, I'm not completley convinced allowing custom fill values is worth the complexity. It definitely has some cool applications, but to me the main goal is having working sparse arrays with an array interface.

I think this solution makes a lot of sense for a lazy interface, but is just a lot of complexity for an eager one.

hameerabbasi

hameerabbasi commented on Apr 22, 2021

@hameerabbasi
Collaborator

I'll have to rephrase this one: np.add.outer(sparse_1d_array, dense_1d_array) uses memory that's beyond O(sparse_memory * dense_memory), whereas np.multiply.outer(sparse_1d_array, dense_1d_array) fits in that range. Specifically, if we can't fit an algorithm that's better than a dense algorithm into our paradigm, we error, that's our definition of "memory blowup". In this particular case, one can simply do np.add.outer(sparse_1d_array.todense(), dense_1d_array).

hameerabbasi

hameerabbasi commented on Apr 22, 2021

@hameerabbasi
Collaborator

This is also in line with the decision made over at #218.

ivirshup

ivirshup commented on Apr 22, 2021

@ivirshup
ContributorAuthor

I'll have to rephrase this one: np.add.outer(sparse_1d_array, dense_1d_array) uses memory that's beyond O(sparse_memory * dense_memory)

This isn't strictly true if the output array is dense, which would follow the current behaviour of sparse_nd_array + dense_nd_array -> dense_nd_array.

But also, isn't the same general principle of large memory growth similar if both those arrays were sparse?

np.multiply.outer(sparse_1d_array, dense_1d_array)

As a counter example here: np.multiply.outer(sparse.random(100, .1, fill_value=1), np.arange(100)). This seems like a case that would easily slip through testing, but show up in downstream use (a concern brought up here: #218 (comment)).

This is also in line with the decision made over at 218.

To my reading, the short discussion there is more focussed on the behavior of np.asarray than the results of broadcasting. I think the results of operations on mixed sparse and dense arrays could use more discussion. I'd be interested to knowing why libraries in other languages tend to return sparse array outputs from these, even though the data can be dense.

That said, it would be really nice if there was a universal way to convert an numpy-array-like to a numpy-array, which I thought was the intent of np.asarray.

hameerabbasi

hameerabbasi commented on Apr 22, 2021

@hameerabbasi
Collaborator

This isn't strictly true if the output array is dense, which would follow the current behaviour of sparse_nd_array + dense_nd_array -> dense_nd_array.

Currently we do not accept dense outputs.

But also, isn't the same general principle of large memory growth similar if both those arrays were sparse?

This is true, I hadn't considered that.

As a counter example here: np.multiply.outer(sparse.random(100, .1, fill_value=1), np.arange(100)). This seems like a case that would easily slip through testing, but show up in downstream use (a concern brought up here: #218 (comment)).

I was assuming a fill-value of 0 in this case, and this case is, indeed, handled more generally such that there aren't blowups. I do agree it could be tested better.

That said, it would be really nice if there was a universal way to convert an numpy-array-like to a numpy-array, which I thought was the intent of np.asarray.

Not quite, there's a difference between "this is already something like an array, please convert it to one" and "this may cause a performance slowdown in some cases" (like CuPy, which disallows __array__) and the worse-case "this may blow up memory so badly I may have to restart my machine or face OOM errors" (which is the case here).

ivirshup

ivirshup commented on Apr 22, 2021

@ivirshup
ContributorAuthor

Currently we do not accept dense outputs.

This doesn't seem to be the case to me. The example at the top of the issue (adding a sparse array and a dense array resulting in dense array) works on current master. Elemwise has code for special handling and returning of dense outputs, which is what allows that operation to occur:

self._dense_result = False

sparse/sparse/_umath.py

Lines 555 to 561 in 0d50c9d

if not equivalent_fv and self.shape != self.ndarray_shape:
raise ValueError(
"Performing a mixed sparse-dense operation that would result in a dense array. "
"Please make sure that func(sparse_fill_values, ndarrays) is a constant array."
)
elif not equivalent_fv:
self._dense_result = True

Not quite, there's a difference between "this is already something like an array, please convert it to one" and "this may cause a performance slowdown in some cases"

I think this is a different point. My point here was it be nice if my library only worked on np.ndarrays, but I could accept any array like by simply calling some conversion function, say arr.convert_to(np.ndarray) on the passed input. It seems like the preferred way to do this is changing to np.asarray(arr, like=np.ones(()))? Though I'm not sure why passing a type as like (like=np.ndarray) isn't the API. Also not sure how array likes are supposed to register conversion methods, but this may be a larger issue of me not fully understanding numpy's various dispatch mechanisms.

Your NEP-37 would help solve a number of these problems.

hameerabbasi

hameerabbasi commented on Apr 22, 2021

@hameerabbasi
Collaborator

Currently we do not accept dense outputs.

What I meant by this is that we do not support ufunc(*input_arrays, out=dense_array), but yes, we do return output arrays being dense.

I think this is a different point.

While I agree, it's relevant. But in light of the point that by my definition, memory blowup could also happen on SparseArrays, I'm willing to go for conditional sparse/dense outputs, but making it consistent is out of the question, as in some cases, sparse inputs may be desirable.

ivirshup

ivirshup commented on Apr 24, 2021

@ivirshup
ContributorAuthor

I'm willing to go for conditional sparse/dense outputs, but making it consistent is out of the question, as in some cases, sparse inputs may be desirable.

Just to be sure we're on the same page here, does something like this fit what you mean?

import numpy as np
import sparse

sparse_mtx = sparse.random((M, N))
dense_mtx = np.random.rand((M, N))

sparse_col = sparse_mtx[:, [0]]
sparse_row = sparse_mtx[0, :]

dense_col = dense_mtx[:, [0]]
dense_row = dense_mtx[0, :]

np.add(sparse_mtx, dense_mtx) -> np.ndarray
np.multiply(sparse_mtx, dense_mtx) -> SparseArray

np.add(dense_mtx, sparse_col) -> np.ndarray

np.add(sparse_mtx, dense_col) -> np.ndarray
np.add(sparse_mtx, sparse_col) -> SparseArray

np.mul(sparse_mtx, dense_col) -> SparseArray
np.mul(dense_mtx, sparse_col) -> ???

I think this might take a significant refactor of the elemwise code. I would think that going for more a signature based dispatch (e.g. something multiple dispatch-y) would be very helpful for defining methods here.

I would also like to investigate why it seems most other libraries seem to have gone for the consistency of just having these operations always return sparse arrays.

hameerabbasi

hameerabbasi commented on Apr 24, 2021

@hameerabbasi
Collaborator

Just to be sure we're on the same page here, does something like this fit what you mean?

For a fill-value of 0 and inputs lacking NaNs, yes. Surprisingly, it doesn't require a refactor at all.

ivirshup

ivirshup commented on Apr 24, 2021

@ivirshup
ContributorAuthor

Surprisingly, it doesn't require a refactor at all.

I think this may be too inefficient of a way to find out that a = sparse.random((100000, 100000)); a * np.ones(a) returns sparse:

sparse/sparse/_umath.py

Lines 531 to 543 in e036c25

zero_args = tuple(
arg.fill_value[...] if isinstance(arg, COO) else arg for arg in self.args
)
# Some elemwise functions require a dtype argument, some abhorr it.
try:
fill_value_array = self.func(
*np.broadcast_arrays(*zero_args), dtype=self.dtype, **self.kwargs
)
except TypeError:
fill_value_array = self.func(
*np.broadcast_arrays(*zero_args), **self.kwargs
)

I think it would make more sense to just say that "if the operation is multiplication, we only need too look at the intersection of the sparse indices". That is, I think it makes sense to have this logic written as code as opposed to determined dynamically.

hameerabbasi

hameerabbasi commented on Apr 24, 2021

@hameerabbasi
Collaborator

I think it would make more sense to just say that "if the operation is multiplication, we only need too look at the intersection of the sparse indices". That is, I think it makes sense to have this logic written as code as opposed to determined dynamically.

While I agree, that would be a big refactor indeed. What I would prefer instead is that we defer that to the rewrite that's taking place in collaboration with the TACO team.

ivirshup

ivirshup commented on Apr 24, 2021

@ivirshup
ContributorAuthor

What I would prefer instead is that we defer that to the rewrite that's taking place in collaboration with the TACO team

I hadn't realized this was something that was being worked on. Is there somewhere it can be tracked?

This definitely seems like the ideal, to just have all operations here generated by taco. I recall reading issues about this, but hadn't seen much movement on implementation. If this is something that might be a reality soon, then I would probably want to re-evaluate implementation priorities for CSR and CSC.

hameerabbasi

hameerabbasi commented on Apr 24, 2021

@hameerabbasi
Collaborator

I hadn't realized this was something that was being worked on. Is there somewhere it can be tracked?

Mostly in the TACO repository. They're adding an array API instead of just Einsum-type operations.

Advanced indexing might have to be done on top. The work is expected to complete in June, after which we'll need to wrap it into something a bit more Pythonic.

8 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugIndicates an unexpected problem or unintended behavior

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @hameerabbasi@guilhermeleobas@ivirshup

        Issue actions

          Inconsistent elemwise auto-densify rules · Issue #460 · pydata/sparse