Inconsistent elemwise auto-densify rules #460

New issue

Open

Inconsistent elemwise auto-densify rules#460

Labels

bug

ivirshup

opened

on Apr 21, 2021

· edited by ivirshup

Contributor

Describe the bug

Sometimes element wise operations return a dense result, sometimes they error. It seems to depend on the broadcasting.

To Reproduce

import sparse
import numpy as np

s = sparse.random((20, 10), .5)
dense_mtx = np.random.rand(20, 10)
dense_vec = np.random.rand(10) 

print(type(s + dense_mtx))
# <class 'numpy.ndarray'>

s + dense_vec

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-8f24377c4243> in <module>
      9 # np.ndarray
     10 
---> 11 s + dense_vec

/usr/local/lib/python3.8/site-packages/numpy/lib/mixins.py in func(self, other)
     19         if _disables_array_ufunc(other):
     20             return NotImplemented
---> 21         return ufunc(self, other)
     22     func.__name__ = '__{}__'.format(name)
     23     return func

~/github/sparse/sparse/_sparse_array.py in __array_ufunc__(self, ufunc, method, *inputs, **kwargs)
    303 
    304         if method == "__call__":
--> 305             result = elemwise(ufunc, *inputs, **kwargs)
    306         elif method == "reduce":
    307             result = SparseArray._reduce(ufunc, *inputs, **kwargs)

~/github/sparse/sparse/_umath.py in elemwise(func, *args, **kwargs)
     47     """
     48 
---> 49     return _Elemwise(func, *args, **kwargs).get_result()
     50 
     51 

~/github/sparse/sparse/_umath.py in __init__(self, func, *args, **kwargs)
    460 
    461         self._check_broadcast()
--> 462         self._get_fill_value()
    463 
    464     def get_result(self):

~/github/sparse/sparse/_umath.py in _get_fill_value(self)
    554         equivalent_fv = equivalent(fill_value, fill_value_array).all()
    555         if not equivalent_fv and self.shape != self.ndarray_shape:
--> 556             raise ValueError(
    557                 "Performing a mixed sparse-dense operation that would result in a dense array. "
    558                 "Please make sure that func(sparse_fill_values, ndarrays) is a constant array."

ValueError: Performing a mixed sparse-dense operation that would result in a dense array. Please make sure that func(sparse_fill_values, ndarrays) is a constant array.

Expected behavior

Consistency. I'd like a dense output for this operation (following #10 (comment)), but would be happy with the consistency of:

Always error
Always return dense
Always return sparse

System

OS and version: macOS 10.15
sparse version (sparse.__version__) 0.12.0+8.g75125cd
NumPy version (np.__version__) 1.20.2
Numba version (numba.__version__) 0.53.1

added

Collaborator

The reasoning behind this one was that doing np.add.outer(sparse_1d_array, dense_1d_array) should fail because it'd be effectively dense; and it'd blow up memory due to the densification of the sparse array, whereas np.multiply.outer(sparse_1d_array, dense_1d_array) is completely fine (and will be sparse).

One option is to allow "broadcasting" fill-values.

ivirshup

ContributorAuthor

The reasoning behind this one was that doing np.add.outer(sparse_1d_array, dense_1d_array)

But wouldn't this still blow up memory if both these arrays are dense?

I'm not sure I would agree that it's this libraries job to error here, and I'm not aware of sparse matrix library that refuses this kind of operation (examples being scipy.sparse, R's Matrix, julia's SparseArray). I think a warning here is very reasonable, but it would be nice if these acted like arrays. I would note that R and Julia return sparse arrays from these operations, while scipy returns a dense array

One option is to allow "broadcasting" fill-values.

Personally, I'm not completley convinced allowing custom fill values is worth the complexity. It definitely has some cool applications, but to me the main goal is having working sparse arrays with an array interface.

I think this solution makes a lot of sense for a lazy interface, but is just a lot of complexity for an eager one.

hameerabbasi

Collaborator

I'll have to rephrase this one: np.add.outer(sparse_1d_array, dense_1d_array) uses memory that's beyond O(sparse_memory * dense_memory), whereas np.multiply.outer(sparse_1d_array, dense_1d_array) fits in that range. Specifically, if we can't fit an algorithm that's better than a dense algorithm into our paradigm, we error, that's our definition of "memory blowup". In this particular case, one can simply do np.add.outer(sparse_1d_array.todense(), dense_1d_array).

hameerabbasi

Collaborator

This is also in line with the decision made over at #218.

ivirshup

ContributorAuthor

I'll have to rephrase this one: np.add.outer(sparse_1d_array, dense_1d_array) uses memory that's beyond O(sparse_memory * dense_memory)

This isn't strictly true if the output array is dense, which would follow the current behaviour of sparse_nd_array + dense_nd_array -> dense_nd_array.

But also, isn't the same general principle of large memory growth similar if both those arrays were sparse?

np.multiply.outer(sparse_1d_array, dense_1d_array)

As a counter example here: np.multiply.outer(sparse.random(100, .1, fill_value=1), np.arange(100)). This seems like a case that would easily slip through testing, but show up in downstream use (a concern brought up here: #218 (comment)).

This is also in line with the decision made over at 218.

To my reading, the short discussion there is more focussed on the behavior of np.asarray than the results of broadcasting. I think the results of operations on mixed sparse and dense arrays could use more discussion. I'd be interested to knowing why libraries in other languages tend to return sparse array outputs from these, even though the data can be dense.

That said, it would be really nice if there was a universal way to convert an numpy-array-like to a numpy-array, which I thought was the intent of np.asarray.

hameerabbasi

Collaborator

This isn't strictly true if the output array is dense, which would follow the current behaviour of sparse_nd_array + dense_nd_array -> dense_nd_array.

Currently we do not accept dense outputs.

But also, isn't the same general principle of large memory growth similar if both those arrays were sparse?

This is true, I hadn't considered that.

As a counter example here: np.multiply.outer(sparse.random(100, .1, fill_value=1), np.arange(100)). This seems like a case that would easily slip through testing, but show up in downstream use (a concern brought up here: #218 (comment)).

I was assuming a fill-value of 0 in this case, and this case is, indeed, handled more generally such that there aren't blowups. I do agree it could be tested better.

That said, it would be really nice if there was a universal way to convert an numpy-array-like to a numpy-array, which I thought was the intent of np.asarray.

Not quite, there's a difference between "this is already something like an array, please convert it to one" and "this may cause a performance slowdown in some cases" (like CuPy, which disallows __array__) and the worse-case "this may blow up memory so badly I may have to restart my machine or face OOM errors" (which is the case here).

ivirshup

ContributorAuthor

Currently we do not accept dense outputs.

This doesn't seem to be the case to me. The example at the top of the issue (adding a sparse array and a dense array resulting in dense array) works on current master. Elemwise has code for special handling and returning of dense outputs, which is what allows that operation to occur:

sparse/sparse/_umath.py

Line 459 in 0d50c9d

self._dense_result = False

sparse/sparse/_umath.py

Lines 555 to 561 in 0d50c9d

    
           if not equivalent_fv and self.shape != self.ndarray_shape: 
        
               raise ValueError( 
        
                   "Performing a mixed sparse-dense operation that would result in a dense array. " 
        
                   "Please make sure that func(sparse_fill_values, ndarrays) is a constant array." 
        
               ) 
        
           elif not equivalent_fv: 
        
               self._dense_result = True

Not quite, there's a difference between "this is already something like an array, please convert it to one" and "this may cause a performance slowdown in some cases"

I think this is a different point. My point here was it be nice if my library only worked on np.ndarrays, but I could accept any array like by simply calling some conversion function, say arr.convert_to(np.ndarray) on the passed input. It seems like the preferred way to do this is changing to np.asarray(arr, like=np.ones(()))? Though I'm not sure why passing a type as like (like=np.ndarray) isn't the API. Also not sure how array likes are supposed to register conversion methods, but this may be a larger issue of me not fully understanding numpy's various dispatch mechanisms.

Your NEP-37 would help solve a number of these problems.

hameerabbasi

Collaborator

Currently we do not accept dense outputs.

What I meant by this is that we do not support ufunc(*input_arrays, out=dense_array), but yes, we do return output arrays being dense.

I think this is a different point.

While I agree, it's relevant. But in light of the point that by my definition, memory blowup could also happen on SparseArrays, I'm willing to go for conditional sparse/dense outputs, but making it consistent is out of the question, as in some cases, sparse inputs may be desirable.

ivirshup

ContributorAuthor

I'm willing to go for conditional sparse/dense outputs, but making it consistent is out of the question, as in some cases, sparse inputs may be desirable.

Just to be sure we're on the same page here, does something like this fit what you mean?

import numpy as np
import sparse

sparse_mtx = sparse.random((M, N))
dense_mtx = np.random.rand((M, N))

sparse_col = sparse_mtx[:, [0]]
sparse_row = sparse_mtx[0, :]

dense_col = dense_mtx[:, [0]]
dense_row = dense_mtx[0, :]

np.add(sparse_mtx, dense_mtx) -> np.ndarray
np.multiply(sparse_mtx, dense_mtx) -> SparseArray

np.add(dense_mtx, sparse_col) -> np.ndarray

np.add(sparse_mtx, dense_col) -> np.ndarray
np.add(sparse_mtx, sparse_col) -> SparseArray

np.mul(sparse_mtx, dense_col) -> SparseArray
np.mul(dense_mtx, sparse_col) -> ???

I think this might take a significant refactor of the elemwise code. I would think that going for more a signature based dispatch (e.g. something multiple dispatch-y) would be very helpful for defining methods here.

I would also like to investigate why it seems most other libraries seem to have gone for the consistency of just having these operations always return sparse arrays.

hameerabbasi

Collaborator

Just to be sure we're on the same page here, does something like this fit what you mean?

For a fill-value of 0 and inputs lacking NaNs, yes. Surprisingly, it doesn't require a refactor at all.

ivirshup

ContributorAuthor

Surprisingly, it doesn't require a refactor at all.

I think this may be too inefficient of a way to find out that a = sparse.random((100000, 100000)); a * np.ones(a) returns sparse:

sparse/sparse/_umath.py

Lines 531 to 543 in e036c25

    
           zero_args = tuple( 
        
               arg.fill_value[...] if isinstance(arg, COO) else arg for arg in self.args 
        
           ) 
        
           # Some elemwise functions require a dtype argument, some abhorr it. 
        
           try: 
        
               fill_value_array = self.func( 
        
                   *np.broadcast_arrays(*zero_args), dtype=self.dtype, **self.kwargs 
        
               ) 
        
           except TypeError: 
        
               fill_value_array = self.func( 
        
                   *np.broadcast_arrays(*zero_args), **self.kwargs 
        
               )

I think it would make more sense to just say that "if the operation is multiplication, we only need too look at the intersection of the sparse indices". That is, I think it makes sense to have this logic written as code as opposed to determined dynamically.

hameerabbasi

Collaborator

I think it would make more sense to just say that "if the operation is multiplication, we only need too look at the intersection of the sparse indices". That is, I think it makes sense to have this logic written as code as opposed to determined dynamically.

While I agree, that would be a big refactor indeed. What I would prefer instead is that we defer that to the rewrite that's taking place in collaboration with the TACO team.

ivirshup

ContributorAuthor

What I would prefer instead is that we defer that to the rewrite that's taking place in collaboration with the TACO team

I hadn't realized this was something that was being worked on. Is there somewhere it can be tracked?

This definitely seems like the ideal, to just have all operations here generated by taco. I recall reading issues about this, but hadn't seen much movement on implementation. If this is something that might be a reality soon, then I would probably want to re-evaluate implementation priorities for CSR and CSC.

hameerabbasi

Collaborator

I hadn't realized this was something that was being worked on. Is there somewhere it can be tracked?

Mostly in the TACO repository. They're adding an array API instead of just Einsum-type operations.

Advanced indexing might have to be done on top. The work is expected to complete in June, after which we'll need to wrap it into something a bit more Pythonic.

ivirshup

mentioned this

on Apr 24, 2021

[WIP] CSR/ CSC Elemwise #465

ivirshup

ContributorAuthor

The work is expected to complete in June

Great! I've been wanting to able to use taco in a package for a few years now, so very happy to see progress being made.

Could you share any info on the plan for distribution? Is this going to require a C++ compiler at runtime, or was the idea of emitting numba code still being looked at? It would be nice to be able to use custom semirings or broadcast operations.

Advanced indexing might have to be done on top.

Yeah, so where does that leave this package, if most everything will just be implemented in taco?

hameerabbasi

Collaborator

Yeah, so where does that leave this package, if most everything will just be implemented in taco?

Well the plan is to have either pytaco (within the TACO repo) or ndsparse (outside the repo) be a new package that does these things. See #354 (comment)

hameerabbasi

Collaborator

The work is expected to complete in June

Also I should say -- The Array API work on TACO's side is expected in June, the bindings come after. Sorry for the ambiguity there.

Could you share any info on the plan for distribution? Is this going to require a C++ compiler at runtime, or was the idea of emitting numba code still being looked at? It would be nice to be able to use custom semirings or broadcast operations.

The tentative plan is to statically link against LLVM so that neither LLVM nor a C++ compiler are required on the user's system. We haven't figured out a plan for custom operations from within Python, but I believe @guilhermeleobas had some ideas.

ivirshup

ContributorAuthor

Well the plan is to have either pytaco (within the TACO repo) or ndsparse (outside the repo) be a new package that does these things.

I had understood that issue as being a tentative renaming. I might be misunderstanding what you're saying here, is it the plan to just deprecate this package?

My main concern here is: is there value in adding features/ fixing bugs in this package? It seems like much of it (at least for COO, GCXS, and CSR/CSC classes) would be replaced by pytaco bindings.

The tentative plan is to statically link against LLVM

Is there any sort of ETA for this? This seems like the largest hurdle for being able to have TACO as a dependency.

Also, are releases documented somewhere? I was looking to install it but was unsure how to choose a stable commit to build at.

hameerabbasi

Collaborator

I had understood that issue as being a tentative renaming. I might be misunderstanding what you're saying here, is it the plan to just deprecate this package?

My main concern here is: is there value in adding features/ fixing bugs in this package? It seems like much of it (at least for COO, GCXS, and CSR/CSC classes) would be replaced by pytaco bindings.

I'm open to both a rename, as well as simply replacing this package. The issue I see is that there will be (albeight slight) backcompat breaks, including NaN-accuracy. We could (instead of a rename) push out a version 2.0.

Is there any sort of ETA for this? This seems like the largest hurdle for being able to have TACO as a dependency.

Unfortunately it's dependent on too many things to calculate a reasonable ETA, some of which are behind the scenes and may become clearer within a few months.

Also, are releases documented somewhere? I was looking to install it but was unsure how to choose a stable commit to build at.

TACO doesn't have a stable release yet, they're aiming for one soon without the Array API.

hameerabbasi

mentioned this

on Apr 24, 2021

Reconsider the name of this package #354

guilhermeleobas

Contributor

Could you share any info on the plan for distribution? Is this going to require a C++ compiler at runtime, or was the idea of emitting numba code still being looked at? It would be nice to be able to use custom semirings or broadcast operations.

When possible, we would like to statically link with LLVM. And about Numba, the initial idea was to reimplement TACO on top of Numba, but this is a lot of work, and we will not follow this direction anymore. In the future, we might consider Numba to generate user defined operations and link it with the LLVM generated by TACO.

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

Labels

bug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistent elemwise auto-densify rules #460

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

	if not equivalent_fv and self.shape != self.ndarray_shape:
	raise ValueError(
	"Performing a mixed sparse-dense operation that would result in a dense array. "
	"Please make sure that func(sparse_fill_values, ndarrays) is a constant array."
	)
	elif not equivalent_fv:
	self._dense_result = True

	zero_args = tuple(
	arg.fill_value[...] if isinstance(arg, COO) else arg for arg in self.args
	)

	# Some elemwise functions require a dtype argument, some abhorr it.
	try:
	fill_value_array = self.func(
	np.broadcast_arrays(zero_args), dtype=self.dtype, **self.kwargs
	)
	except TypeError:
	fill_value_array = self.func(
	np.broadcast_arrays(zero_args), **self.kwargs
	)

Inconsistent elemwise auto-densify rules #460

Description

Activity

hameerabbasi commented on Apr 21, 2021

ivirshup commented on Apr 22, 2021

hameerabbasi commented on Apr 22, 2021

hameerabbasi commented on Apr 22, 2021

ivirshup commented on Apr 22, 2021

hameerabbasi commented on Apr 22, 2021

ivirshup commented on Apr 22, 2021

hameerabbasi commented on Apr 22, 2021

ivirshup commented on Apr 24, 2021

hameerabbasi commented on Apr 24, 2021

ivirshup commented on Apr 24, 2021

hameerabbasi commented on Apr 24, 2021

ivirshup commented on Apr 24, 2021

hameerabbasi commented on Apr 24, 2021

ivirshup commented on Apr 24, 2021

hameerabbasi commented on Apr 24, 2021

hameerabbasi commented on Apr 24, 2021

ivirshup commented on Apr 24, 2021

hameerabbasi commented on Apr 24, 2021

guilhermeleobas commented on Apr 27, 2021

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions