You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our forward/backward prop implementation requires that we store every layer's activations and error signals. However, fusing entry-wise operations together would help us avoid having to store intermediate values, increasing our capacity. Since these operations are often memory-bound, it would also boost performance since we would access data once instead of at each forward/backward prop step. Steps to implement this functionality:
The fused entry-wise layer can execute a series of operations, possibly implemented as lambda functions.
The fused entry-wise layer can perform entry-wise automatic differentiation of its sequence of operations.
The model can parse its layer graph and construct an appropriate fused entry-wise layer.
This functionality will become especially important if #193 is implemented since custom objective functions will often require a sequence of entry-wise operations prior to a reduction.
The text was updated successfully, but these errors were encountered:
I'm not sure if this is currently possible since CUDA kernels don't support polymorphism. Attempts to mimick polymorphism with device function pointers haven't had any success.
I think it may be worth looking at what other frameworks do, since fusing operations is a common optimization. TensorFlow has a combination of manually fused operations and their XLA compiler can do it both ahead of time and JIT. PyTorch has tensor comprehensions (and here). Caffe2 (which is merging into PyTorch) also does kernel fusion for deployment. MXNet also does fusion. Chainer doesn't do it yet, but appears to be moving in that direction.
Our forward/backward prop implementation requires that we store every layer's activations and error signals. However, fusing entry-wise operations together would help us avoid having to store intermediate values, increasing our capacity. Since these operations are often memory-bound, it would also boost performance since we would access data once instead of at each forward/backward prop step. Steps to implement this functionality:
This functionality will become especially important if #193 is implemented since custom objective functions will often require a sequence of entry-wise operations prior to a reduction.
The text was updated successfully, but these errors were encountered: