You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to run a model on CUDA with actual int8 instructions instead of FakeQuantised float32 instructions, and enjoy the efficiency gains. It is slower when I set weights=qint8, activations=qint8 than weights=qint8.
The text was updated successfully, but these errors were encountered:
@shencuifeng this is because the cost of quantizing activations on the fly is not compensated by the faster int8 matmul, especially considering that the float x int8 matmul might benefit from an accelerated kernel depending on the float type.
I want to run a model on CUDA with actual int8 instructions instead of FakeQuantised float32 instructions, and enjoy the efficiency gains. It is slower when I set
weights=qint8, activations=qint8
thanweights=qint8
.The text was updated successfully, but these errors were encountered: