You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thanks for developing and open-sourcing such a cornerstone as a quantization method in LLM.
I have a question about scaled activation function: According to the paper, it is supposed to scale weights from the observation of its activation. But it seems that this code apply activation scaling in every activation function. Is it useful or in other words where can I find the specific explanation for the existence of this part?
And, scaled activation employs a learnable parameter from the given scales. How likely is it to affect to the result?
The text was updated successfully, but these errors were encountered:
Hi, thanks for developing and open-sourcing such a cornerstone as a quantization method in LLM.
I have a question about scaled activation function: According to the paper, it is supposed to scale weights from the observation of its activation. But it seems that this code apply activation scaling in every activation function. Is it useful or in other words where can I find the specific explanation for the existence of this part?
And, scaled activation employs a learnable parameter from the given scales. How likely is it to affect to the result?
The text was updated successfully, but these errors were encountered: