-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vulkan Optimizations and Fixes #8959
Conversation
…ng and more * fixed default sampling queue to include p_step * changed sampling queue display to better reflect the actual logic * added VK-specific settings `use_mmap_vk`, `flash_attn_vk`, `no_kv_offload_vk` * added new presets for testing
I missed a validation issue in #8943, but the fix is now in this branch. I think this should be ready for a review and then merge. |
sz * FLOAT_TYPE((data_a[ib0 + i].scales[v_im + 4] & 0x0f) | ((data_a[ib0 + i].scales[v_im] & 0xc0) >> 2)) + sw * FLOAT_TYPE((data_a[ib0 + i].scales[v_im + 5] & 0x0f) | ((data_a[ib0 + i].scales[v_im + 1] & 0xc0) >> 2))) - dmin * smin); | ||
const uint tmp_idx = 16 * ix + tid; | ||
tmp[tmp_idx] = fma(dall, (fma(sx, FLOAT_TYPE(data_a[ib0 + i].scales[v_im] & 0x3f), fma(sy, FLOAT_TYPE(data_a[ib0 + i].scales[v_im + 1] & 0x3f), | ||
fma(sz, FLOAT_TYPE((data_a[ib0 + i].scales[v_im + 4] & 0x0f) | ((data_a[ib0 + i].scales[v_im] & 0xc0) >> 2)), fma(sw, FLOAT_TYPE((data_a[ib0 + i].scales[v_im + 5] & 0x0f) | ((data_a[ib0 + i].scales[v_im + 1] & 0xc0) >> 2))))))), fma(-dmin, smin, tmp[tmp_idx])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you consider only the FMA changes, is there a measurable performance gain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very hard to tell. The GLSL compiler should be using FMA instructions anyways, basically this change just makes it certain instead of leaving it to the optimizer. Hopefully this means a few more FMA calls in SPIR-V, which could be checked.
But afterwards the SPIR-V code gets compiled again to a device-specific driver-internal representation, where some more optimization takes place. Since there are many combinations of devices, I can't really be sure whether this helped anywhere, but at least I'm sure it doesn't cause slow downs. I haven't seen a significant performance difference on my devices.
@ggerganov @slaren Can one of you review the non-Vulkan parts of this PR and approve if that's fine? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to fix the CI before merging
Vulkan Optimizations and Fixes (ggerganov#8959)
I have implemented a number of Vulkan optimizations and fixes:
I will keep this on draft while I check a few more things, but feel free to test and benchmark. Don't expect a huge difference.