cublas can run now but max token size is greatly reduced. #231
Replies: 4 comments
-
There are many parameters to set - what's the batch size you are using? Is f16 enabled? |
Beta Was this translation helpful? Give feedback.
-
@mudler i'm only having mirostat = 2, temp = 0.3, ngl = 43, t = 1, ctx = 1920, n = 1920. these are the prompt parameters i use for llama.cpp which works. how to disable F16Mem?
|
Beta Was this translation helpful? Give feedback.
-
@mudler any help to get go-llama running like the llama.cpp setting mentioned? |
Beta Was this translation helpful? Give feedback.
-
bringing this discussion to the latest question |
Beta Was this translation helpful? Give feedback.
-
can run but i cant seem to generate the same amount of context size tokens as without using golang. why?
with 4060 rtx, i can do 1920 max tokens using pure llama.cpp cuda offload 100%
on go-llama, i can only do around ctx size of 650 without oom
@mudler do u know why? how do i fix this?
same setting with llama.cpp but in golang...
Beta Was this translation helpful? Give feedback.
All reactions