Add CUDA hardware acceleration for textgen #70

7omb · 2023-11-25T18:21:52Z

Add cuBLAS hardware acceleration to llama-cpp-python. This allows layers of gguf models like Llama-2-13B-chat-GGUF to be offloaded to the GPU with the n-gpu-layers setting:

llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 5363.06 MB (+ 3200.00 MB per state)
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloaded 16/43 layers to GPU
llm_load_tensors: VRAM used: 3439 MB
...................................................................................................
llama_new_context_with_model: kv self size  = 3200.00 MB
llama_new_context_with_model: compute buffer total size =  351.47 MB
llama_new_context_with_model: VRAM scratch buffer: 350.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
2023-11-25 19:13:45 INFO:Loaded the model in 1.31 seconds.

Add cuda hardware acceleration for textgen

2125408

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUDA hardware acceleration for textgen #70

Add CUDA hardware acceleration for textgen #70

7omb commented Nov 25, 2023

Add CUDA hardware acceleration for textgen #70

Are you sure you want to change the base?

Add CUDA hardware acceleration for textgen #70

Conversation

7omb commented Nov 25, 2023