Add llama.cpp backend #2723

mfuntowicz · 2024-11-04T22:25:32Z

This PR is an initial implementation of llama.cpp as potential backend for TGI.

It mostly targets CPU inference in a single/multi stream scheduling fashion, potentially spawning multiple instances of the same model over a non-overlapping subset of the CPU cores.

The current implementation only allows a single request to be running on a working, this constraint will be removed later on.
The current implementation also dupplicate the weights for each worker, this constraint can potentially be removed later on.

# Conflicts: # Cargo.lock

…gpt2

…cting the model

…back

Hugoch

Thanks @mfuntowicz ! I can't get a correct generation using the Docker so I think something is wrong with tokenizer. I'll dig deeper later today!

Dockerfile.llamacpp

backends/llamacpp/src/backend.rs

Dockerfile.llamacpp

backends/llamacpp/src/backend.rs

Co-authored-by: Hugo Larcher <[email protected]>

… the size of the generated core allocation

* feat: Fix Cmakelist to allow building on Darwin platform * fix: Fix tokenizer in llama.cpp Dockerfile

mfuntowicz added 30 commits November 14, 2024 08:42

feat(llamacpp): initial commit

aa1fcba

# Conflicts: # Cargo.lock

feat(llamacpp): correctly handle CMAKE_BUILD_TYPE for spdlog macros

7d1f8a2

feat(llamacpp): initial end2end build

52d57dc

misc(cmake): add parameter to build specific cuda arch

e4432d3

misc(cmake): wut

fa89d1e

feat(llamacpp): enable cuda

05ad684

feat(backend): correctly load llama.cpp model from llama api and not …

0911076

…gpt2

feat(backend): tell cmake to build llama-common and link to it

098c669

feat(backend): add some initial decoding steps

45d5a6a

feat(backend): use llama_token as TokenId type

92bb113

feat(backend): minor refactor

d4b5be1

feat(backend): expose frequency and repetition penalties

37faeb3

chore(backend): minor formatting

f9c2486

feat(backend): wip Rust binding

355d8a5

feat(backend): build and link through build.rs

e4d803c

misc(build): handle different lib destination folder lib/lib64

f0859c2

misc(build): refactor build type detection in cmake

179309b

feat(llamacpp): expose number of threads for the backend when constru…

a316c53

…cting the model

feat(llamacpp): wip explosion

0c1dd0e

misc(offline): link correctly

dbc5b7a

misc(offline): expose more parameters for generate

6115904

feat(backend): entirely rewrite backend

b98c635

misc(offline): update offline tester

6a5f6b0

feat(backend): full rework of the backend internal to safer c++

d52b4c4

misc(offline): match rework

3af2c68

feat(backend): add mapping for ignore_eos_token stopping criteria

f39edc7

feat(backend): add logit parameter in the callback fn

d4aee42

feat(backend): bind incoming request to the server

612f2f9

feat(backend): avoid dropping the boxed stream at the end of the call…

b50dcdd

…back

feat(backend): somewhat generates the final infer response

3e82f14

mfuntowicz added 3 commits November 22, 2024 13:32

feat(backend): rely on multi consumer queue to scheduler workers

5a85661

misc(docker): add numa lib as dependency

30ae996

misc(backend): allow rebinding numa core affinity

2d9465d

mfuntowicz marked this pull request as ready for review November 22, 2024 13:47

mfuntowicz added 3 commits November 22, 2024 14:48

misc(license): update LICENSE

4ee2ee5

misc(doc): c++ documentation

b9c04b9

misc(doc): rust documentation

862a519

mfuntowicz requested review from co42, Hugoch and OlivierDehaene November 22, 2024 14:37

chore: remove unrelated change to trtllm

9025a26

mfuntowicz requested a review from angt November 22, 2024 15:02

Hugoch reviewed Nov 27, 2024

View reviewed changes

mfuntowicz and others added 17 commits November 28, 2024 09:53

Update Dockerfile.llamacpp as per review

bbe95ca

Co-authored-by: Hugo Larcher <[email protected]>

Update Dockerfile.llamacpp as per review

d918e6a

Co-authored-by: Hugo Larcher <[email protected]>

feat(backend): remove core overriding in the Rust backend

274cfce

feat(backend): use the new batch api from llama

8e89793

feat(backend): fix when num_cores_per_instance is equals to zero with…

298367c

… the size of the generated core allocation

feat(backend): add some test to the backend for core allocation

929a2fc

feat(backend): add guard in case top_k = 0

df72c56

feat(backend): add missing temperature parameter

9d659f1

misc(offline): update model creation as std::shared_ptr

6c5a75b

feat(backend): update llama.cpp to 4215

b1ebc8f

feat(backend): create llama_context_params with default factory

dc6435e

feat(backend): use new batch API to generate tokens

b10eaab

feat: Fix Cmakelist to allow building on Darwin platform (#2785)

59b0ef3

* feat: Fix Cmakelist to allow building on Darwin platform * fix: Fix tokenizer in llama.cpp Dockerfile

feat(backend): correctly link to all libraries

f5c4cee

feat(backend): add mimalloc memory allocator to the container

db41776

feat(backend): better map exception throw on C++ side

c9f6c3a

feat(backend): use c++ defined types for llama.cpp

e0dda9b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add llama.cpp backend #2723

Add llama.cpp backend #2723

mfuntowicz commented Nov 4, 2024 •

edited

Loading

Hugoch left a comment

Add llama.cpp backend #2723

Are you sure you want to change the base?

Add llama.cpp backend #2723

Conversation

mfuntowicz commented Nov 4, 2024 • edited Loading

Hugoch left a comment

Choose a reason for hiding this comment

mfuntowicz commented Nov 4, 2024 •

edited

Loading