Skip to content

Commit

Permalink
v2.0.0 (#1736)
Browse files Browse the repository at this point in the history
  • Loading branch information
OlivierDehaene authored Apr 12, 2024
1 parent 275caa0 commit c38a7d7
Show file tree
Hide file tree
Showing 14 changed files with 124 additions and 133 deletions.
213 changes: 95 additions & 118 deletions Cargo.lock

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ members = [
resolver = "2"

[workspace.package]
version = "1.4.5"
version = "2.0.0"
edition = "2021"
authors = ["Olivier Dehaene"]
homepage = "https://github.com/huggingface/text-generation-inference"
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ For a detailed starting guide, please see the [Quick Tour](https://huggingface.c
model=HuggingFaceH4/zephyr-7b-beta
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4 --model-id $model
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model
```

And then you can make requests like
Expand All @@ -90,7 +90,7 @@ curl 127.0.0.1:8080/generate_stream \

**Note:** To use NVIDIA GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 12.2 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the `--gpus all` flag and add `--disable-custom-kernels`, please note CPU is not the intended platform for this project, so performance might be subpar.

**Note:** TGI supports AMD Instinct MI210 and MI250 GPUs. Details can be found in the [Supported Hardware documentation](https://huggingface.co/docs/text-generation-inference/supported_models#supported-hardware). To use AMD GPUs, please use `docker run --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4-rocm --model-id $model` instead of the command above.
**Note:** TGI supports AMD Instinct MI210 and MI250 GPUs. Details can be found in the [Supported Hardware documentation](https://huggingface.co/docs/text-generation-inference/supported_models#supported-hardware). To use AMD GPUs, please use `docker run --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0-rocm --model-id $model` instead of the command above.

To see all options to serve your models (in the [code](https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs) or in the cli):
```
Expand Down Expand Up @@ -120,7 +120,7 @@ model=meta-llama/Llama-2-7b-chat-hf
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
token=<your cli READ token>

docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4 --model-id $model
docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model
```

### A note on Shared Memory (shm)
Expand Down
2 changes: 1 addition & 1 deletion docs/openapi.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
"name": "Apache 2.0",
"url": "https://www.apache.org/licenses/LICENSE-2.0"
},
"version": "1.4.5"
"version": "2.0.0"
},
"paths": {
"/": {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
"id": "",
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"object": "text_completion",
"system_fingerprint": "1.4.5-native",
"system_fingerprint": "2.0.0-native",
"usage": {
"completion_tokens": 100,
"prompt_tokens": 60,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
"id": "",
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"object": "text_completion",
"system_fingerprint": "1.4.5-native",
"system_fingerprint": "2.0.0-native",
"usage": {
"completion_tokens": 29,
"prompt_tokens": 316,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
"id": "",
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"object": "text_completion",
"system_fingerprint": "1.4.5-native",
"system_fingerprint": "2.0.0-native",
"usage": {
"completion_tokens": 29,
"prompt_tokens": 316,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
"id": "",
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"object": "text_completion",
"system_fingerprint": "1.4.5-native",
"system_fingerprint": "2.0.0-native",
"usage": {
"completion_tokens": 21,
"prompt_tokens": 187,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,5 +23,5 @@
"id": "",
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"object": "text_completion",
"system_fingerprint": "1.4.5-native"
"system_fingerprint": "2.0.0-native"
}
2 changes: 1 addition & 1 deletion integration-tests/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "text-generation-integration-tests"
version = "1.4.5"
version = "2.0.0"
description = "Text Generation Inference integration tests"
authors = ["Nicolas Patry <[email protected]>"]

Expand Down
2 changes: 1 addition & 1 deletion server/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "text-generation-server"
version = "1.4.5"
version = "2.0.0"
description = "Text Generation Inference Python gRPC Server"
authors = ["Olivier Dehaene <[email protected]>"]

Expand Down
4 changes: 4 additions & 0 deletions server/text_generation_server/interceptor.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,10 @@ async def intercept(
method_name = method_name.split("/")[-1]
logger.exception(f"Method {method_name} encountered an error.")

# Runtime Error cannot be recovered from
if isinstance(err, RuntimeError):
exit(1)

if torch.cuda.is_available():
torch.cuda.empty_cache()

Expand Down
7 changes: 4 additions & 3 deletions server/text_generation_server/models/cache_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,9 +55,10 @@ def allocate(
):
# Get free blocks indices by finding values in mask that are not set to 0
free_block_indices = self.free_block_mask.nonzero()
assert (
len(free_block_indices) >= blocks
), f"Out of available cache blocks: asked {blocks}, only {len(free_block_indices)} free blocks"
if blocks > len(free_block_indices):
raise RuntimeError(
f"Out of available cache blocks: asked {blocks}, only {len(free_block_indices)} free blocks"
)

# Slice by the number of required blocks
block_indices = free_block_indices[:blocks]
Expand Down
9 changes: 9 additions & 0 deletions server/text_generation_server/utils/layers.py
Original file line number Diff line number Diff line change
Expand Up @@ -503,6 +503,10 @@ def forward(
self, input: torch.Tensor
) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
logits = self.lm_head(input)
# If we have too many tokens, we skip speculative logits
if input.shape[0] > 128:
return logits, None

speculative_logits = self.medusa(input)
return logits, speculative_logits

Expand Down Expand Up @@ -549,6 +553,11 @@ def __init__(self, config, prefix, weights):
self.lm_head = TensorParallelHead.load(config, prefix, weights)

def forward(self, x):
# If we have too many tokens, we skip speculative logits
if x.shape[0] > 128:
logits = self.lm_head(x)
return logits, None

size = x.shape[-1]
block_size = (size + self.world_size - 1) // self.world_size
start = self.rank * block_size
Expand Down

0 comments on commit c38a7d7

Please sign in to comment.