Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPTQ Fused MoE class #8

Closed
wants to merge 82 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
db1f07e
GPTQ Fused MoE class
ElizaWszola Sep 3, 2024
6753789
Add GPTQMarlinMoEMethod to gptq_marlin.py
ElizaWszola Sep 3, 2024
7df4014
Use FusedMoE layer for all loads
ElizaWszola Sep 4, 2024
c3dc249
Merge branch 'marlin-moe-8-bit' into gptq_fused_moe
ElizaWszola Sep 4, 2024
2fa03e5
Make sure that GPTQ runs through mixtral.py
ElizaWszola Sep 4, 2024
8a504d9
enforce float16A/scales for marlin moe
ElizaWszola Sep 4, 2024
689ea0a
Merge branch 'marlin-moe-8-bit' into gptq_fused_moe
ElizaWszola Sep 4, 2024
ec47561
cleanup
ElizaWszola Sep 4, 2024
2ad2e56
[MISC] Consolidate FP8 kv-cache tests (#8131)
comaniac Sep 4, 2024
d1dec64
[CI/Build][ROCm] Enabling LoRA tests on ROCm (#7369)
alexeykondrat Sep 4, 2024
561d6f8
[CI] Change test input in Gemma LoRA test (#8163)
WoosukKwon Sep 4, 2024
e02ce49
[Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistra…
K-Mistele Sep 4, 2024
77d9e51
[MISC] Replace input token throughput with total token throughput (#8…
comaniac Sep 4, 2024
008cf88
[Neuron] Adding support for adding/ overriding neuron configuration a…
hbikki Sep 4, 2024
32e7db2
Bump version to v0.6.0 (#8166)
simon-mo Sep 4, 2024
e01c2be
[Doc] [Misc] Create CODE_OF_CONDUCT.md (#8161)
mmcelaney Sep 4, 2024
1afc931
[bugfix] >1.43 constraint for openai (#8169)
SolitaryThinker Sep 5, 2024
4624d98
[Misc] Clean up RoPE forward_native (#8076)
WoosukKwon Sep 5, 2024
ba262c4
[ci] Mark LoRA test as soft-fail (#8160)
khluu Sep 5, 2024
e39ebf5
[Core/Bugfix] Add query dtype as per FlashInfer API requirements. (#8…
elfiegg Sep 5, 2024
288a938
[Doc] Indicate more information about supported modalities (#8181)
DarkLight1337 Sep 5, 2024
8685ba1
Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parall…
Manikandan-Thangaraj-ZS0321 Sep 5, 2024
9da25a8
[MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) (#8029)
alex-jw-brooks Sep 5, 2024
2ee4528
Move verify_marlin_supported to GPTQMarlinLinearMethod (#8165)
mgoin Sep 5, 2024
2febcf2
[Documentation][Spec Decode] Add documentation about lossless guarant…
sroy745 Sep 5, 2024
9f97b3b
update/fix weight loading to support tp
dsikka Sep 5, 2024
db3bf7c
[Core] Support load and unload LoRA in api server (#6566)
Jeffwan Sep 6, 2024
baa5467
[BugFix] Fix Granite model configuration (#8216)
njhill Sep 6, 2024
b841ac4
remove 8-bit stuff for now
ElizaWszola Sep 6, 2024
a245032
Merge branch 'gptq_fused_moe' of https://github.com/neuralmagic/vllm …
ElizaWszola Sep 6, 2024
9d8a80c
fix; update large model testing cases
dsikka Sep 6, 2024
e5cab71
[Frontend] Add --logprobs argument to `benchmark_serving.py` (#8191)
afeldman-nm Sep 6, 2024
315e22f
add hack to support unfused mixtral pathway for int8
dsikka Sep 6, 2024
de80783
[Misc] Use ray[adag] dependency instead of cuda (#7938)
ruisearch42 Sep 6, 2024
565cc43
fix install for tpu test
dsikka Sep 6, 2024
1447c97
[CI/Build] Increasing timeout for multiproc worker tests (#8203)
alexeykondrat Sep 6, 2024
9db52ea
[Kernel] [Triton] Memory optimization for awq_gemm and awq_dequantize…
rasmith Sep 6, 2024
23f3222
[Misc] Remove `SqueezeLLM` (#8220)
dsikka Sep 6, 2024
29f49cd
[Model] Allow loading from original Mistral format (#8168)
patrickvonplaten Sep 6, 2024
12dd715
[misc] [doc] [frontend] LLM torch profiler support (#7943)
SolitaryThinker Sep 7, 2024
41e95c5
[Bugfix] Fix Hermes tool call chat template bug (#8256)
K-Mistele Sep 7, 2024
2f707fc
[Model] Multi-input support for LLaVA (#8238)
DarkLight1337 Sep 7, 2024
795b662
Enable Random Prefix Caching in Serving Profiling Tool (benchmark_ser…
wschin Sep 7, 2024
ce2702a
[tpu][misc] fix typo (#8260)
youkaichao Sep 7, 2024
9f68e00
[Bugfix] Fix broken OpenAI tensorizer test (#8258)
DarkLight1337 Sep 7, 2024
e807125
[Model][VLM] Support multi-images inputs for InternVL2 models (#8201)
Isotr0py Sep 7, 2024
8886423
Move float16 typecast hack to gptq marlin moe method
ElizaWszola Sep 7, 2024
ab27497
Move output type conversion to gptq method as well
ElizaWszola Sep 7, 2024
36bf815
[Model][VLM] Decouple weight loading logic for `Paligemma` (#8269)
Isotr0py Sep 7, 2024
b962ee1
ppc64le: Dockerfile fixed, and a script for buildkite (#8026)
sumitd2 Sep 7, 2024
cfe712b
[CI/Build] Use python 3.12 in cuda image (#8133)
joerunde Sep 7, 2024
4ef41b8
[Bugfix] Fix async postprocessor in case of preemption (#8267)
alexm-redhat Sep 8, 2024
847e860
Enable 8-bit weights in Fused Marlin MoE
ElizaWszola Aug 30, 2024
430a9cb
fix rocm
ElizaWszola Aug 30, 2024
48047aa
bad paste
ElizaWszola Aug 30, 2024
bfc4fae
add test case; fix imports for tests
dsikka Aug 30, 2024
c5a2f62
fix to adapt custom_routin_function
dsikka Aug 30, 2024
2b308c4
Use select_experts to compute top_k tensors in fused moe
ElizaWszola Sep 2, 2024
71256d4
bring back fused_moe_marlin -> fused_marlin_moe
ElizaWszola Sep 3, 2024
7aa844c
GPTQ Fused MoE class
ElizaWszola Sep 3, 2024
0f7bec3
Add GPTQMarlinMoEMethod to gptq_marlin.py
ElizaWszola Sep 3, 2024
cb0001e
Use FusedMoE layer for all loads
ElizaWszola Sep 4, 2024
33090a3
Make sure that GPTQ runs through mixtral.py
ElizaWszola Sep 4, 2024
d479837
enforce float16A/scales for marlin moe
ElizaWszola Sep 4, 2024
8baaec6
remove large model
dsikka Sep 4, 2024
8fbc181
Cleanup, comments
ElizaWszola Sep 4, 2024
839915f
cleanup
ElizaWszola Sep 4, 2024
a5bc626
remove 8-bit stuff for now
ElizaWszola Sep 6, 2024
c573fa1
update/fix weight loading to support tp
dsikka Sep 5, 2024
a991d82
fix; update large model testing cases
dsikka Sep 6, 2024
d57804d
add hack to support unfused mixtral pathway for int8
dsikka Sep 6, 2024
96fa486
fix install for tpu test
dsikka Sep 6, 2024
1faab90
Move float16 typecast hack to gptq marlin moe method
ElizaWszola Sep 7, 2024
970e06a
Move output type conversion to gptq method as well
ElizaWszola Sep 7, 2024
fd0a4f2
typo fix; fix comment
dsikka Sep 9, 2024
3ac9273
Merge branch 'gptq_fused_moe' of https://github.com/neuralmagic/vllm …
ElizaWszola Sep 9, 2024
d51a2f4
Clarify comment, change how we process bias
ElizaWszola Sep 9, 2024
08287ef
[Bugfix] Streamed tool calls now more strictly follow OpenAI's format…
K-Mistele Sep 9, 2024
58fcc85
[Frontend] Add progress reporting to run_batch.py (#8060)
alugowski Sep 9, 2024
f9b4a2d
[Bugfix] Correct adapter usage for cohere and jamba (#8292)
vladislavkruglikov Sep 9, 2024
c7cb5c3
[Misc] GPTQ Activation Ordering (#8135)
kylesayrs Sep 9, 2024
12f05c5
Merge branch 'main' into gptq_fused_moe
dsikka Sep 9, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 41 additions & 6 deletions .buildkite/run-amd-test.sh
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# This script runs test inside the corresponding ROCm docker container.
set -ex
set -o pipefail

# Print ROCm version
echo "--- Confirming Clean Initial State"
Expand Down Expand Up @@ -70,16 +70,51 @@ HF_CACHE="$(realpath ~)/huggingface"
mkdir -p ${HF_CACHE}
HF_MOUNT="/root/.cache/huggingface"

docker run \
commands=$@
PARALLEL_JOB_COUNT=8
# check if the command contains shard flag, we will run all shards in parallel because the host have 8 GPUs.
if [[ $commands == *"--shard-id="* ]]; then
for GPU in $(seq 0 $(($PARALLEL_JOB_COUNT-1))); do
#replace shard arguments
commands=${@//"--shard-id= "/"--shard-id=${GPU} "}
commands=${commands//"--num-shards= "/"--num-shards=${PARALLEL_JOB_COUNT} "}
docker run \
--device /dev/kfd --device /dev/dri \
--network host \
--shm-size=16gb \
--rm \
-e HIP_VISIBLE_DEVICES=0 \
-e HIP_VISIBLE_DEVICES=${GPU} \
-e HF_TOKEN \
-v ${HF_CACHE}:${HF_MOUNT} \
-e HF_HOME=${HF_MOUNT} \
--name ${container_name} \
--name ${container_name}_${GPU} \
${image_name} \
/bin/bash -c "${@}"

/bin/bash -c "${commands}" \
|& while read -r line; do echo ">>Shard $GPU: $line"; done &
PIDS+=($!)
done
#wait for all processes to finish and collect exit codes
for pid in ${PIDS[@]}; do
wait ${pid}
STATUS+=($?)
done
for st in ${STATUS[@]}; do
if [[ ${st} -ne 0 ]]; then
echo "One of the processes failed with $st"
exit ${st}
fi
done
else
docker run \
--device /dev/kfd --device /dev/dri \
--network host \
--shm-size=16gb \
--rm \
-e HIP_VISIBLE_DEVICES=0 \
-e HF_TOKEN \
-v ${HF_CACHE}:${HF_MOUNT} \
-e HF_HOME=${HF_MOUNT} \
--name ${container_name} \
${image_name} \
/bin/bash -c "${commands}"
fi
32 changes: 32 additions & 0 deletions .buildkite/run-cpu-test-ppc64le.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# This script build the CPU docker image and run the offline inference inside the container.
# It serves a sanity check for compilation and basic model usage.
set -ex

# Try building the docker image
docker build -t cpu-test -f Dockerfile.ppc64le .

# Setup cleanup
remove_docker_container() { docker rm -f cpu-test || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image, setting --shm-size=4g for tensor parallel.
#docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test cpu-test
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true --network host -e HF_TOKEN --name cpu-test cpu-test

# Run basic model test
docker exec cpu-test bash -c "
pip install pytest matplotlib einops transformers_stream_generator
pytest -v -s tests/models -m \"not vlm\" --ignore=tests/models/test_embedding.py --ignore=tests/models/test_oot_registration.py --ignore=tests/models/test_registry.py --ignore=tests/models/test_jamba.py --ignore=tests/models/test_danube3_4b.py" # Mamba and Danube3-4B on CPU is not supported

# online inference
docker exec cpu-test bash -c "
python3 -m vllm.entrypoints.openai.api_server --model facebook/opt-125m &
timeout 600 bash -c 'until curl localhost:8000/v1/models; do sleep 1; done' || exit 1
python3 benchmarks/benchmark_serving.py \
--backend vllm \
--dataset-name random \
--model facebook/opt-125m \
--num-prompts 20 \
--endpoint /v1/completions \
--tokenizer facebook/opt-125m"
7 changes: 6 additions & 1 deletion .buildkite/run-cpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,12 @@ docker exec cpu-test-avx2 bash -c "python3 examples/offline_inference.py"
# Run basic model test
docker exec cpu-test bash -c "
pip install pytest matplotlib einops transformers_stream_generator
pytest -v -s tests/models -m \"not vlm\" --ignore=tests/models/test_embedding.py --ignore=tests/models/test_oot_registration.py --ignore=tests/models/test_registry.py --ignore=tests/models/test_jamba.py --ignore=tests/models/test_danube3_4b.py" # Mamba and Danube3-4B on CPU is not supported
pytest -v -s tests/models -m \"not vlm\" --ignore=tests/models/test_embedding.py \
--ignore=tests/models/test_oot_registration.py \
--ignore=tests/models/test_registry.py \
--ignore=tests/models/test_fp8.py \
--ignore=tests/models/test_jamba.py \
--ignore=tests/models/test_danube3_4b.py" # Mamba and Danube3-4B on CPU is not supported

# online inference
docker exec cpu-test bash -c "
Expand Down
28 changes: 25 additions & 3 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@ steps:
- pytest -v -s entrypoints/openai
- pytest -v -s entrypoints/test_chat_utils.py


- label: Distributed Tests (4 GPUs) # 10min
working_dir: "/vllm-workspace/tests"
num_gpus: 4
Expand Down Expand Up @@ -157,6 +158,7 @@ steps:
- python3 offline_inference_with_prefix.py
- python3 llm_engine_example.py
- python3 offline_inference_vision_language.py
- python3 offline_inference_vision_language_multi_image.py
- python3 tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors
- python3 offline_inference_encoder_decoder.py

Expand Down Expand Up @@ -218,9 +220,9 @@ steps:
- pytest -v -s spec_decode

- label: LoRA Test %N # 30min each
mirror_hardwares: [amd]
source_file_dependencies:
- vllm/lora
- csrc/punica
- tests/lora
command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --ignore=lora/test_long_context.py
parallelism: 4
Expand Down Expand Up @@ -271,6 +273,15 @@ steps:
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- bash ./run-tests.sh -c configs/models-small.txt -t 1

- label: OpenAI-Compatible Tool Use # 20 min
fast_check: false
mirror_hardwares: [ amd ]
source_file_dependencies:
- vllm/
- tests/tool_use
commands:
- pytest -v -s tool_use

##### 1 GPU test #####
##### multi gpus test #####

Expand Down Expand Up @@ -358,9 +369,9 @@ steps:
- label: LoRA Long Context (Distributed) # 11min
# This test runs llama 13B, so it is required to run on 4 GPUs.
num_gpus: 4
soft_fail: true
source_file_dependencies:
- vllm/lora
- csrc/punica
- tests/lora/test_long_context
commands:
# FIXIT: find out which code initialize cuda before running the test
Expand All @@ -375,7 +386,18 @@ steps:
- vllm/
- tests/weight_loading
commands:
- bash weight_loading/run_model_weight_loading_test.sh
- bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models.txt

- label: Weight Loading Multiple GPU Test - Large Models # optional
working_dir: "/vllm-workspace/tests"
num_gpus: 2
gpu: a100
optional: true
source_file_dependencies:
- vllm/
- tests/weight_loading
commands:
- bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models-large.txt


##### multi gpus test #####
Expand Down
1 change: 0 additions & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,6 @@ set(VLLM_EXT_SRC
"csrc/pos_encoding_kernels.cu"
"csrc/activation_kernels.cu"
"csrc/layernorm_kernels.cu"
"csrc/quantization/squeezellm/quant_cuda_kernel.cu"
"csrc/quantization/gptq/q_gemm.cu"
"csrc/quantization/compressed_tensors/int8_quant_kernels.cu"
"csrc/quantization/fp8/common.cu"
Expand Down
128 changes: 128 additions & 0 deletions CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@

# vLLM Code of Conduct

## Our Pledge

We as members, contributors, and leaders pledge to make participation in our
community a harassment-free experience for everyone, regardless of age, body
size, visible or invisible disability, ethnicity, sex characteristics, gender
identity and expression, level of experience, education, socioeconomic status,
nationality, personal appearance, race, caste, color, religion, or sexual
identity and orientation.

We pledge to act and interact in ways that contribute to an open, welcoming,
diverse, inclusive, and healthy community.

## Our Standards

Examples of behavior that contributes to a positive environment for our
community include:

* Demonstrating empathy and kindness toward other people
* Being respectful of differing opinions, viewpoints, and experiences
* Giving and gracefully accepting constructive feedback
* Accepting responsibility and apologizing to those affected by our mistakes,
and learning from the experience
* Focusing on what is best not just for us as individuals, but for the overall
community

Examples of unacceptable behavior include:

* The use of sexualized language or imagery, and sexual attention or advances of
any kind
* Trolling, insulting or derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or email address,
without their explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting

## Enforcement Responsibilities

Community leaders are responsible for clarifying and enforcing our standards of
acceptable behavior and will take appropriate and fair corrective action in
response to any behavior that they deem inappropriate, threatening, offensive,
or harmful.

Community leaders have the right and responsibility to remove, edit, or reject
comments, commits, code, wiki edits, issues, and other contributions that are
not aligned to this Code of Conduct, and will communicate reasons for moderation
decisions when appropriate.

## Scope

This Code of Conduct applies within all community spaces, and also applies when
an individual is officially representing the community in public spaces.
Examples of representing our community include using an official email address,
posting via an official social media account, or acting as an appointed
representative at an online or offline/IRL event.

## Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported to the community leaders responsible for enforcement in the #code-of-conduct
channel in the [vLLM Discord](https://discord.com/invite/jz7wjKhh6g).
All complaints will be reviewed and investigated promptly and fairly.

All community leaders are obligated to respect the privacy and security of the
reporter of any incident.

## Enforcement Guidelines

Community leaders will follow these Community Impact Guidelines in determining
the consequences for any action they deem in violation of this Code of Conduct:

### 1. Correction

**Community Impact**: Use of inappropriate language or other behavior deemed
unprofessional or unwelcome in the community.

**Consequence**: A private, written warning from community leaders, providing
clarity around the nature of the violation and an explanation of why the
behavior was inappropriate. A public apology may be requested.

### 2. Warning

**Community Impact**: A violation through a single incident or series of
actions.

**Consequence**: A warning with consequences for continued behavior. No
interaction with the people involved, including unsolicited interaction with
those enforcing the Code of Conduct, for a specified period of time. This
includes avoiding interactions in community spaces as well as external channels
like social media. Violating these terms may lead to a temporary or permanent
ban.

### 3. Temporary Ban

**Community Impact**: A serious violation of community standards, including
sustained inappropriate behavior.

**Consequence**: A temporary ban from any sort of interaction or public
communication with the community for a specified period of time. No public or
private interaction with the people involved, including unsolicited interaction
with those enforcing the Code of Conduct, is allowed during this period.
Violating these terms may lead to a permanent ban.

### 4. Permanent Ban

**Community Impact**: Demonstrating a pattern of violation of community
standards, including sustained inappropriate behavior, harassment of an
individual, or aggression toward or disparagement of classes of individuals.

**Consequence**: A permanent ban from any sort of public interaction within the
community.

## Attribution

This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org/),
version 2.1, available at
[v2.1](https://www.contributor-covenant.org/version/2/1/code_of_conduct.html).

Community Impact Guidelines were inspired by
[Mozilla's code of conduct enforcement ladder](https://github.com/mozilla/inclusion).

For answers to common questions about this code of conduct, see the
[Contributor Covenant FAQ](https://www.contributor-covenant.org/faq). Translations are available at
[Contributor Covenant translations](https://www.contributor-covenant.org/translations).

10 changes: 6 additions & 4 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ ARG CUDA_VERSION=12.4.1
# prepare basic build environment
FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04 AS base
ARG CUDA_VERSION=12.4.1
ARG PYTHON_VERSION=3.10
ARG PYTHON_VERSION=3.12
ENV DEBIAN_FRONTEND=noninteractive

# Install Python and other dependencies
Expand All @@ -37,7 +37,6 @@ WORKDIR /workspace

# install build and runtime dependencies
COPY requirements-common.txt requirements-common.txt
COPY requirements-adag.txt requirements-adag.txt
COPY requirements-cuda.txt requirements-cuda.txt
RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip install -r requirements-cuda.txt
Expand Down Expand Up @@ -66,7 +65,6 @@ COPY setup.py setup.py
COPY cmake cmake
COPY CMakeLists.txt CMakeLists.txt
COPY requirements-common.txt requirements-common.txt
COPY requirements-adag.txt requirements-adag.txt
COPY requirements-cuda.txt requirements-cuda.txt
COPY pyproject.toml pyproject.toml
COPY vllm vllm
Expand Down Expand Up @@ -135,7 +133,7 @@ RUN --mount=type=cache,target=/root/.cache/pip \
# image with vLLM installed
FROM nvidia/cuda:${CUDA_VERSION}-base-ubuntu20.04 AS vllm-base
ARG CUDA_VERSION=12.4.1
ARG PYTHON_VERSION=3.10
ARG PYTHON_VERSION=3.12
WORKDIR /vllm-workspace
ENV DEBIAN_FRONTEND=noninteractive

Expand Down Expand Up @@ -181,6 +179,10 @@ FROM vllm-base AS test
ADD . /vllm-workspace/

# install development dependencies (for testing)
# A newer setuptools is required for installing some test dependencies from source that do not publish python 3.12 wheels
# This installation must complete before the test dependencies are collected and installed.
RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip install "setuptools>=74.1.1"
RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip install -r requirements-dev.txt

Expand Down
16 changes: 11 additions & 5 deletions Dockerfile.ppc64le
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,27 @@ FROM mambaorg/micromamba
ARG MAMBA_DOCKERFILE_ACTIVATE=1
USER root

RUN apt-get update -y && apt-get install -y git wget vim numactl gcc-12 g++-12 protobuf-compiler libprotobuf-dev && update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
ENV PATH="/usr/local/cargo/bin:$PATH:/opt/conda/bin/"

RUN apt-get update -y && apt-get install -y git wget vim libnuma-dev libsndfile-dev libprotobuf-dev build-essential

# Some packages in requirements-cpu are installed here
# IBM provides optimized packages for ppc64le processors in the open-ce project for mamba
# Currently these may not be available for venv or pip directly
RUN micromamba install -y -n base -c https://ftp.osuosl.org/pub/open-ce/1.11.0-p10/ -c defaults python=3.10 pytorch-cpu=2.1.2 torchvision-cpu=0.16.2 && micromamba clean --all --yes
RUN micromamba install -y -n base -c https://ftp.osuosl.org/pub/open-ce/1.11.0-p10/ -c defaults python=3.10 torchvision-cpu=0.16.2 rust && micromamba clean --all --yes

COPY ./ /workspace/vllm

WORKDIR /workspace/vllm

# These packages will be in rocketce eventually
RUN pip install -v -r requirements-cpu.txt --prefer-binary --extra-index-url https://repo.fury.io/mgiessing
RUN pip install -v cmake torch==2.3.1 uvloop==0.20.0 -r requirements-cpu.txt --prefer-binary --extra-index-url https://repo.fury.io/mgiessing

RUN VLLM_TARGET_DEVICE=cpu python3 setup.py install

WORKDIR /vllm-workspace
ENTRYPOINT ["/opt/conda/bin/python3", "-m", "vllm.entrypoints.openai.api_server"]
WORKDIR /workspace/

RUN ln -s /workspace/vllm/tests && ln -s /workspace/vllm/examples && ln -s /workspace/vllm/benchmarks

ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]

Loading