WIP: add OCI container#241
Conversation
mcowger
left a comment
There was a problem hiding this comment.
I converted the buildah script to a standard Dockerfile below. After adding in libcurl, it does start and download a model, but loading the model fails. This same model runs fine in docker on llama.cpp on this machine.
Most interesting is the stage at which it begins the tensor load:
DEBUG: LLAMA SERVER GPU: load_tensors: loading model tensors, this can take a while... (mmap = true)
DEBUG: LLAMA SERVER GPU: llama_model_load: error loading model: make_cpu_buft_list: no CPU backend found
I've never seen this error from llama.cpp before: make_cpu_buft_list: no CPU backend found
INFO: Loading llm: Qwen3-0.6B-GGUF
INFO: Using backend: vulkan
INFO: Downloading llama.cpp server from https://github.com/ggml-org/llama.cpp/releases/download/b6097/llama-b6097-bin-ubuntu-vulkan-x64.zip
DEBUG: Starting new HTTPS connection (1): github.com:443
DEBUG: https://github.com:443 "GET /ggml-org/llama.cpp/releases/download/b6097/llama-b6097-bin-ubuntu-vulkan-x64.zip HTTP/1.1" 302 0
DEBUG: Starting new HTTPS connection (1): release-assets.githubusercontent.com:443
DEBUG: https://release-assets.githubusercontent.com:443 "GET /github-production-release-asset/612354784/7049d9fd-7769-4f66-953d-584aaadce81c?sp=r&sv=2018-11-09&sr=b&spr=https&se=2025-08-25T07%3A22%3A02Z&rscd=attachment%3B+filename%3Dllama-b6097-bin-ubuntu-vulkan-x64.zip&rsct=application%2Foctet-stream&skoid=96c2d410-5711-43a1-aedd-ab1947aa7ab0&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skt=2025-08-25T06%3A21%3A11Z&ske=2025-08-25T07%3A22%3A02Z&sks=b&skv=2018-11-09&sig=sA9x1qN2%2FPVDlHgu9K8D28FdMo%2FIGlBGiSXvV6szdgM%3D&jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmVsZWFzZS1hc3NldHMuZ2l0aHVidXNlcmNvbnRlbnQuY29tIiwia2V5Ijoia2V5MSIsImV4cCI6MTc1NjEwMzgwMCwibmJmIjoxNzU2MTAzNTAwLCJwYXRoIjoicmVsZWFzZWFzc2V0cHJvZHVjdGlvbi5ibG9iLmNvcmUud2luZG93cy5uZXQifQ.kg9u9BjrIuiqD-P4vre6zn3kqOu_D3uKR6bGn9xlcxk&response-content-disposition=attachment%3B%20filename%3Dllama-b6097-bin-ubuntu-vulkan-x64.zip&response-content-type=application%2Foctet-stream HTTP/1.1" 200 22417582
INFO: Extracting llama-b6097-bin-ubuntu-vulkan-x64.zip to /opt/venv/bin/vulkan/llama_server
INFO: Set executable permissions for /opt/venv/bin/vulkan/llama_server/build/bin/llama-server
INFO: Set executable permissions for /opt/venv/bin/vulkan/llama_server/build/bin/llama-cli
DEBUG: https://huggingface.co:443 "GET /api/models/unsloth/Qwen3-0.6B-GGUF/tree/main?recursive=True&expand=False HTTP/1.1" 200 8793
DEBUG: https://huggingface.co:443 "GET /api/models/unsloth/Qwen3-0.6B-GGUF/revision/main HTTP/1.1" 200 7869
Fetching 1 files: 100%|██████████| 1/1 [00:00<00:00, 1764.54it/s]
DEBUG: GGUF file paths: {'variant': '/srv/lemonade/hub/models--unsloth--Qwen3-0.6B-GGUF/snapshots/50968a4468ef4233ed78cd7c3de230dd1d61a56b/Qwen3-0.6B-Q4_0.gguf'}
DEBUG: Set LD_LIBRARY_PATH to /opt/venv/bin/vulkan/llama_server/build/bin
DEBUG: Starting new HTTP connection (1): localhost:49871
DEBUG: Not able to connect to llama-server yet, will retry
DEBUG: LLAMA SERVER GPU: load_backend: loaded RPC backend from /srv/lemonade/bin/vulkan/llama_server/build/bin/libggml-rpc.so
DEBUG: LLAMA SERVER GPU: ggml_vulkan: Found 1 Vulkan devices:
INFO: GPU acceleration active: 1 device(s) detected by llama-server
DEBUG: LLAMA SERVER GPU: ggml_vulkan: 0 = AMD Radeon Graphics (RADV PHOENIX) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
DEBUG: LLAMA SERVER GPU: load_backend: loaded Vulkan backend from /srv/lemonade/bin/vulkan/llama_server/build/bin/libggml-vulkan.so
DEBUG: LLAMA SERVER GPU: build: 6097 (9515c613) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
DEBUG: LLAMA SERVER GPU: system info: n_threads = 8, n_threads_batch = 8, total_threads = 16
DEBUG: LLAMA SERVER GPU:
DEBUG: LLAMA SERVER GPU: system_info: n_threads = 8 (n_threads_batch = 8) / 16 |
DEBUG: LLAMA SERVER GPU:
DEBUG: LLAMA SERVER GPU: main: binding port with default address family
DEBUG: LLAMA SERVER GPU: main: HTTP server is listening, hostname: 127.0.0.1, port: 49871, http threads: 15
DEBUG: LLAMA SERVER GPU: main: loading model
DEBUG: LLAMA SERVER GPU: srv load_model: loading model '/srv/lemonade/hub/models--unsloth--Qwen3-0.6B-GGUF/snapshots/50968a4468ef4233ed78cd7c3de230dd1d61a56b/Qwen3-0.6B-Q4_0.gguf'
DEBUG: LLAMA SERVER GPU: llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon Graphics (RADV PHOENIX)) - 50682 MiB free
DEBUG: LLAMA SERVER GPU: llama_model_loader: loaded meta data with 32 key-value pairs and 310 tensors from /srv/lemonade/hub/models--unsloth--Qwen3-0.6B-GGUF/snapshots/50968a4468ef4233ed78cd7c3de230dd1d61a56b/Qwen3-0.6B-Q4_0.gguf (version GGUF V3 (latest))
DEBUG: LLAMA SERVER GPU: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 0: general.architecture str = qwen3
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 1: general.type str = model
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 2: general.name str = Qwen3-0.6B
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 3: general.basename str = Qwen3-0.6B
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 4: general.quantized_by str = Unsloth
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 5: general.size_label str = 0.6B
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 7: qwen3.block_count u32 = 28
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 8: qwen3.context_length u32 = 40960
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 9: qwen3.embedding_length u32 = 1024
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 10: qwen3.feed_forward_length u32 = 3072
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 11: qwen3.attention.head_count u32 = 16
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 12: qwen3.attention.head_count_kv u32 = 8
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 13: qwen3.rope.freq_base f32 = 1000000.000000
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 14: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 15: qwen3.attention.key_length u32 = 128
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 16: qwen3.attention.value_length u32 = 128
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 18: tokenizer.ggml.pre str = qwen2
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 22: tokenizer.ggml.eos_token_id u32 = 151645
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 151654
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 25: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 26: general.quantization_version u32 = 2
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 27: general.file_type u32 = 2
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 28: quantize.imatrix.file str = Qwen3-0.6B-GGUF/imatrix_unsloth.dat
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 29: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-0.6B.txt
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 30: quantize.imatrix.entries_count u32 = 196
DEBUG: LLAMA SERVER GPU: llama_model_loader: - kv 31: quantize.imatrix.chunks_count u32 = 688
DEBUG: LLAMA SERVER GPU: llama_model_loader: - type f32: 113 tensors
DEBUG: LLAMA SERVER GPU: llama_model_loader: - type q4_0: 193 tensors
DEBUG: LLAMA SERVER GPU: llama_model_loader: - type q4_1: 3 tensors
DEBUG: LLAMA SERVER GPU: llama_model_loader: - type q6_K: 1 tensors
DEBUG: LLAMA SERVER GPU: print_info: file format = GGUF V3 (latest)
DEBUG: LLAMA SERVER GPU: print_info: file type = Q4_0
DEBUG: LLAMA SERVER GPU: print_info: file size = 358.78 MiB (5.05 BPW)
DEBUG: LLAMA SERVER GPU: load: printing all EOG tokens:
DEBUG: LLAMA SERVER GPU: load: - 151643 ('<|endoftext|>')
DEBUG: LLAMA SERVER GPU: load: - 151645 ('<|im_end|>')
DEBUG: LLAMA SERVER GPU: load: - 151662 ('<|fim_pad|>')
DEBUG: LLAMA SERVER GPU: load: - 151663 ('<|repo_name|>')
DEBUG: LLAMA SERVER GPU: load: - 151664 ('<|file_sep|>')
DEBUG: LLAMA SERVER GPU: load: special tokens cache size = 26
DEBUG: LLAMA SERVER GPU: load: token to piece cache size = 0.9311 MB
DEBUG: LLAMA SERVER GPU: print_info: arch = qwen3
DEBUG: LLAMA SERVER GPU: print_info: vocab_only = 0
DEBUG: LLAMA SERVER GPU: print_info: n_ctx_train = 40960
DEBUG: LLAMA SERVER GPU: print_info: n_embd = 1024
DEBUG: LLAMA SERVER GPU: print_info: n_layer = 28
DEBUG: LLAMA SERVER GPU: print_info: n_head = 16
DEBUG: LLAMA SERVER GPU: print_info: n_head_kv = 8
DEBUG: LLAMA SERVER GPU: print_info: n_rot = 128
DEBUG: LLAMA SERVER GPU: print_info: n_swa = 0
DEBUG: LLAMA SERVER GPU: print_info: is_swa_any = 0
DEBUG: LLAMA SERVER GPU: print_info: n_embd_head_k = 128
DEBUG: LLAMA SERVER GPU: print_info: n_embd_head_v = 128
DEBUG: LLAMA SERVER GPU: print_info: n_gqa = 2
DEBUG: LLAMA SERVER GPU: print_info: n_embd_k_gqa = 1024
DEBUG: LLAMA SERVER GPU: print_info: n_embd_v_gqa = 1024
DEBUG: LLAMA SERVER GPU: print_info: f_norm_eps = 0.0e+00
DEBUG: LLAMA SERVER GPU: print_info: f_norm_rms_eps = 1.0e-06
DEBUG: LLAMA SERVER GPU: print_info: f_clamp_kqv = 0.0e+00
DEBUG: LLAMA SERVER GPU: print_info: f_max_alibi_bias = 0.0e+00
DEBUG: LLAMA SERVER GPU: print_info: f_logit_scale = 0.0e+00
DEBUG: LLAMA SERVER GPU: print_info: f_attn_scale = 0.0e+00
DEBUG: LLAMA SERVER GPU: print_info: n_ff = 3072
DEBUG: LLAMA SERVER GPU: print_info: n_expert = 0
DEBUG: LLAMA SERVER GPU: print_info: n_expert_used = 0
DEBUG: LLAMA SERVER GPU: print_info: causal attn = 1
DEBUG: LLAMA SERVER GPU: print_info: pooling type = -1
DEBUG: LLAMA SERVER GPU: print_info: rope type = 2
DEBUG: LLAMA SERVER GPU: print_info: rope scaling = linear
DEBUG: LLAMA SERVER GPU: print_info: freq_base_train = 1000000.0
DEBUG: LLAMA SERVER GPU: print_info: freq_scale_train = 1
DEBUG: LLAMA SERVER GPU: print_info: n_ctx_orig_yarn = 40960
DEBUG: LLAMA SERVER GPU: print_info: rope_finetuned = unknown
DEBUG: LLAMA SERVER GPU: print_info: model type = 0.6B
DEBUG: LLAMA SERVER GPU: print_info: model params = 596.05 M
DEBUG: LLAMA SERVER GPU: print_info: general.name = Qwen3-0.6B
DEBUG: LLAMA SERVER GPU: print_info: vocab type = BPE
DEBUG: LLAMA SERVER GPU: print_info: n_vocab = 151936
DEBUG: LLAMA SERVER GPU: print_info: n_merges = 151387
DEBUG: LLAMA SERVER GPU: print_info: BOS token = 11 ','
DEBUG: LLAMA SERVER GPU: print_info: EOS token = 151645 '<|im_end|>'
DEBUG: LLAMA SERVER GPU: print_info: EOT token = 151645 '<|im_end|>'
DEBUG: LLAMA SERVER GPU: print_info: PAD token = 151654 '<|vision_pad|>'
DEBUG: LLAMA SERVER GPU: print_info: LF token = 198 'Ċ'
DEBUG: LLAMA SERVER GPU: print_info: FIM PRE token = 151659 '<|fim_prefix|>'
DEBUG: LLAMA SERVER GPU: print_info: FIM SUF token = 151661 '<|fim_suffix|>'
DEBUG: LLAMA SERVER GPU: print_info: FIM MID token = 151660 '<|fim_middle|>'
DEBUG: LLAMA SERVER GPU: print_info: FIM PAD token = 151662 '<|fim_pad|>'
DEBUG: LLAMA SERVER GPU: print_info: FIM REP token = 151663 '<|repo_name|>'
DEBUG: LLAMA SERVER GPU: print_info: FIM SEP token = 151664 '<|file_sep|>'
DEBUG: LLAMA SERVER GPU: print_info: EOG token = 151643 '<|endoftext|>'
DEBUG: LLAMA SERVER GPU: print_info: EOG token = 151645 '<|im_end|>'
DEBUG: LLAMA SERVER GPU: print_info: EOG token = 151662 '<|fim_pad|>'
DEBUG: LLAMA SERVER GPU: print_info: EOG token = 151663 '<|repo_name|>'
DEBUG: LLAMA SERVER GPU: print_info: EOG token = 151664 '<|file_sep|>'
DEBUG: LLAMA SERVER GPU: print_info: max token length = 256
DEBUG: LLAMA SERVER GPU: load_tensors: loading model tensors, this can take a while... (mmap = true)
DEBUG: LLAMA SERVER GPU: llama_model_load: error loading model: make_cpu_buft_list: no CPU backend found
DEBUG: LLAMA SERVER GPU: llama_model_load_from_file_impl: failed to load model
DEBUG: LLAMA SERVER GPU: common_init_from_params: failed to load model '/srv/lemonade/hub/models--unsloth--Qwen3-0.6B-GGUF/snapshots/50968a4468ef4233ed78cd7c3de230dd1d61a56b/Qwen3-0.6B-Q4_0.gguf'
DEBUG: LLAMA SERVER GPU: srv load_model: failed to load model, '/srv/lemonade/hub/models--unsloth--Qwen3-0.6B-GGUF/snapshots/50968a4468ef4233ed78cd7c3de230dd1d61a56b/Qwen3-0.6B-Q4_0.gguf'
DEBUG: LLAMA SERVER GPU: srv operator(): operator(): cleaning up before exit...
DEBUG: LLAMA SERVER GPU: main: exiting due to model loading error
FROM ubuntu:rolling
# Add the entrypoint script to the container
ADD container/entrypoint.sh /
# Set timezone, install dependencies, clone, build, and clean up
RUN ln -sf /usr/share/zoneinfo/Europe/London /etc/localtime && \
apt-get update && \
apt-get install -y --no-install-recommends \
git \
curl \
libcurl4-openssl-dev \
pciutils \
mesa-vulkan-drivers \
python3 \
python3-pip \
python3-venv \
python3-invoke \
python3-yaml \
python3-typeguard \
python3-packaging \
python3-numpy \
python3-fasteners \
python3-git \
python3-watchfiles \
python3-websockets \
python3-cpuinfo \
python3-pytz \
python3-zstandard \
python3-fastapi \
python3-uvicorn \
python3-jinja2 \
python3-tabulate \
python3-sentencepiece \
python3-dotenv \
python3-filelock \
python3-fsspec \
python3-requests \
python3-tqdm \
python3-distro \
python3-httpx \
python3-regex \
python3-protobuf \
python3-certifi \
python3-httpcore \
python3-h11 \
python3-charset-normalizer \
python3-urllib3 && \
python3 -m venv --system-site-packages /opt/venv && \
. /opt/venv/bin/activate && \
mkdir -p /root/lemonade && \
git clone https://github.com/lemonade-sdk/lemonade.git /root/lemonade/src && \
cd /root/lemonade/src && \
git checkout v8.1.3 && \
pip3 install . && \
apt-get autoremove --purge -y git python3-pip python3-venv && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* /root/* && \
chmod 755 /entrypoint.sh && \
mkdir /srv/lemonade
# Set container port, volume, entrypoint, and command
EXPOSE 8000
VOLUME /srv/lemonade
ENTRYPOINT ["/entrypoint.sh"]
CMD ["lemonade-server-dev"]
Signed-off-by: FuchtelJockel <alexander.reimelt@posteo.de>
b3b8c96 to
4ef163b
Compare
|
@mcowger please append files instead of pasting long texts (logs). |
|
I just tried to build the docker image and fire up the container. While the build worked, and I can access the Webgui, I can't load a model. It keeps saying However, the docker log doesn't show anything, and I'm not sure what other log I could check. |
|
I'm not sure it's that problem. I see the Error loading models message from the very beginning, before I even do anything. I just go to the WebGUI and the error message already shows up. I would assume Lemonade is not trying to download some standard models by default? |
|
We now have an official docker image: https://github.com/lemonade-sdk/lemonade/pkgs/container/lemonade-server So I am closing container-related issues and PRs. Please shoot me a message if I am closing anything erroneously. |
Can't test/add ROCm because I have no compatible GPU.