Skip to content

Conversation

kpouget
Copy link
Collaborator

@kpouget kpouget commented Jul 4, 2025

Summary by Sourcery

Introduce a new API remoting layer for llama.cpp that forwards graph computations and buffer operations to a remote execution context over VirtGPU (virgl renderer). Add remoting frontend and backend libraries with encoder/decoder dispatch logic, integrate virtgpu-shared memory for data exchange, and wire it into the build system under new GGML_REMOTING_FRONTEND/BACKEND flags.

New Features:

  • Add VirtGPU-based API remoting frontend and backend libraries to forward ggml graph compute and buffer operations to a remote process via virgl renderer.
  • Introduce new CMake/Makefile options (GGML_REMOTING_FRONTEND and GGML_REMOTING_BACKEND) and build rules to conditionally compile the remoting modules.
  • Implement remote call infrastructure with custom encoder/decoder, shared memory management, and command dispatch over DRM ioctls.

Enhancements:

  • Comment out most LLAMA_LOG_DEBUG/INFO calls to reduce internal logging noise during model loading and tensor operations.
  • Expose Metal device context remotely via ggml_backend_metal_get_device_context to propagate device capabilities across the remoting boundary.
  • Integrate timer instrumentation around Metal compute_graph and llama token generation to collect and report performance metrics triggered at exit or via USR1 signal.

Build:

  • Update Makefile and CMakeLists to register and compile remoting frontend and backend modules and link against DRM for VirtGPU support.

Summary by CodeRabbit

  • New Features

    • Introduced remoting backend and frontend support, enabling remote execution and resource sharing across devices.
    • Added extensive shell scripts for preparing, building, and running remoting and Vulkan environments, including containerized compilation and benchmarking.
    • Implemented advanced key-value cache management for improved model performance and memory handling.
    • Added timing instrumentation for performance measurement in Metal backend and token generation.
  • Enhancements

    • Added new CMake and Makefile options for remoting backends.
    • Extended backend device and buffer interfaces to support remote and Metal-specific capabilities.
  • Bug Fixes

    • Improved error handling and logging in backend and buffer operations.
  • Refactor

    • Modularized backend and buffer logic, separating device, buffer, and Metal-specific code for clarity and maintainability.
  • Chores

    • Removed numerous GitHub Actions workflows, issue templates, and CI/CD configuration files to streamline repository maintenance.
    • Disabled or silenced non-essential logging and informational output for a quieter runtime experience.
  • Documentation

    • Added comprehensive header files for DRM, VirtGPU, and backend interfaces to clarify API usage and system integration.

Copy link

sourcery-ai bot commented Jul 4, 2025

Reviewer's Guide

This PR introduces a new API–remoting layer into llama.cpp by splitting the ggml backend into a “remoting frontend” (runs in the client process) and a “remoting backend” (runs in a separate host process). It uses virtio-gpu execbuffers and shared memory to marshal ggml tensors and graphs across the ABI boundary. The buildsystem is extended to conditionally compile two new libraries (ggml-remotingfrontend and ggml-remotingbackend), and dozens of new source files implement the command dispatch, serialization of ggml_tensor/ggml_cgraph, remote buffer and device wrappers, and the virtgpu-based transport. Timing hooks were also added to measure Metal kernels and run.cpp token loops.

Class diagram for Remoting Frontend/Backend and Device Wrappers

classDiagram
    class GGMLRemotingFrontend {
        +send_tensor()
        +send_graph()
        +receive_result()
        +marshal_data()
    }
    class GGMLRemotingBackend {
        +receive_tensor()
        +receive_graph()
        +execute_graph()
        +return_result()
    }
    class VirtioGPUTransport {
        +send_buffer()
        +receive_buffer()
        +setup_shared_memory()
    }
    class GGMLDeviceWrapper {
        +wrap_device()
        +unwrap_device()
        +marshal_device_info()
    }
    GGMLRemotingFrontend --|> VirtioGPUTransport : uses
    GGMLRemotingBackend --|> VirtioGPUTransport : uses
    GGMLRemotingFrontend --|> GGMLDeviceWrapper : wraps
    GGMLRemotingBackend --|> GGMLDeviceWrapper : wraps
Loading

File-Level Changes

Change Details Files
Added profiling timers to Metal backend and CLI runner
  • Inserted start/stop timers in ggml-metal encode/compute paths
  • Added atexit and signal handlers to print accumulated timings
  • Instrumented run.cpp generate loop with timing calls
ggml/src/ggml-metal/ggml-metal.m
tools/run/run.cpp
ggml-remotingfrontend/shared/apir_backend.h
Extended build files to conditionally include remoting support
  • Added GGML_REMOTING_FRONTEND/BACKEND flags to Makefile and CMakeLists
  • Appended remoting source objects in build targets when enabled
  • Exposed public headers for remoting frontend
Makefile
ggml/CMakeLists.txt
ggml/src/CMakeLists.txt
Added ggml-remotingfrontend library
  • Implements ggml_backend_device and buffer interfaces over virtgpu RPC
  • Defines remote_marshalling of ggml_tensor and ggml_cgraph
  • Wraps Metal device context calls and buffer operations
  • Registers frontend via ggml_backend_reg
ggml/src/ggml-remotingfrontend
ggml/include/ggml-remoting-frontend.h
Added ggml-remotingbackend library
  • Loads host ggml backend as a shared library
  • Dispatches RPC commands in virtgpu execbuffer handler
  • Deserializes ggml_cgraph and tensors into host context
  • Calls underlying backend’s graph_compute and metal device queries
ggml/src/ggml-remotingbackend
ggml/src/ggml-remotingbackend/shared
ggml/src/ggml-remotingbackend/backend-dispatched.cpp
Implemented a custom virtgpu-based transport
  • Created virtgpu helper to open DRM device and shared memory blobs
  • Use venus command streams to encode/decode requests
  • Poll for response via eventfd in shared memory
  • Manage blob allocation and mapping logic
ggml/src/ggml-remotingfrontend/virtgpu.cpp
ggml/src/ggml-remotingfrontend/virtgpu-shm.cpp
ggml/src/ggml-remotingfrontend/virtgpu-utils.cpp

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

coderabbitai bot commented Jul 4, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbit review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This update introduces a comprehensive remoting backend and frontend for the GGML framework, including new CMake options, build scripts, and extensive C++ implementations for remote device, buffer, and graph management. Numerous serialization, deserialization, and communication utilities are added, along with shell scripts for build and run automation. Many GitHub Actions workflow and template files are removed, and logging is silenced across several model and vocab info functions.

Changes

File(s) / Path(s) Change Summary
.github/ISSUE_TEMPLATE/*.yml, .github/labeler.yml, .github/pull_request_template.md, .github/workflows/*, .github/actions/* Deleted all GitHub Actions workflows, custom actions, issue templates, labeler config, and PR template line.
CMakePresets.json, Makefile, ggml/CMakeLists.txt, ggml/src/CMakeLists.txt Added new remoting backend/frontend build options and presets; updated build rules and header installation.
build.backend.sh, prepare.backend.sh, prepare.remoting.sh, prepare.sh, prepare.vulkan.sh, podman_compile.sh, run.remoting.sh, run.sh, run.vulkan.sh Added new shell scripts for configuring, building, and running remoting and Vulkan backends/frontends.
ggml/include/ggml-metal.h Added device context query API for Metal backend.
ggml/include/ggml-remoting-frontend.h New header: remoting frontend backend registration API and macro.
ggml/src/ggml-backend-reg.cpp Integrated conditional registration and loading of remoting frontend backend.
ggml/src/ggml-metal/ggml-metal.m Added timing instrumentation and public API for device context querying.
ggml/src/ggml-remotingbackend/ (multiple files) New backend: CMake config, device/buffer/graph dispatchers, serialization, logging, error codes, and utility headers.
ggml/src/ggml-remotingfrontend/ (multiple files, including include/) New frontend: device, buffer, host buffer, Metal support, remoting backend, serialization, virtgpu interface, shared memory, and utility code.
src/llama-context.cpp, src/llama-kv-cache-unified.cpp, src/llama-model-loader.cpp, src/llama-model.cpp, src/llama-vocab.cpp Disabled or commented out logging and info printouts in constructors and info methods.
src/llama-kv-cache.cpp Added unified key-value cache class for LLaMA, including serialization, defragmentation, and view support.
tools/run/run.cpp Added timing instrumentation to token decoding loop and reporting at program exit.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant ShellScript
    participant CMake
    participant Backend
    participant Frontend
    participant VirtGPU
    participant RemoteGPU

    User->>ShellScript: Run build or inference script
    ShellScript->>CMake: Configure build (remoting/vulkan/backend)
    CMake->>Backend: Build remoting backend/frontend
    User->>Frontend: Start inference (e.g., llama-run)
    Frontend->>VirtGPU: Prepare/encode computation graph
    VirtGPU->>RemoteGPU: Send serialized graph via shared memory
    RemoteGPU->>VirtGPU: Compute and return status/result
    VirtGPU->>Frontend: Decode and process result
    Frontend->>User: Output inference result and timing
Loading

Poem

🐇
A flurry of scripts, a backend anew,
Remoting and buffers, Metal shines through.
Serialization dances, graphs leap and bound,
VirtGPU whispers, in bytes they are found.
Logging grows silent, but timing is bright—
The rabbit builds onward, from morning to night!
🛠️🌈

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch main

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbit or @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@kpouget
Copy link
Collaborator Author

kpouget commented Jul 4, 2025

@coderabbitai review

Copy link

coderabbitai bot commented Jul 4, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 91

🔭 Outside diff range comments (5)
run.sh (1)

1-2: Add a proper shebang (and fail-early flags) so the script can be executed directly

Without a shebang the kernel cannot pick an interpreter, so ./run.sh will exit with “Exec format error”.
Add a portable shebang and fail-fast flags:

+#!/usr/bin/env bash
+set -euo pipefail
+
 ./build/bin/llama-run --ngl 999 --verbose ~/models/llama3.2 "say nothing"
prepare.sh (1)

1-2: Missing shebang prevents direct execution

Same issue as run.sh: the script is not runnable with ./prepare.sh.
Recommend:

+#!/usr/bin/env bash
+set -euo pipefail
+
 cmake -S . -B ./build -DGGML_VULKAN=ON -DGGML_NATIVE=OFF -DGGML_METAL=OFF #-DCMAKE_BUILD_TYPE=Debug #-DGGML_VULKAN_DEBUG=1
prepare.vulkan.sh (1)

1-7: Add shebang & safety flags for consistency with other helper scripts

+#!/usr/bin/env bash
+set -euo pipefail
+
 cmake -S . \
       -B ../build.vulkan \
       -DGGML_VULKAN=ON \
       -DGGML_NATIVE=OFF \
       -DGGML_METAL=OFF \
       -DCMAKE_BUILD_TYPE=Debug
src/llama-kv-cache-unified.cpp (1)

74-86: dev_name is now unused — will trigger -Wunused-variable and break -Werror builds
With the debug log line commented-out, dev_name is never read. Many toolchains compile ggml/llama.cpp with -Wall -Werror; the unused variable will therefore fail the build.

-        const char * dev_name = "CPU";
+        const char * dev_name = "CPU";
@@
-        //LLAMA_LOG_DEBUG("%s: layer %3d: dev = %s\n", __func__, il, dev_name);
+#if defined(LLAMA_LOG_DEBUG) && !defined(NDEBUG)
+        LLAMA_LOG_DEBUG("%s: layer %3d: dev = %s\n", __func__, il, dev_name);
+#else
+        (void)dev_name;               // silence unused-var in release builds
+#endif

This keeps the diagnostic available for debug builds while guaranteeing a clean release compile.

podman_compile.sh (1)

9-40: Validate podman availability before execution.

The script doesn't check if podman is installed and available.

Add validation at the beginning of the script:

if ! command -v podman &> /dev/null; then
    echo "Error: podman is not installed or not in PATH"
    exit 1
fi
🧹 Nitpick comments (43)
run.vulkan.sh (2)

1-1: Tighten the shebang

#! /bin/bash has an extra space and hard-codes /bin/bash. Prefer env lookup:

-#! /bin/bash
+#!/usr/bin/env bash

2-8: Fail fast and quote prefix variable

Consider aborting on any error and treating unset vars as fatal:

+set -euo pipefail
+
 if [[ ${1:-} == "strace" ]]; then
     prefix="strace"
 elif [[ ${1:-} == "gdb" ]]; then
     prefix="gdb --args"
 else
     prefix=""
 fi
ggml/CMakeLists.txt (1)

276-276: Header installed unconditionally, regardless of option state
ggml-remoting-frontend.h is exported even when GGML_REMOTING_FRONTEND=OFF.
Down-stream consumers will include a header whose implementation support is absent, leading to link-time errors.

Guard the public header list behind the same option or always build the thin stub implementation so symbol availability matches install surface.

src/llama-model-loader.cpp (3)

679-697: Redundant work after disabling KV-dump logging

The for (int i = 0; i < n_kv; i++) loop still does string formatting (gguf_kv_to_str, replace_all, etc.) even though every LLAMA_LOG_INFO call inside it is commented out. For large models this can be a noticeable cost at load time while providing zero benefit.

-        for (int i = 0; i < n_kv; i++) {
-            ...
-            //LLAMA_LOG_INFO(...);
-        }
+        /* KV dump disabled */
+        if (trace > 1) {        // or a new `LLAMA_VERBOSE_META` flag
+            for (int i = 0; i < n_kv; i++) {
+                ...
+                LLAMA_LOG_INFO(...);
+            }
+        }

Guarding (or #ifdef-ing) the whole loop avoids useless allocations and string ops when logging is off.


700-706: Same issue for type-count loop

The loop that walks n_type now only computes and discards its body. Wrap it behind the same flag or remove it entirely to avoid wasted cycles.


1156-1164: Early return leaves dead, compiled-in code

print_info() now immediately returns, leaving ~8 lines of unreachable logging code. Most compilers will warn; it also confuses readers.

Options:

  1. Delete the unreachable block.
  2. #if 0 it.
  3. Add a runtime flag (if (!verbose) return;).

Example:

-void llama_model_loader::print_info() const {
-    return;
+#ifndef LLAMA_QUIET_LOAD
+void llama_model_loader::print_info() const {
 #endif

Pick whichever aligns with the new logging policy, but avoid shipping dead code.

src/llama-model.cpp (1)

1563-1568: Consider using conditional logging instead of commenting out debug statements.

While reducing noise during model loading aligns with the PR objectives, completely commenting out debug logs can hinder troubleshooting. The device assignment and SWA status information could be valuable for debugging tensor loading issues.

Consider implementing a more flexible logging approach:

  • Use environment variables or build-time flags to control debug logging
  • Implement log level filtering at runtime
  • Or remove the commented code entirely if it's no longer needed
-            //LLAMA_LOG_DEBUG("load_tensors: layer %3d assigned to device %s, is_swa = %d\n", il, ggml_backend_dev_name(cpu_dev), is_swa);
+            LLAMA_LOG_TRACE("load_tensors: layer %3d assigned to device %s, is_swa = %d\n", il, ggml_backend_dev_name(cpu_dev), is_swa);
-        //LLAMA_LOG_DEBUG("load_tensors: layer %3d assigned to device %s, is_swa = %d\n", il, ggml_backend_dev_name(dev), is_swa);
+        LLAMA_LOG_TRACE("load_tensors: layer %3d assigned to device %s, is_swa = %d\n", il, ggml_backend_dev_name(dev), is_swa);
prepare.backend.sh (2)

7-7: Avoid hard-coded macOS-specific Vulkan paths

/opt/homebrew/... is Apple Silicon-specific and will fail on Intel macOS, Linux and CI runners.
Prefer passing these via environment variables (e.g. VULKAN_SDK) or CMake cache options so callers can override.


10-11: Consider exposing the Debug build flag through an option instead of leaving it commented

Leaving a commented flag invites drift. Offer a boolean CLI switch (e.g. --debug) or document usage in README to toggle -DCMAKE_BUILD_TYPE=Debug.

ggml/include/ggml-metal.h (1)

64-67: Good API addition for device capability querying.

The function signature is well-designed and follows existing conventions. The boolean output parameters clearly indicate the device capabilities being queried.

Consider adding documentation comments to explain what each capability flag represents:

+// Query Metal device capabilities
+// has_simdgroup_mm: whether device supports SIMD group matrix multiplication
+// has_simdgroup_reduction: whether device supports SIMD group reduction operations
+// use_bfloat: whether device should use bfloat16 precision
 GGML_BACKEND_API void ggml_backend_metal_get_device_context(ggml_backend_dev_t dev,
                                                             bool *has_simdgroup_mm,
                                                             bool *has_simdgroup_reduction,
                                                             bool *use_bfloat);
ggml/src/ggml-remotingfrontend/CMakeLists.txt (1)

28-28: Add error handling for missing drm library dependency.

Consider adding a check to ensure the drm library is available before linking:

# dnf install -y libdrm-devel
+find_library(DRM_LIBRARY drm REQUIRED)
+if(NOT DRM_LIBRARY)
+    message(FATAL_ERROR "libdrm not found. Install libdrm-devel package.")
+endif()
target_link_libraries(ggml-remotingfrontend PUBLIC drm)
ggml/src/ggml-remotingfrontend/virtgpu-forward-metal.cpp (1)

3-19: Consider more meaningful return value and input validation.

The function always returns true regardless of the remote call outcome. Consider returning the actual success status and adding input validation:

 bool
 apir_metal_get_device_context(struct virtgpu *gpu, struct ggml_backend_metal_device_context *metal_dev_ctx) {
+  if (!gpu || !metal_dev_ctx) {
+    return false;
+  }
+
   struct vn_cs_encoder *encoder;
   struct vn_cs_decoder *decoder;

   REMOTE_CALL_PREPARE(gpu, encoder, APIR_COMMAND_TYPE_METAL_GET_DEVICE_CONTEXT);

   REMOTE_CALL(gpu, encoder, decoder);

   vn_decode_bool_t(decoder, &metal_dev_ctx->has_simdgroup_mm);
   vn_decode_bool_t(decoder, &metal_dev_ctx->has_simdgroup_reduction);
   vn_decode_bool_t(decoder, &metal_dev_ctx->use_bfloat);

   REMOTE_CALL_FINISH(gpu, encoder, decoder);

   return true;
 }

Note: The macros appear to handle errors via FATAL, so the current approach may be intentional.

ggml/src/ggml-remotingfrontend/virtgpu-forward-impl.h (2)

7-8: Remove unused CACHED macro and fix comment typo.

The CACHED macro appears to be unused, and there's a typo in the comment:

-#define CACHED
-//  printf("INFO: ### found response in the cache %s\n", __func__)o

Consider removing unused code to keep the header clean.


15-16: Consider defensive check for encoder variable name.

The macro uses encoder in the conditional check but takes encoder_name as parameter. While this works with current usage, consider making it more robust:

-    if (!encoder) {							\
+    if (!encoder_name) {						\

This makes the macro more self-contained and less dependent on variable naming conventions.

ggml/src/ggml-remotingfrontend/ggml-metal-remoting.h (1)

7-16: Add documentation for public API.

Consider adding documentation comments to explain the purpose and usage of the struct and functions.

+/**
+ * Metal device context containing device capabilities
+ */
 struct ggml_backend_metal_device_context {
   bool has_simdgroup_mm;
   bool has_simdgroup_reduction;
   bool use_bfloat;
 };


+/**
+ * Get the Metal device context for a given backend device
+ * @param dev The backend device
+ * @return Pointer to the Metal device context, or NULL if not available
+ */
 const struct ggml_backend_metal_device_context *get_metal_dev_context(const ggml_backend_dev_t dev);

+/**
+ * Check if a Metal device context supports a specific tensor operation
+ * @param ctx_dev The Metal device context
+ * @param op The tensor operation to check
+ * @return true if the operation is supported, false otherwise
+ */
 bool ggml_metal_supports_op(const struct ggml_backend_metal_device_context * ctx_dev, const struct ggml_tensor * op);
ggml/src/ggml-remotingbackend/backend-dispatched.cpp (2)

11-17: Consider encapsulating global state.

Global variables should be encapsulated to prevent unintended access and improve maintainability.

Consider wrapping these globals in a namespace or singleton class:

-ggml_backend_reg_t reg = NULL;
-ggml_backend_dev_t dev = NULL;
-ggml_backend_t bck = NULL;
-
-long long timer_start = 0;
-long long timer_total = 0;
-long long timer_count = 0;
+namespace backend_state {
+    ggml_backend_reg_t reg = NULL;
+    ggml_backend_dev_t dev = NULL;
+    ggml_backend_t bck = NULL;
+    
+    long long timer_start = 0;
+    long long timer_total = 0;
+    long long timer_count = 0;
+}

44-44: Use INFO level for informational messages.

Free memory information is not a warning condition.

-  WARNING("%s: free memory: %ld MB\n", __func__, (size_t) free/1024/1024);
+  INFO("%s: free memory: %ld MB\n", __func__, (size_t) free/1024/1024);
ggml/src/ggml-remotingfrontend/virtgpu-shm.cpp (1)

82-82: Define magic number as a constant.

The magic number 16384 should be defined as a named constant for clarity.

Add at the top of the file:

+// Minimum alignment for shared memory allocations
+#define VIRTGPU_SHMEM_MIN_ALIGN 16384
+
 static uint32_t
 virtgpu_ioctl_resource_create_blob(struct virtgpu *gpu,

Then use it:

-   size = align64(size, 16384);
+   size = align64(size, VIRTGPU_SHMEM_MIN_ALIGN);
ggml/src/ggml-remotingfrontend/virtgpu-forward-backend.cpp (2)

3-7: Remove unused function or fix its implementation.

The function current_time_ms is not used anywhere and has issues:

  1. Returns nanoseconds despite the _ms suffix
  2. Comment suggests using CLOCK_MONOTONIC but uses CLOCK_REALTIME

Either remove the function entirely or fix it if it will be used:

-static long long current_time_ms() {
-  struct timespec ts;
-  clock_gettime(CLOCK_REALTIME, &ts);  // Use CLOCK_MONOTONIC for elapsed time
-  return (long long)ts.tv_sec * 1000000000LL + ts.tv_nsec;
-}

42-44: Consider using a more appropriate default status.

GGML_STATUS_ABORTED might not be the best default if decoding fails. Consider using a status that indicates a communication/decoding failure.

-  ggml_status status = GGML_STATUS_ABORTED;
+  ggml_status status = GGML_STATUS_FAILED; // or another appropriate error status
build.backend.sh (1)

30-36: Use more idiomatic error handling.

Instead of checking $?, use the command directly in the if statement.

-cmake --build ../build.remoting-backend$FLAVOR --parallel 8 --target $TARGETS "$@"
-
-if [[ $? == 0 ]]; then
+if cmake --build ../build.remoting-backend$FLAVOR --parallel 8 --target $TARGETS "$@"; then
     touch READY_backend
 else
     touch FAILED_backend
 fi
run.remoting.sh (1)

42-42: Fix grammar in the prompt.

The prompt has a grammatical error.

-    PROMPT="tell what's Apple metal API"
+    PROMPT="tell me what Apple Metal API is"
ggml/src/ggml-metal/ggml-metal.m (1)

11-12: Consider using a compile-time flag to control debug logging.

Instead of unconditionally disabling debug logging, consider using a compile-time flag to allow users to enable debug logs when needed for troubleshooting.

+#ifdef GGML_METAL_DISABLE_DEBUG_LOGS
 #undef GGML_LOG_DEBUG
 #define GGML_LOG_DEBUG(...)
+#endif
tools/run/run.cpp (1)

991-996: Consider making timer functions static instead of inline

These inline functions are only used within this translation unit. Making them static instead of inline would be more appropriate and let the compiler decide on inlining.

-static inline void start_timer(void) {
+static void start_timer(void) {
   struct timespec ts;
-  clock_gettime(CLOCK_REALTIME, &ts);  // Use CLOCK_MONOTONIC for elapsed time
+  clock_gettime(CLOCK_MONOTONIC, &ts);
   timer_start = (long long)ts.tv_sec * 1000000000LL + ts.tv_nsec;
 }

-static inline void stop_timer(void) {
+static void stop_timer(void) {
   struct timespec ts;
-  clock_gettime(CLOCK_REALTIME, &ts);  // Use CLOCK_MONOTONIC for elapsed time
+  clock_gettime(CLOCK_MONOTONIC, &ts);

Also applies to: 997-1005

ggml/src/ggml-remotingbackend/backend-dispatched-backend.cpp (1)

13-57: Consider returning meaningful error codes

The function always returns 0 regardless of success or failure paths. Consider using the return value to indicate different error conditions.

ggml/src/ggml-remotingfrontend/ggml-backend-buffer.cpp (2)

44-54: Remove commented debug code

Please remove the commented-out debug code to keep the codebase clean.

   struct virtgpu *gpu = BUFFER_TO_GPU(buffer);
-#if 0
-  INFO("%s: data=%p, offset=%lu, size=%lu\n", __func__, data, offset, size);
-#endif
-#if 0
-  void **addr = (void **)(uintptr_t)data;
-  for (int i = 0; i <= 10; i++) {
-    INFO("%s: %p | %llx", __func__, addr, *addr);
-    addr++;
-  }
-  INFO("\n");
-#endif
   struct ggml_backend_remoting_buffer_context *context = BUFFER_TO_GGML_CONTEXT(buffer);

15-23: Add null check for context->base assignment

The function assigns to context->base without checking if the assignment was successful.

   if (context->base) {
     return context->base;
   }
 
   context->base = apir_buffer_get_base(BUFFER_TO_GPU(buffer),
 				       BUFFER_TO_APIR_CONTEXT(buffer));
+  if (!context->base) {
+    ERROR("Failed to get buffer base address");
+  }
 
   return context->base;
ggml/src/ggml-remotingfrontend/ggml-backend-reg.cpp (1)

74-81: Device description buffer is not properly used.

The desc buffer is created but only contains static text. It should be populated with device-specific information.

Consider populating the description with device details:

-  char desc[256] = "API Remoting device";
+  char desc[256];
+  snprintf(desc, sizeof(desc), "API Remoting device %d", i);
ggml/src/ggml-remotingfrontend/virtgpu-utils.cpp (3)

19-20: Define HAVE___BUILTIN_CLZLL for compiler optimization.

The macro HAVE___BUILTIN_CLZLL is not defined anywhere, causing the function to always use the slower fallback implementation instead of the built-in __builtin_clzll.

Consider using compiler detection:

-#if defined(HAVE___BUILTIN_CLZLL)
+#if defined(__GNUC__) || defined(__clang__)
    return ((sizeof(uint64_t) * 8 - 1) - __builtin_clzll(n | 1));

65-79: Consider iterative cleanup to avoid stack overflow.

The recursive cleanup in _util_sparse_array_node_finish could cause stack overflow for deeply nested sparse arrays.

Consider using an iterative approach with a stack/queue for large data structures to avoid potential stack overflow issues.


188-195: Fix typo and consider removing debug code.

The function contains a typo ("happilly" should be "happily") and uses a global variable hack to avoid compiler warnings.

-  INFO("thks bye, stopping early and happilly :)");
+  INFO("thks bye, stopping early and happily :)");

Consider removing these debug functions or wrapping them in debug-only compilation flags.

ggml/src/ggml-remotingfrontend/virtgpu.cpp (1)

81-82: Make shared memory sizes configurable.

The shared memory sizes are hardcoded, which may not be suitable for all use cases.

Consider making these sizes configurable through environment variables or initialization parameters:

-  gpu->reply_shmem = virtgpu_shmem_create(gpu, 0x4000);
-  gpu->data_shmem = virtgpu_shmem_create(gpu, 0x1830000); // 24MiB
+  size_t reply_size = getenv("VIRTGPU_REPLY_SIZE") ? 
+                      atol(getenv("VIRTGPU_REPLY_SIZE")) : 0x4000;
+  size_t data_size = getenv("VIRTGPU_DATA_SIZE") ? 
+                     atol(getenv("VIRTGPU_DATA_SIZE")) : 0x1830000;
+  gpu->reply_shmem = virtgpu_shmem_create(gpu, reply_size);
+  gpu->data_shmem = virtgpu_shmem_create(gpu, data_size);
ggml/src/ggml-remotingfrontend/virtgpu.h (1)

1-1: Use traditional include guards instead of #pragma once.

While #pragma once is widely supported, traditional include guards are more portable and compatible with all compilers.

-#pragma once
+#ifndef GGML_VIRTGPU_H
+#define GGML_VIRTGPU_H

And add at the end of the file:

#endif // GGML_VIRTGPU_H
ggml/src/ggml-remotingfrontend/virtgpu-forward-buffer-type.cpp (1)

44-44: Consider making INFO logging configurable.

Multiple functions contain INFO logging that may be too verbose for production use. Consider using a debug flag or log level to control these messages.

-  INFO("%s: Forward BUFT ALIGNMENT --> %zu ", __func__, alignment);
+  // INFO("%s: Forward BUFT ALIGNMENT --> %zu ", __func__, alignment);

Also applies to: 65-65, 86-86, 99-99

ggml/src/ggml-remotingfrontend/virtgpu-utils.h (1)

48-69: Consider moving inline function implementations to a source file.

The WARNING and FATAL inline functions in the header may cause code bloat if included in many translation units. Consider moving the implementations to a source file.

ggml/src/ggml-remotingfrontend/virtgpu-forward-buffer.cpp (1)

62-62: Remove unnecessary return statement.

The explicit return; at the end of a void function is unnecessary.

-  return;
ggml/src/ggml-remotingfrontend/virtgpu-forward-device.cpp (1)

54-74: Consider caching device description for consistency.

The device description function doesn't cache its result, unlike apir_device_get_name and apir_device_get_type. This inconsistency could lead to unnecessary repeated remote calls.

 const char *
 apir_device_get_description(struct virtgpu *gpu) {
+  static char *string = nullptr;
+  if (string) {
+    CACHED;
+    return string;
+  }
   struct vn_cs_encoder *encoder;
   struct vn_cs_decoder *decoder;

   REMOTE_CALL_PREPARE(gpu, encoder, APIR_COMMAND_TYPE_DEVICE_GET_DESCRIPTION);

   REMOTE_CALL(gpu, encoder, decoder);

   const size_t string_size = vn_decode_array_size_unchecked(decoder);
-  char *string = (char *) vn_cs_decoder_alloc_array(decoder, sizeof(char), string_size);
+  string = (char *) vn_cs_decoder_alloc_array(decoder, sizeof(char), string_size);
   if (!string) {
     FATAL("%s: Could not allocate the device description buffer", __func__);
   }
   vn_decode_char_array(decoder, string, string_size);

   INFO("%s: Forward DEV DESCR --> %s", __func__, string);

   REMOTE_CALL_FINISH(gpu, encoder, decoder);

   return string;
 }
ggml/src/ggml-remotingbackend/backend-dispatched-buffer.cpp (2)

34-36: Simplify tensor pointer casting.

The double cast through uintptr_t is unnecessary and the comment about removing const is confusing since the result is assigned to a non-const pointer.

   ggml_tensor *tensor;
-  // safe to remove the const qualifier here
-  tensor = (ggml_tensor *) (uintptr_t) vn_decode_ggml_tensor(dec);
+  tensor = const_cast<ggml_tensor *>(vn_decode_ggml_tensor(dec));

98-98: Fix inconsistent indentation.

   void *shmem_data = ctx->iface.get_shmem_ptr(ctx->virgl_ctx, shmem_res_id);
-    if (!shmem_data) {
+  if (!shmem_data) {
     FATAL("Couldn't get the shmem addr from virgl :/");
   }
ggml/src/ggml-remotingbackend/shared/apir_backend.h (1)

124-124: Fix error message formatting.

The error message has a typo at the end.

-  return "Unknown APIR_BACKEND_INITIALIZE error:/";
+  return "Unknown APIR_BACKEND_INITIALIZE error";
ggml/src/ggml-remotingbackend/shared/venus_cs_ggml-rpc.h (1)

6-22: Consider reordering struct fields for natural alignment

The struct uses explicit padding, but you could minimize padding requirements by grouping fields by size. Consider ordering fields from largest to smallest alignment requirements (uint64_t fields first, then uint32_t/int32_t, then chars).

 struct rpc_tensor {
+  // 8-byte aligned fields
   uint64_t id;
-  uint32_t type;
   uint64_t buffer;
-  uint32_t ne[GGML_MAX_DIMS];
-  uint32_t nb[GGML_MAX_DIMS];
-  uint32_t op;
-  int32_t  op_params[GGML_MAX_OP_PARAMS / sizeof(int32_t)];
-  int32_t  flags;
   uint64_t src[GGML_MAX_SRC];
   uint64_t view_src;
   uint64_t view_offs;
   uint64_t data;
+  
+  // 4-byte aligned fields
+  uint32_t type;
+  uint32_t ne[GGML_MAX_DIMS];
+  uint32_t nb[GGML_MAX_DIMS];
+  uint32_t op;
+  int32_t  op_params[GGML_MAX_OP_PARAMS / sizeof(int32_t)];
+  int32_t  flags;
+  
+  // 1-byte aligned fields
   char name[GGML_MAX_NAME];
-
-  char padding[4];
 };
ggml/src/ggml-remotingfrontend/ggml-remoting.h (1)

40-82: Consider using a proper logging framework

The debug/logging macros use printf directly. Consider using a logging framework or at least routing through a centralized logging function for better control over log levels and destinations.

Example approach:

// Define log levels
enum LogLevel { LOG_DEBUG, LOG_INFO, LOG_WARN, LOG_ERROR };

// Centralized logging function
void ggml_remoting_log(LogLevel level, const char* fmt, ...);

// Update macros to use centralized logging
#define NOT_IMPLEMENTED \
  ggml_remoting_log(LOG_WARN, "reached unimplemented function %s", __func__)
src/llama-kv-cache.cpp (1)

89-92: Consider using conditional logging instead of commented code.

The debug logging is commented out. Consider using a debug flag or log level to control this output.

-	/*
-        LLAMA_LOG_DEBUG("%s: layer %3d: n_embd_k_gqa = %d, n_embd_v_gqa = %d, dev = %s\n", __func__,
-                i, n_embd_k_gqa, n_embd_v_gqa, dev_name);
-	*/
+        if (LLAMA_LOG_LEVEL >= LLAMA_LOG_LEVEL_DEBUG) {
+            LLAMA_LOG_DEBUG("%s: layer %3d: n_embd_k_gqa = %d, n_embd_v_gqa = %d, dev = %s\n", __func__,
+                    i, n_embd_k_gqa, n_embd_v_gqa, dev_name);
+        }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d67341d and 7748ae1.

📒 Files selected for processing (98)
  • .github/ISSUE_TEMPLATE/010-bug-compilation.yml (0 hunks)
  • .github/ISSUE_TEMPLATE/011-bug-results.yml (0 hunks)
  • .github/ISSUE_TEMPLATE/019-bug-misc.yml (0 hunks)
  • .github/ISSUE_TEMPLATE/020-enhancement.yml (0 hunks)
  • .github/ISSUE_TEMPLATE/030-research.yml (0 hunks)
  • .github/ISSUE_TEMPLATE/040-refactor.yml (0 hunks)
  • .github/ISSUE_TEMPLATE/config.yml (0 hunks)
  • .github/actions/get-tag-name/action.yml (0 hunks)
  • .github/actions/windows-setup-cuda/action.yml (0 hunks)
  • .github/actions/windows-setup-curl/action.yml (0 hunks)
  • .github/labeler.yml (0 hunks)
  • .github/pull_request_template.md (0 hunks)
  • .github/workflows/bench.yml.disabled (0 hunks)
  • .github/workflows/build-linux-cross.yml (0 hunks)
  • .github/workflows/build.yml (0 hunks)
  • .github/workflows/close-issue.yml (0 hunks)
  • .github/workflows/docker.yml (0 hunks)
  • .github/workflows/editorconfig.yml (0 hunks)
  • .github/workflows/gguf-publish.yml (0 hunks)
  • .github/workflows/labeler.yml (0 hunks)
  • .github/workflows/python-check-requirements.yml (0 hunks)
  • .github/workflows/python-lint.yml (0 hunks)
  • .github/workflows/python-type-check.yml (0 hunks)
  • .github/workflows/release.yml (0 hunks)
  • .github/workflows/server.yml (0 hunks)
  • .github/workflows/winget.yml (0 hunks)
  • CMakePresets.json (1 hunks)
  • Makefile (2 hunks)
  • build-xcframework.sh (0 hunks)
  • build.backend.sh (1 hunks)
  • ggml/CMakeLists.txt (2 hunks)
  • ggml/include/ggml-metal.h (1 hunks)
  • ggml/include/ggml-remoting-frontend.h (1 hunks)
  • ggml/src/CMakeLists.txt (1 hunks)
  • ggml/src/ggml-backend-reg.cpp (3 hunks)
  • ggml/src/ggml-metal/ggml-metal.m (4 hunks)
  • ggml/src/ggml-remotingbackend/CMakeLists.txt (1 hunks)
  • ggml/src/ggml-remotingbackend/backend-convert.h (1 hunks)
  • ggml/src/ggml-remotingbackend/backend-dispatched-backend.cpp (1 hunks)
  • ggml/src/ggml-remotingbackend/backend-dispatched-buffer-type.cpp (1 hunks)
  • ggml/src/ggml-remotingbackend/backend-dispatched-buffer.cpp (1 hunks)
  • ggml/src/ggml-remotingbackend/backend-dispatched-device.cpp (1 hunks)
  • ggml/src/ggml-remotingbackend/backend-dispatched-metal.cpp (1 hunks)
  • ggml/src/ggml-remotingbackend/backend-dispatched.cpp (1 hunks)
  • ggml/src/ggml-remotingbackend/backend-dispatched.h (1 hunks)
  • ggml/src/ggml-remotingbackend/backend-internal.h (1 hunks)
  • ggml/src/ggml-remotingbackend/backend-utils.h (1 hunks)
  • ggml/src/ggml-remotingbackend/backend.cpp (1 hunks)
  • ggml/src/ggml-remotingbackend/shared/api_remoting.h (1 hunks)
  • ggml/src/ggml-remotingbackend/shared/apir_backend.h (1 hunks)
  • ggml/src/ggml-remotingbackend/shared/venus_cs.h (1 hunks)
  • ggml/src/ggml-remotingbackend/shared/venus_cs_ggml-rpc.cpp (1 hunks)
  • ggml/src/ggml-remotingbackend/shared/venus_cs_ggml-rpc.h (1 hunks)
  • ggml/src/ggml-remotingbackend/shared/venus_cs_ggml.h (1 hunks)
  • ggml/src/ggml-remotingbackend/venus_cs_ggml-rpc-back.cpp (1 hunks)
  • ggml/src/ggml-remotingfrontend/CMakeLists.txt (1 hunks)
  • ggml/src/ggml-remotingfrontend/ggml-backend-buffer-type.cpp (1 hunks)
  • ggml/src/ggml-remotingfrontend/ggml-backend-buffer.cpp (1 hunks)
  • ggml/src/ggml-remotingfrontend/ggml-backend-device.cpp (1 hunks)
  • ggml/src/ggml-remotingfrontend/ggml-backend-host-buffer-type.cpp (1 hunks)
  • ggml/src/ggml-remotingfrontend/ggml-backend-reg.cpp (1 hunks)
  • ggml/src/ggml-remotingfrontend/ggml-backend.cpp (1 hunks)
  • ggml/src/ggml-remotingfrontend/ggml-metal-remoting.cpp (1 hunks)
  • ggml/src/ggml-remotingfrontend/ggml-metal-remoting.h (1 hunks)
  • ggml/src/ggml-remotingfrontend/ggml-remoting-frontend.cpp (1 hunks)
  • ggml/src/ggml-remotingfrontend/ggml-remoting.h (1 hunks)
  • ggml/src/ggml-remotingfrontend/include/drm-uapi/drm.h (1 hunks)
  • ggml/src/ggml-remotingfrontend/include/drm-uapi/virtgpu_drm.h (1 hunks)
  • ggml/src/ggml-remotingfrontend/include/venus_hw.h (1 hunks)
  • ggml/src/ggml-remotingfrontend/venus_cs_ggml-rpc-front.cpp (1 hunks)
  • ggml/src/ggml-remotingfrontend/virtgpu-forward-backend.cpp (1 hunks)
  • ggml/src/ggml-remotingfrontend/virtgpu-forward-buffer-type.cpp (1 hunks)
  • ggml/src/ggml-remotingfrontend/virtgpu-forward-buffer.cpp (1 hunks)
  • ggml/src/ggml-remotingfrontend/virtgpu-forward-device.cpp (1 hunks)
  • ggml/src/ggml-remotingfrontend/virtgpu-forward-impl.h (1 hunks)
  • ggml/src/ggml-remotingfrontend/virtgpu-forward-metal.cpp (1 hunks)
  • ggml/src/ggml-remotingfrontend/virtgpu-forward.h (1 hunks)
  • ggml/src/ggml-remotingfrontend/virtgpu-shm.cpp (1 hunks)
  • ggml/src/ggml-remotingfrontend/virtgpu-shm.h (1 hunks)
  • ggml/src/ggml-remotingfrontend/virtgpu-utils.cpp (1 hunks)
  • ggml/src/ggml-remotingfrontend/virtgpu-utils.h (1 hunks)
  • ggml/src/ggml-remotingfrontend/virtgpu.cpp (1 hunks)
  • ggml/src/ggml-remotingfrontend/virtgpu.h (1 hunks)
  • podman_compile.sh (1 hunks)
  • prepare.backend.sh (1 hunks)
  • prepare.remoting.sh (1 hunks)
  • prepare.sh (1 hunks)
  • prepare.vulkan.sh (1 hunks)
  • run.remoting.sh (1 hunks)
  • run.sh (1 hunks)
  • run.vulkan.sh (1 hunks)
  • src/llama-context.cpp (2 hunks)
  • src/llama-kv-cache-unified.cpp (1 hunks)
  • src/llama-kv-cache.cpp (1 hunks)
  • src/llama-model-loader.cpp (4 hunks)
  • src/llama-model.cpp (2 hunks)
  • src/llama-vocab.cpp (3 hunks)
  • tools/run/run.cpp (3 hunks)
💤 Files with no reviewable changes (27)
  • .github/ISSUE_TEMPLATE/config.yml
  • .github/pull_request_template.md
  • .github/ISSUE_TEMPLATE/011-bug-results.yml
  • .github/workflows/editorconfig.yml
  • .github/ISSUE_TEMPLATE/010-bug-compilation.yml
  • .github/ISSUE_TEMPLATE/019-bug-misc.yml
  • .github/workflows/python-lint.yml
  • .github/workflows/labeler.yml
  • .github/ISSUE_TEMPLATE/040-refactor.yml
  • .github/workflows/python-check-requirements.yml
  • .github/ISSUE_TEMPLATE/020-enhancement.yml
  • .github/workflows/winget.yml
  • .github/workflows/python-type-check.yml
  • .github/actions/get-tag-name/action.yml
  • .github/workflows/gguf-publish.yml
  • .github/workflows/close-issue.yml
  • .github/actions/windows-setup-curl/action.yml
  • .github/labeler.yml
  • .github/ISSUE_TEMPLATE/030-research.yml
  • .github/workflows/build-linux-cross.yml
  • .github/actions/windows-setup-cuda/action.yml
  • .github/workflows/bench.yml.disabled
  • .github/workflows/docker.yml
  • .github/workflows/server.yml
  • .github/workflows/build.yml
  • build-xcframework.sh
  • .github/workflows/release.yml
🧰 Additional context used
🧬 Code Graph Analysis (16)
ggml/include/ggml-remoting-frontend.h (1)
ggml/src/ggml-remotingfrontend/ggml-backend-reg.cpp (2)
  • ggml_backend_remoting_frontend_reg (126-153)
  • ggml_backend_remoting_frontend_reg (126-126)
ggml/src/ggml-backend-reg.cpp (2)
ggml/src/ggml-remotingfrontend/ggml-backend-reg.cpp (2)
  • ggml_backend_remoting_frontend_reg (126-153)
  • ggml_backend_remoting_frontend_reg (126-126)
ggml/include/ggml-remoting-frontend.h (1)
  • ggml_backend_remoting_frontend_reg (12-12)
src/llama-model.cpp (3)
src/llama-hparams.h (8)
  • il (171-171)
  • il (173-173)
  • il (175-175)
  • il (177-177)
  • il (180-180)
  • il (183-183)
  • il (193-193)
  • il (195-195)
src/llama-model.h (2)
  • il (396-396)
  • il (399-399)
ggml/src/ggml-backend-impl.h (15)
  • dev (139-139)
  • dev (142-142)
  • dev (145-145)
  • dev (148-148)
  • dev (151-151)
  • dev (154-154)
  • dev (157-157)
  • dev (160-160)
  • dev (163-163)
  • dev (166-166)
  • dev (169-169)
  • dev (173-173)
  • dev (176-176)
  • dev (177-177)
  • dev (178-178)
ggml/src/ggml-remotingbackend/backend-dispatched-metal.cpp (3)
ggml/src/ggml-remotingbackend/backend-internal.h (1)
  • ggml_backend_metal_get_device_context_fct (31-34)
ggml/src/ggml-remotingbackend/backend-dispatched.h (1)
  • enc (16-16)
ggml/src/ggml-remotingbackend/shared/apir_backend.h (1)
  • ctx (67-67)
ggml/src/ggml-remotingfrontend/virtgpu-forward-backend.cpp (3)
ggml/src/ggml-remotingbackend/shared/venus_cs_ggml.h (8)
  • vn_serialize_ggml_cgraph (140-145)
  • vn_serialize_ggml_cgraph (141-141)
  • vn_encode_virtgpu_shmem_res_id (128-131)
  • vn_encode_virtgpu_shmem_res_id (129-129)
  • vn_encode_cgraph_data (147-152)
  • vn_encode_cgraph_data (148-148)
  • vn_decode_ggml_status (121-124)
  • vn_decode_ggml_status (122-122)
ggml/src/ggml-remotingfrontend/virtgpu-utils.h (4)
  • WARNING (48-57)
  • WARNING (49-49)
  • FATAL (59-69)
  • FATAL (60-60)
ggml/src/ggml-remotingbackend/shared/venus_cs.h (4)
  • vn_encode_size_t (387-392)
  • vn_encode_size_t (388-388)
  • vn_cs_new_encoder (37-46)
  • vn_cs_new_encoder (38-38)
ggml/src/ggml-remotingbackend/backend-dispatched-buffer-type.cpp (4)
ggml/src/ggml-remotingbackend/backend-dispatched.h (1)
  • enc (16-16)
ggml/src/ggml-remotingbackend/shared/venus_cs_ggml.h (4)
  • vn_decode_ggml_buffer_type (75-82)
  • vn_decode_ggml_buffer_type (76-76)
  • vn_encode_ggml_buffer (98-102)
  • vn_encode_ggml_buffer (99-99)
ggml/src/ggml-remotingbackend/shared/venus_cs.h (10)
  • vn_encode_array_size (274-278)
  • vn_encode_array_size (275-275)
  • vn_encode_char_array (459-464)
  • vn_encode_char_array (460-460)
  • vn_encode_size_t (387-392)
  • vn_encode_size_t (388-388)
  • vn_encode_bool_t (502-506)
  • vn_encode_bool_t (503-503)
  • vn_decode_size_t (394-400)
  • vn_decode_size_t (395-395)
ggml/src/ggml-remotingbackend/shared/venus_cs_ggml-rpc.cpp (2)
  • track_backend_buffer (12-15)
  • track_backend_buffer (13-13)
ggml/src/ggml-remotingbackend/backend-dispatched-backend.cpp (5)
ggml/src/ggml-remotingbackend/backend-dispatched.h (1)
  • enc (16-16)
ggml/src/ggml-remotingbackend/shared/apir_backend.h (5)
  • ctx (67-67)
  • start_timer (87-91)
  • start_timer (87-87)
  • stop_timer (93-100)
  • stop_timer (93-93)
ggml/src/ggml-remotingbackend/shared/venus_cs_ggml.h (6)
  • vn_decode_virtgpu_shmem_res_id (133-136)
  • vn_decode_virtgpu_shmem_res_id (134-134)
  • vn_decode_ggml_cgraph (154-167)
  • vn_decode_ggml_cgraph (155-155)
  • vn_encode_ggml_status (116-119)
  • vn_encode_ggml_status (117-117)
ggml/src/ggml-remotingfrontend/virtgpu-utils.h (2)
  • FATAL (59-69)
  • FATAL (60-60)
ggml/src/ggml-remotingbackend/shared/venus_cs.h (4)
  • vn_decode_size_t (394-400)
  • vn_decode_size_t (395-395)
  • vn_cs_new_decoder (27-35)
  • vn_cs_new_decoder (28-28)
ggml/src/ggml-remotingfrontend/ggml-metal-remoting.cpp (2)
ggml/src/ggml-remotingfrontend/virtgpu-forward-metal.cpp (2)
  • apir_metal_get_device_context (3-19)
  • apir_metal_get_device_context (4-4)
ggml/src/ggml-remotingfrontend/virtgpu-forward.h (1)
  • apir_metal_get_device_context (50-50)
ggml/src/ggml-remotingfrontend/ggml-backend-host-buffer-type.cpp (4)
ggml/src/ggml-remotingfrontend/virtgpu-utils.h (8)
  • WARNING (48-57)
  • WARNING (49-49)
  • FATAL (59-69)
  • FATAL (60-60)
  • INFO (35-42)
  • INFO (36-36)
  • INFO (44-45)
  • INFO (45-45)
ggml/src/ggml-remotingfrontend/virtgpu-shm.cpp (2)
  • virtgpu_shmem_destroy (71-77)
  • virtgpu_shmem_destroy (72-73)
ggml/src/ggml-remotingfrontend/virtgpu-forward.h (1)
  • apir_device_buffer_from_ptr (22-24)
ggml/src/ggml-remotingfrontend/virtgpu-forward-device.cpp (2)
  • apir_device_buffer_from_ptr (195-227)
  • apir_device_buffer_from_ptr (196-198)
tools/run/run.cpp (2)
include/llama.h (1)
  • llama_batch_get_one (913-915)
src/llama-batch.cpp (2)
  • llama_batch_get_one (576-588)
  • llama_batch_get_one (576-578)
ggml/src/ggml-remotingfrontend/venus_cs_ggml-rpc-front.cpp (2)
ggml/src/ggml-remotingfrontend/virtgpu-utils.h (2)
  • FATAL (59-69)
  • FATAL (60-60)
ggml/src/ggml-remotingbackend/shared/venus_cs_ggml-rpc.cpp (4)
  • add_tensor (80-94)
  • add_tensor (81-81)
  • serialize_tensor (17-46)
  • serialize_tensor (18-18)
ggml/src/ggml-remotingfrontend/virtgpu-utils.cpp (1)
ggml/src/ggml-remotingfrontend/virtgpu-utils.h (4)
  • INFO (35-42)
  • INFO (36-36)
  • INFO (44-45)
  • INFO (45-45)
ggml/src/ggml-remotingfrontend/virtgpu-forward-device.cpp (2)
ggml/src/ggml-remotingbackend/shared/venus_cs.h (20)
  • vn_decode_int32_t (235-239)
  • vn_decode_int32_t (236-236)
  • vn_decode_array_size_unchecked (293-299)
  • vn_decode_array_size_unchecked (294-294)
  • vn_cs_decoder_alloc_array (493-498)
  • vn_cs_decoder_alloc_array (494-494)
  • vn_decode_char_array (466-476)
  • vn_decode_char_array (467-467)
  • vn_decode_uint32_t (348-352)
  • vn_decode_uint32_t (349-349)
  • vn_decode_size_t (394-400)
  • vn_decode_size_t (395-395)
  • vn_decode_bool_t (508-512)
  • vn_decode_bool_t (509-509)
  • vn_decode_apir_buffer_type_host_handle_t (522-526)
  • vn_decode_apir_buffer_type_host_handle_t (523-523)
  • vn_encode_size_t (387-392)
  • vn_encode_size_t (388-388)
  • vn_decode_apir_buffer_host_handle_t (536-540)
  • vn_decode_apir_buffer_host_handle_t (537-537)
ggml/src/ggml-remotingbackend/shared/venus_cs_ggml.h (4)
  • vn_encode_ggml_tensor_inline (174-210)
  • vn_encode_ggml_tensor_inline (175-175)
  • vn_decode_apir_buffer_type_host_handle (84-91)
  • vn_decode_apir_buffer_type_host_handle (85-85)
ggml/src/ggml-remotingfrontend/virtgpu-forward-buffer-type.cpp (2)
ggml/src/ggml-remotingbackend/shared/venus_cs_ggml.h (2)
  • vn_encode_ggml_buffer_type (69-73)
  • vn_encode_ggml_buffer_type (70-70)
ggml/src/ggml-remotingbackend/shared/venus_cs.h (14)
  • vn_decode_array_size_unchecked (293-299)
  • vn_decode_array_size_unchecked (294-294)
  • vn_cs_decoder_alloc_array (493-498)
  • vn_cs_decoder_alloc_array (494-494)
  • vn_decode_char_array (466-476)
  • vn_decode_char_array (467-467)
  • vn_decode_size_t (394-400)
  • vn_decode_size_t (395-395)
  • vn_decode_bool_t (508-512)
  • vn_decode_bool_t (509-509)
  • vn_encode_size_t (387-392)
  • vn_encode_size_t (388-388)
  • vn_decode_apir_buffer_host_handle_t (536-540)
  • vn_decode_apir_buffer_host_handle_t (537-537)
ggml/src/ggml-remotingfrontend/virtgpu-forward-buffer.cpp (4)
ggml/src/ggml-remotingbackend/shared/venus_cs.h (8)
  • vn_encode_apir_buffer_host_handle_t (530-534)
  • vn_encode_apir_buffer_host_handle_t (531-531)
  • vn_decode_uintptr_t (550-554)
  • vn_decode_uintptr_t (551-551)
  • vn_encode_size_t (387-392)
  • vn_encode_size_t (388-388)
  • vn_encode_uint8_t (150-154)
  • vn_encode_uint8_t (151-151)
ggml/src/ggml-remotingbackend/shared/venus_cs_ggml.h (4)
  • vn_encode_ggml_tensor (39-44)
  • vn_encode_ggml_tensor (40-40)
  • vn_encode_virtgpu_shmem_res_id (128-131)
  • vn_encode_virtgpu_shmem_res_id (129-129)
ggml/src/ggml-remotingfrontend/virtgpu-shm.h (2)
  • virtgpu_shmem_create (32-32)
  • virtgpu_shmem_destroy (33-33)
ggml/src/ggml-remotingfrontend/virtgpu-shm.cpp (4)
  • virtgpu_shmem_create (79-111)
  • virtgpu_shmem_create (80-80)
  • virtgpu_shmem_destroy (71-77)
  • virtgpu_shmem_destroy (72-73)
ggml/src/ggml-remotingbackend/shared/venus_cs_ggml-rpc.cpp (3)
ggml/src/ggml-remotingfrontend/venus_cs_ggml-rpc-front.cpp (4)
  • add_tensor (50-64)
  • add_tensor (51-51)
  • serialize_tensor (12-48)
  • serialize_tensor (13-13)
ggml/src/ggml-remotingbackend/shared/venus_cs_ggml-rpc.h (4)
  • add_tensor (36-36)
  • serialize_tensor (26-26)
  • deserialize_tensor (38-38)
  • create_node (40-43)
ggml/src/ggml-remotingbackend/venus_cs_ggml-rpc-back.cpp (4)
  • deserialize_tensor (33-68)
  • deserialize_tensor (34-34)
  • create_node (70-93)
  • create_node (71-74)
🪛 Shellcheck (0.10.0)
prepare.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)

run.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)

prepare.remoting.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)

prepare.vulkan.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)

prepare.backend.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)

build.backend.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)


[warning] 13-13: Declare and assign separately to avoid masking return values.

(SC2155)

🔇 Additional comments (44)
run.vulkan.sh (1)

21-24: Prefix not quoted – word-splitting risk

$prefix ../build.vulkan/bin/llama-run … will misbehave if $prefix contains spaces (gdb case).
Wrap in eval or an array, e.g.:

if [[ -n "${prefix}" ]]; then
  eval "${prefix} ../build.vulkan/bin/llama-run --verbose ~/models/llama3.2 \"say nothing\""
else
  ../build.vulkan/bin/llama-run --verbose ~/models/llama3.2 "say nothing"
fi
ggml/src/CMakeLists.txt (1)

357-358: Verify CMake option names match the new back-ends

ggml_add_backend(RemotingFrontend) expects an option GGML_REMOTINGFRONTEND, and likewise for RemotingBackend.
Ensure these cache variables are declared (e.g. via option(GGML_REMOTINGFRONTEND …) in a parent CMake file) or the backend will silently be skipped.

ggml/CMakeLists.txt (1)

183-184: Options are declared but never consumed
GGML_REMOTING_FRONTEND / GGML_REMOTING_BACKEND are introduced here but nothing in the root CMake lists or src/ conditionally adds sources, include-dirs, or libs based on them.
Result: toggling the option has zero effect and silently misleads packagers.

Consider:

  1. Wrapping the add_subdirectory(src/ggml-remotingfrontend …) / backend equivalents with these flags.
  2. Failing fast if the option is ON but required dependencies (libdrm, virtgpu headers, etc.) are missing.
src/llama-context.cpp (1)

107-117: LGTM! Logging reduction aligns with PR objectives.

The commented logging block reduces noise during model initialization as intended by the PR. The informational output about context parameters (n_ctx, n_batch, etc.) can be verbose during startup, and this change improves the user experience by reducing clutter.

CMakePresets.json (1)

33-34: LGTM! New remoting presets follow established conventions.

The new remoting_frontend and remoting_backend presets are well-structured and consistent with existing hidden presets in the file. They properly enable the corresponding CMake cache variables (GGML_REMOTING_FRONTEND and GGML_REMOTING_BACKEND) for the new VirtGPU-based remoting functionality mentioned in the PR.

src/llama-vocab.cpp (1)

2001-2002: LGTM - Debug logging appropriately suppressed.

The commented debug log aligns with the PR's objective to reduce noise during model loading. This debug information about control tokens not marked as EOG is non-critical for normal operation.

ggml/src/ggml-remotingfrontend/ggml-remoting-frontend.cpp (1)

1-27: No changes needed for the WIP remoting frontend skeleton

This file is intentionally a placeholder in this “WIP” PR. It defines only the necessary includes, an extern declaration, and a mutex-bearing struct so that you can layer in the real remoting logic later.

• Before merging, please:
– Add stub functions or TODO comments in ggml-remoting-frontend.cpp to indicate where the core remoting API will be implemented.
– Ensure that you flesh out the extern’d ggml_backend_remoting_get_device_count() and any other callbacks referenced in ggml-remoting.h.

ggml/src/ggml-remotingbackend/CMakeLists.txt (2)

1-3: CMake version and policy settings are appropriate.

The minimum version 3.19 and CMP0114 policy are reasonable for a modern CMake configuration supporting the backend library functionality.


6-19: All referenced source and header files are present
Verified that every .cpp and .h file listed in ggml/src/ggml-remotingbackend/CMakeLists.txt (including those under shared/) exists in the repository. No missing files detected.

ggml/include/ggml-remoting-frontend.h (1)

1-16: Well-designed header interface.

The header follows GGML conventions with proper C++ compatibility, clean API design, and appropriate naming. The single registration function approach is consistent with other backend implementations.

ggml/src/ggml-backend-reg.cpp (3)

48-50: LGTM! Backend registration follows established patterns.

The conditional inclusion and registration of the remoting frontend backend is consistent with how other backends (CUDA, Metal, Vulkan, etc.) are integrated into the system.


184-186: LGTM! Backend registration in constructor follows the established pattern.

The registration call follows the same pattern as other backends and is properly guarded by the conditional compilation flag.


591-591: LGTM! Dynamic loading integration is correctly placed.

The addition of "remoting_frontend" to the dynamic loading function is in the appropriate location and follows the naming convention of other backends.

ggml/src/ggml-remotingfrontend/virtgpu-forward-impl.h (1)

11-34: Well-designed macros with proper error handling.

The remote call macros follow good practices:

  • Use do-while(0) pattern for safe expansion
  • Consistent error checking with descriptive messages
  • Clear separation of remote call lifecycle phases

The implementation provides a clean abstraction for remote procedure calls.

Makefile (2)

719-727: LGTM! Makefile changes follow established patterns.

The conditional compilation support for the remoting frontend and backend follows the same pattern used for other optional components like CUDA, Metal, and Vulkan. The preprocessor flags and object file additions are correctly implemented.


768-772: LGTM! Compilation rules are properly defined.

The explicit compilation rules for the remoting frontend and backend object files are correctly specified and use the standard CXXFLAGS, consistent with other C++ compilation targets in the Makefile.

podman_compile.sh (2)

22-24: Verify podman machine environment before using SSH.

The script assumes it's running in a podman machine environment on macOS without verification. This could fail on Linux or when podman machine isn't running.

Consider adding a check for the podman machine environment:

 POD_NAME=mac_ai_compiling
-podman machine ssh podman rm $POD_NAME --force
+if podman machine list --format "{{.Name}}" 2>/dev/null | grep -q .; then
+    podman machine ssh podman rm $POD_NAME --force 2>/dev/null || true
+else
+    podman rm $POD_NAME --force 2>/dev/null || true
+fi

1-1: Add shebang to specify the shell interpreter.

The script is missing a shebang line, which can cause portability issues.

Apply this diff to add the shebang:

+#!/bin/bash
 #! /bin/bash

Likely an incorrect or invalid review comment.

ggml/src/ggml-remotingfrontend/virtgpu-shm.h (1)

24-24: list_head definition is present

The struct list_head is defined in virtgpu-utils.h, which is already included by virtgpu-shm.h. No additional includes are required.

ggml/src/ggml-remotingbackend/backend-dispatched-backend.cpp (1)

36-41: Remove incorrect undefined variable warning

The identifiers dev and bck are declared as extern globals in backend-internal.h and brought into scope by your includes. Compilation won’t fail due to undefined symbols here.

• In ggml/src/ggml-remotingbackend/backend-internal.h:

extern ggml_backend_dev_t dev;
extern ggml_backend_t     bck;

• These headers are included at the top of backend-dispatched-backend.cpp, so dev and bck are defined.

You can ignore the “undefined variables” suggestion—no change is needed.

Likely an incorrect or invalid review comment.

ggml/src/ggml-remotingbackend/backend-dispatched-buffer-type.cpp (1)

1-81: Well-structured buffer type dispatch implementation

The functions follow a consistent pattern and properly handle encoding/decoding. The null check for the allocated buffer at line 76-78 is particularly good practice.

ggml/src/ggml-remotingfrontend/ggml-metal-remoting.cpp (1)

20-198: Well-structured Metal operation support checking

The ggml_metal_supports_op function provides comprehensive coverage of GGML operations with appropriate device capability checks. The TODO comments clearly indicate areas for future optimization.

ggml/src/ggml-remotingfrontend/ggml-backend.cpp (3)

56-70: Incorrect assumption: function defined & default new throws

The function ggml_backend_remoting_frontend_reg() is declared in ggml/include/ggml-remoting-frontend.h and defined in ggml/src/ggml-remotingfrontend/ggml-backend-reg.cpp. Also, the default C++ new operator throws std::bad_alloc on failure rather than returning nullptr, so a null check after new ggml_backend{…} will never fire. If you need non-throwing allocation, use new (std::nothrow) and check for nullptr, or catch std::bad_alloc at the call site.

Likely an incorrect or invalid review comment.


11-15: Add nullptr check before delete operation.

The function directly deletes the backend pointer without checking if it's null, which could lead to undefined behavior.

Apply this fix:

 static void ggml_backend_remoting_free(ggml_backend_t backend) {
   IMPLEMENTED;

+  if (backend == nullptr) {
+    return;
+  }
   delete backend;
 }

Likely an incorrect or invalid review comment.


19-31: All symbols are defined and included correctly.

  • DEV_TO_GPU is defined in ggml-remoting.h.
  • start_timer/stop_timer come from apir_backend.h (pulled in via virtgpu-forward.hvirtgpu.h).
  • apir_backend_graph_compute is declared in virtgpu-forward.h and implemented in virtgpu-forward-backend.cpp.

No missing includes or dependencies—no changes needed.

Likely an incorrect or invalid review comment.

ggml/src/ggml-remotingfrontend/include/venus_hw.h (1)

1-75: Well-structured header file with clear documentation.

The Venus capability structure is properly defined with comprehensive comments explaining each field's purpose.

ggml/src/ggml-remotingbackend/backend.cpp (1)

22-44: Incorrect: dev and timer variables are defined via extern in included headers.

The deinit function includes

  • backend-internal.h (which declares extern ggml_backend_dev_t dev;)
  • shared/apir_backend.h (which declares extern struct timer_data graph_compute_timer;, get_tensor_timer;, and set_tensor_timer;)

Therefore, dev and the timer objects are properly declared and there’s no need to redeclare or pass them as parameters.

Likely an incorrect or invalid review comment.

ggml/src/ggml-remotingbackend/backend-internal.h (1)

22-29: Ignore missing forward declaration of virgl_apir_context
In C++, the use of struct virgl_apir_context in the function parameter list acts as an implicit forward declaration (per the elaborated‐type‐specifier rules), and any translation unit that needs the full definition already includes shared/apir_backend.h. No changes are required here.

Likely an incorrect or invalid review comment.

ggml/src/ggml-remotingfrontend/ggml-backend-reg.cpp (1)

88-90: Function get_metal_dev_context is correctly defined
The symbol is declared in ggml/src/ggml-remotingfrontend/ggml-metal-remoting.h (line 14) and implemented in ggml/src/ggml-remotingfrontend/ggml-metal-remoting.cpp. The original comment can be dismissed.

Likely an incorrect or invalid review comment.

ggml/src/ggml-remotingfrontend/ggml-backend-host-buffer-type.cpp (1)

37-37: No changes needed: correct usage of virtgpu_shmem_destroy
The call passes the inner struct virtgpu_shmem* via shmem->shmem, which matches the signature. Other callers (in virtgpu-forward-backend.cpp and virtgpu-forward-buffer.cpp) use the same pattern.

ggml/src/ggml-remotingfrontend/virtgpu-forward-device.cpp (1)

10-12: Remove duplicate UNUSED macro.

The ctx parameter is marked as UNUSED twice.

-  UNUSED(ctx);
   UNUSED(ctx);

Likely an incorrect or invalid review comment.

ggml/src/ggml-remotingbackend/backend-dispatched-device.cpp (1)

14-15: Variables reg and dev are properly defined as globals
Our search confirms:

  • ggml_backend_reg_t reg and ggml_backend_dev_t dev are defined in backend-dispatched.cpp
  • Both are declared extern in backend-internal.h and thus visible in backend-dispatched-device.cpp

No changes needed.

ggml/src/ggml-remotingfrontend/ggml-backend-device.cpp (1)

163-185: No memory leak—buffer context is freed on destruction

The ggml_backend_remoting_buffer_from_ptr_interface sets .free_buffer = ggml_backend_remoting_buffer_free_buffer, and that implementation calls free(context) and nulls out buffer->context. The allocated context is properly deallocated when the buffer is destroyed.

ggml/src/ggml-remotingbackend/backend-dispatched.h (1)

90-121: Make dispatch table const for safety

The dispatch table should be const to prevent accidental modification at runtime.

-static const backend_dispatch_t apir_backend_dispatch_table[APIR_BACKEND_DISPATCH_TABLE_COUNT] = {
+static const backend_dispatch_t const apir_backend_dispatch_table[APIR_BACKEND_DISPATCH_TABLE_COUNT] = {

Likely an incorrect or invalid review comment.

ggml/src/ggml-remotingfrontend/include/drm-uapi/drm.h (1)

36-1409: LGTM!

The DRM UAPI definitions and structures are correctly implemented following the standard kernel interface.

ggml/src/ggml-remotingbackend/shared/venus_cs.h (1)

9-555: Consider the serialization framework design and implementation.

The serialization implementation is well-structured with proper bounds checking and error handling. The use of FATAL for unrecoverable errors is appropriate for a low-level serialization library.

ggml/src/ggml-remotingfrontend/virtgpu-forward.h (1)

1-51: LGTM!

The virtual GPU interface declarations are well-organized and follow consistent naming conventions. The function signatures appropriately handle device management, buffer operations, and backend functionality.

src/llama-kv-cache.cpp (5)

126-450: Cache operations are well-implemented.

The sequence management functions correctly handle both recurrent and transformer-like models with appropriate validation and state updates.


504-731: Slot finding algorithm is robust.

The implementation correctly handles different model types with proper validation and debug checks. The recurrent model handling with contiguous slot requirements is well-implemented.


770-905: Defragmentation implementation is sound.

The algorithm correctly identifies and fills holes in the cache with proper move tracking and bounds checking. The assertion at line 839 helps ensure algorithm correctness.


907-1281: Serialization implementation is comprehensive.

The state persistence functions correctly handle metadata and tensor data with proper validation during restoration. The transposed value tensor handling adds complexity but is implemented correctly.


1287-1382: KV cache view implementation provides good introspection.

The view functions correctly create snapshots of the cache state with proper memory management and type checking.

ggml/src/ggml-remotingbackend/shared/venus_cs_ggml.h (1)

1-237: GGML serialization utilities are well-designed.

The serialization approach with intermediate rpc_tensor representation and proper validation of tensor constraints provides a robust foundation for the remoting backend.

ggml/src/ggml-remotingfrontend/include/drm-uapi/virtgpu_drm.h (1)

1-276: VirtGPU DRM header is up-to-date

The local copy of virtgpu_drm.h matches the latest Linux v6.16-rc1 UAPI header (June 12, 2025): all newly added IOCTLs—RESOURCE_INFO, TRANSFER_FROM_HOST, TRANSFER_TO_HOST, WAIT, GET_CAPS, RESOURCE_CREATE_BLOB, and CONTEXT_INIT—are present. No further updates are needed.

Comment on lines 12 to 26
ICD_DIR=/Users/kevinpouget/.local/share/vulkan/icd.d

USE_WORK_MESA=1
if [[ "$USE_WORK_MESA" == 1 ]]; then
export VK_ICD_FILENAMES=$ICD_DIR/virtio_icd.aarch64.json
else
export VK_ICD_FILENAMES=$ICD_DIR/virtio_icd.good.aarch64.json
fi
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Hard-coded absolute user path reduces portability

ICD_DIR=/Users/kevinpouget/... will break on every other machine.
Expose this as an env var with a sensible default:

ICD_DIR=${ICD_DIR:-"$HOME/.local/share/vulkan/icd.d"}
🤖 Prompt for AI Agents
In run.vulkan.sh around lines 12 to 19, the ICD_DIR variable is set to a
hard-coded absolute user path, which reduces portability. Replace the fixed path
with a dynamic assignment that uses the existing ICD_DIR environment variable if
set, or defaults to "$HOME/.local/share/vulkan/icd.d" otherwise. This can be
done by setting ICD_DIR=${ICD_DIR:-"$HOME/.local/share/vulkan/icd.d"} at the
start of the script.

run.vulkan.sh Outdated
prefix=""
fi

rm -f /usr/lib64/libvulkan_virtio.so
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

rm -f /usr/lib64/libvulkan_virtio.so is dangerous

Blindly deleting a system library requires sudo and can break other Vulkan apps.
Either guard it with a prompt, check privileges, or document why it is safe.

🤖 Prompt for AI Agents
In run.vulkan.sh at line 10, the command `rm -f /usr/lib64/libvulkan_virtio.so`
is risky because it deletes a system library without confirmation or privilege
checks. Modify the script to either prompt the user for confirmation before
deletion, check if the script is running with sudo privileges before attempting
removal, or add comments explaining why this deletion is safe and necessary to
prevent unintended system issues.

Comment on lines +1 to +8
cmake -S . -B ../build.remoting-frontend \
-DGGML_REMOTINGFRONTEND=ON \
-DGGML_CPU_ARM_ARCH=native \
-DGGML_NATIVE=OFF \
-DGGML_OPENMP=OFF \
-DLLAMA_CURL=OFF \
-DCMAKE_BUILD_TYPE=Debug \
"$@"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Add shebang & fix CMake option typo / portability

  1. ShellCheck warns because the script lacks a shebang.
  2. CMake option appears to be GGML_REMOTING_FRONTEND (with underscore) elsewhere; current GGML_REMOTINGFRONTEND will be ignored.
  3. Hard-coding -DGGML_CPU_ARM_ARCH=native breaks x86 hosts.
+#!/usr/bin/env bash
+set -euo pipefail
+
 cmake -S . -B ../build.remoting-frontend \
-      -DGGML_REMOTINGFRONTEND=ON \
+      -DGGML_REMOTING_FRONTEND=ON \
+      ${CMAKE_ARM_OPT:--DGGML_CPU_ARM_ARCH=native} \
       -DGGML_NATIVE=OFF \
       -DGGML_OPENMP=OFF \
       -DLLAMA_CURL=OFF \
       -DCMAKE_BUILD_TYPE=Debug \
       "$@"

Consider auto-detecting host arch (uname -m) to decide whether to set the ARM flag.

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Shellcheck (0.10.0)

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)

🤖 Prompt for AI Agents
In prepare.remoting.sh lines 1 to 8, add a shebang line (e.g., #!/bin/sh) at the
top to specify the shell interpreter. Correct the CMake option from
GGML_REMOTINGFRONTEND to GGML_REMOTING_FRONTEND to match the expected option
name. Replace the hard-coded -DGGML_CPU_ARM_ARCH=native flag with logic that
detects the host architecture using uname -m and sets this flag only if the host
is an ARM architecture, ensuring portability across different CPU types.

}

void llama_model::print_info() const {
return;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Inconsistent approach to suppressing output - consider a more flexible solution.

Adding an early return completely disables the print_info method, which may break existing functionality that depends on this output. This approach is also inconsistent with the commented-out debug logs in the previous hunk.

Consider implementing a more flexible approach:

  • Add a parameter to control whether to print info
  • Use a global flag or environment variable to control output
  • Implement log level filtering instead of completely disabling the method
-void llama_model::print_info() const {
-    return;
+void llama_model::print_info(bool verbose = true) const {
+    if (!verbose) return;

Or with environment variable control:

-void llama_model::print_info() const {
-    return;
+void llama_model::print_info() const {
+    if (std::getenv("LLAMA_QUIET")) return;

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In src/llama-model.cpp at line 4448, the early return disables the print_info
method entirely, which can break functionality relying on its output. Instead of
an unconditional return, modify the method to accept a parameter that controls
whether to print info, or check a global flag or environment variable before
printing. Implement log level filtering to selectively enable or disable output
without fully disabling the method.

@@ -0,0 +1,11 @@
cmake -S . -B ../build.remoting-backend \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add a proper she-bang and fail-fast guards

The script is missing a she-bang which breaks editor/tooling detection and can cause “Exec format error” if executed directly.
Adding set -euo pipefail (or at least set -e) prevents silent build mis-configurations.

+#!/usr/bin/env bash
+set -euo pipefail
+
 cmake -S . -B ../build.remoting-backend \
🧰 Tools
🪛 Shellcheck (0.10.0)

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)

🤖 Prompt for AI Agents
In prepare.backend.sh at line 1, add a proper she-bang line such as #!/bin/bash
at the very top of the script to ensure correct interpreter detection and
execution. Also, add fail-fast guards by including set -euo pipefail after the
she-bang to make the script exit immediately on errors, unset variables, or
pipeline failures, preventing silent build mis-configurations.

Comment on lines +5163 to +6021
double ms = timer_total/1000000;
double itl = ms/timer_count;
double speed = 1/itl * 1000;

printf("METAL compute_graph: [%9.0f] ms for %lld invokations | ITL %.2f ms | throughput = %.2f t/s\n",ms, timer_count, itl, speed);

timer_start = 0;
timer_total = 1; // to avoid re-registering
timer_count = 0;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix potential division by zero and improve timer state management.

The function has a division by zero risk and uses a hacky approach to prevent re-registration.

 static void show_timer(void) {
+  if (timer_count == 0) {
+    printf("METAL compute_graph: No invocations recorded\n");
+    return;
+  }
+
   double ms = timer_total/1000000;
   double itl = ms/timer_count;
   double speed = 1/itl * 1000;
 
   printf("METAL compute_graph: [%9.0f] ms for %lld invokations | ITL %.2f ms | throughput = %.2f t/s\n",ms, timer_count, itl, speed);
 
-  timer_start = 0;
-  timer_total = 1; // to avoid re-registering
-  timer_count = 0;
+  // Reset timer state
+  timer_start = 0;
+  timer_total = 0;
+  timer_count = 0;
 }
🤖 Prompt for AI Agents
In ggml/src/ggml-metal/ggml-metal.m around lines 5163 to 5172, there is a risk
of division by zero when calculating 'itl' and 'speed' if 'timer_count' is zero,
and the timer state reset uses a hacky value. Fix this by adding a check to
ensure 'timer_count' is greater than zero before performing divisions, and reset
'timer_total' and 'timer_count' to zero instead of using arbitrary values to
properly manage the timer state.

Comment on lines +6073 to +6950
GGML_BACKEND_API void
ggml_backend_metal_get_device_context(ggml_backend_dev_t dev,
bool *has_simdgroup_mm,
bool *has_simdgroup_reduction,
bool *use_bfloat) {
struct ggml_backend_metal_device_context *dev_ctx = dev->context ;

*use_bfloat = dev_ctx->use_bfloat;
*has_simdgroup_reduction = dev_ctx->has_simdgroup_reduction;
*has_simdgroup_mm = dev_ctx->has_simdgroup_mm;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add null pointer validation for safety.

The function should validate input pointers before dereferencing them to prevent potential crashes.

 GGML_BACKEND_API void
 ggml_backend_metal_get_device_context(ggml_backend_dev_t dev,
 				      bool *has_simdgroup_mm,
 				      bool *has_simdgroup_reduction,
 				      bool *use_bfloat) {
+  GGML_ASSERT(dev != NULL);
+  GGML_ASSERT(has_simdgroup_mm != NULL);
+  GGML_ASSERT(has_simdgroup_reduction != NULL);
+  GGML_ASSERT(use_bfloat != NULL);
+
   struct ggml_backend_metal_device_context *dev_ctx = dev->context ;
 
   *use_bfloat = dev_ctx->use_bfloat;
   *has_simdgroup_reduction = dev_ctx->has_simdgroup_reduction;
   *has_simdgroup_mm = dev_ctx->has_simdgroup_mm;
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
GGML_BACKEND_API void
ggml_backend_metal_get_device_context(ggml_backend_dev_t dev,
bool *has_simdgroup_mm,
bool *has_simdgroup_reduction,
bool *use_bfloat) {
struct ggml_backend_metal_device_context *dev_ctx = dev->context ;
*use_bfloat = dev_ctx->use_bfloat;
*has_simdgroup_reduction = dev_ctx->has_simdgroup_reduction;
*has_simdgroup_mm = dev_ctx->has_simdgroup_mm;
}
GGML_BACKEND_API void
ggml_backend_metal_get_device_context(ggml_backend_dev_t dev,
bool *has_simdgroup_mm,
bool *has_simdgroup_reduction,
bool *use_bfloat) {
GGML_ASSERT(dev != NULL);
GGML_ASSERT(has_simdgroup_mm != NULL);
GGML_ASSERT(has_simdgroup_reduction != NULL);
GGML_ASSERT(use_bfloat != NULL);
struct ggml_backend_metal_device_context *dev_ctx = dev->context ;
*use_bfloat = dev_ctx->use_bfloat;
*has_simdgroup_reduction = dev_ctx->has_simdgroup_reduction;
*has_simdgroup_mm = dev_ctx->has_simdgroup_mm;
}
🤖 Prompt for AI Agents
In ggml/src/ggml-metal/ggml-metal.m around lines 6073 to 6083, the function
ggml_backend_metal_get_device_context dereferences the pointers
has_simdgroup_mm, has_simdgroup_reduction, and use_bfloat without checking if
they are null. To fix this, add null pointer checks for each of these pointers
before assigning values to them to prevent potential crashes due to null
dereferencing.

Comment on lines +5154 to +6009
struct timespec ts;
clock_gettime(CLOCK_REALTIME, &ts); // Use CLOCK_MONOTONIC for elapsed time
long long timer_end = (long long)ts.tv_sec * 1000000000LL + ts.tv_nsec;

timer_total += (timer_end - timer_start);
timer_count += 1;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix CLOCK_MONOTONIC usage in stop_timer.

 static inline void stop_timer(void) {
   struct timespec ts;
-  clock_gettime(CLOCK_REALTIME, &ts);  // Use CLOCK_MONOTONIC for elapsed time
+  clock_gettime(CLOCK_MONOTONIC, &ts);
   long long timer_end = (long long)ts.tv_sec * 1000000000LL + ts.tv_nsec;
 
   timer_total += (timer_end - timer_start);
   timer_count += 1;
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
struct timespec ts;
clock_gettime(CLOCK_REALTIME, &ts); // Use CLOCK_MONOTONIC for elapsed time
long long timer_end = (long long)ts.tv_sec * 1000000000LL + ts.tv_nsec;
timer_total += (timer_end - timer_start);
timer_count += 1;
}
static inline void stop_timer(void) {
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
long long timer_end = (long long)ts.tv_sec * 1000000000LL + ts.tv_nsec;
timer_total += (timer_end - timer_start);
timer_count += 1;
}
🤖 Prompt for AI Agents
In ggml/src/ggml-metal/ggml-metal.m around lines 5154 to 5160, the code uses
CLOCK_REALTIME for timing, which can be affected by system clock changes. Change
the clock_gettime call to use CLOCK_MONOTONIC instead to ensure consistent
elapsed time measurement. Replace CLOCK_REALTIME with CLOCK_MONOTONIC in the
clock_gettime function call.

Comment on lines +991 to +1011
static inline void start_timer(void) {
struct timespec ts;
clock_gettime(CLOCK_REALTIME, &ts); // Use CLOCK_MONOTONIC for elapsed time
timer_start = (long long)ts.tv_sec * 1000000000LL + ts.tv_nsec;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Use CLOCK_MONOTONIC instead of CLOCK_REALTIME for elapsed time measurement

The comments indicate these functions should use CLOCK_MONOTONIC for elapsed time, but the code uses CLOCK_REALTIME. CLOCK_REALTIME can be adjusted by system time changes, making it unsuitable for measuring elapsed time.

Apply this diff to fix the clock type:

-  clock_gettime(CLOCK_REALTIME, &ts);  // Use CLOCK_MONOTONIC for elapsed time
+  clock_gettime(CLOCK_MONOTONIC, &ts);

Also applies to: 998-1001

🤖 Prompt for AI Agents
In tools/run/run.cpp around lines 991 to 995 and also lines 998 to 1001, the
code uses CLOCK_REALTIME for timing, but it should use CLOCK_MONOTONIC to
measure elapsed time reliably. Change the clock_gettime calls to use
CLOCK_MONOTONIC instead of CLOCK_REALTIME in both start_timer and the
corresponding stop_timer functions to prevent issues from system time changes.

Comment on lines +987 to +1006
static long long timer_start = 0;
static long long timer_total = 0;
static long long timer_count = 0;

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Add missing include for time.h

The code uses clock_gettime and struct timespec but doesn't include the required header.

Add this include at the top of the file:

 #include <signal.h>
+#include <time.h>
 
 #include <climits>

Let me verify if time.h is already included elsewhere:

Also applies to: 991-1005, 1006-1013


🏁 Script executed:

#!/bin/bash
# Check if time.h is included in the file or its includes
rg -A 5 -B 5 "include.*time\.h" tools/run/run.cpp

# Check system headers that might include time.h
rg -A 5 -B 5 "include.*<sys/time\.h>" tools/run/run.cpp

Length of output: 109


I want to confirm that clock_gettime and struct timespec are actually used and whether <ctime> is already included. Let’s check for these in tools/run/run.cpp:


🏁 Script executed:

#!/bin/bash
# Check for existing ctime include
rg -n "#include.*<ctime>" tools/run/run.cpp

# Check usage of clock_gettime
rg -n "clock_gettime" tools/run/run.cpp

# Check usage of struct timespec
rg -n "timespec" tools/run/run.cpp

Length of output: 338


Add missing <time.h> include for clock_gettime and timespec

To fix the build errors, insert the header in tools/run/run.cpp right after the existing <signal.h> include:

 #include <signal.h>
+#include <time.h>
 #include <climits>

This ensures clock_gettime and struct timespec are properly declared.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
static long long timer_start = 0;
static long long timer_total = 0;
static long long timer_count = 0;
// tools/run/run.cpp
#include <signal.h>
#include <time.h>
#include <climits>
🤖 Prompt for AI Agents
In tools/run/run.cpp around lines 987 to 990, the code uses clock_gettime and
struct timespec but is missing the necessary #include <time.h> directive. To fix
the build errors, add #include <time.h> immediately after the existing #include
<signal.h> line near the top of the file so that these functions and types are
properly declared.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review continued from previous batch...

Comment on lines +24 to +26
if (!shmem_data) {
FATAL("Couldn't get the shmem addr from virgl :/");
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider graceful error handling instead of FATAL

Using FATAL will abort the entire program. Consider returning an error status that can be handled by the caller.

   const void *shmem_data = ctx->iface.get_shmem_ptr(ctx->virgl_ctx, shmem_res_id);
   if (!shmem_data) {
-    FATAL("Couldn't get the shmem addr from virgl :/");
+    ERROR("Couldn't get the shmem addr from virgl");
+    ggml_status status = GGML_STATUS_ABORTED;
+    vn_encode_ggml_status(enc, &status);
+    stop_timer(&graph_compute_timer);
+    return 1; // or appropriate error code
   }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In ggml/src/ggml-remotingbackend/backend-dispatched-backend.cpp around lines 24
to 26, replace the FATAL call that aborts the program when shmem_data is null
with a graceful error handling approach. Modify the function to return an error
status or code indicating failure to get the shared memory address, allowing the
caller to handle the error appropriately instead of terminating the program
abruptly.

Comment on lines +26 to +36
static void ggml_backend_remoting_buffer_memset_tensor(ggml_backend_buffer_t buffer, ggml_tensor * tensor, uint8_t value, size_t offset, size_t size) {
NOT_IMPLEMENTED;

STOP_HERE;

UNUSED(buffer);
UNUSED(tensor);
UNUSED(value);
UNUSED(offset);
UNUSED(size);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Unimplemented functions will crash the program

Functions marked with NOT_IMPLEMENTED followed by STOP_HERE will halt execution. This is problematic for production code.

Consider returning an error instead:

 static void ggml_backend_remoting_buffer_memset_tensor(ggml_backend_buffer_t buffer, ggml_tensor * tensor, uint8_t value, size_t offset, size_t size) {
   NOT_IMPLEMENTED;
-
-  STOP_HERE;
-
-  UNUSED(buffer);
-  UNUSED(tensor);
-  UNUSED(value);
-  UNUSED(offset);
-  UNUSED(size);
+  // Log error and handle gracefully
+  ERROR("memset_tensor not implemented for remoting buffer");
+  // Could potentially fall back to get_tensor/set_tensor implementation
 }

Would you like me to provide a complete implementation or open an issue to track this?

Also applies to: 109-119

🤖 Prompt for AI Agents
In ggml/src/ggml-remotingfrontend/ggml-backend-buffer.cpp at lines 26 to 36, the
function ggml_backend_remoting_buffer_memset_tensor is currently unimplemented
and uses NOT_IMPLEMENTED and STOP_HERE macros that cause the program to crash.
Replace these macros with a proper error handling mechanism, such as returning
an error code or status indicating the function is not implemented, to avoid
halting execution in production. Apply the same fix to the similar unimplemented
function at lines 109 to 119.

Comment on lines +112 to +119
STOP_HERE;

return true;

UNUSED(buffer);
UNUSED(src);
UNUSED(dst);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove unreachable code after STOP_HERE

The return statement and UNUSED macros after STOP_HERE are unreachable.

 static bool ggml_backend_remoting_buffer_cpy_tensor(ggml_backend_buffer_t buffer, const ggml_tensor * src, ggml_tensor * dst) {
   NOT_IMPLEMENTED;
 
   STOP_HERE;
-
-  return true;
-
-  UNUSED(buffer);
-  UNUSED(src);
-  UNUSED(dst);
 }
🤖 Prompt for AI Agents
In ggml/src/ggml-remotingfrontend/ggml-backend-buffer.cpp around lines 112 to
119, the return statement and UNUSED macros after the STOP_HERE macro are
unreachable and should be removed. Delete the lines containing 'return true;'
and the UNUSED(buffer), UNUSED(src), and UNUSED(dst) macros following STOP_HERE
to clean up the code.

Comment on lines 4 to 18
const struct ggml_backend_metal_device_context *get_metal_dev_context(const ggml_backend_dev_t dev) {
static struct ggml_backend_metal_device_context metal_dev_ctx;
static bool has_metal_dev_ctx = false;

if (has_metal_dev_ctx) {
return &metal_dev_ctx;
}

struct virtgpu *gpu = DEV_TO_GPU(dev);

apir_metal_get_device_context(gpu, &metal_dev_ctx);

return &metal_dev_ctx;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Critical bug: Metal device context is never cached

The static variable has_metal_dev_ctx is never set to true, causing the remote call to be made every time instead of using the cached value.

 const struct ggml_backend_metal_device_context *get_metal_dev_context(const ggml_backend_dev_t dev) {
   static struct ggml_backend_metal_device_context metal_dev_ctx;
   static bool has_metal_dev_ctx = false;
 
   if (has_metal_dev_ctx) {
     return &metal_dev_ctx;
   }
 
   struct virtgpu *gpu = DEV_TO_GPU(dev);
 
   apir_metal_get_device_context(gpu, &metal_dev_ctx);
+  has_metal_dev_ctx = true;
 
   return &metal_dev_ctx;
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const struct ggml_backend_metal_device_context *get_metal_dev_context(const ggml_backend_dev_t dev) {
static struct ggml_backend_metal_device_context metal_dev_ctx;
static bool has_metal_dev_ctx = false;
if (has_metal_dev_ctx) {
return &metal_dev_ctx;
}
struct virtgpu *gpu = DEV_TO_GPU(dev);
apir_metal_get_device_context(gpu, &metal_dev_ctx);
return &metal_dev_ctx;
}
const struct ggml_backend_metal_device_context *get_metal_dev_context(const ggml_backend_dev_t dev) {
static struct ggml_backend_metal_device_context metal_dev_ctx;
static bool has_metal_dev_ctx = false;
if (has_metal_dev_ctx) {
return &metal_dev_ctx;
}
struct virtgpu *gpu = DEV_TO_GPU(dev);
apir_metal_get_device_context(gpu, &metal_dev_ctx);
has_metal_dev_ctx = true;
return &metal_dev_ctx;
}
🤖 Prompt for AI Agents
In ggml/src/ggml-remotingfrontend/ggml-metal-remoting.cpp around lines 4 to 17,
the static boolean has_metal_dev_ctx is never set to true after initializing
metal_dev_ctx, causing the function to always fetch the device context remotely
instead of caching it. Fix this by setting has_metal_dev_ctx to true immediately
after the call to apir_metal_get_device_context to ensure the cached context is
reused on subsequent calls.

Comment on lines 125 to 137
if (cmd_type > APIR_BACKEND_DISPATCH_TABLE_COUNT) {
ERROR("Received an invalid dispatch index (%d > %d)\n",
cmd_type, APIR_BACKEND_DISPATCH_TABLE_COUNT);
return APIR_BACKEND_FORWARD_INDEX_INVALID;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add lower bound check for cmd_type.

The function only checks the upper bound but not for negative values, which could lead to buffer underflow.

Apply this fix:

-  if (cmd_type > APIR_BACKEND_DISPATCH_TABLE_COUNT) {
+  if (cmd_type >= APIR_BACKEND_DISPATCH_TABLE_COUNT) {
     ERROR("Received an invalid dispatch index (%d > %d)\n",
           cmd_type, APIR_BACKEND_DISPATCH_TABLE_COUNT);
     return APIR_BACKEND_FORWARD_INDEX_INVALID;
   }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In ggml/src/ggml-remotingbackend/backend.cpp around lines 125 to 129, the code
checks if cmd_type exceeds the upper bound APIR_BACKEND_DISPATCH_TABLE_COUNT but
does not verify if cmd_type is negative. Add a lower bound check to ensure
cmd_type is not less than zero before proceeding. If cmd_type is out of bounds
(either negative or greater than the upper limit), log an error and return
APIR_BACKEND_FORWARD_INDEX_INVALID to prevent buffer underflow.

Comment on lines 88 to 95
struct timespec ts;
clock_gettime(CLOCK_REALTIME, &ts); // Use CLOCK_MONOTONIC for elapsed time
timer->start = (long long)ts.tv_sec * 1000000000LL + ts.tv_nsec;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Use CLOCK_MONOTONIC for elapsed time measurement.

The comments indicate CLOCK_MONOTONIC should be used, but the code uses CLOCK_REALTIME. CLOCK_MONOTONIC is preferred for measuring elapsed time as it's not affected by system time adjustments.

 static inline void start_timer(struct timer_data *timer) {
   struct timespec ts;
-  clock_gettime(CLOCK_REALTIME, &ts);  // Use CLOCK_MONOTONIC for elapsed time
+  clock_gettime(CLOCK_MONOTONIC, &ts);
   timer->start = (long long)ts.tv_sec * 1000000000LL + ts.tv_nsec;
 }

 static inline void stop_timer(struct timer_data *timer) {
   struct timespec ts;
-  clock_gettime(CLOCK_REALTIME, &ts);  // Use CLOCK_MONOTONIC for elapsed time
+  clock_gettime(CLOCK_MONOTONIC, &ts);
   long long timer_end = (long long)ts.tv_sec * 1000000000LL + ts.tv_nsec;

Also applies to: 94-96

🤖 Prompt for AI Agents
In ggml/src/ggml-remotingbackend/shared/apir_backend.h around lines 88 to 91 and
94 to 96, replace CLOCK_REALTIME with CLOCK_MONOTONIC in the clock_gettime calls
to ensure elapsed time measurement is not affected by system time changes. This
involves changing the clock_gettime argument from CLOCK_REALTIME to
CLOCK_MONOTONIC in both places.

if (code == code_name) return #code_name; \
} while (0) \

APIR_BACKEND_INITIALIZE_ERROR(APIR_BACKEND_INITIALIZE_SUCCESSS);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Update macro reference to use corrected constant name.

This needs to be updated to match the corrected macro name.

-  APIR_BACKEND_INITIALIZE_ERROR(APIR_BACKEND_INITIALIZE_SUCCESSS);
+  APIR_BACKEND_INITIALIZE_ERROR(APIR_BACKEND_INITIALIZE_SUCCESS);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
APIR_BACKEND_INITIALIZE_ERROR(APIR_BACKEND_INITIALIZE_SUCCESSS);
APIR_BACKEND_INITIALIZE_ERROR(APIR_BACKEND_INITIALIZE_SUCCESS);
🤖 Prompt for AI Agents
In ggml/src/ggml-remotingbackend/shared/apir_backend.h at line 117, the macro
reference uses APIR_BACKEND_INITIALIZE_SUCCESSS which appears to be a typo.
Update this to the corrected constant name, likely
APIR_BACKEND_INITIALIZE_SUCCESS, to ensure the macro reference is accurate and
consistent.

@@ -0,0 +1,127 @@
#pragma once

#define APIR_BACKEND_INITIALIZE_SUCCESSS 0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix typo in macro name.

The macro name has an extra 'S' at the end.

-#define APIR_BACKEND_INITIALIZE_SUCCESSS 0
+#define APIR_BACKEND_INITIALIZE_SUCCESS 0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#define APIR_BACKEND_INITIALIZE_SUCCESSS 0
#define APIR_BACKEND_INITIALIZE_SUCCESS 0
🤖 Prompt for AI Agents
In ggml/src/ggml-remotingbackend/shared/apir_backend.h at line 3, the macro name
APIR_BACKEND_INITIALIZE_SUCCESSS contains a typo with an extra 'S' at the end.
Rename the macro to APIR_BACKEND_INITIALIZE_SUCCESS by removing the extra 'S' to
correct the spelling.

Comment on lines +45 to +65
static bool
ggml_backend_remoting_device_supports_op(ggml_backend_dev_t dev, const ggml_tensor * op) {
#if USE_ALWAYS_TRUE_SUPPORTS_OP == 1
/* ggml-rpc cheats it like this */
/* with the current implementation of serialize_tensor, the src/view aren't properly passed */
UNUSED(dev);
UNUSED(op);

return true;
#elif USE_METAL_GUEST_SUPPORTS_OP == 1
UNUSED(dev);

struct ggml_backend_remoting_device_context *device_ctx = GET_DEVICE_CONTEXT();

return ggml_metal_supports_op(device_ctx->metal_dev_ctx, op);
#else
struct virtgpu *gpu = DEV_TO_GPU(dev);

return apir_device_supports_op(gpu, op);
#endif
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Address incomplete tensor serialization

The comment indicates that serialize_tensor doesn't properly pass src/view information, which could lead to incorrect operation support queries. This is currently worked around by compile-time flags, but the underlying issue should be fixed.

Would you like me to open an issue to track fixing the tensor serialization to properly include src/view information?

🤖 Prompt for AI Agents
In ggml/src/ggml-remotingfrontend/ggml-backend-device.cpp between lines 45 and
65, the function ggml_backend_remoting_device_supports_op currently relies on
compile-time flags to bypass proper handling of tensor src/view serialization,
which is incomplete. To fix this, update the serialize_tensor implementation to
correctly include src/view information so that operation support queries can be
accurate without workarounds. This involves modifying the serialization logic to
fully capture tensor source and view details and then adjusting
ggml_backend_remoting_device_supports_op to use this corrected serialization
data.

Comment on lines +129 to +144
ggml_backend_buffer_type_t
ggml_backend_remoting_device_get_buffer_type(ggml_backend_dev_t dev) {
IMPLEMENTED_ONCE;

struct virtgpu *gpu = DEV_TO_GPU(dev);

apir_buffer_type_host_handle_t ctx = apir_device_get_buffer_type(gpu);

static struct ggml_backend_buffer_type buft {
/* .iface = */ ggml_backend_remoting_buffer_type_interface,
/* .device = */ dev,
/* .context = */ (void *) ctx,
};

return &buft;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Review static buffer type usage for thread safety

Multiple functions use static buffer type structs. This could be problematic if:

  1. Multiple devices need different buffer types
  2. The context needs to be device-specific

If these buffer types are meant to be singletons, consider documenting this design decision. Otherwise, allocate them dynamically per device.

Also applies to: 146-161, 187-198


🏁 Script executed:

#!/bin/bash
# Description: Check if multiple devices are supported and how buffer types are used

# Search for device enumeration or multiple device support
rg -A 5 'device.*count|get.*device.*count|multiple.*device'

# Check how buffer types are used across the codebase
ast-grep --pattern 'ggml_backend_buffer_type'

Length of output: 89479


Separate buffer types per remoting device

The local static buft in ggml-remotingfrontend/ggml-backend-device.cpp only initializes once (on first call), binding the buffer type to the first dev/ctx. All subsequent devices will erroneously share that same buft. To support multiple remoting devices, you must create one buffer-type struct per device:

• File: ggml/src/ggml-remotingfrontend/ggml-backend-device.cpp, function ggml_backend_remoting_device_get_buffer_type (lines 129–144)
Replace:
cpp static struct ggml_backend_buffer_type buft { .iface = ggml_backend_remoting_buffer_type_interface, .device = dev, .context = (void*)ctx, }; return &buft;
With either:

  • A static array sized by apir_device_get_count(gpu), indexed by dev.
  • Dynamically allocating a new ggml_backend_buffer_type per dev (e.g. new ggml_backend_buffer_type{…}).

Ensure each remoting device has its own ggml_backend_buffer_type instance so that .device and .context reflect the correct device.

🤖 Prompt for AI Agents
In ggml/src/ggml-remotingfrontend/ggml-backend-device.cpp around lines 129 to
144, the static local variable buft is shared across all calls, causing all
devices to share the same buffer type instance with incorrect device and
context. To fix this, replace the static buft with either a static array of
ggml_backend_buffer_type sized by the number of devices and indexed by dev, or
dynamically allocate a new ggml_backend_buffer_type for each device on each
call. This ensures each remoting device has its own distinct buffer type
instance with the correct device and context values.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review continued from previous batch...

Comment on lines +21 to +34
static enum ggml_backend_dev_type
ggml_backend_remoting_device_get_type(ggml_backend_dev_t dev) {
IMPLEMENTED_ONCE;
struct virtgpu *gpu = DEV_TO_GPU(dev);

static enum ggml_backend_dev_type type;
static bool has_type = false;
if (!has_type) {
has_type = true;
type = (enum ggml_backend_dev_type) apir_device_get_type(gpu);
}

return type;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix race condition in device type caching

The static variables type and has_type are accessed without synchronization, creating a race condition if multiple threads call this function simultaneously.

Consider using std::once_flag with std::call_once for thread-safe initialization:

 static enum ggml_backend_dev_type
 ggml_backend_remoting_device_get_type(ggml_backend_dev_t dev) {
   IMPLEMENTED_ONCE;
   struct virtgpu *gpu = DEV_TO_GPU(dev);
 
   static enum ggml_backend_dev_type type;
-  static bool has_type = false;
-  if (!has_type) {
-    has_type = true;
+  static std::once_flag type_flag;
+  std::call_once(type_flag, [&gpu, &type]() {
     type = (enum ggml_backend_dev_type) apir_device_get_type(gpu);
-  }
+  });
 
   return type;
 }

Don't forget to include <mutex> at the top of the file.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
static enum ggml_backend_dev_type
ggml_backend_remoting_device_get_type(ggml_backend_dev_t dev) {
IMPLEMENTED_ONCE;
struct virtgpu *gpu = DEV_TO_GPU(dev);
static enum ggml_backend_dev_type type;
static bool has_type = false;
if (!has_type) {
has_type = true;
type = (enum ggml_backend_dev_type) apir_device_get_type(gpu);
}
return type;
}
static enum ggml_backend_dev_type
ggml_backend_remoting_device_get_type(ggml_backend_dev_t dev) {
IMPLEMENTED_ONCE;
struct virtgpu *gpu = DEV_TO_GPU(dev);
static enum ggml_backend_dev_type type;
static std::once_flag type_flag;
std::call_once(type_flag, [&gpu, &type]() {
type = (enum ggml_backend_dev_type) apir_device_get_type(gpu);
});
return type;
}
🤖 Prompt for AI Agents
In ggml/src/ggml-remotingfrontend/ggml-backend-device.cpp around lines 21 to 34,
the static variables used for caching device type are accessed without
synchronization, causing a race condition in multithreaded scenarios. Replace
the static bool and type variables with a std::once_flag and use std::call_once
to initialize the device type exactly once in a thread-safe manner. Also,
include the <mutex> header at the top of the file to use these synchronization
primitives.

Comment on lines +52 to +88
static inline const char *backend_dispatch_command_name(ApirBackendCommandType type)
{
switch (type) {
/* device */
case APIR_COMMAND_TYPE_DEVICE_GET_COUNT: return "backend_get_device_count";
case APIR_COMMAND_TYPE_DEVICE_GET_NAME: return "backend_get_device_name";
case APIR_COMMAND_TYPE_DEVICE_GET_DESCRIPTION: return "backend_get_device_description";
case APIR_COMMAND_TYPE_DEVICE_GET_TYPE: return "backend_device_get_type";
case APIR_COMMAND_TYPE_DEVICE_GET_MEMORY: return "backend_get_device_memory";
case APIR_COMMAND_TYPE_DEVICE_SUPPORTS_OP: return "backend_device_supports_op";
case APIR_COMMAND_TYPE_DEVICE_GET_BUFFER_TYPE: return "backend_get_buffer_type";
case APIR_COMMAND_TYPE_DEVICE_GET_PROPS: return "backend_get_props";
case APIR_COMMAND_TYPE_DEVICE_BUFFER_FROM_PTR: return "backend_buffer_from_ptr";

/* buffer-type */
case APIR_COMMAND_TYPE_BUFFER_TYPE_GET_NAME: return "backend_buffer_type_get_name";
case APIR_COMMAND_TYPE_BUFFER_TYPE_GET_ALIGNMENT: return "backend_buffer_type_get_alignment";
case APIR_COMMAND_TYPE_BUFFER_TYPE_GET_MAX_SIZE: return "backend_buffer_type_get_max_size";
case APIR_COMMAND_TYPE_BUFFER_TYPE_IS_HOST: return "backend_buffer_type_is_host";
case APIR_COMMAND_TYPE_BUFFER_TYPE_ALLOC_BUFFER: return "backend_buffer_type_alloc_buffer";

/* buffer */
case APIR_COMMAND_TYPE_BUFFER_GET_BASE: return "backend_buffer_get_base";
case APIR_COMMAND_TYPE_BUFFER_SET_TENSOR: return "backend_buffer_set_tensor";
case APIR_COMMAND_TYPE_BUFFER_GET_TENSOR: return "backend_buffer_get_tensor";
case APIR_COMMAND_TYPE_BUFFER_CLEAR: return "backend_buffer_clear";
case APIR_COMMAND_TYPE_BUFFER_FREE_BUFFER: return "backend_buffer_free_buffer";

/* backend */
case APIR_COMMAND_TYPE_BACKEND_GRAPH_COMPUTE: return "backend_graph_compute";

/* metal */
case APIR_COMMAND_TYPE_METAL_GET_DEVICE_CONTEXT: return "metal_get_device_context";

default: return "unknown";
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix function name mismatches in command name mapping

Several command names don't match the actual function names, which could cause confusion during debugging.

 static inline const char *backend_dispatch_command_name(ApirBackendCommandType type)
 {
   switch (type) {
   /* device */
-  case APIR_COMMAND_TYPE_DEVICE_GET_COUNT: return "backend_get_device_count";
-  case APIR_COMMAND_TYPE_DEVICE_GET_NAME: return "backend_get_device_name";
-  case APIR_COMMAND_TYPE_DEVICE_GET_DESCRIPTION: return "backend_get_device_description";
+  case APIR_COMMAND_TYPE_DEVICE_GET_COUNT: return "backend_reg_get_device_count";
+  case APIR_COMMAND_TYPE_DEVICE_GET_NAME: return "backend_device_get_name";
+  case APIR_COMMAND_TYPE_DEVICE_GET_DESCRIPTION: return "backend_device_get_description";
   case APIR_COMMAND_TYPE_DEVICE_GET_TYPE: return "backend_device_get_type";
-  case APIR_COMMAND_TYPE_DEVICE_GET_MEMORY: return "backend_get_device_memory";
+  case APIR_COMMAND_TYPE_DEVICE_GET_MEMORY: return "backend_device_get_memory";
   case APIR_COMMAND_TYPE_DEVICE_SUPPORTS_OP: return "backend_device_supports_op";
-  case APIR_COMMAND_TYPE_DEVICE_GET_BUFFER_TYPE: return "backend_get_buffer_type";
-  case APIR_COMMAND_TYPE_DEVICE_GET_PROPS: return "backend_get_props";
-  case APIR_COMMAND_TYPE_DEVICE_BUFFER_FROM_PTR: return "backend_buffer_from_ptr";
+  case APIR_COMMAND_TYPE_DEVICE_GET_BUFFER_TYPE: return "backend_device_get_buffer_type";
+  case APIR_COMMAND_TYPE_DEVICE_GET_PROPS: return "backend_device_get_props";
+  case APIR_COMMAND_TYPE_DEVICE_BUFFER_FROM_PTR: return "backend_device_buffer_from_ptr";
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
static inline const char *backend_dispatch_command_name(ApirBackendCommandType type)
{
switch (type) {
/* device */
case APIR_COMMAND_TYPE_DEVICE_GET_COUNT: return "backend_get_device_count";
case APIR_COMMAND_TYPE_DEVICE_GET_NAME: return "backend_get_device_name";
case APIR_COMMAND_TYPE_DEVICE_GET_DESCRIPTION: return "backend_get_device_description";
case APIR_COMMAND_TYPE_DEVICE_GET_TYPE: return "backend_device_get_type";
case APIR_COMMAND_TYPE_DEVICE_GET_MEMORY: return "backend_get_device_memory";
case APIR_COMMAND_TYPE_DEVICE_SUPPORTS_OP: return "backend_device_supports_op";
case APIR_COMMAND_TYPE_DEVICE_GET_BUFFER_TYPE: return "backend_get_buffer_type";
case APIR_COMMAND_TYPE_DEVICE_GET_PROPS: return "backend_get_props";
case APIR_COMMAND_TYPE_DEVICE_BUFFER_FROM_PTR: return "backend_buffer_from_ptr";
/* buffer-type */
case APIR_COMMAND_TYPE_BUFFER_TYPE_GET_NAME: return "backend_buffer_type_get_name";
case APIR_COMMAND_TYPE_BUFFER_TYPE_GET_ALIGNMENT: return "backend_buffer_type_get_alignment";
case APIR_COMMAND_TYPE_BUFFER_TYPE_GET_MAX_SIZE: return "backend_buffer_type_get_max_size";
case APIR_COMMAND_TYPE_BUFFER_TYPE_IS_HOST: return "backend_buffer_type_is_host";
case APIR_COMMAND_TYPE_BUFFER_TYPE_ALLOC_BUFFER: return "backend_buffer_type_alloc_buffer";
/* buffer */
case APIR_COMMAND_TYPE_BUFFER_GET_BASE: return "backend_buffer_get_base";
case APIR_COMMAND_TYPE_BUFFER_SET_TENSOR: return "backend_buffer_set_tensor";
case APIR_COMMAND_TYPE_BUFFER_GET_TENSOR: return "backend_buffer_get_tensor";
case APIR_COMMAND_TYPE_BUFFER_CLEAR: return "backend_buffer_clear";
case APIR_COMMAND_TYPE_BUFFER_FREE_BUFFER: return "backend_buffer_free_buffer";
/* backend */
case APIR_COMMAND_TYPE_BACKEND_GRAPH_COMPUTE: return "backend_graph_compute";
/* metal */
case APIR_COMMAND_TYPE_METAL_GET_DEVICE_CONTEXT: return "metal_get_device_context";
default: return "unknown";
}
}
static inline const char *backend_dispatch_command_name(ApirBackendCommandType type)
{
switch (type) {
/* device */
case APIR_COMMAND_TYPE_DEVICE_GET_COUNT: return "backend_reg_get_device_count";
case APIR_COMMAND_TYPE_DEVICE_GET_NAME: return "backend_device_get_name";
case APIR_COMMAND_TYPE_DEVICE_GET_DESCRIPTION: return "backend_device_get_description";
case APIR_COMMAND_TYPE_DEVICE_GET_TYPE: return "backend_device_get_type";
case APIR_COMMAND_TYPE_DEVICE_GET_MEMORY: return "backend_device_get_memory";
case APIR_COMMAND_TYPE_DEVICE_SUPPORTS_OP: return "backend_device_supports_op";
case APIR_COMMAND_TYPE_DEVICE_GET_BUFFER_TYPE: return "backend_device_get_buffer_type";
case APIR_COMMAND_TYPE_DEVICE_GET_PROPS: return "backend_device_get_props";
case APIR_COMMAND_TYPE_DEVICE_BUFFER_FROM_PTR: return "backend_device_buffer_from_ptr";
/* buffer-type */
case APIR_COMMAND_TYPE_BUFFER_TYPE_GET_NAME: return "backend_buffer_type_get_name";
case APIR_COMMAND_TYPE_BUFFER_TYPE_GET_ALIGNMENT: return "backend_buffer_type_get_alignment";
case APIR_COMMAND_TYPE_BUFFER_TYPE_GET_MAX_SIZE: return "backend_buffer_type_get_max_size";
case APIR_COMMAND_TYPE_BUFFER_TYPE_IS_HOST: return "backend_buffer_type_is_host";
case APIR_COMMAND_TYPE_BUFFER_TYPE_ALLOC_BUFFER: return "backend_buffer_type_alloc_buffer";
/* buffer */
case APIR_COMMAND_TYPE_BUFFER_GET_BASE: return "backend_buffer_get_base";
case APIR_COMMAND_TYPE_BUFFER_SET_TENSOR: return "backend_buffer_set_tensor";
case APIR_COMMAND_TYPE_BUFFER_GET_TENSOR: return "backend_buffer_get_tensor";
case APIR_COMMAND_TYPE_BUFFER_CLEAR: return "backend_buffer_clear";
case APIR_COMMAND_TYPE_BUFFER_FREE_BUFFER: return "backend_buffer_free_buffer";
/* backend */
case APIR_COMMAND_TYPE_BACKEND_GRAPH_COMPUTE: return "backend_graph_compute";
/* metal */
case APIR_COMMAND_TYPE_METAL_GET_DEVICE_CONTEXT: return "metal_get_device_context";
default: return "unknown";
}
}
🤖 Prompt for AI Agents
In ggml/src/ggml-remotingbackend/backend-dispatched.h between lines 52 and 88,
some returned command name strings do not match their actual function names,
causing potential confusion. Review each case in the switch statement and update
the returned string to exactly match the corresponding function name for that
command type, ensuring consistency and clarity in debugging output.

Comment on lines 116 to 120
ggml_backend_dev_t ggml_backend_remoting_get_device(size_t device);
ggml_backend_buffer_type_t ggml_backend_remoting_host_buffer_type();
ggml_backend_t ggml_backend_remoting_device_init(ggml_backend_dev_t dev, const char * params);
ggml_backend_buffer_type_t ggml_backend_remoting_device_get_buffer_type(ggml_backend_dev_t dev);
ggml_backend_t ggml_backend_remoting_device_init(ggml_backend_dev_t dev, const char * params);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove duplicate function declaration

The function ggml_backend_remoting_device_init is declared twice.

 ggml_backend_dev_t ggml_backend_remoting_get_device(size_t device);
 ggml_backend_buffer_type_t ggml_backend_remoting_host_buffer_type();
 ggml_backend_t ggml_backend_remoting_device_init(ggml_backend_dev_t dev, const char * params);
 ggml_backend_buffer_type_t ggml_backend_remoting_device_get_buffer_type(ggml_backend_dev_t dev);
-ggml_backend_t ggml_backend_remoting_device_init(ggml_backend_dev_t dev, const char * params);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ggml_backend_dev_t ggml_backend_remoting_get_device(size_t device);
ggml_backend_buffer_type_t ggml_backend_remoting_host_buffer_type();
ggml_backend_t ggml_backend_remoting_device_init(ggml_backend_dev_t dev, const char * params);
ggml_backend_buffer_type_t ggml_backend_remoting_device_get_buffer_type(ggml_backend_dev_t dev);
ggml_backend_t ggml_backend_remoting_device_init(ggml_backend_dev_t dev, const char * params);
ggml_backend_dev_t ggml_backend_remoting_get_device(size_t device);
ggml_backend_buffer_type_t ggml_backend_remoting_host_buffer_type();
ggml_backend_t ggml_backend_remoting_device_init(ggml_backend_dev_t dev, const char * params);
ggml_backend_buffer_type_t ggml_backend_remoting_device_get_buffer_type(ggml_backend_dev_t dev);
🤖 Prompt for AI Agents
In ggml/src/ggml-remotingfrontend/ggml-remoting.h around lines 116 to 120, the
function ggml_backend_remoting_device_init is declared twice. Remove the
duplicate declaration so that the function is declared only once to avoid
redundancy and potential confusion.

Comment on lines +56 to +61
#define STOP_HERE \
thks_bye()

#define BREAKPOINT \
breakpoint()

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix undefined function references in debug macros

The macros STOP_HERE and BREAKPOINT reference undefined functions that will cause link errors.

Either:

  1. Define these functions or include the appropriate header
  2. Replace with standard debugging approaches:
 #define STOP_HERE \
-  thks_bye()
+  __builtin_trap()  // or assert(false)

 #define BREAKPOINT \
-  breakpoint()
+  __builtin_debugtrap()  // or raise(SIGTRAP) on POSIX
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#define STOP_HERE \
thks_bye()
#define BREAKPOINT \
breakpoint()
#define STOP_HERE \
__builtin_trap() // or assert(false)
#define BREAKPOINT \
__builtin_debugtrap() // or raise(SIGTRAP) on POSIX
🤖 Prompt for AI Agents
In ggml/src/ggml-remotingfrontend/ggml-remoting.h around lines 56 to 61, the
macros STOP_HERE and BREAKPOINT reference undefined functions thks_bye() and
breakpoint(), causing link errors. To fix this, either define these functions or
include the headers where they are defined. Alternatively, replace these macros
with standard debugging methods such as calling abort() for STOP_HERE and using
compiler-specific breakpoint intrinsics or inline assembly for BREAKPOINT.

Comment on lines +1 to +35
/*
* Header for the Direct Rendering Manager
*
* Author: Rickard E. (Rik) Faith <[email protected]>
*
* Acknowledgments:
* Dec 1999, Richard Henderson <[email protected]>, move to generic cmpxchg.
*/

/*
* Copyright 1999 Precision Insight, Inc., Cedar Park, Texas.
* Copyright 2000 VA Linux Systems, Inc., Sunnyvale, California.
* All rights reserved.
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the "Software"),
* to deal in the Software without restriction, including without limitation
* the rights to use, copy, modify, merge, publish, distribute, sublicense,
* and/or sell copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice (including the next
* paragraph) shall be included in all copies or substantial portions of the
* Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
* VA LINUX SYSTEMS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
* OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
* ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
* OTHER DEALINGS IN THE SOFTWARE.
*/

#ifndef _DRM_H_
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify the source and version of this DRM UAPI header.

This appears to be a copy of the Linux kernel's DRM UAPI header. Consider documenting the source kernel version for future maintenance and updates.


🏁 Script executed:

#!/bin/bash
# Description: Check if this is a verbatim copy from the Linux kernel

# Search for similar files in the repository
fd -e h drm.h

# Check if there are any modifications from the standard kernel header
# by searching for common DRM defines
rg -A 5 "DRM_NAME.*drm.*Name in kernel"

Length of output: 820


Document upstream Linux kernel version for DRM UAPI header

The file ggml/src/ggml-remotingfrontend/include/drm-uapi/drm.h appears to be a direct copy of the Linux kernel’s DRM UAPI header (e.g. it defines DRM_NAME, DRM_MIN_ORDER, etc.). To simplify future syncs and maintenance, please add a top‐of‐file comment indicating the exact kernel version (or commit/patch level) this was taken from, along with a link or reference to the corresponding Linux kernel source location.

Suggested update:

  • File: ggml/src/ggml-remotingfrontend/include/drm-uapi/drm.h
  • At the very top, insert something like:
    /*
     * Upstream source: Linux kernel v5.15-rc2 (commit abcdef1234567890)
     * https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/drm/drm.h
     */
  • Verify that any local changes are documented or reconsidered in future merges.
🤖 Prompt for AI Agents
In ggml/src/ggml-remotingfrontend/include/drm-uapi/drm.h at the very top before
any existing comments, add a new comment block specifying the exact Linux kernel
version or commit hash from which this DRM UAPI header was copied, along with a
URL to the corresponding upstream source in the Linux kernel repository. This
will help future maintainers track the origin and simplify syncing updates.
Also, review and document any local modifications compared to the upstream
version.

Comment on lines +6 to +8
// needs UNUSED to be defined
// needs FATAL to be defined

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Document the required macro dependencies.

The file requires UNUSED and FATAL macros to be defined but doesn't include a header that provides them. Consider either including the required header or documenting where these should come from.

 
 // needs UNUSED to be defined
 // needs FATAL to be defined
+// Example: These macros should be defined in a common header like "common.h" or "utils.h"
+// #define UNUSED(x) ((void)(x))
+// #define FATAL(...) do { fprintf(stderr, __VA_ARGS__); abort(); } while(0)
 
🤖 Prompt for AI Agents
In ggml/src/ggml-remotingbackend/shared/venus_cs.h around lines 6 to 8, the file
uses the macros UNUSED and FATAL without defining them or including a header
that does. Fix this by either including the appropriate header file that defines
these macros or adding comments to document that these macros must be defined
before including this file.

Comment on lines +514 to +526
/* apir_buffer_type_host_handle_t */

static inline void
vn_encode_apir_buffer_type_host_handle_t(struct vn_cs_encoder *enc, const apir_buffer_type_host_handle_t *val)
{
vn_encode(enc, sizeof(apir_buffer_type_host_handle_t), val, sizeof(apir_buffer_type_host_handle_t));
}

static inline void
vn_decode_apir_buffer_type_host_handle_t(struct vn_cs_decoder *dec, apir_buffer_type_host_handle_t *val)
{
vn_decode(dec, sizeof(apir_buffer_type_host_handle_t), val, sizeof(apir_buffer_type_host_handle_t));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Missing type definitions for apir_buffer types.

The functions use apir_buffer_type_host_handle_t and apir_buffer_host_handle_t types that are not defined in this header.

Include the header that defines these types or add forward declarations:

+// Forward declarations or include the header that defines these
+// #include "apir_backend.h"
+typedef struct apir_buffer_type_host_handle_t apir_buffer_type_host_handle_t;
+typedef struct apir_buffer_host_handle_t apir_buffer_host_handle_t;
+
 /* apir_buffer_type_host_handle_t */
🤖 Prompt for AI Agents
In ggml/src/ggml-remotingbackend/shared/venus_cs.h around lines 514 to 526, the
types apir_buffer_type_host_handle_t and apir_buffer_host_handle_t used in the
encode and decode functions are not defined in this header. To fix this, include
the appropriate header file where these types are defined or add forward
declarations for these types at the top of this header file to ensure they are
recognized during compilation.


static inline void
vn_encode_ggml_buffer(struct vn_cs_encoder *enc, const ggml_backend_buffer_t buffer) {
apir_buffer_host_handle_t handle = BUFFER_TO_HOST_HANDLE(buffer);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Missing definition for BUFFER_TO_HOST_HANDLE macro.

The macro BUFFER_TO_HOST_HANDLE is used but not defined in this header.

Add the macro definition or include the header that defines it:

+// Include the header that defines BUFFER_TO_HOST_HANDLE or define it here
+// #define BUFFER_TO_HOST_HANDLE(buffer) ggml_buffer_to_apir_handle(buffer)
+
 static inline void
 vn_encode_ggml_buffer(struct vn_cs_encoder *enc, const ggml_backend_buffer_t buffer) {
   apir_buffer_host_handle_t handle = BUFFER_TO_HOST_HANDLE(buffer);
🤖 Prompt for AI Agents
In ggml/src/ggml-remotingbackend/shared/venus_cs_ggml.h at line 100, the macro
BUFFER_TO_HOST_HANDLE is used but not defined or included. To fix this, either
add the definition of the BUFFER_TO_HOST_HANDLE macro in this header or include
the appropriate header file where it is defined, ensuring the macro is available
before its usage.


// it safe to remove the `const` qualifier here, we *do* want to
// modify the shared memory data to fix the `src` pointers.
ggml_tensor *tensor = (ggml_tensor *)(uintptr_t) vn_cs_decoder_use_inplace(dec, sizeof(ggml_tensor));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Document the safety of the const cast.

The const qualifier is removed to modify shared memory data. While this appears intentional for fixing src pointers, it should be documented.

 
-  // it safe to remove the `const` qualifier here, we *do* want to
+  // it is safe to remove the `const` qualifier here, we *do* want to
   // modify the shared memory data to fix the `src` pointers.
+  // This is required because the deserialized tensor pointers need to be
+  // updated to point to the correct locations in the shared memory.
   ggml_tensor *tensor = (ggml_tensor *)(uintptr_t) vn_cs_decoder_use_inplace(dec, sizeof(ggml_tensor));

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In ggml/src/ggml-remotingbackend/shared/venus_cs_ggml.h at line 217, the code
removes the const qualifier by casting to a non-const pointer to modify shared
memory data. Add a comment explaining that this const cast is intentional and
safe because the modification is required to fix src pointers in shared memory,
clarifying the rationale to future readers and maintainers.

#ifndef VIRTGPU_DRM_H
#define VIRTGPU_DRM_H

#include "drm.h"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify the include path for drm.h

The include uses quotes which suggests a local header. Since this file is in drm-uapi/, the include path should likely be relative or use the full path from the include directory.


🏁 Script executed:

#!/bin/bash
# Description: Check if drm.h exists in the expected location and how it's included elsewhere

# Find drm.h in the repository
fd -t f "drm.h$"

# Check how other files include drm headers
rg -A 2 '#include.*drm\.h'

Length of output: 760


Fix the include path in virtgpu_drm.h

The header currently uses a local include ("drm.h"), but to match the project’s include-directory layout and how other headers reference drm-uapi files, it should use the full relative path.

• File: ggml/src/ggml-remotingfrontend/include/drm-uapi/virtgpu_drm.h
Change:

-#include "drm.h"
+#include "drm-uapi/drm.h"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#include "drm.h"
#include "drm-uapi/drm.h"
🤖 Prompt for AI Agents
In ggml/src/ggml-remotingfrontend/include/drm-uapi/virtgpu_drm.h at line 27,
change the include directive from a local include "drm.h" to use the full
relative path that matches the project's include-directory layout and other
drm-uapi headers. Replace #include "drm.h" with the appropriate relative path
include, such as #include "drm-uapi/drm.h".

Copy link

openshift-ci bot commented Jul 9, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kpouget pushed a commit to kpouget/crc-llama.cpp that referenced this pull request Aug 27, 2025
* oai moe

* compat with new checkpoint

* add attn sink impl

* add rope scaling yarn

* logits match with latest transformers code

* wip chat template

* rm trailing space

* use ggml_scale_bias

* rm redundant is_swa_all

* convert interleaved gate_up

* graph : fix activation function to match reference (crc-org#7)

* vocab : handle o200k_harmony special tokens

* ggml : add attention sinks support (crc-org#1)

* llama : add attn sinks

* ggml : add attn sinks

* cuda : add attn sinks

* vulkan : add support for sinks in softmax

remove unnecessary return

* ggml : add fused swiglu_oai op (crc-org#11)

* ggml : add fused swiglu_oai op

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* update CUDA impl

* cont : metal impl

* add vulkan impl

* test-backend-ops : more test cases, clean up

* llama : remove unfused impl

* remove extra lines

---------

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: slaren <[email protected]>

* repack mxfp4 upon conversion

* clean up a bit

* enable thinking

* add quick hack to render only some special tokens

* fix bf16 conversion

* remove vocab hack

* webui ok

* support chat parsing for gpt-oss

* fix webui

* direct mapping mxfp4, FINALLY

* force using mxfp4

* properly use lazy tensor

* ggml : add mxfp4

ggml : use e8m0 conversion instead of powf

Co-authored-by: Diego Devesa <[email protected]>

change kvalues_mxfp4 table to match e2m1 (crc-org#6)

metal : remove quantization for now (not used)

cuda : fix disabled CUDA graphs due to ffn moe bias

vulkan : add support for mxfp4

cont : add cm2 dequant

* ggml : add ggml_add_id (crc-org#13)

* ggml : add ggml_add_id

* add cuda impl

* llama : add weight support check for add_id

* perf opt

* add vulkan impl

* rename cuda files

* add metal impl

* allow in-place ggml_add_id

* llama : keep biases on CPU with --cpu-moe

* llama : fix compile error

ggml-ci

* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw

ggml-ci

* cleanup

ggml-ci

* sycl : fix supports_op for MXFP4

ggml-ci

* fix Unknown reasoning format

* ggml-cpu : fix AVX build

ggml-ci

* fix hip build

ggml-ci

* cuda : add mxfp4 dequantization support for cuBLAS

ggml-ci

* ggml-cpu : fix mxfp4 fallback definitions for some architectures

ggml-ci

* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>
Co-authored-by: slaren <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant