Clarification on the OP_CPY operation src0->src1 #1314

josemonsalve2 · 2025-07-24T23:56:04Z

josemonsalve2
Jul 24, 2025

Hi,

I was reviewing the generated graph for Llama 4, and there appears to be an issue with the implementation of OP_CPY.

In this line, it suggests that this is src0 -> src1, and dst is not really used.

case GGML_OP_CPY: {
    // cpy overwrites value of src1 by src0 and returns view(src1)
    // the overwriting is mathematically equivalent to:
    // tensor = src0 * 1 + src1 * 0
    if (src0_needs_grads) {
        // dsrc0 = dtensor * 1
        ggml_add_or_set(ctx, cgraph, isrc0, ggml_reshape(ctx, grad, src0));
    }
    if (src1_needs_grads) {
        // dsrc1 = dtensor * 0 -> noop
    }
}

When looking at the resulting graph, there is a dependency that does not seem to be realized:

The red line is an implicit dependency. The blue node has no output dependencies.

I am curious why this was designed this way, rather than having src0 -> dst.

I can imagine that this works better (more efficiently) because this results in a view, rather than an extra copy. But when doing dependency analysis, this gets in the way (the dependency is never realized). This is currently not an issue, as the order of the tensor ID ensures the implicit dependency (i.e., in the figure, the copy is ID 14, while the consumer is ID 20); therefore, during evaluation, 14 and 20 are never executed out of order or in parallel.

Any insights here would be appreciated.

Answered by ggerganov

Jul 30, 2025

The code that you are looking is for the backward pass. It's not used during inference.

The update of the KV cache in all inference graphs is indeed a bit tricky. We basically have a large KV buffer and for each batch we update a small section of it - this is the cpy into a view (i.e. the left red arrow in your picture).

Later in the same graph, we need to use a larger portion of the KV buffer. So we make another view and use that - this is the right arrow in the picture.

Indeed, there is no explicit dependency stated here. We solve this by applying a ggml_build_forward_expand() after the update of the KV cache to guarantee that all operations up to that point would be performed before th…

View full answer

ggerganov · 2025-07-30T13:19:41Z

ggerganov
Jul 30, 2025
Maintainer

The code that you are looking is for the backward pass. It's not used during inference.

The update of the KV cache in all inference graphs is indeed a bit tricky. We basically have a large KV buffer and for each batch we update a small section of it - this is the cpy into a view (i.e. the left red arrow in your picture).

Later in the same graph, we need to use a larger portion of the KV buffer. So we make another view and use that - this is the right arrow in the picture.

Indeed, there is no explicit dependency stated here. We solve this by applying a ggml_build_forward_expand() after the update of the KV cache to guarantee that all operations up to that point would be performed before the operations that follow.

1 reply

josemonsalve2 Jul 30, 2025
Author

Thanks so much @ggerganov

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarification on the OP_CPY operation src0->src1 #1314

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Clarification on the OP_CPY operation src0->src1 #1314

Uh oh!

josemonsalve2 Jul 24, 2025

Replies: 1 comment · 1 reply

Uh oh!

ggerganov Jul 30, 2025 Maintainer

Uh oh!

josemonsalve2 Jul 30, 2025 Author

josemonsalve2
Jul 24, 2025

Replies: 1 comment 1 reply

ggerganov
Jul 30, 2025
Maintainer

josemonsalve2 Jul 30, 2025
Author