Clarification on the OP_CPY operation src0->src1 #1314
-
Hi, I was reviewing the generated graph for Llama 4, and there appears to be an issue with the implementation of OP_CPY. In this line, it suggests that this is case GGML_OP_CPY: {
// cpy overwrites value of src1 by src0 and returns view(src1)
// the overwriting is mathematically equivalent to:
// tensor = src0 * 1 + src1 * 0
if (src0_needs_grads) {
// dsrc0 = dtensor * 1
ggml_add_or_set(ctx, cgraph, isrc0, ggml_reshape(ctx, grad, src0));
}
if (src1_needs_grads) {
// dsrc1 = dtensor * 0 -> noop
}
} When looking at the resulting graph, there is a dependency that does not seem to be realized: ![]() The red line is an implicit dependency. The blue node has no output dependencies. I am curious why this was designed this way, rather than having src0 -> dst. I can imagine that this works better (more efficiently) because this results in a view, rather than an extra copy. But when doing dependency analysis, this gets in the way (the dependency is never realized). This is currently not an issue, as the order of the tensor ID ensures the implicit dependency (i.e., in the figure, the copy is ID 14, while the consumer is ID 20); therefore, during evaluation, 14 and 20 are never executed out of order or in parallel. Any insights here would be appreciated. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
The code that you are looking is for the backward pass. It's not used during inference. The update of the KV cache in all inference graphs is indeed a bit tricky. We basically have a large KV buffer and for each batch we update a small section of it - this is the cpy into a view (i.e. the left red arrow in your picture). Later in the same graph, we need to use a larger portion of the KV buffer. So we make another view and use that - this is the right arrow in the picture. Indeed, there is no explicit dependency stated here. We solve this by applying a |
Beta Was this translation helpful? Give feedback.
The code that you are looking is for the backward pass. It's not used during inference.
The update of the KV cache in all inference graphs is indeed a bit tricky. We basically have a large KV buffer and for each batch we update a small section of it - this is the cpy into a view (i.e. the left red arrow in your picture).
Later in the same graph, we need to use a larger portion of the KV buffer. So we make another view and use that - this is the right arrow in the picture.
Indeed, there is no explicit dependency stated here. We solve this by applying a
ggml_build_forward_expand()
after the update of the KV cache to guarantee that all operations up to that point would be performed before th…