Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reorder ops in cpu implementation with cl_mem #28286

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

WeldonWangwang
Copy link
Contributor

@WeldonWangwang WeldonWangwang commented Jan 7, 2025

Details:

  • Add buffer_ptr() API in gpu_buffer struct
  • Mark the output in reorder cpu implementation as read_write to enable the result(type is reorder) with cl_mem

image
When use HETERO:GPU.0,GPU.1 pipeline parallel split the llama_v2 model, there is a __module.model.layers.0.self_attn/prim::ListConstruct_3 node is marked in shape_of subgraph, and it's result will pass to next device, thus this node and it's result node can only get CPU implementations.

In inference stage, when prepare outputs for the Result_44438, B580 dGPU will only use cl_mem in

tensor_type = TensorType::BT_BUF_INTERNAL;
, the reorder cpu impl can not set result to cl_mem.

So, we have two solutions:

  1. Use this PR mark reorder output_lock in cpu impl to read_write, so it can use cl_mem
  2. In PR28476 Skip mark_node for result node with reorder type

Tickets:

@WeldonWangwang WeldonWangwang requested review from a team as code owners January 7, 2025 03:58
@github-actions github-actions bot added the category: GPU OpenVINO GPU plugin label Jan 7, 2025
src/plugins/intel_gpu/src/runtime/ocl/ocl_memory.hpp Outdated Show resolved Hide resolved
@@ -56,7 +56,7 @@ struct reorder_impl : public typed_primitive_impl<reorder> {
auto output_mem_ptr = instance.output_memory_ptr();

cldnn::mem_lock<uint8_t, mem_lock_type::read> input_lock(input_mem_ptr, stream);
cldnn::mem_lock<uint8_t, mem_lock_type::read> output_lock(output_mem_ptr, stream);
cldnn::mem_lock<uint8_t, mem_lock_type::read_write> output_lock(output_mem_ptr, stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output memory will be updated as the reason solved by this PR, could you descript the detail reason?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who allocate cl_mem output for this node? and why cpu impl is choosen?
btw, this line will break if output is usm_device, it's unsafe to directly change to read_write here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, River, I will add some test case to explain this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usm_device

If it is usm_device, it will do copying operation by mem_lock itself, right? But we need logic to write back to usm_device memory if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WeldonWangwang, could you please update access modifiers for the other CPU impls as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sshlyapn Sure, will add it.

@@ -39,6 +39,9 @@ struct gpu_buffer : public lockable_gpu_mem, public memory {
assert(0 == _lock_count);
return _buffer;
}
void* buffer_ptr() const override {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's not needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Sergey, it's not needed, this API just used in debug messages here:

GPU_DEBUG_TRACE_DETAIL << internal_name << " with index " << output_idx << " prepare output: " << output_memory->buffer_ptr() << std::endl;

The ptr would be 0 without this API, so I added it.
such as: GPU_Debug: sync_infer_request.cpp:989:prepare_output: result:Result_51154 with index 9 prepare output: 0

@@ -56,7 +56,7 @@ struct reorder_impl : public typed_primitive_impl<reorder> {
auto output_mem_ptr = instance.output_memory_ptr();

cldnn::mem_lock<uint8_t, mem_lock_type::read> input_lock(input_mem_ptr, stream);
cldnn::mem_lock<uint8_t, mem_lock_type::read> output_lock(output_mem_ptr, stream);
cldnn::mem_lock<uint8_t, mem_lock_type::read_write> output_lock(output_mem_ptr, stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WeldonWangwang, could you please update access modifiers for the other CPU impls as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: GPU OpenVINO GPU plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants