cuda : implement bf16 cpy ops and enable bf16 cont #14763

CISC · 2025-07-18T20:55:22Z

Implemented missing BF16 CPY ops and enabled CONT op for BF16.

Tests before

  CONT(type=bf16,ne=[2,1,1,1]): not supported [CUDA0] 
  CONT(type=bf16,ne=[2,1,3,5]): not supported [CUDA0] 
  CONT(type=bf16,ne=[2,3,5,7]): not supported [CUDA0] 
[...]
  CPY(type_src=bf16,type_dst=bf16,ne=[1,2,3,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[1,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=bf16,ne=[1,2,3,4],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=bf16,ne=[2,2,3,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[2,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=bf16,ne=[2,2,3,4],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=bf16,ne=[3,2,3,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[3,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=bf16,ne=[3,2,3,4],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]): not supported [CUDA0] 
[...]
  CPY(type_src=f16,type_dst=bf16,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): not supported [CUDA0] 
  CPY(type_src=f16,type_dst=bf16,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): not supported [CUDA0] 
[...]
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=f16,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=f16,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=bf16,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): not supported [CUDA0] 
[...]
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): not supported [CUDA0]

Tests after

  CONT(type=bf16,ne=[2,1,1,1]): OK
  CONT(type=bf16,ne=[2,1,3,5]): OK
  CONT(type=bf16,ne=[2,3,5,7]): OK
[...]
  CPY(type_src=bf16,type_dst=bf16,ne=[1,2,3,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[1,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[1,2,3,4],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[2,2,3,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[2,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[2,2,3,4],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[3,2,3,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[3,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[3,2,3,4],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]): OK
[...]
  CPY(type_src=f16,type_dst=bf16,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=f16,type_dst=bf16,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): OK
[...]
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=f16,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=f16,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): OK
[...]
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): OK

Also fixed a cut'n'paste error for F16->F16 in ggml_cuda_cpy_fn and deduplicated all copy functions.

JohannesGaessler

Generally speaking I am not a fan of how the float conversions are being done currently. I think the code could be deduplicated significantly by unconditionally casting half, nv_bfloat16, and float to float and then simply using that float value to set the destination. I would appreciate it if you were to do this in this PR, otherwise I'll keep it as one of the tasks to hand out when people ask me for a good first issue to work on.

ggml/src/ggml-cuda/cpy-utils.cuh

ggml/src/ggml-cuda/cpy.cu

ggml/src/ggml-cuda/ggml-cuda.cu

implement bf16 cpy ops and enable bf16 cont

4162ffe

CISC requested a review from JohannesGaessler July 18, 2025 20:55

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jul 18, 2025

JohannesGaessler reviewed Jul 21, 2025

View reviewed changes

ggml/src/ggml-cuda/cpy-utils.cuh Show resolved Hide resolved

deduplicate copy functions

1860cf9

CISC requested a review from JohannesGaessler July 21, 2025 14:49

JohannesGaessler reviewed Jul 21, 2025

View reviewed changes

ggml/src/ggml-cuda/cpy-utils.cuh Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/cpy-utils.cuh Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/cpy.cu Outdated Show resolved Hide resolved

further deduplication

9cbb916

CISC requested a review from JohannesGaessler July 21, 2025 16:02

JohannesGaessler reviewed Jul 21, 2025

View reviewed changes

ggml/src/ggml-cuda/ggml-cuda.cu Outdated Show resolved Hide resolved

CISC added 2 commits July 21, 2025 23:03

deduplicate checks

c039424

ws--

c058460

CISC requested a review from JohannesGaessler July 21, 2025 21:07

rename helper function

45148e6

JohannesGaessler approved these changes Jul 22, 2025

View reviewed changes

CISC merged commit e28c0b8 into master Jul 22, 2025
47 checks passed

CISC deleted the cisc/cuda-bf16-cpy-cont branch July 22, 2025 10:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cuda : implement bf16 cpy ops and enable bf16 cont #14763

cuda : implement bf16 cpy ops and enable bf16 cont #14763

CISC commented Jul 18, 2025 •

edited

Loading

Uh oh!

JohannesGaessler left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cuda : implement bf16 cpy ops and enable bf16 cont #14763

cuda : implement bf16 cpy ops and enable bf16 cont #14763

Conversation

CISC commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CISC commented Jul 18, 2025 •

edited

Loading