Commit df8b6bd
support fp16 shgemm under openblas (pytorch#169042)
# Purpose
This PR is to support fp16 shgemm under openblas. We conducted tests using vLLM on the following platform.
With using this patch, vLLM demonstrates faster inference speed under fp16.
**Platform info:**
Architecture: riscv64
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Vendor ID: 0x5b7
BIOS Vendor ID: SOPHGO
Model name: -
BIOS Model name: SG2044 Not Set CPU @ 2.6GHz
BIOS CPU family: 513
CPU family: 0x80000000090c0d00
Model: 0x2047000
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s): 1
Frequency boost: disabled
CPU(s) scaling MHz: 100%
CPU max MHz: 2600.0000
CPU min MHz: 1000.0000
Caches (sum of all):
L1d: 4 MiB (64 instances)
L1i: 4 MiB (64 instances)
L2: 32 MiB (16 instances)
L3: 64 MiB (1 instance)
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Reg file data sampling: Not affected
Retbleed: Not affected
ISA: rv64imafdcv_zicbom_zicboz_zicntr_zicond_zicsr_zifencei_zihintntl_zihintpause_zihpm_zawrs_zfa_zfh_zfhmin_zca_zcb_zcd_zba_zbb_zbc_zbs_zve32f_zve32x_zve64d_zve64f_zve64x_zvfh_zvfhmin_sscofpmf_sstc_svinval_svnapot_svpbmt
**Branch**
openblas: develop
torch: develop
vllm: main
# Test Plan
Base: without this PR
Pytorch use OpenBLAS FP16 GEMM: use this PR
**Base**
export VLLM_CPU_OMP_THREADS_BIND=0-63
export VLLM_CPU_KVCACHE_SPACE=60
vllm bench latency \
--model /home/models/Qwen2.5-7B-Instruct \
--tensor-parallel-size 1\
--dtype float16 \
--input-len 16 \
--output-len 16 \
--enforce-eager \
--max-model-len 8192 \
--max-num-batched-tokens 8192 \
--batch-size 1 \
--n 1 \
--num-iters-warmup 5 \
--num-iters 8 \
--seed 42 \
--output-json ./latency_results_fp16_latency_base.json
**Pytorch use OpenBLAS FP16 GEMM**
export VLLM_CPU_OMP_THREADS_BIND=0-63
export VLLM_CPU_KVCACHE_SPACE=60
vllm bench latency \
--model /home/models/Qwen2.5-7B-Instruct \
--tensor-parallel-size 1\
--dtype float16 \
--input-len 16 \
--output-len 16 \
--enforce-eager \
--max-model-len 8192 \
--max-num-batched-tokens 8192 \
--batch-size 1 \
--n 1 \
--num-iters-warmup 5 \
--num-iters 8 \
--seed 42 \
--output-json ./latency_results_fp16_latency_with_openblas_support.json
# Result
**Base**
{
"avg_latency": 62.53946338250171,
"latencies": [
58.46783778001554,
58.230652199999895,
58.335780619992875,
59.77051957999356,
58.587668860011036,
59.31567866000114,
58.460076240007766,
89.14749311999185
],
"percentiles": {
"10": 58.30424209399498,
"25": 58.42900233500404,
"50": 58.52775332001329,
"75": 59.429388889999245,
"90": 68.58361164199304,
"99": 87.09110497219196
}
}
**Pytorch use OpenBLAS FP16 GEMM**
{
"avg_latency": 32.42863222499727,
"latencies": [
30.742418120033108,
33.67000828002347,
29.747197599965148,
32.11275753995869,
34.566938299976755,
30.849812360014766,
34.46360486000776,
33.27632073999848
],
"percentiles": {
"10": 30.44385196401272,
"25": 30.82296380001935,
"50": 32.69453913997859,
"75": 33.86840742501954,
"90": 34.49460489199846,
"99": 34.55970495917892
}
}
Pull Request resolved: pytorch#169042
Approved by: https://github.com/aditew01, https://github.com/albanD1 parent 8121f2c commit df8b6bd
File tree
3 files changed
+63
-0
lines changed- aten/src/ATen/native
- cmake
- Modules
3 files changed
+63
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
28 | 28 | | |
29 | 29 | | |
30 | 30 | | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
31 | 39 | | |
32 | 40 | | |
33 | 41 | | |
| |||
413 | 421 | | |
414 | 422 | | |
415 | 423 | | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
416 | 452 | | |
417 | 453 | | |
418 | 454 | | |
| |||
471 | 507 | | |
472 | 508 | | |
473 | 509 | | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
474 | 525 | | |
475 | 526 | | |
476 | 527 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
346 | 346 | | |
347 | 347 | | |
348 | 348 | | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
60 | 60 | | |
61 | 61 | | |
62 | 62 | | |
| 63 | + | |
63 | 64 | | |
64 | 65 | | |
65 | 66 | | |
| |||
0 commit comments