Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test bench #127

Open
wants to merge 116 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
116 commits
Select commit Hold shift + click to select a range
93a5cda
add all_reduce_group test
Panlichen Jul 8, 2022
725b0d7
simple group allreduce
Panlichen Jul 13, 2022
06c97da
nccl group bigger
Panlichen Jul 14, 2022
560f9eb
log for ncclGroupStart/End
Panlichen Jul 16, 2022
abe8f5c
add non group all_reduce in simple
Panlichen Jul 16, 2022
6990dc3
half ofccl_all_reduce
Panlichen Jul 16, 2022
44c40d8
Merge branch 'master' of github.com:Panlichen/nccl-tests
Panlichen Jul 16, 2022
b8a749a
Simple not really necessary, yet no harm to keep it
Panlichen Jul 16, 2022
02a914e
ofccl test file
Panlichen Jul 18, 2022
8eba16f
run startColl exactly as we want
Panlichen Jul 18, 2022
d6a4d47
ofccl_all_reduce.cu
Panlichen Jul 18, 2022
5e3c98d
add log
Panlichen Jul 19, 2022
d3de021
add -M option: use seprate ncclComm for different coll op, even with …
Panlichen Jul 19, 2022
76b1cd7
remove log
Panlichen Jul 19, 2022
a466914
use prepare and done in nccl-tests
Panlichen Jul 28, 2022
9b362b5
check no reused ncclComm in ofcclCommList
Panlichen Aug 9, 2022
818c8e3
invoke ofcclDestroy
Panlichen Aug 9, 2022
67e70b9
use ofcclRunAllReduce
Panlichen Aug 23, 2022
97f58bc
use callback
Panlichen Aug 26, 2022
8a3d5f8
use func-ptr for callback, instead of std::function and lambda
Panlichen Aug 26, 2022
a3a1aea
stuck
Panlichen Aug 28, 2022
7ff3ea5
completeColl in warmup result in stuck
Panlichen Aug 28, 2022
8c5df9e
Merge pull request #1 from Panlichen/aggregate_waiting_cqes
Panlichen Aug 28, 2022
ee76beb
+lock
Panlichen Aug 28, 2022
3f0a8fe
tidy log
Panlichen Aug 29, 2022
0bd6d6a
nccl-tests run exactly once
Panlichen Sep 5, 2022
9a35e7f
ad-hoc check
Panlichen Sep 8, 2022
22b869f
wierd check
Panlichen Sep 9, 2022
732f8bd
activate -n, can run multi-iters
Panlichen Sep 9, 2022
85d5cbd
+ warmup
Panlichen Sep 26, 2022
090185c
fix completeColl in warmup
Panlichen Sep 26, 2022
17197fa
try context
Panlichen Sep 30, 2022
5f399fd
bugfix: seenCqe[miter] = 0; in warmup
Panlichen Oct 2, 2022
5cd2cb8
polish callback
Panlichen Oct 6, 2022
0148628
check OK
Panlichen Oct 9, 2022
b6027be
check ok
Panlichen Oct 9, 2022
24290c6
restore semi-original NCCL's BenchTime
Panlichen Oct 9, 2022
d609039
run check smoothly
Panlichen Oct 12, 2022
4cd2091
finalize check
Panlichen Oct 13, 2022
b8dc018
adapt to volunteer quit
Panlichen Oct 13, 2022
4ebe9e5
Merge branch 'master' of github.com:Panlichen/nccl-tests
Panlichen Oct 13, 2022
bd10523
adapt to volunteer quit
Panlichen Oct 13, 2022
74d4f0d
keep the report log
Panlichen Oct 14, 2022
ed7f645
try pure inplace
Panlichen Oct 17, 2022
eed57ca
log format
Panlichen Oct 17, 2022
0ef76cc
manual buffer size done
Panlichen Oct 18, 2022
d6cad8e
adjust log
Panlichen Oct 19, 2022
cec88ef
+ nccl_manual_size
Panlichen Oct 20, 2022
34f1b12
nccl manual size seems ok
Panlichen Oct 20, 2022
c84dd89
fix manual size bug
Panlichen Oct 20, 2022
93668dc
non-homogeneous nccl manual size
Panlichen Oct 20, 2022
71b40c7
+ cudadev in cbArgs for ofccl manual size
Panlichen Oct 22, 2022
f2b285d
161 maunal size from resnet
Panlichen Oct 24, 2022
a32587b
accurate damie
Panlichen Oct 24, 2022
5d07bca
.
Panlichen Oct 24, 2022
4f06775
aggressive no sync
Panlichen Oct 31, 2022
40fbb70
a new permutation from oneflow
Panlichen Nov 7, 2022
e98b271
log
Panlichen Nov 12, 2022
3598de4
suit 8 cards
Panlichen Nov 14, 2022
bcf3b87
use prepareDone
Panlichen Nov 18, 2022
0ccfcc9
nccl ms different order
Panlichen Nov 20, 2022
2b19a59
usleep
Panlichen Nov 26, 2022
b3b6323
+ ofccl_test.sh
Panlichen Nov 26, 2022
e2bfe2e
scripts
Panlichen Nov 29, 2022
34cd275
script
Panlichen Nov 30, 2022
4e81620
scripts
Panlichen Dec 1, 2022
0c3718e
scripts
Panlichen Dec 5, 2022
dba5947
scripts
Panlichen Dec 7, 2022
10fefc6
scripts
Panlichen Dec 8, 2022
7a79d98
little ms
Panlichen Dec 9, 2022
7b37cea
+ nccl_test.sh
Panlichen Dec 19, 2022
5bb88a1
fix bug in nccl_tests.sh
Panlichen Dec 19, 2022
d9f1a55
+ run multi test scripts
Panlichen Dec 21, 2022
f505977
+order
Panlichen Dec 21, 2022
1c9a007
28 is occupied
Panlichen Dec 22, 2022
54ff526
fix bug in nccl-tests/src_manual_size/ofccl_all_reduce_ms.cu
Panlichen Dec 22, 2022
57875d6
第一次完成 auto_test 开发
Panlichen Dec 23, 2022
23a25ea
文件命名bug修复
Panlichen Dec 23, 2022
9daa76c
deltaSec看起来统计得偏大了
Panlichen Dec 23, 2022
1f58d48
meaningless NEW_TIMER
Panlichen Dec 23, 2022
fc09438
check frequency
Panlichen Dec 23, 2022
266d3c8
update xls name and ndev
Panlichen Dec 23, 2022
b034d10
Merge branch 'master' into auto_test
Panlichen Dec 23, 2022
f0cf272
Merge pull request #2 from Panlichen/auto_test
Panlichen Dec 23, 2022
4702178
add log
Panlichen Dec 23, 2022
5f7b4bf
update env
Panlichen Dec 23, 2022
5ced1a0
+nccl ofccl run.sh
Panlichen Dec 24, 2022
875d1d5
report rank 0 avg time
Panlichen Dec 25, 2022
386ee92
scripts
Panlichen Dec 25, 2022
13567f2
nccl show each kernel time
Panlichen Dec 25, 2022
12969b5
能处理均值
Dec 27, 2022
29d300b
time页表 R列-O列
Dec 27, 2022
2b4b937
nccl kern 求平均
Dec 27, 2022
93b9ddc
输出 ori
Dec 28, 2022
4321b54
编译 QE_oricpp 增加一列实际的byte数,增加 Ex-Ox,修改average
Dec 28, 2022
d3f652d
+ in order ms
Panlichen Dec 29, 2022
5135aa3
输出 totalCnt
Dec 29, 2022
ffcbfba
script
Panlichen Dec 29, 2022
8dab62a
Merge branch 'master' into auto_test2
Panlichen Dec 29, 2022
c0454ff
Merge pull request #3 from Panlichen/auto_test2
Panlichen Dec 29, 2022
57ee21f
scripts
Panlichen Dec 30, 2022
9ee7202
scripts
Panlichen Jan 1, 2023
665de43
scripts
Panlichen Jan 6, 2023
b5a42cc
scripts
Panlichen Jan 6, 2023
1970187
scripts
Panlichen Jan 8, 2023
ceb3a5a
scripts
Panlichen Jan 10, 2023
ac30fd4
datatype in cmd; MY_NUM_DEV as cmd line param
Panlichen Jan 11, 2023
82c2a8e
scripts
Panlichen Jan 13, 2023
c760c82
+ occl AllGather
Panlichen Jan 14, 2023
abd0c14
5555 remove NCCL_MIN_NCHANNELS limit T^T TAT T_T T-T
Panlichen Jan 14, 2023
bbb04c6
+ ofccl ReduceScatter
Panlichen Jan 15, 2023
1f099a1
+ occl reduce
Panlichen Jan 15, 2023
6c075fd
+ofccl_broadcast; fix DEBUG_NT
Panlichen Jan 15, 2023
37b2463
精简前
Jan 19, 2023
d81c257
测试 五种操作
Jan 20, 2023
3bfb750
去除 xlrd
Jan 27, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
scripts
  • Loading branch information
Panlichen committed Nov 29, 2022

Unverified

This user has not yet uploaded their public signing key.
commit e2bfe2e80b6293ee7f50cfa0910e4e0069db91db
43 changes: 29 additions & 14 deletions ofccl_test.sh
Original file line number Diff line number Diff line change
@@ -7,19 +7,14 @@ export NCCL_ALGO=Ring
# export NCCL_MAX_NCHANNELS=1
# export NCCL_MIN_NCHANNELS=1
# export NCCL_NTHREADS=64
export MY_NUM_DEV=2
# export CUDA_VISIBLE_DEVICES=0,1,4,5
export SHOW_ALL_PREPARED_COLL=0
export NITER=4
export NBYTES=8K
export WARMITER=2
export MITER=4

export CHECK=0

export TRAVERSE_TIMES=10
export TOLERANT_FAIL_CHECK_SQ_CNT=500
export TOLERANT_FAIL_CHECK_SQ_CNT=5000
export CNT_BEFORE_QUIT=5
export TOLERANT_UNPROGRESSED_CNT=50000
export BASE_CTX_SWITCH_THRESHOLD=100
export BASE_CTX_SWITCH_THRESHOLD=80

echo TRAVERSE_TIMES=$TRAVERSE_TIMES
echo TOLERANT_FAIL_CHECK_SQ_CNT=$TOLERANT_FAIL_CHECK_SQ_CNT
@@ -28,18 +23,38 @@ echo TOLERANT_UNPROGRESSED_CNT=$TOLERANT_UNPROGRESSED_CNT
echo BASE_CTX_SWITCH_THRESHOLD=$BASE_CTX_SWITCH_THRESHOLD

if [ -z $BINARY ];then
BINARY="NORMAL"
BINARY="DEBUG"
BINARY="MS"
BINARY="PERF"
fi

if [ "$BINARY" == "NORMAL" ];then
if [ "$BINARY" == "DEBUG" ];then
target="./build/ofccl_all_reduce_perf"
export MY_NUM_DEV=8
# export CUDA_VISIBLE_DEVICES=0,1,4,5
export SHOW_ALL_PREPARED_COLL=1
export NITER=4
export NBYTES=8K
export WARMITER=2
export MITER=4
elif [ "$BINARY" == "PERF" ];then
target="./build/ofccl_all_reduce_perf"
export MY_NUM_DEV=2
export CUDA_VISIBLE_DEVICES=0,1,4,5
export SHOW_ALL_PREPARED_COLL=0
export NITER=4
export NBYTES=8K
export WARMITER=2
export MITER=4
elif [ "$BINARY" == "MS" ];then
target="./build/ofccl_all_reduce_ms_perf"
export NITER=200
export MY_NUM_DEV=8
# export CUDA_VISIBLE_DEVICES=0,1,4,5
export NITER=200
export SHOW_ALL_PREPARED_COLL=1
export WARMITER=0
export NBYTES=8K
export MITER=4
fi


@@ -48,13 +63,13 @@ if [ -z $RUN_TYPE ];then
fi

if [ "$RUN_TYPE" == "PURE" ];then
cmd="$target -b $NBYTES -e $NBYTES -f 2 -t $MY_NUM_DEV -g 1 -n $NITER -w $WARMITER -c 0 -M $MITER"
cmd="$target -b $NBYTES -e $NBYTES -f 2 -t $MY_NUM_DEV -g 1 -n $NITER -w $WARMITER -c $CHECK -M $MITER"
elif [ "$RUN_TYPE" == "GDB" ];then
cmd="cuda-gdb $target"
elif [ "$RUN_TYPE" == "NSYS" ];then
cmd="nsys profile -f true --trace=cuda,cudnn,cublas,osrt,nvtx -o /home/panlichen/work2/ofccl/log/nsys/$NSYS_FILE $target -b 64M -e 64M -f 2 -t $MY_NUM_DEV -g 1 -n 1 -w 0 -c 0"
fi

echo cmd=$cmd
$cmd
$cmd #> /home/panlichen/work2/ofccl/log/ofccl.log