Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First Step accuracy? #1

Open
qidiso opened this issue May 4, 2018 · 31 comments
Open

First Step accuracy? #1

qidiso opened this issue May 4, 2018 · 31 comments

Comments

@qidiso
Copy link

qidiso commented May 4, 2018

i have not a p40 gpu card,but i have two 1080 gpu cards; you can set batch-size 512;but i only can set 128 instead.
in your readme article ,you train 40 thousand times in your first step; so how many times i can train in my first step? in the first step ,how much can you get accuracy in lfw or agedb-30 ?

@moli232777144
Copy link
Owner

lr_steps may be set to 80000. Acc: 95.6, just uploaded data. It still need three days to verify the project.

@qidiso
Copy link
Author

qidiso commented May 5, 2018

thanks @moli232777144 .i try it

@moli232777144
Copy link
Owner

Unfortunately.Bad results.your experiment?

@qidiso
Copy link
Author

qidiso commented May 7, 2018

samples/sec acc=0.366797
INFO:root:Epoch[33] Batch [11280] Speed: 861.70 samples/sec acc=0.361328
INFO:root:Epoch[33] Batch [11300] Speed: 864.48 samples/sec acc=0.368750
INFO:root:Epoch[33] Batch [11320] Speed: 871.56 samples/sec acc=0.372070
INFO:root:Epoch[33] Batch [11340] Speed: 866.92 samples/sec acc=0.361914
INFO:root:Epoch[33] Batch [11360] Speed: 877.25 samples/sec acc=0.362500
INFO:root:Epoch[33] Batch [11380] Speed: 865.76 samples/sec acc=0.372266
INFO:root:Epoch[33] Batch [11400] Speed: 878.27 samples/sec acc=0.365625
INFO:root:Epoch[33] Batch [11420] Speed: 867.70 samples/sec acc=0.353906
INFO:root:Epoch[33] Batch [11440] Speed: 870.65 samples/sec acc=0.367578
INFO:root:Epoch[33] Batch [11460] Speed: 866.44 samples/sec acc=0.369531
INFO:root:Epoch[33] Batch [11480] Speed: 875.99 samples/sec acc=0.352344
INFO:root:Epoch[33] Batch [11500] Speed: 873.52 samples/sec acc=0.371875
INFO:root:Epoch[33] Batch [11520] Speed: 867.85 samples/sec acc=0.352539
INFO:root:Epoch[33] Batch [11540] Speed: 865.43 samples/sec acc=0.351758
lr-batch-epoch: 1e-05 11553 33
testing verification..
(12000, 128)
infer time 11.886539
[lfw][502000]XNorm: 11.133125
[lfw][502000]Accuracy-Flip: 0.98900+-0.00484
testing verification..
(14000, 128)
infer time 14.814039
[cfp_fp][502000]XNorm: 9.110006
[cfp_fp][502000]Accuracy-Flip: 0.84514+-0.01910
testing verification..
(12000, 128)
infer time 12.223364
[agedb_30][502000]XNorm: 10.877486
[agedb_30][502000]Accuracy-Flip: 0.93533+-0.01299
saving 251
INFO:root

@qidiso
Copy link
Author

qidiso commented May 7, 2018

[lfw][530000]Accuracy-Flip: 0.99000+-0.00459
testing verification..
(14000, 128)
infer time 14.187055
[cfp_fp][530000]XNorm: 9.111494
[cfp_fp][530000]Accuracy-Flip: 0.84843+-0.01903
testing verification..
(12000, 128)
infer time 12.007815
[agedb_30][530000]XNorm: 10.877945
[agedb_30][530000]Accuracy-Flip: 0.93417+-0.01218
saving 265
INFO:root:Saved checkpoint to "../models/MobileFaceNet/model-y1-arcface-0265.params"
[530000]Accuracy-Highest: 0.93683

@qidiso
Copy link
Author

qidiso commented May 7, 2018

i feel my result is more bad. i use cmd:
CUDA_VISIBLE_DEVICES='0,1' python -u train_softmax.py --network y1 --ckpt 2 --loss-type 4 --lr-steps 160000,240000,280000,320000 --emb-size 128 --per-batch-size 128 --data-dir ../data/faces_ms1m_112x112 --pretrained ../models/MobileFaceNet/model-y1-softmax,20 --prefix ../models/MobileFaceNet/model-y1-arcface

@muzi2045
Copy link

muzi2045 commented May 7, 2018

第一步的训练参数有问题, 我这边在自己的机器上训练出来的结果准确率达不到要求

@qidiso
Copy link
Author

qidiso commented May 7, 2018

@moli232777144 me too! now i train again just use cmd:
CUDA_VISIBLE_DEVICES='0,1' python -u train_softmax.py --network y1 --loss-type 4 --margin-m 0.5 --data-dir ../data/faces_ms1m_112x112 --pretrained ../models/MobileFaceNet/model-y1-softmax,28 --prefix ../models/MobileFaceNet/model-y1-arcface --emb-size 128 --per-batch-size 150
maybe auto dropout is better

@moli232777144
Copy link
Owner

uploaded!weight decay should be set to 0.00004.

@qidiso
Copy link
Author

qidiso commented May 7, 2018

can you share me first step softmax result models?

@qidiso
Copy link
Author

qidiso commented May 8, 2018

any progress?

@moli232777144
Copy link
Owner

moli232777144 commented May 8, 2018

[2018-05-08 15:30:36] INFO:root:Epoch[9] Batch [1060] Speed: 588.76 samples/sec acc=0.279980
[2018-05-08 15:30:54] INFO:root:Epoch[9] Batch [1080] Speed: 588.33 samples/sec acc=0.281934
[2018-05-08 15:31:11] INFO:root:Epoch[9] Batch [1100] Speed: 588.85 samples/sec acc=0.276074
[2018-05-08 15:31:28] lr-batch-epoch: 0.1 1120 9
[2018-05-08 15:31:28] testing verification..
[2018-05-08 15:31:28] INFO:root:Epoch[9] Batch [1120] Speed: 590.26 samples/sec acc=0.280859
[2018-05-08 15:31:41] (12000, 128)
[2018-05-08 15:31:41] infer time 12.936783
[2018-05-08 15:31:45] [lfw][68000]XNorm: 11.173922
[2018-05-08 15:31:45] [lfw][68000]Accuracy-Flip: 0.99283+-0.00472
[2018-05-08 15:31:45] testing verification..
[2018-05-08 15:32:01] (14000, 128)
[2018-05-08 15:32:01] infer time 15.572022
[2018-05-08 15:32:05] [cfp_fp][68000]XNorm: 9.046101
[2018-05-08 15:32:05] [cfp_fp][68000]Accuracy-Flip: 0.86486+-0.01647
[2018-05-08 15:32:05] testing verification..
[2018-05-08 15:32:18] (12000, 128)
[2018-05-08 15:32:18] infer time 12.247801
[2018-05-08 15:32:22] [agedb_30][68000]XNorm: 11.032911
[2018-05-08 15:32:22] [agedb_30][68000]Accuracy-Flip: 0.94050+-0.01049
[2018-05-08 15:32:22] saving 34
[2018-05-08 15:32:22] [68000]Accuracy-Highest: 0.94167
[2018-05-08 15:32:22] INFO:root:Saved checkpoint to "/data/output/model-y1-arcface-0034.params"

@qidiso
Copy link
Author

qidiso commented May 8, 2018

i find i can get 99.37% in lfw on the 40000 steps ,but i train 70000 steps ,i can get only 99.1% in lfw.maybe we should set lr =0.01 in the 40000 steps

@moli232777144
Copy link
Owner

you can try it. i still need a day to run this experiment.

@qidiso
Copy link
Author

qidiso commented May 8, 2018

@moli232777144 i try it .if i get goods result ,i will reports the log

@qidiso
Copy link
Author

qidiso commented May 8, 2018

not good result .i get 99.45 in lfw and 94.50 in agedb ,so it can't be higher.

@moli232777144
Copy link
Owner

updated.we maybe should increase the number of iterations until acc is stable.

@muzi2045
Copy link

muzi2045 commented May 10, 2018

in the second step, I got 99.2 in lfw and 95.1 in agedb , maybe I need to continue training.
but the acc in the lfw looks like stuck in the 99.2.

@qidiso
Copy link
Author

qidiso commented May 10, 2018

in the fist stage ,i get this
lr-batch-epoch: 0.001 3087 16
testing verification..
(12000, 128)
infer time 12.316967
[lfw][206000]XNorm: 11.074202
[lfw][206000]Accuracy-Flip: 0.99383+-0.00289
testing verification..
(14000, 128)
infer time 14.801521
[cfp_fp][206000]XNorm: 9.228846
[cfp_fp][206000]Accuracy-Flip: 0.88029+-0.01851
testing verification..
(12000, 128)
infer time 12.482475
[agedb_30][206000]XNorm: 11.014230
[agedb_30][206000]Accuracy-Flip: 0.95417+-0.00892
saving 103
INFO:root:Saved checkpoint to "../models/MobileFaceNet/model-y1-arcface-0103.params"
[206000]Accuracy-Highest: 0.95417

or this:

lr-batch-epoch: 1e-05 5723 18
testing verification..
(12000, 128)
infer time 12.226353
[lfw][234000]XNorm: 11.085642
[lfw][234000]Accuracy-Flip: 0.99450+-0.00259
testing verification..
(14000, 128)
infer time 14.634496
[cfp_fp][234000]XNorm: 9.239763
[cfp_fp][234000]Accuracy-Flip: 0.87871+-0.01877
testing verification..
(12000, 128)
infer time 12.104834
[agedb_30][234000]XNorm: 11.024038
[agedb_30][234000]Accuracy-Flip: 0.95100+-0.00723
saving 117
INFO:root:Saved checkpoint to "../models/MobileFaceNet/model-y1-arcface-0117.params"
[234000]Accuracy-Highest: 0.95417

or this
lr-batch-epoch: 1e-05 1677 21
testing verification..
(12000, 128)
infer time 11.989666
[lfw][268000]XNorm: 11.078677
[lfw][268000]Accuracy-Flip: 0.99417+-0.00271
testing verification..
(14000, 128)
infer time 13.049772
[cfp_fp][268000]XNorm: 9.235129
[cfp_fp][268000]Accuracy-Flip: 0.87629+-0.01867
testing verification..
(12000, 128)
infer time 12.188487
[agedb_30][268000]XNorm: 11.015260
[agedb_30][268000]Accuracy-Flip: 0.95267+-0.00989
saving 134
INFO:root:Saved checkpoint to "../models/MobileFaceNet/model-y1-arcface-0134.params"
[268000]Accuracy-Highest: 0.95417
,no
i now training the last stage, but i don't know how to choose one to train ,now i choose 117.prarm to try

@qidiso
Copy link
Author

qidiso commented May 10, 2018

@moli232777144 have you good results?

@moli232777144
Copy link
Owner

lr 0.1,+40000steps,lr 0.01 +20000steps,i get agedb 95.59,lfw 99.51,i will continue to extend the steps.

@muzi2045
Copy link

thanks, I'll try it in the next time.

@qidiso
Copy link
Author

qidiso commented May 13, 2018

i training again and again .so i now get a better result:
lr-batch-epoch: 0.001 7999 0
testing verification..
(12000, 128)
infer time 12.323731
[lfw][8000]XNorm: 11.118196
[lfw][8000]Accuracy-Flip: 0.99583+-0.00375
testing verification..
(14000, 128)
infer time 14.580451
[cfp_fp][8000]XNorm: 9.335661
[cfp_fp][8000]Accuracy-Flip: 0.88786+-0.01615
testing verification..
(12000, 128)
infer time 12.362448
[agedb_30][8000]XNorm: 11.044563
[agedb_30][8000]Accuracy-Flip: 0.96083+-0.00827
saving 4
INFO:root:Saved checkpoint to "../models/MobileFaceNet/model-y1-arcface-0004.params"
[8000]Accuracy-Highest: 0.96133

@moli232777144
Copy link
Owner

good job!Modify parameters?Fine-tune process?

@muzi2045
Copy link

you are trained on the single card?
I trained on the single Titan X, but finally I just got lfw: 99.47% agedb_30: 99.53% on the step 2.
maybe I need change the batch_size 256 to 512, unfortunately,there is no enough CUDA memory.

@qidiso
Copy link
Author

qidiso commented May 14, 2018

@moli232777144 Fine-tune .but i first train the step3 (s=128),next ,i train step 2(s=64).

@xxllp
Copy link

xxllp commented May 22, 2018

各位都是多大GPU 哈,128 batch 都不够哈

@yc-huang
Copy link

@xxllp 8G显存应该可以支持batch size 180; 另外把数据放到ssd可以显著提高训练速度,在1070单卡上可以达到450 samples/s

@zhangxiaopang88
Copy link

您好,请问训练的时候打印的acc是什么精度啊 @moli232777144

@moli232777144
Copy link
Owner

训练的数据本身分类准确度 @zhangxiaopang88

@zhangxiaopang88
Copy link

zhangxiaopang88 commented Jan 21, 2019

哦哦,谢谢你,我用asia-celebrity数据集训练的,精度一直在0.44左右,请问您有什么训练方面的技巧,可以给点建议吗 @moli232777144

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants