Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BERT Pre-Training loss does not reduce when upgrading to TF 2.4 #45

Open
piyushghai opened this issue Dec 24, 2020 · 4 comments
Open

BERT Pre-Training loss does not reduce when upgrading to TF 2.4 #45

piyushghai opened this issue Dec 24, 2020 · 4 comments

Comments

@piyushghai
Copy link

piyushghai commented Dec 24, 2020

When training BERT with TF 2.3, the loss would decrease and MLM_Acc would be non-zero.
After upgrading to TF 2.4 and using the same script, the loss does not decrease and MLM_Acc remains 0.0

Note : The hyperparameters were unchanged between the runs with TF2.3 and TF 2.4.

Here are the logs of a 2 node run :

[1,0]<stdout>:2020-12-23 22:57:09,931 __main__    : INFO     Train step 10 -- Loss: 11.170, MLM: 10.479, SOP: 0.691, MLM_acc: 0.000, SOP_acc: 0.523 -- It/s: 0.04
[1,0]<stdout>:2020-12-23 22:57:15,716 __main__    : INFO     Train step 20 -- Loss: 11.175, MLM: 10.481, SOP: 0.693, MLM_acc: 0.000, SOP_acc: 0.551 -- It/s: 1.73
[1,0]<stdout>:2020-12-23 22:57:21,492 __main__    : INFO     Train step 30 -- Loss: 11.183, MLM: 10.484, SOP: 0.699, MLM_acc: 0.000, SOP_acc: 0.518 -- It/s: 1.73
[1,0]<stdout>:2020-12-23 22:57:27,349 __main__    : INFO     Train step 40 -- Loss: 11.166, MLM: 10.473, SOP: 0.692, MLM_acc: 0.000, SOP_acc: 0.532 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 22:57:33,201 __main__    : INFO     Train step 50 -- Loss: 11.170, MLM: 10.478, SOP: 0.692, MLM_acc: 0.000, SOP_acc: 0.539 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 22:57:39,176 __main__    : INFO     Train step 60 -- Loss: 11.165, MLM: 10.475, SOP: 0.690, MLM_acc: 0.000, SOP_acc: 0.532 -- It/s: 1.67
[1,0]<stdout>:2020-12-23 22:57:45,032 __main__    : INFO     Train step 70 -- Loss: 11.182, MLM: 10.481, SOP: 0.700, MLM_acc: 0.000, SOP_acc: 0.500 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 22:57:50,938 __main__    : INFO     Train step 80 -- Loss: 11.155, MLM: 10.472, SOP: 0.683, MLM_acc: 0.000, SOP_acc: 0.566 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 22:57:56,830 __main__    : INFO     Train step 90 -- Loss: 11.172, MLM: 10.477, SOP: 0.696, MLM_acc: 0.000, SOP_acc: 0.512 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 22:58:02,750 __main__    : INFO     Train step 100 -- Loss: 11.175, MLM: 10.480, SOP: 0.695, MLM_acc: 0.000, SOP_acc: 0.504 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 22:58:08,727 __main__    : INFO     Train step 110 -- Loss: 11.175, MLM: 10.482, SOP: 0.692, MLM_acc: 0.000, SOP_acc: 0.525 -- It/s: 1.67
[1,0]<stdout>:2020-12-23 22:58:14,774 __main__    : INFO     Train step 120 -- Loss: 11.171, MLM: 10.478, SOP: 0.693, MLM_acc: 0.000, SOP_acc: 0.519 -- It/s: 1.65
[1,0]<stdout>:2020-12-23 22:58:20,760 __main__    : INFO     Train step 130 -- Loss: 11.171, MLM: 10.480, SOP: 0.690, MLM_acc: 0.000, SOP_acc: 0.542 -- It/s: 1.67
[1,0]<stdout>:2020-12-23 22:58:26,685 __main__    : INFO     Train step 140 -- Loss: 11.168, MLM: 10.475, SOP: 0.693, MLM_acc: 0.000, SOP_acc: 0.515 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 22:58:32,692 __main__    : INFO     Train step 150 -- Loss: 11.165, MLM: 10.472, SOP: 0.692, MLM_acc: 0.000, SOP_acc: 0.525 -- It/s: 1.66
[1,0]<stdout>:2020-12-23 22:58:38,637 __main__    : INFO     Train step 160 -- Loss: 11.171, MLM: 10.476, SOP: 0.695, MLM_acc: 0.000, SOP_acc: 0.525 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 22:58:44,603 __main__    : INFO     Train step 170 -- Loss: 11.175, MLM: 10.485, SOP: 0.690, MLM_acc: 0.000, SOP_acc: 0.548 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 22:58:50,504 __main__    : INFO     Train step 180 -- Loss: 11.173, MLM: 10.486, SOP: 0.687, MLM_acc: 0.000, SOP_acc: 0.548 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 22:58:56,448 __main__    : INFO     Train step 190 -- Loss: 11.174, MLM: 10.478, SOP: 0.695, MLM_acc: 0.000, SOP_acc: 0.529 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 22:59:02,442 __main__    : INFO     Train step 200 -- Loss: 11.171, MLM: 10.479, SOP: 0.692, MLM_acc: 0.000, SOP_acc: 0.527 -- It/s: 1.67
[1,0]<stdout>:2020-12-23 22:59:08,371 __main__    : INFO     Train step 210 -- Loss: 11.176, MLM: 10.478, SOP: 0.698, MLM_acc: 0.000, SOP_acc: 0.490 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 22:59:14,360 __main__    : INFO     Train step 220 -- Loss: 11.172, MLM: 10.479, SOP: 0.692, MLM_acc: 0.000, SOP_acc: 0.530 -- It/s: 1.67
[1,0]<stdout>:2020-12-23 22:59:20,316 __main__    : INFO     Train step 230 -- Loss: 11.174, MLM: 10.476, SOP: 0.698, MLM_acc: 0.000, SOP_acc: 0.515 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 22:59:26,258 __main__    : INFO     Train step 240 -- Loss: 11.173, MLM: 10.480, SOP: 0.692, MLM_acc: 0.000, SOP_acc: 0.521 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 22:59:32,202 __main__    : INFO     Train step 250 -- Loss: 11.166, MLM: 10.477, SOP: 0.688, MLM_acc: 0.000, SOP_acc: 0.547 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 22:59:38,197 __main__    : INFO     Train step 260 -- Loss: 11.175, MLM: 10.480, SOP: 0.695, MLM_acc: 0.000, SOP_acc: 0.534 -- It/s: 1.67
[1,0]<stdout>:2020-12-23 22:59:44,114 __main__    : INFO     Train step 270 -- Loss: 11.166, MLM: 10.470, SOP: 0.695, MLM_acc: 0.000, SOP_acc: 0.530 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 22:59:50,143 __main__    : INFO     Train step 280 -- Loss: 11.170, MLM: 10.472, SOP: 0.698, MLM_acc: 0.000, SOP_acc: 0.503 -- It/s: 1.66
[1,0]<stdout>:2020-12-23 22:59:56,261 __main__    : INFO     Train step 290 -- Loss: 11.170, MLM: 10.474, SOP: 0.696, MLM_acc: 0.000, SOP_acc: 0.504 -- It/s: 1.63
[1,0]<stdout>:2020-12-23 23:00:02,180 __main__    : INFO     Train step 300 -- Loss: 11.171, MLM: 10.483, SOP: 0.688, MLM_acc: 0.000, SOP_acc: 0.545 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:00:08,090 __main__    : INFO     Train step 310 -- Loss: 11.172, MLM: 10.478, SOP: 0.695, MLM_acc: 0.000, SOP_acc: 0.526 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:00:13,961 __main__    : INFO     Train step 320 -- Loss: 11.173, MLM: 10.479, SOP: 0.694, MLM_acc: 0.000, SOP_acc: 0.516 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:00:19,884 __main__    : INFO     Train step 330 -- Loss: 11.172, MLM: 10.479, SOP: 0.692, MLM_acc: 0.000, SOP_acc: 0.529 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:00:25,825 __main__    : INFO     Train step 340 -- Loss: 11.172, MLM: 10.477, SOP: 0.695, MLM_acc: 0.000, SOP_acc: 0.513 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:00:31,732 __main__    : INFO     Train step 350 -- Loss: 11.171, MLM: 10.479, SOP: 0.692, MLM_acc: 0.000, SOP_acc: 0.513 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:00:37,669 __main__    : INFO     Train step 360 -- Loss: 11.165, MLM: 10.478, SOP: 0.688, MLM_acc: 0.000, SOP_acc: 0.531 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:00:43,639 __main__    : INFO     Train step 370 -- Loss: 11.173, MLM: 10.486, SOP: 0.687, MLM_acc: 0.000, SOP_acc: 0.532 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:00:49,558 __main__    : INFO     Train step 380 -- Loss: 11.168, MLM: 10.476, SOP: 0.692, MLM_acc: 0.000, SOP_acc: 0.529 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:00:55,507 __main__    : INFO     Train step 390 -- Loss: 11.176, MLM: 10.480, SOP: 0.696, MLM_acc: 0.000, SOP_acc: 0.512 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:01:01,426 __main__    : INFO     Train step 400 -- Loss: 11.158, MLM: 10.467, SOP: 0.690, MLM_acc: 0.000, SOP_acc: 0.516 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:01:07,394 __main__    : INFO     Train step 410 -- Loss: 11.177, MLM: 10.482, SOP: 0.695, MLM_acc: 0.000, SOP_acc: 0.516 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:01:13,370 __main__    : INFO     Train step 420 -- Loss: 11.173, MLM: 10.478, SOP: 0.695, MLM_acc: 0.000, SOP_acc: 0.501 -- It/s: 1.67
[1,0]<stdout>:2020-12-23 23:01:19,411 __main__    : INFO     Train step 430 -- Loss: 11.173, MLM: 10.484, SOP: 0.689, MLM_acc: 0.000, SOP_acc: 0.535 -- It/s: 1.66
[1,0]<stdout>:2020-12-23 23:01:25,373 __main__    : INFO     Train step 440 -- Loss: 11.178, MLM: 10.476, SOP: 0.702, MLM_acc: 0.000, SOP_acc: 0.495 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:01:31,289 __main__    : INFO     Train step 450 -- Loss: 11.175, MLM: 10.483, SOP: 0.692, MLM_acc: 0.000, SOP_acc: 0.540 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:01:37,115 __main__    : INFO     Train step 460 -- Loss: 11.170, MLM: 10.476, SOP: 0.693, MLM_acc: 0.000, SOP_acc: 0.525 -- It/s: 1.72
[1,0]<stdout>:2020-12-23 23:01:43,234 __main__    : INFO     Train step 470 -- Loss: 11.183, MLM: 10.489, SOP: 0.694, MLM_acc: 0.000, SOP_acc: 0.516 -- It/s: 1.63
[1,0]<stdout>:2020-12-23 23:01:49,182 __main__    : INFO     Train step 480 -- Loss: 11.172, MLM: 10.476, SOP: 0.695, MLM_acc: 0.000, SOP_acc: 0.495 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:01:55,056 __main__    : INFO     Train step 490 -- Loss: 11.178, MLM: 10.485, SOP: 0.693, MLM_acc: 0.000, SOP_acc: 0.545 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:02:01,032 __main__    : INFO     Train step 500 -- Loss: 11.171, MLM: 10.477, SOP: 0.694, MLM_acc: 0.000, SOP_acc: 0.528 -- It/s: 1.67
[1,0]<stdout>:2020-12-23 23:02:06,961 __main__    : INFO     Train step 510 -- Loss: 11.174, MLM: 10.478, SOP: 0.696, MLM_acc: 0.000, SOP_acc: 0.502 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:02:12,874 __main__    : INFO     Train step 520 -- Loss: 11.171, MLM: 10.481, SOP: 0.691, MLM_acc: 0.000, SOP_acc: 0.539 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:02:18,798 __main__    : INFO     Train step 530 -- Loss: 11.184, MLM: 10.489, SOP: 0.695, MLM_acc: 0.000, SOP_acc: 0.517 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:02:24,742 __main__    : INFO     Train step 540 -- Loss: 11.181, MLM: 10.485, SOP: 0.696, MLM_acc: 0.000, SOP_acc: 0.510 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:02:30,681 __main__    : INFO     Train step 550 -- Loss: 11.177, MLM: 10.489, SOP: 0.688, MLM_acc: 0.000, SOP_acc: 0.541 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:02:36,630 __main__    : INFO     Train step 560 -- Loss: 11.170, MLM: 10.476, SOP: 0.693, MLM_acc: 0.000, SOP_acc: 0.532 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:02:42,600 __main__    : INFO     Train step 570 -- Loss: 11.179, MLM: 10.484, SOP: 0.694, MLM_acc: 0.000, SOP_acc: 0.518 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:02:48,506 __main__    : INFO     Train step 580 -- Loss: 11.176, MLM: 10.482, SOP: 0.694, MLM_acc: 0.000, SOP_acc: 0.520 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:02:54,511 __main__    : INFO     Train step 590 -- Loss: 11.159, MLM: 10.468, SOP: 0.690, MLM_acc: 0.000, SOP_acc: 0.523 -- It/s: 1.67
[1,0]<stdout>:2020-12-23 23:03:00,396 __main__    : INFO     Train step 600 -- Loss: 11.183, MLM: 10.485, SOP: 0.698, MLM_acc: 0.000, SOP_acc: 0.511 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:03:06,252 __main__    : INFO     Train step 610 -- Loss: 11.182, MLM: 10.482, SOP: 0.700, MLM_acc: 0.000, SOP_acc: 0.492 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:03:12,405 __main__    : INFO     Train step 620 -- Loss: 11.174, MLM: 10.478, SOP: 0.695, MLM_acc: 0.000, SOP_acc: 0.506 -- It/s: 1.63
[1,0]<stdout>:2020-12-23 23:03:18,373 __main__    : INFO     Train step 630 -- Loss: 11.169, MLM: 10.474, SOP: 0.695, MLM_acc: 0.000, SOP_acc: 0.514 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:03:24,541 __main__    : INFO     Train step 640 -- Loss: 11.177, MLM: 10.487, SOP: 0.689, MLM_acc: 0.000, SOP_acc: 0.527 -- It/s: 1.62
[1,0]<stdout>:2020-12-23 23:03:30,506 __main__    : INFO     Train step 650 -- Loss: 11.176, MLM: 10.478, SOP: 0.698, MLM_acc: 0.000, SOP_acc: 0.514 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:03:36,466 __main__    : INFO     Train step 660 -- Loss: 11.171, MLM: 10.479, SOP: 0.692, MLM_acc: 0.000, SOP_acc: 0.521 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:03:42,454 __main__    : INFO     Train step 670 -- Loss: 11.169, MLM: 10.479, SOP: 0.691, MLM_acc: 0.000, SOP_acc: 0.551 -- It/s: 1.67
[1,0]<stdout>:2020-12-23 23:03:48,455 __main__    : INFO     Train step 680 -- Loss: 11.175, MLM: 10.479, SOP: 0.696, MLM_acc: 0.000, SOP_acc: 0.515 -- It/s: 1.67
[1,0]<stdout>:2020-12-23 23:03:54,337 __main__    : INFO     Train step 690 -- Loss: 11.178, MLM: 10.482, SOP: 0.695, MLM_acc: 0.000, SOP_acc: 0.518 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:04:00,255 __main__    : INFO     Train step 700 -- Loss: 11.166, MLM: 10.474, SOP: 0.692, MLM_acc: 0.000, SOP_acc: 0.518 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:04:06,327 __main__    : INFO     Train step 710 -- Loss: 11.172, MLM: 10.480, SOP: 0.691, MLM_acc: 0.000, SOP_acc: 0.537 -- It/s: 1.65
[1,0]<stdout>:2020-12-23 23:04:12,316 __main__    : INFO     Train step 720 -- Loss: 11.163, MLM: 10.468, SOP: 0.695, MLM_acc: 0.000, SOP_acc: 0.492 -- It/s: 1.67
[1,0]<stdout>:2020-12-23 23:04:18,179 __main__    : INFO     Train step 730 -- Loss: 11.172, MLM: 10.478, SOP: 0.694, MLM_acc: 0.000, SOP_acc: 0.541 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:04:24,065 __main__    : INFO     Train step 740 -- Loss: 11.176, MLM: 10.481, SOP: 0.694, MLM_acc: 0.000, SOP_acc: 0.512 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:04:29,921 __main__    : INFO     Train step 750 -- Loss: 11.178, MLM: 10.480, SOP: 0.697, MLM_acc: 0.000, SOP_acc: 0.499 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:04:35,780 __main__    : INFO     Train step 760 -- Loss: 11.168, MLM: 10.477, SOP: 0.691, MLM_acc: 0.000, SOP_acc: 0.541 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:04:41,681 __main__    : INFO     Train step 770 -- Loss: 11.169, MLM: 10.480, SOP: 0.690, MLM_acc: 0.000, SOP_acc: 0.534 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:04:47,571 __main__    : INFO     Train step 780 -- Loss: 11.170, MLM: 10.477, SOP: 0.692, MLM_acc: 0.000, SOP_acc: 0.519 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:04:53,457 __main__    : INFO     Train step 790 -- Loss: 11.171, MLM: 10.473, SOP: 0.697, MLM_acc: 0.000, SOP_acc: 0.510 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:04:59,328 __main__    : INFO     Train step 800 -- Loss: 11.165, MLM: 10.474, SOP: 0.690, MLM_acc: 0.000, SOP_acc: 0.551 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:05:05,221 __main__    : INFO     Train step 810 -- Loss: 11.163, MLM: 10.474, SOP: 0.689, MLM_acc: 0.000, SOP_acc: 0.535 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:05:11,130 __main__    : INFO     Train step 820 -- Loss: 11.174, MLM: 10.480, SOP: 0.694, MLM_acc: 0.000, SOP_acc: 0.524 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:05:16,990 __main__    : INFO     Train step 830 -- Loss: 11.177, MLM: 10.481, SOP: 0.695, MLM_acc: 0.000, SOP_acc: 0.511 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:05:22,870 __main__    : INFO     Train step 840 -- Loss: 11.173, MLM: 10.484, SOP: 0.690, MLM_acc: 0.000, SOP_acc: 0.524 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:05:28,805 __main__    : INFO     Train step 850 -- Loss: 11.182, MLM: 10.486, SOP: 0.696, MLM_acc: 0.000, SOP_acc: 0.517 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:05:34,714 __main__    : INFO     Train step 860 -- Loss: 11.167, MLM: 10.474, SOP: 0.693, MLM_acc: 0.000, SOP_acc: 0.526 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:05:40,662 __main__    : INFO     Train step 870 -- Loss: 11.176, MLM: 10.483, SOP: 0.692, MLM_acc: 0.000, SOP_acc: 0.531 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:05:46,613 __main__    : INFO     Train step 880 -- Loss: 11.177, MLM: 10.480, SOP: 0.697, MLM_acc: 0.000, SOP_acc: 0.510 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:05:52,549 __main__    : INFO     Train step 890 -- Loss: 11.163, MLM: 10.472, SOP: 0.690, MLM_acc: 0.000, SOP_acc: 0.531 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:05:58,431 __main__    : INFO     Train step 900 -- Loss: 11.160, MLM: 10.469, SOP: 0.691, MLM_acc: 0.000, SOP_acc: 0.524 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:06:04,313 __main__    : INFO     Train step 910 -- Loss: 11.175, MLM: 10.486, SOP: 0.689, MLM_acc: 0.000, SOP_acc: 0.539 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:06:10,194 __main__    : INFO     Train step 920 -- Loss: 11.172, MLM: 10.481, SOP: 0.691, MLM_acc: 0.000, SOP_acc: 0.520 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:06:16,125 __main__    : INFO     Train step 930 -- Loss: 11.168, MLM: 10.476, SOP: 0.692, MLM_acc: 0.000, SOP_acc: 0.538 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:06:22,022 __main__    : INFO     Train step 940 -- Loss: 11.175, MLM: 10.478, SOP: 0.697, MLM_acc: 0.000, SOP_acc: 0.500 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:06:27,930 __main__    : INFO     Train step 950 -- Loss: 11.170, MLM: 10.481, SOP: 0.689, MLM_acc: 0.000, SOP_acc: 0.534 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:06:33,872 __main__    : INFO     Train step 960 -- Loss: 11.171, MLM: 10.480, SOP: 0.691, MLM_acc: 0.000, SOP_acc: 0.525 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:06:39,835 __main__    : INFO     Train step 970 -- Loss: 11.174, MLM: 10.480, SOP: 0.694, MLM_acc: 0.000, SOP_acc: 0.512 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:06:45,746 __main__    : INFO     Train step 980 -- Loss: 11.177, MLM: 10.480, SOP: 0.696, MLM_acc: 0.000, SOP_acc: 0.517 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:06:51,632 __main__    : INFO     Train step 990 -- Loss: 11.173, MLM: 10.481, SOP: 0.693, MLM_acc: 0.000, SOP_acc: 0.529 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:06:57,599 __main__    : INFO     Train step 1000 -- Loss: 11.171, MLM: 10.480, SOP: 0.691, MLM_acc: 0.000, SOP_acc: 0.543 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:07:03,524 __main__    : INFO     Train step 1010 -- Loss: 11.172, MLM: 10.477, SOP: 0.695, MLM_acc: 0.000, SOP_acc: 0.518 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:07:09,347 __main__    : INFO     Train step 1020 -- Loss: 11.171, MLM: 10.478, SOP: 0.693, MLM_acc: 0.000, SOP_acc: 0.520 -- It/s: 1.72
[1,0]<stdout>:2020-12-23 23:07:15,191 __main__    : INFO     Train step 1030 -- Loss: 11.164, MLM: 10.477, SOP: 0.687, MLM_acc: 0.000, SOP_acc: 0.545 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:07:21,094 __main__    : INFO     Train step 1040 -- Loss: 11.174, MLM: 10.479, SOP: 0.695, MLM_acc: 0.000, SOP_acc: 0.521 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:07:26,961 __main__    : INFO     Train step 1050 -- Loss: 11.174, MLM: 10.478, SOP: 0.696, MLM_acc: 0.000, SOP_acc: 0.509 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:07:32,810 __main__    : INFO     Train step 1060 -- Loss: 11.176, MLM: 10.483, SOP: 0.694, MLM_acc: 0.000, SOP_acc: 0.533 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:07:38,801 __main__    : INFO     Train step 1070 -- Loss: 11.171, MLM: 10.475, SOP: 0.695, MLM_acc: 0.000, SOP_acc: 0.517 -- It/s: 1.67
[1,0]<stdout>:2020-12-23 23:07:44,775 __main__    : INFO     Train step 1080 -- Loss: 11.166, MLM: 10.477, SOP: 0.689, MLM_acc: 0.000, SOP_acc: 0.553 -- It/s: 1.67
[1,0]<stdout>:2020-12-23 23:07:50,677 __main__    : INFO     Train step 1090 -- Loss: 11.170, MLM: 10.479, SOP: 0.691, MLM_acc: 0.000, SOP_acc: 0.546 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:07:56,539 __main__    : INFO     Train step 1100 -- Loss: 11.174, MLM: 10.481, SOP: 0.693, MLM_acc: 0.000, SOP_acc: 0.524 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:08:02,402 __main__    : INFO     Train step 1110 -- Loss: 11.177, MLM: 10.481, SOP: 0.696, MLM_acc: 0.000, SOP_acc: 0.498 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:08:08,251 __main__    : INFO     Train step 1120 -- Loss: 11.165, MLM: 10.474, SOP: 0.691, MLM_acc: 0.000, SOP_acc: 0.516 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:08:14,125 __main__    : INFO     Train step 1130 -- Loss: 11.178, MLM: 10.482, SOP: 0.696, MLM_acc: 0.000, SOP_acc: 0.507 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:08:19,998 __main__    : INFO     Train step 1140 -- Loss: 11.164, MLM: 10.476, SOP: 0.689, MLM_acc: 0.000, SOP_acc: 0.554 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:08:25,866 __main__    : INFO     Train step 1150 -- Loss: 11.171, MLM: 10.480, SOP: 0.691, MLM_acc: 0.000, SOP_acc: 0.541 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:08:31,735 __main__    : INFO     Train step 1160 -- Loss: 11.177, MLM: 10.484, SOP: 0.693, MLM_acc: 0.000, SOP_acc: 0.524 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:08:37,570 __main__    : INFO     Train step 1170 -- Loss: 11.166, MLM: 10.478, SOP: 0.688, MLM_acc: 0.000, SOP_acc: 0.536 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:08:43,434 __main__    : INFO     Train step 1180 -- Loss: 11.175, MLM: 10.485, SOP: 0.690, MLM_acc: 0.000, SOP_acc: 0.514 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:08:49,277 __main__    : INFO     Train step 1190 -- Loss: 11.179, MLM: 10.487, SOP: 0.692, MLM_acc: 0.000, SOP_acc: 0.517 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:08:55,143 __main__    : INFO     Train step 1200 -- Loss: 11.174, MLM: 10.476, SOP: 0.698, MLM_acc: 0.000, SOP_acc: 0.500 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:09:01,213 __main__    : INFO     Train step 1210 -- Loss: 11.163, MLM: 10.477, SOP: 0.686, MLM_acc: 0.000, SOP_acc: 0.539 -- It/s: 1.65
[1,0]<stdout>:2020-12-23 23:09:07,170 __main__    : INFO     Train step 1220 -- Loss: 11.172, MLM: 10.482, SOP: 0.690, MLM_acc: 0.000, SOP_acc: 0.528 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:09:13,011 __main__    : INFO     Train step 1230 -- Loss: 11.186, MLM: 10.487, SOP: 0.699, MLM_acc: 0.000, SOP_acc: 0.507 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:09:18,922 __main__    : INFO     Train step 1240 -- Loss: 11.169, MLM: 10.475, SOP: 0.694, MLM_acc: 0.000, SOP_acc: 0.522 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:09:24,792 __main__    : INFO     Train step 1250 -- Loss: 11.167, MLM: 10.481, SOP: 0.686, MLM_acc: 0.000, SOP_acc: 0.546 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:09:30,651 __main__    : INFO     Train step 1260 -- Loss: 11.181, MLM: 10.485, SOP: 0.697, MLM_acc: 0.000, SOP_acc: 0.509 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:09:36,594 __main__    : INFO     Train step 1270 -- Loss: 11.170, MLM: 10.480, SOP: 0.689, MLM_acc: 0.000, SOP_acc: 0.526 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:09:42,490 __main__    : INFO     Train step 1280 -- Loss: 11.171, MLM: 10.481, SOP: 0.690, MLM_acc: 0.000, SOP_acc: 0.519 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:09:48,352 __main__    : INFO     Train step 1290 -- Loss: 11.177, MLM: 10.479, SOP: 0.698, MLM_acc: 0.000, SOP_acc: 0.506 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:09:54,296 __main__    : INFO     Train step 1300 -- Loss: 11.169, MLM: 10.480, SOP: 0.689, MLM_acc: 0.000, SOP_acc: 0.536 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:10:00,242 __main__    : INFO     Train step 1310 -- Loss: 11.178, MLM: 10.482, SOP: 0.696, MLM_acc: 0.000, SOP_acc: 0.490 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:10:06,177 __main__    : INFO     Train step 1320 -- Loss: 11.164, MLM: 10.472, SOP: 0.692, MLM_acc: 0.000, SOP_acc: 0.500 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:10:12,117 __main__    : INFO     Train step 1330 -- Loss: 11.172, MLM: 10.478, SOP: 0.694, MLM_acc: 0.000, SOP_acc: 0.533 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:10:18,032 __main__    : INFO     Train step 1340 -- Loss: 11.169, MLM: 10.478, SOP: 0.691, MLM_acc: 0.000, SOP_acc: 0.529 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:10:23,889 __main__    : INFO     Train step 1350 -- Loss: 11.176, MLM: 10.479, SOP: 0.697, MLM_acc: 0.000, SOP_acc: 0.510 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:10:29,759 __main__    : INFO     Train step 1360 -- Loss: 11.165, MLM: 10.468, SOP: 0.698, MLM_acc: 0.000, SOP_acc: 0.505 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:10:35,608 __main__    : INFO     Train step 1370 -- Loss: 11.176, MLM: 10.487, SOP: 0.690, MLM_acc: 0.000, SOP_acc: 0.523 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:10:41,468 __main__    : INFO     Train step 1380 -- Loss: 11.180, MLM: 10.485, SOP: 0.695, MLM_acc: 0.000, SOP_acc: 0.526 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:10:47,361 __main__    : INFO     Train step 1390 -- Loss: 11.177, MLM: 10.474, SOP: 0.703, MLM_acc: 0.000, SOP_acc: 0.502 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:10:53,279 __main__    : INFO     Train step 1400 -- Loss: 11.167, MLM: 10.475, SOP: 0.693, MLM_acc: 0.000, SOP_acc: 0.529 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:10:59,189 __main__    : INFO     Train step 1410 -- Loss: 11.179, MLM: 10.481, SOP: 0.698, MLM_acc: 0.000, SOP_acc: 0.500 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:11:05,079 __main__    : INFO     Train step 1420 -- Loss: 11.185, MLM: 10.483, SOP: 0.701, MLM_acc: 0.000, SOP_acc: 0.491 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:11:11,194 __main__    : INFO     Train step 1430 -- Loss: 11.174, MLM: 10.480, SOP: 0.694, MLM_acc: 0.000, SOP_acc: 0.501 -- It/s: 1.64
[1,0]<stdout>:2020-12-23 23:11:17,116 __main__    : INFO     Train step 1440 -- Loss: 11.170, MLM: 10.476, SOP: 0.695, MLM_acc: 0.000, SOP_acc: 0.518 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:11:22,995 __main__    : INFO     Train step 1450 -- Loss: 11.160, MLM: 10.472, SOP: 0.687, MLM_acc: 0.000, SOP_acc: 0.554 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:11:28,845 __main__    : INFO     Train step 1460 -- Loss: 11.176, MLM: 10.479, SOP: 0.696, MLM_acc: 0.000, SOP_acc: 0.512 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:11:34,799 __main__    : INFO     Train step 1470 -- Loss: 11.175, MLM: 10.479, SOP: 0.696, MLM_acc: 0.000, SOP_acc: 0.518 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:11:40,628 __main__    : INFO     Train step 1480 -- Loss: 11.175, MLM: 10.480, SOP: 0.695, MLM_acc: 0.000, SOP_acc: 0.510 -- It/s: 1.72
[1,0]<stdout>:2020-12-23 23:11:46,548 __main__    : INFO     Train step 1490 -- Loss: 11.165, MLM: 10.479, SOP: 0.686, MLM_acc: 0.000, SOP_acc: 0.547 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:11:52,491 __main__    : INFO     Train step 1500 -- Loss: 11.160, MLM: 10.472, SOP: 0.689, MLM_acc: 0.000, SOP_acc: 0.541 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:11:58,395 __main__    : INFO     Train step 1510 -- Loss: 11.171, MLM: 10.476, SOP: 0.695, MLM_acc: 0.000, SOP_acc: 0.520 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:12:04,242 __main__    : INFO     Train step 1520 -- Loss: 11.173, MLM: 10.480, SOP: 0.693, MLM_acc: 0.000, SOP_acc: 0.519 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:12:10,153 __main__    : INFO     Train step 1530 -- Loss: 11.174, MLM: 10.483, SOP: 0.691, MLM_acc: 0.000, SOP_acc: 0.536 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:12:16,036 __main__    : INFO     Train step 1540 -- Loss: 11.176, MLM: 10.482, SOP: 0.694, MLM_acc: 0.000, SOP_acc: 0.528 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:12:21,889 __main__    : INFO     Train step 1550 -- Loss: 11.169, MLM: 10.477, SOP: 0.693, MLM_acc: 0.000, SOP_acc: 0.535 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:12:27,721 __main__    : INFO     Train step 1560 -- Loss: 11.172, MLM: 10.480, SOP: 0.693, MLM_acc: 0.000, SOP_acc: 0.509 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:12:33,562 __main__    : INFO     Train step 1570 -- Loss: 11.166, MLM: 10.469, SOP: 0.697, MLM_acc: 0.000, SOP_acc: 0.512 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:12:39,399 __main__    : INFO     Train step 1580 -- Loss: 11.177, MLM: 10.480, SOP: 0.697, MLM_acc: 0.000, SOP_acc: 0.510 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:12:45,210 __main__    : INFO     Train step 1590 -- Loss: 11.175, MLM: 10.483, SOP: 0.692, MLM_acc: 0.000, SOP_acc: 0.528 -- It/s: 1.72
[1,0]<stdout>:2020-12-23 23:12:51,021 __main__    : INFO     Train step 1600 -- Loss: 11.170, MLM: 10.476, SOP: 0.694, MLM_acc: 0.000, SOP_acc: 0.530 -- It/s: 1.72
[1,0]<stdout>:2020-12-23 23:12:56,942 __main__    : INFO     Train step 1610 -- Loss: 11.173, MLM: 10.478, SOP: 0.695, MLM_acc: 0.000, SOP_acc: 0.499 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:13:02,829 __main__    : INFO     Train step 1620 -- Loss: 11.179, MLM: 10.479, SOP: 0.700, MLM_acc: 0.000, SOP_acc: 0.513 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:13:08,808 __main__    : INFO     Train step 1630 -- Loss: 11.167, MLM: 10.474, SOP: 0.693, MLM_acc: 0.000, SOP_acc: 0.511 -- It/s: 1.67
[1,0]<stdout>:2020-12-23 23:13:14,744 __main__    : INFO     Train step 1640 -- Loss: 11.164, MLM: 10.475, SOP: 0.688, MLM_acc: 0.000, SOP_acc: 0.524 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:13:20,644 __main__    : INFO     Train step 1650 -- Loss: 11.174, MLM: 10.479, SOP: 0.695, MLM_acc: 0.000, SOP_acc: 0.519 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:13:26,562 __main__    : INFO     Train step 1660 -- Loss: 11.166, MLM: 10.472, SOP: 0.694, MLM_acc: 0.000, SOP_acc: 0.521 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:13:32,444 __main__    : INFO     Train step 1670 -- Loss: 11.179, MLM: 10.480, SOP: 0.699, MLM_acc: 0.000, SOP_acc: 0.510 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:13:38,346 __main__    : INFO     Train step 1680 -- Loss: 11.171, MLM: 10.478, SOP: 0.693, MLM_acc: 0.000, SOP_acc: 0.515 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:13:44,240 __main__    : INFO     Train step 1690 -- Loss: 11.173, MLM: 10.478, SOP: 0.694, MLM_acc: 0.000, SOP_acc: 0.527 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:13:50,151 __main__    : INFO     Train step 1700 -- Loss: 11.176, MLM: 10.475, SOP: 0.701, MLM_acc: 0.000, SOP_acc: 0.506 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:13:56,110 __main__    : INFO     Train step 1710 -- Loss: 11.171, MLM: 10.484, SOP: 0.687, MLM_acc: 0.000, SOP_acc: 0.542 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:14:02,000 __main__    : INFO     Train step 1720 -- Loss: 11.174, MLM: 10.474, SOP: 0.700, MLM_acc: 0.000, SOP_acc: 0.495 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:14:07,881 __main__    : INFO     Train step 1730 -- Loss: 11.167, MLM: 10.475, SOP: 0.692, MLM_acc: 0.000, SOP_acc: 0.531 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:14:13,857 __main__    : INFO     Train step 1740 -- Loss: 11.184, MLM: 10.486, SOP: 0.698, MLM_acc: 0.000, SOP_acc: 0.494 -- It/s: 1.67
[1,0]<stdout>:2020-12-23 23:14:19,810 __main__    : INFO     Train step 1750 -- Loss: 11.171, MLM: 10.481, SOP: 0.691, MLM_acc: 0.000, SOP_acc: 0.521 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:14:25,780 __main__    : INFO     Train step 1760 -- Loss: 11.174, MLM: 10.475, SOP: 0.699, MLM_acc: 0.000, SOP_acc: 0.488 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:14:31,648 __main__    : INFO     Train step 1770 -- Loss: 11.175, MLM: 10.482, SOP: 0.693, MLM_acc: 0.000, SOP_acc: 0.510 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:14:37,492 __main__    : INFO     Train step 1780 -- Loss: 11.176, MLM: 10.480, SOP: 0.695, MLM_acc: 0.000, SOP_acc: 0.521 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:14:43,351 __main__    : INFO     Train step 1790 -- Loss: 11.171, MLM: 10.481, SOP: 0.691, MLM_acc: 0.000, SOP_acc: 0.529 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:14:49,199 __main__    : INFO     Train step 1800 -- Loss: 11.164, MLM: 10.475, SOP: 0.689, MLM_acc: 0.000, SOP_acc: 0.555 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:14:55,059 __main__    : INFO     Train step 1810 -- Loss: 11.184, MLM: 10.483, SOP: 0.701, MLM_acc: 0.000, SOP_acc: 0.505 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:15:00,943 __main__    : INFO     Train step 1820 -- Loss: 11.174, MLM: 10.477, SOP: 0.697, MLM_acc: 0.000, SOP_acc: 0.514 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:15:06,876 __main__    : INFO     Train step 1830 -- Loss: 11.167, MLM: 10.476, SOP: 0.691, MLM_acc: 0.000, SOP_acc: 0.539 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:15:12,777 __main__    : INFO     Train step 1840 -- Loss: 11.164, MLM: 10.477, SOP: 0.688, MLM_acc: 0.000, SOP_acc: 0.553 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:15:18,669 __main__    : INFO     Train step 1850 -- Loss: 11.174, MLM: 10.479, SOP: 0.694, MLM_acc: 0.000, SOP_acc: 0.516 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:15:24,637 __main__    : INFO     Train step 1860 -- Loss: 11.169, MLM: 10.476, SOP: 0.694, MLM_acc: 0.000, SOP_acc: 0.514 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:15:30,546 __main__    : INFO     Train step 1870 -- Loss: 11.173, MLM: 10.477, SOP: 0.696, MLM_acc: 0.000, SOP_acc: 0.524 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:15:36,417 __main__    : INFO     Train step 1880 -- Loss: 11.174, MLM: 10.481, SOP: 0.693, MLM_acc: 0.000, SOP_acc: 0.521 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:15:42,269 __main__    : INFO     Train step 1890 -- Loss: 11.168, MLM: 10.477, SOP: 0.691, MLM_acc: 0.000, SOP_acc: 0.511 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:15:48,218 __main__    : INFO     Train step 1900 -- Loss: 11.170, MLM: 10.474, SOP: 0.696, MLM_acc: 0.000, SOP_acc: 0.496 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:15:54,135 __main__    : INFO     Train step 1910 -- Loss: 11.172, MLM: 10.475, SOP: 0.697, MLM_acc: 0.000, SOP_acc: 0.493 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:16:00,246 __main__    : INFO     Train step 1920 -- Loss: 11.174, MLM: 10.478, SOP: 0.697, MLM_acc: 0.000, SOP_acc: 0.512 -- It/s: 1.64
[1,0]<stdout>:2020-12-23 23:16:06,149 __main__    : INFO     Train step 1930 -- Loss: 11.171, MLM: 10.480, SOP: 0.691, MLM_acc: 0.000, SOP_acc: 0.538 -- It/s: 1.69
[1,0]<stdout>:2020-12-23 23:16:12,097 __main__    : INFO     Train step 1940 -- Loss: 11.174, MLM: 10.485, SOP: 0.689, MLM_acc: 0.000, SOP_acc: 0.543 -- It/s: 1.68
[1,0]<stdout>:2020-12-23 23:16:17,982 __main__    : INFO     Train step 1950 -- Loss: 11.168, MLM: 10.476, SOP: 0.692, MLM_acc: 0.000, SOP_acc: 0.514 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:16:23,863 __main__    : INFO     Train step 1960 -- Loss: 11.166, MLM: 10.472, SOP: 0.694, MLM_acc: 0.000, SOP_acc: 0.532 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:16:29,741 __main__    : INFO     Train step 1970 -- Loss: 11.168, MLM: 10.474, SOP: 0.694, MLM_acc: 0.000, SOP_acc: 0.514 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:16:35,632 __main__    : INFO     Train step 1980 -- Loss: 11.178, MLM: 10.481, SOP: 0.696, MLM_acc: 0.000, SOP_acc: 0.513 -- It/s: 1.70
[1,0]<stdout>:2020-12-23 23:16:41,497 __main__    : INFO     Train step 1990 -- Loss: 11.169, MLM: 10.479, SOP: 0.691, MLM_acc: 0.000, SOP_acc: 0.519 -- It/s: 1.71
[1,0]<stdout>:2020-12-23 23:16:47,411 __main__    : INFO     Final step 2000: Loss: 11.180, MLM: 10.484, SOP: 0.696, MLM_acc: 0.000, SOP_acc: 0.519 -- Average seq_per_sec: 1451.38 -- Total Time: 1411.072882219
[1,0]<stdout>:2020-12-23 23:16:47,412 __main__    : INFO     Saving model at /opt/ml/input/data/train/checkpoints/bert/Tensorflow2-VNFEJZ-step2000.ckpt, optimizer at /opt/ml/input/data/train/checkpoints/bert/Tensorflow2-VNFEJZ-step2000-optimizer.npy
[1,0]<stdout>:2020-12-23 23:17:23,935 __main__    : INFO     Validation step 2000 -- Loss: 11.157, MLM: 10.470, SOP: 0.687, MLM_acc: 0.000, SOP_acc: 0.564
[1,0]<stdout>:2020-12-23 23:17:24,069 __main__    : INFO     Finished pretraining, job name Tensorflow2-VNFEJZ
jarednielsen added a commit that referenced this issue Jan 14, 2021
…outputs for huggingface >= 3.0.0 (#47)

Change optimizer.loss_scale() to optimizer.loss_scale for TF 2.4. Fixes #44.

I ran pretraining on ALBERT for a few thousand steps with XLA enabled on TF 2.4 and observed the loss drop from 11 to 7.5 in 3k steps with global batch size 32. I am not able to replicate #45. It is possible that fixing the named model outputs has solved this issue.
@jarednielsen
Copy link
Contributor

I'm not able to replicate these results - I saw the loss drop to 7.5 within 2k steps. Can you try again with the latest version of the codebase? I've updated the Dockerfile to be TF 2.4 with the latest version of transformers.

@piyushghai
Copy link
Author

@rondogency Can you share the env file on which we saw this issue ?

@piyushghai
Copy link
Author

@jarednielsen Did you try to reproduce before the transformers library upgrade or after it ?

@jarednielsen
Copy link
Contributor

After the transformers upgrade to v4.2.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants