We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
代码如下:
for epoch in range(10): acc_num=0 for i, (inputs, labels) in enumerate(train_loader): # forward inputs = inputs.to(device) labels = labels.to(device) outputs = model(inputs) loss = criterion(outputs[0], labels) # backward optimizer.zero_grad() loss.backward() optimizer.step() # log if args.local_rank == 0 and i % 5 == 0: tb_writer.add_scalar('loss', loss.item(), i) acc_num += (outputs[0].argmax(1)==labels).sum() if args.local_rank == 0: tb_writer.add_scalar('acc', acc_num/len(train_dataset),epoch) print(f"acc:{acc_num/len(train_dataset)}")
1个GPU时,acc:89% 3个GPU时,acc:29% 4个GPU时,acc: 17%
The text was updated successfully, but these errors were encountered:
我理解原因可能是: ddp是数据并行,意味着模型有多份,它们有些相同的初始参数,每次前向后向时只看到属于自己partition的训练数据,算完梯度后,进程间通讯,大家同步一下参数。我看你是直接算当前batch上的准确率,并且是在backward之前,那意味着还没同步参数,模型看到的训练数据是不一样的:如果是1GPU,那模型是看到全量数据的,如果是2GPU,每个模型只看到了一半的数据,导致算出来的准确率也不一样。 建议在完成一个前向后向后,在测试集上计算指标。
Sorry, something went wrong.
Owner
是这样的,我试了一下。每个GPU只能看到所有数据的一部分,是我的计算方法有误,应该除以每个GPU看到的所有样本数量,而不是数据集所有的样本数量。 谢谢作者的解答
No branches or pull requests
代码如下:
1个GPU时,acc:89%
3个GPU时,acc:29%
4个GPU时,acc: 17%
The text was updated successfully, but these errors were encountered: