Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

增加gpu后准确率成倍减小,是为什么呢 #5

Open
dahuaxiya opened this issue Jul 18, 2023 · 2 comments
Open

增加gpu后准确率成倍减小,是为什么呢 #5

dahuaxiya opened this issue Jul 18, 2023 · 2 comments

Comments

@dahuaxiya
Copy link

代码如下:

for epoch in range(10):
    acc_num=0
    for i, (inputs, labels) in enumerate(train_loader):
        # forward
        inputs = inputs.to(device)
        labels = labels.to(device)
        outputs = model(inputs)
        loss = criterion(outputs[0], labels)
        # backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        # log
        if args.local_rank == 0 and i % 5 == 0:
            tb_writer.add_scalar('loss', loss.item(), i)
        acc_num += (outputs[0].argmax(1)==labels).sum()
    if args.local_rank == 0:
        tb_writer.add_scalar('acc', acc_num/len(train_dataset),epoch)
        print(f"acc:{acc_num/len(train_dataset)}")

1个GPU时,acc:89%
3个GPU时,acc:29%
4个GPU时,acc: 17%

@jia-zhuang
Copy link
Owner

我理解原因可能是:
ddp是数据并行,意味着模型有多份,它们有些相同的初始参数,每次前向后向时只看到属于自己partition的训练数据,算完梯度后,进程间通讯,大家同步一下参数。我看你是直接算当前batch上的准确率,并且是在backward之前,那意味着还没同步参数,模型看到的训练数据是不一样的:如果是1GPU,那模型是看到全量数据的,如果是2GPU,每个模型只看到了一半的数据,导致算出来的准确率也不一样。
建议在完成一个前向后向后,在测试集上计算指标。

@dahuaxiya
Copy link
Author

Owner

是这样的,我试了一下。每个GPU只能看到所有数据的一部分,是我的计算方法有误,应该除以每个GPU看到的所有样本数量,而不是数据集所有的样本数量。
谢谢作者的解答

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants