Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numerical instability in Google Colab - Part 4 of Makemore #13

Open
sachag678 opened this issue Oct 13, 2022 · 8 comments · May be fixed by #67
Open

Numerical instability in Google Colab - Part 4 of Makemore #13

sachag678 opened this issue Oct 13, 2022 · 8 comments · May be fixed by #67

Comments

@sachag678
Copy link

I ran into an interesting issue in makemore 4 backpro ninja where the dhpreact was not exactly matching the hpreact.grad.

However, this was only in the collab notebook because when I put the same code into a local jupyter notebook it works fine.

Not sure why this would be the case but just an odd curiosity.

@karpathy
Copy link
Owner

oh oh

@sachag678
Copy link
Author

I'm guessing it has something to do with the python versions?

@JonathanSum
Copy link

JonathanSum commented Oct 14, 2022

Yes. I have an issue with colab. But I don't have an issue with the local VScode Jupyter notebook.
The local Jupyter notebook version is
Python 3.7.13
The tested colab notebook version is
3.7.14 (default, Sep 8 2022, 00:06:44)
[GCC 7.5.0]
image

image
image

If the diff number is too small, maybe it is fine to use some way to accept it?
Colab tested notebook: https://colab.research.google.com/drive/1HmZ8bgtAfvyMaZyu3Sr1Bgxsj35jitTs?usp=sharing

Maybe the issue is Pytorch version?

@JonathanSum
Copy link

JonathanSum commented Oct 14, 2022

I used the t.grad.sum() and dt.sum() to compare the sum between colab and the local notebook.
colab.txt
local.txt

I posted it on Pytorch forum, and I got no answer: https://discuss.pytorch.org/t/numerical-instability-in-google-colab/163610
I am planning to post it on Colab Git Issues.

@mriganktiwari
Copy link

Yes. I have an issue with colab. But I don't have an issue with the local VScode Jupyter notebook. The local Jupyter notebook version is Python 3.7.13 The tested colab notebook version is 3.7.14 (default, Sep 8 2022, 00:06:44) [GCC 7.5.0] image

image image

If the diff number is too small, maybe it is fine to use some way to accept it? Colab tested notebook: https://colab.research.google.com/drive/1HmZ8bgtAfvyMaZyu3Sr1Bgxsj35jitTs?usp=sharing

Maybe the issue is Pytorch version?

I am getting exactly same maxdiff for hpreact, and my notebook is running on local machine.
Python 3.9.13

&

torch.version
'1.12.1'

@evgenyfadeev
Copy link

evgenyfadeev commented Apr 29, 2023

I've got a strange observation (using the colab version)

dlogit_maxes = - dnorm_logits.sum(dim=1, keepdim=True) gives me exact equality
dlogit_maxes = - dnorm_logits.sum(dim=1) gives approximate equality with a maxdiff ~ 10^-8

In this exapmple - if shapes of the gradients are not equal, but the comparison is made after broadcasting (I guess) - there is a residual difference, otherwise the values equal exactly. Somehow it might have to do with the accuracy limitations of the floating point operations. In this case values are float32 and 10^-8 is close to the precision limit for float32 operations.

I've made a PR for the cmp function to output comparison of shapes, it could probably be useful: #36

Another thing is that maybe what matters is the order of the arithmetic operations. Apparently addition and multiplications of the floats are not associative https://pytorch.org/docs/stable/notes/numerical_accuracy.html

Also the doc says that there results my be inconsistent across devices, and commits in the software.

@vdyma
Copy link

vdyma commented Apr 25, 2024

I had the same difference problem between gradients when running locally, because I used GPU to store tensors and perform computations. Once I changed to CPU, I had the difference in the later computations because of the ordering of operations. I managed to get the exact gradients running on CPU and reordering computations to be the same as in the lecture.

@conscell
Copy link

I encountered the same issue on a Linux machine with CPU. Setting the following environment variable resolved the problem:

ATEN_CPU_CAPABILITY=default

To fix the issue in the notebook, add these lines at the very beginning of the notebook, before importing PyTorch:

import os
os.environ['ATEN_CPU_CAPABILITY'] = 'default'

However, this solution does not address issues with Nvidia GPUs, which remain affected.

conscell added a commit to conscell/nn-zero-to-hero that referenced this issue Dec 27, 2024
Fixes karpathy#13 karpathy#45, where the `dhpreact` was not exactly matching the `hpreact.grad`.
@conscell conscell linked a pull request Dec 27, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants