Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARNING: CPU: X PID: XXXX at include/linux/rwsem.h:80 follow_pte+0xf8/0x120 on resume #719

Open
1 of 2 tasks
birdie-github opened this issue Oct 22, 2024 · 11 comments
Open
1 of 2 tasks
Labels
bug Something isn't working

Comments

@birdie-github
Copy link

birdie-github commented Oct 22, 2024

NVIDIA Open GPU Kernel Modules Version

565.57.01

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Fedora 40

Kernel Release

Linux zen 6.11.4-zen3 #1 SMP PREEMPT_DYNAMIC Tue Oct 22 11:16:40 UTC 2024 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

NVIDIA GeForce RTX 4070 SUPER

Describe the bug

The issue with resuming is not fixed in beta driver 565.57.01 :-(

There are even MORE dmesg errors than in the previous stable driver.

I'm using XFCE without compositing and these nvidia kernel modules options:

options nvidia NVreg_EnableS0ixPowerManagement=1
options nvidia-drm modeset=1 fbdev=1

To Reproduce

Suspend/resume.

Bug Incidence

Always

nvidia-bug-report.log.gz

kernel-trace.txt

@aritger
Copy link
Collaborator

aritger commented Oct 22, 2024

Can you please attach a full nvidia-bug-report.log.gz?

@birdie-github
Copy link
Author

birdie-github commented Oct 22, 2024

Here it is:

nvidia-bug-report.log.gz

This first appeared in kernel 6.10.

Kernels 6.9 and earlier don't exhibit this issue.

@aritger
Copy link
Collaborator

aritger commented Oct 22, 2024

Thanks for the log. I've filed NVIDIA internal bug 4922186 for this.

Knowing this is specific to >= Linux kernel 6.10 helps; thanks for that isolation.
Any other isolation you can do will help a lot. E.g.,

  • Does this reproduce if forcing use of the closed kernel modules?
  • Does this reproduce without NVreg_EnableS0ixPowerManagement=1?
  • Does this reproduce without fbdev=1?

@birdie-github
Copy link
Author

* Does this reproduce if forcing use of the closed kernel modules?

Yes.

* Does this reproduce without NVreg_EnableS0ixPowerManagement=1?

I'll try, and report later.

* Does this reproduce without fbdev=1?

Yes.

@birdie-github
Copy link
Author

birdie-github commented Oct 22, 2024

Removing NVreg_EnableS0ixPowerManagement=1 fixes the issue.

No more dmesg spam (more than 60KB of messages).

I started using the option because you or @aaronp24 told me it was necessary to properly restore the system state.

Probably I had some issues with either Firefox or Chrome misbehaving on resume. It was a long time ago.

I have 64GB of RAM, most of it completely free, I'm not using SWAP or hibernate.

@aritger
Copy link
Collaborator

aritger commented Oct 22, 2024

Thank you for those experiments. That will help us focus our debugging.

@tekstryder
Copy link

tekstryder commented Oct 22, 2024

I'd been following #662 mistakenly this whole time when it's this issue that's been affecting every kernel since 6.10 release, as mentioned. I should have paid more attention to those stack traces! D'oh!

Does this reproduce if forcing use of the closed kernel modules?

Yes

Does this reproduce without NVreg_EnableS0ixPowerManagement=1?

Yes. I've never had NVreg_EnableS0ixPowerManagement set.

Does this reproduce without fbdev=1?

Yes, and I'm unable to boot with `fbdev=1' anyhow. See related Arch bug

Using the following kernel parameters:
nvidia_drm.modeset=1 nvidia_drm.fbdev=0

  • Arch Linux | Kernel 6.10.14
  • nVidia 560.35.03
  • GeForce GTX 1050 Ti

The attached stacktrace repeats ~20ish times per suspend: nvidia-sleep-stacktrace.txt

/usr/lib/modprobe.d/nvidia-sleep.conf:

options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia NVreg_TemporaryFilePath=/var/tmp

nvidia-bug-report.log

@birdie-github
Copy link
Author

@tekstryder

Please attach your nvidia-bug-report as well.

@tekstryder
Copy link

Please attach your nvidia-bug-report as well.

Edited my post and attached it.

I also just built and booted kernel 6.11.5, and suspended one time to reproduce this issue, so the dmesg is fresh.

nvidia-bug-report.log

@ZulluBalti
Copy link

Happening to me also on the proprietary drivers as well on linux 6.10.10

@birdie-github
Copy link
Author

@aritger

There's also #705 which looks similar.

I wonder if #662 #705 and this one are all somehow related because as far as I can see most people affected by them started to get issues with kernel 6.10 and later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants