Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpuHandleSanityCheckRegReadError_GM107: Possible bad register read #688

Open
1 of 2 tasks
taochenlove opened this issue Aug 2, 2024 · 5 comments
Open
1 of 2 tasks
Labels
bug Something isn't working NV-Triaged An NVBug has been created for dev to investigate

Comments

@taochenlove
Copy link

NVIDIA Open GPU Kernel Modules Version

560.28.03

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Ubuntu 22.04 LTS

Kernel Release

5.15.0-25-generic

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

NVIDIA A100-PCIE-40GB

Describe the bug

When running nvidia-smi there are some exceptions printed below
(base) root@D11DJ-3410-01:~/chenct# nvidia-smi -L
[12919.827336] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88158, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827348] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88174, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827489] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x889d4, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827624] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e2c, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827628] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e30, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827632] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e34, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827636] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e38, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827639] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e3c, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827642] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e40, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827646] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e44, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827649] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e48, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827652] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e4c, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827656] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e50, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827659] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e54, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827662] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e58, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827665] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e5c, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827668] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e60, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827671] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e64, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827674] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e68, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827677] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e6c, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827680] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e70, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827682] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e74, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827685] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e78, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827689] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e7c, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827691] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e80, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827694] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e84, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827697] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e88, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827699] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e8c, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827703] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e90, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827705] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e94, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827708] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e98, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827711] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e9c, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827714] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ea0, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827717] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ea4, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827720] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ea8, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827722] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88eac, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827726] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88eb0, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827729] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88eb4, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827731] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88eb8, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827734] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ebc, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827737] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ec0, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827740] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ec4, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827743] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ec8, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827746] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ecc, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827749] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ed0, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827752] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ed4, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827755] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ed8, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827758] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88edc, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[12919.827761] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88fe4, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR

To Reproduce

"./cuda_12.6.0_560.28.03_linux.run -m=kernel-open" .Use this command after the installation will appear.

Bug Incidence

Always

nvidia-bug-report.log.gz

none

More Info

No response

@taochenlove taochenlove added the bug Something isn't working label Aug 2, 2024
@gauravjuvekar gauravjuvekar added the NV-Triaged An NVBug has been created for dev to investigate label Aug 7, 2024
@gauravjuvekar
Copy link
Member

Tracked internally as Bug 4290269

@drastx
Copy link

drastx commented Aug 22, 2024

I am seeing the same error code NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR on GH200, could these be related?
Ubuntu 24.04 and nvidia's ghvirt 6.5.3 based kernel, driver 550

[ 5.868764] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920bc, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[ 5.870053] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920c0, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[ 5.871339] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920c4, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[ 5.872554] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920c8, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[ 5.873768] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920cc, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[ 5.875016] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920d0, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[ 5.876228] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920e4, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[ 5.877451] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920e8, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[ 5.878672] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920ec, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[ 5.879958] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920f0, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[ 5.881140] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920f4, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[ 5.882292] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920f8, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
[ 5.883533] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920fc, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR

@gauravjuvekar
Copy link
Member

Yes, this is the same bug which affects release 550 and later.

@apoorvemohan
Copy link

apoorvemohan commented Sep 24, 2024

We are seeing the following error on AMD system A100 40GB system with Nvidia Driver 550 and CUDA 12.4 (Ubuntu 22.04 LTS).

[   37.333506] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88158,  regvalue: 0xbadf5040,  error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR

cc: @mengmeiye

@levipereira
Copy link

Same bug
Ubuntu 22.04 LTS

using NVIDIA-Linux-x86_64-560.35.03.run

GPU RTX 4090

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 7 3700X 8-Core Processor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working NV-Triaged An NVBug has been created for dev to investigate
Projects
None yet
Development

No branches or pull requests

5 participants