Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvswitch_get_link_handlers: initialize ->read_discovery_token method by default #647

Open
wants to merge 1 commit into
base: 545.23
Choose a base branch
from

Conversation

dmonakhov
Copy link

Otherwise this can result in null pointer deference if HW is not healthy.
Example:
[ 66.499757] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 66.502446] #PF: supervisor instruction fetch in kernel mode
[ 66.504631] #PF: error_code(0x0010) - not-present page
[ 66.506625] PGD 800000011acc2067 P4D 800000011acc2067 PUD 10f48e067 PMD 0
[ 66.509260] Oops: 0010 [#1] SMP PTI
[ 66.510647] CPU: 12 PID: 15050 Comm: nv-fabricmanage Tainted: G OE 5.15.0-1058-aws #64~20.04.1-Ubuntu
[ 66.514613] Hardware name: Amazon EC2 p4d.24xlarge/, BIOS 1.0 10/16/2017
[ 66.517203] RIP: 0010:0x0
[ 66.518271] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[ 66.520897] RSP: 0018:ffffac89a361bd20 EFLAGS: 00010246
[ 66.522914] RAX: 0000000000000000 RBX: ffff91ec630fef10 RCX: 000000000000001a
[ 66.525641] RDX: 0000000000000002 RSI: ffffac89a361bd28 RDI: ffff91ec4efb7000
[ 66.528366] RBP: ffffac89a361bd48 R08: 0000000000000000 R09: ffff915fbb4b1540
[ 66.531091] R10: 0000000000000000 R11: 0000000000000000 R12: ffff91ec4efb7000
[ 66.533821] R13: ffff91ec4efb6300 R14: 0000000000000000 R15: ffff91ec4efb7000
[ 66.536558] FS: 00007f43f1cf3700(0000) GS:ffff91e87d700000(0000) knlGS:0000000000000000
[ 66.539650] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 66.541885] CR2: ffffffffffffffd6 CR3: 00000004277fe005 CR4: 00000000007706e0
[ 66.544610] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 66.548862] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 66.553084] PKRU: 55555554
[ 66.555718] Call Trace:
[ 66.558238]
[ 66.560629] ? show_regs.cold+0x1a/0x1f
[ 66.563644] ? __die_body+0x20/0x70
[ 66.566550] ? __die+0x2b/0x37
[ 66.569302] ? page_fault_oops+0x136/0x2c0
[ 66.572432] ? do_user_addr_fault+0x303/0x660
[ 66.575667] ? hrtimer_cancel+0x15/0x30
[ 66.578703] ? futex_wait+0x1e9/0x260
[ 66.581669] ? exc_page_fault+0x77/0x170
[ 66.584737] ? asm_exc_page_fault+0x27/0x30
[ 66.587902] nvlink_core_read_link_discovery_token+0x73/0x80 [nvidia]
[ 66.592041] nvlink_core_correlate_conn_by_token+0xb8/0xf0 [nvidia]
[ 66.596073] nvlink_lib_ioctl_ctrl+0x10e9/0x2680 [nvidia]
[ 66.599776] ? __x64_sys_futex+0x7b/0x1c0
[ 66.602873] ? __cond_resched+0x19/0x40
[ 66.605894] ? __kmalloc+0x1ab/0x4b0
[ 66.608815] nvlink_fops_unlocked_ioctl+0xd6/0x1c0 [nvidia]
[ 66.612543] __x64_sys_ioctl+0x95/0xd0
[ 66.615549] do_syscall_64+0x5c/0xc0
[ 66.618476] ? do_syscall_64+0x69/0xc0
[ 66.621483] entry_SYSCALL_64_after_hwframe+0x62/0xcc
[ 66.624949] RIP: 0033:0x7f43f56115cb
[ 66.627869] Code: 0f 1e fa 48 8b 05 c5 78 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 95 78 0d 00 f7 d8 64 89 01 48
[ 66.639398] RSP: 002b:00007f43f1cf2b28 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 66.645327] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f43f56115cb
[ 66.649545] RDX: 00007f43f1cf2bbc RSI: 0000000080044e08 RDI: 0000000000000030
[ 66.653786] RBP: 0000000080044e08 R08: 00007f43dc000f30 R09: 00000000008db3f0
[ 66.658029] R10: 00007f43dc0008d0 R11: 0000000000000246 R12: 0000000001a253e0
[ 66.662271] R13: 00007f43f1cf2bbc R14: 0000000000000018 R15: 0000000001a2b598
[ 66.666508]
[ 66.668919] Modules linked in: nvidia_uvm(OE) xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c xt_addrtype iptable_filter bpfilter br_netfilter br
drv(OE) overlay nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel msr nvidia(OE) ib_iser binfmt_misc ppdev rdma_cm iw_cm ib_cm libiscsi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha256_ssse3 sha1_ssse3 scsi_transpo
crypto_simd psmouse input_leds cryptd serio_raw ena parport_pc parport efa(OE) ib_uverbs ib_core drm sunrpc efi_pstore ip_tables x_tables autofs4
[ 66.704520] CR2: 0000000000000000
[ 66.707368] ---[ end trace 4652d4c22dfad5b0 ]---

This issue affect at least 535 and 545 branches

…by default

Otherwise this can result in null pointer dereference if HW is not healthy
Example:
[   66.499757] BUG: kernel NULL pointer dereference, address: 0000000000000000
[   66.502446] #PF: supervisor instruction fetch in kernel mode
[   66.504631] #PF: error_code(0x0010) - not-present page
[   66.506625] PGD 800000011acc2067 P4D 800000011acc2067 PUD 10f48e067 PMD 0
[   66.509260] Oops: 0010 [NVIDIA#1] SMP PTI
[   66.510647] CPU: 12 PID: 15050 Comm: nv-fabricmanage Tainted: G           OE     5.15.0-1058-aws NVIDIA#64~20.04.1-Ubuntu
[   66.514613] Hardware name: Amazon EC2 p4d.24xlarge/, BIOS 1.0 10/16/2017
[   66.517203] RIP: 0010:0x0
[   66.518271] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[   66.520897] RSP: 0018:ffffac89a361bd20 EFLAGS: 00010246
[   66.522914] RAX: 0000000000000000 RBX: ffff91ec630fef10 RCX: 000000000000001a
[   66.525641] RDX: 0000000000000002 RSI: ffffac89a361bd28 RDI: ffff91ec4efb7000
[   66.528366] RBP: ffffac89a361bd48 R08: 0000000000000000 R09: ffff915fbb4b1540
[   66.531091] R10: 0000000000000000 R11: 0000000000000000 R12: ffff91ec4efb7000
[   66.533821] R13: ffff91ec4efb6300 R14: 0000000000000000 R15: ffff91ec4efb7000
[   66.536558] FS:  00007f43f1cf3700(0000) GS:ffff91e87d700000(0000) knlGS:0000000000000000
[   66.539650] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   66.541885] CR2: ffffffffffffffd6 CR3: 00000004277fe005 CR4: 00000000007706e0
[   66.544610] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   66.548862] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   66.553084] PKRU: 55555554
[   66.555718] Call Trace:
[   66.558238]  <TASK>
[   66.560629]  ? show_regs.cold+0x1a/0x1f
[   66.563644]  ? __die_body+0x20/0x70
[   66.566550]  ? __die+0x2b/0x37
[   66.569302]  ? page_fault_oops+0x136/0x2c0
[   66.572432]  ? do_user_addr_fault+0x303/0x660
[   66.575667]  ? hrtimer_cancel+0x15/0x30
[   66.578703]  ? futex_wait+0x1e9/0x260
[   66.581669]  ? exc_page_fault+0x77/0x170
[   66.584737]  ? asm_exc_page_fault+0x27/0x30
[   66.587902]  nvlink_core_read_link_discovery_token+0x73/0x80 [nvidia]
[   66.592041]  nvlink_core_correlate_conn_by_token+0xb8/0xf0 [nvidia]
[   66.596073]  nvlink_lib_ioctl_ctrl+0x10e9/0x2680 [nvidia]
[   66.599776]  ? __x64_sys_futex+0x7b/0x1c0
[   66.602873]  ? __cond_resched+0x19/0x40
[   66.605894]  ? __kmalloc+0x1ab/0x4b0
[   66.608815]  nvlink_fops_unlocked_ioctl+0xd6/0x1c0 [nvidia]
[   66.612543]  __x64_sys_ioctl+0x95/0xd0
[   66.615549]  do_syscall_64+0x5c/0xc0
[   66.618476]  ? do_syscall_64+0x69/0xc0
[   66.621483]  entry_SYSCALL_64_after_hwframe+0x62/0xcc
[   66.624949] RIP: 0033:0x7f43f56115cb
[   66.627869] Code: 0f 1e fa 48 8b 05 c5 78 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 95 78 0d 00 f7 d8 64 89 01 48
[   66.639398] RSP: 002b:00007f43f1cf2b28 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[   66.645327] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f43f56115cb
[   66.649545] RDX: 00007f43f1cf2bbc RSI: 0000000080044e08 RDI: 0000000000000030
[   66.653786] RBP: 0000000080044e08 R08: 00007f43dc000f30 R09: 00000000008db3f0
[   66.658029] R10: 00007f43dc0008d0 R11: 0000000000000246 R12: 0000000001a253e0
[   66.662271] R13: 00007f43f1cf2bbc R14: 0000000000000018 R15: 0000000001a2b598
[   66.666508]  </TASK>
[   66.668919] Modules linked in: nvidia_uvm(OE) xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c xt_addrtype iptable_filter bpfilter br_netfilter br\
drv(OE) overlay nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel msr nvidia(OE) ib_iser binfmt_misc ppdev rdma_cm iw_cm ib_cm libiscsi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha256_ssse3 sha1_ssse3 scsi_transpo\
crypto_simd psmouse input_leds cryptd serio_raw ena parport_pc parport efa(OE) ib_uverbs ib_core drm sunrpc efi_pstore ip_tables x_tables autofs4
[   66.704520] CR2: 0000000000000000
[   66.707368] ---[ end trace 4652d4c22dfad5b0 ]---

This issue affect at least 535 and 545 branches
@CLAassistant
Copy link

CLAassistant commented Jun 6, 2024

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants