Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] - NodeJS npm, pnpm, and yarn install hang due to backported Kernel bug #856

Closed
sgleske-ias opened this issue Dec 3, 2024 · 1 comment

Comments

@sgleske-ias
Copy link

sgleske-ias commented Dec 3, 2024

Workarounds for developers

  • Upgrade to the latest NodeJS.
  • Or if you're stuck on current NodeJS set environment variable export UV_USE_IO_URING=0 before running node processes.

Bug description

Note: Amazon Linux 2023 EC2 host running with its kernel. The node processes encounter the hang from within an AL23 docker container but containers use the host kernel. So users will see this bug within or outside of docker if the host is running AL23.

yarn install --immutable within Jenkins environments indefinitely hangs. node processes get stuck in process state Dl... which means according to man pages:

  • D uninterruptible sleep (usually IO)
  • l is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)

Further debugging with:

echo w > /proc/sysrq-trigger
dmesg -T

returns logs like

[Tue Dec  3 21:27:04 2024] task:node            state:D stack:0     pid:23041 ppid:21790  flags:0x00004002
[Tue Dec  3 21:27:04 2024] Call Trace:
[Tue Dec  3 21:27:04 2024]  <TASK>
[Tue Dec  3 21:27:04 2024]  __schedule+0x1ad/0x530
[Tue Dec  3 21:27:04 2024]  schedule+0x5a/0xd0
[Tue Dec  3 21:27:04 2024]  schedule_preempt_disabled+0x11/0x20
[Tue Dec  3 21:27:04 2024]  __mutex_lock.constprop.0+0x372/0x6c0
[Tue Dec  3 21:27:04 2024]  io_uring_del_tctx_node+0x5a/0xbf
[Tue Dec  3 21:27:04 2024]  io_uring_clean_tctx+0x55/0xa7
[Tue Dec  3 21:27:04 2024]  io_uring_cancel_generic+0x223/0x267
[Tue Dec  3 21:27:04 2024]  ? membarrier_register_private_expedited+0x90/0x90
[Tue Dec  3 21:27:04 2024]  do_exit+0x139/0x4a0
[Tue Dec  3 21:27:04 2024]  do_group_exit+0x2d/0x80
[Tue Dec  3 21:27:04 2024]  get_signal+0x857/0x880
[Tue Dec  3 21:27:04 2024]  ? ep_eventpoll_poll+0x10/0x10
[Tue Dec  3 21:27:04 2024]  arch_do_signal_or_restart+0x25/0x100
[Tue Dec  3 21:27:04 2024]  ? do_epoll_wait+0x9d/0xe0
[Tue Dec  3 21:27:04 2024]  ? do_compat_epoll_pwait.part.0+0xc/0x70
[Tue Dec  3 21:27:04 2024]  ? __x64_sys_epoll_pwait+0x7f/0x130
[Tue Dec  3 21:27:04 2024]  exit_to_user_mode_loop+0x9c/0x130
[Tue Dec  3 21:27:04 2024]  exit_to_user_mode_prepare+0xb6/0xc0
[Tue Dec  3 21:27:04 2024]  syscall_exit_to_user_mode+0x12/0x40
[Tue Dec  3 21:27:04 2024]  do_syscall_64+0x42/0x80
[Tue Dec  3 21:27:04 2024]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[Tue Dec  3 21:27:04 2024] RIP: 0033:0x7faf21ebf728
[Tue Dec  3 21:27:04 2024] RSP: 002b:00007ffcb1420160 EFLAGS: 00000293 ORIG_RAX: 0000000000000119
[Tue Dec  3 21:27:04 2024] RAX: fffffffffffffffc RBX: 0000000000000004 RCX: 00007faf21ebf728
[Tue Dec  3 21:27:04 2024] RDX: 0000000000000400 RSI: 00007ffcb1420ec0 RDI: 0000000000000011
[Tue Dec  3 21:27:04 2024] RBP: 00007ffcb1423ef0 R08: 0000000000000000 R09: 0000000000000008
[Tue Dec  3 21:27:04 2024] R10: 00000000ffffffff R11: 0000000000000293 R12: 0000000000000000
[Tue Dec  3 21:27:04 2024] R13: 0000000005499800 R14: 000000000a7506d0 R15: 00000000ffffffff
[Tue Dec  3 21:27:04 2024]  </TASK>

And ps output shows multiple processes in this state:

# ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32 | grep " D"
  21790   23041 REDACTEDUSER Dl    0.4 node            io_uring_del_tctx_node

# ps aux | grep node
root        1037  0.0  0.0      0     0 ?        I<   20:34   0:00 [xfs-inodegc/nvm]
REDACTEDUSER   25887  0.0  0.3 11783472 55876 ?      Dl   20:58   0:00 node REDACTED
REDACTEDUSER   25888  0.3  2.1 55727120 352756 ?     Dl   20:58   0:15 /home/REDACTED/.nvm/versions/node/v20.9.0/bin/node REDACTED/node_modules/nx/bin/run-executor.js
REDACTEDUSER   25905  0.3  2.1 55716208 343600 ?     Dl   20:58   0:16 /home/REDACTED/.nvm/versions/node/v20.9.0/bin/node REDACTED/node_modules/nx/bin/run-executor.js
REDACTEDUSER   38172  0.8  6.8 6191248 1100084 ?     Sl   21:34   0:24 node /home/REDACTED/.nvm/versions/node/v20.9.0/bin/pnpm i --frozen-lockfile
root       45559  0.0  0.0 222316  2176 pts/1    S+   22:21   0:00 grep --color=auto node

To reproduce

Both yarn install --immutable and pnpm i --frozen-lockfile replicate the same bug. The hanging started occurring roughly last week.

Upstream bug report with known issue

This issue was reported to NodeJS and confirmed to be a Linux Kernel bug.

nodejs/node#55587 (comment)

Buggy patch: gregkh/linux@f4ce3b5

Upstream backport fix

Fix patch: gregkh/linux@8d09a88

I confirm that Amazon Linux 2023 source code does not have this patch.

# grep -F 'iopoll syncs against uring_lock, not completion_lock' linux-6.1.115/linux-6.1.115/io_uring/io_uring.c
		/* iopoll syncs against uring_lock, not completion_lock *

Because the above line exists, the backport patch has not been applied.

Confirming Amazon Linux 2023 kernel has the bug

dnf install kernel-devel
dnf download --source kernel
mkdir extracted-kernel
cd extracted-kernel/
rpm2cpio ../kernel-6.1.115-126.197.amzn2023.src.rpm | cpio -idmv
mkdir linux-6.1.115
tar -xC linux-6.1.115 -f linux-6.1.115.tar
find linux-6.1.115 -type f -name 'io_uring.c'
grep -F 'For silly syzbot cases that deliberately overflow by huge' linux-6.1.115/linux-6.1.115/io_uring/io_uring.c

image

@sgleske-ias sgleske-ias changed the title [Bug] - NodeJS npm, pnpm, and yarn install hang due to backported Kernel bug. [Bug] - NodeJS npm, pnpm, and yarn install hang due to backported Kernel bug Dec 3, 2024
@sgleske-ias
Copy link
Author

Closing this duplicate; AL23 will roll out the fix Dec 9th #840 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant