Description
Workarounds for developers
- Upgrade to the latest NodeJS.
- Or if you're stuck on current NodeJS set environment variable
export UV_USE_IO_URING=0
before running node processes.
Bug description
Note: Amazon Linux 2023 EC2 host running with its kernel. The node processes encounter the hang from within an AL23 docker container but containers use the host kernel. So users will see this bug within or outside of docker if the host is running AL23.
yarn install --immutable
within Jenkins environments indefinitely hangs. node
processes get stuck in process state Dl
... which means according to man pages:
D uninterruptible sleep (usually IO)
l is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)
echo w > /proc/sysrq-trigger
dmesg -T
returns logs like
[Tue Dec 3 21:27:04 2024] task:node state:D stack:0 pid:23041 ppid:21790 flags:0x00004002
[Tue Dec 3 21:27:04 2024] Call Trace:
[Tue Dec 3 21:27:04 2024] <TASK>
[Tue Dec 3 21:27:04 2024] __schedule+0x1ad/0x530
[Tue Dec 3 21:27:04 2024] schedule+0x5a/0xd0
[Tue Dec 3 21:27:04 2024] schedule_preempt_disabled+0x11/0x20
[Tue Dec 3 21:27:04 2024] __mutex_lock.constprop.0+0x372/0x6c0
[Tue Dec 3 21:27:04 2024] io_uring_del_tctx_node+0x5a/0xbf
[Tue Dec 3 21:27:04 2024] io_uring_clean_tctx+0x55/0xa7
[Tue Dec 3 21:27:04 2024] io_uring_cancel_generic+0x223/0x267
[Tue Dec 3 21:27:04 2024] ? membarrier_register_private_expedited+0x90/0x90
[Tue Dec 3 21:27:04 2024] do_exit+0x139/0x4a0
[Tue Dec 3 21:27:04 2024] do_group_exit+0x2d/0x80
[Tue Dec 3 21:27:04 2024] get_signal+0x857/0x880
[Tue Dec 3 21:27:04 2024] ? ep_eventpoll_poll+0x10/0x10
[Tue Dec 3 21:27:04 2024] arch_do_signal_or_restart+0x25/0x100
[Tue Dec 3 21:27:04 2024] ? do_epoll_wait+0x9d/0xe0
[Tue Dec 3 21:27:04 2024] ? do_compat_epoll_pwait.part.0+0xc/0x70
[Tue Dec 3 21:27:04 2024] ? __x64_sys_epoll_pwait+0x7f/0x130
[Tue Dec 3 21:27:04 2024] exit_to_user_mode_loop+0x9c/0x130
[Tue Dec 3 21:27:04 2024] exit_to_user_mode_prepare+0xb6/0xc0
[Tue Dec 3 21:27:04 2024] syscall_exit_to_user_mode+0x12/0x40
[Tue Dec 3 21:27:04 2024] do_syscall_64+0x42/0x80
[Tue Dec 3 21:27:04 2024] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[Tue Dec 3 21:27:04 2024] RIP: 0033:0x7faf21ebf728
[Tue Dec 3 21:27:04 2024] RSP: 002b:00007ffcb1420160 EFLAGS: 00000293 ORIG_RAX: 0000000000000119
[Tue Dec 3 21:27:04 2024] RAX: fffffffffffffffc RBX: 0000000000000004 RCX: 00007faf21ebf728
[Tue Dec 3 21:27:04 2024] RDX: 0000000000000400 RSI: 00007ffcb1420ec0 RDI: 0000000000000011
[Tue Dec 3 21:27:04 2024] RBP: 00007ffcb1423ef0 R08: 0000000000000000 R09: 0000000000000008
[Tue Dec 3 21:27:04 2024] R10: 00000000ffffffff R11: 0000000000000293 R12: 0000000000000000
[Tue Dec 3 21:27:04 2024] R13: 0000000005499800 R14: 000000000a7506d0 R15: 00000000ffffffff
[Tue Dec 3 21:27:04 2024] </TASK>
And ps output shows multiple processes in this state:
# ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32 | grep " D"
21790 23041 REDACTEDUSER Dl 0.4 node io_uring_del_tctx_node
# ps aux | grep node
root 1037 0.0 0.0 0 0 ? I< 20:34 0:00 [xfs-inodegc/nvm]
REDACTEDUSER 25887 0.0 0.3 11783472 55876 ? Dl 20:58 0:00 node REDACTED
REDACTEDUSER 25888 0.3 2.1 55727120 352756 ? Dl 20:58 0:15 /home/REDACTED/.nvm/versions/node/v20.9.0/bin/node REDACTED/node_modules/nx/bin/run-executor.js
REDACTEDUSER 25905 0.3 2.1 55716208 343600 ? Dl 20:58 0:16 /home/REDACTED/.nvm/versions/node/v20.9.0/bin/node REDACTED/node_modules/nx/bin/run-executor.js
REDACTEDUSER 38172 0.8 6.8 6191248 1100084 ? Sl 21:34 0:24 node /home/REDACTED/.nvm/versions/node/v20.9.0/bin/pnpm i --frozen-lockfile
root 45559 0.0 0.0 222316 2176 pts/1 S+ 22:21 0:00 grep --color=auto node
To reproduce
Both yarn install --immutable
and pnpm i --frozen-lockfile
replicate the same bug. The hanging started occurring roughly last week.
Upstream bug report with known issue
This issue was reported to NodeJS and confirmed to be a Linux Kernel bug.
Buggy patch: gregkh/linux@f4ce3b5
Upstream backport fix
Fix patch: gregkh/linux@8d09a88
I confirm that Amazon Linux 2023 source code does not have this patch.
# grep -F 'iopoll syncs against uring_lock, not completion_lock' linux-6.1.115/linux-6.1.115/io_uring/io_uring.c
/* iopoll syncs against uring_lock, not completion_lock *
Because the above line exists, the backport patch has not been applied.
Confirming Amazon Linux 2023 kernel has the bug
dnf install kernel-devel
dnf download --source kernel
mkdir extracted-kernel
cd extracted-kernel/
rpm2cpio ../kernel-6.1.115-126.197.amzn2023.src.rpm | cpio -idmv
mkdir linux-6.1.115
tar -xC linux-6.1.115 -f linux-6.1.115.tar
find linux-6.1.115 -type f -name 'io_uring.c'
grep -F 'For silly syzbot cases that deliberately overflow by huge' linux-6.1.115/linux-6.1.115/io_uring/io_uring.c