Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Resource leak in SMP mode when running signal handler #14448

Open
1 task done
pussuw opened this issue Oct 22, 2024 · 2 comments
Open
1 task done

[BUG] Resource leak in SMP mode when running signal handler #14448

pussuw opened this issue Oct 22, 2024 · 2 comments
Labels
Arch: all Issues that apply to all architectures Area: Posix Posix issues OS: Linux Issues related to Linux (building system, etc) Type: Bug Something isn't working

Comments

@pussuw
Copy link
Contributor

pussuw commented Oct 22, 2024

Description / Steps to reproduce the issue

There is a serious issue with the current asynchronous signal delivery system; it will forcibly make another CPU resume
code at places where this must not happen. Take the following example where CPU0 takes a semaphore and CPU1 sends
a signal to it:

CPU0                                                CPU1
nxsem_wait() // Take semaphore                                     
enter_critical_section()                 
... in atomic section ...                           
up_switch_context(this_task(), rtcb)
---> next process , atomic section over             enter_critical_section()
                                                    nxsig_queue_action()
                                                    nxsched_smp_call_single()
                                                      // Setup interrupt on CPU0 to run sig_handler
                                                      nxsched_smp_call_single(stcb->cpu, sig_handler, &arg, true);
                                                    <--- SMP_CALL interrupt pends on CPU0
                                                    leave_critical_section()
---> SMP_CALL interrupt fires on CPU0
               |
               v
nxsched_smp_call_handler()
  // Run sig_handler on CPU0
  sig_handler()
  up_schedule_sigaction()
  // up_schedule_sigaction makes task on CPU1 return to riscv_sigdeliver
  tcb->xcp.regs[REG_EPC] = (uintptr_t)riscv_sigdeliver;
              |
              v
riscv_smp_call_handler()
  // riscv_smp_call_handler restores (new) context, EPC=riscv_sigdeliver
  tcb = current_task(cpu);
  riscv_savecontext(tcb);
  nxsched_process_delivered(cpu);
  tcb = current_task(cpu);
  riscv_restorecontext(tcb);
               |
               v
riscv_smp_call_handler() interrupt returns
               |
               v
riscv_sigdeliver()
               |
               v
signal_handler()
  // Signal handler runs in userspace
          ***CRASH***

If the process on CPU0 crashes in the signal handler, the semaphore taken on CPU0 does not get freed,
causing a resource leak.

The leak is not an issue for user resources but is catastrophic for kernel resources!

On which OS does this issue occur?

[OS: Linux]

What is the version of your OS?

Irrelevant

NuttX Version

master

Issue Architecture

[Arch: all]

Issue Area

[Area: Posix]

Verification

  • I have verified before submitting the report.
@pussuw pussuw added the Type: Bug Something isn't working label Oct 22, 2024
@github-actions github-actions bot added Arch: all Issues that apply to all architectures Area: Posix Posix issues OS: Linux Issues related to Linux (building system, etc) labels Oct 22, 2024
@pussuw
Copy link
Contributor Author

pussuw commented Oct 22, 2024

I noticed the issue with syslog(), though in this case it causes a deadlock. The problem is still the same and the leaked
resource is g_lowputs_lock.

static ssize_t syslog_default_write(FAR syslog_channel_t *channel,
                                    FAR const char *buffer, size_t buflen)
{
#  ifdef CONFIG_ARCH_LOWPUTC
  nxmutex_lock(&g_lowputs_lock);

  up_nputs(buffer, buflen);

  nxmutex_unlock(&g_lowputs_lock);
#  endif

  UNUSED(channel);
  return buflen;
}
#endif

All you need is for CPU0 to call syslog_default_write taking the g_lowputs_lock semaphore and CPU1 sending a signal to CPU0. When CPU0 starts running the signal dispatch logic it will deadlock if there is another syslog() call there. One example below:

void riscv_sigdeliver(void)
{
struct tcb_s *rtcb = this_task();
uintreg_t *regs = rtcb->xcp.saved_regs;
#ifdef CONFIG_SMP
/* In the SMP case, we must terminate the critical section while the signal
* handler executes, but we also need to restore the irqcount when the
* we resume the main thread of the task.
*/
int16_t saved_irqcount;
#endif
board_autoled_on(LED_SIGNAL);
sinfo("rtcb=%p sigdeliver=%p sigpendactionq.head=%p\n",
rtcb, rtcb->sigdeliver, rtcb->sigpendactionq.head);

This is how it looks on the terminal

[CPU1] nxsig_tcbdispatch: TCB=0x80410ad0 pid=58 signo=1 code=0 value=0 masked=NO, svcall=YES
[CPU0] [CPU1] nxsig_tcbdispatch: TCB=0x80410ad0 pid=58 signo=2 code=0 value=0 masked=NO, svcall=YES
[CPU0] up_schedule_sigaction: tcb=0x80410ad0, rtcb=0x80410ad0 current_regs=0x80413720
[CPU0] 

@pussuw
Copy link
Contributor Author

pussuw commented Oct 22, 2024

Steps to reproduce:
Enable:
CONFIG_DEBUG_FEATURES=y
CONFIG_DEBUG_SCHED_INFO=y
tools/configure.sh rv-virt:ksmp64

Start qemu and run ostest.
It will get stuck in one of the signal handler tests almost every time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arch: all Issues that apply to all architectures Area: Posix Posix issues OS: Linux Issues related to Linux (building system, etc) Type: Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant