[BUG] Resource leak in SMP mode when running signal handler #14448

pussuw · 2024-10-22T11:22:06Z

Description / Steps to reproduce the issue

There is a serious issue with the current asynchronous signal delivery system; it will forcibly make another CPU resume
code at places where this must not happen. Take the following example where CPU0 takes a semaphore and CPU1 sends
a signal to it:

CPU0                                                CPU1
nxsem_wait() // Take semaphore                                     
enter_critical_section()                 
... in atomic section ...                           
up_switch_context(this_task(), rtcb)
---> next process , atomic section over             enter_critical_section()
                                                    nxsig_queue_action()
                                                    nxsched_smp_call_single()
                                                      // Setup interrupt on CPU0 to run sig_handler
                                                      nxsched_smp_call_single(stcb->cpu, sig_handler, &arg, true);
                                                    <--- SMP_CALL interrupt pends on CPU0
                                                    leave_critical_section()
---> SMP_CALL interrupt fires on CPU0
               |
               v
nxsched_smp_call_handler()
  // Run sig_handler on CPU0
  sig_handler()
  up_schedule_sigaction()
  // up_schedule_sigaction makes task on CPU1 return to riscv_sigdeliver
  tcb->xcp.regs[REG_EPC] = (uintptr_t)riscv_sigdeliver;
              |
              v
riscv_smp_call_handler()
  // riscv_smp_call_handler restores (new) context, EPC=riscv_sigdeliver
  tcb = current_task(cpu);
  riscv_savecontext(tcb);
  nxsched_process_delivered(cpu);
  tcb = current_task(cpu);
  riscv_restorecontext(tcb);
               |
               v
riscv_smp_call_handler() interrupt returns
               |
               v
riscv_sigdeliver()
               |
               v
signal_handler()
  // Signal handler runs in userspace
          ***CRASH***

If the process on CPU0 crashes in the signal handler, the semaphore taken on CPU0 does not get freed,
causing a resource leak.

The leak is not an issue for user resources but is catastrophic for kernel resources!

On which OS does this issue occur?

[OS: Linux]

What is the version of your OS?

Irrelevant

NuttX Version

master

Issue Architecture

[Arch: all]

Issue Area

[Area: Posix]

Verification

I have verified before submitting the report.

The text was updated successfully, but these errors were encountered:

pussuw · 2024-10-22T11:30:26Z

I noticed the issue with syslog(), though in this case it causes a deadlock. The problem is still the same and the leaked
resource is g_lowputs_lock.

static ssize_t syslog_default_write(FAR syslog_channel_t *channel,
                                    FAR const char *buffer, size_t buflen)
{
#  ifdef CONFIG_ARCH_LOWPUTC
  nxmutex_lock(&g_lowputs_lock);

  up_nputs(buffer, buflen);

  nxmutex_unlock(&g_lowputs_lock);
#  endif

  UNUSED(channel);
  return buflen;
}
#endif

All you need is for CPU0 to call syslog_default_write taking the g_lowputs_lock semaphore and CPU1 sending a signal to CPU0. When CPU0 starts running the signal dispatch logic it will deadlock if there is another syslog() call there. One example below:

nuttx/arch/risc-v/src/common/riscv_sigdeliver.c

Lines 55 to 72 in b4a6d45

    
           void riscv_sigdeliver(void) 
        
           { 
        
             struct tcb_s *rtcb = this_task(); 
        
             uintreg_t *regs = rtcb->xcp.saved_regs; 
        
           #ifdef CONFIG_SMP 
        
             /* In the SMP case, we must terminate the critical section while the signal 
        
              * handler executes, but we also need to restore the irqcount when the 
        
              * we resume the main thread of the task. 
        
              */ 
        
             int16_t saved_irqcount; 
        
           #endif 
        
             board_autoled_on(LED_SIGNAL); 
        
             sinfo("rtcb=%p sigdeliver=%p sigpendactionq.head=%p\n", 
        
                   rtcb, rtcb->sigdeliver, rtcb->sigpendactionq.head);

This is how it looks on the terminal

[CPU1] nxsig_tcbdispatch: TCB=0x80410ad0 pid=58 signo=1 code=0 value=0 masked=NO, svcall=YES
[CPU0] [CPU1] nxsig_tcbdispatch: TCB=0x80410ad0 pid=58 signo=2 code=0 value=0 masked=NO, svcall=YES
[CPU0] up_schedule_sigaction: tcb=0x80410ad0, rtcb=0x80410ad0 current_regs=0x80413720
[CPU0]

pussuw · 2024-10-22T12:32:54Z

Steps to reproduce:
Enable:
CONFIG_DEBUG_FEATURES=y
CONFIG_DEBUG_SCHED_INFO=y
tools/configure.sh rv-virt:ksmp64

Start qemu and run ostest.
It will get stuck in one of the signal handler tests almost every time.

pussuw added the Type: Bug Something isn't working label Oct 22, 2024

github-actions bot added Arch: all Issues that apply to all architectures Area: Posix Posix issues OS: Linux Issues related to Linux (building system, etc) labels Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Resource leak in SMP mode when running signal handler #14448

[BUG] Resource leak in SMP mode when running signal handler #14448

pussuw commented Oct 22, 2024

pussuw commented Oct 22, 2024 •

edited

Loading

pussuw commented Oct 22, 2024

[BUG] Resource leak in SMP mode when running signal handler #14448

[BUG] Resource leak in SMP mode when running signal handler #14448

Comments

pussuw commented Oct 22, 2024

Description / Steps to reproduce the issue

On which OS does this issue occur?

What is the version of your OS?

NuttX Version

Issue Architecture

Issue Area

Verification

pussuw commented Oct 22, 2024 • edited Loading

pussuw commented Oct 22, 2024

pussuw commented Oct 22, 2024 •

edited

Loading