Skip to content

[FEATURE] System-Wide VFS Latency Tracing and Instrumentation via Kernel Notes (sched/note) #18687

@Sumit6307

Description

@Sumit6307

Is your feature request related to a problem? Please describe.

Yes. Building on the recent addition of cumulative VFS performance profiling (#18607), there is a significant limitation in our current observability: cumulative metrics (averages and total counts) inevitably mask transient performance "jitter" and worst-case execution time (WCET) spikes.

In mission-critical RTOS environments, a single filesystem write taking 500ms is a failure, even if the average over 1,000 calls is a healthy 1ms. Currently, NuttX lacks a low-overhead, system-wide way to capture these specific temporal events and correlate them with the CPU scheduler or interrupt state without using heavyweight external hardware tracers.

Describe the solution you'd like

I would like to implement a high-resolution VFS tracing mechanism integrated into the existing sched/note (Kernel Note) framework. This solution will allow developers to record the precise start and end of VFS operations in a compact binary format that is compatible with professional tracing tools like SystemView, ftrace/trace-cmd, and TraceCompass.

The solution should include:

  1. Binary Tracepoints: Implementation of new note types (e.g., NOTE_VFS_WRITE_START, NOTE_VFS_WRITE_STOP) in include/nuttx/sched_note.h.
  2. Upper-Half Instrumentation: Adding trace hooks in the VFS layer (fs/vfs/) that emit these notes with minimal overhead.
  3. Contextual Data: Packing essential metadata into the trace notes, such as the File Descriptor (FD), requested byte size, and the actual return value.
  4. Configurability: A new Kconfig option CONFIG_FS_PROFILER_TRACE to ensure this logic is only compiled in when high-resolution debugging is required, maintaining zero impact on production builds otherwise.

Describe alternatives you've considered

  1. Existing Procfs Profiler: Useful for general health monitoring and CI regression testing, but mathematically incapable of identifying temporal spikes or identifying which specific call caused a latency issue.
  2. External Hardware Tracing (JTAG/SWO): Highly accurate but requires expensive hardware and specialized setups. A software-based sched_note approach is much more portable and accessible across different boards.
  3. Standard Syslog/Debug prints: These are too heavyweight and significantly alter the real-time behavior (observer effect) of the filesystem, making timing-specific bugs difficult to reproduce.

Verification

  • I have verified before submitting the report.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions