kstat: allow multi-level module names #17142

robn · 2025-03-13T03:21:16Z

[Sponsors: Klara, Inc., Syneto]

Motivation and Context

In #16200 I added support for multi-level kstat nodes so that I could more easily add per-vdev queue stats. I intend to come back to that PR, but I have separately wanted this support in another project, so I've lifted it out.

In general, this gives us an easy way to create a richer tree of stats, by building on the simple one-level support we have for zfs/<pool>. With this, we could avoid the mess we have for objset-0xXX dataset stats.

Description

Both Linux and FreeBSD implementations of kstat_create() have support for creating a single sub-level stat base, by splitting on a single / char and creating an additional intermediate node. This PR extends that support by turning that into a loop, such that each / creates a new intermediate node.

"Intermediate node" here means an extra SYSCTL_ADD_NODE() on FreeBSD, or an extra proc_mkdir() on Linux.

There's extra teardown code needed on Linux to remove the intermediate directories when the last concrete kstat object is removed. FreeBSD doesn't have this problem (I think), at least because the intermediate nodes don't seem to "exist" except as a hook point for other kstats to hang off. I can and will do more here if asked, but this is the same situation as the previous code has, so its kinda half each way.

There's one extra commit here. Previously, if you tried to create a second kstat with the same module & name as an existing one, it would be ignored. This changes it so the old one is removed and the new one added. This also came from #16200 because during import, we create a vdev_t for each drive (which create kstat nodes) and then as we transition to a trusted config, we create another vdev_t before destroying the old one, which creates another set of kstats to replace the first ones.

That behaviour is of course specific to that PR, but it does raise the question of what should happen if we try to create a kstat a second time. "Keep the first" was mostly an accident of the implementation, and never happened in normal operation anyway. The other options are to take the second, or reject it until explicitly removed. This commit is making a choice, but I don't want to hold this PR up on it if there's strong opinions. I'm happy to drop it for now, or move it to another PR, or discuss it when I get back to #16200. I think it's pretty benign as-is, and obviously something we can change down the track if its not working for us.

How Has This Been Tested?

There's no actual stats changes here, so this should be an effective no-op change (only once round the loop is the same as an if (...). That's not much fun though, so I've made a demo commit robn/zfs@e18264f that moves the existing zfs/<poolname> stats to zfs/pool/<poolname> and zfs/<poolname>/objset-<id> to zfs/pool/<poolname>/dataset/<id>.

On Linux, that gives us:

    # find /proc/spl/kstat/zfs/pool
    /proc/spl/kstat/zfs/pool
    /proc/spl/kstat/zfs/pool/tank
    /proc/spl/kstat/zfs/pool/tank/guid
    /proc/spl/kstat/zfs/pool/tank/state
    /proc/spl/kstat/zfs/pool/tank/dataset
    /proc/spl/kstat/zfs/pool/tank/dataset/0x36
    /proc/spl/kstat/zfs/pool/tank/iostats
    /proc/spl/kstat/zfs/pool/tank/dmu_tx_assign
    /proc/spl/kstat/zfs/pool/tank/ddt_stats_edonr
    /proc/spl/kstat/zfs/pool/tank/ddt_stats_skein
    /proc/spl/kstat/zfs/pool/tank/ddt_stats_blake3
    /proc/spl/kstat/zfs/pool/tank/ddt_stats_sha256
    /proc/spl/kstat/zfs/pool/tank/ddt_stats_sha512

    # cat /proc/spl/kstat/zfs/pool/tank/state
    ONLINE

    # head /proc/spl/kstat/zfs/pool/tank/dataset/0x36
    151 1 0x01 27 7600 23955484523 216194931914
    name                            type data
    dataset_name                    7    tank
    writes                          4    0
    nwritten                        4    0
    reads                           4    0
    nread                           4    0
    nunlinks                        4    0
    nunlinked                       4    0
    zil_commit_count                4    0
    ...

and on FreeBSD:

    # sysctl -a kstat.zfs.pool | head
    kstat.zfs.pool.tank.dataset.dataset.0x36.zil_itx_metaslab_slog_alloc: 0
    kstat.zfs.pool.tank.dataset.dataset.0x36.zil_itx_metaslab_slog_write: 0
    kstat.zfs.pool.tank.dataset.dataset.0x36.zil_itx_metaslab_slog_bytes: 0
    kstat.zfs.pool.tank.dataset.dataset.0x36.zil_itx_metaslab_slog_count: 0
    kstat.zfs.pool.tank.dataset.dataset.0x36.zil_itx_metaslab_normal_alloc: 0
    kstat.zfs.pool.tank.dataset.dataset.0x36.zil_itx_metaslab_normal_write: 0
    kstat.zfs.pool.tank.dataset.dataset.0x36.zil_itx_metaslab_normal_bytes: 0
    kstat.zfs.pool.tank.dataset.dataset.0x36.zil_itx_metaslab_normal_count: 0
    kstat.zfs.pool.tank.dataset.dataset.0x36.zil_itx_needcopy_bytes: 0
    kstat.zfs.pool.tank.dataset.dataset.0x36.zil_itx_needcopy_count: 0

    # sysctl kstat.zfs.pool.tank.misc.state
    kstat.zfs.pool.tank.misc.state: ONLINE

    # sysctl kstat.zfs.pool.tank.dataset.dataset.0x36.dataset_name
    kstat.zfs.pool.tank.dataset.dataset.0x36.dataset_name: tank

Note the extra misc and dataset nodes are part of the existing implementation; I'm not here to remake the world just yet.

Meanwhile, my own ZTS run is progressing nicely. I'm not expecting any issues, though we never do so who knows 😅.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

module/os/freebsd/spl/spl_kstat.c

amotin · 2025-03-13T16:23:04Z

FreeBSD doesn't have this problem (I think), at least because the intermediate nodes don't seem to "exist" except as a hook point for other kstats to hang off.

On FreeBSD sysctl_ctx_free(&ksp->ks_sysctl_ctx) called by kstat_delete() should automatically delete everything belonging to the context.

This changes it so the old one is removed and the new one added.

I haven't looked deep, but I wonder what happen when the original one is finally deleted?

amotin · 2025-03-13T18:39:45Z

  [   27.590437] VERIFY0(module->ksm_nchildren) failed (0 == 32)
  [   27.592523] PANIC at spl-kstat.c:390:kstat_delete_module()
  [   27.594384] Showing stack for process 2033
  [   27.596193] CPU: 0 PID: 2033 Comm: zpool Tainted: P           OE     -------- -  - 4.18.0-553.44.1.el8_10.x86_64 #1
  [   27.601298] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
  [   27.606614] Call Trace:
  [   27.607831]  dump_stack+0x41/0x60
  [   27.609722]  spl_panic+0xd0/0xe8 [spl]
  [   27.613067]  ? srso_alias_return_thunk+0x5/0xfcdfd
  [   27.614619]  ? schedule+0x55/0xf0
  [   27.615794]  ? srso_alias_return_thunk+0x5/0xfcdfd
  [   27.617902]  ? schedule_timeout+0x281/0x320
  [   27.619323]  ? srso_alias_return_thunk+0x5/0xfcdfd
  [   27.621352]  ? __smp_call_single_queue+0xa1/0x140
  [   27.623413]  ? srso_alias_return_thunk+0x5/0xfcdfd
  [   27.625354]  ? srso_alias_return_thunk+0x5/0xfcdfd
  [   27.627116]  ? srso_alias_return_thunk+0x5/0xfcdfd
  [   27.629220]  ? srso_alias_return_thunk+0x5/0xfcdfd
  [   27.631339]  ? srso_alias_return_thunk+0x5/0xfcdfd
  [   27.633392]  ? kmem_cache_free+0x2d6/0x300
  [   27.635322]  kstat_delete_module+0x15d/0x170 [spl]
  [   27.637302]  kstat_proc_entry_delete+0xcd/0x120 [spl]
  [   27.639402]  __kstat_delete+0x15/0x80 [spl]
  [   27.641214]  spa_guid_destroy+0x1a/0x50 [zfs]
  [   27.644019]  spa_remove+0x1c4/0x700 [zfs]
  [   27.646762]  spa_create+0x54a/0xd80 [zfs]
  [   27.649077]  ? get_nvlist+0x8e/0x120 [zfs]
  [   27.651135]  ? srso_alias_return_thunk+0x5/0xfcdfd
  [   27.653298]  ? kfree+0xd3/0x250
  [   27.654648]  zfs_ioc_pool_create+0x121/0x300 [zfs]

robn · 2025-03-17T10:27:25Z

[ 27.590437] VERIFY0(module->ksm_nchildren) failed (0 == 32)
[ 27.592523] PANIC at spl-kstat.c:390:kstat_delete_module()

I'm pretty sure this was failing to initialise ksm_nchildren. Last push uses kmem_zalloc instead to zero the whole thing. Also some extra asserts to make sure the module lock is held when messing with the tree (not the problem, was just my first guess, but why not leave them in there now).

If this round of tests pass I'll consider that bit good.

robn · 2025-03-17T10:32:57Z

On FreeBSD sysctl_ctx_free(&ksp->ks_sysctl_ctx) called by kstat_delete() should automatically delete everything belonging to the context.

Great, thanks for confirming that!

I haven't looked deep, but I wonder what happen when the original one is finally deleted?

We remove the old name from sysctl (kstat_install_named()) or proc (kstat_proc_entry_install()), but the kstat_t still exists and does whatever it does. When destroyed, it's already gone from sysctl/proc, so we just skip that bit.

amotin · 2025-03-17T14:44:27Z

The test failures now seem to be caused by #17145 . I'd appreciate some reviews.

tonyhutter

I gave this a test with @robn's demo commit and it worked as advertised.

tonyhutter · 2025-03-19T21:56:30Z

@robn this looks ready to merge - can you squash and rebase it?

Module names are mapped directly to directory names in procfs, but nothing is done to create the intermediate directories, or remove them. This makes it impossible to sensibly present kstats about sub-objects. This commit loops through '/'-separated names in the full module name, creates a separate module for each, and hooks them up with a parent pointer and child counter, and then unrolls this on the other side when deleting a module. Sponsored-by: Klara, Inc. Sponsored-by: Syneto Signed-off-by: Rob Norris <[email protected]>

This extends the existing special-case for zfs/poolname to split and create any number of intermediate sysctl names, so that multi-level module names are possible. Sponsored-by: Klara, Inc. Sponsored-by: Syneto Signed-off-by: Rob Norris <[email protected]>

tonyhutter · 2025-03-20T23:26:31Z

Merged as:
45e9b54 freebsd/kstat: allow multi-level module names
d28d2e3 linux/kstat: allow multi-level module names

robn · 2025-03-20T23:39:43Z

Rad, thanks!

amotin reviewed Mar 13, 2025

View reviewed changes

module/os/freebsd/spl/spl_kstat.c Outdated Show resolved Hide resolved

amotin added the Status: Code Review Needed Ready for review and testing label Mar 13, 2025

robn force-pushed the kstat-multi-level branch 3 times, most recently from ac9f08f to 3461c5d Compare March 17, 2025 06:26

tonyhutter approved these changes Mar 18, 2025

View reviewed changes

amotin approved these changes Mar 19, 2025

View reviewed changes

robn force-pushed the kstat-multi-level branch from 3461c5d to 1e2b5d9 Compare March 20, 2025 00:01

robn added 2 commits March 20, 2025 12:31

robn force-pushed the kstat-multi-level branch from 1e2b5d9 to 911c8dc Compare March 20, 2025 01:31

tonyhutter closed this Mar 20, 2025

robn mentioned this pull request Mar 21, 2025

vdev queue stats #16200

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kstat: allow multi-level module names #17142

kstat: allow multi-level module names #17142

Uh oh!

robn commented Mar 13, 2025

Uh oh!

Uh oh!

amotin commented Mar 13, 2025

Uh oh!

amotin commented Mar 13, 2025

Uh oh!

robn commented Mar 17, 2025

Uh oh!

robn commented Mar 17, 2025 •

edited

Loading

Uh oh!

amotin commented Mar 17, 2025 •

edited

Loading

Uh oh!

tonyhutter left a comment

Uh oh!

tonyhutter commented Mar 19, 2025

Uh oh!

tonyhutter commented Mar 20, 2025

Uh oh!

robn commented Mar 20, 2025

Uh oh!

Uh oh!

kstat: allow multi-level module names #17142

kstat: allow multi-level module names #17142

Uh oh!

Conversation

robn commented Mar 13, 2025

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

Uh oh!

amotin commented Mar 13, 2025

Uh oh!

amotin commented Mar 13, 2025

Uh oh!

robn commented Mar 17, 2025

Uh oh!

robn commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amotin commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tonyhutter left a comment

Choose a reason for hiding this comment

Uh oh!

tonyhutter commented Mar 19, 2025

Uh oh!

tonyhutter commented Mar 20, 2025

Uh oh!

robn commented Mar 20, 2025

Uh oh!

Uh oh!

robn commented Mar 17, 2025 •

edited

Loading

amotin commented Mar 17, 2025 •

edited

Loading