Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kstat: allow multi-level module names #17142

Closed
wants to merge 2 commits into from

Conversation

robn
Copy link
Member

@robn robn commented Mar 13, 2025

[Sponsors: Klara, Inc., Syneto]

Motivation and Context

In #16200 I added support for multi-level kstat nodes so that I could more easily add per-vdev queue stats. I intend to come back to that PR, but I have separately wanted this support in another project, so I've lifted it out.

In general, this gives us an easy way to create a richer tree of stats, by building on the simple one-level support we have for zfs/<pool>. With this, we could avoid the mess we have for objset-0xXX dataset stats.

Description

Both Linux and FreeBSD implementations of kstat_create() have support for creating a single sub-level stat base, by splitting on a single / char and creating an additional intermediate node. This PR extends that support by turning that into a loop, such that each / creates a new intermediate node.

"Intermediate node" here means an extra SYSCTL_ADD_NODE() on FreeBSD, or an extra proc_mkdir() on Linux.

There's extra teardown code needed on Linux to remove the intermediate directories when the last concrete kstat object is removed. FreeBSD doesn't have this problem (I think), at least because the intermediate nodes don't seem to "exist" except as a hook point for other kstats to hang off. I can and will do more here if asked, but this is the same situation as the previous code has, so its kinda half each way.

There's one extra commit here. Previously, if you tried to create a second kstat with the same module & name as an existing one, it would be ignored. This changes it so the old one is removed and the new one added. This also came from #16200 because during import, we create a vdev_t for each drive (which create kstat nodes) and then as we transition to a trusted config, we create another vdev_t before destroying the old one, which creates another set of kstats to replace the first ones.

That behaviour is of course specific to that PR, but it does raise the question of what should happen if we try to create a kstat a second time. "Keep the first" was mostly an accident of the implementation, and never happened in normal operation anyway. The other options are to take the second, or reject it until explicitly removed. This commit is making a choice, but I don't want to hold this PR up on it if there's strong opinions. I'm happy to drop it for now, or move it to another PR, or discuss it when I get back to #16200. I think it's pretty benign as-is, and obviously something we can change down the track if its not working for us.

How Has This Been Tested?

There's no actual stats changes here, so this should be an effective no-op change (only once round the loop is the same as an if (...). That's not much fun though, so I've made a demo commit robn/zfs@e18264f that moves the existing zfs/<poolname> stats to zfs/pool/<poolname> and zfs/<poolname>/objset-<id> to zfs/pool/<poolname>/dataset/<id>.

On Linux, that gives us:

    # find /proc/spl/kstat/zfs/pool
    /proc/spl/kstat/zfs/pool
    /proc/spl/kstat/zfs/pool/tank
    /proc/spl/kstat/zfs/pool/tank/guid
    /proc/spl/kstat/zfs/pool/tank/state
    /proc/spl/kstat/zfs/pool/tank/dataset
    /proc/spl/kstat/zfs/pool/tank/dataset/0x36
    /proc/spl/kstat/zfs/pool/tank/iostats
    /proc/spl/kstat/zfs/pool/tank/dmu_tx_assign
    /proc/spl/kstat/zfs/pool/tank/ddt_stats_edonr
    /proc/spl/kstat/zfs/pool/tank/ddt_stats_skein
    /proc/spl/kstat/zfs/pool/tank/ddt_stats_blake3
    /proc/spl/kstat/zfs/pool/tank/ddt_stats_sha256
    /proc/spl/kstat/zfs/pool/tank/ddt_stats_sha512

    # cat /proc/spl/kstat/zfs/pool/tank/state
    ONLINE

    # head /proc/spl/kstat/zfs/pool/tank/dataset/0x36
    151 1 0x01 27 7600 23955484523 216194931914
    name                            type data
    dataset_name                    7    tank
    writes                          4    0
    nwritten                        4    0
    reads                           4    0
    nread                           4    0
    nunlinks                        4    0
    nunlinked                       4    0
    zil_commit_count                4    0
    ...

and on FreeBSD:

    # sysctl -a kstat.zfs.pool | head
    kstat.zfs.pool.tank.dataset.dataset.0x36.zil_itx_metaslab_slog_alloc: 0
    kstat.zfs.pool.tank.dataset.dataset.0x36.zil_itx_metaslab_slog_write: 0
    kstat.zfs.pool.tank.dataset.dataset.0x36.zil_itx_metaslab_slog_bytes: 0
    kstat.zfs.pool.tank.dataset.dataset.0x36.zil_itx_metaslab_slog_count: 0
    kstat.zfs.pool.tank.dataset.dataset.0x36.zil_itx_metaslab_normal_alloc: 0
    kstat.zfs.pool.tank.dataset.dataset.0x36.zil_itx_metaslab_normal_write: 0
    kstat.zfs.pool.tank.dataset.dataset.0x36.zil_itx_metaslab_normal_bytes: 0
    kstat.zfs.pool.tank.dataset.dataset.0x36.zil_itx_metaslab_normal_count: 0
    kstat.zfs.pool.tank.dataset.dataset.0x36.zil_itx_needcopy_bytes: 0
    kstat.zfs.pool.tank.dataset.dataset.0x36.zil_itx_needcopy_count: 0

    # sysctl kstat.zfs.pool.tank.misc.state
    kstat.zfs.pool.tank.misc.state: ONLINE

    # sysctl kstat.zfs.pool.tank.dataset.dataset.0x36.dataset_name
    kstat.zfs.pool.tank.dataset.dataset.0x36.dataset_name: tank

Note the extra misc and dataset nodes are part of the existing implementation; I'm not here to remake the world just yet.

Meanwhile, my own ZTS run is progressing nicely. I'm not expecting any issues, though we never do so who knows 😅.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@amotin
Copy link
Member

amotin commented Mar 13, 2025

FreeBSD doesn't have this problem (I think), at least because the intermediate nodes don't seem to "exist" except as a hook point for other kstats to hang off.

On FreeBSD sysctl_ctx_free(&ksp->ks_sysctl_ctx) called by kstat_delete() should automatically delete everything belonging to the context.

This changes it so the old one is removed and the new one added.

I haven't looked deep, but I wonder what happen when the original one is finally deleted?

@amotin amotin added the Status: Code Review Needed Ready for review and testing label Mar 13, 2025
@amotin
Copy link
Member

amotin commented Mar 13, 2025

  [   27.590437] VERIFY0(module->ksm_nchildren) failed (0 == 32)
  [   27.592523] PANIC at spl-kstat.c:390:kstat_delete_module()
  [   27.594384] Showing stack for process 2033
  [   27.596193] CPU: 0 PID: 2033 Comm: zpool Tainted: P           OE     -------- -  - 4.18.0-553.44.1.el8_10.x86_64 #1
  [   27.601298] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
  [   27.606614] Call Trace:
  [   27.607831]  dump_stack+0x41/0x60
  [   27.609722]  spl_panic+0xd0/0xe8 [spl]
  [   27.613067]  ? srso_alias_return_thunk+0x5/0xfcdfd
  [   27.614619]  ? schedule+0x55/0xf0
  [   27.615794]  ? srso_alias_return_thunk+0x5/0xfcdfd
  [   27.617902]  ? schedule_timeout+0x281/0x320
  [   27.619323]  ? srso_alias_return_thunk+0x5/0xfcdfd
  [   27.621352]  ? __smp_call_single_queue+0xa1/0x140
  [   27.623413]  ? srso_alias_return_thunk+0x5/0xfcdfd
  [   27.625354]  ? srso_alias_return_thunk+0x5/0xfcdfd
  [   27.627116]  ? srso_alias_return_thunk+0x5/0xfcdfd
  [   27.629220]  ? srso_alias_return_thunk+0x5/0xfcdfd
  [   27.631339]  ? srso_alias_return_thunk+0x5/0xfcdfd
  [   27.633392]  ? kmem_cache_free+0x2d6/0x300
  [   27.635322]  kstat_delete_module+0x15d/0x170 [spl]
  [   27.637302]  kstat_proc_entry_delete+0xcd/0x120 [spl]
  [   27.639402]  __kstat_delete+0x15/0x80 [spl]
  [   27.641214]  spa_guid_destroy+0x1a/0x50 [zfs]
  [   27.644019]  spa_remove+0x1c4/0x700 [zfs]
  [   27.646762]  spa_create+0x54a/0xd80 [zfs]
  [   27.649077]  ? get_nvlist+0x8e/0x120 [zfs]
  [   27.651135]  ? srso_alias_return_thunk+0x5/0xfcdfd
  [   27.653298]  ? kfree+0xd3/0x250
  [   27.654648]  zfs_ioc_pool_create+0x121/0x300 [zfs]

@robn robn force-pushed the kstat-multi-level branch 3 times, most recently from ac9f08f to 3461c5d Compare March 17, 2025 06:26
@robn
Copy link
Member Author

robn commented Mar 17, 2025

[ 27.590437] VERIFY0(module->ksm_nchildren) failed (0 == 32)
[ 27.592523] PANIC at spl-kstat.c:390:kstat_delete_module()

I'm pretty sure this was failing to initialise ksm_nchildren. Last push uses kmem_zalloc instead to zero the whole thing. Also some extra asserts to make sure the module lock is held when messing with the tree (not the problem, was just my first guess, but why not leave them in there now).

If this round of tests pass I'll consider that bit good.

@robn
Copy link
Member Author

robn commented Mar 17, 2025

On FreeBSD sysctl_ctx_free(&ksp->ks_sysctl_ctx) called by kstat_delete() should automatically delete everything belonging to the context.

Great, thanks for confirming that!

I haven't looked deep, but I wonder what happen when the original one is finally deleted?

We remove the old name from sysctl (kstat_install_named()) or proc (kstat_proc_entry_install()), but the kstat_t still exists and does whatever it does. When destroyed, it's already gone from sysctl/proc, so we just skip that bit.

@amotin
Copy link
Member

amotin commented Mar 17, 2025

The test failures now seem to be caused by #17145 . I'd appreciate some reviews.

Copy link
Contributor

@tonyhutter tonyhutter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a test with @robn's demo commit and it worked as advertised.

@tonyhutter
Copy link
Contributor

@robn this looks ready to merge - can you squash and rebase it?

@robn robn force-pushed the kstat-multi-level branch from 3461c5d to 1e2b5d9 Compare March 20, 2025 00:01
robn added 2 commits March 20, 2025 12:31
Module names are mapped directly to directory names in procfs, but
nothing is done to create the intermediate directories, or remove them.
This makes it impossible to sensibly present kstats about sub-objects.

This commit loops through '/'-separated names in the full module name,
creates a separate module for each, and hooks them up with a parent
pointer and child counter, and then unrolls this on the other side when
deleting a module.

Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Signed-off-by: Rob Norris <[email protected]>
This extends the existing special-case for zfs/poolname to split and
create any number of intermediate sysctl names, so that multi-level
module names are possible.

Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Signed-off-by: Rob Norris <[email protected]>
@robn robn force-pushed the kstat-multi-level branch from 1e2b5d9 to 911c8dc Compare March 20, 2025 01:31
@tonyhutter
Copy link
Contributor

Merged as:
45e9b54 freebsd/kstat: allow multi-level module names
d28d2e3 linux/kstat: allow multi-level module names

@tonyhutter tonyhutter closed this Mar 20, 2025
@robn
Copy link
Member Author

robn commented Mar 20, 2025

Rad, thanks!

@robn robn mentioned this pull request Mar 21, 2025
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants