Proposal: collect mdraid metrics from sysfs instead of parsing /proc/mdstat #1085

dswarbrick · 2018-09-25T10:21:20Z

Proposal

I just discovered the that node_md_blocks_synced metric, which is currently parsed from /proc/mdstat is not very useful in trying to determine resync / check progress percentage of an array. To be honest, I'm not really sure what it can be used for, because the way in which it is parsed has little bearing to the other metrics that are parsed from /proc/mdstat.

Take the following example of an array rebuilding:

md300 : active raid5 sdh[8] sdc[0] sdj[7] sdi[6] sdg[4] sdf[3] sde[2] sdd[1]
      3281959800 blocks super 1.0 level 5, 8k chunk, algorithm 2 [8/7] [UUUUU_UU]
      [=====>...............]  recovery = 26.4% (124074752/468851400) finish=207.9min speed=27635K/sec
      bitmap: 3/4 pages [12KB], 65536KB chunk

Parsing this, node_exporter provides me with two nearly useful metrics:
node_md_blocks = 3281959800
node_md_blocks_synced = 124074752

One might be forgiven for thinking that one can simply divide node_md_blocks_synced by node_md_blocks and multiply by 100 to get a resync completion percentage. The problem is however that node_md_blocks is the whole array size expressed in 1 KB blocks (this is a ~3.1 TB array), whereas node_md_blocks_synced is the number of synced 1 KB blocks of the drive which is being rebuilt., i.e. one seventh of the of entire array (raid5, 8 members).

Dividing node_md_blocks_synced by node_md_blocks is going to yield a number that is seven times too small, and the only way you could possibly know that you need to multiply that number by seven would be if you knew a) the raid level, and b) the number of raid members.

The mdraid information contained in sysfs is a lot more detailed, and a lot easier to parse. The information that one would need to accurately calculate current rebuild progress can be read from /sys/block/md*/md/sync_completed, e.g.:

$ cat /sys/block/md300/md/sync_completed 
248115968 / 937702800

These numbers are in sectors, and the md admin guide literally says that you can divide these numbers to get a "fraction of the process that is complete."

Use case. Why is this important?
The mdraid information in sysfs is substantially more detailed and more machine-readable than /proc/mdstat. I don't know whether there are any guarantees about the stability of the format of /proc/mdstat, and I've never seen any indication that it is intended to be machine-readable; on the contrary, it is formatted to be human-friendly.

The only information from /proc/mdstat which I don't think can be found in sysfs is the very first line, e.g. Personalities : [raid0] [raid1] [raid6] [raid5] [raid4], but node_exporter does not expose that information anyway. Everything else can be found in /sys/block/md*/md/*.

I am prepared to provide a PR which would completely refactor the mdadm collector (with a healthy dose of unit tests), if the Prometheus developers are interested. Ultimately I suspect this should find its way into the prometheus/procfs package, alongside the other sysfs parsers.

The text was updated successfully, but these errors were encountered:

SuperQ · 2018-09-25T12:10:42Z

I think we've talked about this in the past. I'm very much in favor of using /sys parsing over /proc/mdstat. The only problem is has been lack of time to implement it.

We also would like to implement the parsing as a separate library, rather than put the code directly in the node_exporter like it is now.

We have the Prometheus procfs library which is our catch-all for /proc and /sys parsing. Would you be willing to do the parsing there?

dswarbrick · 2018-09-25T12:16:27Z

@SuperQ I forked prometheus/procfs recently, and had an initial poke around. My WIP is in https://github.com/dswarbrick/procfs/blob/mdraid-sysfs/sysfs/mdraid.go. The code is in its early stages - I intend to split up the function(s) a bit more.

frittentheke · 2020-06-11T14:09:07Z

@dswarbrick I just noticed that the sync / check phase is not really covered by the node_exporter currently (which makes it hard to i.e. supress disk utilization alerts when a check is running) and I was about to start implementing this.

Now I ran into this issue and your move to use sysfs as data source. Are you still pursuing this and would open a PR at some point?

dswarbrick · 2020-06-11T14:51:23Z

@frittentheke Sadly I let this slip off my radar for a long time. There was also some discussion over the "right" way to implement reading of a large number of sysfs files. I've seen some overly naive approaches fall flat on their face, especially when exact sysfs entries present can vary.

I am not currently working on this, but still interested in the topic, and happy to collaborate with you.

frittentheke · 2020-08-14T05:19:31Z

@dswarbrick I opened PR prometheus/procfs#321 and a corresponding one for the node exporter with #1810

jooola · 2021-03-07T09:35:39Z

@dswarbrick May you post a link to the discussion you had about the right way to read a large number of files in sysfs ?

dswarbrick · 2021-03-07T11:00:50Z

@jooola I don't have the exact link off the top of my head, but it mostly revolved around a couple of fragile approaches. One involved basically reading every file in a particular sysfs directory, and using the relevant values read from the files. This was suboptimal because a few files here and there are not world-readable, which of course resulted in errors. There are also some files that will return dangerously large amounts of data (think /proc/kcore) if not specifically limited in the read call. It's also not unheard of for buggy kernel modules to react badly when certain sysfs entries are read, which is another reason why the code should be as selective and specific as possible about what it reads - no more, no less.

There are some collectors which read a predefined list of files, e.g. for _, f := range [...]string{"foo", "bar"}, which is ok if the filenames are always present. The mdraid subsystem however is a bit more variable, and certain sysfs entries are only present with certain raid levels, so the code has to be more adaptive.

On a side note, possibly of interest to procfs / node_exporter in general, there was a new readfile syscall being discussed for Linux, which would open, read and close a file in a single syscall - i.e., something that would be quite useful for the procfs package - https://lwn.net/Articles/825324/

However, it seems like there is now more interest in using io_uring to read a large number of files with as few syscalls as possible.

jooola · 2021-03-07T12:14:40Z

I also stumbled on those emails, really interesting.

I also read about the idea to cache the file handler and re seek any necessary content for parsing. It uses more memory, but is way faster than opening/closing the desired file. But it might be dangerous, I don't know enough on this topic to say more.

SuperQ · 2021-03-07T14:10:34Z

It would be so much easier if the kernel just had a proper metrics interface. ☹️

jooola · 2021-03-07T15:12:14Z

@dswarbrick Do you mind if I pick up your current code and try to work on it ? What is missing ? Should it have feature equality with the current /proc/mdstat parser ?

dswarbrick · 2021-03-09T10:00:59Z

@jooola The kernel md admin guide describes how many of the md-related sysfs (and /proc/mdstat, IIRC) support select/poll, to efficiently watch for changes. I don't know how useful this is in the case of node_exporter, due to its "pull" nature, rather than "push" - i.e., it needs to read the current values exactly when scraped by Prometheus.

I have no objection to you picking up my code and continuing with it. Sadly I just haven't had the time to dedicate to this, with everything else on my radar most of the time. If it's going to be a drop-in replacement for the current /proc/mdstat parsing, it definitely needs to provide feature parity and expose the same interface as currently. So far, the only feature of /proc/mdstat I have found which doesn't have an equivalent in sysfs is the first line, which lists the raid personalities supported by the kernel, e.g.:

Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]

AFAIK this is unused by the procfs package.

dswarbrick · 2023-04-24T16:51:51Z

Since this appears to have stagnated, I've resumed what I started several years ago. I hope to have a PR for procfs in a couple of days.

dtap001 · 2024-02-14T12:58:11Z

any update?

frittentheke · 2024-02-19T19:40:52Z

@dtap001 what is missing here exactly? The PR prometheus/procfs#509 was merged which gets the stats off of sysfs now.

@dswarbrick you wanted to also fetch metrics via sysfs, sync rate and progress, right?

dswarbrick · 2024-06-11T19:45:20Z

Sorry I haven't had the bandwidth to work on this for a long time, but it looks like somebody else has picked up the torch in #3031

dswarbrick mentioned this issue Sep 25, 2018

Proposal: collect mdraid metrics from sysfs instead of parsing /proc/mdstat prometheus/prometheus#4651

Closed

corny mentioned this issue Sep 25, 2018

mdadm: Add test case for syncing RAID 6 #1080

Closed

discordianfish added the accepted label Oct 4, 2018

dswarbrick mentioned this issue Oct 7, 2020

MDStat parsing enrichment prometheus/procfs#329

Closed

dswarbrick mentioned this issue Apr 24, 2023

Implement mdraid sysfs parsing prometheus/procfs#509

Merged

dswarbrick mentioned this issue May 14, 2024

added md disks in down state #3007

Open

dswarbrick mentioned this issue Oct 9, 2024

feat: expose MD raid component devices prometheus/procfs#674

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: collect mdraid metrics from sysfs instead of parsing /proc/mdstat #1085

Proposal: collect mdraid metrics from sysfs instead of parsing /proc/mdstat #1085

dswarbrick commented Sep 25, 2018 •

edited

Loading

SuperQ commented Sep 25, 2018

dswarbrick commented Sep 25, 2018 •

edited

Loading

frittentheke commented Jun 11, 2020

dswarbrick commented Jun 11, 2020

frittentheke commented Aug 14, 2020 •

edited

Loading

jooola commented Mar 7, 2021

dswarbrick commented Mar 7, 2021 •

edited

Loading

jooola commented Mar 7, 2021

SuperQ commented Mar 7, 2021

jooola commented Mar 7, 2021

dswarbrick commented Mar 9, 2021

dswarbrick commented Apr 24, 2023

dtap001 commented Feb 14, 2024

frittentheke commented Feb 19, 2024

dswarbrick commented Jun 11, 2024

Proposal: collect mdraid metrics from sysfs instead of parsing /proc/mdstat #1085

Proposal: collect mdraid metrics from sysfs instead of parsing /proc/mdstat #1085

Comments

dswarbrick commented Sep 25, 2018 • edited Loading

Proposal

SuperQ commented Sep 25, 2018

dswarbrick commented Sep 25, 2018 • edited Loading

frittentheke commented Jun 11, 2020

dswarbrick commented Jun 11, 2020

frittentheke commented Aug 14, 2020 • edited Loading

jooola commented Mar 7, 2021

dswarbrick commented Mar 7, 2021 • edited Loading

jooola commented Mar 7, 2021

SuperQ commented Mar 7, 2021

jooola commented Mar 7, 2021

dswarbrick commented Mar 9, 2021

dswarbrick commented Apr 24, 2023

dtap001 commented Feb 14, 2024

frittentheke commented Feb 19, 2024

dswarbrick commented Jun 11, 2024

dswarbrick commented Sep 25, 2018 •

edited

Loading

dswarbrick commented Sep 25, 2018 •

edited

Loading

frittentheke commented Aug 14, 2020 •

edited

Loading

dswarbrick commented Mar 7, 2021 •

edited

Loading