Skip to content

Commit

Permalink
v0.5.4 - zfs
Browse files Browse the repository at this point in the history
  • Loading branch information
Eric Wilhelm committed Jan 17, 2014
2 parents f446f7c + c6d0060 commit 1f6650c
Show file tree
Hide file tree
Showing 5 changed files with 252 additions and 1 deletion.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# 0.5.4

* collectors/zfs - zpool health & space collector

# 0.5.3

* collectors/fma - solaris faults collector
Expand Down
34 changes: 34 additions & 0 deletions collectors/zfs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Description

Reports on zfs pool size and health.

# Output

The metrics `used`, `free`, `state`, and `faults` will always be
present. If there are faults, the `errors` hash may contain additional
metrics. The error metrics `read`, `write`, and `cksum` are as-reported
for the zfs pool (see `zpool status`.)

Note that `faults` may be non-zero while the pool is still fully
functional (and `state=1`) if there are minor errors on a device,
including data loss. The device-specific errors are reported under
`_info`. Unrecoverable (pool) errors will appear as `read`, `write`, or
`cksum` at the upper level. The value of `faults` is a count of all
vdevs which contain errors (including the pool itself), so a flaky
(redundant) disk in a mirror would appear as `faults=2`, but with 2/2
corrupted: faults = 2 (disks) + 1 (mirror) + 1 (pool) = 4.

```yaml
zpool:
monkey:
used: 39.2 # GB
free: 17.483 # GB
state: 10 # 10=ONLINE 5=DEGRADED, 0=FAULTED/UNAVAIL
faults: 4
errors:
read: 8
_info:
c1t1d0: { state: "ONLINE", errors: {read: 2}}
c2t5d0: { state: "ONLINE", errors: {read: 6}}
files: ["<0x0>", "a.txt"]
```
131 changes: 131 additions & 0 deletions collectors/zfs/fakebin/zpool
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
#!/bin/bash

if [ "x$KNOB" == "xOK" ] ; then
cat << END_OF_OUTPUT
pool: monkey
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
monkey ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
/tmp/p1 ONLINE 0 0 0
/tmp/p2 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
/tmp/p3 ONLINE 0 0 0
/tmp/p4 ONLINE 0 0 0
errors: No known data errors
pool: tank
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
/dev/sdg ONLINE 0 0 0
errors: No known data errors
END_OF_OUTPUT

elif [ "x$KNOB" == "xERRORS" ] ; then
cat << END_OF_OUTPUT
pool: monkey
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-9P
scan: scrub repaired 2K in 0h0m with 0 errors on Sat Jan 11 11:47:26 2014
config:
NAME STATE READ WRITE CKSUM
monkey ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
/tmp/p1 ONLINE 0 0 0
/tmp/p2 ONLINE 9 80.6K 0
mirror-1 ONLINE 0 0 0
/tmp/p3 ONLINE 0 0 0
/tmp/p4 ONLINE 0 0 0
errors: No known data errors
END_OF_OUTPUT

elif [ "x$KNOB" == "xLOSSY" ] ; then
cat << END_OF_OUTPUT
pool: tank
state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
config:
NAME STATE READ WRITE CKSUM
tank UNAVAIL 0 0 0 insufficient replicas
c1t0d0 ONLINE 0 0 0
c1t1d0 UNAVAIL 4 1 0 cannot open
errors: Permanent errors have been detected in the following files:
/tank/data/aaa
/tank/data/bbb
/tank/data/ccc
END_OF_OUTPUT

elif [ "x$KNOB" == "xDEGRADED" ] ; then
cat << END_OF_OUTPUT
pool: monkey
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
see: http://zfsonlinux.org/msg/ZFS-8000-2Q
scan: scrub repaired 0 in 0h0m with 0 errors on Sat Jan 11 16:45:29 2014
config:
NAME STATE READ WRITE CKSUM
monkey DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
/tmp/p1a UNAVAIL 0 0 0 cannot open
/tmp/p2 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
/tmp/p3 ONLINE 0 0 0
/tmp/p4 ONLINE 0 0 0
errors: No known data errors
END_OF_OUTPUT

elif [ "x$KNOB" == "xCORRUPT" ] ; then
cat << END_OF_OUTPUT
pool: monkey
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: resilvered 0 in 0h0m with 0 errors on Sat Jan 11 11:52:56 2014
config:
NAME STATE READ WRITE CKSUM
monkey ONLINE 0 30.0K 0
mirror-0 ONLINE 0 30.0K 0
/tmp/p1 ONLINE 23 59.3K 0
/tmp/p2 ONLINE 47 62.3K 0
mirror-1 ONLINE 0 0 0
/tmp/p3 ONLINE 0 0 0
/tmp/p4 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
monkey:<0x0>
/monkey/
END_OF_OUTPUT


else
echo "oh no" >&2
exit 1
fi
82 changes: 82 additions & 0 deletions collectors/zfs/zfs
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
#!/usr/bin/env ruby
# Copyright (C) 2014 Cisco, Inc.

require 'json'
class Array; def to_h; Hash[*self.flatten]; end; end

opt = ARGV[0] ? JSON::parse(ARGV[0], {symbolize_names: true}) : {}

zpool_cmd = [opt[:zpool_cmd] || 'zpool'].flatten
zfs_cmd = [opt[:zfs_cmd] || 'zfs'].flatten

########################################################################
states = { ONLINE: 10, DEGRADED: 5, FAULTED: 0, UNAVAIL: 0 }
convert_10 = ->(){
unit = { 'G' => 1E9, 'M' => 1E6, 'K' => 1E3 }
->(x) {
x =~ /([\d.]+)([GMK])/ ?
($1.to_f * unit[$2]).to_i
: x.to_i
}
}[]
convert_GB = ->(){
exp = 0
unit = %w(B K M G T P E Z).map {|p| exp += 1; [p, 1024**exp-1]}.to_h
->(x) {
(x.sub(/([BKMGTPEZ])/, '').to_f *
(unit[$1] or raise "unknown unit #{$1}") /
unit['G']).round(6)
}
}[]
########################################################################

status = IO.popen(zpool_cmd + ['status', '-v']) {|fh| fh.readline(nil)}.
split(/\s+pool:\s+(\S+)\n/).drop(1).to_h
raise "zpool error - #{$?.exitstatus}" unless $?.success?

metrics = status.keys.map {|pool|
info = status[pool].split(/^\s*(\w+):\s+/).drop(1).to_h
info.each_value {|v| v.chomp!}
m = {} # inner metrics
m[:state] = states[info['state'].to_sym] ||
begin; warn "unknown state #{info['state']}"; -1; end

info['config'] =~ /NAME\s+STATE\s+READ\s+WRITE/ or
raise "not expecting format of config: #{info['config']}"
errors = {}
info['config'].split(/\n/).drop(1).each {|line|
(dev, state, r, w, c) = line.split(/\s+/).drop(1)
(r,w,c) = [r,w,c].map {|x| convert_10[x]}
if [r,w,c].find {|x| x > 0} || state != 'ONLINE'
e = errors[dev] = {state: state, read: r, write: w, cksum: c}
[:read, :write, :cksum].each {|k| e.delete(k) if e[k] == 0}
end
}
m[:faults] = errors.keys.count
if errors.keys.count > 0
e = m[:errors] = {_info: errors}
if pe = errors.delete(pool)
pe.delete(:state) # string / redundant
e.merge!(pe)
end
end

if info['errors'] =~ /files:\n\n\s*(\S+.*)/m
m[:errors][:_info][:_files] = $1.chomp.split(/\s+/)
end

[pool, m]
}.to_h


IO.popen(zfs_cmd + ['list', '-H', '-o', 'name,used,avail'] +
metrics.keys) {|fh|
fh.readlines.each {|line|
(pool, used, free) = line.split(/\s+/)
metrics[pool][:used] = convert_GB[used]
metrics[pool][:free] = convert_GB[free]
}
}
raise "zpool error - #{$?.exitstatus}" unless $?.success?

puts JSON::generate(metrics)
2 changes: 1 addition & 1 deletion lib/panoptimon/version.rb
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Copyright (C) 2012-2014 Cisco, Inc.

module Panoptimon
VERSION = "0.5.3"
VERSION = "0.5.4"
end

0 comments on commit 1f6650c

Please sign in to comment.