CheckMate is a service monitoring tool written in Go that provides real-time health checks and metrics for infrastructure. It supports multiple protocols, customizable rules, and Prometheus integration.
DISCLAIMER: This is a personal project and is not meant to be used in a production environment as it is not feature complete nor secure nor tested and under heavy development.
- Multi-protocol support (TCP, HTTP, SMTP, DNS)
- Hierarchical configuration (Sites → Groups → Hosts → Checks)
- Group monitoring
- Configurable check intervals per service
- Prometheus metrics integration
- Rule-based monitoring with custom conditions
- Flexible notification system with rule-specific routing
- Service availability status
- Response time measurements
- Prometheus-compatible metrics endpoint
- Downtime tracking
- Latency histograms and gauges
- YAML-based configuration
- Modular architecture for easy extension using interfaces
- Site-based infrastructure organization
- Tag inheritance (site tags are inherited by groups and hosts)
- Clone the repository:
git clone https://github.com/whiskeyjimbo/CheckMate.git
cd CheckMate
- Build using Make:
make build
- Run using Make:
make run config.yaml ## see configuration below for more details
CheckMate is configured using a YAML (default: ./config.yaml) file. Example configuration:
sites:
- name: "mars-prod"
tags: ["region-mars", "prod"]
groups:
- name: "api-service.dev.com"
tags: ["prod"]
hosts:
- host: "127.0.0.1"
tags: ["primary"]
- host: "localhost2"
tags: ["secondary"]
checks:
- port: "443"
protocol: HTTPS
interval: 10s
tags: ["api"]
- port: "22"
protocol: TCP
interval: 10s
tags: ["ssh"]
rules:
- name: high_latency_warning
condition: "responseTime > 5ms"
tags: ["prod"]
notifications: ["log"]
- name: critical_downtime_prod
condition: "downtime > 15s"
tags: ["prod"]
notifications: ["log"]
notifications:
- type: "log"
name
: Unique identifier for the sitetags
: List of tags inherited by all groups in the sitegroups
: List of service groups in this site
name
: The group identifiertags
: Additional tags specific to this group (combined with site tags)hosts
: List of hosts in this grouphost
: The hostname or IP to monitortags
: Additional tags specific to this host
checks
: List of service checks applied to all hostsport
: Port number to checkprotocol
: One of: TCP, HTTP, SMTP, DNSinterval
: Check frequency (e.g., "30s", "1m")tags
: Additional tags specific to this checkruleMode
: Override group's rule mode for this specific check
ruleMode
: How rules are evaluated for this group (optional)all
: Only fire rules when all hosts are down (default)any
: Fire rules when any host in the group is down
name
: Unique rule identifiercondition
: Expression to evaluate (uses responseTime and downtime variables)tags
: List of tags to match against groupsnotifications
: List of notification types to use when rule triggers- If omitted, all configured notifiers will be used
type
: Type of notification ("log", with more coming soon)
Groups support high availability monitoring with configurable rule modes at both group and check levels:
- Group is considered "up" if any host is responding
- Rules only trigger when all hosts are down
- Ideal for redundant services where one available host is sufficient
- Group monitoring tracks all hosts individually
- Rules trigger when any host goes down
- Suitable for services where each host's availability is critical
- Can be set at check level to override group settings
In both modes:
- Response times are averaged across all successful checks in the group (think i will change this later to use host level metrics..)
- Metrics are tracked at both host and group levels
- Prometheus histograms are used for latency tracking
- Notifications include specific failing hosts
Example Prometheus queries for HA monitoring:
# Count of available hosts in each group
count(checkmate_check_success{group="api-service"} == 1) by (site, group)
# Groups with all hosts down
count(checkmate_check_success{} == 0) by (site, group)
# Average response time across all hosts in a group
avg(checkmate_check_latency_milliseconds) by (site, group)
CheckMate exposes Prometheus metrics at :9100/metrics
including:
checkmate_check_success
: Service availability (1 = up, 0 = down)checkmate_check_latency_milliseconds
: Response time in millisecondscheckmate_check_latency_milliseconds_histogram
: Response time distribution in millisecondscheckmate_hosts_up
: Number of hosts up in a group (per port/protocol)checkmate_hosts_total
: Total number of hosts in a group (per port/protocol)
Note: These metrics are designed for Grafana's Node Graph visualization and are currently in flux
-
checkmate_node_info
: Node information for graph visualization- Labels: id, type (site/group/host), name, tags, port, protocol
- Values: 1 for active nodes, 0 for inactive
-
checkmate_edge_info
: Edge information with latency- Labels: source, target, type, metric, port, protocol
- Values: latency in milliseconds
Example Prometheus queries:
# Filter checks by site
checkmate_check_success{site="mars-lab"}
# Average response time for production APIs
avg(checkmate_check_latency_milliseconds{tags=~".*prod.*", tags=~".*api.*"})
# 95th percentile latency by site
histogram_quantile(0.95, sum(rate(checkmate_check_latency_milliseconds_histogram[5m])) by (le, site))
# Host availability ratio per group
sum(checkmate_hosts_up) by (id) / sum(checkmate_hosts_total) by (id)
# Graph Visualization (Beta)
# Node status
checkmate_node_info{type="host", port="443", protocol="HTTPS"}
# Edge latencies
avg(checkmate_edge_info{type="contains", metric="latency"}) by (source, target, port, protocol)
To visualize your infrastructure in Grafana's Node Graph:
-
Create a new Node Graph panel
-
Configure the Node Query:
checkmate_node_info
-
Configure the Edge Query:
checkmate_edge_info{metric="latency"}
-
Set transformations:
- Nodes: Use 'id' for node ID, 'type' for node class
- Edges: Use 'source' and 'target' for connections
Note: Graph visualization features are in flux and the query/configuration interface may change
CheckMate provides Kubernetes-compatible health check endpoints:
-
/health/live
- Liveness probe- Returns 200 OK when the service is running
-
/health/ready
- Readiness probe- Returns 200 OK when the service is ready to receive traffic
- Returns 503 Service Unavailable during initialization
All health check endpoints are served on port 9100 alongside the metrics endpoint.
CheckMate uses structured logging with the following fields:
- Basic check information:
site
: Site namegroup
: Target hostnamehost
: Target hostnameport
: Service portprotocol
: Check protocolsuccess
: Check result (true/false)responseTime_us
: Response time in microsecondstags
: Array of host tags
- Rule evaluation:
rule
: Rule nameruleTags
: Tags assigned to the rulehostTags
: Tags assigned to the hostcondition
: Rule conditiondowntime
: Current downtime durationresponseTime
: Last check response time
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.
- Additional protocol support (HTTPS, TLS verification)
- Notification system integration (Slack, Email, etc.)
- Configurable notification thresholds
- Database support for historical data
- Docker container
- Web UI for monitoring (MAYBE)
- Kubernetes readiness/liveness probe support
- Multiple host monitoring
- Multi-protocol per host
- Service tagging system
- Site-based infrastructure organization
- High availability group monitoring