A proactive monitoring solution that automatically analyzes your Elasticsearch logs, detects patterns, and delivers concise email reports about your platform's health.
If you already have an ELK stack setup up and running, Platform Problem Monitoring delivers this ↓ into your mailbox every hour:
Platform Problem Monitoring Core helps platform engineers and system administrators by:
- Detecting problems automatically — Identifies errors, exceptions, and warnings in your logs without manual searching
- Recognizing patterns — Normalizes similar log messages to reveal systemic issues
- Tracking changes over time — Compares current issues with previous runs to show what's new, increasing, or decreasing
- Delivering digestible reports — Sends clear, well-formatted email reports with Kibana links to examples
This tool is ideal if:
- You already have an ELK (Elasticsearch, Logstash, Kibana) stack collecting logs
- You want automated, periodic health assessments of your platform
- You prefer receiving digestible summaries rather than real-time alerts
- You need to understand patterns and trends in your platform's problems
- Python 3.10+ installed on the host system
- Network access to:
- Your Elasticsearch server
- An AWS S3 bucket (for state storage between runs)
- An SMTP server (for sending reports)
- Credentials for all these services
-
Clone the repository:
git clone https://github.com/dx-tooling/platform-problem-monitoring-core.git cd platform-problem-monitoring-core
-
Set up a virtual environment:
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install the package:
pip3 install -e .
-
Create a configuration file:
mkdir -p /etc/ppmc cp etc/main.conf.dist /etc/ppmc/main.conf
-
Edit the configuration:
REMOTE_STATE_S3_BUCKET_NAME="your-s3-bucket" REMOTE_STATE_S3_FOLDER_NAME="platform-monitoring" ELASTICSEARCH_SERVER_BASE_URL="https://your-elasticsearch-server:9200" ELASTICSEARCH_LUCENE_QUERY_FILE_PATH="path/to/lucene_query.json" KIBANA_DISCOVER_BASE_URL="https://your-kibana-server:5601" KIBANA_DOCUMENT_DEEPLINK_URL_STRUCTURE="https://your-kibana-server:5601/app/discover#/doc/logstash-*/{{index}}?id={{id}}" SMTP_SERVER_HOSTNAME="smtp.example.com" SMTP_SERVER_PORT="587" SMTP_SERVER_USERNAME="your-smtp-username" SMTP_SERVER_PASSWORD="your-smtp-password" SMTP_SENDER_ADDRESS="[email protected]" SMTP_RECEIVER_ADDRESS="[email protected]"
-
Set up the Elasticsearch query:
cp etc/lucene_query.json.dist /etc/ppmc/lucene_query.json
This sample query looks for error messages while filtering out noise:
{ "query": { "bool": { "should": [ { "match": { "message": "error" } }, { "match": { "message": "failure" } }, { "match": { "message": "critical" } }, { "match": { "message": "alert" } }, { "match": { "message": "exception" } } ], "must_not": [ { "match": { "message": "User Deprecated" } }, { "match": { "message": "logstash" } }, { "term": { "syslog_program": "dd.collector" } } ], "minimum_should_match": 1 } } }
-
Run the tool:
./bin/ppmc /etc/ppmc/main.conf
When executed, the tool:
- Prepares the environment by creating a temporary work directory
- Downloads previous state from S3 (for comparison)
- Queries Elasticsearch for new problem-related log messages since the last run
- Extracts relevant fields from the returned documents
- Normalizes messages by replacing dynamic parts like UUIDs, timestamps, and specific values with placeholders
- Compares current patterns with the previous run to identify new, increased, and decreased issues
- Generates an email report with detailed information about all identified issues
- Sends the report via your configured SMTP server
- Stores the current state in S3 for the next run
- Cleans up temporary files
For a Kubernetes deployment, you might want to focus on pod-related errors:
{
"query": {
"bool": {
"should": [
{ "match": { "kubernetes.pod.name": "*" } },
{ "match_phrase": { "message": "error" } },
{ "match_phrase": { "message": "exception" } }
],
"must_not": [
{ "match": { "message": "liveness probe failed" } }
],
"minimum_should_match": 2
}
}
}
For web services, you might focus on HTTP errors and performance issues:
{
"query": {
"bool": {
"should": [
{ "range": { "http.response.status_code": { "gte": 500 } } },
{ "range": { "response_time_ms": { "gte": 1000 } } },
{ "match_phrase": { "message": "timed out" } }
],
"minimum_should_match": 1
}
}
}
To run the tool periodically, set up a cron job:
# Run every 6 hours
0 */6 * * * cd /path/to/platform-problem-monitoring-core && ./bin/ppmc ./etc/main.conf >> /var/log/platform-monitoring.log 2>&1
The tool uses boto3's default credential resolution. You can:
- Set environment variables:
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
- Use a shared credentials file (
~/.aws/credentials
) - Use IAM roles if running on EC2 instances
- Check Elasticsearch connectivity:
curl -X GET https://your-elasticsearch-server:9200/_cat/indices
- Verify S3 bucket permissions
- Test SMTP settings:
python -m smtplib -d smtp.example.com:587
- Check your query matches actual log patterns
- Test your Elasticsearch query directly in Kibana
- Check the date range - are you missing events due to time zone issues?
- Adjust the Lucene query to be more inclusive
For large log volumes:
- Increase the time between runs to process more logs at once
- Optimize your Elasticsearch query with more specific filters
- Ensure the host running the tool has sufficient memory
If you encounter problems or have questions, please:
- Check the detailed logs in your temporary work directory
- Open an issue in our repository with your configuration (with sensitive data removed)
- Include error messages and steps to reproduce the issue
This project is available under the MIT License — Copyright (c) 2025 Manuel Kießling.