Platform Problem Monitoring — Core Application

A proactive monitoring solution that automatically analyzes your Elasticsearch logs, detects patterns, and delivers concise email reports about your platform's health.

If you already have an ELK stack setup up and running, Platform Problem Monitoring delivers this ↓ into your mailbox every hour:

What This Tool Does

Platform Problem Monitoring Core helps platform engineers and system administrators by:

Detecting problems automatically — Identifies errors, exceptions, and warnings in your logs without manual searching
Recognizing patterns — Normalizes similar log messages to reveal systemic issues
Tracking changes over time — Compares current issues with previous runs to show what's new, increasing, or decreasing
Delivering digestible reports — Sends clear, well-formatted email reports with Kibana links to examples

Is This Tool Right For You?

This tool is ideal if:

You already have an ELK (Elasticsearch, Logstash, Kibana) stack collecting logs
You want automated, periodic health assessments of your platform
You prefer receiving digestible summaries rather than real-time alerts
You need to understand patterns and trends in your platform's problems

Prerequisites

Python 3.10+ installed on the host system
Network access to:
- Your Elasticsearch server
- An AWS S3 bucket (for state storage between runs)
- An SMTP server (for sending reports)
Credentials for all these services

Quick Start

Clone the repository:

git clone https://github.com/dx-tooling/platform-problem-monitoring-core.git
cd platform-problem-monitoring-core

Set up a virtual environment:

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install the package:
```
pip3 install -e .
```

Create a configuration file:

mkdir -p /etc/ppmc
cp etc/main.conf.dist /etc/ppmc/main.conf

Edit the configuration:

REMOTE_STATE_S3_BUCKET_NAME="your-s3-bucket"
REMOTE_STATE_S3_FOLDER_NAME="platform-monitoring"

ELASTICSEARCH_SERVER_BASE_URL="https://your-elasticsearch-server:9200"
ELASTICSEARCH_LUCENE_QUERY_FILE_PATH="path/to/lucene_query.json"

KIBANA_DISCOVER_BASE_URL="https://your-kibana-server:5601"
KIBANA_DOCUMENT_DEEPLINK_URL_STRUCTURE="https://your-kibana-server:5601/app/discover#/doc/logstash-*/{{index}}?id={{id}}"

SMTP_SERVER_HOSTNAME="smtp.example.com"
SMTP_SERVER_PORT="587"
SMTP_SERVER_USERNAME="your-smtp-username"
SMTP_SERVER_PASSWORD="your-smtp-password"
SMTP_SENDER_ADDRESS="[email protected]"
SMTP_RECEIVER_ADDRESS="[email protected]"

Set up the Elasticsearch query:

cp etc/lucene_query.json.dist /etc/ppmc/lucene_query.json

This sample query looks for error messages while filtering out noise:

{
    "query": {
        "bool": {
            "should": [
                { "match": { "message": "error" } },
                { "match": { "message": "failure" } },
                { "match": { "message": "critical" } },
                { "match": { "message": "alert" } },
                { "match": { "message": "exception" } }
            ],
            "must_not": [
                { "match": { "message": "User Deprecated" } },
                { "match": { "message": "logstash" } },
                { "term": { "syslog_program": "dd.collector" } }
            ],
            "minimum_should_match": 1
        }
    }
}

Run the tool:
```
./bin/ppmc /etc/ppmc/main.conf
```

How It Works

When executed, the tool:

Prepares the environment by creating a temporary work directory
Downloads previous state from S3 (for comparison)
Queries Elasticsearch for new problem-related log messages since the last run
Extracts relevant fields from the returned documents
Normalizes messages by replacing dynamic parts like UUIDs, timestamps, and specific values with placeholders
Compares current patterns with the previous run to identify new, increased, and decreased issues
Generates an email report with detailed information about all identified issues
Sends the report via your configured SMTP server
Stores the current state in S3 for the next run
Cleans up temporary files

Common Configuration Scenarios

Example: Monitoring a Kubernetes Cluster

For a Kubernetes deployment, you might want to focus on pod-related errors:

{
    "query": {
        "bool": {
            "should": [
                { "match": { "kubernetes.pod.name": "*" } },
                { "match_phrase": { "message": "error" } },
                { "match_phrase": { "message": "exception" } }
            ],
            "must_not": [
                { "match": { "message": "liveness probe failed" } }
            ],
            "minimum_should_match": 2
        }
    }
}

Example: Monitoring Web Services

For web services, you might focus on HTTP errors and performance issues:

{
    "query": {
        "bool": {
            "should": [
                { "range": { "http.response.status_code": { "gte": 500 } } },
                { "range": { "response_time_ms": { "gte": 1000 } } },
                { "match_phrase": { "message": "timed out" } }
            ],
            "minimum_should_match": 1
        }
    }
}

Scheduled Monitoring

To run the tool periodically, set up a cron job:

# Run every 6 hours
0 */6 * * * cd /path/to/platform-problem-monitoring-core && ./bin/ppmc ./etc/main.conf >> /var/log/platform-monitoring.log 2>&1

Advanced Configuration

Configuring AWS Credentials

The tool uses boto3's default credential resolution. You can:

Set environment variables: AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
Use a shared credentials file (~/.aws/credentials)
Use IAM roles if running on EC2 instances

Troubleshooting

No Reports Being Sent

Check Elasticsearch connectivity: curl -X GET https://your-elasticsearch-server:9200/_cat/indices
Verify S3 bucket permissions
Test SMTP settings: python -m smtplib -d smtp.example.com:587
Check your query matches actual log patterns

Reports Missing Expected Issues

Test your Elasticsearch query directly in Kibana
Check the date range - are you missing events due to time zone issues?
Adjust the Lucene query to be more inclusive

Performance Issues

For large log volumes:

Increase the time between runs to process more logs at once
Optimize your Elasticsearch query with more specific filters
Ensure the host running the tool has sufficient memory

Getting Help

If you encounter problems or have questions, please:

Check the detailed logs in your temporary work directory
Open an issue in our repository with your configuration (with sensitive data removed)
Include error messages and steps to reproduce the issue

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
.github/workflows		.github/workflows
.idea		.idea
.vscode		.vscode
assets		assets
bin		bin
docs		docs
etc		etc
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Platform Problem Monitoring — Core Application

What This Tool Does

Is This Tool Right For You?

Prerequisites

Quick Start

How It Works

Common Configuration Scenarios

Example: Monitoring a Kubernetes Cluster

Example: Monitoring Web Services

Scheduled Monitoring

Advanced Configuration

Configuring AWS Credentials

Troubleshooting

No Reports Being Sent

Reports Missing Expected Issues

Performance Issues

Getting Help

License

About

Uh oh!

Releases 8

Uh oh!

Languages

License

dx-tooling/platform-problem-monitoring-core

Folders and files

Latest commit

History

Repository files navigation

Platform Problem Monitoring — Core Application

What This Tool Does

Is This Tool Right For You?

Prerequisites

Quick Start

How It Works

Common Configuration Scenarios

Example: Monitoring a Kubernetes Cluster

Example: Monitoring Web Services

Scheduled Monitoring

Advanced Configuration

Configuring AWS Credentials

Troubleshooting

No Reports Being Sent

Reports Missing Expected Issues

Performance Issues

Getting Help

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Uh oh!

Languages