Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

256 integrate health check and monitoring for the server e5 #305

Open
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

Ngha-Boris
Copy link
Collaborator

@Ngha-Boris Ngha-Boris commented Feb 3, 2025

Screenshot from 2025-01-30 17-22-16

Copy link
Collaborator

@Blindspot22 Blindspot22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check this @Ngha-Boris

Cargo.toml Outdated Show resolved Hide resolved
alertmanager.yml Outdated Show resolved Hide resolved
docker-compose.yml Show resolved Hide resolved
docker-compose.yml Show resolved Hide resolved
prometheus.yml Outdated Show resolved Hide resolved
prometheus.yml Outdated Show resolved Hide resolved
@ndefokou ndefokou requested a review from Ogenbertrand February 6, 2025 09:36
alertmanager.yml Outdated Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
Comment on lines 22 to 59
```
#
### Metrics Enpoint
- **Route:** ```/metrics```
- **Description:** Exposes Prometheus-compatible metrics, which are scraped by Prometheus for monitoring.
- **Response:** Raw Prometheus metrics data (e.g., CPU, memory usage).
#
### Prometheus Metrics

Prometheus collects the following metrics:
- **CPU Usage:** Tracks CPU time spent in different modes (idle, system, user). Alert if usageexceeds 80% for 2 minutes.
- **Memory Usage:** Monitors available memory. Alert if usage exceeds 85% for 2 minutes.
- **Disk Usage:** Monitors disk space. Alert if disk space is below 20%.
- **API Response Time:** Measures API response times. Alert if the 95th percentile response time exceeds 1 second.
- **HTTP Errors:** Tracks failed HTTP requests. Alert if failure rate exceeds 5% over 5 minutes.
#
### Alerting

Prometheus triggers alerts based on defined conditions:

**Important Alerts**
- **InstanceDown:** Triggered if the server is unreachable for 30 seconds.
- **HighCPUUsage:** Triggered if CPU usage is over 80% for 2 minutes.
- **HighMemoryUsage:** Triggered if memory usage exceeds 85% for 2 minutes.
- **SlowAPIResponse:** Triggered if API response times are too slow (95th percentile > 1s).
- **DiskSpaceLow:** Triggered if disk space is below 20%.
Alerts are routed to Alertmanager, which handles notification via "any" application(e.g slack discord etc) (configured in ```alertmanager.yml```).
#
### Docker Setup
The monitoring stack is managed using Docker Compose. It includes:
- **Prometheus:** Collects metrics.
- **Grafana:** Visualizes metrics.
- **Alertmanager:** Sends notifications based on alerts.
- **Node Exporter:** Exposes system metrics (CPU, memory, disk usage).
### Build and run
```bash
docker compose up --build
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you double check the markdown here? When it is all red, it generally means the syntax is not correct. Maybe the hash signs with remaining lines empty are not necessary...

Comment on lines +1 to +2
// src/main.rs
use axum::routing::get;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of adding this comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was an error i just removed it this morning. I will soon push.

Comment on lines +8 to +9
mod health;
mod metrics;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer you simply import these modules than creating two hierarchies:

use didcomm_mediator::{app, health, metrics};

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay Sir

Comment on lines 17 to 21
- **Response:**
```json
{
"Status": "OK",
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could find that GitHub's problem is the level of indentation at this point, and possibly the extra comma too.

Suggested change
- **Response:**
```json
{
"Status": "OK",
}
- **Response:**
```json
{
"Status": "OK"
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just fixed the issue but i am still working on updating the document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integrate health checks and monitoring for the server E5.
4 participants