A fully automated network monitoring system that detects network failures (packet loss, high latency, link failures), alerts you via Slack & Email, and performs auto-remediation using Python, Netmiko, and Shell scripts. It also provides a REST API and Prometheus-Grafana integration for real-time monitoring and dashboards.
- π‘ Pings and monitors multiple network nodes automatically
- π Detects:
- Packet loss
- Latency issues
- Link down or host unreachable
- π¨ Alerts via:
- Slack API
- Email (SMTP)
- π§ Auto-remediation:
- Restarting network services
- Sending commands via Netmiko
- π Real-time metrics:
- Prometheus metrics exporter
- Grafana dashboards
- π REST API for live status of devices
Auto-Network-Monitoring/
βββ monitor.py # Main monitoring script
βββ api.py # Flask API to check device status
βββ slack_alerts.py # Slack webhook alert handler
βββ email_alerts.py # Email alerting script
βββ remediation.sh # Shell script to restart network services
βββ requirements.txt # Python dependencies
βββ prometheus.yml # Prometheus config
βββ .gitignore # Git ignored files
βββ README.md # Project documentation
βββ exporters/
βββ metrics_exporter.py # Prometheus metrics exporter
This script is the core monitoring engine. It performs the following actions:
- Loops over a predefined list of network devices every 60 seconds.
- Pings each device to check its reachability.
- If a device is detected as down:
- Sends out alerts via configured channels (Slack, Email).
- Logs the failure event for record-keeping.
- Attempts to perform automated remediation actions using Netmiko (for network devices) or a fallback shell script.
This file sets up a Flask web server to provide an API endpoint for checking the status of any monitored device.
Example Request:
To check the status of the device with IP 192.168.1.1
, you would send a GET request to /status/192.168.1.1
.
GET /status/192.168.1.1
Example Response:
JSON
{
"ip": "192.168.1.1",
"status": "up"
}
This script is responsible for sending alert notifications to a specified Slack channel. It uses a Slack webhook URL to post messages.
This script handles sending alert notifications via email. It uses a Gmail SMTP server. You will need to customize this file with your Gmail credentials (email address and an app password for security).
This is a shell script designed as a fallback remediation method. It attempts to restart network services like NetworkManager using systemctl
. This script is executed if Netmiko-based remediation is not applicable or fails.
This script exposes custom metrics in a format that Prometheus can scrape.
It runs a small HTTP server on port 8000
.
It tracks the reachability status of each device: 0
indicates the device is OK (up), and 1
indicates the device is down.
π§± Install dependencies: Open your terminal and run the following command to install all necessary Python libraries listed in the requirements.txt
file.
Bash
pip install -r requirements.txt
π Configure your devices in monitor.py
: Edit the monitor.py
file to define the list of devices you want to monitor. Each device is a dictionary with its host
(IP address or hostname), username
, password
, and device_type
(e.g., cisco_ios
, juniper_junos
).
devices = [
{"host": "192.168.1.1", "username": "admin", "password": "admin", "device_type": "cisco_ios"},
{"host": "10.0.0.5", "username": "user", "password": "password123", "device_type": "linux"},
# Add more devices here
]
π Set up your credentials:
email_alerts.py
: Open this file and enter your Gmail email address and an app password. (Using app passwords is more secure than using your main Google account password).
slack_alerts.py
: Open this file and replace the placeholder Slack webhook URL with your actual webhook URL.
π Run Monitoring Script: Start the main monitoring process by executing:
python monitor.py
π Run API server (Optional): If you want to use the API to check device statuses, run the Flask server in a separate terminal:
python api.py
This setup allows you to visualize the network monitoring data.
Prometheus Setup:
Install Prometheus: Download and install Prometheus from the official website if you haven't already.
Configure Prometheus: Replace the contents of your prometheus.yml
configuration file with the following. This tells Prometheus to scrape metrics from the metrics_exporter.py
script.
scrape_configs:
- job_name: 'network_monitor'
static_configs:
- targets: ['localhost:8000'] # Assumes metrics_exporter.py is running on the same machine
Start Prometheus: Navigate to your Prometheus directory in the terminal and start it with the new configuration:
./prometheus --config.file=prometheus.yml
- Install Grafana if you haven't already.
- Open Grafana in your web browser: http://localhost:3000
- Add Prometheus as a data source, pointing it to your Prometheus server: http://localhost:9090
-
In Grafana, create a new dashboard.
-
Add a panel and use a PromQL query to display the status.
Example to track packet loss (where1
means down) for a specific device:network_packet_loss{device_ip="192.168.1.1"}
The system attempts to automatically fix issues when a device goes down.
For devices that support SSH access and are compatible with Netmiko (e.g., Cisco IOS, Juniper Junos), monitor.py
can send commands like reload
or other custom-defined commands to attempt recovery.
If Netmiko is not suitable or fails, the remediation.sh
script is executed:
bash remediation.sh
You will receive a message in your configured Slack channel similar to:
ALERT: 192.168.1.2 is down or has high latency.
Emails will have a subject line like:
Network Alert
The email body will contain more details about the downed device.
The api.py
script provides a simple REST API to query device status.
Route | Method | Description |
---|---|---|
/status/<ip> |
GET | Returns the current status (up or down) of the device. |
Potential enhancements for this project include:
- SNMP Trap Integration: Receive and process SNMP traps from network devices for more proactive monitoring.
- Web Dashboard: Develop a dedicated web dashboard (beyond Grafana) for a live view of device statuses and logs.
- Store Logs to Database: Implement logging to a persistent database (e.g., PostgreSQL, MySQL, InfluxDB) for better querying and historical analysis.
- Kubernetes & Docker Compose Deployment: Create configurations for easier deployment and scaling using containerization technologies.
Contributions are welcome!
- For minor changes, feel free to submit a pull request.
- For major changes or new features, please open an issue first to discuss the proposed changes.
This project is licensed under the MIT License. See the LICENSE
file for more details.
Shibam Nath
GitHub: shibam120302