GitHub - samuelarogbonlo/skip-monitoring

OracleOps

Monitoring Solution for Slinky Oracle Service

This document outlines the monitoring solution designed for Slinky, Skip's premium oracle product used by high-performance app chains. The solution includes a metrics dashboard, alerts configuration, and additional features to ensure the reliability and performance of the oracle service.

Overview

The monitoring solution is built on Prometheus for metrics collection, Grafana for visualization, and AlertManager for alert management. It offers dynamic service discovery, advanced alert correlation, anomaly detection, and automated remediation capabilities to identify and respond to issues proactively.

Metrics

Metric Name	Metric Type	Metric Description
oracle_aggregate_price	Gauge	The price of each asset pair after running aggregation (medianization). Contains `pair` and `decimals` labels.
oracle_api_response_status_per_provider	Counter	The number of each status as reported per provider. Contains `provider`, `id`, and `status` labels.
oracle_api_response_time_per_provider	Histogram	The response time of a API calls made as reported per provider. Contains `provider` label.
oracle_provider_last_updated_id	Gauge	The last updated time for each ID (currency pair). Contains `provider`, `id`, and `type` labels.
oracle_provider_price	Gauge	The price that each separate provider reports (prior to aggregation). Contains `provider`, `type`, `pair`, and `decimals` labels.
oracle_provider_status_responses	Counter	The stats (success or failure) of each attempt at retrieving a price by a given provider. Contains `provider`, `status`, `code`, and `type` labels.
oracle_provider_status_responses_per_id	Counter	The number of each status as reported per ID (currency pair). Contains `provider`, `id`, `status`, `code`, and `type` labels.
oracle_ticks_total	Counter	The constantly incrementing number of "ticks" the oracle has successfully executed. The tick time period is configurable, but the count should continue increasing as the oracle runs.

Exposed services

Prometheus: http://localhost:9090
- Accessible in Grafana through http://prometheus:9090
Grafana: http://localhost:3000 (the credentials are admin / admin)
Oracle sidecar metrics: http://localhost:8002/metrics
Oracle sidecar API: http://localhost:8080/api

Running the infrastructure

Prerequisites

Docker and Docker Compose
Prometheus, Grafana, and AlertManager
Access to the Slinky instance and related services

Installation Steps

Create a personal server on any cloud platform of your choice or use your local machine.
Access the created server or use your local machine terminal to clone the Repository by running the following command:

git clone https://yourrepository.com/monitoring-solution.git
cd monitoring-solution

Configure Services: Navigate to each service's configuration directory (prometheus, grafana, alertmanager) and review the configuration files. Update the configurations as needed to match your environment.
Navigate to alertmanager.yml file in the alertmanager directory then update the channel with your slack channel name and api_url with your slack webhook for that channel. Check more information on setting up a webhook URL here
Launch the Stack: This will start all the services and monitoring setup as well.

docker-compose up -d

Verify Installation: Ensure all services are running correctly. You can access Grafana at http://localhost:3000 and Prometheus at http://localhost:9090.
For access, you can use the default username and password as admin, but you have a choice of changing the password after the first password usage.

Note

Furthermore, you can inspect the logs of any service in the stack by running:

docker-compose logs -f <service-name>

Dashboard and Visualizations

On the dashboard, there are four major metric sections and they are listed thus:

Provider API Metrics: This includes the "total" number of provider responses by status per hour and by ID per hour. To interact with both panels, you can make changes to provider, Provider API Status, and id variables. In summary, the two panels are listed below:
- Provider Responses By Status Per Hour: This provides introspection into how often providers are successfully updating their data.
- Provider Responses By ID Per Hour: This provides introspection into how often each price feed is being updated successfully.
Base Provider Metrics: This row has two major panels as well and to modify and compare data, you need to make changes to the provider, Base Provider Status and id variables. The panels are listed thus:
- Average Number of Responses Per Provider And Status Per Hour
- Average Number of Responses Per ID Per Hour
Prices & Charts: This part of the dashboard has six panels and they have different functions for the user. To interact with these panels, you can make changes to the pair, type, provider, They include the following:
- Oracle Aggregate Price Chart: This shows the oracle aggregate price chart over time
- Oracle Provider Price Chart: This displays the Oracle provider price chart over a certain timeframe.
- Oracle Aggregate Price: This displays the Oracle aggregate price over time.
- Oracle Provider Price: This displays the oracle's provider price per time.
Miscellaneous: This row has two panels. To modify the panels, you'll make changes to the variables; id, provider, type. They include:
- Oracle Provider Last Updated Time For Each Currency Pair in Seconds: Time taken for the Oracle provider API to update currency pair data.
- Rate of Oracle Ticks: Displays rate of oracle ticks per hour.

Generally, stakeholders would find the dashboard very helpful as it highlights different price variances and peculiarities per time. We went forward to set up "Rate of Oracle Ticks" to monitor the spikes in the infrastructure so we can rightly be alerted when things get out of hand. Furthermore, alerting is very crucial to having visibility status of the entire stack and we have built alerting rules that could still be expanded as the stack expands - this also promotes solid service discovery. The rules are listed thus:

Oracle Service Anomalies
High Error Rates Critical
Significant Response Time Increases
Significant Response Time Increases(Critical)
Data Freshness Issues
Price Data Anomalies
Service Unavailability
Spike In Query Volume

Note

Be aware that all the descriptions for the alerts are added in the rules file here. And any other necessary alerts can be added as we move forward.

In engaging with the dashboard and monitoring setup, you should be aware that all you have to do is change the different variables to fit your preference of display and everything works as an out-of-the-box solution.

Other alert-receiving platforms like Discord and email can be set as well.

Author

Samuel Arogbonlo - GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
contrib		contrib
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OracleOps

Monitoring Solution for Slinky Oracle Service

Overview

Metrics

Exposed services

Running the infrastructure

Prerequisites

Installation Steps

Dashboard and Visualizations

Author

About

Releases

Packages

samuelarogbonlo/skip-monitoring

Folders and files

Latest commit

History

Repository files navigation

OracleOps

Monitoring Solution for Slinky Oracle Service

Overview

Metrics

Exposed services

Running the infrastructure

Prerequisites

Installation Steps

Dashboard and Visualizations

Author

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages