This document outlines the monitoring solution designed for Slinky, Skip's premium oracle product used by high-performance app chains. The solution includes a metrics dashboard, alerts configuration, and additional features to ensure the reliability and performance of the oracle service.
The monitoring solution is built on Prometheus for metrics collection, Grafana for visualization, and AlertManager for alert management. It offers dynamic service discovery, advanced alert correlation, anomaly detection, and automated remediation capabilities to identify and respond to issues proactively.
Metric Name | Metric Type | Metric Description |
---|---|---|
oracle_aggregate_price | Gauge | The price of each asset pair after running aggregation (medianization). Contains pair and decimals labels. |
oracle_api_response_status_per_provider | Counter | The number of each status as reported per provider. Contains provider , id , and status labels. |
oracle_api_response_time_per_provider | Histogram | The response time of a API calls made as reported per provider. Contains provider label. |
oracle_provider_last_updated_id | Gauge | The last updated time for each ID (currency pair). Contains provider , id , and type labels. |
oracle_provider_price | Gauge | The price that each separate provider reports (prior to aggregation). Contains provider , type , pair , and decimals labels. |
oracle_provider_status_responses | Counter | The stats (success or failure) of each attempt at retrieving a price by a given provider. Contains provider , status , code , and type labels. |
oracle_provider_status_responses_per_id | Counter | The number of each status as reported per ID (currency pair). Contains provider , id , status , code , and type labels. |
oracle_ticks_total | Counter | The constantly incrementing number of "ticks" the oracle has successfully executed. The tick time period is configurable, but the count should continue increasing as the oracle runs. |
- Prometheus: http://localhost:9090
- Accessible in Grafana through http://prometheus:9090
- Grafana: http://localhost:3000 (the credentials are admin / admin)
- Oracle sidecar metrics: http://localhost:8002/metrics
- Oracle sidecar API: http://localhost:8080/api
- Docker and Docker Compose
- Prometheus, Grafana, and AlertManager
- Access to the Slinky instance and related services
- Create a personal server on any cloud platform of your choice or use your local machine.
- Access the created server or use your local machine terminal to clone the Repository by running the following command:
git clone https://yourrepository.com/monitoring-solution.git
cd monitoring-solution
-
Configure Services: Navigate to each service's configuration directory (prometheus, grafana, alertmanager) and review the configuration files. Update the configurations as needed to match your environment.
-
Navigate to
alertmanager.yml
file in the alertmanager directory then update thechannel
with your slack channel name andapi_url
with your slack webhook for that channel. Check more information on setting up a webhook URL here -
Launch the Stack: This will start all the services and monitoring setup as well.
docker-compose up -d
-
Verify Installation: Ensure all services are running correctly. You can access Grafana at http://localhost:3000 and Prometheus at http://localhost:9090.
-
For access, you can use the default username and password as
admin
, but you have a choice of changing the password after the first password usage.
Note
- Furthermore, you can inspect the logs of any service in the stack by running:
docker-compose logs -f <service-name>
On the dashboard, there are four major metric sections and they are listed thus:
-
Provider API Metrics: This includes the "total" number of provider responses by status per hour and by ID per hour. To interact with both panels, you can make changes to
provider
,Provider API Status
, andid
variables. In summary, the two panels are listed below:- Provider Responses By Status Per Hour: This provides introspection into how often providers are successfully updating their data.
- Provider Responses By ID Per Hour: This provides introspection into how often each price feed is being updated successfully.
-
Base Provider Metrics: This row has two major panels as well and to modify and compare data, you need to make changes to the
provider
,Base Provider Status
andid
variables. The panels are listed thus:- Average Number of Responses Per Provider And Status Per Hour
- Average Number of Responses Per ID Per Hour
-
Prices & Charts: This part of the dashboard has six panels and they have different functions for the user. To interact with these panels, you can make changes to the
pair
,type
,provider
, They include the following:- Oracle Aggregate Price Chart: This shows the oracle aggregate price chart over time
- Oracle Provider Price Chart: This displays the Oracle provider price chart over a certain timeframe.
- Oracle Aggregate Price: This displays the Oracle aggregate price over time.
- Oracle Provider Price: This displays the oracle's provider price per time.
-
Miscellaneous: This row has two panels. To modify the panels, you'll make changes to the variables;
id
,provider
,type
. They include:- Oracle Provider Last Updated Time For Each Currency Pair in Seconds: Time taken for the Oracle provider API to update currency pair data.
- Rate of Oracle Ticks: Displays rate of oracle ticks per hour.
Generally, stakeholders would find the dashboard very helpful as it highlights different price variances and peculiarities per time. We went forward to set up "Rate of Oracle Ticks" to monitor the spikes in the infrastructure so we can rightly be alerted when things get out of hand. Furthermore, alerting is very crucial to having visibility status of the entire stack and we have built alerting rules that could still be expanded as the stack expands - this also promotes solid service discovery. The rules are listed thus:
- Oracle Service Anomalies
- High Error Rates Critical
- Significant Response Time Increases
- Significant Response Time Increases(Critical)
- Data Freshness Issues
- Price Data Anomalies
- Service Unavailability
- Spike In Query Volume
Note
- Be aware that all the descriptions for the alerts are added in the rules file here. And any other necessary alerts can be added as we move forward.
- In engaging with the dashboard and monitoring setup, you should be aware that all you have to do is change the different variables to fit your preference of display and everything works as an out-of-the-box solution.
- Other alert-receiving platforms like Discord and email can be set as well.
- Samuel Arogbonlo - GitHub