Skip to content
This repository has been archived by the owner on Oct 11, 2020. It is now read-only.

Automated zoom in and zoom out for root cause analysis (demo with BGP)

Khelil Sator edited this page Mar 12, 2019 · 14 revisions

monitoring rule/playbook

The monitoring playbook bgp-monitoring-with-automatic-zoom uses the device rule check-bgp-state-with-automatic-zoom

The devices rule check-bgp-state-with-automatic-zoom collects BGP details, store data in the database, and monitor sessions state. This rule doesnt run advanced tests (no cross devices correlation for root cause analysis, just BGP sessions state monitoring).
If a BGP session state moves to a non established state, the device rule check-bgp-state-with-automatic-zoom uses the python script bgp_zoom_in.py to automatically instanciate a BGP troubleshooting playbook.

troubleshooting rules/playbook

These rules do not collect data from devices. They process the data stored in the database, with a cross devices correlation. These rules help to understand the root cause of BGP issues:

The troubleshooting playbook bgp-zoom uses the network rule troubleshooting-as and the network rule troubleshooting-peer-type

Workflow overview

Automated zoom in demo

Instanciate the BGP monitoring playbook bgp-monitoring-with-automatic-zoom.

All BGP sessions are established. Healthbot GUI shows all devices are in a good state. Also there is no network group configured.

BGP_monitoring_dashboard.png

BGP_monitoring_vmx1.png

Let's break a BGP session. Let's connect on the vMX1 and apply a bad configuration change in order to break the BGP session session between vMX1 and vMX4

jcluser@vMX-addr-0# show | compare
[edit protocols bgp group underlay neighbor 192.168.1.1]
-     peer-as 104;
+     peer-as 200;

[edit]
jcluser@vMX-addr-0# commit and-quit
commit complete
Exiting configuration mode

jcluser@vMX-addr-0> show bgp summary
Groups: 1 Peers: 4 Down peers: 1
Table          Tot Paths  Act Paths Suppressed    History Damp State    Pending
inet.0
                      33         12          0          0          0          0
Peer                     AS      InPkt     OutPkt    OutQ   Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped...
192.168.1.1             200          0          0       0       0           8 Active
192.168.1.3             105        423        426       0       0     3:08:56 4/11/11/0            0/0/0/0
192.168.1.5             106        423        428       0       0     3:08:53 4/11/11/0            0/0/0/0
192.168.1.7             107        423        428       0       0     3:08:51 4/11/11/0            0/0/0/0

jcluser@vMX-addr-0>

The monitoring rule shows the issue: BGP_monitoring_vmx1_active.png

The monitoring rule uses UDA (user defined action) to automatically instantiated the troubleshooting playbook.
Healthbot shows the root cause of the issue (AS configuration mismatch between vMX4 local-as and vMX1 remote-as).
bgp_zoom.png

Clone this wiki locally