docs: improvements to incidents and providers (#3067)

keephq · Feb 1, 2025 · 4fd6937 · 4fd6937
1 parent a44b927
commit 4fd6937
Show file tree

Hide file tree

Showing 16 changed files with 444 additions and 13 deletions.
diff --git a/docs/deployment/configuration.mdx b/docs/deployment/configuration.mdx
@@ -135,6 +135,8 @@ OpenAPI configuration is used for integrating with OpenAI services. These settin
 | Env var | Purpose | Required | Default Value | Valid options |
 |:-------------------:|:-------:|:----------:|:-------------:|:-------------:|
 | **OPENAI_API_KEY** | API key for OpenAI services | No | None | Valid OpenAI API key |
+| **OPEN_AI_ORGANIZATION_ID** | Organization ID for OpenAI services | No | None | Valid OpenAI organization ID |
+| **OPENAI_BASE_URL** | Base URL for OpenAI API (useful for LiteLLM proxy) | No | None | Valid URL (e.g., "http://localhost:4000") |
 
 
 ### Posthog

diff --git a/docs/deployment/local-llm/keep-with-litellm.mdx b/docs/deployment/local-llm/keep-with-litellm.mdx
@@ -0,0 +1,65 @@
+---
+title: "Running Keep with LiteLLM"
+---
+
+<Info>
+  This guide is for users who want to run Keep with locally hosted LLM models.
+  If you encounter any issues, please talk to us at our (Slack
+  community)[https://slack.keephq.dev].
+</Info>
+
+## Overview
+
+This guide will help you set up Keep with LiteLLM, a versatile tool that supports over 100 LLM providers. LiteLLM acts as a proxy that adheres to OpenAI standards, allowing seamless integration with Keep. By following this guide, you can easily configure Keep to work with various LLM providers using LiteLLM.
+
+### Motivation
+
+Incorporating LiteLLM with Keep allows organizations to run local models in on-premises and air-gapped environments. This setup is particularly beneficial for leveraging AIOps capabilities while ensuring that sensitive data does not leave the premises. By using LiteLLM as a proxy, you can seamlessly integrate with Keep and access a wide range of LLM providers without compromising data security. This approach is ideal for organizations that prioritize data privacy and need to comply with strict regulatory requirements.
+
+## Prerequisites
+
+### Running LiteLLM locally
+
+1. Ensure you have Python and pip installed on your system.
+2. Install LiteLLM by running the following command:
+
+```bash
+pip install litellm
+```
+
+3. Start LiteLLM with your desired model. For example, to use the HuggingFace model:
+
+```bash
+litellm --model huggingface/bigcode/starcoder
+```
+
+This will start the proxy server on `http://0.0.0.0:4000`.
+
+### Running LiteLLM with Docker
+
+To run LiteLLM using Docker, you can use the following command:
+
+```bash
+docker run -p 4000:4000 litellm/litellm --model huggingface/bigcode/starcoder
+```
+
+This command will start the LiteLLM proxy in a Docker container, exposing it on port 4000.
+
+## Configuration
+
+|           Env var           |                   Purpose                   | Required | Default Value |               Valid options               |
+| :-------------------------: | :-----------------------------------------: | :------: | :-----------: | :---------------------------------------: |
+| **OPEN_AI_ORGANIZATION_ID** | Organization ID for OpenAI/LiteLLM services |   Yes    |     None      |       Valid organization ID string        |
+|     **OPEN_AI_API_KEY**     |     API key for OpenAI/LiteLLM services     |   Yes    |     None      |           Valid API key string            |
+|     **OPENAI_BASE_URL**     |       Base URL for the LiteLLM proxy        |   Yes    |     None      | Valid URL (e.g., "http://localhost:4000") |
+
+<Note>
+  These environment variables should be set on both Keep **frontend** and
+  **backend**.
+</Note>
+
+## Additional Resources
+
+- [LiteLLM Documentation](https://docs.litellm.ai/)
+
+By following these steps, you can leverage the power of multiple LLM providers with Keep, using LiteLLM as a flexible and powerful proxy.
diff --git a/docs/images/correlation-topology.png b/docs/images/correlation-topology.png
diff --git a/docs/images/provider-methods-assistant.png b/docs/images/provider-methods-assistant.png
diff --git a/docs/images/provider-methods-menu.png b/docs/images/provider-methods-menu.png
diff --git a/docs/images/provider-methods-modal.png b/docs/images/provider-methods-modal.png
diff --git a/docs/mint.json b/docs/mint.json
@@ -66,12 +66,18 @@
     {
       "group": "AIOps",
       "pages": [
-        "overview/correlation",
+        {
+          "group": "Correlation",
+          "pages": [
+            "overview/correlation-rules",
+            "overview/correlation-topology"
+          ]
+        },
         "overview/deduplication",
         "overview/enrichment/extraction",
         "overview/enrichment/mapping",
-        "overview/servicetopology",
         "overview/maintenance-windows",
+        "overview/servicetopology",
         "overview/workflow-automation"
       ]
     },
@@ -122,6 +128,7 @@
       "pages": [
         "providers/overview",
         "providers/adding-a-new-provider",
+        "providers/provider-methods",
         {
           "group": "Supported Providers",
           "pages": [
@@ -278,6 +285,10 @@
             "deployment/ecs"
           ]
         },
+        {
+          "group": "Local LLM",
+          "pages": ["deployment/local-llm/keep-with-litellm"]
+        },
         "deployment/stress-testing"
       ]
     },

diff --git a/docs/overview/correlation.mdx → docs/overview/correlation-rules.mdx b/docs/overview/correlation.mdx → docs/overview/correlation-rules.mdx
@@ -1,5 +1,5 @@
 ---
-title: "Correlation"
+title: "Correlation Rules"
 ---
 
 The Keep Correlation Engine is a versatile tool for correlating and consolidating alerts into incidents or incident-candidates.

diff --git a/docs/overview/correlation-topology.mdx b/docs/overview/correlation-topology.mdx
@@ -0,0 +1,104 @@
+---
+title: "Topology Correlation"
+---
+
+The Topology Processor is a core component of Keep that helps correlate alerts based on your infrastructure's topology, creating meaningful incidents that reflect the relationships between your services and applications.
+It automatically analyzes incoming alerts and their relationship to your infrastructure topology, creating incidents when multiple related services or components of an application are affected.
+
+Read more about [Service Topology](/overview/servicetopology).
+
+<Frame width="100" height="200">
+  <img height="10" src="/images/correlation-topology.png" />
+</Frame>
+
+<Tip>
+  The Topology Processor is disabled by default. To enable it, set the
+  environment variable `KEEP_TOPOLOGY_PROCESSOR=true`.
+</Tip>
+
+## How It Works
+
+1. **Service Discovery**: The processor maintains a map of your infrastructure's topology, including:
+
+   - Services and their relationships
+   - Applications and their constituent services
+   - Dependencies between different components
+
+2. **Alert Processing**: Every few seconds, the processor:
+
+   - Analyzes recent alerts
+   - Maps alerts to services in your topology
+   - Creates or updates incidents based on application-level impact
+
+3. **Incident Creation**: When multiple services within an application have active alerts:
+   - Creates a new application-level incident
+   - Groups related alerts under this incident
+   - Provides context about the affected application and its services
+
+## Configuration
+
+### Environment Variables
+
+| Variable                                   | Description                                         | Default |
+| ------------------------------------------ | --------------------------------------------------- | ------- |
+| `KEEP_TOPOLOGY_PROCESSOR`                  | Enable/disable the topology processor               | `false` |
+| `KEEP_TOPOLOGY_PROCESSOR_INTERVAL`         | Interval for processing alerts (in seconds)         | `10`    |
+| `KEEP_TOPOLOGY_PROCESSOR_LOOK_BACK_WINDOW` | Look back window for alert correlation (in minutes) | `15`    |
+
+## Incident Management
+
+### Creation
+
+When the processor detects alerts affecting multiple services within an application:
+
+- Creates a new incident with type "topology"
+- Names it "Application incident: {application_name}"
+- Automatically confirms the incident
+- Links all related alerts to the incident
+
+### Resolution
+
+Incidents can be configured to resolve automatically when:
+
+- All related alerts are resolved
+- Specific resolution criteria are met
+
+## Best Practices
+
+1. **Service Mapping**
+
+   - Ensure services in alerts match your topology definitions
+   - Maintain up-to-date topology information
+
+2. **Application Definition**
+
+   - Group related services into logical applications
+   - Define clear service boundaries
+
+3. **Alert Configuration**
+   - Include service information in your alerts
+   - Use consistent service naming across monitoring tools
+
+## Example
+
+If you have an application "payment-service" consisting of multiple microservices:
+
+```json
+{
+  "application": "payment-service",
+  "services": ["payment-api", "payment-processor", "payment-database"]
+}
+```
+
+When alerts come in for both `payment-api` and `payment-database`, the Topology Processor will:
+
+1. Recognize these services belong to the same application
+2. Create a single incident for "payment-service"
+3. Group both alerts under this incident
+4. Provide application-level context in the incident description
+
+## Limitations
+
+- Currently supports only application-based incident creation
+- One active incident per application at a time
+- Requires service information in alerts for correlation
diff --git a/docs/overview/servicetopology.mdx b/docs/overview/servicetopology.mdx
@@ -4,38 +4,84 @@ title: "Service Topology"
 
 The Service Topology feature in Keep provides a visual representation of your service dependencies, allowing you to quickly understand the relationships between various components in your system. By mapping services and their interactions, you can gain insights into how issues in one service may impact others, enabling faster root-cause analysis and more effective incident resolution.
 
-
 <Frame width="100" height="200">
   <img height="10" src="/images/servicetopology.png" />
 </Frame>
 
-
 ## Key Concepts
 
 - **Nodes**: Represent individual services, applications, or infrastructure components.
 - **Edges**: Show the dependencies and interactions between nodes.
 
+## Supported Providers
+
+<CardGroup cols={3}>
+  <Card
+    title="Datadog"
+    href="/providers/documentation/datadog-provider"
+    icon={
+      <img src="https://img.logo.dev/datadoghq.com?token=pk_dfXfZBoKQMGDTIgqu7LvYg" />
+    }
+  ></Card>
+  <Card
+    title="Pagerduty"
+    href="/providers/documentation/pagerduty-provider"
+    icon={
+      <img src="https://img.logo.dev/pagerduty.com?token=pk_dfXfZBoKQMGDTIgqu7LvYg" />
+    }
+  ></Card>
+  <Card
+    title="ArgoCD"
+    href="/providers/documentation/argocd-provider"
+    icon={
+      <img src="https://img.logo.dev/argoproj.github.io?token=pk_dfXfZBoKQMGDTIgqu7LvYg" />
+    }
+  ></Card>
+  <Card
+    title="Cilium"
+    href="/providers/documentation/cilium-provider"
+    icon={
+      <img src="https://img.logo.dev/cilium.io?token=pk_dfXfZBoKQMGDTIgqu7LvYg" />
+    }
+  ></Card>
+  <Card
+    title="Service Now"
+    href="/providers/documentation/service-now-provider"
+    icon={
+      <img src="https://img.logo.dev/servicenow.com?token=pk_dfXfZBoKQMGDTIgqu7LvYg" />
+    }
+  ></Card>
+</CardGroup>
+
 ## Features
 
 ### Visualizing Dependencies
+
 The service topology graph helps you:
+
 - Identify critical dependencies between services.
 - Understand how failures in one service propagate through the system.
 - Highlight single points of failure or bottlenecks.
 
 ### Real-Time Health Indicators
+
 Nodes and edges are enriched with health indicators derived from alerts and metrics. This allows you to:
+
 - Quickly spot issues in your architecture.
 - Prioritize incident resolution based on affected dependencies.
 
 ### Filter and Focus
+
 Use filters to focus on specific parts of the topology, such as:
+
 - A particular environment (e.g., production, staging).
 - A service group (e.g., all database-related services).
 - Alerts of a specific severity or type.
 
 ### Incident Integration
+
 Service topology integrates seamlessly with Keep’s incident management features. When an incident is triggered, you can:
+
 - View the affected nodes and their dependencies directly on the topology graph.
 - Analyze how alerts related to the incident are propagating through the system.
 - Use this information to guide remediation efforts.
diff --git a/docs/providers/documentation/datadog-provider.mdx b/docs/providers/documentation/datadog-provider.mdx
@@ -10,10 +10,31 @@ description: "Datadog provider allows you to query Datadog metrics and logs for
 - `time_range`: dict = None: The time range for the query (e.g., `{'from': 'now-15m', 'to': 'now'}`)
 - `source`: str = None: The source type (metrics, traces, logs).
 
+Example:
+```python
+result = provider.query(
+    query="avg:system.cpu.user{*}",
+    from_time="now-1h",
+    to_time="now"
+)
+```
+
 ## Outputs
 
 _No information yet, feel free to contribute it using the "Edit this page" link at the bottom of the page_
 
+### Additional Methods
+
+| Method | Description | Required Scopes | Type |
+|--------|-------------|----------------|------|
+| `mute_monitor` | Mute a monitor | `monitors_write` | action |
+| `unmute_monitor` | Unmute a monitor | `monitors_write` | action |
+| `get_monitor_events` | Get all events related to this monitor | `events_read` | view |
+| `get_trace` | Get trace by id | `apm_read` | view |
+| `create_incident` | Create an incident | `incidents_write` | action |
+| `resolve_incident` | Resolve an active incident | `incidents_write` | action |
+| `add_incident_timeline_note` | Add a note to an incident timeline | `incidents_write` | action |
+
 ## Authentication Parameters
 
 The `api_key` and `app_key` are required for connecting to the Datadog provider. You can obtain them as described in the "Connecting with the Provider" section.

diff --git a/docs/providers/overview.mdx b/docs/providers/overview.mdx
@@ -39,7 +39,7 @@ By leveraging Keep Providers, users are able to deeply integrate Keep with the t
 <Card
   title="ArgoCD"
   href="/providers/documentation/argocd-provider"
-  icon={ <img src="https://img.logo.dev/argocd.com?token=pk_dfXfZBoKQMGDTIgqu7LvYg" /> }
+  icon={ <img src="https://img.logo.dev/argoproj.github.io?token=pk_dfXfZBoKQMGDTIgqu7LvYg" /> }
 ></Card>
 
 <Card
@@ -93,7 +93,7 @@ By leveraging Keep Providers, users are able to deeply integrate Keep with the t
 <Card
   title="Cilium"
   href="/providers/documentation/cilium-provider"
-  icon={ <img src="https://img.logo.dev/cilium.com?token=pk_dfXfZBoKQMGDTIgqu7LvYg" /> }
+  icon={ <img src="https://img.logo.dev/cilium.io?token=pk_dfXfZBoKQMGDTIgqu7LvYg" /> }
 ></Card>
 
 <Card
@@ -496,7 +496,7 @@ By leveraging Keep Providers, users are able to deeply integrate Keep with the t
 <Card
   title="Service Now"
   href="/providers/documentation/service-now-provider"
-  icon={ <img src="https://img.logo.dev/service.com?token=pk_dfXfZBoKQMGDTIgqu7LvYg" /> }
+  icon={ <img src="https://img.logo.dev/servicenow.com?token=pk_dfXfZBoKQMGDTIgqu7LvYg" /> }
 ></Card>
 
 <Card