Skip to content

Commit

Permalink
docs: improvements to incidents and providers (#3067)
Browse files Browse the repository at this point in the history
  • Loading branch information
talboren authored Feb 1, 2025
1 parent a44b927 commit 4fd6937
Show file tree
Hide file tree
Showing 16 changed files with 444 additions and 13 deletions.
2 changes: 2 additions & 0 deletions docs/deployment/configuration.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,8 @@ OpenAPI configuration is used for integrating with OpenAI services. These settin
| Env var | Purpose | Required | Default Value | Valid options |
|:-------------------:|:-------:|:----------:|:-------------:|:-------------:|
| **OPENAI_API_KEY** | API key for OpenAI services | No | None | Valid OpenAI API key |
| **OPEN_AI_ORGANIZATION_ID** | Organization ID for OpenAI services | No | None | Valid OpenAI organization ID |
| **OPENAI_BASE_URL** | Base URL for OpenAI API (useful for LiteLLM proxy) | No | None | Valid URL (e.g., "http://localhost:4000") |


### Posthog
Expand Down
65 changes: 65 additions & 0 deletions docs/deployment/local-llm/keep-with-litellm.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
---
title: "Running Keep with LiteLLM"
---

<Info>
This guide is for users who want to run Keep with locally hosted LLM models.
If you encounter any issues, please talk to us at our (Slack
community)[https://slack.keephq.dev].
</Info>

## Overview

This guide will help you set up Keep with LiteLLM, a versatile tool that supports over 100 LLM providers. LiteLLM acts as a proxy that adheres to OpenAI standards, allowing seamless integration with Keep. By following this guide, you can easily configure Keep to work with various LLM providers using LiteLLM.

### Motivation

Incorporating LiteLLM with Keep allows organizations to run local models in on-premises and air-gapped environments. This setup is particularly beneficial for leveraging AIOps capabilities while ensuring that sensitive data does not leave the premises. By using LiteLLM as a proxy, you can seamlessly integrate with Keep and access a wide range of LLM providers without compromising data security. This approach is ideal for organizations that prioritize data privacy and need to comply with strict regulatory requirements.

## Prerequisites

### Running LiteLLM locally

1. Ensure you have Python and pip installed on your system.
2. Install LiteLLM by running the following command:

```bash
pip install litellm
```

3. Start LiteLLM with your desired model. For example, to use the HuggingFace model:

```bash
litellm --model huggingface/bigcode/starcoder
```

This will start the proxy server on `http://0.0.0.0:4000`.

### Running LiteLLM with Docker

To run LiteLLM using Docker, you can use the following command:

```bash
docker run -p 4000:4000 litellm/litellm --model huggingface/bigcode/starcoder
```

This command will start the LiteLLM proxy in a Docker container, exposing it on port 4000.

## Configuration

| Env var | Purpose | Required | Default Value | Valid options |
| :-------------------------: | :-----------------------------------------: | :------: | :-----------: | :---------------------------------------: |
| **OPEN_AI_ORGANIZATION_ID** | Organization ID for OpenAI/LiteLLM services | Yes | None | Valid organization ID string |
| **OPEN_AI_API_KEY** | API key for OpenAI/LiteLLM services | Yes | None | Valid API key string |
| **OPENAI_BASE_URL** | Base URL for the LiteLLM proxy | Yes | None | Valid URL (e.g., "http://localhost:4000") |

<Note>
These environment variables should be set on both Keep **frontend** and
**backend**.
</Note>

## Additional Resources

- [LiteLLM Documentation](https://docs.litellm.ai/)

By following these steps, you can leverage the power of multiple LLM providers with Keep, using LiteLLM as a flexible and powerful proxy.
Binary file added docs/images/correlation-topology.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/provider-methods-assistant.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/provider-methods-menu.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/provider-methods-modal.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
15 changes: 13 additions & 2 deletions docs/mint.json
Original file line number Diff line number Diff line change
Expand Up @@ -66,12 +66,18 @@
{
"group": "AIOps",
"pages": [
"overview/correlation",
{
"group": "Correlation",
"pages": [
"overview/correlation-rules",
"overview/correlation-topology"
]
},
"overview/deduplication",
"overview/enrichment/extraction",
"overview/enrichment/mapping",
"overview/servicetopology",
"overview/maintenance-windows",
"overview/servicetopology",
"overview/workflow-automation"
]
},
Expand Down Expand Up @@ -122,6 +128,7 @@
"pages": [
"providers/overview",
"providers/adding-a-new-provider",
"providers/provider-methods",
{
"group": "Supported Providers",
"pages": [
Expand Down Expand Up @@ -278,6 +285,10 @@
"deployment/ecs"
]
},
{
"group": "Local LLM",
"pages": ["deployment/local-llm/keep-with-litellm"]
},
"deployment/stress-testing"
]
},
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: "Correlation"
title: "Correlation Rules"
---

The Keep Correlation Engine is a versatile tool for correlating and consolidating alerts into incidents or incident-candidates.
Expand Down
104 changes: 104 additions & 0 deletions docs/overview/correlation-topology.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
---
title: "Topology Correlation"
---

The Topology Processor is a core component of Keep that helps correlate alerts based on your infrastructure's topology, creating meaningful incidents that reflect the relationships between your services and applications.
It automatically analyzes incoming alerts and their relationship to your infrastructure topology, creating incidents when multiple related services or components of an application are affected.

Read more about [Service Topology](/overview/servicetopology).

<Frame width="100" height="200">
<img height="10" src="/images/correlation-topology.png" />
</Frame>

<Tip>
The Topology Processor is disabled by default. To enable it, set the
environment variable `KEEP_TOPOLOGY_PROCESSOR=true`.
</Tip>

## How It Works

1. **Service Discovery**: The processor maintains a map of your infrastructure's topology, including:

- Services and their relationships
- Applications and their constituent services
- Dependencies between different components

2. **Alert Processing**: Every few seconds, the processor:

- Analyzes recent alerts
- Maps alerts to services in your topology
- Creates or updates incidents based on application-level impact

3. **Incident Creation**: When multiple services within an application have active alerts:
- Creates a new application-level incident
- Groups related alerts under this incident
- Provides context about the affected application and its services

## Configuration

### Environment Variables

| Variable | Description | Default |
| ------------------------------------------ | --------------------------------------------------- | ------- |
| `KEEP_TOPOLOGY_PROCESSOR` | Enable/disable the topology processor | `false` |
| `KEEP_TOPOLOGY_PROCESSOR_INTERVAL` | Interval for processing alerts (in seconds) | `10` |
| `KEEP_TOPOLOGY_PROCESSOR_LOOK_BACK_WINDOW` | Look back window for alert correlation (in minutes) | `15` |

## Incident Management

### Creation

When the processor detects alerts affecting multiple services within an application:

- Creates a new incident with type "topology"
- Names it "Application incident: {application_name}"
- Automatically confirms the incident
- Links all related alerts to the incident

### Resolution

Incidents can be configured to resolve automatically when:

- All related alerts are resolved
- Specific resolution criteria are met

## Best Practices

1. **Service Mapping**

- Ensure services in alerts match your topology definitions
- Maintain up-to-date topology information

2. **Application Definition**

- Group related services into logical applications
- Define clear service boundaries

3. **Alert Configuration**
- Include service information in your alerts
- Use consistent service naming across monitoring tools

## Example

If you have an application "payment-service" consisting of multiple microservices:

```json
{
"application": "payment-service",
"services": ["payment-api", "payment-processor", "payment-database"]
}
```

When alerts come in for both `payment-api` and `payment-database`, the Topology Processor will:

1. Recognize these services belong to the same application
2. Create a single incident for "payment-service"
3. Group both alerts under this incident
4. Provide application-level context in the incident description

## Limitations

- Currently supports only application-based incident creation
- One active incident per application at a time
- Requires service information in alerts for correlation
50 changes: 48 additions & 2 deletions docs/overview/servicetopology.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,38 +4,84 @@ title: "Service Topology"

The Service Topology feature in Keep provides a visual representation of your service dependencies, allowing you to quickly understand the relationships between various components in your system. By mapping services and their interactions, you can gain insights into how issues in one service may impact others, enabling faster root-cause analysis and more effective incident resolution.


<Frame width="100" height="200">
<img height="10" src="/images/servicetopology.png" />
</Frame>


## Key Concepts

- **Nodes**: Represent individual services, applications, or infrastructure components.
- **Edges**: Show the dependencies and interactions between nodes.

## Supported Providers

<CardGroup cols={3}>
<Card
title="Datadog"
href="/providers/documentation/datadog-provider"
icon={
<img src="https://img.logo.dev/datadoghq.com?token=pk_dfXfZBoKQMGDTIgqu7LvYg" />
}
></Card>
<Card
title="Pagerduty"
href="/providers/documentation/pagerduty-provider"
icon={
<img src="https://img.logo.dev/pagerduty.com?token=pk_dfXfZBoKQMGDTIgqu7LvYg" />
}
></Card>
<Card
title="ArgoCD"
href="/providers/documentation/argocd-provider"
icon={
<img src="https://img.logo.dev/argoproj.github.io?token=pk_dfXfZBoKQMGDTIgqu7LvYg" />
}
></Card>
<Card
title="Cilium"
href="/providers/documentation/cilium-provider"
icon={
<img src="https://img.logo.dev/cilium.io?token=pk_dfXfZBoKQMGDTIgqu7LvYg" />
}
></Card>
<Card
title="Service Now"
href="/providers/documentation/service-now-provider"
icon={
<img src="https://img.logo.dev/servicenow.com?token=pk_dfXfZBoKQMGDTIgqu7LvYg" />
}
></Card>
</CardGroup>

## Features

### Visualizing Dependencies

The service topology graph helps you:

- Identify critical dependencies between services.
- Understand how failures in one service propagate through the system.
- Highlight single points of failure or bottlenecks.

### Real-Time Health Indicators

Nodes and edges are enriched with health indicators derived from alerts and metrics. This allows you to:

- Quickly spot issues in your architecture.
- Prioritize incident resolution based on affected dependencies.

### Filter and Focus

Use filters to focus on specific parts of the topology, such as:

- A particular environment (e.g., production, staging).
- A service group (e.g., all database-related services).
- Alerts of a specific severity or type.

### Incident Integration

Service topology integrates seamlessly with Keep’s incident management features. When an incident is triggered, you can:

- View the affected nodes and their dependencies directly on the topology graph.
- Analyze how alerts related to the incident are propagating through the system.
- Use this information to guide remediation efforts.
21 changes: 21 additions & 0 deletions docs/providers/documentation/datadog-provider.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,31 @@ description: "Datadog provider allows you to query Datadog metrics and logs for
- `time_range`: dict = None: The time range for the query (e.g., `{'from': 'now-15m', 'to': 'now'}`)
- `source`: str = None: The source type (metrics, traces, logs).

Example:
```python
result = provider.query(
query="avg:system.cpu.user{*}",
from_time="now-1h",
to_time="now"
)
```

## Outputs

_No information yet, feel free to contribute it using the "Edit this page" link at the bottom of the page_

### Additional Methods

| Method | Description | Required Scopes | Type |
|--------|-------------|----------------|------|
| `mute_monitor` | Mute a monitor | `monitors_write` | action |
| `unmute_monitor` | Unmute a monitor | `monitors_write` | action |
| `get_monitor_events` | Get all events related to this monitor | `events_read` | view |
| `get_trace` | Get trace by id | `apm_read` | view |
| `create_incident` | Create an incident | `incidents_write` | action |
| `resolve_incident` | Resolve an active incident | `incidents_write` | action |
| `add_incident_timeline_note` | Add a note to an incident timeline | `incidents_write` | action |

## Authentication Parameters

The `api_key` and `app_key` are required for connecting to the Datadog provider. You can obtain them as described in the "Connecting with the Provider" section.
Expand Down
6 changes: 3 additions & 3 deletions docs/providers/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ By leveraging Keep Providers, users are able to deeply integrate Keep with the t
<Card
title="ArgoCD"
href="/providers/documentation/argocd-provider"
icon={ <img src="https://img.logo.dev/argocd.com?token=pk_dfXfZBoKQMGDTIgqu7LvYg" /> }
icon={ <img src="https://img.logo.dev/argoproj.github.io?token=pk_dfXfZBoKQMGDTIgqu7LvYg" /> }
></Card>

<Card
Expand Down Expand Up @@ -93,7 +93,7 @@ By leveraging Keep Providers, users are able to deeply integrate Keep with the t
<Card
title="Cilium"
href="/providers/documentation/cilium-provider"
icon={ <img src="https://img.logo.dev/cilium.com?token=pk_dfXfZBoKQMGDTIgqu7LvYg" /> }
icon={ <img src="https://img.logo.dev/cilium.io?token=pk_dfXfZBoKQMGDTIgqu7LvYg" /> }
></Card>

<Card
Expand Down Expand Up @@ -496,7 +496,7 @@ By leveraging Keep Providers, users are able to deeply integrate Keep with the t
<Card
title="Service Now"
href="/providers/documentation/service-now-provider"
icon={ <img src="https://img.logo.dev/service.com?token=pk_dfXfZBoKQMGDTIgqu7LvYg" /> }
icon={ <img src="https://img.logo.dev/servicenow.com?token=pk_dfXfZBoKQMGDTIgqu7LvYg" /> }
></Card>

<Card
Expand Down
Loading

0 comments on commit 4fd6937

Please sign in to comment.