Skip to content

Document metrics that help detect when WAL apply lag is increasing #191

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 10 additions & 6 deletions documentation/operations/logging-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ description: Configure and understand QuestDB logging and metrics, including log
import { ConfigTable } from "@theme/ConfigTable"
import httpMinimalConfig from "./_http-minimal.config.json"

This page outlines logging in QuestDB. It covers how to configure logs via `log.conf` and expose metrics via Prometheus.
This page outlines logging in QuestDB. It covers how to configure logs via `log.conf` and expose metrics via Prometheus.

- [Logging](/docs/operations/logging-metrics/#logging)
- [Metrics](/docs/operations/logging-metrics/#metrics)
Expand Down Expand Up @@ -48,10 +48,10 @@ QuestDB provides the following types of log information:
For more information, see the
[QuestDB source code](https://github.com/questdb/questdb/blob/master/core/src/main/java/io/questdb/log/LogLevel.java).


### Example log messages

Advisory:

```
2023-02-24T14:59:45.076113Z A server-main Config:
2023-02-24T14:59:45.076130Z A server-main - http.enabled : true
Expand All @@ -60,23 +60,27 @@ Advisory:
```

Critical:

```
2022-08-08T11:15:13.040767Z C i.q.c.p.WriterPool could not open [table=`sys.text_import_log`, thread=1, ex=could not open read-write [file=/opt/homebrew/var/questdb/db/sys.text_import_log/_todo_], errno=13]
```

Error:

```
2023-02-24T14:59:45.059012Z I i.q.c.t.t.InputFormatConfiguration loading input format config [resource=/text_loader.json]
2023-03-20T08:38:17.076744Z E i.q.c.l.u.AbstractLineProtoUdpReceiver could not set receive buffer size [fd=140, size=8388608, errno=55]
```

Info:

```
2020-04-15T16:42:32.879970Z I i.q.c.TableReader new transaction [txn=2, transientRowCount=1, fixedRowCount=1, maxTimestamp=1585755801000000, attempts=0]
2020-04-15T16:42:32.880051Z I i.q.g.FunctionParser call to_timestamp('2020-05-01:15:43:21','yyyy-MM-dd:HH:mm:ss') -> to_timestamp(Ss)
```

Debug:

```
2023-03-31T11:47:05.723715Z D i.q.g.FunctionParser call cast(investmentMill,INT) -> cast(Li)
2023-03-31T11:47:05.723729Z D i.q.g.FunctionParser call rnd_symbol(4,4,4,2) -> rnd_symbol(iiii)
Expand Down Expand Up @@ -206,10 +210,10 @@ The following configuration options can be set in your `server.conf`:

On systems with
[8 Cores and less](/docs/operations/capacity-planning/#cpu-cores), contention
for threads might increase the latency of health check service responses. If you use
a load balancer thinks the QuestDB service is dead with nothing apparent in the
QuestDB logs, you may need to configure a dedicated thread pool for the health
check service. To do so, increase `http.min.worker.count` to `1`.
for threads might increase the latency of health check service responses. If you
use a load balancer, and it thinks the QuestDB service is dead with nothing
apparent in the QuestDB logs, you may need to configure a dedicated thread pool
for the health check service. To do so, increase `http.min.worker.count` to `1`.

:::

Expand Down
83 changes: 83 additions & 0 deletions documentation/operations/monitoring-alerting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
---
title: Monitoring and alerting
description: Shows you how to set up to monitor your database for potential issues, and how to raise alerts
---

## Basic health check

QuestDB comes with an out-of-the-box health check HTTP endpoint:

```shell title="GET health status of local instance"
curl -v http://127.0.0.1:9003
```

Getting an OK response means the QuestDB process is up and running. This method
provides no further information.

If you allocate 8 vCPUs/cores or less to QuestDB, the HTTP server thread may not
be able to get enough CPU time to respodn in a timely manner. Your load balancer
may flag the instance as dead. In such a case, create an isolated thread pool
just for the health check service (the `min` HTTP server), by setting this
configuration option:

```text
http.min.worker.count=1
```

## Alert on critical errors

QuestDB includes a log writer that sends any message logged at critical level to
Prometheus Alertmanager over a TCP/IP socket. To configure this writer, add it
to the `writers` config alongside other log writers. This is the basic setup:

```ini title="log.conf"
writers=stdout,alert
w.alert.class=io.questdb.log.LogAlertSocketWriter
w.alert.level=CRITICAL
```

For more details, see the
[Logging and metrics page](/docs/operations/logging-metrics/#prometheus-alertmanager).

## Detect suspended tables

QuestDB exposes a Prometheus gauge called `questdb_suspended_tables`. You can set up
to alert whenever this gauge shows an above-zero value.

## Detect slow ingestion

QuestDB ingests data in two stages: first it records everything to the
Write-Ahead Log. This step is optimized for throughput and usually isn't the
bottleneck. The next step is inserting the data to the table, and this can
take longer if the data is out of order, or touches different time partitions.
You can monitor the overall performance of this process of applying the WAL
data to tables. QuestDB exposes two Prometheus counters for this:

1. `questdb_wal_apply_seq_txn_total`: sum of all committed transaction sequence numbers
2. `questdb_wal_apply_writer_txn_total`: sum of all transaction sequence numbers applied to tables

Both of these numbers are continuously growing as the data is ingested. When
they are equal, all WAL data has been applied to the tables. While data is being
actively ingested, the second counter will lag behind the first one. A steady
difference between them is a sign of healthy rate of WAL application, the
database keeping up with the demand. However, if the difference continously
rises, this indicates that either a table has become suspended and WAL can't be
applied to it, or QuestDB is not able to keep up with the ingestion rate. All of
the data is still safely stored, but a growing portion of it is not yet visible
to queries.

You can create an alert that detects a steadily increasing difference between
these two numbers. It won't tell you which table is experiencing issues, but it
is a low-impact way to detect there's a problem which needs further diagnosing.

## Detect slow queries

QuestDB maintains a table called `_query_trace`, which records each executed
query and the time it took. You can query this table to find slow queries.

Read more on query tracing on the
[Concepts page](/docs/concept/query-tracing/).

## Detect potential causes of performance issues

... mention interesting Prometheus metrics ...
1 change: 1 addition & 0 deletions documentation/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -468,6 +468,7 @@ module.exports = {
]
},
"operations/logging-metrics",
"operations/monitoring-alerting",
"operations/data-retention",
"operations/design-for-performance",
"operations/updating-data",
Expand Down
Loading