questdb · mtopolnik · May 26, 2025 · May 26, 2025 · May 26, 2025 · May 27, 2025
diff --git a/documentation/operations/logging-metrics.md b/documentation/operations/logging-metrics.md
@@ -6,7 +6,7 @@ description: Configure and understand QuestDB logging and metrics, including log
 import { ConfigTable } from "@theme/ConfigTable"
 import httpMinimalConfig from "./_http-minimal.config.json"
 
-This page outlines logging in QuestDB. It covers how to configure logs via `log.conf` and expose metrics via Prometheus. 
+This page outlines logging in QuestDB. It covers how to configure logs via `log.conf` and expose metrics via Prometheus.
 
 - [Logging](/docs/operations/logging-metrics/#logging)
 - [Metrics](/docs/operations/logging-metrics/#metrics)
@@ -48,10 +48,10 @@ QuestDB provides the following types of log information:
 For more information, see the
 [QuestDB source code](https://github.com/questdb/questdb/blob/master/core/src/main/java/io/questdb/log/LogLevel.java).
 
-
 ### Example log messages
 
 Advisory:
+
 ```
 2023-02-24T14:59:45.076113Z A server-main Config:
 2023-02-24T14:59:45.076130Z A server-main  - http.enabled : true
@@ -60,23 +60,27 @@ Advisory:
 ```
 
 Critical:
+
 ```
 2022-08-08T11:15:13.040767Z C i.q.c.p.WriterPool could not open [table=`sys.text_import_log`, thread=1, ex=could not open read-write [file=/opt/homebrew/var/questdb/db/sys.text_import_log/_todo_], errno=13]
 ```
 
 Error:
+
 ```
 2023-02-24T14:59:45.059012Z I i.q.c.t.t.InputFormatConfiguration loading input format config [resource=/text_loader.json]
 2023-03-20T08:38:17.076744Z E i.q.c.l.u.AbstractLineProtoUdpReceiver could not set receive buffer size [fd=140, size=8388608, errno=55]
 ```
 
 Info:
+
 ```
 2020-04-15T16:42:32.879970Z I i.q.c.TableReader new transaction [txn=2, transientRowCount=1, fixedRowCount=1, maxTimestamp=1585755801000000, attempts=0]
 2020-04-15T16:42:32.880051Z I i.q.g.FunctionParser call to_timestamp('2020-05-01:15:43:21','yyyy-MM-dd:HH:mm:ss') -> to_timestamp(Ss)
 ```
 
 Debug:
+
 ```
 2023-03-31T11:47:05.723715Z D i.q.g.FunctionParser call cast(investmentMill,INT) -> cast(Li)
 2023-03-31T11:47:05.723729Z D i.q.g.FunctionParser call rnd_symbol(4,4,4,2) -> rnd_symbol(iiii)
@@ -206,10 +210,10 @@ The following configuration options can be set in your `server.conf`:
 
 On systems with
 [8 Cores and less](/docs/operations/capacity-planning/#cpu-cores), contention
-for threads might increase the latency of health check service responses. If you use 
-a load balancer thinks the QuestDB service is dead with nothing apparent in the
-QuestDB logs, you may need to configure a dedicated thread pool for the health
-check service. To do so, increase `http.min.worker.count` to `1`.
+for threads might increase the latency of health check service responses. If you
+use a load balancer, and it thinks the QuestDB service is dead with nothing
+apparent in the QuestDB logs, you may need to configure a dedicated thread pool
+for the health check service. To do so, increase `http.min.worker.count` to `1`.
 
 :::
 

diff --git a/documentation/operations/monitoring-alerting.md b/documentation/operations/monitoring-alerting.md
@@ -0,0 +1,83 @@
+---
+title: Monitoring and alerting
+description: Shows you how to set up to monitor your database for potential issues, and how to raise alerts
+---
+
+## Basic health check
+
+QuestDB comes with an out-of-the-box health check HTTP endpoint:
+
+```shell title="GET health status of local instance"
+curl -v http://127.0.0.1:9003
+```
+
+Getting an OK response means the QuestDB process is up and running. This method
+provides no further information.
+
+If you allocate 8 vCPUs/cores or less to QuestDB, the HTTP server thread may not
+be able to get enough CPU time to respodn in a timely manner. Your load balancer
+may flag the instance as dead. In such a case, create an isolated thread pool
+just for the health check service (the `min` HTTP server), by setting this
+configuration option:
+
+```text
+http.min.worker.count=1
+```
+
+## Alert on critical errors
+
+QuestDB includes a log writer that sends any message logged at critical level to
+Prometheus Alertmanager over a TCP/IP socket. To configure this writer, add it
+to the `writers` config alongside other log writers. This is the basic setup:
+
+```ini title="log.conf"
+writers=stdout,alert
+w.alert.class=io.questdb.log.LogAlertSocketWriter
+w.alert.level=CRITICAL
+```
+
+For more details, see the
+[Logging and metrics page](/docs/operations/logging-metrics/#prometheus-alertmanager).
+
+## Detect suspended tables
+
+QuestDB exposes a Prometheus gauge called `questdb_suspended_tables`. You can set up
+to alert whenever this gauge shows an above-zero value.
+
+## Detect slow ingestion
+
+QuestDB ingests data in two stages: first it records everything to the
+Write-Ahead Log. This step is optimized for throughput and usually isn't the
+bottleneck. The next step is inserting the data to the table, and this can
+take longer if the data is out of order, or touches different time partitions.
+You can monitor the overall performance of this process of applying the WAL
+data to tables. QuestDB exposes two Prometheus counters for this:
+
+1. `questdb_wal_apply_seq_txn_total`: sum of all committed transaction sequence numbers
+2. `questdb_wal_apply_writer_txn_total`: sum of all transaction sequence numbers applied to tables
+
+Both of these numbers are continuously growing as the data is ingested. When
+they are equal, all WAL data has been applied to the tables. While data is being
+actively ingested, the second counter will lag behind the first one. A steady
+difference between them is a sign of healthy rate of WAL application, the
+database keeping up with the demand. However, if the difference continously
+rises, this indicates that either a table has become suspended and WAL can't be
+applied to it, or QuestDB is not able to keep up with the ingestion rate. All of
+the data is still safely stored, but a growing portion of it is not yet visible
+to queries.
+
+You can create an alert that detects a steadily increasing difference between
+these two numbers. It won't tell you which table is experiencing issues, but it
+is a low-impact way to detect there's a problem which needs further diagnosing.
+
+## Detect slow queries
+
+QuestDB maintains a table called `_query_trace`, which records each executed
+query and the time it took. You can query this table to find slow queries.
+
+Read more on query tracing on the
+[Concepts page](/docs/concept/query-tracing/).
+
+## Detect potential causes of performance issues
+
+... mention interesting Prometheus metrics ...
diff --git a/documentation/sidebars.js b/documentation/sidebars.js
@@ -468,6 +468,7 @@ module.exports = {
           ]
         },
         "operations/logging-metrics",
+        "operations/monitoring-alerting",
         "operations/data-retention",
         "operations/design-for-performance",
         "operations/updating-data",