Metrics retention job performance improved by recycling connections #1813

mikberg · 2022-12-22T13:31:41Z

This is more a potentially useful finding than a bug report.

My Promscale installation recently reached the point where metrics data retention policies started, and the maintenance jobs started to delete chunks. At the same time, the maintenance jobs started to take much longer to complete, and they were struggling to keep up with the expiring chunks. The maintenance jobs' performance would seemingly often crawl to a halt, and they would consume large amounts of memory. This lead to knock-on effects, such as high latencies, failing backup processes and postgres going into recovery mode.

I discovered that the maintenance jobs' performance would start out pretty good, handing one chunk every ~3-4 seconds. After a while, the time spent per chunk increased steadily to several minutes per chunk. At the same time, top showed memory use the postgres processes corresponding to the maintenance job PIDs growing steadily, into the GBs.

Killing and restarting the maintenance jobs seemed to help – they would start out again fresh, with high performance and throughput. After about 5 minutes, their performance would start to noticeably degrade.

I found this answer on the DBA Stack Exchange, which provided a hypothesis for what could be happening – the per-connection cache growing as the maintenance jobs touched more objects.

I tested out this hypothesis by writing this custom metric data retention job, executed as a Kubernetes CronJob. The job has a connection pool and a worker pool of the same size, and each database connection is recycled every 3 minutes. (It also tries to back off if performance drops, e.g. while backup processes are running.)

(The compression part of the maintenance job is still scheduled from Timescaledb's Jobs feature, with the retention part commented out).

This workaround/custom job indeed seems to have a consistently high performance, the same performance as the maintenance jobs had in the start of each run. This has solved my problems with high and increasing numbers of expired metric chunks, and makes the installation more performant. The maintenance jobs would previously run for many hours; the custom one often completes within a few minutes.

I'm unsure whether this would apply more generally and could speed up metrics retention jobs for others, or whether my installation is somehow misconfigured, causing it to need this workaround.

Before:

After:

(Time range on after-screenshot is shorter to avoid some unrelated problems)

Edit: promscale 0.16.0, timescaledb 2.8.1 and promscale_extension 0.7.0

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics retention job performance improved by recycling connections #1813

Metrics retention job performance improved by recycling connections #1813

mikberg commented Dec 22, 2022 •

edited

Loading

Metrics retention job performance improved by recycling connections #1813

Metrics retention job performance improved by recycling connections #1813

Comments

mikberg commented Dec 22, 2022 • edited Loading

mikberg commented Dec 22, 2022 •

edited

Loading