You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Apr 2, 2024. It is now read-only.
This is more a potentially useful finding than a bug report.
My Promscale installation recently reached the point where metrics data retention policies started, and the maintenance jobs started to delete chunks. At the same time, the maintenance jobs started to take much longer to complete, and they were struggling to keep up with the expiring chunks. The maintenance jobs' performance would seemingly often crawl to a halt, and they would consume large amounts of memory. This lead to knock-on effects, such as high latencies, failing backup processes and postgres going into recovery mode.
I discovered that the maintenance jobs' performance would start out pretty good, handing one chunk every ~3-4 seconds. After a while, the time spent per chunk increased steadily to several minutes per chunk. At the same time, top showed memory use the postgres processes corresponding to the maintenance job PIDs growing steadily, into the GBs.
Killing and restarting the maintenance jobs seemed to help – they would start out again fresh, with high performance and throughput. After about 5 minutes, their performance would start to noticeably degrade.
I found this answer on the DBA Stack Exchange, which provided a hypothesis for what could be happening – the per-connection cache growing as the maintenance jobs touched more objects.
I tested out this hypothesis by writing this custom metric data retention job, executed as a Kubernetes CronJob. The job has a connection pool and a worker pool of the same size, and each database connection is recycled every 3 minutes. (It also tries to back off if performance drops, e.g. while backup processes are running.)
(The compression part of the maintenance job is still scheduled from Timescaledb's Jobs feature, with the retention part commented out).
This workaround/custom job indeed seems to have a consistently high performance, the same performance as the maintenance jobs had in the start of each run. This has solved my problems with high and increasing numbers of expired metric chunks, and makes the installation more performant. The maintenance jobs would previously run for many hours; the custom one often completes within a few minutes.
I'm unsure whether this would apply more generally and could speed up metrics retention jobs for others, or whether my installation is somehow misconfigured, causing it to need this workaround.
Before:
After:
(Time range on after-screenshot is shorter to avoid some unrelated problems)
Edit: promscale 0.16.0, timescaledb 2.8.1 and promscale_extension 0.7.0
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
This is more a potentially useful finding than a bug report.
My Promscale installation recently reached the point where metrics data retention policies started, and the maintenance jobs started to delete chunks. At the same time, the maintenance jobs started to take much longer to complete, and they were struggling to keep up with the expiring chunks. The maintenance jobs' performance would seemingly often crawl to a halt, and they would consume large amounts of memory. This lead to knock-on effects, such as high latencies, failing backup processes and postgres going into recovery mode.
I discovered that the maintenance jobs' performance would start out pretty good, handing one chunk every ~3-4 seconds. After a while, the time spent per chunk increased steadily to several minutes per chunk. At the same time,
top
showed memory use thepostgres
processes corresponding to the maintenance job PIDs growing steadily, into the GBs.Killing and restarting the maintenance jobs seemed to help – they would start out again fresh, with high performance and throughput. After about 5 minutes, their performance would start to noticeably degrade.
I found this answer on the DBA Stack Exchange, which provided a hypothesis for what could be happening – the per-connection cache growing as the maintenance jobs touched more objects.
I tested out this hypothesis by writing this custom metric data retention job, executed as a Kubernetes CronJob. The job has a connection pool and a worker pool of the same size, and each database connection is recycled every 3 minutes. (It also tries to back off if performance drops, e.g. while backup processes are running.)
(The compression part of the maintenance job is still scheduled from Timescaledb's Jobs feature, with the retention part commented out).
This workaround/custom job indeed seems to have a consistently high performance, the same performance as the maintenance jobs had in the start of each run. This has solved my problems with high and increasing numbers of expired metric chunks, and makes the installation more performant. The maintenance jobs would previously run for many hours; the custom one often completes within a few minutes.
I'm unsure whether this would apply more generally and could speed up metrics retention jobs for others, or whether my installation is somehow misconfigured, causing it to need this workaround.
Before:
After:
(Time range on after-screenshot is shorter to avoid some unrelated problems)
Edit: promscale 0.16.0, timescaledb 2.8.1 and promscale_extension 0.7.0
The text was updated successfully, but these errors were encountered: