Maintenance jobs unable to compress all chunks #1741

mikberg · 2022-11-04T11:24:24Z

Describe the bug

My Promscale instance is nearly constantly alerting
PromscaleMaintenanceJobNotKeepingup, which seems to be because
promscale_sql_database_chunks_metrics_uncompressed_count never quite reaches
the set minimum of
10.
Instead, it seems to vary in the interval between 600 and ~1100, depending on maintenance job settings.

I have tried running call prom_api.execute_maintenance(); manually, repeatedly (in a loop) and tried an aggressive schedule for maintenance jobs, running 4x every 5 minutes. They still seem to hit a "floor" around 600 uncompressed chunks.

Unfortunately, I haven't been able to run the full debugging query from the runbook, as the database goes into recovery mode whenever I try.

To Reproduce

Not sure.

Expected behavior

promscale_sql_database_chunks_metrics_uncompressed_count hitting values < 10 after maintenace jobs are done.

Screenshots

Configuration (as applicable)

Promscale Connector:

startup.dataset.config: |
  metrics:
    compress_data: true  # default
    default_retention_period: 90d  # default
    default_chunk_interval: 2h  # default is 8h; reduced in effort to mitigate PromscaleMaintenanceJobRunningTooLong
  traces:
    default_retention_period: 30d  # default

TimescaleDB:

shared_buffers: 1280MB
effective_cache_size: 3840MB
maintenance_work_mem: 640MB
work_mem: 8738kB
timescaledb.max_background_workers: 8
max_worker_processes: 13
max_parallel_workers_per_gather: 1
max_parallel_workers: 2
wal_buffers: 16MB
min_wal_size: 2GB
max_wal_size: 4GB
checkpoint_timeout: 900
bgwriter_delay: 10ms
bgwriter_lru_maxpages: 100000
default_statistics_target: 500
random_page_cost: 1.1
checkpoint_completion_target: 0.9
max_connections: 75
max_locks_per_transaction: 64
autovacuum_max_workers: 10
autovacuum_naptime: 10
effective_io_concurrency: 256
timescaledb.last_tuned: '2022-10-28T08:48:02Z'
timescaledb.last_tuned_version: '0.14.1'

Version

Distribution/OS:
Promscale: 0.16.0, 0.7.0 (extension)
TimescaleDB: 2.8.1

Additional context

PostgreSQL running via Crunchy postgres-operator, database is allocated 8 GB memory, on average using about 5-6 GB.
Average ingest at around 2000 samples/sec per Grafana dashboard.

The text was updated successfully, but these errors were encountered:

ramonguiu · 2022-11-05T05:40:24Z

The number of uncompressed chunks depends on the number of unique metric names. Each metric name uses a hypertable and at any point in time there shouldn't be more than 2 chunks uncompressed per hypertable (the current one where current data is being written, the previous one which is kept open for one hour after the current chunk was created for data arriving late).

How many unique metric names do you have?

mikberg · 2022-11-07T09:50:34Z

The number of uncompressed chunks depends on the number of unique metric names. Each metric name uses a hypertable and at any point in time there shouldn't be more than 2 chunks uncompressed per hypertable (the current one where current data is being written, the previous one which is kept open for one hour after the current chunk was created for data arriving late).

How many unique metric names do you have?

tsdb=# select count(*) from information_schema.tables where table_schema='prom_metric';
 count
-------
  2617

(or 2015 label values for __name__ in Prometheus, might be some left-overs).

Thanks, this was very informative. Do I understand correctly if I take from this that I shouldn't really expect the uncompressed chunks count to fall much below 2*(number_of_unique_metric_names)? In that case, the default alert value of 10 sounds very low?

ramonguiu · 2022-11-13T22:09:22Z

Thanks, this was very informative. Do I understand correctly if I take from this that I shouldn't really expect the uncompressed chunks count to fall much below 2*(number_of_unique_metric_names)? In that case, the default alert value of 10 sounds very low?

Yes, that's correct. Let me check with the team why the alert is defined like that.

Harkishen-Singh · 2022-11-14T09:25:32Z

I agree, this should be changed to

(
    min_over_time(promscale_sql_database_chunks_metrics_uncompressed_count[1h]) > 2 * promscale_sql_database_metric_count
)

Also, pinging @sumerman in case he knows the reason behind > 10.

sumerman · 2022-11-14T09:31:59Z

I agree, this should be changed to
(
    min_over_time(promscale_sql_database_chunks_metrics_uncompressed_count[1h]) > 2 * promscale_sql_database_metric_count
)
Also, I think we should change min_over_time to avg_over_time. Reason? min_over_time in this case seems too strict, since at any given point in last 1h, if the uncompressed chunks are more than expected, it will alert. Averaging this over 30m should be fine.

Also, pinging @sumerman in case he knows the reason behind > 10.

Thank you. As I have answered elsewhere my intention defining this metric was for it to go down to 0. 10 was a safety margin.

ramonguiu · 2022-12-13T22:36:10Z

@sumerman did we fix this?

on a function used by the maintenance jobs. It should also fix for #1741

sumerman · 2022-12-15T15:47:59Z

I expect #1794 to fix this when it lands

on a function used by the maintenance jobs. It should also fix for #1741

VineethReddy02 added the Bug Something isn't working label Nov 22, 2022

sumerman added a commit that referenced this issue Dec 14, 2022

Fixing the query behind chunks_uncompressed metric by making it rely

721fca6

on a function used by the maintenance jobs. It should also fix for #1741

sumerman mentioned this issue Dec 14, 2022

A fix for chunks_uncomopressed metric #1794

Merged

2 tasks

sumerman added a commit that referenced this issue Dec 21, 2022

Fixing the query behind chunks_uncompressed metric by making it rely

e069098

on a function used by the maintenance jobs. It should also fix for #1741

alejandrodnm pushed a commit that referenced this issue Dec 23, 2022

Fixing the query behind chunks_uncompressed metric by making it rely

d0caac2

on a function used by the maintenance jobs. It should also fix for #1741

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maintenance jobs unable to compress all chunks #1741

Maintenance jobs unable to compress all chunks #1741

mikberg commented Nov 4, 2022

ramonguiu commented Nov 5, 2022

mikberg commented Nov 7, 2022

ramonguiu commented Nov 13, 2022

Harkishen-Singh commented Nov 14, 2022 •

edited

Loading

sumerman commented Nov 14, 2022

ramonguiu commented Dec 13, 2022

sumerman commented Dec 15, 2022

Maintenance jobs unable to compress all chunks #1741

Maintenance jobs unable to compress all chunks #1741

Comments

mikberg commented Nov 4, 2022

ramonguiu commented Nov 5, 2022

mikberg commented Nov 7, 2022

ramonguiu commented Nov 13, 2022

Harkishen-Singh commented Nov 14, 2022 • edited Loading

sumerman commented Nov 14, 2022

ramonguiu commented Dec 13, 2022

sumerman commented Dec 15, 2022

Harkishen-Singh commented Nov 14, 2022 •

edited

Loading