Monitor and alert xsnap worker memory #10842

mhofman · 2025-01-14T04:19:51Z

What is the Problem Being Solved?

#10841 reminded us that xsnap vats will fail if they attempt to allocate more than 2GB of memory (as see by xsnap's metering). We need to make sure we get alert if any networks we monitor like mainnet has a vat getting anywhere close to this.

Description of the Design

The slog currently reports the uncompressed snapshot size (uncompressedSize) in heap-snapshot-save events, but that doesn't tell us the peak memory usage since it's taken after gc. It is however a good indicator already and should be monitored.

Delivery results also contain an allocate which seems the current allocation of memory including free slots and chunks (which the snapshot seem to exclude). As such the value seem to always be higher than the snapshot size, and may be the correct value to monitor, but it is not currently observed as a metric.

A proxy measurement would be the RSS size of the worker process, but I have seen this vary during the snapshot time.

Regardless of the way we monitor this, we should configure an alert when reaching 1GB. It would be good to have an alert when reaching 500 MB as well since that's the threshold at which state sync stops working as well.

Security Considerations

None

Scaling Considerations

Monitoring this should not introduce undue processing

Test Plan

TBD

Upgrade Considerations

We need to avoid chain software changes to start monitoring this.

The text was updated successfully, but these errors were encountered:

mhofman added enhancement New feature or request telemetry labels Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitor and alert xsnap worker memory #10842

Monitor and alert xsnap worker memory #10842

mhofman commented Jan 14, 2025

Monitor and alert xsnap worker memory #10842

Monitor and alert xsnap worker memory #10842

Comments

mhofman commented Jan 14, 2025

What is the Problem Being Solved?

Description of the Design

Security Considerations

Scaling Considerations

Test Plan

Upgrade Considerations