You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#10841 reminded us that xsnap vats will fail if they attempt to allocate more than 2GB of memory (as see by xsnap's metering). We need to make sure we get alert if any networks we monitor like mainnet has a vat getting anywhere close to this.
Description of the Design
The slog currently reports the uncompressed snapshot size (uncompressedSize) in heap-snapshot-save events, but that doesn't tell us the peak memory usage since it's taken after gc. It is however a good indicator already and should be monitored.
Delivery results also contain an allocate which seems the current allocation of memory including free slots and chunks (which the snapshot seem to exclude). As such the value seem to always be higher than the snapshot size, and may be the correct value to monitor, but it is not currently observed as a metric.
A proxy measurement would be the RSS size of the worker process, but I have seen this vary during the snapshot time.
Regardless of the way we monitor this, we should configure an alert when reaching 1GB. It would be good to have an alert when reaching 500 MB as well since that's the threshold at which state sync stops working as well.
Security Considerations
None
Scaling Considerations
Monitoring this should not introduce undue processing
Test Plan
TBD
Upgrade Considerations
We need to avoid chain software changes to start monitoring this.
The text was updated successfully, but these errors were encountered:
What is the Problem Being Solved?
#10841 reminded us that xsnap vats will fail if they attempt to allocate more than 2GB of memory (as see by xsnap's metering). We need to make sure we get alert if any networks we monitor like mainnet has a vat getting anywhere close to this.
Description of the Design
The slog currently reports the uncompressed snapshot size (
uncompressedSize
) inheap-snapshot-save
events, but that doesn't tell us the peak memory usage since it's taken after gc. It is however a good indicator already and should be monitored.Delivery results also contain an
allocate
which seems the current allocation of memory including free slots and chunks (which the snapshot seem to exclude). As such the value seem to always be higher than the snapshot size, and may be the correct value to monitor, but it is not currently observed as a metric.A proxy measurement would be the RSS size of the worker process, but I have seen this vary during the snapshot time.
Regardless of the way we monitor this, we should configure an alert when reaching 1GB. It would be good to have an alert when reaching 500 MB as well since that's the threshold at which state sync stops working as well.
Security Considerations
None
Scaling Considerations
Monitoring this should not introduce undue processing
Test Plan
TBD
Upgrade Considerations
We need to avoid chain software changes to start monitoring this.
The text was updated successfully, but these errors were encountered: