Only keep track of recently used stacks in memory. #2591

vicsn · 2025-01-10T16:33:37Z

Motivation

RSS growth is correlated with deployments. By lazy loading deployments, hopefully unbounded memory growth is eliminated.

The cache will take up at most MAX_PROGRAM_DEPTH x MAX_IMPORTS x 100kb x 10 ~= 4GB of data.

Note that programs, represented as Stacks, have imports. This means the cache is essentially a DAG with multiple roots. To avoid memory leaks, we only evict root stacks from the cache. In the following diagram, you can see in the call graph below where the cache is (temporarily) locked and updated. The "long" loop has some nasty edge cases, it would be better to pass all imports directly into load_deployment and Stack::new, but that would be a much bigger refactor.

Test Plan

Unit tests pass. Requires [Refactor] simplify and unify the storage setup for tests #2590 to fix the test_real_example_cache_evict test.
Local network passes, using tx-cannon to deploy and fetch multiple sets of maximally nested deployments.
Deployed network passes, using tx-cannon to deploy and fetch multiple sets of maximally nested deployments.
Consider fixing how test storage works, which currently leaks states across tests.

Related PRs

#2519
#2553
#2578

This should - as far as we know - eliminate unbounded memory growth. To avoid memory leaks, we only evict "root" stacks from the cache.

console/network/src/lib.rs

ledger/narwhal/batch-header/src/lib.rs

synthesizer/process/src/lib.rs

Co-authored-by: ljedrz <[email protected]> Signed-off-by: vicsn <[email protected]>

synthesizer/process/src/lib.rs

…aded

synthesizer/process/src/lib.rs

ljedrz · 2025-01-21T10:26:05Z

synthesizer/process/src/lib.rs

+        let credits_id = self
+            .credits
+            .as_ref()
+            .map_or(ProgramID::<N>::from_str("credits.aleo").unwrap(), |stack| *stack.program_id());


Suggested change

.map_or(ProgramID::<N>::from_str("credits.aleo").unwrap(), |stack| *stack.program_id());

.map_or_else(|| ProgramID::<N>::from_str("credits.aleo").unwrap(), |stack| *stack.program_id());

(to avoid eager evaluation)

ljedrz · 2025-01-21T10:30:47Z

synthesizer/process/src/lib.rs

+        let programs_to_add = programs_to_add
+            .into_iter()
+            .chain(std::iter::once((*stack.program_id(), stack))) // add the root stack.
+            .unique_by(|(id, _)|*id) // don't add duplicates.


Suggested change

.unique_by(|(id, _)|*id) // don't add duplicates.

.unique_by(|(id, _)| *id) // don't add duplicates.

tiny nit, rustfmt might have missed it

ljedrz · 2025-01-21T10:36:37Z

synthesizer/process/src/lib.rs

+        let mut process = Self {
+            universal_srs: Arc::new(UniversalSRS::load()?),
+            credits: None,
+            stacks: Arc::new(Mutex::new(LruCache::new(NonZeroUsize::new(N::MAX_STACKS).unwrap()))),
+            store: None,
+        };


Suggested change

let mut process = Self {

universal_srs: Arc::new(UniversalSRS::load()?),

credits: None,

stacks: Arc::new(Mutex::new(LruCache::new(NonZeroUsize::new(N::MAX_STACKS).unwrap()))),

store: None,

};

let mut process = Self::load_no_storage()?;

ljedrz · 2025-01-21T10:37:29Z

synthesizer/process/src/lib.rs

+        let mut process = Self {
+            universal_srs: Arc::new(UniversalSRS::load()?),
+            credits: None,
+            stacks: Arc::new(Mutex::new(LruCache::new(NonZeroUsize::new(N::MAX_STACKS).unwrap()))),
+            store: Some(store),
+        };


Suggested change

let mut process = Self {

universal_srs: Arc::new(UniversalSRS::load()?),

credits: None,

stacks: Arc::new(Mutex::new(LruCache::new(NonZeroUsize::new(N::MAX_STACKS).unwrap()))),

store: Some(store),

};

let mut process = Self::load_no_storage()?;

process.store = Some(store);

ljedrz

2nd review pass complete, left a few comments; also, it seems like one of the new tests is currently failing.

kpandl · 2025-02-13T14:22:42Z

Tested it on a 5 validator network, with 4 instance doing nested deployments, 4 instances doing nested executions, and a program probe instance running every 500 ms (total ~100 program deployments).
Validators reliably halted after ~70 deployments.

For testing, added the locktick features to snarkVM (commit c12cbf66d334159108d75f85d910d073420869b8 here) and snarkOS (commit 246996bb2cdbc0dc75d4029b21aad1d32db6af2d here).
This revealed that locks are held for a long time (>2s) for deep nested programs.

Example logs:

2025-02-13T09:54:21.240609Z TRACE snarkos: [locktick] checking for active lock guards
2025-02-13T09:54:21.240971Z TRACE snarkos: /Users/kp/.cargo/git/checkouts/snarkvm-438da7bfff6ff07c/c12cbf6/ledger/src/advance.rs@93:33 (Write): 304; 1 active; avg d: 2.09742095s; avg w: 58ns
2025-02-13T09:54:21.240999Z TRACE snarkos: /Users/kp/.cargo/git/checkouts/snarkvm-438da7bfff6ff07c/c12cbf6/synthesizer/src/vm/finalize.rs@591:27 (Write): 305; 1 active; avg d: 2.065852041s; avg w: 79ns
2025-02-13T09:54:21.241006Z TRACE snarkos: /Users/kp/.cargo/git/checkouts/snarkvm-438da7bfff6ff07c/c12cbf6/synthesizer/src/vm/finalize.rs@554:28 (Lock): 305; 1 active; avg d: 2.06651786s; avg w: 99ns
2025-02-13T09:54:21.241015Z TRACE snarkos: /Users/kp/.cargo/git/checkouts/snarkvm-438da7bfff6ff07c/c12cbf6/synthesizer/src/vm/mod.rs@321:27 (Lock): 305; 1 active; avg d: 2.097369753s; avg w: 97ns
2025-02-13T09:54:21.241165Z TRACE snarkos: /Users/kp/dev/stress-observability/test_suites/single-region-tests/playbooks/snarkos-shallow/246996bb2cdbc0dc75d4029b21aad1d32db6af2d/node/bft/src/bft.rs@459:38 (Lock): 2590; 1 active; avg d: 6.866325ms; avg w: 201ns

Some fix ideas:

use try_lock
find a minimal reproduction case
print when we call contains_program_in_cache and get_stack so we understand where the "pressure" is

vicsn changed the base branch from staging to mainnet January 10, 2025 16:34

vicsn changed the base branch from mainnet to staging January 10, 2025 16:34

Only keep track of recently used stacks in memory.

44fc174

This should - as far as we know - eliminate unbounded memory growth. To avoid memory leaks, we only evict "root" stacks from the cache.

vicsn force-pushed the stack_cache_reloaded branch from 6c990e8 to 44fc174 Compare January 10, 2025 17:39

vicsn added 3 commits January 10, 2025 19:51

Revert MAX_TRANSMISSIONS_PER_BATCH location

9486378

Fix circleci resource

9e56e07

Lower MAX_STACKS when testing

4a93690

vicsn mentioned this pull request Jan 11, 2025

Only keep track of recently used stacks in memory + unified storage #2592

Draft

4 tasks

vicsn added 4 commits January 13, 2025 14:04

Resolve bugs

135e279

Fix tests: don't superfluously cache credits.aleo

e76606d

Update circleci config

678fa94

Update MAX_STACKS to fix test_long_import_chain

ce57e74

ljedrz reviewed Jan 13, 2025

View reviewed changes

console/network/src/lib.rs Outdated Show resolved Hide resolved

ljedrz reviewed Jan 13, 2025

View reviewed changes

ledger/narwhal/batch-header/src/lib.rs Outdated Show resolved Hide resolved