Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only keep track of recently used stacks in memory. #2591

Open
wants to merge 33 commits into
base: staging
Choose a base branch
from

Conversation

vicsn
Copy link
Collaborator

@vicsn vicsn commented Jan 10, 2025

Motivation

RSS growth is correlated with deployments. By lazy loading deployments, hopefully unbounded memory growth is eliminated.

The cache will take up at most MAX_PROGRAM_DEPTH x MAX_IMPORTS x 100kb x 10 ~= 4GB of data.

Note that programs, represented as Stacks, have imports. This means the cache is essentially a DAG with multiple roots. To avoid memory leaks, we only evict root stacks from the cache. In the following diagram, you can see in the call graph below where the cache is (temporarily) locked and updated. The "long" loop has some nasty edge cases, it would be better to pass all imports directly into load_deployment and Stack::new, but that would be a much bigger refactor.

Screenshot 2025-01-10 at 16 30 57

Test Plan

  • Unit tests pass. Requires [Refactor] simplify and unify the storage setup for tests #2590 to fix the test_real_example_cache_evict test.
  • Local network passes, using tx-cannon to deploy and fetch multiple sets of maximally nested deployments.
  • Deployed network passes, using tx-cannon to deploy and fetch multiple sets of maximally nested deployments.
  • Consider fixing how test storage works, which currently leaks states across tests.

Related PRs

#2519
#2553
#2578

@vicsn vicsn changed the base branch from staging to mainnet January 10, 2025 16:34
@vicsn vicsn changed the base branch from mainnet to staging January 10, 2025 16:34
This should - as far as we know - eliminate unbounded memory growth.
To avoid memory leaks, we only evict "root" stacks from the cache.
@vicsn vicsn force-pushed the stack_cache_reloaded branch from 6c990e8 to 44fc174 Compare January 10, 2025 17:39
console/network/src/lib.rs Outdated Show resolved Hide resolved
synthesizer/process/src/lib.rs Outdated Show resolved Hide resolved
let credits_id = self
.credits
.as_ref()
.map_or(ProgramID::<N>::from_str("credits.aleo").unwrap(), |stack| *stack.program_id());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.map_or(ProgramID::<N>::from_str("credits.aleo").unwrap(), |stack| *stack.program_id());
.map_or_else(|| ProgramID::<N>::from_str("credits.aleo").unwrap(), |stack| *stack.program_id());

(to avoid eager evaluation)

let programs_to_add = programs_to_add
.into_iter()
.chain(std::iter::once((*stack.program_id(), stack))) // add the root stack.
.unique_by(|(id, _)|*id) // don't add duplicates.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.unique_by(|(id, _)|*id) // don't add duplicates.
.unique_by(|(id, _)| *id) // don't add duplicates.

tiny nit, rustfmt might have missed it

Comment on lines +314 to +319
let mut process = Self {
universal_srs: Arc::new(UniversalSRS::load()?),
credits: None,
stacks: Arc::new(Mutex::new(LruCache::new(NonZeroUsize::new(N::MAX_STACKS).unwrap()))),
store: None,
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut process = Self {
universal_srs: Arc::new(UniversalSRS::load()?),
credits: None,
stacks: Arc::new(Mutex::new(LruCache::new(NonZeroUsize::new(N::MAX_STACKS).unwrap()))),
store: None,
};
let mut process = Self::load_no_storage()?;

Comment on lines +256 to +261
let mut process = Self {
universal_srs: Arc::new(UniversalSRS::load()?),
credits: None,
stacks: Arc::new(Mutex::new(LruCache::new(NonZeroUsize::new(N::MAX_STACKS).unwrap()))),
store: Some(store),
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut process = Self {
universal_srs: Arc::new(UniversalSRS::load()?),
credits: None,
stacks: Arc::new(Mutex::new(LruCache::new(NonZeroUsize::new(N::MAX_STACKS).unwrap()))),
store: Some(store),
};
let mut process = Self::load_no_storage()?;
process.store = Some(store);

Copy link
Collaborator

@ljedrz ljedrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2nd review pass complete, left a few comments; also, it seems like one of the new tests is currently failing.

@kpandl
Copy link
Collaborator

kpandl commented Feb 13, 2025

Tested it on a 5 validator network, with 4 instance doing nested deployments, 4 instances doing nested executions, and a program probe instance running every 500 ms (total ~100 program deployments).
Validators reliably halted after ~70 deployments.

For testing, added the locktick features to snarkVM (commit c12cbf66d334159108d75f85d910d073420869b8 here) and snarkOS (commit 246996bb2cdbc0dc75d4029b21aad1d32db6af2d here).
This revealed that locks are held for a long time (>2s) for deep nested programs.

Example logs:

2025-02-13T09:54:21.240609Z TRACE snarkos: [locktick] checking for active lock guards
2025-02-13T09:54:21.240971Z TRACE snarkos: /Users/kp/.cargo/git/checkouts/snarkvm-438da7bfff6ff07c/c12cbf6/ledger/src/advance.rs@93:33 (Write): 304; 1 active; avg d: 2.09742095s; avg w: 58ns
2025-02-13T09:54:21.240999Z TRACE snarkos: /Users/kp/.cargo/git/checkouts/snarkvm-438da7bfff6ff07c/c12cbf6/synthesizer/src/vm/finalize.rs@591:27 (Write): 305; 1 active; avg d: 2.065852041s; avg w: 79ns
2025-02-13T09:54:21.241006Z TRACE snarkos: /Users/kp/.cargo/git/checkouts/snarkvm-438da7bfff6ff07c/c12cbf6/synthesizer/src/vm/finalize.rs@554:28 (Lock): 305; 1 active; avg d: 2.06651786s; avg w: 99ns
2025-02-13T09:54:21.241015Z TRACE snarkos: /Users/kp/.cargo/git/checkouts/snarkvm-438da7bfff6ff07c/c12cbf6/synthesizer/src/vm/mod.rs@321:27 (Lock): 305; 1 active; avg d: 2.097369753s; avg w: 97ns
2025-02-13T09:54:21.241165Z TRACE snarkos: /Users/kp/dev/stress-observability/test_suites/single-region-tests/playbooks/snarkos-shallow/246996bb2cdbc0dc75d4029b21aad1d32db6af2d/node/bft/src/bft.rs@459:38 (Lock): 2590; 1 active; avg d: 6.866325ms; avg w: 201ns

Some fix ideas:

  • use try_lock
  • find a minimal reproduction case
  • print when we call contains_program_in_cache and get_stack so we understand where the "pressure" is

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants