Fix the deadlock during statements gossiping #9868

AndreiEres · 2025-09-29T14:39:11Z

Description

During statement store benchmarking we experienced deadlock-like behavior which we found happened during statement propagation. Every second statements were propagating, locking the index which possibly caused the deadlock. After the fix, the observed behavior no longer occurs.

Even though there is a possibility to unsync the DB and the index for read operations and release locks earlier, which should be harmless, it leads to regressions. I suspect because of concurrent access to many calls of db.get(). Checked with the benchmarks in #9884

Integration

This PR should not affect downstream projects.

gui1117 · 2025-09-29T22:24:52Z

substrate/client/statement-store/src/lib.rs

+		let keys: Vec<_> = {
+			let index = self.index.read();
+			index.entries.keys().cloned().collect()
+		};
+
+		let mut result = Vec::with_capacity(keys.len());
+		for h in keys {
+			let encoded = self.db.get(col::STATEMENTS, &h).map_err(|e| Error::Db(e.to_string()))?;


Note that by the time we read into the db the statement might have been removed.
So this operation doesn't return the view of the statement store at one point in time. Instead it returns most statements. It could be that a statement was removed and another added in the meantime.

But it can be good and more efficient than before as well.

But for instance if a user always. Override one statement in one channel. While the store always have one statement for this user, this function might return none sometimes. But again it can be good.

But it can be good and more efficient than before as well.

It could be good, but benchmarking revealed a regression. I suppose without the lock, concurrent access to the DB slows things down

AndreiEres · 2025-09-30T16:01:54Z

substrate/client/statement-store/src/lib.rs

 		Block::Hash: From<BlockHash>,
 		Client: ProvideRuntimeApi<Block>
 			+ HeaderBackend<Block>
-			+ sc_client_api::ExecutorProvider<Block>


Not related to the subject, but should be removed as we don't need this trait. It only complicates the test setup.

AndreiEres · 2025-09-30T16:07:27Z

substrate/client/statement-store/src/lib.rs

 	}

 	/// Perform periodic store maintenance
 	pub fn maintain(&self) {


Not related to the current deadlock, but better to remove unnecessary reads. It makes the log more precise as the change of index between maintenance and logging is possible. Keeping the write lock during the DB commit is not necessary.

AndreiEres · 2025-09-30T16:22:25Z

/cmd prdoc --audience node_dev --bump patch

paritytech-workflow-stopper · 2025-09-30T16:48:04Z

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/18137050908
Failed job name: test-linux-stable

# Description During statement store benchmarking we experienced deadlock-like behavior which we found happened during statement propagation. Every second statements were propagating, locking the index which possibly caused the deadlock. After the fix, the observed behavior no longer occurs. Even though there is a possibility to unsync the DB and the index for read operations and release locks earlier, which should be harmless, it leads to regressions. I suspect because of concurrent access to many calls of db.get(). Checked with the benchmarks in #9884 ## Integration This PR should not affect downstream projects. --------- Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com> (cherry picked from commit ed4eebb)

paritytech-release-backport-bot · 2025-09-30T23:19:41Z

Successfully created backport PR for stable2506:

[stable2506] Backport #9868 #9889

# Description During statement store benchmarking we experienced deadlock-like behavior which we found happened during statement propagation. Every second statements were propagating, locking the index which possibly caused the deadlock. After the fix, the observed behavior no longer occurs. Even though there is a possibility to unsync the DB and the index for read operations and release locks earlier, which should be harmless, it leads to regressions. I suspect because of concurrent access to many calls of db.get(). Checked with the benchmarks in #9884 ## Integration This PR should not affect downstream projects. --------- Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com> (cherry picked from commit ed4eebb)

paritytech-release-backport-bot · 2025-09-30T23:19:44Z

Successfully created backport PR for unstable2507:

[unstable2507] Backport #9868 #9890

paritytech-release-backport-bot · 2025-09-30T23:19:47Z

Successfully created backport PR for stable2509:

[stable2509] Backport #9868 #9891

# Description During statement store benchmarking we experienced deadlock-like behavior which we found happened during statement propagation. Every second statements were propagating, locking the index which possibly caused the deadlock. After the fix, the observed behavior no longer occurs. Even though there is a possibility to unsync the DB and the index for read operations and release locks earlier, which should be harmless, it leads to regressions. I suspect because of concurrent access to many calls of db.get(). Checked with the benchmarks in #9884 ## Integration This PR should not affect downstream projects. --------- Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com> (cherry picked from commit ed4eebb)

Backport #9868 into `stable2509` from AndreiEres. See the [documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md) on how to use this bot.  Co-authored-by: Andrei Eres <[email protected]> Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Backport #9868 into `stable2506` from AndreiEres. See the [documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md) on how to use this bot.  Co-authored-by: Andrei Eres <[email protected]> Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>

# Description During statement store benchmarking we experienced deadlock-like behavior which we found happened during statement propagation. Every second statements were propagating, locking the index which possibly caused the deadlock. After the fix, the observed behavior no longer occurs. Even though there is a possibility to unsync the DB and the index for read operations and release locks earlier, which should be harmless, it leads to regressions. I suspect because of concurrent access to many calls of db.get(). Checked with the benchmarks in #9884 ## Integration This PR should not affect downstream projects. --------- Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>

# Description Adds benchmarks to measure the performance of the statement-store: - Message Exchange Scenario: interaction with one or many nodes. - Memory Stress Test Scenario. ## Results **Key improvements made to improve the performance:** - [Fixed a deadlock](#9868) - [Increased statements limits](#9894) - [Improved gossiping](#9912) **Hardware** All benchmarks were run on a MacBook Pro M2 ### 1. Message Exchange Scenario **Test Configuration** - Total participants: 49_998 - Group size: 6 participants per group - Total groups: 8_333 - Statement payload size: 512 KB - Propagation delay: 2 seconds (empirically determined) - Parachain network tested with: 2, 3, 6, 12 nodes **Network Topologies** We tested two distribution patterns: 1. **To one RPC:** All participants send statements to a single RPC node 2. **To all RPC:** Participants distribute statements across all nodes (slower due to gossiping overhead) **Participant Flow** - Sends a statement with their key for an exchange session (1 sent) - Waits 2 seconds for statement propagation - Receives session keys from other members in the group (5 received) - Sends statements containing a 512KB message to each member in the group (5 sent) - Waits 2 seconds for statement propagation - Receives messages from other members (5 received) - Total: 6 sent, 10 received. **Results** | Collators | Avg time | Max time | Memory | | -------------- | -------- | -------- | ------ | | **To one RPC** | | | | | 2 | 35s | 35s | 2.1GB | | 6 | 48s | 50s | 1.7GB | | **To all RPC** | | | | | 3 | 41s | 51s | 1.9GB | | 6 | 61s | 71s | 1.4GB | | 12 | 94s | 119s | 1.9GB | **Observations** - Sending to one RPC node is faster but creates a bottleneck - Distributing across all nodes takes longer due to gossiping overhead - More collators increase gossiping load, resulting in slower completion times - Memory usage per node remains around 2GB. ### 2. Memory Stress Test Scenario **Test Configuration** We prepared one more scenario to check how much memory nodes use with a full store after we increased the limits. To maximize memory usage for index usage, we submitted statements with all unique topics; other fields (e.g., proofs) were not used. - Total tasks: 100,000 concurrent - Statement size: 1 KB per statement - Network size: 6 collators **Test Flow** 1. Spawn 100,000 concurrent tasks 2. Each task sends statements with 1KB payload to one node until store is full 3. Statements are gossiped across the network to the other 5 collators 4. Test completes when all collator stores are full **Results** During the tests, each node used up to 4.5GB of memory.

AndreiEres requested review from georgepisaltu and gui1117 September 29, 2025 14:39

AndreiEres mentioned this pull request Sep 29, 2025

Fix statement store deadlock #9861

Closed

gui1117 reviewed Sep 29, 2025

View reviewed changes

AndreiEres force-pushed the AndreiEres/fix-statement-store-deadlock branch 2 times, most recently from c2e4e5d to faa4bf0 Compare September 30, 2025 15:35

AndreiEres commented Sep 30, 2025

View reviewed changes

AndreiEres added the T0-node This PR/Issue is related to the topic “node”. label Sep 30, 2025

AndreiEres requested a review from gui1117 September 30, 2025 16:23

georgepisaltu approved these changes Sep 30, 2025

View reviewed changes

AndreiEres and others added 4 commits September 30, 2025 18:54

Remove unused trait

a9ef68e

Fix the deadlock during statements gossiping

1b5a0a2

Remove unnecessary readings during maintenance

6df5b37

Add prdoc

4bfda69

AndreiEres force-pushed the AndreiEres/fix-statement-store-deadlock branch from 26918d9 to 4bfda69 Compare September 30, 2025 16:55

AndreiEres enabled auto-merge September 30, 2025 17:04

gui1117 approved these changes Sep 30, 2025

View reviewed changes

AndreiEres added this pull request to the merge queue Sep 30, 2025

Merged via the queue into master with commit ed4eebb Sep 30, 2025
244 of 246 checks passed

AndreiEres deleted the AndreiEres/fix-statement-store-deadlock branch September 30, 2025 22:59

paritytech-release-backport-bot bot mentioned this pull request Sep 30, 2025

[stable2506] Backport #9868 #9889

Merged

paritytech-release-backport-bot bot mentioned this pull request Sep 30, 2025

[unstable2507] Backport #9868 #9890

Closed

paritytech-release-backport-bot bot mentioned this pull request Sep 30, 2025

[stable2509] Backport #9868 #9891

Merged

AndreiEres mentioned this pull request Oct 14, 2025

statement-store performance benchmarks #9763

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix the deadlock during statements gossiping #9868

Fix the deadlock during statements gossiping #9868

Uh oh!

AndreiEres commented Sep 29, 2025 •

edited

Loading

Uh oh!

gui1117 Sep 29, 2025

Uh oh!

AndreiEres Sep 30, 2025

Uh oh!

AndreiEres Sep 30, 2025

Uh oh!

AndreiEres Sep 30, 2025

Uh oh!

AndreiEres commented Sep 30, 2025

Uh oh!

paritytech-workflow-stopper bot commented Sep 30, 2025

Uh oh!

Uh oh!

paritytech-release-backport-bot bot commented Sep 30, 2025

Uh oh!

paritytech-release-backport-bot bot commented Sep 30, 2025

Uh oh!

paritytech-release-backport-bot bot commented Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix the deadlock during statements gossiping #9868

Fix the deadlock during statements gossiping #9868

Uh oh!

Conversation

AndreiEres commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Integration

Uh oh!

gui1117 Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

AndreiEres Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

AndreiEres Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

AndreiEres Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

AndreiEres commented Sep 30, 2025

Uh oh!

paritytech-workflow-stopper bot commented Sep 30, 2025

Uh oh!

Uh oh!

paritytech-release-backport-bot bot commented Sep 30, 2025

Uh oh!

paritytech-release-backport-bot bot commented Sep 30, 2025

Uh oh!

paritytech-release-backport-bot bot commented Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AndreiEres commented Sep 29, 2025 •

edited

Loading