Skip to content

Conversation

@AndreiEres
Copy link
Contributor

@AndreiEres AndreiEres commented Sep 29, 2025

Description

During statement store benchmarking we experienced deadlock-like behavior which we found happened during statement propagation. Every second statements were propagating, locking the index which possibly caused the deadlock. After the fix, the observed behavior no longer occurs.

Even though there is a possibility to unsync the DB and the index for read operations and release locks earlier, which should be harmless, it leads to regressions. I suspect because of concurrent access to many calls of db.get(). Checked with the benchmarks in #9884

Integration

This PR should not affect downstream projects.

Comment on lines 765 to 772
let keys: Vec<_> = {
let index = self.index.read();
index.entries.keys().cloned().collect()
};

let mut result = Vec::with_capacity(keys.len());
for h in keys {
let encoded = self.db.get(col::STATEMENTS, &h).map_err(|e| Error::Db(e.to_string()))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that by the time we read into the db the statement might have been removed.
So this operation doesn't return the view of the statement store at one point in time. Instead it returns most statements. It could be that a statement was removed and another added in the meantime.

But it can be good and more efficient than before as well.

But for instance if a user always. Override one statement in one channel. While the store always have one statement for this user, this function might return none sometimes. But again it can be good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it can be good and more efficient than before as well.

It could be good, but benchmarking revealed a regression. I suppose without the lock, concurrent access to the DB slows things down

@AndreiEres AndreiEres force-pushed the AndreiEres/fix-statement-store-deadlock branch 2 times, most recently from c2e4e5d to faa4bf0 Compare September 30, 2025 15:35
Block::Hash: From<BlockHash>,
Client: ProvideRuntimeApi<Block>
+ HeaderBackend<Block>
+ sc_client_api::ExecutorProvider<Block>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to the subject, but should be removed as we don't need this trait. It only complicates the test setup.

}

/// Perform periodic store maintenance
pub fn maintain(&self) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to the current deadlock, but better to remove unnecessary reads. It makes the log more precise as the change of index between maintenance and logging is possible. Keeping the write lock during the DB commit is not necessary.

@AndreiEres
Copy link
Contributor Author

/cmd prdoc --audience node_dev --bump patch

@AndreiEres AndreiEres added the T0-node This PR/Issue is related to the topic “node”. label Sep 30, 2025
@AndreiEres AndreiEres requested a review from gui1117 September 30, 2025 16:23
@paritytech-workflow-stopper
Copy link

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/18137050908
Failed job name: test-linux-stable

@AndreiEres AndreiEres force-pushed the AndreiEres/fix-statement-store-deadlock branch from 26918d9 to 4bfda69 Compare September 30, 2025 16:55
@AndreiEres AndreiEres added this pull request to the merge queue Sep 30, 2025
Merged via the queue into master with commit ed4eebb Sep 30, 2025
244 of 246 checks passed
@AndreiEres AndreiEres deleted the AndreiEres/fix-statement-store-deadlock branch September 30, 2025 22:59
@gui1117 gui1117 added A4-backport-stable2506 Pull request must be backported to the stable2506 release branch A4-backport-unstable2507 Pull request must be backported to the unstable2507 release branch A4-backport-stable2509 Pull request must be backported to the stable2509 release branch labels Sep 30, 2025
paritytech-release-backport-bot bot pushed a commit that referenced this pull request Sep 30, 2025
# Description

During statement store benchmarking we experienced deadlock-like
behavior which we found happened during statement propagation. Every
second statements were propagating, locking the index which possibly
caused the deadlock. After the fix, the observed behavior no longer
occurs.

Even though there is a possibility to unsync the DB and the index for
read operations and release locks earlier, which should be harmless, it
leads to regressions. I suspect because of concurrent access to many
calls of db.get(). Checked with the benchmarks in
#9884

## Integration

This PR should not affect downstream projects.

---------

Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
(cherry picked from commit ed4eebb)
@paritytech-release-backport-bot

Successfully created backport PR for stable2506:

paritytech-release-backport-bot bot pushed a commit that referenced this pull request Sep 30, 2025
# Description

During statement store benchmarking we experienced deadlock-like
behavior which we found happened during statement propagation. Every
second statements were propagating, locking the index which possibly
caused the deadlock. After the fix, the observed behavior no longer
occurs.

Even though there is a possibility to unsync the DB and the index for
read operations and release locks earlier, which should be harmless, it
leads to regressions. I suspect because of concurrent access to many
calls of db.get(). Checked with the benchmarks in
#9884

## Integration

This PR should not affect downstream projects.

---------

Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
(cherry picked from commit ed4eebb)
@paritytech-release-backport-bot

Successfully created backport PR for unstable2507:

@paritytech-release-backport-bot

Successfully created backport PR for stable2509:

paritytech-release-backport-bot bot pushed a commit that referenced this pull request Sep 30, 2025
# Description

During statement store benchmarking we experienced deadlock-like
behavior which we found happened during statement propagation. Every
second statements were propagating, locking the index which possibly
caused the deadlock. After the fix, the observed behavior no longer
occurs.

Even though there is a possibility to unsync the DB and the index for
read operations and release locks earlier, which should be harmless, it
leads to regressions. I suspect because of concurrent access to many
calls of db.get(). Checked with the benchmarks in
#9884

## Integration

This PR should not affect downstream projects.

---------

Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
(cherry picked from commit ed4eebb)
EgorPopelyaev pushed a commit that referenced this pull request Oct 1, 2025
Backport #9868 into `stable2509` from AndreiEres.

See the
[documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md)
on how to use this bot.

<!--
  # To be used by other automation, do not modify:
  original-pr-number: #${pull_number}
-->

Co-authored-by: Andrei Eres <[email protected]>
Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
EgorPopelyaev pushed a commit that referenced this pull request Oct 3, 2025
Backport #9868 into `stable2506` from AndreiEres.

See the
[documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md)
on how to use this bot.

<!--
  # To be used by other automation, do not modify:
  original-pr-number: #${pull_number}
-->

Co-authored-by: Andrei Eres <[email protected]>
Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
bee344 pushed a commit that referenced this pull request Oct 7, 2025
# Description

During statement store benchmarking we experienced deadlock-like
behavior which we found happened during statement propagation. Every
second statements were propagating, locking the index which possibly
caused the deadlock. After the fix, the observed behavior no longer
occurs.

Even though there is a possibility to unsync the DB and the index for
read operations and release locks earlier, which should be harmless, it
leads to regressions. I suspect because of concurrent access to many
calls of db.get(). Checked with the benchmarks in
#9884

## Integration

This PR should not affect downstream projects.

---------

Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
alvicsam pushed a commit that referenced this pull request Oct 17, 2025
# Description

During statement store benchmarking we experienced deadlock-like
behavior which we found happened during statement propagation. Every
second statements were propagating, locking the index which possibly
caused the deadlock. After the fix, the observed behavior no longer
occurs.

Even though there is a possibility to unsync the DB and the index for
read operations and release locks earlier, which should be harmless, it
leads to regressions. I suspect because of concurrent access to many
calls of db.get(). Checked with the benchmarks in
#9884

## Integration

This PR should not affect downstream projects.

---------

Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
github-merge-queue bot pushed a commit that referenced this pull request Nov 7, 2025
# Description

Adds benchmarks to measure the performance of the statement-store:
- Message Exchange Scenario: interaction with one or many nodes.
- Memory Stress Test Scenario.

## Results

**Key improvements made to improve the performance:**
- [Fixed a
deadlock](#9868)
- [Increased statements
limits](#9894)
- [Improved
gossiping](#9912)

**Hardware**
All benchmarks were run on a MacBook Pro M2

### 1. Message Exchange Scenario
**Test Configuration**
- Total participants: 49_998
- Group size: 6 participants per group
- Total groups: 8_333
- Statement payload size: 512 KB
- Propagation delay: 2 seconds (empirically determined)
- Parachain network tested with: 2, 3, 6, 12 nodes

**Network Topologies**
We tested two distribution patterns:
1. **To one RPC:** All participants send statements to a single RPC node
2. **To all RPC:** Participants distribute statements across all nodes
(slower due to gossiping overhead)

**Participant Flow**
- Sends a statement with their key for an exchange session (1 sent)
- Waits 2 seconds for statement propagation
- Receives session keys from other members in the group (5 received)
- Sends statements containing a 512KB message to each member in the
group (5 sent)
- Waits 2 seconds for statement propagation
- Receives messages from other members (5 received)
- Total: 6 sent, 10 received.


**Results**

| Collators      | Avg time | Max time | Memory |
| -------------- | -------- | -------- | ------ |
| **To one RPC** |          |          |        |
| 2              | 35s      | 35s      | 2.1GB  |
| 6              | 48s      | 50s      | 1.7GB  |
| **To all RPC** |          |          |        |
| 3              | 41s      | 51s      | 1.9GB  |
| 6              | 61s      | 71s      | 1.4GB  |
| 12             | 94s      | 119s     | 1.9GB  |

**Observations**
- Sending to one RPC node is faster but creates a bottleneck
- Distributing across all nodes takes longer due to gossiping overhead
- More collators increase gossiping load, resulting in slower completion
times
- Memory usage per node remains around 2GB.


### 2. Memory Stress Test Scenario

**Test Configuration**
We prepared one more scenario to check how much memory nodes use with a
full store after we increased the limits. To maximize memory usage for
index usage, we submitted statements with all unique topics; other
fields (e.g., proofs) were not used.
- Total tasks: 100,000 concurrent
- Statement size: 1 KB per statement
- Network size: 6 collators

**Test Flow**
1. Spawn 100,000 concurrent tasks
2. Each task sends statements with 1KB payload to one node until store
is full
3. Statements are gossiped across the network to the other 5 collators
4. Test completes when all collator stores are full

**Results**
During the tests, each node used up to 4.5GB of memory.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A4-backport-stable2506 Pull request must be backported to the stable2506 release branch A4-backport-stable2509 Pull request must be backported to the stable2509 release branch A4-backport-unstable2507 Pull request must be backported to the unstable2507 release branch T0-node This PR/Issue is related to the topic “node”.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants