-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Fix the deadlock during statements gossiping #9868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| let keys: Vec<_> = { | ||
| let index = self.index.read(); | ||
| index.entries.keys().cloned().collect() | ||
| }; | ||
|
|
||
| let mut result = Vec::with_capacity(keys.len()); | ||
| for h in keys { | ||
| let encoded = self.db.get(col::STATEMENTS, &h).map_err(|e| Error::Db(e.to_string()))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that by the time we read into the db the statement might have been removed.
So this operation doesn't return the view of the statement store at one point in time. Instead it returns most statements. It could be that a statement was removed and another added in the meantime.
But it can be good and more efficient than before as well.
But for instance if a user always. Override one statement in one channel. While the store always have one statement for this user, this function might return none sometimes. But again it can be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it can be good and more efficient than before as well.
It could be good, but benchmarking revealed a regression. I suppose without the lock, concurrent access to the DB slows things down
c2e4e5d to
faa4bf0
Compare
| Block::Hash: From<BlockHash>, | ||
| Client: ProvideRuntimeApi<Block> | ||
| + HeaderBackend<Block> | ||
| + sc_client_api::ExecutorProvider<Block> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to the subject, but should be removed as we don't need this trait. It only complicates the test setup.
| } | ||
|
|
||
| /// Perform periodic store maintenance | ||
| pub fn maintain(&self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to the current deadlock, but better to remove unnecessary reads. It makes the log more precise as the change of index between maintenance and logging is possible. Keeping the write lock during the DB commit is not necessary.
|
/cmd prdoc --audience node_dev --bump patch |
|
All GitHub workflows were cancelled due to failure one of the required jobs. |
26918d9 to
4bfda69
Compare
# Description During statement store benchmarking we experienced deadlock-like behavior which we found happened during statement propagation. Every second statements were propagating, locking the index which possibly caused the deadlock. After the fix, the observed behavior no longer occurs. Even though there is a possibility to unsync the DB and the index for read operations and release locks earlier, which should be harmless, it leads to regressions. I suspect because of concurrent access to many calls of db.get(). Checked with the benchmarks in #9884 ## Integration This PR should not affect downstream projects. --------- Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com> (cherry picked from commit ed4eebb)
|
Successfully created backport PR for |
# Description During statement store benchmarking we experienced deadlock-like behavior which we found happened during statement propagation. Every second statements were propagating, locking the index which possibly caused the deadlock. After the fix, the observed behavior no longer occurs. Even though there is a possibility to unsync the DB and the index for read operations and release locks earlier, which should be harmless, it leads to regressions. I suspect because of concurrent access to many calls of db.get(). Checked with the benchmarks in #9884 ## Integration This PR should not affect downstream projects. --------- Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com> (cherry picked from commit ed4eebb)
|
Successfully created backport PR for |
|
Successfully created backport PR for |
# Description During statement store benchmarking we experienced deadlock-like behavior which we found happened during statement propagation. Every second statements were propagating, locking the index which possibly caused the deadlock. After the fix, the observed behavior no longer occurs. Even though there is a possibility to unsync the DB and the index for read operations and release locks earlier, which should be harmless, it leads to regressions. I suspect because of concurrent access to many calls of db.get(). Checked with the benchmarks in #9884 ## Integration This PR should not affect downstream projects. --------- Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com> (cherry picked from commit ed4eebb)
Backport #9868 into `stable2509` from AndreiEres. See the [documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md) on how to use this bot. <!-- # To be used by other automation, do not modify: original-pr-number: #${pull_number} --> Co-authored-by: Andrei Eres <[email protected]> Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Backport #9868 into `stable2506` from AndreiEres. See the [documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md) on how to use this bot. <!-- # To be used by other automation, do not modify: original-pr-number: #${pull_number} --> Co-authored-by: Andrei Eres <[email protected]> Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
# Description During statement store benchmarking we experienced deadlock-like behavior which we found happened during statement propagation. Every second statements were propagating, locking the index which possibly caused the deadlock. After the fix, the observed behavior no longer occurs. Even though there is a possibility to unsync the DB and the index for read operations and release locks earlier, which should be harmless, it leads to regressions. I suspect because of concurrent access to many calls of db.get(). Checked with the benchmarks in #9884 ## Integration This PR should not affect downstream projects. --------- Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
# Description During statement store benchmarking we experienced deadlock-like behavior which we found happened during statement propagation. Every second statements were propagating, locking the index which possibly caused the deadlock. After the fix, the observed behavior no longer occurs. Even though there is a possibility to unsync the DB and the index for read operations and release locks earlier, which should be harmless, it leads to regressions. I suspect because of concurrent access to many calls of db.get(). Checked with the benchmarks in #9884 ## Integration This PR should not affect downstream projects. --------- Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
# Description Adds benchmarks to measure the performance of the statement-store: - Message Exchange Scenario: interaction with one or many nodes. - Memory Stress Test Scenario. ## Results **Key improvements made to improve the performance:** - [Fixed a deadlock](#9868) - [Increased statements limits](#9894) - [Improved gossiping](#9912) **Hardware** All benchmarks were run on a MacBook Pro M2 ### 1. Message Exchange Scenario **Test Configuration** - Total participants: 49_998 - Group size: 6 participants per group - Total groups: 8_333 - Statement payload size: 512 KB - Propagation delay: 2 seconds (empirically determined) - Parachain network tested with: 2, 3, 6, 12 nodes **Network Topologies** We tested two distribution patterns: 1. **To one RPC:** All participants send statements to a single RPC node 2. **To all RPC:** Participants distribute statements across all nodes (slower due to gossiping overhead) **Participant Flow** - Sends a statement with their key for an exchange session (1 sent) - Waits 2 seconds for statement propagation - Receives session keys from other members in the group (5 received) - Sends statements containing a 512KB message to each member in the group (5 sent) - Waits 2 seconds for statement propagation - Receives messages from other members (5 received) - Total: 6 sent, 10 received. **Results** | Collators | Avg time | Max time | Memory | | -------------- | -------- | -------- | ------ | | **To one RPC** | | | | | 2 | 35s | 35s | 2.1GB | | 6 | 48s | 50s | 1.7GB | | **To all RPC** | | | | | 3 | 41s | 51s | 1.9GB | | 6 | 61s | 71s | 1.4GB | | 12 | 94s | 119s | 1.9GB | **Observations** - Sending to one RPC node is faster but creates a bottleneck - Distributing across all nodes takes longer due to gossiping overhead - More collators increase gossiping load, resulting in slower completion times - Memory usage per node remains around 2GB. ### 2. Memory Stress Test Scenario **Test Configuration** We prepared one more scenario to check how much memory nodes use with a full store after we increased the limits. To maximize memory usage for index usage, we submitted statements with all unique topics; other fields (e.g., proofs) were not used. - Total tasks: 100,000 concurrent - Statement size: 1 KB per statement - Network size: 6 collators **Test Flow** 1. Spawn 100,000 concurrent tasks 2. Each task sends statements with 1KB payload to one node until store is full 3. Statements are gossiped across the network to the other 5 collators 4. Test completes when all collator stores are full **Results** During the tests, each node used up to 4.5GB of memory.
Description
During statement store benchmarking we experienced deadlock-like behavior which we found happened during statement propagation. Every second statements were propagating, locking the index which possibly caused the deadlock. After the fix, the observed behavior no longer occurs.
Even though there is a possibility to unsync the DB and the index for read operations and release locks earlier, which should be harmless, it leads to regressions. I suspect because of concurrent access to many calls of db.get(). Checked with the benchmarks in #9884
Integration
This PR should not affect downstream projects.