Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(lib/runtime): fail to execute block 17088245 allocator out of space #3774

Closed
1 task
EclesioMeloJunior opened this issue Feb 26, 2024 · 6 comments · Fixed by #3800
Closed
1 task

(lib/runtime): fail to execute block 17088245 allocator out of space #3774

EclesioMeloJunior opened this issue Feb 26, 2024 · 6 comments · Fixed by #3800
Assignees
Labels
P-high this should be addressed ASAP. S-sync-westend related to particular network syncing.

Comments

@EclesioMeloJunior
Copy link
Member

EclesioMeloJunior commented Feb 26, 2024

2024-02-24T03:55:37Z INFO     imported block 0x2aad31837b6cdf8ce5964cca86b3a39b67ff872aa5eb6c954217872f6683d63d and stored state trie with root 0xe40e9900f63d1539014ce5cc4fa0e4932ae4ed1154c31cd77714d0fdf2d4e8b3	pkg=core
RUNTIME HASH: 0xd1d708433e5d7bcb9fc30a3de67952c1611fa58cd1980146ab07aa438116d352
2024-02-24T03:55:37Z ERROR    block data processing for block with hash 0x0feb9c8bb9018bbcf20b467e15636de02397de00c66233c5b4a06f5be3a2f558 failed: processing block data with header and body: handling block: failed to execute block 17088245: running runtime function: bumping: allocator out of space: current pages 65535 greater than max wasm pages 65535 (recovered by wazero)
wasm stack trace:
	env.ext_allocator_malloc_version_1(i32) i32
	._ZN5sp_io9allocator26extern_host_function_impls6malloc17h175ebffadfcb2d44E(i32) i32
	._ZN5alloc7raw_vec11finish_grow17h97bec91bef32a5ebE.llvm.1609301459894010017(i32,i32,i32,i32)
	._ZN5alloc7raw_vec19RawVec$LT$T$C$A$GT$16reserve_for_push17h1ed9c357b734a946E(i32,i32)
	._ZN18parity_scale_codec5codec19decode_vec_with_len17h15e2f12c18cd243fE(i32,i32,i32)
	._ZN13frame_support7storage8unhashed3get17hb6cef7316d8fee28E(i32,i32,i32)
	._ZN353_$LT$pallet_staking..ExposureOf$LT$T$GT$$u20$as$u20$sp_runtime..traits..Convert$LT$$LT$T$u20$as$u20$frame_system..pallet..Config$GT$..AccountId$C$core..option..Option$LT$pallet_staking..Exposure$LT$$LT$T$u20$as$u20$frame_system..pallet..Config$GT$..AccountId$C$$LT$T$u20$as$u20$pallet_staking..pallet..pallet..Config$GT$..CurrencyBalance$GT$$GT$$GT$$GT$7convert17h469fc883609fa8e1E(i32,i32)
	._ZN5alloc3vec16in_place_collect108_$LT$impl$u20$alloc..vec..spec_from_iter..SpecFromIter$LT$T$C$I$GT$$u20$for$u20$alloc..vec..Vec$LT$T$GT$$GT$9from_iter17h34924d7a99a1d609E(i32,i32)
	._ZN14pallet_session10historical7onchain39store_session_validator_set_to_offchain17h0e99a24533159ab9E(i32)
	._ZN893_$LT$$LP$TupleElement0$C$TupleElement1$C$TupleElement2$C$TupleElement3$C$TupleElement4$C$TupleElement5$C$TupleElement6$C$TupleElement7$C$TupleElement8$C$TupleElement9$C$TupleElement10$C$TupleElement11$C$TupleElement12$C$TupleElement13$C$TupleElement14$C$TupleElement15$C$TupleElement16$C$TupleElement17$C$TupleElement18$C$TupleElement19$C$TupleElement20$C$TupleElement21$C$TupleElement22$C$TupleElement23$C$TupleElement24$C$TupleElement25$C$TupleElement26$C$TupleElement27$C$TupleElement28$C$TupleElement29$C$TupleElement30$C$TupleElement31$C$TupleElement32$C$TupleElement33$C$TupleElement34$C$TupleElement35$C$TupleElement36$C$TupleElement37$C$TupleElement38$C$TupleElement39$C$TupleElement40$C$TupleElement41$C$TupleElement42$C$TupleElement43$C$TupleElement44$C$TupleElement45$C$TupleElement46$C$TupleElement47$RP$$u20$as$u20$frame_support..traits..hooks..OnInitialize$LT$BlockNumber$GT$$GT$13on_initialize17h252b62108cdb4ff8E(i32,i32)
	._ZN15frame_executive104Executive$LT$System$C$Block$C$Context$C$UnsignedValidator$C$AllPalletsWithSystem$C$COnRuntimeUpgrade$GT$16initialize_block17h25c01251da9ce46aE(i32)
	._ZN15frame_executive104Executive$LT$System$C$Block$C$Context$C$UnsignedValidator$C$AllPalletsWithSystem$C$COnRuntimeUpgrade$GT$13execute_block17h04bd9202b85e1554E(i32)
	.Core_execute_block(i32,i32) i64	pkg=sync

edit
After a long debug session with @jimjbrettj™ we found that the allocation problem was related with long running instances and that is not trivial to clean up the runtime memory after a block execution (manually clean up), so here is a suggestion of what can be done to solve and a reference to substrate (https://github.com/paritytech/polkadot-sdk/blob/0c6c837f689a287583508506e342ba07687e8d26/substrate/client/executor/wasmtime/src/runtime.rs#L170C39-L170C55)

  • Instantiate a runtime for every block execution
@EclesioMeloJunior EclesioMeloJunior added P-high this should be addressed ASAP. S-sync-westend related to particular network syncing. labels Feb 26, 2024
@EclesioMeloJunior EclesioMeloJunior self-assigned this Feb 26, 2024
@EclesioMeloJunior EclesioMeloJunior added the A-blocked issue or PR is blocked until something else changes. label Mar 11, 2024
@EclesioMeloJunior
Copy link
Member Author

Related to the allocator problem I will need to wait for EC2 node to run towards block 17M to validate the changes to the allocator, I cannot reproduce it locally since when I start the node the memory is completely empty when executing the block, differently from the ec2 node where the allocator/memory is being used every imported block. Given that the allocator exports its metrics through Prometheus we can compare the usage now and before (where Gossamer had less one page)

@EclesioMeloJunior
Copy link
Member Author

EclesioMeloJunior commented Mar 18, 2024

After investigating substrate FreeingBumpHeapAllocator I notice that Gossamer is not cleaning up the allocator after block execution, instead Gossamer is refreshing the allocator only on every runtime upgrade

Here is the call_impl method which is called when the host needs to call a runtime func

Also the PR #3800 introduces the fix

@EclesioMeloJunior
Copy link
Member Author

After investigations with @jimjbrettj™ we found that the allocator still facing the same problem, just reinstantiating the allocator is not enough to solve the allocator out of space problem and we are facing wasm out of bounds access problems as shown in the image belown.

image

@EclesioMeloJunior
Copy link
Member Author

Same problem happening on westend staging at block #9412261

image

@EclesioMeloJunior EclesioMeloJunior removed the A-blocked issue or PR is blocked until something else changes. label Apr 8, 2024
@jimjbrettj jimjbrettj assigned jimjbrettj and unassigned jimjbrettj Apr 8, 2024
@EclesioMeloJunior
Copy link
Member Author

While syncing to #17m we faced a super slowdown after a runtime upgrade (0.9.25) which I am investigating at #3950

@EclesioMeloJunior
Copy link
Member Author

The fix was confirmed in our staging nodes given that the latest block finalized was #17272320
image

fyi @jimjbrettj @P1sar @timwu20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P-high this should be addressed ASAP. S-sync-westend related to particular network syncing.
Projects
None yet
2 participants