[ReshardingV3] - forknet testing and follow ups #12552

wacban · 2024-12-03T17:14:15Z

Longarithm · 2024-12-04T14:22:01Z

My current setup

alias mirror="python3 tests/mocknet/mirror.py --chain-id mainnet --start-height 128293844 --unique-id eshardnet"
NODE_BINARY_URL=https://storage.googleapis.com/logunov/neard-1203
mirror init-neard-runner --neard-binary-url $NODE_BINARY_URL
mirror new-test \
  --epoch-length 5500 \
  --genesis-protocol-version 73 \
  --num-validators 7 \
  --num-seats 7 \
  --stateless-setup \
  --new-chain-id eshardnet \
  --gcs-state-sync \
  --yes
RUST_LOG="client=debug,chain=debug,mirror=debug,actix_web=warn,mio=warn,tokio_util=warn,actix_server=warn,actix_http=warn,resharding=debug,fork-network=info,metrics=trace,doomslug=trace,sync=debug,catchup=debug,info"
mirror --host-type nodes run-cmd --cmd "jq '.opentelemetry = \"${RUST_LOG}\" | .rust_log = \"${RUST_LOG}\"' /home/ubuntu/.near/log_config.json > tmp && mv tmp /home/ubuntu/.near/log_config.json"
mirror --host-type traffic run-cmd --cmd "jq '.opentelemetry = \"${RUST_LOG}\" | .rust_log = \"${RUST_LOG}\"' /home/ubuntu/.near/target/log_config.json > tmp && mv tmp /home/ubuntu/.near/target/log_config.json"
mirror update-config --set 'gc_num_epochs_to_keep=5'
mirror update-config --set 'p_produce_chunk=0.3'
mirror update-config --set 'resharding_config.batch_delay={"secs":0,"nanos":10000000}'
mirror start-nodes

then
while true; do result=$(curl --silent http://34.13.138.46:3030/metrics | grep 'near_current_protocol_version'); echo "$result"; if [[ $result == *"74"* ]]; then sleep 10; mirror --host-filter '.*([0-9A-Fa-f]{4}|traffic)$' start-traffic; break; fi; sleep 1; done

Longarithm · 2024-12-06T14:45:51Z

Current status:
Resharding works with single shard tracking, but nodes crashes in the next epochs.

Follow-ups:

Clearing state parts properly fix(forknet): clearing state parts #12539
Traffic node was using legacy archival node which is not supported anymore. But RPC is fine mirror: don't require an archival dir for the traffic generator #12559
Buffered receipts fix fix(resharding) - fix the buffered receipts forwarding #12561
State dumper crash - working on easier repro, blocked on memtrie-flat storage discrepancy - due to state sync?
Support node_exporter https://near.zulipchat.com/#narrow/channel/308695-nearone.2Fprivate/topic/SRE.20questions/near/486183889
mirror env --key-value "RUST_LOG=debug,network=info" is no-op. Consider command for writing to log_config.json
for traffic node, consider storing target data dir in just ~/.near - so that tooling wouldn’t have to solve that specific condition
Consider using locust for faster setup https://docs.google.com/document/d/1PrrATaMpQhb6om1O11pg7KFpcSGAWkkvtwbW_WnpQBU/

Latest setup

alias mirror="python3 tests/mocknet/mirror.py --chain-id mainnet --start-height 128293844 --unique-id hshardnet"
### SEPARATE COMMAND ###
NODE_BINARY_URL=https://storage.googleapis.com/logunov/neard-1206
mirror init-neard-runner --neard-binary-url $NODE_BINARY_URL
mirror new-test \
  --epoch-length 4500 \
  --genesis-protocol-version 73 \
  --num-validators 7 \
  --num-seats 7 \
  --stateless-setup \
  --new-chain-id hshardnet \
  --gcs-state-sync \
  --yes
RUST_LOG="client=debug,chain=debug,mirror=debug,actix_web=warn,mio=warn,tokio_util=warn,actix_server=warn,actix_http=warn,resharding=debug,fork-network=info,metrics=trace,doomslug=trace,indexer=info,info"
mirror --host-type nodes run-cmd --cmd "jq '.opentelemetry = \"${RUST_LOG}\" | .rust_log = \"${RUST_LOG}\"' /home/ubuntu/.near/log_config.json > tmp && mv tmp /home/ubuntu/.near/log_config.json"
mirror --host-type traffic run-cmd --cmd "jq '.opentelemetry = \"${RUST_LOG}\" | .rust_log = \"${RUST_LOG}\"' /home/ubuntu/.near/target/log_config.json > tmp && mv tmp /home/ubuntu/.near/target/log_config.json"
mirror --host-type nodes run-cmd --cmd 'for f in /home/ubuntu/.near/epoch_configs/73.json; do jq ".validator_selection_config.shuffle_shard_assignment_for_chunk_producers = true" "$f" > tmp && mv tmp "$f"; done'
mirror --host-type traffic run-cmd --cmd 'for f in /home/ubuntu/.near/target/epoch_configs/73.json; do jq ".validator_selection_config.shuffle_shard_assignment_for_chunk_producers = true" "$f" > tmp && mv tmp "$f"; done'
mirror update-config --set 'p_produce_chunk=0.3'
mirror update-config --set 'resharding_config.batch_delay={"secs":0,"nanos":10000000}'
mirror start-nodes
mirror start-traffic

Longarithm · 2024-12-20T10:48:32Z

Forknet survived 10 epochs with

CurrentEpochStateSync pre-enabled, protocol upgraded to SimpleNightshadeV4, BandwidthScheduler not enabled at all
Only 30% of chunks are produced to test missing chunks behaviour aggressively
Default transaction rate, 30 tx/s
Shard shuffling, single shard tracking

https://near.zulipchat.com/#narrow/channel/407288-core.2Fresharding/topic/forknet/near/489974124

Trisfald · 2025-01-14T13:42:50Z

Tested node restart after resharding: Partial Success

Objective: Verify that nodes can restart and rejoin the network after resharding.
Tested code
Setup: same as before, 5% chunk misses
Dashboard

Outcome:

Nodes restarted and rejoined the network just fine
Issues with state dumper

Random observations:

processing resharding block takes ~2s
increased CPU usage for several node during resharding for around 10mins (coincides with flat storage resharding duration)
- example from 120% average to 300% average

Issues:

State dumper got stuck in block sync, after resharding finished. I had to kill the process. After a brute force restart the problem went away.

2025-01-14T13:18:15.507680Z ERROR obtain_state_part{part_id=293 shard_id=7 
prev_hash=BAGwSCHdB7PNawJds5EnENacVcvFJ8RQorNwePeyxSXB state_
root=GBBv6VhHPgCjppzG9Jv4mMY5ufXaPE9vZYHtgi1ZgEQY num_parts=829}:obtain_state_part{part_id=293 shard_id=7 
prev_hash=BAGwSCHdB7PNawJds5En
ENacVcvFJ8RQorNwePeyxSXB num_parts=829}: runtime: Can't get trie nodes for state part 
err=MissingTrieValue(TrieMemoryPartialStorage, PWs
uYn6CCgRY17Rb9jaA8KKLBL1tZQvi52Q8PkYPvM2) part_id.idx=293 part_id.total=829 
prev_hash=BAGwSCHdB7PNawJds5EnENacVcvFJ8RQorNwePeyxSXB state
_root=GBBv6VhHPgCjppzG9Jv4mMY5ufXaPE9vZYHtgi1ZgEQY shard_id=7

At the end of the test I couldn't stop the state dumper normally. It hangs with

INFO neard: Waiting for RocksDB to gracefully shutdown
INFO db: Waiting for remaining RocksDB instances to close num_instances=2

Trisfald · 2025-01-14T20:17:00Z

Tested node restart during resharding: Failure

Tested code
Setup: same as before, 5% chunk misses
Dashboard

Outcome:

Flat storage resharding didn't resume at restart (I think, not sure because of errors below)
Memtrie loading error

Issues:

As expected memtries can't be loaded immediately

thread 'main' panicked at chain/client/src/client_actor.rs:171:6:
called `Result::unwrap()` on an `Err` value: Chain(StorageError(MemTrieLoadingError("Cannot load memtries when flat 
storage is not ready
 for shard s6.v3, actual status: Empty")))

wacban mentioned this issue Dec 3, 2024

🔷 [Project Tracking] Resharding v3 #11881

Open

73 tasks

Longarithm self-assigned this Dec 4, 2024

github-actions bot mentioned this issue Jan 1, 2025

Monthly issue metrics report #12673

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ReshardingV3] - forknet testing and follow ups #12552

[ReshardingV3] - forknet testing and follow ups #12552

wacban commented Dec 3, 2024 •

edited by Trisfald

Loading

Longarithm commented Dec 4, 2024 •

edited

Loading

Longarithm commented Dec 6, 2024 •

edited

Loading

Longarithm commented Dec 20, 2024

Trisfald commented Jan 14, 2025 •

edited

Loading

Trisfald commented Jan 14, 2025 •

edited

Loading

[ReshardingV3] - forknet testing and follow ups #12552

[ReshardingV3] - forknet testing and follow ups #12552

Comments

wacban commented Dec 3, 2024 • edited by Trisfald Loading

Description

Longarithm commented Dec 4, 2024 • edited Loading

Longarithm commented Dec 6, 2024 • edited Loading

Longarithm commented Dec 20, 2024

Trisfald commented Jan 14, 2025 • edited Loading

Tested node restart after resharding: Partial Success

Outcome:

Random observations:

Issues:

Trisfald commented Jan 14, 2025 • edited Loading

Tested node restart during resharding: Failure

Outcome:

Issues:

wacban commented Dec 3, 2024 •

edited by Trisfald

Loading

Longarithm commented Dec 4, 2024 •

edited

Loading

Longarithm commented Dec 6, 2024 •

edited

Loading

Trisfald commented Jan 14, 2025 •

edited

Loading

Trisfald commented Jan 14, 2025 •

edited

Loading