Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ReshardingV3] - forknet testing and follow ups #12552

Open
11 of 20 tasks
Tracked by #11881
wacban opened this issue Dec 3, 2024 · 5 comments
Open
11 of 20 tasks
Tracked by #11881

[ReshardingV3] - forknet testing and follow ups #12552

wacban opened this issue Dec 3, 2024 · 5 comments
Assignees

Comments

@wacban
Copy link
Contributor

wacban commented Dec 3, 2024

Description

Run forknet

  • without any traffic
    • Fix errors in get_postponed_receipt_count_for_shard shard_layout.shard_ids().any(|i| i == shard_id)
    • TrieQueueIndices assertion failing ref
  • with traffic after resharding
  • with single shard tracking
    • Fix verify_path failing ref
    • Fix index out of bounds ref
  • with traffic before and after resharding
    • Fix state dumper stall ref
  • with heavy traffic to trigger congestion
  • with shard shuffling
  • with RPC & archival nodes (no memtries)
  • with node restarts
    • after resharding
    • during resharding
      • fix memtrie loading error and flatstorage resharding not resuming ref
  • with forks
  • with missing chunks & blocks
  • with decentralised state sync
  • with multiple reshardings
@Longarithm
Copy link
Member

Longarithm commented Dec 4, 2024

My current setup

alias mirror="python3 tests/mocknet/mirror.py --chain-id mainnet --start-height 128293844 --unique-id eshardnet"
NODE_BINARY_URL=https://storage.googleapis.com/logunov/neard-1203
mirror init-neard-runner --neard-binary-url $NODE_BINARY_URL
mirror new-test \
  --epoch-length 5500 \
  --genesis-protocol-version 73 \
  --num-validators 7 \
  --num-seats 7 \
  --stateless-setup \
  --new-chain-id eshardnet \
  --gcs-state-sync \
  --yes
RUST_LOG="client=debug,chain=debug,mirror=debug,actix_web=warn,mio=warn,tokio_util=warn,actix_server=warn,actix_http=warn,resharding=debug,fork-network=info,metrics=trace,doomslug=trace,sync=debug,catchup=debug,info"
mirror --host-type nodes run-cmd --cmd "jq '.opentelemetry = \"${RUST_LOG}\" | .rust_log = \"${RUST_LOG}\"' /home/ubuntu/.near/log_config.json > tmp && mv tmp /home/ubuntu/.near/log_config.json"
mirror --host-type traffic run-cmd --cmd "jq '.opentelemetry = \"${RUST_LOG}\" | .rust_log = \"${RUST_LOG}\"' /home/ubuntu/.near/target/log_config.json > tmp && mv tmp /home/ubuntu/.near/target/log_config.json"
mirror update-config --set 'gc_num_epochs_to_keep=5'
mirror update-config --set 'p_produce_chunk=0.3'
mirror update-config --set 'resharding_config.batch_delay={"secs":0,"nanos":10000000}'
mirror start-nodes

then
while true; do result=$(curl --silent http://34.13.138.46:3030/metrics | grep 'near_current_protocol_version'); echo "$result"; if [[ $result == *"74"* ]]; then sleep 10; mirror --host-filter '.*([0-9A-Fa-f]{4}|traffic)$' start-traffic; break; fi; sleep 1; done

@Longarithm
Copy link
Member

Longarithm commented Dec 6, 2024

Current status:
Resharding works with single shard tracking, but nodes crashes in the next epochs.

Follow-ups:

Latest setup
alias mirror="python3 tests/mocknet/mirror.py --chain-id mainnet --start-height 128293844 --unique-id hshardnet"
### SEPARATE COMMAND ###
NODE_BINARY_URL=https://storage.googleapis.com/logunov/neard-1206
mirror init-neard-runner --neard-binary-url $NODE_BINARY_URL
mirror new-test \
  --epoch-length 4500 \
  --genesis-protocol-version 73 \
  --num-validators 7 \
  --num-seats 7 \
  --stateless-setup \
  --new-chain-id hshardnet \
  --gcs-state-sync \
  --yes
RUST_LOG="client=debug,chain=debug,mirror=debug,actix_web=warn,mio=warn,tokio_util=warn,actix_server=warn,actix_http=warn,resharding=debug,fork-network=info,metrics=trace,doomslug=trace,indexer=info,info"
mirror --host-type nodes run-cmd --cmd "jq '.opentelemetry = \"${RUST_LOG}\" | .rust_log = \"${RUST_LOG}\"' /home/ubuntu/.near/log_config.json > tmp && mv tmp /home/ubuntu/.near/log_config.json"
mirror --host-type traffic run-cmd --cmd "jq '.opentelemetry = \"${RUST_LOG}\" | .rust_log = \"${RUST_LOG}\"' /home/ubuntu/.near/target/log_config.json > tmp && mv tmp /home/ubuntu/.near/target/log_config.json"
mirror --host-type nodes run-cmd --cmd 'for f in /home/ubuntu/.near/epoch_configs/73.json; do jq ".validator_selection_config.shuffle_shard_assignment_for_chunk_producers = true" "$f" > tmp && mv tmp "$f"; done'
mirror --host-type traffic run-cmd --cmd 'for f in /home/ubuntu/.near/target/epoch_configs/73.json; do jq ".validator_selection_config.shuffle_shard_assignment_for_chunk_producers = true" "$f" > tmp && mv tmp "$f"; done'
mirror update-config --set 'p_produce_chunk=0.3'
mirror update-config --set 'resharding_config.batch_delay={"secs":0,"nanos":10000000}'
mirror start-nodes
mirror start-traffic

@Longarithm
Copy link
Member

Forknet survived 10 epochs with

  • CurrentEpochStateSync pre-enabled, protocol upgraded to SimpleNightshadeV4, BandwidthScheduler not enabled at all
  • Only 30% of chunks are produced to test missing chunks behaviour aggressively
  • Default transaction rate, 30 tx/s
  • Shard shuffling, single shard tracking

https://near.zulipchat.com/#narrow/channel/407288-core.2Fresharding/topic/forknet/near/489974124

@Trisfald
Copy link
Contributor

Trisfald commented Jan 14, 2025

Tested node restart after resharding: Partial Success

Objective: Verify that nodes can restart and rejoin the network after resharding.
Tested code
Setup: same as before, 5% chunk misses
Dashboard

Outcome:

  • Nodes restarted and rejoined the network just fine
  • Issues with state dumper

Random observations:

  • processing resharding block takes ~2s
  • increased CPU usage for several node during resharding for around 10mins (coincides with flat storage resharding duration)
    • example from 120% average to 300% average

Issues:

  • State dumper got stuck in block sync, after resharding finished. I had to kill the process. After a brute force restart the problem went away.
    2025-01-14T13:18:15.507680Z ERROR obtain_state_part{part_id=293 shard_id=7 
    prev_hash=BAGwSCHdB7PNawJds5EnENacVcvFJ8RQorNwePeyxSXB state_
    root=GBBv6VhHPgCjppzG9Jv4mMY5ufXaPE9vZYHtgi1ZgEQY num_parts=829}:obtain_state_part{part_id=293 shard_id=7 
    prev_hash=BAGwSCHdB7PNawJds5En
    ENacVcvFJ8RQorNwePeyxSXB num_parts=829}: runtime: Can't get trie nodes for state part 
    err=MissingTrieValue(TrieMemoryPartialStorage, PWs
    uYn6CCgRY17Rb9jaA8KKLBL1tZQvi52Q8PkYPvM2) part_id.idx=293 part_id.total=829 
    prev_hash=BAGwSCHdB7PNawJds5EnENacVcvFJ8RQorNwePeyxSXB state
    _root=GBBv6VhHPgCjppzG9Jv4mMY5ufXaPE9vZYHtgi1ZgEQY shard_id=7
    
  • At the end of the test I couldn't stop the state dumper normally. It hangs with
    INFO neard: Waiting for RocksDB to gracefully shutdown
    INFO db: Waiting for remaining RocksDB instances to close num_instances=2
    

@Trisfald
Copy link
Contributor

Trisfald commented Jan 14, 2025

Tested node restart during resharding: Failure

Tested code
Setup: same as before, 5% chunk misses
Dashboard

Outcome:

  • Flat storage resharding didn't resume at restart (I think, not sure because of errors below)
  • Memtrie loading error

Issues:

  • As expected memtries can't be loaded immediately
    thread 'main' panicked at chain/client/src/client_actor.rs:171:6:
    called `Result::unwrap()` on an `Err` value: Chain(StorageError(MemTrieLoadingError("Cannot load memtries when flat 
    storage is not ready
     for shard s6.v3, actual status: Empty")))
    

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants