Skip to content

implement validator custody with column refill mechanism #7127

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: column-syncer
Choose a base branch
from

Conversation

agnxsh
Copy link
Contributor

@agnxsh agnxsh commented May 5, 2025

No description provided.

for i in startIndex..<SLOTS_PER_EPOCH:
let blck = vcus.dag.getForkedBlock(blocks[int(i)]).valueOr: continue
withBlck(blck):
when typeof(forkyBlck).kind < ConsensusFork.Fulu: continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this ever happen? it's much cheaper to detect this before the block is ever fetched from the database, based on the slot number + fork schedule.

Copy link
Contributor Author

@agnxsh agnxsh May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detect this before the block is ever fetched from the database

the only way to make this cheaper is to see if the earliest data column in the database is against a Fulu block only then it makes sense to skip this check, imo, otherwise everytime we pull a finalized block from the DB, it's safe to check the fork, also say it can be a situation just a couple epochs after Fulu transition that on wanting block range i get EEEEEEFFFFFFF, i want to refill only against the last Fulu blocks, and wanna continue for the first few Es, note that Es and Fs should ideally way more in number, this is just for an example

for cindex in request_item.indices:
let lookup = DataColumnIdentifier(block_root: request_item.block_root,
index: cindex)
if lookup notin vcus.global_refill_list and cindex in remoteCustodyColumns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often is this cindex in RemoteCustodyColumns expected to occur? Each one is an O(n) lookup, in a for cindex in request_items.indices loop, due to

func resolve_columns_from_custody_groups*(node_id: NodeId,
custody_group_count: CustodyIndex):
seq[ColumnIndex] =

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3dc502b

How does this commit affect checkIntersectingCustody()?

Copy link

github-actions bot commented May 5, 2025

Unit Test Results

       15 files  ±0    2 630 suites  ±0   1h 15m 40s ⏱️ + 1m 6s
  6 372 tests ±0    5 870 ✔️ ±0  502 💤 ±0  0 ±0 
44 281 runs  ±0  43 603 ✔️ ±0  678 💤 ±0  0 ±0 

Results for commit 3dc502b. ± Comparison against base commit 6d759d3.

♻️ This comment has been updated with latest results.

@tersec
Copy link
Contributor

tersec commented May 5, 2025

https://github.com/status-im/nimbus-eth2/actions/runs/14842398481/job/41668253800?pr=7127

2025-05-05T18:21:02.9784725Z Hint: mm: refc; threads: on; opt: speed; options: -d:release
2025-05-05T18:21:02.9787538Z 378176 lines; 215.223s; 10.377GiB peakmem; proj: /github-runner/github-runner-node-02/workspace/nimbus-eth2/nimbus-eth2/tests/all_tests.nim; out: /github-runner/github-runner-node-02/workspace/nimbus-eth2/nimbus-eth2/nimcache/release/all_tests/all_tests.json [SuccessX]
2025-05-05T18:24:28.9640792Z /github-runner/github-runner-node-02/workspace/nimbus-eth2/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/system/iterators.nim: In function ‘_ZN9confutils8loadImplE8typeDescIN4conf14BeaconNodeConfEE3seqI6stringE6string6string4bool4bool3refIN9anonymous16SecondarySourcesEE4procIN4conf14BeaconNodeConfE3refIN9anonymous16SecondarySourcesEEE6string.constprop’:
2025-05-05T18:24:28.9649194Z /github-runner/github-runner-node-02/workspace/nimbus-eth2/nimbus-eth2/vendor/nim-confutils/confutils.nim:910:15: note: variable tracking size limit exceeded with ‘-fvar-tracking-assignments’, retrying without
2025-05-05T18:24:28.9786592Z   910 | proc loadImpl[C, SecondarySources](
2025-05-05T18:24:28.9787035Z       |               ^
2025-05-05T18:25:06.9981057Z /github-runner/github-runner-node-02/workspace/nimbus-eth2/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/system/iterators.nim: In function ‘_ZN9confutils8loadImplE8typeDescIN4conf14BeaconNodeConfEE3seqI6stringE6string6string4bool4bool3refIN14initBeaconNode16SecondarySourcesEE4procIN4conf14BeaconNodeConfE3refIN14initBeaconNode16SecondarySourcesEEE6string.constprop’:
2025-05-05T18:25:06.9985700Z /github-runner/github-runner-node-02/workspace/nimbus-eth2/nimbus-eth2/vendor/nim-confutils/confutils.nim:910:15: note: variable tracking size limit exceeded with ‘-fvar-tracking-assignments’, retrying without
2025-05-05T18:25:06.9986981Z   910 | proc loadImpl[C, SecondarySources](
2025-05-05T18:25:06.9987472Z       |               ^
2025-05-05T18:25:26.2738160Z /github-runner/github-runner-node-02/workspace/nimbus-eth2/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/system/iterators_1.nim: In function ‘_ZN17validator_custody25detectNewValidatorCustodyE3refIN17validator_custody16ValidatorCustodyEE’:
2025-05-05T18:25:26.2740847Z /github-runner/github-runner-node-02/workspace/nimbus-eth2/nimbus-eth2/beacon_chain/sync/validator_custody.nim:67:15: error: stack usage is 11312192 bytes [-Werror=stack-usage=]
2025-05-05T18:25:26.2742260Z    67 | proc detectNewValidatorCustody(vcus: ValidatorCustodyRef): seq[ColumnIndex] =
2025-05-05T18:25:26.2742997Z       |               ^
2025-05-05T18:26:16.7631562Z /github-runner/github-runner-node-02/workspace/nimbus-eth2/nimbus-eth2/vendor/nim-unittest2/unittest2.nim: In function ‘_ZN22test_gossip_transition18runTestX60gensym3_E6string6string’:
2025-05-05T18:26:16.7736235Z /github-runner/github-runner-node-02/workspace/nimbus-eth2/nimbus-eth2/vendor/nim-unittest2/unittest2.nim:1101:15: note: variable tracking size limit exceeded with ‘-fvar-tracking-assignments’, retrying without
2025-05-05T18:26:16.7737632Z  1101 |   proc runTest(suiteName, testName: string): TestStatus {.raises: [], gensym.} =
2025-05-05T18:26:16.7738172Z       |               ^
2025-05-05T18:26:18.2169452Z /github-runner/github-runner-node-02/workspace/nimbus-eth2/nimbus-eth2/vendor/nim-unittest2/unittest2.nim:1101:15: note: variable tracking size limit exceeded
2025-05-05T18:26:19.0940375Z lto1: some warnings being treated as errors
2025-05-05T18:26:19.1458084Z make[2]: *** [/tmp/ccScWUVj.mk:305: /tmp/ccDZizXd.ltrans101.ltrans.o] Error 1
2025-05-05T18:26:19.1459044Z make[2]: *** Waiting for unfinished jobs....
2025-05-05T18:26:45.1057573Z /github-runner/github-runner-node-02/workspace/nimbus-eth2/nimbus-eth2/vendor/nim-unittest2/unittest2.nim: In function ‘_ZN19test_toblindedblock21runTestX60gensym1069_E6string6string’:
2025-05-05T18:26:45.1134244Z /github-runner/github-runner-node-02/workspace/nimbus-eth2/nimbus-eth2/vendor/nim-unittest2/unittest2.nim:1101:15: note: variable tracking size limit exceeded with ‘-fvar-tracking-assignments’, retrying without
2025-05-05T18:26:45.1135690Z  1101 |   proc runTest(suiteName, testName: string): TestStatus {.raises: [], gensym.} =
2025-05-05T18:26:45.1136273Z       |               ^
2025-05-05T18:27:20.9297450Z /github-runner/github-runner-node-02/workspace/nimbus-eth2/nimbus-eth2/vendor/nim-unittest2/unittest2.nim: In function ‘_ZN19test_toblindedblock21runTestX60gensym2038_E6string6string’:
2025-05-05T18:27:20.9300358Z /github-runner/github-runner-node-02/workspace/nimbus-eth2/nimbus-eth2/vendor/nim-unittest2/unittest2.nim:1101:15: note: variable tracking size limit exceeded with ‘-fvar-tracking-assignments’, retrying without
2025-05-05T18:27:22.1837786Z lto-wrapper: fatal error: make returned 2 exit status
2025-05-05T18:27:22.1838783Z compilation terminated.
2025-05-05T18:27:22.3302505Z /usr/bin/ld: error: lto-wrapper failed
2025-05-05T18:27:22.3436028Z collect2: error: ld returned 1 exit status
2025-05-05T18:27:22.3468012Z make[1]: *** [nimcache/release/all_tests/all_tests.makefile:3760: build] Error 1

vcus.network.nodeId,
max(vcus.dag.cfg.SAMPLES_PER_SLOT.uint64,
vcustody))
newer_column_set = newer_columns.toHashSet()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why

func resolve_columns_from_custody_groups*(node_id: NodeId,
custody_group_count: CustodyIndex):
seq[ColumnIndex] =
let
custody_groups = node_id.get_custody_groups(custody_group_count)
var flattened =
newSeqOfCap[ColumnIndex](COLUMNS_PER_GROUP * custody_groups.len)
for group in custody_groups:
for index in compute_columns_for_custody_group(group):
flattened.add index
flattened

doesn't just always construct in-place with incl or similar and then return a HashSet[ColumnIndex] rather than this seq[ColumnIndex] which as far as I can tell, every single user of resolve_columns_from_custody_groups() converts to a HashSet before using or storing:

custody_columns_set =
node.network.nodeId.resolve_column_sets_from_custody_groups(
max(SAMPLES_PER_SLOT.uint64,
localCustodyGroups))

calls only the wrapper function, to get its HashSet[ColumnIndex].

RequestManager* = object
network*: Eth2Node
supernode*: bool
custody_columns_set: HashSet[ColumnIndex]

stores a HashSet[ColumnIndex], not a seq[ColumnIndex], which it gets from the nimbus_beacon_node code.

remoteCustodyColumns =
remoteNodeId.resolve_column_sets_from_custody_groups(
max(SAMPLES_PER_SLOT.uint64,
remoteCustodyGroupCount))
for local_column in rman.custody_columns_set:
if local_column notin remoteCustodyColumns:
return false

uses the resolve_column_sets_from_custody_groups() wrapper again.

And this code here doesn't use resolve_column_sets_from_custody_groups() but does the same thing. So it seems like in general resolve_columns_from_custody_groups() should natively return a HashSet and avoid the redundant copying, allocating, conversions, et cetera.

At that point, one wouldn't need a separate

func resolve_column_sets_from_custody_groups*(node_id: NodeId,
custody_group_count: CustodyIndex):
HashSet[ColumnIndex] =
node_id.resolve_columns_from_custody_groups(custody_group_count).toHashSet()

at all, because all those HashSet[ColumnIndex] users would just be able to call resolve_columns_from_custody_groups() directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

custody_columns =
router[].network.cfg.resolve_columns_from_custody_groups(
router[].network.node_id,
max(SAMPLES_PER_SLOT.uint64,
metadata))
var final_columns: seq[DataColumnSidecar]
for dc in dataColumns:
if dc.index in custody_columns:
final_columns.add dc

doesn't use the HashSet but probably should, and it wouldn't particularly be more expensive if it were constructed natively rather than converted from the seq.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are looking at the wrong branch, i think there are significant use cases of resolve_columns_from_custody_groups, and hashset conversion is only applicable to most syncers/refillers/backfillers where lookup is more frequent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are looking at the wrong branch

Yes, this is true -- those were mostly from the stable branch, so e.g., resolve_column_sets_from_custody_groups isn't present in either column-syncer or vcus, so that part isn't relevant/accurate.

That said:

i think there are significant use cases of resolve_columns_from_custody_groups, and hashset conversion is only applicable to most syncers/refillers/backfillers where lookup is more frequent

Looking again specifically at the current state of the column-syncer and vcus branches:

In column-syncer:

Doesn't use HashSet, probably should, only cares interface-wise that in works:

let
metadata = router[].network.metadata.custody_group_count
custody_columns =
router[].network.cfg.resolve_columns_from_custody_groups(
router[].network.node_id,
max(SAMPLES_PER_SLOT.uint64,
metadata))
var final_columns: seq[DataColumnSidecar]
for dc in dataColumns:
if dc.index in custody_columns:
final_columns.add dc

Immediately converts to HashSet via toHashSet:

let
remoteNodeId = fetchNodeIdFromPeerId(peer)
remoteCustodyColumns =
rman.cfg.resolve_columns_from_custody_groups(
remoteNodeId,
max(rman.cfg.SAMPLES_PER_SLOT.uint64,
remoteCustodyGroupCount))
remoteSet = remoteCustodyColumns.toHashSet()
for local_column in rman.custody_columns_set:
if local_column notin remoteSet:
return false

Immediately converts to HashSet via toHashSet for RequestManager initialization purposes, but also uses dataColumnQuarantine[].custody_columns with the seq:

dataColumnQuarantine[].custody_columns =
dag.cfg.resolve_columns_from_custody_groups(
node.network.nodeId,
max(dag.cfg.SAMPLES_PER_SLOT.uint64,
localCustodyGroups))
let
custody_columns_set =
dataColumnQuarantine[].custody_columns.toHashSet()

https://github.com/status-im/nimbus-eth2/blob/083ed100bab70e07331f0c3e9692b1b5d3eed412/beacon_chain/consensus_object_pools/data_column_quarantine.nim from column-syncer itself uses custody_columns field in a few ways:

for col_idx in quarantine.custody_columns:


for idx in quarantine.custody_columns:

if collectedColumns.len >= (quarantine.custody_columns.len div 2):

if collectedColumns.len == quarantine.custody_columns.len:

The field is marked for export, and probably shouldn't be, because that's mainly there to let it be initialized from nimbus_beacon_node, but nothing reads from it later outside that module.

Anyway, it does two things, iterates, in a way it's unclear that order matters (i.e. HashSet arbitrary ordering is ok) and checks length, either of which is fine in a HashSet. And this HashSet is constructed anyway, a line later in nimbus_beacon_node.

That's all the usage in column-syncer. On to vcus (this PR):

message_router's usage is identical:

let
metadata = router[].network.metadata.custody_group_count
custody_columns =
router[].network.cfg.resolve_columns_from_custody_groups(
router[].network.node_id,
max(SAMPLES_PER_SLOT.uint64,
metadata))
var final_columns: seq[DataColumnSidecar]
for dc in dataColumns:
if dc.index in custody_columns:
final_columns.add dc

request_manager's usage is identical:

let
remoteNodeId = fetchNodeIdFromPeerId(peer)
remoteCustodyColumns =
rman.cfg.resolve_columns_from_custody_groups(
remoteNodeId,
max(rman.cfg.SAMPLES_PER_SLOT.uint64,
remoteCustodyGroupCount))
remoteSet = remoteCustodyColumns.toHashSet()
for local_column in rman.custody_columns_set:
if local_column notin remoteSet:
return false

nimbus_beacon_node's usage is identical:

dataColumnQuarantine[].custody_columns =
dag.cfg.resolve_columns_from_custody_groups(
node.network.nodeId,
max(dag.cfg.SAMPLES_PER_SLOT.uint64,
localCustodyGroups))
let
custody_columns_set =
dataColumnQuarantine[].custody_columns.toHashSet()

The data column quarantine usage is the same: https://github.com/status-im/nimbus-eth2/blob/083ed100bab70e07331f0c3e9692b1b5d3eed412/beacon_chain/consensus_object_pools/data_column_quarantine.nim so, iteration in a way where exact order shouldn't matter and length-checking, both easy and reasonably efficient with the HashSet that's constructed anyway.

But vcus adds validator_custody usage.

Immediately converts via toHashSet:

let
newer_columns =
vcus.dag.cfg.resolve_columns_from_custody_groups(
vcus.network.nodeId,
max(vcus.dag.cfg.SAMPLES_PER_SLOT.uint64,
vcustody))
newer_column_set = newer_columns.toHashSet()

Doesn't use a HashSet, but probably should:

let
remoteNodeId = fetchNodeIdFromPeerId(peer)
remoteCustodyColumns =
vcus.dag.cfg.resolve_columns_from_custody_groups(
remoteNodeId,
max(vcus.dag.cfg.SAMPLES_PER_SLOT.uint64,
remoteCustodyGroupCount))
for request_item in vcus.requested_columns:
var colIds: seq[ColumnIndex]
for cindex in request_item.indices:
let lookup = DataColumnIdentifier(block_root: request_item.block_root,
index: cindex)
if lookup notin vcus.global_refill_list and cindex in remoteCustodyColumns:
colIds.add cindex

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deliberately putting this in a separate comment:

func resolve_columns_from_custody_groups*(cfg: RuntimeConfig, node_id: NodeId,
custody_group_count: CustodyIndex):
seq[ColumnIndex] =
let
custody_groups = node_id.get_custody_groups(custody_group_count)
var flattened =
newSeqOfCap[ColumnIndex](COLUMNS_PER_GROUP * custody_groups.len)
for group in custody_groups:
for index in compute_columns_for_custody_group(cfg, group):
flattened.add index
flattened

could just as easily directly construct the HashSet that most of those use case either already use or should use, and never construct the seq at all. Then no repeatedly converting to it, no issues with some places using the seq (and incurring O(n) search times) and some places not, et cetera.

The data column quarantine itself doesn't really benefit, but it's not hurt either that I can tell.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://nim-lang.org/docs/packedsets.html provides another alternative, I'm not that familiar with its performance tradeoffs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be tackled in the column-syncer branch with a subsequent rebase

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants