Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] - checksum when deserializing the ledger snapshot file #892

Closed
nfrisby opened this issue Jan 18, 2024 · 2 comments · Fixed by #1319
Closed

[FEAT] - checksum when deserializing the ledger snapshot file #892

nfrisby opened this issue Jan 18, 2024 · 2 comments · Fixed by #1319
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@nfrisby
Copy link
Contributor

nfrisby commented Jan 18, 2024

Internal/External
Internal

Describe the feature you'd like
The deserialization of the ledger snapshot file should fail if what was read from disk differs from what was written to disk (eg disk corruption flipped a bit).

Describe alternatives you've considered
This requested feature is not strictly necessary: some interaction with the rest of the network will (eventually) fail if the node has the wrong ledger state. But it could be arbitrarily later, which would be especially confusing. Also, this would limit one of the ubiquitous "possible corruption" explanations for failures (such as VRFKeyBadProof).

Limitation: the checksum would not ensure that the ledger state was serialized to disk correctly---only that the bitstream written to disk was the same read from disk.

@nfrisby nfrisby added enhancement New feature or request good first issue Good for newcomers labels Jan 18, 2024
@nfrisby
Copy link
Contributor Author

nfrisby commented Jan 18, 2024

@jorisdral had some suggestions about how some utilities from fs-api might help here?

@jorisdral
Copy link
Contributor

We could use the file read/write functions from System.FS.CRC. They are used by the other storage components already, e.g.,

Now that we're talking about checksums, though it is orthogonal to this issue, the System.FS.CRC module is consensus specific and should probably be moved to consensus instead.

@jorisdral jorisdral moved this to 🔖 Ready in Consensus Team Backlog Jan 24, 2024
@geo2a geo2a self-assigned this Nov 19, 2024
github-merge-queue bot pushed a commit that referenced this issue Dec 10, 2024
…1319)

Fixes #892

Integration into `cardano-node`:
IntersectMBO/cardano-node#6047. This uses the
branch
[geo2a/issue-892-checksum-snaphot-file-release-ouroboros-consensus-0.21.0.0-backport](https://github.com/IntersectMBO/ouroboros-consensus/tree/geo2a/issue-892-checksum-snaphot-file-release-ouroboros-consensus-0.21.0.0-backport)
which is the backport of this PR onto the most resent release of the
`ouroboros-consensus` package.

In this PR, we change the reading and writing disk snapshots of ledger
state. When a snapshot is taken and written to disk, an additional file
with the `.checksum` extension is written alongside it. The checksum
file contains a string that represent the CRC32 checksum of the
snapshot.

The checksum is calculated incrementally, alongside writing the snapshot
to disk. When a snapshot is read from dist, the checksum is again
calculated and compared to the tracked one. If the checksum is
different, `readSnaphot` returns the `ReadSnapshotDataCorruption` error,
indicating data corruption.

The checksum is calculated incrementally, alongside reading a writing
the data. On write, we use the
[`hPutAllCRC`](https://input-output-hk.github.io/fs-sim/fs-api/src/System.FS.CRC.html#hPutAllCRC)
function from `fs-sim`, and on read we modify the
[readIncremental](https://github.com/IntersectMBO/ouroboros-consensus/blob/892-checksum-snaphot-file/ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/Util/CBOR.hs#L191)
function to compute the checksum as data is read.

To enable seamless integration into `cardano-node`, we make the check
optional. When initialising the ledger state from a snapshot in
`initLedgerDB`, we issue a warning in case the checksum file is missing
for a snapshot, but do not fail as in case of invalid snapshots.

The `db-analyser` tool ignores the checksum files by default when
reading the snapshots. We add `--disk-snapshot-checksum` flag to enabled
the check. When writing a snapshot to disk, e.g. as a result of the
`--store-ledger` analysis, `db-analyser` will always write
calculate the checksum and write it into the snapshot's `.checksum`
file.

**Tests**

There state machine test in `Test.Ouroboros.Storage.LedgerDB.OnDisk` is
relevant to this feature, and has caught a number of silly mistakes in
the process of its implementation, for example forgetting to delete a
checksum file when the snapshot is deleted.

The model in the test does not track checksums, and I do not think it
can (or should) be augmented to do that. Howerver, the `Snap` and
`Restore` events are now parameterised by the checksum flag, and the
values for the flag are randomised when generating these events. This
leads to testing the following properties:
- this feature is backwards-compatible, i.e. the `Restore` events will
always lead to restoring from a snapshot, even if `Snap` events do not
write checksum files (i.e. their flag is `NoDoDiskSnapshotChecksum` ~=
`False`).
- If the interpretation of the `DoDiskSnapshotChecksum` flag changes in
the code base and becomes strict, i.e. hard fail if the checksum file is
missing, this test will discover that.

**Effects on Performance**:

Running `db-analyser` to read a ledger snapshot and store the snapshot
of the state at the next slot shows a difference of 2 seconds on my
machine. See a comment below for the logs.

To precisely evaluate the effects, we need a micro-benchmark of the
reading and writing of snapshots with and without the checksum
calculation.
@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in Consensus Team Backlog Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
Status: ✅ Done
Development

Successfully merging a pull request may close this issue.

3 participants