-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] - checksum when deserializing the ledger snapshot file #892
Comments
@jorisdral had some suggestions about how some utilities from |
We could use the file read/write functions from Line 500 in 706c4b1
Now that we're talking about checksums, though it is orthogonal to this issue, the |
…1319) Fixes #892 Integration into `cardano-node`: IntersectMBO/cardano-node#6047. This uses the branch [geo2a/issue-892-checksum-snaphot-file-release-ouroboros-consensus-0.21.0.0-backport](https://github.com/IntersectMBO/ouroboros-consensus/tree/geo2a/issue-892-checksum-snaphot-file-release-ouroboros-consensus-0.21.0.0-backport) which is the backport of this PR onto the most resent release of the `ouroboros-consensus` package. In this PR, we change the reading and writing disk snapshots of ledger state. When a snapshot is taken and written to disk, an additional file with the `.checksum` extension is written alongside it. The checksum file contains a string that represent the CRC32 checksum of the snapshot. The checksum is calculated incrementally, alongside writing the snapshot to disk. When a snapshot is read from dist, the checksum is again calculated and compared to the tracked one. If the checksum is different, `readSnaphot` returns the `ReadSnapshotDataCorruption` error, indicating data corruption. The checksum is calculated incrementally, alongside reading a writing the data. On write, we use the [`hPutAllCRC`](https://input-output-hk.github.io/fs-sim/fs-api/src/System.FS.CRC.html#hPutAllCRC) function from `fs-sim`, and on read we modify the [readIncremental](https://github.com/IntersectMBO/ouroboros-consensus/blob/892-checksum-snaphot-file/ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/Util/CBOR.hs#L191) function to compute the checksum as data is read. To enable seamless integration into `cardano-node`, we make the check optional. When initialising the ledger state from a snapshot in `initLedgerDB`, we issue a warning in case the checksum file is missing for a snapshot, but do not fail as in case of invalid snapshots. The `db-analyser` tool ignores the checksum files by default when reading the snapshots. We add `--disk-snapshot-checksum` flag to enabled the check. When writing a snapshot to disk, e.g. as a result of the `--store-ledger` analysis, `db-analyser` will always write calculate the checksum and write it into the snapshot's `.checksum` file. **Tests** There state machine test in `Test.Ouroboros.Storage.LedgerDB.OnDisk` is relevant to this feature, and has caught a number of silly mistakes in the process of its implementation, for example forgetting to delete a checksum file when the snapshot is deleted. The model in the test does not track checksums, and I do not think it can (or should) be augmented to do that. Howerver, the `Snap` and `Restore` events are now parameterised by the checksum flag, and the values for the flag are randomised when generating these events. This leads to testing the following properties: - this feature is backwards-compatible, i.e. the `Restore` events will always lead to restoring from a snapshot, even if `Snap` events do not write checksum files (i.e. their flag is `NoDoDiskSnapshotChecksum` ~= `False`). - If the interpretation of the `DoDiskSnapshotChecksum` flag changes in the code base and becomes strict, i.e. hard fail if the checksum file is missing, this test will discover that. **Effects on Performance**: Running `db-analyser` to read a ledger snapshot and store the snapshot of the state at the next slot shows a difference of 2 seconds on my machine. See a comment below for the logs. To precisely evaluate the effects, we need a micro-benchmark of the reading and writing of snapshots with and without the checksum calculation.
Internal/External
Internal
Describe the feature you'd like
The deserialization of the ledger snapshot file should fail if what was read from disk differs from what was written to disk (eg disk corruption flipped a bit).
Describe alternatives you've considered
This requested feature is not strictly necessary: some interaction with the rest of the network will (eventually) fail if the node has the wrong ledger state. But it could be arbitrarily later, which would be especially confusing. Also, this would limit one of the ubiquitous "possible corruption" explanations for failures (such as
VRFKeyBadProof
).Limitation: the checksum would not ensure that the ledger state was serialized to disk correctly---only that the bitstream written to disk was the same read from disk.
The text was updated successfully, but these errors were encountered: