[DRAFT] Blockstore: Reduce WAL Usage #4132

steviez · 2024-12-16T05:59:53Z

Problem

For starters, this page provides a nice overview of where data goes in rocksdb (including the graphic in section 3). To quickly summarize the relevant context for this issue:

When data is written to the rocksdb, it is written to memtables (in memory), and that data is eventually flushed to sst's (on disk)
In the event of a crash, the memtables are obviously not persisted
To prevent that data from being permanently lost, rocks provides a feature called the Write Ahead Log (WAL). The WAL is a second place where data is written and facilitates recovery of data that lived in a memtable but not an SST
As memtables are flushed, the WAL files that have data pertaining to those memtables can be gradually cleaned up
We use the the WriteBatch API which gives us atomic writes across columns. So, if this

The WAL allows us to make some assumptions that data will be available after the process stops (both regular and iiregular). However, this comes at the cost of a non-trivial amount of extra I/O. In the SST's and with out setting, most keys will make it no further than level 2. Assuming any given key-value pair lived in each leach level (L0, L1, L2), that would be 3 writes + a 4th write to the WAL. So, on paper, removing the WAL write should give us a 25% drop (4 to 3) in Blockstore write I/O. Empirical data will be king, so I won't try to make a more accurate estimate; the key point is that this is a non-trivial amount of I/O.

It would be great if we could ensure a similar amount of safety without incurring the extra I/O (all of the time). So, below outlines a plan to getting to disabling the WAL (at least most of the time)

Proposed Solution

I think it is helpful to break the problem into two subcategories; graceful shutdown and non-graceful shutdown. The graceful shutdown case is one where the operator intentionally stops the process where the non-graceful case might be something like the OOM killer stopping the process.

Both scenarios will be discussed below, but some fundamental changes:

Introduce a new field that is persisted to rocksdb, something like CLEAN_STOP that is read at startup
- Upon process restart, the node will check if the flag was set and clear it if it was
  - In the event of a graceful shutdown, this flag will have been set
  - In the event of a non-graceful shutdown, the flag will NOT have been set
- Depending on if the flag had been set, the node can now respond accordingly
Adjust insert_shreds_lock to be a Mutex<bool> instead of Mutex<()>
- This mutex is obtained during shred insertion and the background blockstore cleaning
- The new boolean field will represent whether the shred insertion should write to the WAL or not. The value will be initialized as false

Graceful Shutdown

The graceful shutdown scenario is the friendlier one in which a node is intentionally stopped. One such example is when the agave-validator exit is run. In this situation, execute this sequence serially:

Obtain insert_shreds_lock to prevent any further shred insertion
Flush all memtables from all columns
Write the CLEAN_STOP flag to blockstore (with settings to persist it immediately)
Toggle the value in insert_shreds_lock from false to true (which will enable WAL writes)
Release insert_shreds_lock
Shred insertion in the other thread may now resume
Process eventually stops and is eventually restarted
Startup sees CLEAN_STOP flag is set; the flag is cleared and the startup routine can continue knowing that the Blockstore state is consistent

In 6., we may not actually get much more shred insertion in our current state; the exit flag getting set means that WindowService will stop. But, we might actually wish to trigger the memtable flush earlier if the flush would hold up process exit for a non-trivial amount of time. By triggering the flush earlier and continuing to run shred insertion (with WAL enabled), we'll continue to ingest shreds while also not starving other critical services that could lengthen startup time (like snapshot creation).

Non-Graceful Shutdown

The non-graceful shutdown scenario isn't so nice as the shutdown was unexpected. In this scenario, we need to be much more cautious about what assumptions we make. We can detect the non-graceful shutdown since the CLEAN_STOP flag will not be set. Some ideas on how to handle this case:

Blanket wipe of rocksdb; a brand new (and empty) rocksdb is implicitly consistent. This approach is simple, but has some drawbacks:
- Wiping local state not good for RPC nodes, recent reads that would have typically been on disk will hit BigTable
- Extra repair traffic on the cluster
- Wiping local state could make any forensics would difficult and/or impossible if we wipe local state that might have contributed to the failure
Run a consistency scanner/fixer at startup. That is, iterate over slots and check that columns are consistent (ie the shreds that SlotaMeta says are present are actually present). This approach avoid some of the negatives of above approach, but has its own drawbacks:
- This has the potential to have some serious overhead by needing to re-read the entire DB
- We might be able to be smarter about which slots we scan (older only?)

The text was updated successfully, but these errors were encountered:

steviez · 2024-12-17T21:09:31Z

As reference, #3838 is a PR that simply disables the WAL.

steviez · 2024-12-17T21:25:42Z

@alessandrod - We chatted about this previously and I think we have both thought about this more since; I tried to capture everything in this issue. Lemme know what you think

steviez changed the title ~~Blockstore: Reduce WAL Usage~~ [DRAFT] Blockstore: Reduce WAL Usage Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Blockstore: Reduce WAL Usage #4132

[DRAFT] Blockstore: Reduce WAL Usage #4132

steviez commented Dec 16, 2024 •

edited

Loading

steviez commented Dec 17, 2024 •

edited

Loading

steviez commented Dec 17, 2024

[DRAFT] Blockstore: Reduce WAL Usage #4132

[DRAFT] Blockstore: Reduce WAL Usage #4132

Comments

steviez commented Dec 16, 2024 • edited Loading

Problem

Proposed Solution

Graceful Shutdown

Non-Graceful Shutdown

steviez commented Dec 17, 2024 • edited Loading

steviez commented Dec 17, 2024

steviez commented Dec 16, 2024 •

edited

Loading

steviez commented Dec 17, 2024 •

edited

Loading