Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Blockstore: Reduce WAL Usage #4132

Open
steviez opened this issue Dec 16, 2024 · 2 comments
Open

[DRAFT] Blockstore: Reduce WAL Usage #4132

steviez opened this issue Dec 16, 2024 · 2 comments

Comments

@steviez
Copy link

steviez commented Dec 16, 2024

Problem

For starters, this page provides a nice overview of where data goes in rocksdb (including the graphic in section 3). To quickly summarize the relevant context for this issue:

  • When data is written to the rocksdb, it is written to memtables (in memory), and that data is eventually flushed to sst's (on disk)
  • In the event of a crash, the memtables are obviously not persisted
  • To prevent that data from being permanently lost, rocks provides a feature called the Write Ahead Log (WAL). The WAL is a second place where data is written and facilitates recovery of data that lived in a memtable but not an SST
  • As memtables are flushed, the WAL files that have data pertaining to those memtables can be gradually cleaned up
  • We use the the WriteBatch API which gives us atomic writes across columns. So, if this

The WAL allows us to make some assumptions that data will be available after the process stops (both regular and iiregular). However, this comes at the cost of a non-trivial amount of extra I/O. In the SST's and with out setting, most keys will make it no further than level 2. Assuming any given key-value pair lived in each leach level (L0, L1, L2), that would be 3 writes + a 4th write to the WAL. So, on paper, removing the WAL write should give us a 25% drop (4 to 3) in Blockstore write I/O. Empirical data will be king, so I won't try to make a more accurate estimate; the key point is that this is a non-trivial amount of I/O.

It would be great if we could ensure a similar amount of safety without incurring the extra I/O (all of the time). So, below outlines a plan to getting to disabling the WAL (at least most of the time)

Proposed Solution

I think it is helpful to break the problem into two subcategories; graceful shutdown and non-graceful shutdown. The graceful shutdown case is one where the operator intentionally stops the process where the non-graceful case might be something like the OOM killer stopping the process.

Both scenarios will be discussed below, but some fundamental changes:

  1. Introduce a new field that is persisted to rocksdb, something like CLEAN_STOP that is read at startup
    • Upon process restart, the node will check if the flag was set and clear it if it was
      • In the event of a graceful shutdown, this flag will have been set
      • In the event of a non-graceful shutdown, the flag will NOT have been set
    • Depending on if the flag had been set, the node can now respond accordingly
  2. Adjust insert_shreds_lock to be a Mutex<bool> instead of Mutex<()>
    • This mutex is obtained during shred insertion and the background blockstore cleaning
    • The new boolean field will represent whether the shred insertion should write to the WAL or not. The value will be initialized as false
Graceful Shutdown

The graceful shutdown scenario is the friendlier one in which a node is intentionally stopped. One such example is when the agave-validator exit is run. In this situation, execute this sequence serially:

  1. Obtain insert_shreds_lock to prevent any further shred insertion
  2. Flush all memtables from all columns
  3. Write the CLEAN_STOP flag to blockstore (with settings to persist it immediately)
  4. Toggle the value in insert_shreds_lock from false to true (which will enable WAL writes)
  5. Release insert_shreds_lock
  6. Shred insertion in the other thread may now resume
  7. Process eventually stops and is eventually restarted
  8. Startup sees CLEAN_STOP flag is set; the flag is cleared and the startup routine can continue knowing that the Blockstore state is consistent

In 6., we may not actually get much more shred insertion in our current state; the exit flag getting set means that WindowService will stop. But, we might actually wish to trigger the memtable flush earlier if the flush would hold up process exit for a non-trivial amount of time. By triggering the flush earlier and continuing to run shred insertion (with WAL enabled), we'll continue to ingest shreds while also not starving other critical services that could lengthen startup time (like snapshot creation).

Non-Graceful Shutdown

The non-graceful shutdown scenario isn't so nice as the shutdown was unexpected. In this scenario, we need to be much more cautious about what assumptions we make. We can detect the non-graceful shutdown since the CLEAN_STOP flag will not be set. Some ideas on how to handle this case:

  1. Blanket wipe of rocksdb; a brand new (and empty) rocksdb is implicitly consistent. This approach is simple, but has some drawbacks:

    • Wiping local state not good for RPC nodes, recent reads that would have typically been on disk will hit BigTable
    • Extra repair traffic on the cluster
    • Wiping local state could make any forensics would difficult and/or impossible if we wipe local state that might have contributed to the failure
  2. Run a consistency scanner/fixer at startup. That is, iterate over slots and check that columns are consistent (ie the shreds that SlotaMeta says are present are actually present). This approach avoid some of the negatives of above approach, but has its own drawbacks:

    • This has the potential to have some serious overhead by needing to re-read the entire DB
    • We might be able to be smarter about which slots we scan (older only?)
@steviez steviez changed the title Blockstore: Reduce WAL Usage [DRAFT] Blockstore: Reduce WAL Usage Dec 16, 2024
@steviez
Copy link
Author

steviez commented Dec 17, 2024

As reference, #3838 is a PR that simply disables the WAL.

@steviez
Copy link
Author

steviez commented Dec 17, 2024

@alessandrod - We chatted about this previously and I think we have both thought about this more since; I tried to capture everything in this issue. Lemme know what you think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant