You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For starters, this page provides a nice overview of where data goes in rocksdb (including the graphic in section 3). To quickly summarize the relevant context for this issue:
When data is written to the rocksdb, it is written to memtables (in memory), and that data is eventually flushed to sst's (on disk)
In the event of a crash, the memtables are obviously not persisted
To prevent that data from being permanently lost, rocks provides a feature called the Write Ahead Log (WAL). The WAL is a second place where data is written and facilitates recovery of data that lived in a memtable but not an SST
As memtables are flushed, the WAL files that have data pertaining to those memtables can be gradually cleaned up
We use the the WriteBatch API which gives us atomic writes across columns. So, if this
The WAL allows us to make some assumptions that data will be available after the process stops (both regular and iiregular). However, this comes at the cost of a non-trivial amount of extra I/O. In the SST's and with out setting, most keys will make it no further than level 2. Assuming any given key-value pair lived in each leach level (L0, L1, L2), that would be 3 writes + a 4th write to the WAL. So, on paper, removing the WAL write should give us a 25% drop (4 to 3) in Blockstore write I/O. Empirical data will be king, so I won't try to make a more accurate estimate; the key point is that this is a non-trivial amount of I/O.
It would be great if we could ensure a similar amount of safety without incurring the extra I/O (all of the time). So, below outlines a plan to getting to disabling the WAL (at least most of the time)
Proposed Solution
I think it is helpful to break the problem into two subcategories; graceful shutdown and non-graceful shutdown. The graceful shutdown case is one where the operator intentionally stops the process where the non-graceful case might be something like the OOM killer stopping the process.
Both scenarios will be discussed below, but some fundamental changes:
Introduce a new field that is persisted to rocksdb, something like CLEAN_STOP that is read at startup
Upon process restart, the node will check if the flag was set and clear it if it was
In the event of a graceful shutdown, this flag will have been set
In the event of a non-graceful shutdown, the flag will NOT have been set
Depending on if the flag had been set, the node can now respond accordingly
Adjust insert_shreds_lock to be a Mutex<bool> instead of Mutex<()>
This mutex is obtained during shred insertion and the background blockstore cleaning
The new boolean field will represent whether the shred insertion should write to the WAL or not. The value will be initialized as false
Graceful Shutdown
The graceful shutdown scenario is the friendlier one in which a node is intentionally stopped. One such example is when the agave-validator exit is run. In this situation, execute this sequence serially:
Obtain insert_shreds_lock to prevent any further shred insertion
Flush all memtables from all columns
Write the CLEAN_STOP flag to blockstore (with settings to persist it immediately)
Toggle the value in insert_shreds_lock from false to true (which will enable WAL writes)
Release insert_shreds_lock
Shred insertion in the other thread may now resume
Process eventually stops and is eventually restarted
Startup sees CLEAN_STOP flag is set; the flag is cleared and the startup routine can continue knowing that the Blockstore state is consistent
In 6., we may not actually get much more shred insertion in our current state; the exit flag getting set means that WindowService will stop. But, we might actually wish to trigger the memtable flush earlier if the flush would hold up process exit for a non-trivial amount of time. By triggering the flush earlier and continuing to run shred insertion (with WAL enabled), we'll continue to ingest shreds while also not starving other critical services that could lengthen startup time (like snapshot creation).
Non-Graceful Shutdown
The non-graceful shutdown scenario isn't so nice as the shutdown was unexpected. In this scenario, we need to be much more cautious about what assumptions we make. We can detect the non-graceful shutdown since the CLEAN_STOP flag will not be set. Some ideas on how to handle this case:
Blanket wipe of rocksdb; a brand new (and empty) rocksdb is implicitly consistent. This approach is simple, but has some drawbacks:
Wiping local state not good for RPC nodes, recent reads that would have typically been on disk will hit BigTable
Extra repair traffic on the cluster
Wiping local state could make any forensics would difficult and/or impossible if we wipe local state that might have contributed to the failure
Run a consistency scanner/fixer at startup. That is, iterate over slots and check that columns are consistent (ie the shreds that SlotaMeta says are present are actually present). This approach avoid some of the negatives of above approach, but has its own drawbacks:
This has the potential to have some serious overhead by needing to re-read the entire DB
We might be able to be smarter about which slots we scan (older only?)
The text was updated successfully, but these errors were encountered:
steviez
changed the title
Blockstore: Reduce WAL Usage
[DRAFT] Blockstore: Reduce WAL Usage
Dec 16, 2024
@alessandrod - We chatted about this previously and I think we have both thought about this more since; I tried to capture everything in this issue. Lemme know what you think
Problem
For starters, this page provides a nice overview of where data goes in rocksdb (including the graphic in section 3). To quickly summarize the relevant context for this issue:
WriteBatch
API which gives us atomic writes across columns. So, if thisThe WAL allows us to make some assumptions that data will be available after the process stops (both regular and iiregular). However, this comes at the cost of a non-trivial amount of extra I/O. In the SST's and with out setting, most keys will make it no further than level 2. Assuming any given key-value pair lived in each leach level (L0, L1, L2), that would be 3 writes + a 4th write to the WAL. So, on paper, removing the WAL write should give us a 25% drop (4 to 3) in Blockstore write I/O. Empirical data will be king, so I won't try to make a more accurate estimate; the key point is that this is a non-trivial amount of I/O.
It would be great if we could ensure a similar amount of safety without incurring the extra I/O (all of the time). So, below outlines a plan to getting to disabling the WAL (at least most of the time)
Proposed Solution
I think it is helpful to break the problem into two subcategories; graceful shutdown and non-graceful shutdown. The graceful shutdown case is one where the operator intentionally stops the process where the non-graceful case might be something like the OOM killer stopping the process.
Both scenarios will be discussed below, but some fundamental changes:
CLEAN_STOP
that is read at startupinsert_shreds_lock
to be aMutex<bool>
instead ofMutex<()>
false
Graceful Shutdown
The graceful shutdown scenario is the friendlier one in which a node is intentionally stopped. One such example is when the
agave-validator exit
is run. In this situation, execute this sequence serially:insert_shreds_lock
to prevent any further shred insertionCLEAN_STOP
flag to blockstore (with settings to persist it immediately)insert_shreds_lock
fromfalse
totrue
(which will enable WAL writes)insert_shreds_lock
CLEAN_STOP
flag is set; the flag is cleared and the startup routine can continue knowing that theBlockstore
state is consistentIn 6., we may not actually get much more shred insertion in our current state; the
exit
flag getting set means thatWindowService
will stop. But, we might actually wish to trigger the memtable flush earlier if the flush would hold up process exit for a non-trivial amount of time. By triggering the flush earlier and continuing to run shred insertion (with WAL enabled), we'll continue to ingest shreds while also not starving other critical services that could lengthen startup time (like snapshot creation).Non-Graceful Shutdown
The non-graceful shutdown scenario isn't so nice as the shutdown was unexpected. In this scenario, we need to be much more cautious about what assumptions we make. We can detect the non-graceful shutdown since the
CLEAN_STOP
flag will not be set. Some ideas on how to handle this case:Blanket wipe of rocksdb; a brand new (and empty) rocksdb is implicitly consistent. This approach is simple, but has some drawbacks:
Run a consistency scanner/fixer at startup. That is, iterate over slots and check that columns are consistent (ie the shreds that
SlotaMeta
says are present are actually present). This approach avoid some of the negatives of above approach, but has its own drawbacks:The text was updated successfully, but these errors were encountered: