Optimize indexer package #219

cpacia · 2019-04-26T15:38:21Z

The indexer package is extremely inefficient and is responsible for most of the time spent syncing the chain.

Benchmarks show that around 90-95% of the time syncing the chain is due to two processes ― loading the utxos from disk to validate them, and running the indexer.

Of those two the indexer takes about twice the amount of time as loading the utxos. And the utxo cost can potentially be dramatically reduced by setting --utxocachemaxsize to hold the entire utxo set in memory.

This would leave about 90% of the sync time exclusively due to the indexer (which you obviously wouldn't pay if you turn off all indexes).

Refactoring this will take some clever engineering. Right now blocks are passed in to the indexer as they are processed (and in the same db transaction). This guarantees that the indexer tracks the tip of the chain and is always caught up with the best chain.

It seems like an optimization might be to have the indexer have access to the block index and use the index in separate goroutines to load blocks and index them. However, the challenge is this might be even slower if it needs to load blocks from disk rather than having them passed in from memory. But the separate goroutine approach may allow each indexer to be run in parallel rather than serially.

If you're a strong Go dev who would like to tackle this optimization it would be greatly appreciated.

The text was updated successfully, but these errors were encountered:

cpacia · 2019-06-03T21:35:29Z

So I'm thinking an approach something like the following might work:

The indexers' ConnectBlock() method should extract the header from the block. It should then maintain a list of headers (or pointers to the headers) that have not yet been indexed.

There should be a BlockLoader process with multiple workers which should loop over the headers and read the blocks from disk into an indexing channel:

loadChan := make(*wire.BlockHeader)
indexChan := make(*wire.MsgBlock, maxBlocksInMemory)

for {
    header <- GetNextHeader()
    loadChan <- header
}
for _, worker := range loaderWorkers {
    go func() {
        for header := range loadChan {
               block := loadBlock(header)
               indexChan <- block
        }
    }()
}

If we're currently less than maxBlocksInMemory then ConnectBlock() can actual just write the full block to the workChan rather than discard the block and then load it again.

Then the index process also has multiple workers which reads from a buffered channel:

for _, worker := range indexWorkers {
    go func() {
        for block := range indexChan {
              for _, indexer := range indexers {
                  go indexer.Index(block)
              }
        }
    }()
}

Then finally on start it will need to figure which blocks if any have not been indexed and index them. I believe currently it just checks if indexer.Height < blockchain.Height and then catches up. But if we're doing blocks in parallel then tracking the height of the indexer is probably not appropriate as a later block could finish before a prior one. Unless we just do something like if block.height > previousIndexerBestHeight {putBestNewBestHeight(block.height)}

Also in this architecture we'd need to consider how DisconnectBlock() works in this context as we still want disconnects to be processed in the same order that they were called by the blockchain.

Maybe the goroutine reading jobs off the queue checks if the job is a disconnect. If so then it waits until all connects finish and then runs the disconnects serially.

gubatron · 2019-09-10T15:28:52Z

Then finally on start it will need to figure which blocks if any have not been indexed and index them. I believe currently it just checks if indexer.Height < blockchain.Height and then catches up. But if we're doing blocks in parallel then tracking the height of the indexer is probably not appropriate as a later block could finish before a prior one.

maybe adding the indexed blocks hashes to a bloom filter can help make this check fast and using very little memory.

cpacia added help wanted Extra attention is needed optimization An opportunity for a performance boost refactoring Some refactoring is needed labels Apr 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize indexer package #219

Optimize indexer package #219

cpacia commented Apr 26, 2019 •

edited

Loading

cpacia commented Jun 3, 2019 •

edited

Loading

gubatron commented Sep 10, 2019

Optimize indexer package #219

Optimize indexer package #219

Comments

cpacia commented Apr 26, 2019 • edited Loading

cpacia commented Jun 3, 2019 • edited Loading

gubatron commented Sep 10, 2019

cpacia commented Apr 26, 2019 •

edited

Loading

cpacia commented Jun 3, 2019 •

edited

Loading