Proposal: Avoiding CRDB BulkImport Timeouts #1544
Labels
area/api v1
Affects the v1 API
area/datastore
Affects the storage system
area/perf
Affects performance or scalability
kind/proposal
Something fundamentally needs to change
The current implementation of BulkImport in CRDB runs all inserts in a single transaction. When that transaction is held open for too long, users will start seeing TransactionRetryWithProtoRefreshErrors:
To work around this limitation users loading large amounts of data will need to call BulkImport in batches:
Per-Batch Transactions
A simple solution would be to have every batch write in a separate transaction. This is simpler to reason about (unlikely users would ever hit tx timeouts), but is strictly less flexible than what we have today.
For example, the current API supports a transaction per batch by only running including one batch per call:
That said, the practical limit on the number of batches in CRDB is quite low (roughly what can fit into 5s) so for the CRDB driver at least, defaulting to a transaction per batch is probably sane.
Drawbacks
This will force all BulkImports with more than one batch into multiple transactions, even those that could have comfortably fit into a single transaction. If the BulkImport is interrupted, it may be harder to know where to resume from.
This may not be much of an issue in practice; you could simply replay the entire import and ignore errors. But we might want to discuss the implications on resumeability some more before and come up with some alternatives before moving ahead with this proposal.
The text was updated successfully, but these errors were encountered: