Proposal: Avoiding CRDB BulkImport Timeouts #1544

ecordell · 2023-09-22T21:09:54Z

The current implementation of BulkImport in CRDB runs all inserts in a single transaction. When that transaction is held open for too long, users will start seeing TransactionRetryWithProtoRefreshErrors:

When a transaction A is forced to refresh (i.e., change its timestamp) due to hitting the maximum closed timestamp interval (closed timestamps enable Follower Reads and Change Data Capture (CDC)). This can happen when transaction A is a long-running transaction, and there is a write by another transaction to data that A has already read.

To work around this limitation users loading large amounts of data will need to call BulkImport in batches:

client := v1.NewExperimentalServiceClient(conn)

const batchesPerAPICall = 20  // determined expirementally 

for i := 0; i < numBatches/batchesPerAPICall; i++ {
	writer, err := client.BulkImportRelationships(ctx)
	if err != nil {
		return err
	}
	for batchNum := i * batchesPerAPICall; batchNum < (i+1)*batchesPerAPICall; batchNum++ {
		batch := make([]*v1.Relationship, 0, batchSize)

		for i := 0; i < batchSize; i++ {
			batch = append(batch, rel)
		}

		err := writer.Send(&v1.BulkImportRelationshipsRequest{
			Relationships: batch,
		})
		if err != nil {
			return err
		}
	}

	resp, err := writer.CloseAndRecv()
	if err != nil {
		return err
	}
}

Per-Batch Transactions

A simple solution would be to have every batch write in a separate transaction. This is simpler to reason about (unlikely users would ever hit tx timeouts), but is strictly less flexible than what we have today.

For example, the current API supports a transaction per batch by only running including one batch per call:

for batchNum := 0; batchNum < numBatches; batchNum++ {
	writer, _ := client.BulkImportRelationships(ctx)
	batch := make([]*v1.Relationship, 0, batchSize)

	for i := 0; i < batchSize; i++ {
		batch = append(batch, rel) 
	}

	err := writer.Send(&v1.BulkImportRelationshipsRequest{
		Relationships: batch,
	})
	if err != nil {
		return err
	}
	resp, err := writer.CloseAndRecv()
	if err != nil {
		return err
	}
}

That said, the practical limit on the number of batches in CRDB is quite low (roughly what can fit into 5s) so for the CRDB driver at least, defaulting to a transaction per batch is probably sane.

Drawbacks

This will force all BulkImports with more than one batch into multiple transactions, even those that could have comfortably fit into a single transaction. If the BulkImport is interrupted, it may be harder to know where to resume from.

This may not be much of an issue in practice; you could simply replay the entire import and ignore errors. But we might want to discuss the implications on resumeability some more before and come up with some alternatives before moving ahead with this proposal.

josephschorr · 2024-02-21T19:14:03Z

This is handled now by a new RetryableBulkImport call in the Go client library: authzed/authzed-go#165

jzelinskie added area/perf Affects performance or scalability area/api v1 Affects the v1 API area/datastore Affects the storage system kind/proposal Something fundamentally needs to change labels Sep 23, 2023

josephschorr mentioned this issue Dec 19, 2023

refactor backup restore to handle serialization errors and conflicts authzed/zed#316

Merged

josephschorr closed this as completed Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Avoiding CRDB BulkImport Timeouts #1544

Proposal: Avoiding CRDB BulkImport Timeouts #1544

ecordell commented Sep 22, 2023 •

edited

Loading

josephschorr commented Feb 21, 2024

Proposal: Avoiding CRDB BulkImport Timeouts #1544

Proposal: Avoiding CRDB BulkImport Timeouts #1544

Comments

ecordell commented Sep 22, 2023 • edited Loading

Per-Batch Transactions

Drawbacks

josephschorr commented Feb 21, 2024

ecordell commented Sep 22, 2023 •

edited

Loading