Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Avoiding CRDB BulkImport Timeouts #1544

Closed
ecordell opened this issue Sep 22, 2023 · 1 comment
Closed

Proposal: Avoiding CRDB BulkImport Timeouts #1544

ecordell opened this issue Sep 22, 2023 · 1 comment
Labels
area/api v1 Affects the v1 API area/datastore Affects the storage system area/perf Affects performance or scalability kind/proposal Something fundamentally needs to change

Comments

@ecordell
Copy link
Contributor

ecordell commented Sep 22, 2023

The current implementation of BulkImport in CRDB runs all inserts in a single transaction. When that transaction is held open for too long, users will start seeing TransactionRetryWithProtoRefreshErrors:

When a transaction A is forced to refresh (i.e., change its timestamp) due to hitting the maximum closed timestamp interval (closed timestamps enable Follower Reads and Change Data Capture (CDC)). This can happen when transaction A is a long-running transaction, and there is a write by another transaction to data that A has already read.

To work around this limitation users loading large amounts of data will need to call BulkImport in batches:

client := v1.NewExperimentalServiceClient(conn)

const batchesPerAPICall = 20  // determined expirementally 

for i := 0; i < numBatches/batchesPerAPICall; i++ {
	writer, err := client.BulkImportRelationships(ctx)
	if err != nil {
		return err
	}
	for batchNum := i * batchesPerAPICall; batchNum < (i+1)*batchesPerAPICall; batchNum++ {
		batch := make([]*v1.Relationship, 0, batchSize)

		for i := 0; i < batchSize; i++ {
			batch = append(batch, rel)
		}

		err := writer.Send(&v1.BulkImportRelationshipsRequest{
			Relationships: batch,
		})
		if err != nil {
			return err
		}
	}

	resp, err := writer.CloseAndRecv()
	if err != nil {
		return err
	}
}

Per-Batch Transactions

A simple solution would be to have every batch write in a separate transaction. This is simpler to reason about (unlikely users would ever hit tx timeouts), but is strictly less flexible than what we have today.

For example, the current API supports a transaction per batch by only running including one batch per call:

for batchNum := 0; batchNum < numBatches; batchNum++ {
	writer, _ := client.BulkImportRelationships(ctx)
	batch := make([]*v1.Relationship, 0, batchSize)

	for i := 0; i < batchSize; i++ {
		batch = append(batch, rel) 
	}

	err := writer.Send(&v1.BulkImportRelationshipsRequest{
		Relationships: batch,
	})
	if err != nil {
		return err
	}
	resp, err := writer.CloseAndRecv()
	if err != nil {
		return err
	}
}

That said, the practical limit on the number of batches in CRDB is quite low (roughly what can fit into 5s) so for the CRDB driver at least, defaulting to a transaction per batch is probably sane.

Drawbacks

This will force all BulkImports with more than one batch into multiple transactions, even those that could have comfortably fit into a single transaction. If the BulkImport is interrupted, it may be harder to know where to resume from.

This may not be much of an issue in practice; you could simply replay the entire import and ignore errors. But we might want to discuss the implications on resumeability some more before and come up with some alternatives before moving ahead with this proposal.

@jzelinskie jzelinskie added area/perf Affects performance or scalability area/api v1 Affects the v1 API area/datastore Affects the storage system kind/proposal Something fundamentally needs to change labels Sep 23, 2023
@josephschorr
Copy link
Member

This is handled now by a new RetryableBulkImport call in the Go client library: authzed/authzed-go#165

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api v1 Affects the v1 API area/datastore Affects the storage system area/perf Affects performance or scalability kind/proposal Something fundamentally needs to change
Projects
None yet
Development

No branches or pull requests

3 participants