Exactly Once Support

This page outlines the design for exactly-once support in Gobblin.

Currently the flow of publishing data in Gobblin is:

DataWriter writes to staging folder
DataWriter moves files from staging folder to task output folder
Publisher moves files from task output folder to job output folder
Persists checkpoints (watermarks) to state store
Delete staging folder and task-output folder.

This flow does not theoretically guarantee exactly-once delivery, rather, it guarantess at least once. Because if something bad happens in step 4, or between steps 3 and 4, it is possible that data is published but checkpoints are not, and the next run will re-extract and re-publish those records.

To guarantee exactly-once, steps 3 & 4 should be atomic.

Achieving Exactly-Once Delivery with `CommitStepStore`

The idea is similar as write-head logging. Before doing the atomic steps (i.e., steps 3 & 4), first write all these steps (referred to as CommitSteps) into a CommitStepStore. In this way, if failure happens during the atomic steps, the next run can continue doing the rest of the steps before ingesting more data for this dataset.

Example: Suppose we have a Kafka-HDFS ingestion job, where each Kafka topic is a dataset. Suppose a task generates three output files for topic 'MyTopic':

task-output/MyTopic/2015-12-09/1.avro
task-output/MyTopic/2015-12-09/2.avro
task-output/MyTopic/2015-12-10/1.avro

which should be published to

job-output/MyTopic/2015-12-09/1.avro
job-output/MyTopic/2015-12-09/2.avro
job-output/MyTopic/2015-12-10/1.avro

And suppose this topic has two partitions, and the their checkpoints, i.e., the actual high watermarks are offset=100 and offset=200.

In this case, there will be 5 CommitSteps for this dataset:

FsRenameCommitStep: rename task-output/MyTopic/2015-12-09/1.avro to job-output/MyTopic/2015-12-09/1.avro
FsRenameCommitStep: rename task-output/MyTopic/2015-12-09/2.avro to job-output/MyTopic/2015-12-09/2.avro
FsRenameCommitStep: rename task-output/MyTopic/2015-12-10/1.avro to job-output/MyTopic/2015-12-10/1.avro
HighWatermarkCommitStep: set the high watermark for partition MyTopic:0 = 100
HighWatermarkCommtiStep: set the high watermark for partition MyTopic:1 = 200

If all these CommitSteps are successful, we can proceed with deleting task-output folder and deleting the above CommitSteps from the CommitStepStore. If any of these steps fails, these steps will not be deleted. When the next run starts, for each dataset, it will check whether there are CommitSteps for this dataset in the CommitStepStore. If there are, it means the previous run may not have successfully executed some of these steps, so it will verify whether each step has been done, and re-do the step if not. If the re-do fails for a certain number of times, this dataset will be skipped. Thus the CommitStep interface will have two methods: verify() and execute().

Scalability

The above approach potentially affects scalability for two reasons:

The driver needs to write all CommitSteps to the CommitStepStore for each dataset, once it determines that all tasks for the dataset have finished. This may cause scalability issues if there are too many CommitSteps, too many datasets, or too many tasks.
Upon the start of the next run, the driver needs to verify all CommitSteps and redo the CommitSteps that the previous run failed to do. This may also cause scalability issues if there are too many CommitSteps.

Both issues can be resolved by moving the majority of the work to an MR job, rather than doing it in the driver.

For #1, in addition to the MR job pulling the data, we can launch a separate MR job, in which each container is responsible for writing CommitSteps for a subset of the datasets. Each container will keep polling the TaskStateStore to determine whether all tasks for each dataset that it is responsible for have finished, and if so, it writes CommitSteps for this dataset to the CommitStepStore.

#2 can also easily be parallelized where we have each container responsible for a subset of datasets.

Source | Documentation | Discussion Group

Home
[Getting Started](Getting Started)
Architecture
User Guide
- Working with Job Configuration Files
- [Deployment](Gobblin Deployment)
- Gobblin on Yarn
- Compaction
- [State Management and Watermarks] (State-Management-and-Watermarks)
- Working with the ForkOperator
- [Configuration Glossary](Configuration Properties Glossary)
- [Partitioned Writers](Partitioned Writers)
- Monitoring
- Schedulers
- [Job Execution History Store](Job Execution History Store)
- Gobblin Build Options
- Troubleshooting
- [FAQs] (FAQs)
Case Studies
- Kafka-HDFS Ingestion
- Publishing Data to S3
Gobblin Metrics
- [Quick Start](Gobblin Metrics)
- [Existing Reporters](Existing Reporters)
- [Metrics for Gobblin ETL](Metrics for Gobblin ETL)
- [Gobblin Metrics Architecture](Gobblin Metrics Architecture)
- [Implementing New Reporters](Implementing New Reporters)
- [Gobblin Metrics Performance](Gobblin Metrics Performance)
Developer Guide
- [Customization: New Source](Customization for New Source)
- [Customization: Converter/Operator](Customization for Converter and Operator)
- Code Style Guide
- IDE setup
- Monitoring Design
Project
- [Feature List](Feature List)
- Contributors/Team
- [Talks/Tech Blogs](Talks and Tech Blogs)
- News/Roadmap
- Posts
Miscellaneous
- Camus → Gobblin Migration
- Exactly Once Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exactly Once Support

Achieving Exactly-Once Delivery with `CommitStepStore`

Scalability

Clone this wiki locally

Exactly Once Support

Achieving Exactly-Once Delivery with CommitStepStore

Scalability

Clone this wiki locally

Achieving Exactly-Once Delivery with `CommitStepStore`