Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow sources to be configured as 'always have data' #184

Open
yruslan opened this issue Apr 4, 2023 · 0 comments
Open

Allow sources to be configured as 'always have data' #184

yruslan opened this issue Apr 4, 2023 · 0 comments
Labels
enhancement New feature or request Pramen-Scala

Comments

@yruslan
Copy link
Collaborator

yruslan commented Apr 4, 2023

Background

Currently, ingestion jobs always checks the record count first, and only if the record count is non-zero, proceeds with the ingestion. This works well for event-based JDBC sources. But for file based sources it sometimes requires to parse data twice: the first time to get the record count, and the second time to load the data.

Feature

Extend the interface of Source to allow instances where data is always assumed to be available, like snapshot sources.

When a source has 'always data available' the record count should be derived on write.

Things to consider

  • If the data does not get through the metastore, but passed directly from a source to a sink, possibly requires caching data, or saving in a temporary directory.

Proposed Solution

trait Source extends ExternalChannel {
   /** 
     * If true, getRecordCount() won't be used to determine if the data is available. 
     * This saves performance ono double data read when the source is file based and always points to a particular file,
     * or it is a snapshot-based source
     */
   def isDataAlwaysAvailable: Boolean = false
}

The default implementation is false so that the interface is source code compatible with existing sources.

@yruslan yruslan added enhancement New feature or request Pramen-Scala labels Apr 4, 2023
yruslan added a commit that referenced this issue Apr 6, 2023
yruslan added a commit that referenced this issue Apr 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Pramen-Scala
Projects
None yet
Development

No branches or pull requests

1 participant