You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, ingestion jobs always checks the record count first, and only if the record count is non-zero, proceeds with the ingestion. This works well for event-based JDBC sources. But for file based sources it sometimes requires to parse data twice: the first time to get the record count, and the second time to load the data.
Feature
Extend the interface of Source to allow instances where data is always assumed to be available, like snapshot sources.
When a source has 'always data available' the record count should be derived on write.
Things to consider
If the data does not get through the metastore, but passed directly from a source to a sink, possibly requires caching data, or saving in a temporary directory.
Proposed Solution
traitSourceextendsExternalChannel {
/** * If true, getRecordCount() won't be used to determine if the data is available. * This saves performance ono double data read when the source is file based and always points to a particular file, * or it is a snapshot-based source*/defisDataAlwaysAvailable:Boolean=false
}
The default implementation is false so that the interface is source code compatible with existing sources.
The text was updated successfully, but these errors were encountered:
Background
Currently, ingestion jobs always checks the record count first, and only if the record count is non-zero, proceeds with the ingestion. This works well for event-based JDBC sources. But for file based sources it sometimes requires to parse data twice: the first time to get the record count, and the second time to load the data.
Feature
Extend the interface of
Source
to allow instances where data is always assumed to be available, like snapshot sources.When a source has 'always data available' the record count should be derived on write.
Things to consider
Proposed Solution
The default implementation is
false
so that the interface is source code compatible with existing sources.The text was updated successfully, but these errors were encountered: