gradle clean build
spark-shell \
--jars=./lib/build/libs/dataguidebook-1.0-SNAPSHOT.jar
This is the original way of defining a data source. It is still available in Spark 3.2.1.
import org.apache.spark.sql.SaveMode
val df = spark.read.format("com.dataguidebook.spark.datasource.v1").load("")
df.printSchema()
df.count()
df.write.mode(SaveMode.Append).format("com.dataguidebook.spark.datasource.v1").save("newpath")
// InsertInto command Only works when you have a table defined USING your
// custom data source.
//spark.sql("CREATE TABLE myTable(column01 int, column02 int, column03 int ) USING com.dataguidebook.spark.datasource.v1 LOCATION custom/insertinto")
df.write.mode(SaveMode.Append).format("com.dataguidebook.spark.datasource.v1").insertInto("myTable")
JavaDoc for Spark SQL Sources provides you with all the classes you can use in your custom data source to define the behaviors. Need to provide two classes:
- DefaultSource which implements RelationProvider
- YourCustomClass which extends BaseRelation and implements at least TableScan
Used for Reading or Writing
- RelationProvider: Used in the DefaultSource and defines how you initialize a custom data source without a user defined schema.
- Defines
createRelation
and is used when you dospark.read
and takes in the options provided.
- Defines
- SchemaRelationProvider: Used in the DefaultSource and defines how you initialize a custom data source with a user defined schema.
- Defines
createRelation
and is used when you dospark.read
and requires that you provide a schema.
- Defines
- DataSourceRegister: Data sources should implement this trait so that they can register an alias to their data source.
Reading Data
- TableScan: A BaseRelation that can produce all of its tuples as an RDD of Row objects.
- PrunedScan: A BaseRelation that can eliminate unneeded columns before producing an RDD containing all of its tuples as Row objects.
- PrunedFilteredScan: A BaseRelation that can eliminate unneeded columns and filter using selected predicates before producing an RDD containing all matching tuples as Row objects.
Writing Data
- CreatetableRelationProvider: Used on the
DefaultSource
class to define a data writing behavior- Requires you to implement
createRelation
with an additionalSaveMode
parameter. - You define all the business logic to overwrite, append, etc. a dataframe to your custom data source.
- Use
df.foreachPartition
to execute your business logic on each partition when writing.
- Requires you to implement
- InsertableRelation: Used on the custom data source's class (e.g.
CustomDataRelation
) that inherits from BaseRelation to insert into a hive metastore backed datasource.- Requires you to implement
insert
which takes in a dataframe and you apply the business logic to store it inside your Hive metastore (e.g. write it to your proprietary format) or send the dataframe to a different datastore. - This only works if you have defined a custom table inside your hive metastore with the
USING
keyword specifying your custom data source.
spark.sql("CREATE TABLE myTable(column01 int, column02 int, column03 int ) USING com.dataguidebook.spark.datasource.v1 LOCATION custom/insertinto")
- Requires you to implement
In Spark 2.3, the Data Sources V2 API (JavaDoc) was released in beta (Spark JIRA) but was not marked as stable until 2.4. So, we'll only talk about the Spark 2.4.7 Data Sources V2 API (JavaDoc)
Used for Reading or Writing
- DataSourceV2 is a "marker interface" which essentially tags the class but doesn't define any behavior.
- (org.apache.spark.sql.catalyst)InternalRow is a binary format used inside of Apache Spark.
- TODO: How do you make an internal row
Reading Data
- ReadSupport: Requires you implement the
createReader
method to return aDataSourceReader
object. - DataSourceReader: Requires you implement
readSchema
(when no schema is provided) andplanInputPartitions
which returns the set of partitions being used. Each partition would create its own data source reader to handle that partition of data. - InputPartition:
- InputPartitionReader:
Writing Data
In Spark 3, the Data Sources V2 API was revised AGAIN and should really be called the V3 API.
Used for Reading or Writing
Reading Data
Writing Data
These blogs, videos, and repos have been extremely helpful in improving my understanding of the history of the Data Source API in Apache Spark.
- blog.madhukaraphatak.com Data Sources V2 (Spark 3.0)
- DatasourceV2Relation Spark Catalyst Doc
- CatalogPlugin
- Table
- TableCatalog
- DataSourceV2 Examples:
- Spark 2.4.7 JavaDoc for Data Sources V2 API
- Spark 2.3.0 JavaDoc for Data Sources V2 API
- (2018 Spark Summit) Data Source V2 (Spark 2.3) starts at 9:19
- Includes a review of Data Sources V1
- blog.madhukaraphatak.com Data Sources V2 (Spark 2.3)
- shzhangji.com Data Source V2 (Spark 2.3)
- Spark Latest JavaDoc for Spark SQL Sources
- 2019 Spark Summit: Jacek Laskowski Live Coding Session (Spark 2.4):
- Spark in Action Book Ch 9 Data Source V1
- (2016 Spark Summit) Data Sources V1
- InsertableRelation Spark Internals reference
- Data Source V1
- Data Source V2 (Spark 3)