Custom Apache Spark data sources using the Python Data Source API (Spark 4.0+). Learn by example and build your own data sources.
pip install pyspark-data-sources
# Install with specific extras
pip install pyspark-data-sources[faker]        # For FakeDataSource
pip install pyspark-data-sources[all]          # All optional dependencies- Apache Spark 4.0+ or Databricks Runtime 15.4 LTS+
- Python 3.9-3.12
from pyspark.sql import SparkSession
from pyspark_datasources import FakeDataSource
# Create Spark session
spark = SparkSession.builder.appName("datasource-demo").getOrCreate()
# Register the data source
spark.dataSource.register(FakeDataSource)
# Read batch data
df = spark.read.format("fake").option("numRows", 5).load()
df.show()
# +--------------+----------+-------+------------+
# |          name|      date|zipcode|       state|
# +--------------+----------+-------+------------+
# |  Pam Mitchell|1988-10-20|  23788|   Tennessee|
# |Melissa Turner|1996-06-14|  30851|      Nevada|
# |  Brian Ramsey|2021-08-21|  55277|  Washington|
# |  Caitlin Reed|1983-06-22|  89813|Pennsylvania|
# | Douglas James|2007-01-18|  46226|     Alabama|
# +--------------+----------+-------+------------+
# Stream data
stream = spark.readStream.format("fake").load()
query = stream.writeStream.format("console").start()| Data Source | Type | Description | Install | 
|---|---|---|---|
| fake | Batch/Stream | Generate synthetic test data using Faker | pip install pyspark-data-sources[faker] | 
| github | Batch | Read GitHub pull requests | Built-in | 
| googlesheets | Batch | Read public Google Sheets | Built-in | 
| huggingface | Batch | Load Hugging Face datasets | [huggingface] | 
| stock | Batch | Fetch stock market data (Alpha Vantage) | Built-in | 
| opensky | Batch/Stream | Live flight tracking data | Built-in | 
| kaggle | Batch | Load Kaggle datasets | [kaggle] | 
| arrow | Batch | Read Apache Arrow files | [arrow] | 
| lance | Batch Write | Write Lance vector format | [lance] | 
📚 See detailed examples for all data sources →
from pyspark_datasources import FakeDataSource
spark.dataSource.register(FakeDataSource)
# Generate synthetic data with custom schema
df = spark.read.format("fake") \
    .schema("name string, email string, company string") \
    .option("numRows", 5) \
    .load()
df.show(truncate=False)
# +------------------+-------------------------+-----------------+
# |name              |email                    |company          |
# +------------------+-------------------------+-----------------+
# |Christine Sampson |[email protected]|Hernandez-Nguyen |
# |Yolanda Brown     |[email protected]  |Miller-Hernandez |
# +------------------+-------------------------+-----------------+Here's a minimal example to get started:
from pyspark.sql.datasource import DataSource, DataSourceReader
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
class MyCustomDataSource(DataSource):
    def name(self):
        return "mycustom"
    def schema(self):
        return StructType([
            StructField("id", IntegerType()),
            StructField("name", StringType())
        ])
    def reader(self, schema):
        return MyCustomReader(self.options, schema)
class MyCustomReader(DataSourceReader):
    def __init__(self, options, schema):
        self.options = options
        self.schema = schema
    def read(self, partition):
        # Your data reading logic here
        for i in range(10):
            yield (i, f"name_{i}")
# Register and use
spark.dataSource.register(MyCustomDataSource)
df = spark.read.format("mycustom").load()📖 Complete guide with advanced patterns →
- 📚 Data Sources Guide - Detailed examples for each data source
- 🔧 Building Data Sources - Complete tutorial with advanced patterns
- 📖 API Reference - Full API specification and method signatures
- 💻 Development Guide - Contributing and development setup
- Apache Spark 4.0+ or Databricks Runtime 15.4 LTS+
- Python 3.9-3.12
We welcome contributions! See our Development Guide for details.