Are you willing to submit PR?
What is the problem?
The Spark Declarative Pipelines programming guide does not explain how datasets are stored and refreshed internally. Key information missing includes:
- Default table format: SDP creates tables using Spark's default format (
parquet via spark.sql.sources.default), but this is not documented. Users don't know what format their tables will be in or how to change it.
- Materialized view refresh behavior: Materialized views perform a full recomputation (TRUNCATE + append) on every pipeline run. This is fundamentally different from database-native materialized views (e.g., PostgreSQL) that support incremental refresh. Users need to understand this to plan for performance and cost.
- Streaming table checkpoint requirement: Streaming tables require a checkpoint directory on a Hadoop-compatible file system, but the relationship between the
storage field in spark-pipeline.yml and checkpoint behavior is not explained.
- Full refresh semantics: The
--full-refresh / --full-refresh-all CLI options are documented, but the actual effect on each dataset type is not described.
How to reproduce
Read the current programming guide and try to answer: "What format are my tables stored in?" or "What happens to my materialized view data on each pipeline run?"
What is the expected behavior?
The programming guide should include a section explaining how datasets are stored and refreshed, covering table format configuration, materialized view refresh mechanics, streaming table checkpoint requirements, and full refresh behavior.
Are you willing to submit PR?
What is the problem?
The Spark Declarative Pipelines programming guide does not explain how datasets are stored and refreshed internally. Key information missing includes:
parquetviaspark.sql.sources.default), but this is not documented. Users don't know what format their tables will be in or how to change it.storagefield inspark-pipeline.ymland checkpoint behavior is not explained.--full-refresh/--full-refresh-allCLI options are documented, but the actual effect on each dataset type is not described.How to reproduce
Read the current programming guide and try to answer: "What format are my tables stored in?" or "What happens to my materialized view data on each pipeline run?"
What is the expected behavior?
The programming guide should include a section explaining how datasets are stored and refreshed, covering table format configuration, materialized view refresh mechanics, streaming table checkpoint requirements, and full refresh behavior.