Skip to content

Conversation

chi-yang-db
Copy link

Submission of Lab Project ZeroBus - File Mode
Proposal
CUJ
This is a DAB config that deploys resources to customer's workspace and invoke script jobs for setup of the file push endpoints. No new API or SQL syntax is introduced.

@chi-yang-db chi-yang-db requested a review from a team as a code owner October 1, 2025 18:55
@chi-yang-db chi-yang-db requested a review from fjakobs October 1, 2025 18:55
@alexott alexott requested a review from Copilot October 4, 2025 12:20
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a "ZeroBus - File Mode Prototype DAB" that provides a lightweight, no-code file ingestion workflow for Databricks Unity Catalog tables using Auto Loader. The implementation allows users to configure tables via JSON, deploy resources through Databricks Asset Bundles (DAB), and drop files into volume paths for automatic ingestion.

  • Implements a complete file push workflow with configuration management, table validation, and auto-loader integration
  • Provides debugging capabilities through dev mode deployment and interactive notebook for table configuration refinement
  • Includes comprehensive documentation and examples for both quick start and troubleshooting scenarios

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
filepush/dab/src/utils/tablemanager.py Core table management utilities for config validation, volume path handling, and DataFrame creation with Auto Loader
filepush/dab/src/utils/initialization.py Initialization script for setting up workspace resources, volume structure, and environment configuration
filepush/dab/src/utils/formatmanager.py Format-specific Auto Loader configuration management for CSV and JSON with validation and option merging
filepush/dab/src/utils/envmanager.py Environment configuration management and catalog storage validation utilities
filepush/dab/src/ingestion.py DLT pipeline implementation for streaming data ingestion with dynamic table creation
filepush/dab/src/debug_table_config.py Interactive debugging notebook for testing and refining table configurations
filepush/dab/src/configs/tables.json Example table configuration file defining ingestion parameters
filepush/dab/resources/*.yml DAB resource definitions for volume, schema, pipeline, and job configurations
filepush/dab/databricks.yml Main DAB configuration with deployment targets and variables
filepush/README.md Comprehensive documentation with quick start guide and debugging instructions
filepush/.gitignore Git ignore configuration for the project
CODEOWNERS Adds code ownership assignment for the filepush directory

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

ws.schemas.update(full_name=f"{catalog_name}.{schema_name}", properties={
"filepush.volume_path_root": volume_path_root,
"filepush.volume_path_data": volume_path_data,
"filepush.volume_path_data": volume_path_archive
Copy link

Copilot AI Oct 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect property assignment - should be 'filepush.volume_path_archive' instead of 'filepush.volume_path_data' for the archive path.

Suggested change
"filepush.volume_path_data": volume_path_archive
"filepush.volume_path_archive": volume_path_archive

Copilot uses AI. Check for mistakes.

def get_configs() -> list:
json_path = os.path.join(os.path.dirname(os.path.dirname(__file__)), "configs", "tables.json")
if not os.path.exists(json_path):
raise RuntimeError(f"Missing table configs file: {json_path}. Please following README.md to create one, deploy and run configuration_job.")
Copy link

Copilot AI Oct 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected 'following' to 'follow' in the error message.

Suggested change
raise RuntimeError(f"Missing table configs file: {json_path}. Please following README.md to create one, deploy and run configuration_job.")
raise RuntimeError(f"Missing table configs file: {json_path}. Please follow README.md to create one, deploy and run configuration_job.")

Copilot uses AI. Check for mistakes.

# schema hints
schema_hints = table_config.get("schema_hints")
if schema_hints:
reader = reader.option("cloudFiles.schemaHints", ", ".join({schema_hints} | fmt_mgr.default_schema))
Copy link

Copilot AI Oct 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set union operation with string and set will fail. Should convert schema_hints to a set first: {schema_hints}.union(fmt_mgr.default_schema) or use fmt_mgr.default_schema | {schema_hints}.

Suggested change
reader = reader.option("cloudFiles.schemaHints", ", ".join({schema_hints} | fmt_mgr.default_schema))
if isinstance(schema_hints, str):
schema_hints_set = {schema_hints}
else:
schema_hints_set = set(schema_hints)
reader = reader.option("cloudFiles.schemaHints", ", ".join(schema_hints_set | fmt_mgr.default_schema))

Copilot uses AI. Check for mistakes.


_supported_formats: dict[str, AutoLoaderFormat] = {f.name: f for f in (CSV(), JSON())}

def get_format_manager(fmt: str) -> dict[str, str]:
Copy link

Copilot AI Oct 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return type annotation is incorrect. The function returns an AutoLoaderFormat instance, not a dict[str, str].

Suggested change
def get_format_manager(fmt: str) -> dict[str, str]:
def get_format_manager(fmt: str) -> AutoLoaderFormat:

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant