DuckDB SQL transformation component for Keboola platform with block-based orchestration.
Features:
- Consecutive Blocks: Blocks execute in order, ensuring logical separation of processing phases
- Parallel Scripts: Scripts within each block run in parallel when dependencies allow
- Automatic DAG: Component creates its own dependency graph based on SQL analysis
- SQLGlot Integration: Advanced SQL parsing and dependency detection
- Performance Optimization: Parallel execution with configurable thread limits
- System Resource Detection: Automatic detection of CPU and memory limits for optimal DuckDB settings
- Local File Support: Support for CSV and Parquet files from local storage
- Data Type Inference: Optional automatic data type detection for CSV files
- SQL Validation: Startup and on-demand SQL syntax validation
- Visualization Actions: Execution plan and data lineage visualization
Table of Contents:
Ensure you have the necessary API token, register the application, etc.
Feature | Description |
---|---|
Block-Based Orchestration | Consecutive blocks with parallel scripts execution |
Automatic DAG Creation | SQL dependency analysis and execution planning |
SQLGlot Integration | Advanced SQL parsing and syntax validation |
Parallel Processing | Configurable thread limits for performance |
Memory Management | Configurable memory limits for DuckDB |
Syntax Checking | Startup and on-demand SQL validation |
System Resource Detection | Automatic CPU and memory detection for optimal settings |
Local File Support | Support for CSV and Parquet files from local storage |
Data Type Inference | Optional automatic data type detection for CSV files |
Execution Visualization | Visualize execution plan and data lineage |
If you need additional endpoints, please submit your request to ideas.keboola.com.
The component uses a block-based configuration structure:
{
"parameters": {
"blocks": [
{
"name": "Data Preparation",
"codes": [
{
"name": "Clean Data",
"script": [
"CREATE VIEW 'clean_table' AS SELECT * FROM input_table WHERE valid = true;"
]
}
]
}
],
"threads": 4,
"max_memory_mb": 2048,
"dtypes_infer": false,
"debug": false,
"syntax_check_on_startup": false
}
}
Parameters:
blocks
: Array of processing blocks (executed consecutively)threads
: Number of parallel threads for query execution (None for auto-detection)max_memory_mb
: Memory limit for DuckDB in MB (None for auto-detection)dtypes_infer
: Enable automatic data type inference for CSV files (default: false)debug
: Enable debug logging (default: false)syntax_check_on_startup
: Validate SQL syntax before execution (default: false)
Input Sources:
- Local Files: CSV and Parquet files from local storage
Sync Actions:
syntax_check
: Validate SQL syntax without executionlineage_visualization
: Generate data lineage visualizationexecution_plan_visualization
: Visualize execution planexpected_input_tables
: Show expected input tables
Exports tables to CSV files with manifests into out/tables
and file manifests into out/files
.
To customize the local data folder path, replace the CUSTOM_FOLDER
placeholder with your desired path in the docker-compose.yml
file:
volumes:
- ./:/code
- ./CUSTOM_FOLDER:/data
Clone this repository, initialize the workspace, and run the component using the following commands:
git clone [email protected]:keboola/component-duckdb-transformation.git keboola.duckdb_transformation
cd keboola.duckdb_transformation
docker-compose build
docker-compose run --rm dev
Run the test suite and perform lint checks using this command:
docker-compose run --rm test
For details about deployment and integration with Keboola, refer to the deployment section of the developer documentation.