Add llm-transpile command with Switch runner and integration tests (#2078)

hiroyukinakazato-db · asnare · sundarshankar89 · web-flow · commit e8aadaa7fc5d · 2025-11-07T19:55:30.000Z
## Changes This PR adds the `llm-transpile` command for LLM-powered SQL conversion using the Switch transpiler. ### What does this PR do? Adds `llm-transpile` CLI command that runs Switch transpiler jobs with parameter passing support. ### Relevant implementation details **CLI Integration**: - Add `llm-transpile` command to Lakebridge CLI - Input source validation (workspace paths and local files) - Parameter passing to Switch job runs **Switch Runner Implementation**: - `SwitchConfig`: manages Switch resources and job ID retrieval from InstallState - `SwitchRunner`: orchestrates Switch job execution with parameters **Testing**: - Unit tests for Switch runner with parameter verification - Integration tests for Switch installation lifecycle **Development Environment**: - Add `.env` to `.gitignore` for local development credentials ### Caveats/things to watch out for when reviewing: - **Parameter design**: Follows `transpile` and `recon` command patterns - **Catalog/schema usage**: Uses values configured during Switch installation (following `recon` pattern) - **Output parameter naming**: Uses `--output-ws-folder` (not `--output-folder`) to explicitly indicate workspace folder - **Dependencies**: Requires PR #2066 (Switch installation) to be merged first ``` console ~ ❯ databricks labs lakebridge llm-transpile --input-source $HOME/IdeaProjects/switch/examples/workflow/airflow/input --output-ws-folder /Workspace/Users/<>/transpiled --source-dialect airflow 17:23:54 INFO [d.labs.lakebridge] Please read and accept the following comments before proceeding: This Feature leverages a large language model (LLM) to analyse and convert your provided content, code and data. You consent to your content being transmitted to, processed by, and returned from the LLM hosted by Databricks foundational models or other external models you may configure during the runtime. The outputs of the LLM are generated automatically without human review, and may contain inaccuracies or errors. You are responsible for reviewing and validating all outputs before relying on them for any critical or production use. By running this feature you accept these conditions. Enter catalog name (default: lakebridge): lakebridge 17:24:11 INFO [d.l.l.deployment.configurator] Found existing catalog `lakebridge` Enter schema name (default: switch): 17:24:15 INFO [d.l.l.deployment.configurator] Found existing schema `switch` in catalog `lakebridge` Enter volume name (default: switch_volume): 17:24:18 INFO [d.l.l.deployment.configurator] Found existing volume `switch_volume` in catalog `lakebridge` and schema `switch` Select a Foundation Model serving endpoint: [0] [Recommended] databricks-claude-sonnet-4-5 [1] databricks-bge-large-en [2] databricks-claude-3-7-sonnet [3] databricks-claude-opus-4 [4] databricks-claude-opus-4-1 [5] databricks-claude-sonnet-4 [6] databricks-gemini-2-5-flash [7] databricks-gemini-2-5-pro [8] databricks-gemma-3-12b [9] databricks-gpt-5 [10] databricks-gpt-5-mini [11] databricks-gpt-5-nano [12] databricks-gpt-oss-120b [13] databricks-gpt-oss-20b [14] databricks-gte-large-en [15] databricks-llama-4-maverick [16] databricks-meta-llama-3-1-405b-instruct [17] databricks-meta-llama-3-1-8b-instruct [18] databricks-meta-llama-3-3-70b-instruct [19] databricks-qwen3-next-80b-a3b-instruct [20] databricks-shutterstock-imageai Enter a number between 0 and 20: 0 17:24:32 INFO [d.l.l.transpiler.switch_runner] Uploading /Users/<>/IdeaProjects/switch/examples/workflow/airflow/input to /Volumes/lakebridge/switch/switch_volume/input_20251105115432_n2iz... 17:24:33 INFO [d.l.l.transpiler.switch_runner] Upload complete: /Volumes/lakebridge/switch/switch_volume/input_20251105115432_n2iz 17:24:33 INFO [d.l.l.transpiler.switch_runner] Triggering Switch job with job_id: job_id 17:24:34 INFO [d.l.l.transpiler.switch_runner] Switch LLM transpilation job started: https://<workspacename>/jobs/job_id/runs/run_id [ { "job_id": job_id, "run_id": run_id, "run_url": "https://<workspacename/jobs/job_id/runs/run_id" } ]% ``` ### Linked issues Resolves #2047 ### Functionality - [ ] added relevant user documentation - [x] added new CLI command: `databricks labs lakebridge llm-transpile` - [ ] modified existing command ### Tests - [x] manually tested - [x] added unit tests - [x] added integration tests --------- Co-authored-by: Andrew Snare <andrew.snare@databricks.com> Co-authored-by: sundarshankar89 <sundar.shankar@databricks.com> Co-authored-by: SundarShankar89 <72757199+sundarshankar89@users.noreply.github.com> Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com> Co-authored-by: Andrew Snare <asnare@users.noreply.github.com>
diff --git a/.gitignore b/.gitignore
@@ -22,3 +22,4 @@ remorph_transpile/
 /linter/src/main/antlr4/library/gen/
 .databricks-login.json
 .mypy_cache
+.env
diff --git a/labs.yml b/labs.yml
@@ -46,6 +46,26 @@ commands:
       {{range .}}{{.total_files_processed}}\t{{.total_queries_processed}}\t{{.analysis_error_count}}\t{{.parsing_error_count}}\t{{.validation_error_count}}\t{{.generation_error_count}}\t{{.error_log_file}}
       {{end}}
 
+  - name: llm-transpile
+    description: Transpile SQL/ETL sources to Databricks using LLM-based conversion (EXPERIMENTAL)
+    flags:
+      - name: accept-terms
+        description: Whether to accept the terms for using LLM-based transpilation (`true|false`).
+      - name: input-source
+        description: Local `path` of the sources to be convert
+      - name: output-ws-folder
+        description: Output `path` where converted code will be written in the workspace. (Must start with '/Workspace/'.)
+      - name: source-dialect
+        description: The source dialect to use when performing conversion
+      - name: catalog-name
+        description: Databricks Catalog `name` to use. (Must already exist and have permissions.)
+      - name: schema-name
+        description: Databricks Schema `name` to use. (Must already exist and have permissions.)
+      - name: volume
+        description: Databricks UC Volume `name` for staging sources to convert. (Must already exist and have permissions.)
+      - name: foundation-model
+        description: The Foundation Model to use for conversion. (Must be available via the Databricks Model Serving Endpoint.)
+
   - name: reconcile
     description: Reconcile source and target data residing on Databricks
 
diff --git a/src/databricks/labs/lakebridge/cli.py b/src/databricks/labs/lakebridge/cli.py
@@ -36,9 +36,11 @@
 from databricks.labs.lakebridge.transpiler.lsp.lsp_engine import LSPEngine
 from databricks.labs.lakebridge.transpiler.repository import TranspilerRepository
 from databricks.labs.lakebridge.transpiler.sqlglot.sqlglot_engine import SqlglotEngine
+from databricks.labs.lakebridge.transpiler.switch_runner import SwitchRunner
 from databricks.labs.lakebridge.transpiler.transpile_engine import TranspileEngine
 
 from databricks.labs.lakebridge.transpiler.transpile_status import ErrorSeverity
+from databricks.labs.switch.lsp import get_switch_dialects
 
 
 # Subclass to allow controlled access to protected methods.
@@ -827,6 +829,137 @@ def analyze(
         logger.debug(f"User: {ctx.current_user}")
 
 
+def _validate_llm_transpile_args(
+    input_source: str | None,
+    output_ws_folder: str | None,
+    source_dialect: str | None,
+    prompts: Prompts,
+) -> tuple[str, str, str]:
+
+    _switch_dialects = get_switch_dialects()
+
+    # Validate presence after attempting to source from config
+    if not input_source:
+        input_source = prompts.question("Enter input SQL path")
+    if not output_ws_folder:
+        output_ws_folder = prompts.question("Enter output workspace folder must start with /Workspace/")
+    if not source_dialect:
+        source_dialect = prompts.choice("Select the source dialect", sorted(_switch_dialects))
+
+    # Validate input_source path exists (local path)
+    if not Path(input_source).exists():
+        raise_validation_exception(f"Invalid path for '--input-source': Path '{input_source}' does not exist.")
+
+    # Validate output_ws_folder is a workspace path
+    if not str(output_ws_folder).startswith("/Workspace/"):
+        raise_validation_exception(
+            f"Invalid value for '--output-ws-folder': workspace output path must start with /Workspace/. Got: {output_ws_folder!r}"
+        )
+
+    if source_dialect not in _switch_dialects:
+        raise_validation_exception(
+            f"Invalid value for '--source-dialect': {source_dialect!r} must be one of: {', '.join(sorted(_switch_dialects))}"
+        )
+
+    return input_source, output_ws_folder, source_dialect
+
+
+@lakebridge.command
+def llm_transpile(
+    *,
+    w: WorkspaceClient,
+    accept_terms: bool = False,
+    input_source: str | None = None,
+    output_ws_folder: str | None = None,
+    source_dialect: str | None = None,
+    catalog_name: str | None = None,
+    schema_name: str | None = None,
+    volume: str | None = None,
+    foundation_model: str | None = None,
+    ctx: ApplicationContext | None = None,
+) -> None:
+    """Transpile source code to Databricks using LLM Transpiler (Switch)"""
+    if ctx is None:
+        ctx = ApplicationContext(w)
+    del w
+    ctx.add_user_agent_extra("cmd", "llm-transpile")
+    user = ctx.current_user
+    logger.debug(f"User: {user}")
+
+    if not accept_terms:
+        logger.warning(
+            """Please read and accept these terms before proceeding:
+    This feature leverages a Large Language Model (LLM) to analyse and convert
+    your provided content, code and data. You consent to your content being
+    transmitted to, processed by, and returned from the foundation models hosted
+    by Databricks or external foundation models you have configured in your
+    workspace. The outputs of the LLM are generated automatically without human
+    review, and may contain inaccuracies or errors. You are responsible for
+    reviewing and validating all outputs before relying on them for any critical
+    or production use.
+
+    By using this feature you accept these terms, re-run with '--accept-terms=true'.
+                """
+        )
+        raise SystemExit("LLM transpiler terms not accepted, exiting.")
+
+    prompts = ctx.prompts
+    resource_configurator = ctx.resource_configurator
+
+    # If CLI args are missing, try to read them from config.yml
+    input_source, output_ws_folder, source_dialect = _validate_llm_transpile_args(
+        input_source,
+        output_ws_folder,
+        source_dialect,
+        prompts,
+    )
+
+    if catalog_name is None:
+        catalog_name = resource_configurator.prompt_for_catalog_setup(default_catalog_name="lakebridge")
+
+    if schema_name is None:
+        schema_name = resource_configurator.prompt_for_schema_setup(catalog=catalog_name, default_schema_name="switch")
+
+    if volume is None:
+        volume = resource_configurator.prompt_for_volume_setup(
+            catalog=catalog_name, schema=schema_name, default_volume_name="switch_volume"
+        )
+
+    resource_configurator.has_necessary_access(catalog_name, schema_name, volume)
+
+    if foundation_model is None:
+        foundation_model = resource_configurator.prompt_for_foundation_model_choice()
+
+    job_list = ctx.install_state.jobs
+    if "Switch" not in job_list:
+        logger.debug(f"Missing Switch from installed state jobs: {job_list!r}")
+        raise RuntimeError(
+            "Switch Job not found. "
+            "Please run 'databricks labs lakebridge install-transpile --include-llm-transpiler true' first."
+        )
+    job_id = int(job_list["Switch"])
+    logger.debug(f"Switch job ID found: {job_id}")
+
+    ctx.add_user_agent_extra("transpiler_source_dialect", source_dialect)
+    job_runner = SwitchRunner(ctx.workspace_client)
+    volume_input_path = job_runner.upload_to_volume(
+        local_path=Path(input_source),
+        catalog=catalog_name,
+        schema=schema_name,
+        volume=volume,
+    )
+
+    job_runner.run(
+        volume_input_path=volume_input_path,
+        output_ws_folder=output_ws_folder,
+        source_tech=source_dialect,
+        catalog=catalog_name,
+        schema=schema_name,
+        foundation_model=foundation_model,
+        job_id=job_id,
+    )
+
+
 @lakebridge.command()
 def create_profiler_dashboard(
     *,
diff --git a/src/databricks/labs/lakebridge/config.py b/src/databricks/labs/lakebridge/config.py
@@ -166,6 +166,7 @@ class TranspileConfig:
     error_file_path: str | None = None
     sdk_config: dict[str, str] | None = None
     skip_validation: bool = False
+    include_llm: bool = False
     catalog_name: str = "remorph"
     schema_name: str = "transpiler"
     transpiler_options: JsonValue = None
diff --git a/src/databricks/labs/lakebridge/transpiler/switch_runner.py b/src/databricks/labs/lakebridge/transpiler/switch_runner.py
@@ -0,0 +1,139 @@
+import io
+import logging
+import os
+import random
+import string
+from datetime import datetime, timezone
+from pathlib import Path
+
+from databricks.labs.blueprint.installation import RootJsonValue
+from databricks.sdk import WorkspaceClient
+
+logger = logging.getLogger(__name__)
+
+
+class SwitchRunner:
+    """Runner for Switch LLM transpilation jobs."""
+
+    def __init__(
+        self,
+        ws: WorkspaceClient,
+    ):
+        self._ws = ws
+
+    def run(
+        self,
+        volume_input_path: str,
+        output_ws_folder: str,
+        source_tech: str,
+        catalog: str,
+        schema: str,
+        foundation_model: str,
+        job_id: int,
+    ) -> RootJsonValue:
+        """Trigger Switch job."""
+
+        job_params = self._build_job_parameters(
+            input_dir=volume_input_path,
+            output_dir=output_ws_folder,
+            source_tech=source_tech,
+            catalog=catalog,
+            schema=schema,
+            foundation_model=foundation_model,
+        )
+        logger.info(f"Triggering Switch job with job_id: {job_id}")
+
+        return self._run_job(job_id, job_params)
+
+    def upload_to_volume(
+        self,
+        local_path: Path,
+        catalog: str,
+        schema: str,
+        volume: str,
+    ) -> str:
+        """Upload local files to UC Volume with unique timestamped path."""
+        now = datetime.now(timezone.utc)
+        time_part = now.strftime("%Y%m%d%H%M%S")
+        random_part = ''.join(random.choices(string.ascii_lowercase + string.digits, k=4))
+        volume_base_path = f"/Volumes/{catalog}/{schema}/{volume}"
+        volume_input_path = f"{volume_base_path}/input-{time_part}-{random_part}"
+
+        logger.info(f"Uploading {local_path} to {volume_input_path}...")
+
+        # File upload
+        if local_path.is_file():
+            if local_path.name.startswith('.'):
+                logger.debug(f"Skipping hidden file: {local_path}")
+                return volume_input_path
+            volume_file_path = f"{volume_input_path}/{local_path.name}"
+            with open(local_path, 'rb') as f:
+                content = f.read()
+            self._ws.files.upload(file_path=volume_file_path, contents=io.BytesIO(content), overwrite=True)
+            logger.debug(f"Uploaded: {local_path} -> {volume_file_path}")
+
+        # Directory upload
+        else:
+            for root, dirs, files in os.walk(local_path):
+                # remove hidden directories
+                dirs[:] = [d for d in dirs if not d.startswith('.')]
+                # skip hidden files
+                files = [f for f in files if not f.startswith('.')]
+                for file in files:
+                    local_file = Path(root) / file
+                    relative_path = local_file.relative_to(local_path)
+                    volume_file_path = f"{volume_input_path}/{relative_path}"
+
+                    with open(local_file, 'rb') as f:
+                        content = f.read()
+
+                    self._ws.files.upload(file_path=volume_file_path, contents=io.BytesIO(content), overwrite=True)
+                    logger.debug(f"Uploaded: {local_file} -> {volume_file_path}")
+
+        logger.info(f"Upload complete: {volume_input_path}")
+        return volume_input_path
+
+    def _build_job_parameters(
+        self,
+        input_dir: str,
+        output_dir: str,
+        source_tech: str,
+        catalog: str,
+        schema: str,
+        foundation_model: str,
+        switch_options: dict | None = None,
+    ) -> dict[str, str]:
+        """Build Switch job parameters."""
+        if switch_options is None:
+            switch_options = {}
+        return {
+            "input_dir": input_dir,
+            "output_dir": output_dir,
+            "source_tech": source_tech,
+            "catalog": catalog,
+            "schema": schema,
+            "foundation_model": foundation_model,
+            **switch_options,
+        }
+
+    def _run_job(
+        self,
+        job_id: int,
+        job_params: dict[str, str],
+    ) -> RootJsonValue:
+        """Trigger Switch job and return run information."""
+        job_run = self._ws.jobs.run_now(job_id, job_parameters=job_params)
+
+        if not job_run.run_id:
+            raise SystemExit(f"Job {job_id} execution failed.")
+
+        job_run_url = f"{self._ws.config.host}/jobs/{job_id}/runs/{job_run.run_id}"
+        logger.info(f"Switch LLM transpilation job started: {job_run_url}")
+
+        return [
+            {
+                "job_id": job_id,
+                "run_id": job_run.run_id,
+                "run_url": job_run_url,
+            }
+        ]
diff --git a/tests/conftest.py b/tests/conftest.py
@@ -312,6 +312,21 @@ def morpheus_artifact() -> Path:
     return artifact
 
 
+@pytest.fixture
+def switch_artifact() -> Path:
+    """Get Switch wheel for testing."""
+    artifact = (
+        Path(__file__).parent
+        / "resources"
+        / "transpiler_configs"
+        / "switch"
+        / "wheel"
+        / "databricks_switch_plugin-0.1.2-py3-none-any.whl"
+    )
+    assert artifact.exists(), f"Switch artifact not found: {artifact}"
+    return artifact
+
+
 class FakeDataSource(DataSource):
 
     def __init__(self, start_delimiter: str, end_delimiter: str):
diff --git a/tests/unit/test_cli_llm_transpile.py b/tests/unit/test_cli_llm_transpile.py
diff --git a/tests/unit/test_install.py b/tests/unit/test_install.py