Skip to content

Commit e8aadaa

Browse files
hiroyukinakazato-dbasnaresundarshankar89gueniai
authored
Add llm-transpile command with Switch runner and integration tests (#2078)
## Changes This PR adds the `llm-transpile` command for LLM-powered SQL conversion using the Switch transpiler. ### What does this PR do? Adds `llm-transpile` CLI command that runs Switch transpiler jobs with parameter passing support. ### Relevant implementation details **CLI Integration**: - Add `llm-transpile` command to Lakebridge CLI - Input source validation (workspace paths and local files) - Parameter passing to Switch job runs **Switch Runner Implementation**: - `SwitchConfig`: manages Switch resources and job ID retrieval from InstallState - `SwitchRunner`: orchestrates Switch job execution with parameters **Testing**: - Unit tests for Switch runner with parameter verification - Integration tests for Switch installation lifecycle **Development Environment**: - Add `.env` to `.gitignore` for local development credentials ### Caveats/things to watch out for when reviewing: - **Parameter design**: Follows `transpile` and `recon` command patterns - **Catalog/schema usage**: Uses values configured during Switch installation (following `recon` pattern) - **Output parameter naming**: Uses `--output-ws-folder` (not `--output-folder`) to explicitly indicate workspace folder - **Dependencies**: Requires PR #2066 (Switch installation) to be merged first ``` console ~ ❯ databricks labs lakebridge llm-transpile --input-source $HOME/IdeaProjects/switch/examples/workflow/airflow/input --output-ws-folder /Workspace/Users/<>/transpiled --source-dialect airflow 17:23:54 INFO [d.labs.lakebridge] Please read and accept the following comments before proceeding: This Feature leverages a large language model (LLM) to analyse and convert your provided content, code and data. You consent to your content being transmitted to, processed by, and returned from the LLM hosted by Databricks foundational models or other external models you may configure during the runtime. The outputs of the LLM are generated automatically without human review, and may contain inaccuracies or errors. You are responsible for reviewing and validating all outputs before relying on them for any critical or production use. By running this feature you accept these conditions. Enter catalog name (default: lakebridge): lakebridge 17:24:11 INFO [d.l.l.deployment.configurator] Found existing catalog `lakebridge` Enter schema name (default: switch): 17:24:15 INFO [d.l.l.deployment.configurator] Found existing schema `switch` in catalog `lakebridge` Enter volume name (default: switch_volume): 17:24:18 INFO [d.l.l.deployment.configurator] Found existing volume `switch_volume` in catalog `lakebridge` and schema `switch` Select a Foundation Model serving endpoint: [0] [Recommended] databricks-claude-sonnet-4-5 [1] databricks-bge-large-en [2] databricks-claude-3-7-sonnet [3] databricks-claude-opus-4 [4] databricks-claude-opus-4-1 [5] databricks-claude-sonnet-4 [6] databricks-gemini-2-5-flash [7] databricks-gemini-2-5-pro [8] databricks-gemma-3-12b [9] databricks-gpt-5 [10] databricks-gpt-5-mini [11] databricks-gpt-5-nano [12] databricks-gpt-oss-120b [13] databricks-gpt-oss-20b [14] databricks-gte-large-en [15] databricks-llama-4-maverick [16] databricks-meta-llama-3-1-405b-instruct [17] databricks-meta-llama-3-1-8b-instruct [18] databricks-meta-llama-3-3-70b-instruct [19] databricks-qwen3-next-80b-a3b-instruct [20] databricks-shutterstock-imageai Enter a number between 0 and 20: 0 17:24:32 INFO [d.l.l.transpiler.switch_runner] Uploading /Users/<>/IdeaProjects/switch/examples/workflow/airflow/input to /Volumes/lakebridge/switch/switch_volume/input_20251105115432_n2iz... 17:24:33 INFO [d.l.l.transpiler.switch_runner] Upload complete: /Volumes/lakebridge/switch/switch_volume/input_20251105115432_n2iz 17:24:33 INFO [d.l.l.transpiler.switch_runner] Triggering Switch job with job_id: job_id 17:24:34 INFO [d.l.l.transpiler.switch_runner] Switch LLM transpilation job started: https://<workspacename>/jobs/job_id/runs/run_id [ { "job_id": job_id, "run_id": run_id, "run_url": "https://<workspacename/jobs/job_id/runs/run_id" } ]% ``` ### Linked issues Resolves #2047 ### Functionality - [ ] added relevant user documentation - [x] added new CLI command: `databricks labs lakebridge llm-transpile` - [ ] modified existing command ### Tests - [x] manually tested - [x] added unit tests - [x] added integration tests --------- Co-authored-by: Andrew Snare <[email protected]> Co-authored-by: sundarshankar89 <[email protected]> Co-authored-by: SundarShankar89 <[email protected]> Co-authored-by: Guenia Izquierdo <[email protected]> Co-authored-by: Andrew Snare <[email protected]>
1 parent 09bd34e commit e8aadaa

File tree

8 files changed

+561
-1
lines changed

8 files changed

+561
-1
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,3 +22,4 @@ remorph_transpile/
2222
/linter/src/main/antlr4/library/gen/
2323
.databricks-login.json
2424
.mypy_cache
25+
.env

labs.yml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,26 @@ commands:
4646
{{range .}}{{.total_files_processed}}\t{{.total_queries_processed}}\t{{.analysis_error_count}}\t{{.parsing_error_count}}\t{{.validation_error_count}}\t{{.generation_error_count}}\t{{.error_log_file}}
4747
{{end}}
4848
49+
- name: llm-transpile
50+
description: Transpile SQL/ETL sources to Databricks using LLM-based conversion (EXPERIMENTAL)
51+
flags:
52+
- name: accept-terms
53+
description: Whether to accept the terms for using LLM-based transpilation (`true|false`).
54+
- name: input-source
55+
description: Local `path` of the sources to be convert
56+
- name: output-ws-folder
57+
description: Output `path` where converted code will be written in the workspace. (Must start with '/Workspace/'.)
58+
- name: source-dialect
59+
description: The source dialect to use when performing conversion
60+
- name: catalog-name
61+
description: Databricks Catalog `name` to use. (Must already exist and have permissions.)
62+
- name: schema-name
63+
description: Databricks Schema `name` to use. (Must already exist and have permissions.)
64+
- name: volume
65+
description: Databricks UC Volume `name` for staging sources to convert. (Must already exist and have permissions.)
66+
- name: foundation-model
67+
description: The Foundation Model to use for conversion. (Must be available via the Databricks Model Serving Endpoint.)
68+
4969
- name: reconcile
5070
description: Reconcile source and target data residing on Databricks
5171

src/databricks/labs/lakebridge/cli.py

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,9 +36,11 @@
3636
from databricks.labs.lakebridge.transpiler.lsp.lsp_engine import LSPEngine
3737
from databricks.labs.lakebridge.transpiler.repository import TranspilerRepository
3838
from databricks.labs.lakebridge.transpiler.sqlglot.sqlglot_engine import SqlglotEngine
39+
from databricks.labs.lakebridge.transpiler.switch_runner import SwitchRunner
3940
from databricks.labs.lakebridge.transpiler.transpile_engine import TranspileEngine
4041

4142
from databricks.labs.lakebridge.transpiler.transpile_status import ErrorSeverity
43+
from databricks.labs.switch.lsp import get_switch_dialects
4244

4345

4446
# Subclass to allow controlled access to protected methods.
@@ -827,6 +829,137 @@ def analyze(
827829
logger.debug(f"User: {ctx.current_user}")
828830

829831

832+
def _validate_llm_transpile_args(
833+
input_source: str | None,
834+
output_ws_folder: str | None,
835+
source_dialect: str | None,
836+
prompts: Prompts,
837+
) -> tuple[str, str, str]:
838+
839+
_switch_dialects = get_switch_dialects()
840+
841+
# Validate presence after attempting to source from config
842+
if not input_source:
843+
input_source = prompts.question("Enter input SQL path")
844+
if not output_ws_folder:
845+
output_ws_folder = prompts.question("Enter output workspace folder must start with /Workspace/")
846+
if not source_dialect:
847+
source_dialect = prompts.choice("Select the source dialect", sorted(_switch_dialects))
848+
849+
# Validate input_source path exists (local path)
850+
if not Path(input_source).exists():
851+
raise_validation_exception(f"Invalid path for '--input-source': Path '{input_source}' does not exist.")
852+
853+
# Validate output_ws_folder is a workspace path
854+
if not str(output_ws_folder).startswith("/Workspace/"):
855+
raise_validation_exception(
856+
f"Invalid value for '--output-ws-folder': workspace output path must start with /Workspace/. Got: {output_ws_folder!r}"
857+
)
858+
859+
if source_dialect not in _switch_dialects:
860+
raise_validation_exception(
861+
f"Invalid value for '--source-dialect': {source_dialect!r} must be one of: {', '.join(sorted(_switch_dialects))}"
862+
)
863+
864+
return input_source, output_ws_folder, source_dialect
865+
866+
867+
@lakebridge.command
868+
def llm_transpile(
869+
*,
870+
w: WorkspaceClient,
871+
accept_terms: bool = False,
872+
input_source: str | None = None,
873+
output_ws_folder: str | None = None,
874+
source_dialect: str | None = None,
875+
catalog_name: str | None = None,
876+
schema_name: str | None = None,
877+
volume: str | None = None,
878+
foundation_model: str | None = None,
879+
ctx: ApplicationContext | None = None,
880+
) -> None:
881+
"""Transpile source code to Databricks using LLM Transpiler (Switch)"""
882+
if ctx is None:
883+
ctx = ApplicationContext(w)
884+
del w
885+
ctx.add_user_agent_extra("cmd", "llm-transpile")
886+
user = ctx.current_user
887+
logger.debug(f"User: {user}")
888+
889+
if not accept_terms:
890+
logger.warning(
891+
"""Please read and accept these terms before proceeding:
892+
This feature leverages a Large Language Model (LLM) to analyse and convert
893+
your provided content, code and data. You consent to your content being
894+
transmitted to, processed by, and returned from the foundation models hosted
895+
by Databricks or external foundation models you have configured in your
896+
workspace. The outputs of the LLM are generated automatically without human
897+
review, and may contain inaccuracies or errors. You are responsible for
898+
reviewing and validating all outputs before relying on them for any critical
899+
or production use.
900+
901+
By using this feature you accept these terms, re-run with '--accept-terms=true'.
902+
"""
903+
)
904+
raise SystemExit("LLM transpiler terms not accepted, exiting.")
905+
906+
prompts = ctx.prompts
907+
resource_configurator = ctx.resource_configurator
908+
909+
# If CLI args are missing, try to read them from config.yml
910+
input_source, output_ws_folder, source_dialect = _validate_llm_transpile_args(
911+
input_source,
912+
output_ws_folder,
913+
source_dialect,
914+
prompts,
915+
)
916+
917+
if catalog_name is None:
918+
catalog_name = resource_configurator.prompt_for_catalog_setup(default_catalog_name="lakebridge")
919+
920+
if schema_name is None:
921+
schema_name = resource_configurator.prompt_for_schema_setup(catalog=catalog_name, default_schema_name="switch")
922+
923+
if volume is None:
924+
volume = resource_configurator.prompt_for_volume_setup(
925+
catalog=catalog_name, schema=schema_name, default_volume_name="switch_volume"
926+
)
927+
928+
resource_configurator.has_necessary_access(catalog_name, schema_name, volume)
929+
930+
if foundation_model is None:
931+
foundation_model = resource_configurator.prompt_for_foundation_model_choice()
932+
933+
job_list = ctx.install_state.jobs
934+
if "Switch" not in job_list:
935+
logger.debug(f"Missing Switch from installed state jobs: {job_list!r}")
936+
raise RuntimeError(
937+
"Switch Job not found. "
938+
"Please run 'databricks labs lakebridge install-transpile --include-llm-transpiler true' first."
939+
)
940+
job_id = int(job_list["Switch"])
941+
logger.debug(f"Switch job ID found: {job_id}")
942+
943+
ctx.add_user_agent_extra("transpiler_source_dialect", source_dialect)
944+
job_runner = SwitchRunner(ctx.workspace_client)
945+
volume_input_path = job_runner.upload_to_volume(
946+
local_path=Path(input_source),
947+
catalog=catalog_name,
948+
schema=schema_name,
949+
volume=volume,
950+
)
951+
952+
job_runner.run(
953+
volume_input_path=volume_input_path,
954+
output_ws_folder=output_ws_folder,
955+
source_tech=source_dialect,
956+
catalog=catalog_name,
957+
schema=schema_name,
958+
foundation_model=foundation_model,
959+
job_id=job_id,
960+
)
961+
962+
830963
@lakebridge.command()
831964
def create_profiler_dashboard(
832965
*,

src/databricks/labs/lakebridge/config.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -166,6 +166,7 @@ class TranspileConfig:
166166
error_file_path: str | None = None
167167
sdk_config: dict[str, str] | None = None
168168
skip_validation: bool = False
169+
include_llm: bool = False
169170
catalog_name: str = "remorph"
170171
schema_name: str = "transpiler"
171172
transpiler_options: JsonValue = None
Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
import io
2+
import logging
3+
import os
4+
import random
5+
import string
6+
from datetime import datetime, timezone
7+
from pathlib import Path
8+
9+
from databricks.labs.blueprint.installation import RootJsonValue
10+
from databricks.sdk import WorkspaceClient
11+
12+
logger = logging.getLogger(__name__)
13+
14+
15+
class SwitchRunner:
16+
"""Runner for Switch LLM transpilation jobs."""
17+
18+
def __init__(
19+
self,
20+
ws: WorkspaceClient,
21+
):
22+
self._ws = ws
23+
24+
def run(
25+
self,
26+
volume_input_path: str,
27+
output_ws_folder: str,
28+
source_tech: str,
29+
catalog: str,
30+
schema: str,
31+
foundation_model: str,
32+
job_id: int,
33+
) -> RootJsonValue:
34+
"""Trigger Switch job."""
35+
36+
job_params = self._build_job_parameters(
37+
input_dir=volume_input_path,
38+
output_dir=output_ws_folder,
39+
source_tech=source_tech,
40+
catalog=catalog,
41+
schema=schema,
42+
foundation_model=foundation_model,
43+
)
44+
logger.info(f"Triggering Switch job with job_id: {job_id}")
45+
46+
return self._run_job(job_id, job_params)
47+
48+
def upload_to_volume(
49+
self,
50+
local_path: Path,
51+
catalog: str,
52+
schema: str,
53+
volume: str,
54+
) -> str:
55+
"""Upload local files to UC Volume with unique timestamped path."""
56+
now = datetime.now(timezone.utc)
57+
time_part = now.strftime("%Y%m%d%H%M%S")
58+
random_part = ''.join(random.choices(string.ascii_lowercase + string.digits, k=4))
59+
volume_base_path = f"/Volumes/{catalog}/{schema}/{volume}"
60+
volume_input_path = f"{volume_base_path}/input-{time_part}-{random_part}"
61+
62+
logger.info(f"Uploading {local_path} to {volume_input_path}...")
63+
64+
# File upload
65+
if local_path.is_file():
66+
if local_path.name.startswith('.'):
67+
logger.debug(f"Skipping hidden file: {local_path}")
68+
return volume_input_path
69+
volume_file_path = f"{volume_input_path}/{local_path.name}"
70+
with open(local_path, 'rb') as f:
71+
content = f.read()
72+
self._ws.files.upload(file_path=volume_file_path, contents=io.BytesIO(content), overwrite=True)
73+
logger.debug(f"Uploaded: {local_path} -> {volume_file_path}")
74+
75+
# Directory upload
76+
else:
77+
for root, dirs, files in os.walk(local_path):
78+
# remove hidden directories
79+
dirs[:] = [d for d in dirs if not d.startswith('.')]
80+
# skip hidden files
81+
files = [f for f in files if not f.startswith('.')]
82+
for file in files:
83+
local_file = Path(root) / file
84+
relative_path = local_file.relative_to(local_path)
85+
volume_file_path = f"{volume_input_path}/{relative_path}"
86+
87+
with open(local_file, 'rb') as f:
88+
content = f.read()
89+
90+
self._ws.files.upload(file_path=volume_file_path, contents=io.BytesIO(content), overwrite=True)
91+
logger.debug(f"Uploaded: {local_file} -> {volume_file_path}")
92+
93+
logger.info(f"Upload complete: {volume_input_path}")
94+
return volume_input_path
95+
96+
def _build_job_parameters(
97+
self,
98+
input_dir: str,
99+
output_dir: str,
100+
source_tech: str,
101+
catalog: str,
102+
schema: str,
103+
foundation_model: str,
104+
switch_options: dict | None = None,
105+
) -> dict[str, str]:
106+
"""Build Switch job parameters."""
107+
if switch_options is None:
108+
switch_options = {}
109+
return {
110+
"input_dir": input_dir,
111+
"output_dir": output_dir,
112+
"source_tech": source_tech,
113+
"catalog": catalog,
114+
"schema": schema,
115+
"foundation_model": foundation_model,
116+
**switch_options,
117+
}
118+
119+
def _run_job(
120+
self,
121+
job_id: int,
122+
job_params: dict[str, str],
123+
) -> RootJsonValue:
124+
"""Trigger Switch job and return run information."""
125+
job_run = self._ws.jobs.run_now(job_id, job_parameters=job_params)
126+
127+
if not job_run.run_id:
128+
raise SystemExit(f"Job {job_id} execution failed.")
129+
130+
job_run_url = f"{self._ws.config.host}/jobs/{job_id}/runs/{job_run.run_id}"
131+
logger.info(f"Switch LLM transpilation job started: {job_run_url}")
132+
133+
return [
134+
{
135+
"job_id": job_id,
136+
"run_id": job_run.run_id,
137+
"run_url": job_run_url,
138+
}
139+
]

tests/conftest.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -312,6 +312,21 @@ def morpheus_artifact() -> Path:
312312
return artifact
313313

314314

315+
@pytest.fixture
316+
def switch_artifact() -> Path:
317+
"""Get Switch wheel for testing."""
318+
artifact = (
319+
Path(__file__).parent
320+
/ "resources"
321+
/ "transpiler_configs"
322+
/ "switch"
323+
/ "wheel"
324+
/ "databricks_switch_plugin-0.1.2-py3-none-any.whl"
325+
)
326+
assert artifact.exists(), f"Switch artifact not found: {artifact}"
327+
return artifact
328+
329+
315330
class FakeDataSource(DataSource):
316331

317332
def __init__(self, start_delimiter: str, end_delimiter: str):

0 commit comments

Comments
 (0)