Skip to content

[SEA-NodeJS] Kernel backend: connection/statement options, TLS & configurable sync/async execution#416

Open
msrathore-db wants to merge 8 commits into
mainfrom
msrathore/sea-async-tls-options-main
Open

[SEA-NodeJS] Kernel backend: connection/statement options, TLS & configurable sync/async execution#416
msrathore-db wants to merge 8 commits into
mainfrom
msrathore/sea-async-tls-options-main

Conversation

@msrathore-db
Copy link
Copy Markdown
Contributor

@msrathore-db msrathore-db commented Jun 4, 2026

Base main. Recreation of #413 (which was mistakenly based on the #412 branch and merged there, never reaching main), now extended with a configurable sync/async execution path.

Depends on kernel PR databricks/databricks-sql-kernel#122 (exposes the sync StatementCanceller over napi). KERNEL_REV is pinned to that branch; this can land on main once #122 merges.

What's here

Connection / statement options + async (the original #413): TLS (checkServerCertificate secure-by-default + customCaCert), maxConnections, rowLimit, statementConf, queryTags, bound params (incl. TIMESTAMP_NTZ; TIMESTAMP_LTZTIMESTAMP), and the async submit/poll path. All earlier #413 review fixes carried over (server-driven Cancelled/Closed/Unknown → OperationStateError; queryTimeout enforced client-side; Int64 queryTimeout coercion; etc.).

Configurable sync/async execution (new), toggled by the existing runAsync option — no new public field:

  • Default = sync (runAsync false/undefined) → kernel Connection.executeStatementCancellable (blocking direct-results). Faster, and now cancellable. Matches the python connector (cursor.execute()stmt.execute()).
  • runAsync: truesubmitStatement (submit + poll), unchanged.
  • Public API is identical across both modes and to Thrift (same IOperation, fetchAll/fetchChunk/getSchema/getOperationStatus/cancel/close/hasMoreRows, result shape, TTableSchema, error classes). Only lifecycle timing differs — exactly Thrift's runAsync distinction.

Sync mid-compute cancellation: cancel() on a sync op fires the detached kernel StatementCanceller (via #122's napi surface) so a still-running blocking execute is cancelled server-side; cancel-induced rejection maps to OperationStateError(Canceled). Mid-download cancel already works via the CloudFetch token.

op.id now surfaces the server statement_id on the sync path once the execute round-trip publishes it (was a client UUID), so cancelled/closed sync ops are traceable in server/kernel logs — parity with the async path.

Verification

  • Unit suite 240 SEA / 1159 full passing; tsc/eslint/prettier clean.
  • e2e parity gate (sync mode, live): SEA-vs-Thrift byte-identical rows + INTERVAL decode/format/null/multi-row — 8 passing. (Row-count / interval / DDL correctness on the sync path.)
  • Live A/B (5-rep, alternating): sync ≤ async; both ~Thrift small, sync faster large; concurrent tied.
  • Cancel, both modes (live, server-confirmed CANCELED): mid-compute (~300ms after cancel) and mid-download (~10× faster than full download).

Known caveats (flagged for reviewers)

  1. First-window cancel (~10s): a cancel() before the initial execute POST returns is a server-side no-op (the kernel publishes the statement_id only when that POST returns); the client still unblocks with OperationStateError(Canceled). Kernel-core property shared with pyo3; tracked as a kernel follow-up.
  2. Sync-cancel error code: the kernel surfaces a sync-execute cancel as InvalidArgument (the async path returns Cancelled); node normalises it to OperationStateError(Canceled) via the client-side cancel flag. Kernel follow-up.

This pull request and its description were written by Isaac.

Wire the SEA connection-level and per-statement option surfaces onto the
merged-kernel napi binding (thin forwarding — the kernel owns the behaviour):

Connection options (SeaAuth.buildSeaConnectionOptions):
- `maxConnections` → kernel pool size, validated as a positive integer within
  the napi u32 range.
- TLS: `checkServerCertificate` (secure-by-default — omit to keep the kernel's
  verify-on default; `false` opts into insecure) and `customCaCert` (PEM string
  or Buffer; strings are PEM-sanity-checked and normalised to a Buffer before
  the FFI boundary), via the new `buildSeaTlsOptions`.
- `intervalsAsString: true` is always set so SEA interval/duration columns
  render as strings — a byte-compatible drop-in for the Thrift backend.
  `complexTypesAsJson` is intentionally left at the kernel default (native
  Arrow), which already decodes identically to Thrift via the shared converter.

Statement options (SeaSessionBackend.executeStatement, via buildExecuteOptions):
- `queryTimeout` → `queryTimeoutSecs`; `rowLimit` → `rowLimit` (SEA-only cap).
- `queryTags` serialised JS-side (reusing Thrift's `serializeQueryTags`) into
  the reserved `query_tags` conf key, merged with any explicit `statementConf` —
  the napi `queryTags` field can't carry null-valued tags, and the kernel
  rejects setting both. `queryTags` / `queryTimeout` are no longer rejected.
- Still rejected (genuinely unsupported on SEA): `useCloudFetch`,
  `useLZ4Compression`, `stagingAllowedLocalPath`.

`rowLimit` / `statementConf` added to the public `ExecuteStatementOptions`;
SEA-only knobs (`maxConnections` / `checkServerCertificate` / `customCaCert`)
added to the internal `InternalConnectionOptions`.

Validated against a live warehouse: secure-by-default connect, maxConnections,
checkServerCertificate, rowLimit (caps rows), queryTimeout, queryTags,
statementConf, and non-PEM customCaCert rejection.

Co-authored-by: Isaac
Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>
Switch the SEA query path from the blocking `executeStatement` to the kernel's
async `submitStatement`, matching the Thrift backend's always-async
(`runAsync: true`) model. `submitStatement` returns immediately with a pending
`AsyncStatement` (kernel `wait_timeout=0s`) while the query runs server-side.

SeaOperationBackend becomes dual-mode (exactly one of):
- `asyncStatement` (query path): `waitUntilReady()` polls `status()` to a
  terminal state on a 100ms cadence (matching Thrift), firing the progress
  callback each tick. Polling `status()` rather than blocking on `awaitResult()`
  keeps `cancel()` responsive — a blocking awaitResult would hold the kernel
  statement mutex for the whole query and queue cancel behind it. On Succeeded
  it materialises the result handle (first fetch is free); on Failed it drives
  `awaitResult()` to surface the kernel's typed SQL-error envelope; on a
  server-side Cancelled/Closed/Unknown it throws a clear error. `status()`
  reports the real Pending/Running/Succeeded state.
- `statement` (metadata path): the kernel `list*`/`get*` statement is already
  terminal, so `waitUntilReady()` stays the one-shot completion tick.

The fetch pipeline is shared: `awaitResult()`'s `AsyncResultHandle` and the
metadata `Statement` expose the same `fetchNextBatch()` / `schema()` surface, so
`SeaResultsProvider` → `ArrowResultConverter` → `ResultSlicer` consume either
interchangeably via a single memoised fetch handle. cancel()/close() route
through a `lifecycleHandle` abstraction over whichever handle backs the op.

Re-exports the kernel `AsyncStatement` / `AsyncResultHandle` types from
`SeaNativeLoader`.

Validated against a live warehouse: async fetchAll correctness, multi-row drain
(5000 rows), long-running aggregate (count over 20M), kernel SQL-error
surfacing, and cancellation mid-execution. PR1's params/metadata/getInfo all
still pass through the new async path.

Co-authored-by: Isaac
Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>
… code-style)

The CI "Check code style" step runs `prettier . --check` (whole repo);
these files were committed without prettier formatting. Formatting-only.

Co-authored-by: Isaac
Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>
Code-review #413 (81/100). Validated each against the code + a live warehouse:

- F1 (HIGH): the async poll loop threw plain HiveDriverError for server-driven
  Cancelled/Closed/Unknown. The DBSQLOperation facade only mirrors its
  cancelled/closed flags when `err instanceof OperationStateError` (and
  OperationStateError extends HiveDriverError, not the reverse), so a
  server-side cancel/close/admin-kill left the facade desynced. Now throws
  OperationStateError(Canceled/Closed/Unknown) — matching the Thrift backend.
  The Failed branch still surfaces the kernel SQL-error envelope via awaitResult.

- F2 (MED): the server-Cancelled test asserted only instanceOf(HiveDriverError),
  which passes for both the correct and incorrect type — it couldn't catch F1.
  Now asserts instanceOf(OperationStateError) + errorCode, plus a new Closed test.

- F3 (MED): queryTimeout was forwarded to submitStatement but the kernel ignores
  queryTimeoutSecs on submit (always wait_timeout=0s), so the documented public
  option was a silent no-op, and the poll loop had no client-side deadline (a
  stalled Running statement polled forever). Now enforced client-side: the poll
  loop tracks a deadline, best-effort cancels the statement on expiry, and
  throws OperationStateError(Timeout) — matching Thrift's server TIMEDOUT
  outcome. Stopped forwarding the ignored queryTimeoutSecs to the napi options.
  Validated live: a 2s timeout interrupts a slow cross-join with TIMEOUT.

- F4 (LOW): customCaCert PEM string check now requires the END marker too (a
  truncated/headerless cert no longer passes), consistent with the Buffer path.

- F5 (LOW): SeaAuth reads the SEA-only fields (checkServerCertificate /
  customCaCert / maxConnections) through `InternalConnectionOptions` instead of
  ad-hoc inline casts, so a typo'd key fails to compile.

- F6 (LOW): corrected the poll-loop comment — the prior text justified polling
  by an incorrect "blocking awaitResult holds the mutex and queues cancel"
  claim; cancel() is documented lock-free. The real rationale (real-time
  status to the progress callback + cancel observed between ticks) is now stated.

Co-authored-by: Isaac
Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>
Consolidates the last net-new bit of the superseded #408: two SEA-path
DBSQLParameterType variants for binding timezone-explicit timestamps. The type
name flows through the existing param codec (toSparkParameter → sqlType), which
the kernel accepts — validated live (SELECT ? with TIMESTAMP_NTZ/LTZ returns the
bound values). On the Thrift backend they degrade to a plain TIMESTAMP bind.

Co-authored-by: Isaac
Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>
…ryTimeout coercion

Code-review #413 (gopalldb). Two P1s:

- TIMESTAMP_LTZ was sent verbatim on the wire, but Spark has no distinct
  TIMESTAMP_LTZ type (TIMESTAMP already carries LTZ semantics) — so a Thrift
  caller got an opaque server bind error, and the enum comment falsely claimed
  NTZ/LTZ "degrade to a plain TIMESTAMP bind" (there was no such logic).
  `toSparkParameter` now maps TIMESTAMP_LTZ → `TIMESTAMP` (valid on both Thrift
  and kernel); TIMESTAMP_NTZ stays native (a real Spark type). Comment corrected.
  Added DBSQLParameter tests for both wire types (the Thrift behaviour the
  review flagged as untested) and updated the kernel positional-params test.

- queryTimeout (`number | bigint | Int64`) was coerced with `Number(...)`, which
  yields NaN for an Int64 (node-int64 has no valueOf) → the client-side deadline
  was silently disabled for Int64 inputs. Now uses
  `numberToInt64(...).toNumber()`, matching the Thrift backend. Added a
  regression test that an `Int64(1)` queryTimeout actually fires the deadline
  (OperationStateError(Timeout)) rather than polling forever.

(P1 "queryTimeout silently dropped on submit" and the unbounded-poll note were
already resolved earlier by the client-side deadline fix; doc comment updated to
match. P2 polarity/Date-NTZ items noted for the public-surface follow-up.)

Validated live: NTZ binds natively and LTZ binds as TIMESTAMP on the kernel path.

Co-authored-by: Isaac
Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>
@gopalldb
Copy link
Copy Markdown
Collaborator

gopalldb commented Jun 4, 2026

🟡 P1

  1. seaCancel resets isCancelled to false on kernel-cancel failure — loses user's cancel intent (SeaOperationLifecycle.ts:134-141). Brief true window can be observed by the poll loop and throw
    OperationStateError(Canceled), then the rollback hides the intent from any other consumer. Worst case: kernel-side statement keeps running while the user believes it was canceled. Fix: don't reset
    isCancelled on cancel-RPC failure.
  2. Cancel during in-flight status() poll is unbounded (SeaOperationBackend.ts:372-419). Cancel arrives → flag flipped → but observation deferred until current status() resolves. No AbortSignal threaded to
    napi. Hung warehouse = unbounded latency to observe cancel. Fix: race in-flight status() against a lifecycle promise on cancel. Add a test where cancel fires while status() is mid-flight (existing test only
    covers cancel-before-poll-starts).
  3. PEM marker check is order-insensitive substring match (SeaAuth.ts:201-209). "...-----END...-----BEGIN..." (e.g. proxy intercept page with both literals) passes. Fix: ordered regex /-----BEGIN
    CERTIFICATE-----[\s\S]+?-----END CERTIFICATE-----/.
  4. customCaCert honored regardless of checkServerCertificate=false — silent insecure combo (InternalConnectionOptions.ts:40, SeaAuth.ts:184-223). User providing both gets verification fully off but with a
    wasted custom CA install — the secure-looking customCaCert masks the disabled verification. Fix: error or warn when both set; today's doc explicitly endorses the ambiguous combo.
  5. No close() cleanup on poll-loop terminal error — kernel statement leaks (SeaOperationBackend.ts:394-415). Failed/Cancelled/Closed/Unknown/Timeout all throw without closing the kernel-side statement.
    fetchChunk does best-effort close on error; the poll loop doesn't. Fix: best-effort seaClose on all terminal-error branches.
  6. rowLimit/statementConf added to cross-backend ExecuteStatementOptions but silently dropped by Thrift (IDBSQLSession.ts:30-42). Exactly the anti-pattern the PR's own code comments warn against ("silent
    drop is the worst failure mode"). User sets rowLimit: 100 on Thrift, gets full result set, no warning. Fix: warn at the Thrift session when these are passed, or move to an SEA-specific options type.

🟢 P2

  • OperationStateError messages lack statement_id and state name (matches Thrift behavior, just an enhancement).
  • "Secure-by-default" should be owned by the JS layer, not inherited from the kernel default. Add explicit checkServerCertificate: true when undefined for defense in depth.
  • Missing test: cancel-mid-status() (the actual race the P1 Setup E2E tests with GitHub Actions #2 fix needs).
  • Missing test: server-Cancelled + client-Timeout race on same tick.
  • Missing test: PEM with reversed marker order, BEGIN-only, END-only.
  • hasResultSet() hardcoded true even for DDL/DML — leave for now; revisit once napi exposes the kernel's has_result_set bit.

Make the SEA execution path user-configurable between sync and async,
toggled by the EXISTING `runAsync` option — no new public field, exactly
mirroring the Thrift backend's `runAsync` distinction. Default is SYNC
(`runAsync: false`): faster, and with the kernel sync canceller fully
cancellable mid-compute.

SeaSessionBackend.executeStatement now branches on `options.runAsync`:
  - false/undefined (DEFAULT) -> Connection.executeStatementCancellable:
    the kernel blocks on execute() (poll-to-terminal server-side), driven
    lazily in the operation backend's result(). `queryTimeoutSecs` IS
    forwarded (the kernel execute() honours it).
  - true -> Connection.submitStatement (submit + poll), unchanged.
    `queryTimeoutSecs` stays client-side (kernel ignores it on submit).

SeaOperationBackend gains a third dual-mode handle kind,
`cancellableExecution`, alongside `asyncStatement` and `statement`:
  - waitUntilReadyCancellable drives result() to the terminal Statement
    (memoised as the fetch handle + close target);
  - the lifecycle handle is a composite: cancel() routes to the detached
    canceller (lock-free, interrupts a running result() mid-COMPUTE and
    is a no-op once terminal); close() routes to the resolved statement;
  - a cancel-induced result() rejection maps to OperationStateError(
    Canceled) so the DBSQLOperation facade mirrors its cancelled flag,
    matching the Thrift path.

Public API, result shape, schema (TTableSchema), and error classes are
identical across both modes and to Thrift — the only observable
difference is lifecycle timing (when executeStatement resolves).

Built against databricks-sql-kernel napi
639e19ef97decc1c5aa2365c0b3a229c1ccd5b58 (executeStatementCancellable /
CancellableExecution); KERNEL_REV bumped to match. Refreshed the
committed binding surface (native/sea/index.{js,d.ts}).

Co-authored-by: Isaac
Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>
On the sync (cancellable) execute path the operation id was a client UUID,
because the server statement_id isn't known at construction — the kernel
publishes it mid-`result()` once the initial execute round-trip returns. That
left a cancelled/closed sync op untraceable against server/kernel logs (the
async path already had the id from `submit`).

`id` now prefers the server statement_id once known (captured from the resolved
`Statement`, then the live canceller slot), falling back to the construction-time
UUID until then. Updated the fake to model the real null-until-resolved
`statementId` and assert op.id flips from UUID → server id after the execute
completes. Validated live: SELECT 1 op.id is a UUID before fetch and the real
`01f1…` statement id after.

Co-authored-by: Isaac
Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>
@msrathore-db msrathore-db deployed to azure-prod June 4, 2026 14:15 — with GitHub Actions Active
@msrathore-db msrathore-db changed the title [SEA-NodeJS] SEA async execute, TLS & connection/statement options [SEA-NodeJS] Kernel backend: connection/statement options, TLS & configurable sync/async execution Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants