Skip to content

[feature](lance) Add Rust-based Lance format reader for AI-native dat…#62182

Open
tomz-alt wants to merge 1 commit intoapache:masterfrom
tomz-alt:lance-support
Open

[feature](lance) Add Rust-based Lance format reader for AI-native dat…#62182
tomz-alt wants to merge 1 commit intoapache:masterfrom
tomz-alt:lance-support

Conversation

@tomz-alt
Copy link
Copy Markdown

@tomz-alt tomz-alt commented Apr 7, 2026

Summary

Add native Lance format support to Doris via Rust FFI integration, enabling SQL queries over AI-native Lance datasets from local disk and S3.

Lance is a columnar format designed for vector search, multimodal data (images, embeddings), and fast random access -- widely used in AI/ML pipelines.

Quick Examples

Read from S3:

SELECT * FROM s3(
    "uri" = "s3://bucket/embeddings.lance/data/fragment.lance",
    "format" = "lance",
    "s3.access_key" = "...",
    "s3.secret_key" = "...",
    "s3.region" = "us-east-1",
    "s3.endpoint" = "https://s3.us-east-1.amazonaws.com"
) ORDER BY id LIMIT 10;

Read from local disk (for testing):

-- Get backend_id from: SHOW BACKENDS;
SELECT * FROM local(
    "file_path" = "data/my_dataset.lance/data/fragment.lance",
    "backend_id" = "<backend_id from SHOW BACKENDS>",
    "format" = "lance"
) ORDER BY id LIMIT 10;

Aggregation across multi-fragment dataset:

SELECT count(*), min(id), max(id) FROM s3(
    "uri" = "s3://bucket/large.lance/data/fragment.lance",
    "format" = "lance",
    "s3.access_key" = "...", "s3.secret_key" = "...",
    "s3.region" = "us-east-1",
    "s3.endpoint" = "https://s3.us-east-1.amazonaws.com"
);

Architecture

  • Data exchange: Arrow C Data Interface (zero-copy between Rust and C++)
  • Async containment: block_on() with single-threaded tokio runtime (zero extra OS threads)
  • Build gating: BUILD_RUST_READERS=OFF by default, zero impact on existing builds

What Works (Verified on Live Cluster)

Feature Status
SELECT * / column projection Tested
WHERE filter / LIMIT / COUNT(*) Tested
SUM() / AVG() aggregation Tested
Multi-fragment datasets (3 fragments, 15 rows) Tested
S3 access with AWS credentials Tested
Schema inference (fetch_table_schema) Tested
Time travel version (config wired) Config ready
Vector ANN search / FTS / filter pushdown Config ready

Known Limitations

  • TVF path only: No CREATE CATALOG support yet. Must use local() or s3() TVF
  • Directory-based format workaround: Lance datasets are directories. The TVF file_path must point to a single .lance data file inside the dataset; the reader auto-strips the path back to the dataset root and reads all fragments. If the TVF glob matches multiple .lance files (multi-fragment dataset), each scan range reopens the full dataset causing duplicate rows. Workaround: ensure the file_path glob matches exactly one data file per dataset
  • No Doris data cache integration: Lance reads bypass BlockFileCache. S3 reads are not cached on local SSD
  • No filter/vector pushdown from FE: The Rust config supports filter, vector_search, full_text_search but the FE planner does not populate them yet
  • BUILD_RUST_READERS=OFF default: Requires explicit opt-in and Rust toolchain
  • Binary size: Rust static lib ~430MB (.a), adds ~50-80MB to final doris_be after LTO

How to Build

# Rust tests only:
cd be/src/rust/doris-native && cargo test

# Full BE with Lance:
BUILD_RUST_READERS=ON ./build.sh --be

# Regression test:
./run-regression-test.sh --run -s test_lance_tvf

Changes

Thrift (2 files): FORMAT_LANCE = 19, TLanceFileDesc, enable_rust_lance_reader

FE (4 files): LanceFileFormatProperties, FileFormatProperties factory, FileFormatConstants, SessionVariable

BE - Rust (be/src/rust/doris-native/, 6 files): error.rs, lance_reader.rs (LanceReaderConfig with S3/version/vector/FTS support), ffi.rs (extern C functions), lib.rs

BE - C++ (5 files): lance_ffi.h, lance_rust_reader.h/cpp (GenericReader with Arrow import), file_scanner.cpp (FORMAT_LANCE dispatch), internal_service.cpp (fetch_table_schema)

Build (3 files): rust.cmake (Corrosion v0.5), CMakeLists.txt (BUILD_RUST_READERS option), format/CMakeLists.txt

Tests (4 files): 24 Rust tests, C++ GTest (7), C++ standalone (8), Groovy regression (9)

Future Work

  • Lance Catalog: CREATE CATALOG for dataset discovery (eliminates TVF path limitations and backend_id requirement)
  • Fragment-level scan ranges: FE lists fragments and creates one scan range per fragment with fragment ID, avoiding duplicate reads
  • FE filter/vector pushdown: Pass WHERE predicates and ANN queries to lance-rs scanner
  • Doris BlockFileCache integration: Route lance I/O through Doris cache for S3 data caching
  • Lance Session cache: Shared IndexCache for vector/FTS index reuse across queries
  • Time travel SQL: FOR VERSION AS OF N via LanceMvccSnapshot

@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 7, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@morningman morningman self-assigned this Apr 7, 2026
@tomz-alt tomz-alt force-pushed the lance-support branch 5 times, most recently from 9825d0e to 6f655a4 Compare April 7, 2026 18:15
@morningman
Copy link
Copy Markdown
Contributor

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 18.18% (2/11) 🎉
Increment coverage report
Complete coverage report

@tomz-alt
Copy link
Copy Markdown
Author

tomz-alt commented Apr 7, 2026

we need to install cargo toolchain to make rust ffi compile

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.98% (20105/37947)
Line Coverage 36.53% (188940/517204)
Region Coverage 32.78% (146566/447093)
Branch Coverage 33.93% (64209/189233)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.69% (27386/37162)
Line Coverage 57.31% (295496/515622)
Region Coverage 54.51% (245979/451226)
Branch Coverage 56.18% (106642/189815)

@hello-stephen
Copy link
Copy Markdown
Contributor

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TeamCity pipeline will use regression-test/pipeline/common/custom_env.sh when compiling, please set BUILD_RUST_READERS=on in it for test.

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.00% (20112/37947)
Line Coverage 36.56% (189072/517225)
Region Coverage 32.81% (146716/447111)
Branch Coverage 33.94% (64233/189251)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 89.87% (71/79) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.72% (27397/37162)
Line Coverage 57.36% (295782/515643)
Region Coverage 54.61% (246412/451244)
Branch Coverage 56.29% (106852/189833)

@hello-stephen
Copy link
Copy Markdown
Contributor

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 18.18% (2/11) 🎉
Increment coverage report
Complete coverage report

@tomz-alt tomz-alt force-pushed the lance-support branch 3 times, most recently from a02a47a to 9dcf75e Compare April 13, 2026 06:19
@morningman
Copy link
Copy Markdown
Contributor

run buildall

…a access

Introduce Lance columnar format support in Doris via Rust FFI integration.
Lance is an AI-native data format optimized for vector search, multimodal
data, and fast random access.

Architecture:
- Rust static library (doris-ffi) compiled via Corrosion CMake integration
- Arrow C Data Interface for zero-copy data exchange between Rust and C++
- LanceRustReader inherits GenericReader, same pattern as PaimonCppReader
- JSON config for extensible reader options (S3 creds, version, indexes)

Features:
- Local and S3 dataset access via local() and s3() TVF
- Schema inference via fetch_table_schema RPC
- Column projection, WHERE filter, LIMIT, aggregation
- Multi-fragment dataset reads (all fragments via single scan range)
- S3 credential passthrough (AWS_ACCESS_KEY → object_store)
- Time travel version support (TLanceFileDesc.version)
- Vector ANN search, FTS, scalar filter pushdown config (wired in Rust)
- BUILD_RUST_READERS=OFF by default (zero impact on existing builds)

Thrift: FORMAT_LANCE=19, TLanceFileDesc, enable_rust_lance_reader
FE: LanceFileFormatProperties, SessionVariable, FileFormatConstants
BE: LanceRustReader, lance_ffi.h, CMake Corrosion integration

Tests:
- 24 Rust unit tests (error handling, reader lifecycle, FFI bridge)
- 8 C++ standalone tests (Arrow import, schema, multi-fragment)
- 9 Groovy regression tests (SELECT, projection, COUNT, WHERE, LIMIT)
- Verified on live Doris cluster with local and S3 datasets
@tomz-alt
Copy link
Copy Markdown
Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.48% (1798/2291)
Line Coverage 64.17% (32304/50345)
Region Coverage 65.13% (16260/24967)
Branch Coverage 55.63% (8689/15620)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 18.18% (2/11) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants