[feature](lance) Add Rust-based Lance format reader for AI-native dat…#62182
[feature](lance) Add Rust-based Lance format reader for AI-native dat…#62182tomz-alt wants to merge 1 commit intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
9825d0e to
6f655a4
Compare
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
|
we need to install cargo toolchain to make rust ffi compile |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
|
TeamCity pipeline will use regression-test/pipeline/common/custom_env.sh when compiling, please set BUILD_RUST_READERS=on in it for test. |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
a02a47a to
9dcf75e
Compare
|
run buildall |
…a access Introduce Lance columnar format support in Doris via Rust FFI integration. Lance is an AI-native data format optimized for vector search, multimodal data, and fast random access. Architecture: - Rust static library (doris-ffi) compiled via Corrosion CMake integration - Arrow C Data Interface for zero-copy data exchange between Rust and C++ - LanceRustReader inherits GenericReader, same pattern as PaimonCppReader - JSON config for extensible reader options (S3 creds, version, indexes) Features: - Local and S3 dataset access via local() and s3() TVF - Schema inference via fetch_table_schema RPC - Column projection, WHERE filter, LIMIT, aggregation - Multi-fragment dataset reads (all fragments via single scan range) - S3 credential passthrough (AWS_ACCESS_KEY → object_store) - Time travel version support (TLanceFileDesc.version) - Vector ANN search, FTS, scalar filter pushdown config (wired in Rust) - BUILD_RUST_READERS=OFF by default (zero impact on existing builds) Thrift: FORMAT_LANCE=19, TLanceFileDesc, enable_rust_lance_reader FE: LanceFileFormatProperties, SessionVariable, FileFormatConstants BE: LanceRustReader, lance_ffi.h, CMake Corrosion integration Tests: - 24 Rust unit tests (error handling, reader lifecycle, FFI bridge) - 8 C++ standalone tests (Arrow import, schema, multi-fragment) - 9 Groovy regression tests (SELECT, projection, COUNT, WHERE, LIMIT) - Verified on live Doris cluster with local and S3 datasets
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
Summary
Add native Lance format support to Doris via Rust FFI integration, enabling SQL queries over AI-native Lance datasets from local disk and S3.
Lance is a columnar format designed for vector search, multimodal data (images, embeddings), and fast random access -- widely used in AI/ML pipelines.
Quick Examples
Read from S3:
Read from local disk (for testing):
Aggregation across multi-fragment dataset:
Architecture
What Works (Verified on Live Cluster)
Known Limitations
.lancedata file inside the dataset; the reader auto-strips the path back to the dataset root and reads all fragments. If the TVF glob matches multiple.lancefiles (multi-fragment dataset), each scan range reopens the full dataset causing duplicate rows. Workaround: ensure the file_path glob matches exactly one data file per datasetHow to Build
Changes
Thrift (2 files): FORMAT_LANCE = 19, TLanceFileDesc, enable_rust_lance_reader
FE (4 files): LanceFileFormatProperties, FileFormatProperties factory, FileFormatConstants, SessionVariable
BE - Rust (be/src/rust/doris-native/, 6 files): error.rs, lance_reader.rs (LanceReaderConfig with S3/version/vector/FTS support), ffi.rs (extern C functions), lib.rs
BE - C++ (5 files): lance_ffi.h, lance_rust_reader.h/cpp (GenericReader with Arrow import), file_scanner.cpp (FORMAT_LANCE dispatch), internal_service.cpp (fetch_table_schema)
Build (3 files): rust.cmake (Corrosion v0.5), CMakeLists.txt (BUILD_RUST_READERS option), format/CMakeLists.txt
Tests (4 files): 24 Rust tests, C++ GTest (7), C++ standalone (8), Groovy regression (9)
Future Work