Skip to content

[Feature] Add Spark TsFile connector support for the TsFile table model #843

Description

@JackieTien97

Motivation

Apache TsFile already has ecosystem integration with Spark, and the tree-model Spark TsFile connector can be used as a reference:
https://github.com/apache/iotdb-extras/tree/master/connectors/spark-tsfile

TsFile now also supports the table model, including TableSchema, ColumnCategory.TAG/FIELD, table-model read APIs, and table-model write APIs such as TsFileWriter#registerTableSchema and TsFileWriter#writeTable.

We need a Spark connector for table-model TsFile so users can read and write table-model TsFiles directly through Spark SQL/DataFrame APIs.

Goal

Develop a Spark SQL/DataFrame connector for TsFile table model. The connector should reuse existing TsFile Java read/write APIs as much as possible, instead of duplicating TsFile parsing or writing logic.

Expected Scope

The initial implementation should support:

  • Reading table-model TsFile files or directories into Spark DataFrames.
  • Inferring or loading table schemas from TsFile metadata, including:
    • table name
    • time column
    • TAG columns
    • FIELD columns
    • TsFile data types and corresponding Spark SQL types
  • Preserving table-model semantics:
    • TAG columns identify devices
    • FIELD columns represent measurements
    • null values and sparse field values are handled correctly
  • Reading multiple TsFiles with compatible schemas.
  • Column pruning where possible.
  • Predicate pushdown where possible, especially:
    • time-range filters
    • tag filters
  • Writing Spark DataFrames into table-model TsFiles, with options such as:
    • table name
    • tag columns
    • field columns
    • encoding/compression defaults if needed
  • Providing user-facing examples for Spark SQL/DataFrame read and write workflows.

Proposed User Experience

Example read API:

val df = spark.read
  .format("tsfile")
  .option("model", "table")
  .option("table", "weather")
  .load("/path/to/tsfile-dir")

df.select("time", "city", "device", "temperature")
  .where("city = 'beijing'")
  .show()

Metadata

Metadata

Assignees

Labels

No fields configured for Feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions