Skip to content

docs: Updating spark documentation #5766

Draft
ShimonSte wants to merge 1 commit intomainfrom
docs/spark-connector-query-settings
Draft

docs: Updating spark documentation #5766
ShimonSte wants to merge 1 commit intomainfrom
docs/spark-connector-query-settings

Conversation

@ShimonSte
Copy link
Copy Markdown
Contributor

New Section Issues Addressed
Supported ClickHouse server versions #432 — adds a minimum version table (21.6+ for partition-id, 25.3+ for VariantType, etc.)
Push-down operations #476 — lists all 5 push-down interfaces (column pruning, filters, limit, aggregations, runtime filters) with a table, notes on unsupported
expressions, and how to enable runtime filtering
Connector parallelism #435 — explains read parallelism (1 task per CH partition, 3 modes for Distributed), write parallelism (repartition/sort flow), and how to
control task count
Working with Distributed tables #402 — explicitly documents that inserts go to local tables, all shard hostnames must be accessible, and how to fall back to
coordinator-only mode
Passing query settings and Java client options #430, #431 — documents the option.<key> /
option.custom_http_params mechanism for both Catalog and TableProvider APIs, with a common settings table
Performance tuning #434 — read and write tuning tables plus a recommended bulk-load config snippet
Troubleshooting #436, #365,
#374 — covers Broken Pipe, WHERE 1=0 schema inference overhead, shard hostname resolution, too many tasks, partition expression errors, and stale schema

@ShimonSte ShimonSte requested review from a team as code owners March 17, 2026 09:39
@vercel
Copy link
Copy Markdown

vercel bot commented Mar 17, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
clickhouse-docs Ready Ready Preview, Comment Mar 17, 2026 9:45am
4 Skipped Deployments
Project Deployment Actions Updated (UTC)
clickhouse-docs-jp Ignored Ignored Mar 17, 2026 9:45am
clickhouse-docs-ko Ignored Ignored Mar 17, 2026 9:45am
clickhouse-docs-ru Ignored Ignored Mar 17, 2026 9:45am
clickhouse-docs-zh Ignored Ignored Mar 17, 2026 9:45am

Request Review

Copy link
Copy Markdown
Contributor

@BentsiLeviav BentsiLeviav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the comments I left, addressing these issues would make our docs even better:

  • Let's fix the UInt64 mapping. Right now, it maps to LongType, after your change here (ClickHouse/spark-clickhouse-connector#477), it is DecimalType(20, 0)
  • For the option allowUnsupportedSharding docs say since 0.10.0, the code says since 0.9.0
  • There is an SBT syntax issue (line 141) - the artifact name is unquoted
  • We wrongly use : in the jar path instead of - . --jars /path/clickhouse-spark-runtime-{{ spark_binary_version }}_{{ scala_binary_version }}:{{ stable_version }}.jar
    should be changed to --jars /path/clickhouse-spark-runtime-{{ spark_binary_version }}_{{ scala_binary_version }}-{{ stable_version }}.jar
  • We have a typo in the Gradle snippet in line 131 (repositries)
  • In the compatibility matrix - let's update that main uses JDBC 0.9.5 and not 0.9.4
  • In version 0.9.0, we added the option to provide settings on read using spark.clickhouse.read.settings - let's document it in the configuration table.
  • spark.clickhouse.read.jsonAs is missing from the configuration table as well

| `spark.clickhouse.read.splitByPartitionId` (partition-id-based filtering) | 21.6+ |
| `VariantType` / `JSON` type support | 25.3+ |

For production deployments, we recommend running ClickHouse 23.x or later. The connector is tested against the latest stable ClickHouse releases.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should recommend using the latest, and not a specific version - WDYT?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a link to the GitHub workflow ClickHouse version list? That way, users would be able to see exactly what versions we test with


When using ClickHouse `Distributed` tables with the connector, there is an important networking and architecture consideration:

**The connector bypasses the Distributed engine and connects directly to each shard node.**
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current phrasing sounds a bit like it’s “working around”. Ill suggest writing something like:

Suggested change
**The connector bypasses the Distributed engine and connects directly to each shard node.**
**Following ClickHouse best practices for high-throughput ingestion, the connector writes data directly to the underlying shard nodes rather than using the Distributed engine.**

Feel free to suggest other versions if you find them fit.

This means:

1. **All shard hostnames must be reachable** from Spark executors. The connector reads cluster topology from `system.clusters` on the coordinator node, then opens direct connections to each shard. If the shard hostnames returned by `system.clusters` are internal cluster names not resolvable from outside, reads and writes will fail.
2. **Inserts go to local tables**: When writing, data is inserted directly into the local `MergeTree` table on each shard (not through the `Distributed` table). This is more efficient but requires that the local table exists on every shard and that Spark can connect to each shard directly.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## Passing query settings and Java client options {#passing-query-settings}

The connector uses ClickHouse's HTTP interface via the [ClickHouse Java client](https://github.com/ClickHouse/clickhouse-java). You can pass arbitrary ClickHouse session settings and HTTP client options using the `option.<key>` prefix.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:::note All available Java client options are listed in [ClientConfigProperties.java](https://github.com/ClickHouse/clickhouse-java/blob/main/client-v2/src/main/java/com/clickhouse/client/api/ClientConfigProperties.java) and [this documentation page](https://clickhouse.com/docs/integrations/language-clients/java/client#configuration).
All available server session settings are listed in [this documentation page](https://clickhouse.com/docs/operations/settings/settings). :::

| `max_execution_time=300` | `option.custom_http_params` | Extend query timeout (seconds) for large reads |
| `session_timeout=60` | `option.custom_http_params` | Extend HTTP session timeout |
| `ssl` | `option.ssl` | Enable TLS (`true`/`false`) |
| `client_name` | `option.client_name` | Tag requests with a client name visible in `system.query_log` |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't append multiple client_name values in the connector, and the Java Client doesn't do it automatically (in situations where there are multiple client_name sets, as in https://github.com/ClickHouse/spark-clickhouse-connector/blob/a1d8b7b32cae27e2fe38133c10ce10613b824013/clickhouse-core/src/main/scala/com/clickhouse/spark/client/NodeClient.scala#L97-L104), so I'm not sure that will work. Did you verify it?

### Via TableProvider API {#query-settings-tableprovider-api}

```python
df.read \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean spark.read?


### Common settings {#query-settings-common}

| Setting | Example value | Purpose |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table header is not aligned with the data. The first two columns should be swapped

Image


**Cause**: When aggregation push-down is triggered, the connector sends a probe query to ClickHouse to determine the output schema of the pushed-down aggregation. This is by design and only occurs when aggregation push-down is active.

**Fix**: If this causes unacceptable load, avoid triggering aggregation push-down by handling the aggregation in Spark instead:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "fix" reads the entire table into Spark before aggregating, which in Spark's use case (working with big data), would potentially be far worse than the probe query it is trying to avoid.

For most cases, the WHERE 1=0 probe query is just a schema-determination round-trip (near-zero cost, and if it is not, it might suggest other problems), and the SELECT * workaround trades a free probe query for a full table scan.

Do we have other alternatives to avoid it? I see the question in ClickHouse/spark-clickhouse-connector#374 was for a situation where a schema is being provided. Is there a feature flag disabling this once a schema is provided? If not, I would encourage us to keep the issue open, and once we develop such functionality, add it to the docs.


### Read parallelism {#read-parallelism}

The connector creates one Spark input partition per ClickHouse physical partition (data part group). The number of Spark read tasks equals the number of distinct partition values in the ClickHouse table.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The connector creates one Spark input partition per ClickHouse physical partition (data part group). The number of Spark read tasks equals the number of distinct partition values in the ClickHouse table.
The connector creates one Spark input partition per ClickHouse physical partition (data part group). The number of Spark read tasks equals the number of distinct partition values in the ClickHouse table. Please visit [Table Partitions](https://clickhouse.com/docs/partitions) for more information on partitioning.

| `Distributed` table with `spark.clickhouse.read.distributed.convertLocal=true` (default) | (number of shards) × (partitions per shard) tasks, each targeting a specific shard node directly |
| `Distributed` table with `spark.clickhouse.read.distributed.convertLocal=false` | 1 task, reading through the `Distributed` coordinator node |

For a table with no `PARTITION BY` clause, the connector creates one Spark partition per ClickHouse data part group, which may result in many small tasks. Use `PARTITION BY` in your ClickHouse table to control parallelism granularity.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which may result in many small

Would love to hear your thoughts to see if my understanding here is right:

  1. planInputPartitions returns Array[InputParition] and Spark creates once task per element -
  2. createReader(partition) is called once per element from that array. It takes a single ClickHouseInputPartition and creates one reader for it.

So, to summarize - the mapping is 1 to 1 - One InputPartition object mapped to one Spark task (which mapped to one reader).

If that is correct, there is no fan out.

For the no-PARTITION BY case: queryPartitionSpec returns Array(NoPartitionSpec) (one element) -> inputPartitions has length 1->
planInputPartitions returns an array of 1 → Spark schedules exactly 1 task.

@ShimonSte ShimonSte marked this pull request as draft March 23, 2026 11:03
@ShimonSte ShimonSte added the Don't Merge Don't merge yet label Mar 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Don't Merge Don't merge yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants