docs: Updating spark documentation #5766

Draft

ShimonSte wants to merge 1 commit intomainfrom

docs/spark-connector-query-settings

Contributor

ShimonSte commented Mar 17, 2026

New Section	Issues Addressed
Supported ClickHouse server versions	#432 — adds a minimum version table (21.6+ for partition-id, 25.3+ for VariantType, etc.)
Push-down operations	#476 — lists all 5 push-down interfaces (column pruning, filters, limit, aggregations, runtime filters) with a table, notes on unsupported
expressions, and how to enable runtime filtering
Connector parallelism	#435 — explains read parallelism (1 task per CH partition, 3 modes for Distributed), write parallelism (repartition/sort flow), and how to
control task count
Working with Distributed tables	#402 — explicitly documents that inserts go to local tables, all shard hostnames must be accessible, and how to fall back to
coordinator-only mode
Passing query settings and Java client options	#430, #431 — documents the `option.<key>` /
`option.custom_http_params` mechanism for both Catalog and TableProvider APIs, with a common settings table
Performance tuning	#434 — read and write tuning tables plus a recommended bulk-load config snippet
Troubleshooting	#436, #365,
#374 — covers Broken Pipe, `WHERE 1=0` schema inference overhead, shard hostname resolution, too many tasks, partition expression errors, and stale schema


          docs: add Passing query settings and Java client options to Spark con…

49d8033

…nector

ShimonSte requested review from a team as code owners

March 17, 2026 09:39

vercel bot commented Mar 17, 2026 •

edited

Loading

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
clickhouse-docs	Ready	Preview, Comment	Mar 17, 2026 9:45am

4 Skipped Deployments

Project	Deployment	Actions	Updated (UTC)
clickhouse-docs-jp	Ignored		Mar 17, 2026 9:45am
clickhouse-docs-ko	Ignored		Mar 17, 2026 9:45am
clickhouse-docs-ru	Ignored		Mar 17, 2026 9:45am
clickhouse-docs-zh	Ignored		Mar 17, 2026 9:45am

vercel bot deployed to Preview – clickhouse-docs

March 17, 2026 09:45

View deployment

ShimonSte requested a review from BentsiLeviav

March 17, 2026 09:56

BentsiLeviav requested a review from mzitnik

March 18, 2026 14:11

BentsiLeviav requested changes

View reviewed changes

Contributor

BentsiLeviav left a comment

Besides the comments I left, addressing these issues would make our docs even better:

Let's fix the UInt64 mapping. Right now, it maps to LongType, after your change here (ClickHouse/spark-clickhouse-connector#477), it is DecimalType(20, 0)
For the option allowUnsupportedSharding docs say since 0.10.0, the code says since 0.9.0
There is an SBT syntax issue (line 141) - the artifact name is unquoted
We wrongly use : in the jar path instead of - . --jars /path/clickhouse-spark-runtime-{{ spark_binary_version }}_{{ scala_binary_version }}:{{ stable_version }}.jar
should be changed to --jars /path/clickhouse-spark-runtime-{{ spark_binary_version }}_{{ scala_binary_version }}-{{ stable_version }}.jar
We have a typo in the Gradle snippet in line 131 (repositries)
In the compatibility matrix - let's update that main uses JDBC 0.9.5 and not 0.9.4
In version 0.9.0, we added the option to provide settings on read using spark.clickhouse.read.settings - let's document it in the configuration table.
spark.clickhouse.read.jsonAs is missing from the configuration table as well

docs/integrations/data-ingestion/apache-spark/spark-native-connector.md

+              | `spark.clickhouse.read.splitByPartitionId` (partition-id-based filtering) | 21.6+ |
+              | `VariantType` / `JSON` type support | 25.3+ |
+              For production deployments, we recommend running ClickHouse 23.x or later. The connector is tested against the latest stable ClickHouse releases.

Contributor

BentsiLeviav Mar 22, 2026

I think we should recommend using the latest, and not a specific version - WDYT?

Contributor

BentsiLeviav Mar 22, 2026

Can you add a link to the GitHub workflow ClickHouse version list? That way, users would be able to see exactly what versions we test with

docs/integrations/data-ingestion/apache-spark/spark-native-connector.md


		When using ClickHouse `Distributed` tables with the connector, there is an important networking and architecture consideration:

		The connector bypasses the Distributed engine and connects directly to each shard node.

Contributor

BentsiLeviav Mar 22, 2026

The current phrasing sounds a bit like it’s “working around”. Ill suggest writing something like:

Suggested change

      
            **The connector bypasses the Distributed engine and connects directly to each shard node.**
          
            **Following ClickHouse best practices for high-throughput ingestion, the connector writes data directly to the underlying shard nodes rather than using the Distributed engine.**

Feel free to suggest other versions if you find them fit.

docs/integrations/data-ingestion/apache-spark/spark-native-connector.md

+              This means:
+. **All shard hostnames must be reachable** from Spark executors. The connector reads cluster topology from `system.clusters` on the coordinator node, then opens direct connections to each shard. If the shard hostnames returned by `system.clusters` are internal cluster names not resolvable from outside, reads and writes will fail.
+. **Inserts go to local tables**: When writing, data is inserted directly into the local `MergeTree` table on each shard (not through the `Distributed` table). This is more efficient but requires that the local table exists on every shard and that Spark can connect to each shard directly.

Contributor

BentsiLeviav Mar 22, 2026

would be nice to include a link to this doc page https://clickhouse.com/docs/engines/table-engines/special/distributed#distributed-writing-data

docs/integrations/data-ingestion/apache-spark/spark-native-connector.md

		## Passing query settings and Java client options {#passing-query-settings}

		The connector uses ClickHouse's HTTP interface via the [ClickHouse Java client](https://github.com/ClickHouse/clickhouse-java). You can pass arbitrary ClickHouse session settings and HTTP client options using the `option.<key>` prefix.

Contributor

BentsiLeviav Mar 22, 2026

Suggested change

      
            :::note All available Java client options are listed in [ClientConfigProperties.java](https://github.com/ClickHouse/clickhouse-java/blob/main/client-v2/src/main/java/com/clickhouse/client/api/ClientConfigProperties.java) and [this documentation page](https://clickhouse.com/docs/integrations/language-clients/java/client#configuration).
          
            All available server session settings are listed in [this documentation page](https://clickhouse.com/docs/operations/settings/settings). :::

docs/integrations/data-ingestion/apache-spark/spark-native-connector.md

+              | `max_execution_time=300` | `option.custom_http_params` | Extend query timeout (seconds) for large reads |
+              | `session_timeout=60` | `option.custom_http_params` | Extend HTTP session timeout |
+              | `ssl` | `option.ssl` | Enable TLS (`true`/`false`) |
+              | `client_name` | `option.client_name` | Tag requests with a client name visible in `system.query_log` |

Contributor

BentsiLeviav Mar 22, 2026

We don't append multiple client_name values in the connector, and the Java Client doesn't do it automatically (in situations where there are multiple client_name sets, as in https://github.com/ClickHouse/spark-clickhouse-connector/blob/a1d8b7b32cae27e2fe38133c10ce10613b824013/clickhouse-core/src/main/scala/com/clickhouse/spark/client/NodeClient.scala#L97-L104), so I'm not sure that will work. Did you verify it?

docs/integrations/data-ingestion/apache-spark/spark-native-connector.md

+              ### Via TableProvider API {#query-settings-tableprovider-api}
+              ```python
+              df.read \

Contributor

BentsiLeviav Mar 22, 2026

Did you mean spark.read?

docs/integrations/data-ingestion/apache-spark/spark-native-connector.md


		### Common settings {#query-settings-common}

		\| Setting \| Example value \| Purpose \|

Contributor

BentsiLeviav Mar 23, 2026

The table header is not aligned with the data. The first two columns should be swapped

docs/integrations/data-ingestion/apache-spark/spark-native-connector.md


		Cause: When aggregation push-down is triggered, the connector sends a probe query to ClickHouse to determine the output schema of the pushed-down aggregation. This is by design and only occurs when aggregation push-down is active.

		Fix: If this causes unacceptable load, avoid triggering aggregation push-down by handling the aggregation in Spark instead:

Contributor

BentsiLeviav Mar 23, 2026

This "fix" reads the entire table into Spark before aggregating, which in Spark's use case (working with big data), would potentially be far worse than the probe query it is trying to avoid.

For most cases, the WHERE 1=0 probe query is just a schema-determination round-trip (near-zero cost, and if it is not, it might suggest other problems), and the SELECT * workaround trades a free probe query for a full table scan.

Do we have other alternatives to avoid it? I see the question in ClickHouse/spark-clickhouse-connector#374 was for a situation where a schema is being provided. Is there a feature flag disabling this once a schema is provided? If not, I would encourage us to keep the issue open, and once we develop such functionality, add it to the docs.

docs/integrations/data-ingestion/apache-spark/spark-native-connector.md


		### Read parallelism {#read-parallelism}

		The connector creates one Spark input partition per ClickHouse physical partition (data part group). The number of Spark read tasks equals the number of distinct partition values in the ClickHouse table.

Contributor

BentsiLeviav Mar 23, 2026

Suggested change

      
            The connector creates one Spark input partition per ClickHouse physical partition (data part group). The number of Spark read tasks equals the number of distinct partition values in the ClickHouse table.
          
            The connector creates one Spark input partition per ClickHouse physical partition (data part group). The number of Spark read tasks equals the number of distinct partition values in the ClickHouse table. Please visit [Table Partitions](https://clickhouse.com/docs/partitions) for more information on partitioning.

docs/integrations/data-ingestion/apache-spark/spark-native-connector.md

+              | `Distributed` table with `spark.clickhouse.read.distributed.convertLocal=true` (default) | (number of shards) × (partitions per shard) tasks, each targeting a specific shard node directly |
+              | `Distributed` table with `spark.clickhouse.read.distributed.convertLocal=false` | 1 task, reading through the `Distributed` coordinator node |
+              For a table with no `PARTITION BY` clause, the connector creates one Spark partition per ClickHouse data part group, which may result in many small tasks. Use `PARTITION BY` in your ClickHouse table to control parallelism granularity.

Contributor

BentsiLeviav Mar 23, 2026

which may result in many small

Would love to hear your thoughts to see if my understanding here is right:

planInputPartitions returns Array[InputParition] and Spark creates once task per element -
createReader(partition) is called once per element from that array. It takes a single ClickHouseInputPartition and creates one reader for it.

So, to summarize - the mapping is 1 to 1 - One InputPartition object mapped to one Spark task (which mapped to one reader).

If that is correct, there is no fan out.

For the no-PARTITION BY case: queryPartitionSpec returns Array(NoPartitionSpec) (one element) -> inputPartitions has length 1->
planInputPartitions returns an array of 1 → Spark schedules exactly 1 task.

ShimonSte marked this pull request as draft

March 23, 2026 11:03

ShimonSte added the Don't Merge label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels