chore: create migration for coordinator schema #28139

Daesgar · 2025-01-31T13:19:45Z

Problem

The coordinator schema is manually created.

Changes

This adds the migration to the repository so we can evolve the schema from that point onward through the code and not manually.

Also, I'll compare the differences between the schema created through the migration and the actual ones to fix them when it applies.

Finally, I will add the entry to the migrations table manually, since this schema already exists. We don't want to recreate it.

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

Does this work well for both Cloud and self-hosted?

Yes

How did you test this code?

Running migrations locally.

Daesgar · 2025-01-31T15:31:06Z

posthog/clickhouse/cluster.py

-            HostInfo(ConnectionInfo(host_address, port), shard_num, replica_num, host_cluster_type, host_cluster_role)
+            HostInfo(
+                ConnectionInfo(host_address, port),
+                shard_num if host_cluster_role != "coordinator" else None,


I don't think we want to take into account the coordinators as separate shards, since they are merely "compute" nodes. This is mainly for the map_host_per_shard function, to avoid taking the coordinators into account.

…emas

greptile-apps

PR Summary

This PR adds a migration for the coordinator schema in ClickHouse, moving away from manual schema creation. Here are the key changes:

Added new migration file 0096_coordinator_schemas.py to create distributed tables and dictionaries specifically for coordinator nodes
Introduced ON_CLUSTER_CLAUSE lambda function across SQL files to centralize cluster clause generation
Modified table creation SQL functions to accept optional on_cluster parameter to control whether cluster clauses are included
Updated SQL generation functions to use NodeRole.COORDINATOR when creating coordinator-specific schemas
Added documentation in README.md explaining when to run migrations on worker nodes vs all nodes

The changes standardize schema management through code rather than manual intervention while maintaining backward compatibility. The PR follows best practices by avoiding ON CLUSTER clauses in migrations since they will be run through run_sql_with_exceptions.

_{💡 (4/5) You can add custom instructions or style guidelines for the bot here!}

_{23 file(s) reviewed, 7 comment(s)}
_{Edit PR Review Bot Settings | Greptile}

greptile-apps · 2025-01-31T16:57:38Z

posthog/clickhouse/migrations/0096_coordinator_schemas.py

+    run_sql_with_exceptions(CREATE_COHORTPEOPLE_TABLE_SQL(on_cluster=False), node_role=NodeRole.COORDINATOR),
+    run_sql_with_exceptions(DISTRIBUTED_EVENTS_RECENT_TABLE_SQL(on_cluster=False), node_role=NodeRole.COORDINATOR),
+    run_sql_with_exceptions(DISTRIBUTED_EVENTS_TABLE_SQL(on_cluster=False), node_role=NodeRole.COORDINATOR),
+    run_sql_with_exceptions(EVENTS_RECENT_TABLE_SQL(on_cluster=False), node_role=NodeRole.COORDINATOR),


logic: EVENTS_RECENT_TABLE_SQL is not a distributed table and should not be included in coordinator schema

greptile-apps · 2025-01-31T16:57:48Z

posthog/clickhouse/migrations/README.md

+
+### When to run a migration for all nodes
+
+- Basically when the migration does not include any of the above listed in the previous section.


style: This line could be more explicit by saying 'When the migration is not for any of the table types listed in the worker node section above' instead of the current wording which is less precise

greptile-apps · 2025-01-31T16:57:49Z

posthog/clickhouse/migrations/README.md

+The ON CLUSTER clause is used to specify the cluster to run the DDL statement on. By default, the `posthog` cluster is used. That cluster only includes the worker nodes.
+
+Ideally, **do not use the ON CLUSTER clause**, since the DDL statement will be run on all nodes anyway through the `run_sql_with_exceptions` function, and, by default, the ON CLUSTER clause make the DDL statement run on nodes specified for the default cluster, and that does not include the coordinator.


style: The explanation of why not to use ON CLUSTER could be clearer. The current wording implies two separate reasons but they're actually connected - the run_sql_with_exceptions handles cluster distribution AND includes the coordinator, while ON CLUSTER only targets worker nodes

greptile-apps · 2025-01-31T16:59:17Z

posthog/models/event/sql.py

@@ -18,23 +18,23 @@
 )
 from posthog.kafka_client.topics import KAFKA_EVENTS_JSON

+ON_CLUSTER_CLAUSE = lambda: f"ON CLUSTER '{settings.CLICKHOUSE_CLUSTER}'"


style: Consider making this a regular function instead of a lambda since it's used in many places and doesn't need to be anonymous

greptile-apps · 2025-01-31T16:59:48Z

posthog/models/group/sql.py

 DROP_GROUPS_TABLE_SQL = f"DROP TABLE {GROUPS_TABLE} ON CLUSTER '{CLICKHOUSE_CLUSTER}'"
 TRUNCATE_GROUPS_TABLE_SQL = f"TRUNCATE TABLE {GROUPS_TABLE} ON CLUSTER '{CLICKHOUSE_CLUSTER}'"


style: These SQL statements still use hardcoded ON CLUSTER clauses while the rest of the file uses the new ON_CLUSTER_CLAUSE function. Should update for consistency.

Suggested change

DROP_GROUPS_TABLE_SQL = f"DROP TABLE {GROUPS_TABLE} ON CLUSTER '{CLICKHOUSE_CLUSTER}'"

TRUNCATE_GROUPS_TABLE_SQL = f"TRUNCATE TABLE {GROUPS_TABLE} ON CLUSTER '{CLICKHOUSE_CLUSTER}'"

DROP_GROUPS_TABLE_SQL = f"DROP TABLE {GROUPS_TABLE} {ON_CLUSTER_CLAUSE()}"

TRUNCATE_GROUPS_TABLE_SQL = f"TRUNCATE TABLE {GROUPS_TABLE} {ON_CLUSTER_CLAUSE()}"

greptile-apps · 2025-01-31T16:59:54Z

posthog/models/ingestion_warnings/sql.py

    table_name="ingestion_warnings",
-    cluster=settings.CLICKHOUSE_CLUSTER,
+    on_cluster_clause=ON_CLUSTER_CLAUSE() if on_cluster else "",
    engine=Distributed(data_table="sharded_ingestion_warnings", sharding_key="rand()"),


logic: Using rand() as a sharding key can lead to uneven data distribution across shards. Consider using a more deterministic key based on team_id or another relevant field.

greptile-apps · 2025-01-31T17:00:43Z

posthog/models/person_overrides/sql.py

@@ -85,7 +82,10 @@
    -- the newest known mapping for it in the table. Query side we will need to
    -- ensure that we are always querying the latest version of the mapping.
    ORDER BY (team_id, old_person_id)
-"""
+""".format(
+        engine=ReplacingMergeTree("person_overrides", replication_scheme=ReplicationScheme.REPLICATED, ver="version")


style: Consider adding a TTL clause to automatically clean up old version records after a certain period

Daesgar added 10 commits January 31, 2025 11:39

chore: 003 and 004

91616fd

chore: migrations 0006 - 0013

6b11379

chore: migrations 0014 - 0026

f42f043

chore: create coordinator schema migration

4be0a72

fix: only add shard_num if node is not a coordinator

f80bb82

ref: revert migrations to master state

f4fb2d0

chore: add missing person_overrides table

4a2a4f6

fix: function calls

0a19eef

fix: function calls in tests

fd1bc3a

fix: generate uuid instead of relying on the macro

972f30e

Daesgar commented Jan 31, 2025

View reviewed changes

github-actions bot and others added 4 commits January 31, 2025 15:48

Update query snapshots

4654145

chore: add readme for migrations

2a4dcc6

Merge remote-tracking branch 'origin/master' into add-coordinator-sch…

7f3ac95

…emas

chore: use ReplacingMergeTree class for person_overrides

d4d6e79

Daesgar marked this pull request as ready for review January 31, 2025 16:56

Daesgar requested a review from a team as a code owner January 31, 2025 16:56

greptile-apps bot reviewed Jan 31, 2025

View reviewed changes

Daesgar and others added 4 commits January 31, 2025 18:11

fix: function call

4226041

ref: restore log level

761fc6e

Update query snapshots

2159d92

fix: eclude readme from migrations check and fix function call

e61e31d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: create migration for coordinator schema #28139

chore: create migration for coordinator schema #28139

Daesgar commented Jan 31, 2025

Daesgar Jan 31, 2025

greptile-apps bot left a comment

greptile-apps bot Jan 31, 2025

greptile-apps bot Jan 31, 2025

greptile-apps bot Jan 31, 2025

greptile-apps bot Jan 31, 2025

greptile-apps bot Jan 31, 2025

greptile-apps bot Jan 31, 2025

greptile-apps bot Jan 31, 2025


		### When to run a migration for all nodes

		- Basically when the migration does not include any of the above listed in the previous section.

		The ON CLUSTER clause is used to specify the cluster to run the DDL statement on. By default, the `posthog` cluster is used. That cluster only includes the worker nodes.

		Ideally, do not use the ON CLUSTER clause, since the DDL statement will be run on all nodes anyway through the `run_sql_with_exceptions` function, and, by default, the ON CLUSTER clause make the DDL statement run on nodes specified for the default cluster, and that does not include the coordinator.

		DROP_GROUPS_TABLE_SQL = f"DROP TABLE {GROUPS_TABLE} ON CLUSTER '{CLICKHOUSE_CLUSTER}'"
		TRUNCATE_GROUPS_TABLE_SQL = f"TRUNCATE TABLE {GROUPS_TABLE} ON CLUSTER '{CLICKHOUSE_CLUSTER}'"

chore: create migration for coordinator schema #28139

Are you sure you want to change the base?

chore: create migration for coordinator schema #28139

Conversation

Daesgar commented Jan 31, 2025

Problem

Changes

Does this work well for both Cloud and self-hosted?

How did you test this code?

Daesgar Jan 31, 2025

Choose a reason for hiding this comment

greptile-apps bot left a comment

Choose a reason for hiding this comment

PR Summary

greptile-apps bot Jan 31, 2025

Choose a reason for hiding this comment

greptile-apps bot Jan 31, 2025

Choose a reason for hiding this comment

greptile-apps bot Jan 31, 2025

Choose a reason for hiding this comment

greptile-apps bot Jan 31, 2025

Choose a reason for hiding this comment

greptile-apps bot Jan 31, 2025

Choose a reason for hiding this comment

greptile-apps bot Jan 31, 2025

Choose a reason for hiding this comment

greptile-apps bot Jan 31, 2025

Choose a reason for hiding this comment