This document introduces the design of global connection id, and the global KILL <connID>
based on it.
Currently connection ids are local to TiDB instances, which means that a KILL x
must be directed to the correct instance, and can not safely be load balanced across the cluster, as discussed here.
To support "Global Kill", we need:
- Global connection ids, which are unique among all TiDB instances.
- Redirect
KILL x
to target TiDB instance, on which the connectionx
is running. - Support both 32 & 64 bits
connID
. 32 bitsconnID
is used on small clusters (number of TiDB instances less than 2048), to be fully compatible with all clients including legacy 32 bits ones, while 64 bitsconnID
is used for big clusters. Bit 0 inconnID
is a markup to distinguish between these two kinds.
31 21 20 1 0
+--------+------------------+------+
|serverID| local connID |markup|
| (11b) | (20b) | =0 |
+--------+------------------+------+
63 62 41 40 1 0
+--+---------------------+--------------------------------------+------+
| | serverID | local connID |markup|
|=0| (22b) | (40b) | =1 |
+--+---------------------+--------------------------------------+------+
The key factor is serverID
(see serverID
section for detail), which depends on number of TiDB instances in cluster.
- Choose 32 bits when number of TiDB instances less than 2048. Otherwise choose 64 bits.
- When 32 bits chosen, upgrade to 64 bits when: 1) Fail to acquire
serverID
(because of occupied) continuously more than 3 times (which will happen when size of cluster is increasing rapidly); 2) Alllocal connID
in 32 bitsconnID
are in used (seelocal connID
section for detail). - When 64 bits chosen, downgrade to 32 bits in a gradually manner when cluster scales down from big to small, as TiDB instances keep
serverID
until next restart or lost connection to PD.
Bit 63 is always ZERO, making connID
in range of non-negative int64, to be more friendly to exists codes, and some languages don't have primitive type uint64
.
markup == 0
indicates that theconnID
is just 32 bits long effectively, and high 32 bits should be all zeros. Compatible with legacy 32 bits clients.markup == 1
indicates that theconnID
is 64 bits long. Incompatible with legacy 32 bits clients.markup == 1
while high 32 bits are zeros, indicates that 32 bits truncation happens. SeeCompatibility
section.
-
serverID
is selected RANDOMLY fromserverIDs pool
(see next) by each TiDB instance on startup, and the uniqueness is guaranteed by PD (etcd).serverID
should be larger or equal to 1, to ensure that high 32 bits ofconnID
is always non-zero, and make it possible to detect truncation. -
serverIDs pool
is:- All UNUSED
serverIDs
within [1, 2047] acquired fromCLUSTER_INFO
when 32 bitsconnID
chosen. - All
serverIDs
within [2048, 2^22-1] when 64 bitsconnID
chosen.
- All UNUSED
-
On failure (e.g. fail connecting to PD, or all
serverID
are unavailable when 64 bitsconnID
chosen), we block any new connection. -
serverID
is kept by PD with a lease (defaults to 12 hours, long enough to avoid brutally killing any long-run SQL). If TiDB is disconnected to PD longer than half of the lease (defaults to 6 hours), all connections are killed, and no new connection is accepted, to avoid running with stale/incorrectserverID
. On connection to PD restored, a newserverID
is acquired before accepting new connection. -
On single TiDB instance without PD, a
serverID
of1
is assigned.
local connID
is allocated by each TiDB instance on establishing connections:
-
For 32 bits
connID
,local connID
is possible to be integer-overflow and/or used up, especially on system being busy and/or with long running SQL. So we use a lock-free queue to maintain availablelocal connID
, dequeue on client connecting, and enqueue on disconnecting. Whenlocal connID
exhausted, upgrade to 64 bits. -
For 64 bits
connID
, allocatelocal connID
by auto-increment. Besides, flip to zero if integer-overflow, and checklocal connID
existed or not by Server.clients for correctness with trivial cost, as the conflict is very unlikely to happen (It needs more than 3 years to use up 2^40local connID
in a 1w TPS instance). At last, return "Too many connections" error if exhausted.
On processing KILL x
command, first extract serverID
from x
. Then if serverID
aims to a remote TiDB instance, get the address from CLUSTER_INFO
, and redirect the command to it by "Coprocessor API" provided by the remote TiDB, along with the original user authentication.
32 bits | 64 bits | |
---|---|---|
ServerID pool size | 2^11 | 2^22 - 2^11 |
ServerID allocation | Random of Unused serverIDs acquired from PD within pool. Retry if unavailable. Upgrade to 64 bits while failed more than 3 times | Random of All serverIDs within pool. Retry if unavailable |
Local connID pool size | 2^20 | 2^40 |
Local connID allocation | Using a queue to maintain and allocate available local connID. Upgrade to 64 bits while exhausted | Auto-increment within pool. Flip to zero when overflow. Return "Too many connections" if exhausted |
-
32 bits
connID
is compatible to well-known clients. -
64 bits
connID
is incompatible with legacy 32 bits clients. (According to some quick tests by now, MySQL client v8.0.19 supportsKILL
a connection with 64 bitsconnID
, whileCTRL-C
does not, as it truncates theconnID
to 32 bits). A warning is set prompting that truncation happened, but user cannot see it, becauseCTRL-C
is sent by a new connection in an instant. -
KILL TIDB
syntax andcompatible-kill-query
configuration item are deprecated.
Set small_cluster_size_threshold
and local_connid_pool_size
to small numbers (e.g. 4) by variable hacking, for easily switch between 32 and 64 bits connID
.
- A TiDB without PD, killed by Ctrl+C, and killed by KILL.
- One TiDB with PD, killed by Ctrl+C, and killed by KILL.
- Multiple TiDB nodes, killed {local,remote} by {Ctrl-C,KILL}.
- Upgrade caused by cluster scaled up from small to big.
- Upgrade caused by
local connID
used up.
- Multiple TiDB nodes, killed {local,remote} by {Ctrl-C,KILL}.
- Downgrade caused by cluster scaled down from big to small.
- Existing connections are killed after PD lost connection for long time.
- New connections are not accepted after PD lost connection for long time.
- New connections are accepted after PD lost connection for long time and then recovered.
- Connections can be killed after PD lost connection for long time and then recovered.