Global Kill

Author(s): pingyu (Ping Yu)
Last updated: 2021-05-05
Discussion at: #8854

Abstract

This document introduces the design of global connection id, and the global KILL <connID> based on it.

Background

Currently connection ids are local to TiDB instances, which means that a KILL x must be directed to the correct instance, and can not safely be load balanced across the cluster, as discussed here.

Proposal

To support "Global Kill", we need:

Global connection ids, which are unique among all TiDB instances.
Redirect KILL x to target TiDB instance, on which the connection x is running.
Support both 32 & 64 bits connID. 32 bits connID is used on small clusters (number of TiDB instances less than 2048), to be fully compatible with all clients including legacy 32 bits ones, while 64 bits connID is used for big clusters. Bit 0 in connID is a markup to distinguish between these two kinds.

Rationale

1. Structure of `connID`

32 bits

                                      31    21 20               1    0
                                     +--------+------------------+------+
                                     |serverID|   local connID   |markup|
                                     | (11b)  |       (20b)      |  =0  |
                                     +--------+------------------+------+

64 bits

  63 62                 41 40                                   1    0    
 +--+---------------------+--------------------------------------+------+
 |  |      serverID       |             local connID             |markup|
 |=0|       (22b)         |                 (40b)                |  =1  |
 +--+---------------------+--------------------------------------+------+

Determine 32 or 64 bits

The key factor is serverID (see serverID section for detail), which depends on number of TiDB instances in cluster.

Choose 32 bits when number of TiDB instances less than 2048. Otherwise choose 64 bits.
When 32 bits chosen, upgrade to 64 bits when: 1) Fail to acquire serverID (because of occupied) continuously more than 3 times (which will happen when size of cluster is increasing rapidly); 2) All local connID in 32 bits connID are in used (see local connID section for detail).
When 64 bits chosen, downgrade to 32 bits in a gradually manner when cluster scales down from big to small, as TiDB instances keep serverID until next restart or lost connection to PD.

2. Bit 63

Bit 63 is always ZERO, making connID in range of non-negative int64, to be more friendly to exists codes, and some languages don't have primitive type uint64.

3. Markup

markup == 0 indicates that the connID is just 32 bits long effectively, and high 32 bits should be all zeros. Compatible with legacy 32 bits clients.
markup == 1 indicates that the connID is 64 bits long. Incompatible with legacy 32 bits clients.
markup == 1 while high 32 bits are zeros, indicates that 32 bits truncation happens. See Compatibility section.

4. ServerID

serverID is selected RANDOMLY from serverIDs pool(see next) by each TiDB instance on startup, and the uniqueness is guaranteed by PD (etcd). serverID should be larger or equal to 1, to ensure that high 32 bits of connID is always non-zero, and make it possible to detect truncation.
serverIDs pool is:
- All UNUSED serverIDs within [1, 2047] acquired from CLUSTER_INFO when 32 bits connID chosen.
- All serverIDs within [2048, 2^22-1] when 64 bits connID chosen.
On failure (e.g. fail connecting to PD, or all serverID are unavailable when 64 bits connID chosen), we block any new connection.
serverID is kept by PD with a lease (defaults to 12 hours, long enough to avoid brutally killing any long-run SQL). If TiDB is disconnected to PD longer than half of the lease (defaults to 6 hours), all connections are killed, and no new connection is accepted, to avoid running with stale/incorrect serverID. On connection to PD restored, a new serverID is acquired before accepting new connection.
On single TiDB instance without PD, a serverID of 1 is assigned.

5. Local connID

local connID is allocated by each TiDB instance on establishing connections:

For 32 bits connID, local connID is possible to be integer-overflow and/or used up, especially on system being busy and/or with long running SQL. So we use a lock-free queue to maintain available local connID, dequeue on client connecting, and enqueue on disconnecting. When local connID exhausted, upgrade to 64 bits.
For 64 bits connID, allocate local connID by auto-increment. Besides, flip to zero if integer-overflow, and check local connID existed or not by Server.clients for correctness with trivial cost, as the conflict is very unlikely to happen (It needs more than 3 years to use up 2^40 local connID in a 1w TPS instance). At last, return "Too many connections" error if exhausted.

6. Global kill

On processing KILL x command, first extract serverID from x. Then if serverID aims to a remote TiDB instance, get the address from CLUSTER_INFO, and redirect the command to it by "Coprocessor API" provided by the remote TiDB, along with the original user authentication.

7. Summary

	32 bits	64 bits
ServerID pool size	2^11	2^22 - 2^11
ServerID allocation	Random of Unused serverIDs acquired from PD within pool. Retry if unavailable. Upgrade to 64 bits while failed more than 3 times	Random of All serverIDs within pool. Retry if unavailable
Local connID pool size	2^20	2^40
Local connID allocation	Using a queue to maintain and allocate available local connID. Upgrade to 64 bits while exhausted	Auto-increment within pool. Flip to zero when overflow. Return "Too many connections" if exhausted

Compatibility

32 bits connID is compatible to well-known clients.
64 bits connID is incompatible with legacy 32 bits clients. (According to some quick tests by now, MySQL client v8.0.19 supports KILL a connection with 64 bits connID, while CTRL-C does not, as it truncates the connID to 32 bits). A warning is set prompting that truncation happened, but user cannot see it, because CTRL-C is sent by a new connection in an instant.
KILL TIDB syntax and compatible-kill-query configuration item are deprecated.

Test Design

Prerequisite

Set small_cluster_size_threshold and local_connid_pool_size to small numbers (e.g. 4) by variable hacking, for easily switch between 32 and 64 bits connID.

Scenario A. 32 bits `connID` with small cluster

A TiDB without PD, killed by Ctrl+C, and killed by KILL.
One TiDB with PD, killed by Ctrl+C, and killed by KILL.
Multiple TiDB nodes, killed {local,remote} by {Ctrl-C,KILL}.

Scenario B. Upgrade from 32 to 64 bits `connID`

Upgrade caused by cluster scaled up from small to big.
Upgrade caused by local connID used up.

Scenario C. 64 bits `connID` with big cluster

Multiple TiDB nodes, killed {local,remote} by {Ctrl-C,KILL}.

Scenario D. Downgrade from 64 to 32 bits `connID`

Downgrade caused by cluster scaled down from big to small.

Scenario E. Fault tolerant while disconnected with PD

Existing connections are killed after PD lost connection for long time.
New connections are not accepted after PD lost connection for long time.
New connections are accepted after PD lost connection for long time and then recovered.
Connections can be killed after PD lost connection for long time and then recovered.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2020-06-01-global-kill.md

2020-06-01-global-kill.md

Global Kill

Abstract

Background

Proposal

Rationale

1. Structure of `connID`

32 bits

64 bits

Determine 32 or 64 bits

2. Bit 63

3. Markup

4. ServerID

5. Local connID

6. Global kill

7. Summary

Compatibility

Test Design

Prerequisite

Scenario A. 32 bits `connID` with small cluster

Scenario B. Upgrade from 32 to 64 bits `connID`

Scenario C. 64 bits `connID` with big cluster

Scenario D. Downgrade from 64 to 32 bits `connID`

Scenario E. Fault tolerant while disconnected with PD

Files

2020-06-01-global-kill.md

Latest commit

History

2020-06-01-global-kill.md

File metadata and controls

Global Kill

Abstract

Background

Proposal

Rationale

1. Structure of connID

32 bits

64 bits

Determine 32 or 64 bits

2. Bit 63

3. Markup

4. ServerID

5. Local connID

6. Global kill

7. Summary

Compatibility

Test Design

Prerequisite

Scenario A. 32 bits connID with small cluster

Scenario B. Upgrade from 32 to 64 bits connID

Scenario C. 64 bits connID with big cluster

Scenario D. Downgrade from 64 to 32 bits connID

Scenario E. Fault tolerant while disconnected with PD

1. Structure of `connID`

Scenario A. 32 bits `connID` with small cluster

Scenario B. Upgrade from 32 to 64 bits `connID`

Scenario C. 64 bits `connID` with big cluster

Scenario D. Downgrade from 64 to 32 bits `connID`