Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: Throttler doesn't ignore connection errors #18022

Open
mhamza15 opened this issue Mar 24, 2025 · 0 comments · May be fixed by #18073
Open

Bug Report: Throttler doesn't ignore connection errors #18022

mhamza15 opened this issue Mar 24, 2025 · 0 comments · May be fixed by #18073

Comments

@mhamza15
Copy link
Contributor

Overview of the Issue

When a tablet is shutdown, it's state in the topology is cleared out and ends up looking like this:

{
  "alias": {
    "cell": "...",
    "uid": 171496207
  },
  "hostname": "",
  "port_map": {},
  "keyspace": "...",
  "shard": "0",
  "key_range": null,
  "type": "REPLICA",
  "db_name_override": "...",
  "tags": {},
  "mysql_hostname": "",
  "mysql_port": 3306,
  "primary_term_start_time": null,
  "default_conn_collation": 45
}

As the hostname and port_map are empty, the throttler will try to connect to this tablet, fail, and report the shard as unhealthy:

"mysql/shard": {
  "LastHealthyAt": "2025-03-20T12:24:14.045597766-07:00",
  "SecondsSinceLastHealthy": 5083
}

...which in the case of VReplication, would stop VReplication entirely. Ideally, the throttler should ignore connection errors. See this slack thread for more context.

Reproduction Steps

make build

cd examples/local

./101_initial_cluster.sh; ./201_customer_tablets.sh; ./202_move_tables.sh

vtctldclient UpdateThrottlerConfig --enable --throttle-app="all" --throttle-app-ratio 0 --throttle-app-duration 4h customer

primaryuid=$(vtctldclient GetTablets --keyspace customer --tablet-type primary --shard "0" | awk '{print $1}' | cut -d- -f2 | bc)

Observe that the shard is healthy:

$ vtctldclient GetThrottlerStatus zone1-0000000${primaryuid} | jq .status.metrics_health

{
  "last_healthy_at": {
    "seconds": "1742838360",
    "nanoseconds": 529167126
  },
  "seconds_since_last_healthy": "0"
}

Then kill one of the tablets:

$ vtctldclient --server localhost:15999 GetTablets
zone1-0000000100 commerce 0 replica localhost:15100 localhost:17100 [] <null>
zone1-0000000101 commerce 0 primary localhost:15101 localhost:17101 [] 2025-03-24T17:02:42Z
zone1-0000000102 commerce 0 rdonly localhost:15102 localhost:17102 [] <null>
zone1-0000000200 customer 0 replica :0 :17200 [] <null>
zone1-0000000201 customer 0 primary localhost:15201 localhost:17201 [] 2025-03-24T17:05:27Z
zone1-0000000202 customer 0 rdonly localhost:15202 localhost:17202 [] <null>

Note tablet 200 with the empty hostname and port. Now check the throttler again and observe that it is unhealthy:

$ vtctldclient --server localhost:15999 GetThrottlerStatus zone1-0000000${primaryuid} | jq .status.metrics_health.shard
{
  "last_healthy_at": {
    "seconds": "1742838425",
    "nanoseconds": 778510514
  },
  "seconds_since_last_healthy": "162"
}

Binary Version

main

Operating System and Environment details

All

Log Fragments

@mhamza15 mhamza15 added Needs Triage This issue needs to be correctly labelled and triaged Type: Bug labels Mar 24, 2025
@frouioui frouioui added Component: Throttler and removed Needs Triage This issue needs to be correctly labelled and triaged labels Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants