fix: Improve cluster connection pool logic when disconnecting #5

martinslota · 2024-06-10T12:19:19Z

This PR is a port of redis/ioredis#1864 and what follows below is a verbatim copy of that PR's description. A reproduce for the bug described here that uses valkey_server and iovalkey instead of redis_server and ioredis, respectively, can be found in this repository.

Motivation and Background

This is an attempt to fix errors occurring when a connect() call is made shortly after a disconnect(), which is something that the Bull library does when pausing a queue.

Here's a relatively minimal way to reproduce an error:

import IORedis from "ioredis";

const cluster = new IORedis.Cluster([{ host: "localhost", port: 6380 }]);

await cluster.set("foo", "bar");

const endPromise = new Promise((resolve) => cluster.once("end", resolve));
await cluster.quit();
cluster.disconnect();
await endPromise;

cluster.connect();
console.log(await cluster.get("foo"));
cluster.disconnect();

Running that script in a loop using

#!/bin/bash

set -euo pipefail

while true
do
    DEBUG=ioredis:cluster node cluster-error.mjs
done

against the main branch of ioredis quickly results in this output:

/Code/ioredis/built/cluster/index.js:124
                    reject(new redis_errors_1.RedisError("Connection is aborted"));
                           ^

RedisError: Connection is aborted
    at /Code/ioredis/built/cluster/index.js:124:28

Node.js v20.11.0

My debugging led me to believe that the existing node cleanup logic in the ConnectionPool class leads to race conditions: upon disconnect(), the this.connectionPool.reset() call will remove nodes from the pool without cleaning up the event listener which may then subsequently issue more than one drain event. Depending on timing, one of the extra drain events may fire after connect() and change the status to close, interfering with the connection attempt and leading to the error above.

Changes

Keep track of node listeners in the ConnectionPool class and remove them from the nodes whenever they are removed from the pool.
Issue -node / drain regardless of whether nodes disconnected or were removed through a reset() call.
Within reset(), add nodes before removing old ones to avoid unwanted drain events.
Fix one of the listeners by using an arrow function to make this point to the connection pool instance.
Try to fix the script for running cluster tests and attempt to enable them on CI. If this doesn't work out or isn't useful, I'm happy to revert the changes.
Add a test around this issue. The error thrown in the test on main is seemingly different from the error shown above but it still seems related to the disconnection logic and still gets fixed by the changes in this PR.

Signed-off-by: Martin Slota <[email protected]>

… connection pool instance Signed-off-by: Martin Slota <[email protected]>

Signed-off-by: Martin Slota <[email protected]>

mcollina

lgtm with a green CI

This reverts commit 6536e22. Signed-off-by: Martin Slota <[email protected]>

Signed-off-by: Martin Slota <[email protected]>

…e to connect using the Cluster client Signed-off-by: Martin Slota <[email protected]>

…) is finished Signed-off-by: Martin Slota <[email protected]>

martinslota · 2024-06-10T17:33:18Z

Now that I saw the results from CI, I also figured out how to run the unit tests locally. 😅

The tests caught a bug that I then fixed in 788361c and 0deaeba. 😬

Furthermore, some of the existing functional tests were assuming that responding with 'OK' from the mock server would work fine as a response to cluster SLOTS, i.e. that the Cluster client would reach a ready state despite the response being invalid. With the changes made in this PR, the tests were failing because part of connectionPool.reset([]) is now the drain event which leads to the close state instead, after which the client attempts to reconnect over and over again, never reaching the ready state.

In my view, this change in behaviour is desirable since it prevents node clients which are no longer tracked in the connection pool from emitting additional events that could mess with the state of the cluster client (which could easily have moved on, e.g. attempting to reconnect). So for the time being, I adjusted the tests by adding valid slots tables to the mock server in d5d85e9.

Finally, one of the tests was assuming that when a node disappears from the cluster, the node removal (the -node event) would only be emitted after the refreshSlotsCache() method was finished executing. Once again, this assumption is not valid with the changes in this PR because the event is emitted already during the execution of refreshSlotsCache(), and again as part of connectionPool.reset([]). For similar reasons as above, I adjusted the test in d3f83c5 so that it starts listening for a node removal event already before it calls refreshSlotsCache().

Does this approach sound kinda reasonable?

mcollina

Yes.

Still LGTM

martinslota · 2024-06-13T10:33:27Z

As far as I can tell, this should be ready to get merged. I cannot merge it myself.

mcollina · 2024-06-13T15:34:19Z

Will ship a release asap.

martinslota added 16 commits June 10, 2024 14:20

Tell Redis cluster to disable protected mode before running tests

90039df

Signed-off-by: Martin Slota <[email protected]>

Try to enable Redis cluster tests on CI

8b54785

Signed-off-by: Martin Slota <[email protected]>

Add a failing test around Redis cluster disconnection logic

da9b9f7

Signed-off-by: Martin Slota <[email protected]>

Rename function parameter

f0e5f66

Signed-off-by: Martin Slota <[email protected]>

Turn node error listener into an arrow function so that points to the…

78e2239

… connection pool instance Signed-off-by: Martin Slota <[email protected]>

Extract node listeners into separate constants

fdb3db8

Signed-off-by: Martin Slota <[email protected]>

Keep track of listeners along with each Redis client

d4cec9c

Signed-off-by: Martin Slota <[email protected]>

Remove node listeners when the node is being removed

c767ee0

Signed-off-by: Martin Slota <[email protected]>

Emit node removal events whenever a node is removed

43b8522

Signed-off-by: Martin Slota <[email protected]>

When resetting, add nodes before removing old ones

09d336d

Signed-off-by: Martin Slota <[email protected]>

Rename Node type to NodeRecord for clarity

f002230

Signed-off-by: Martin Slota <[email protected]>

Also rename the field holding node records

41751c0

Signed-off-by: Martin Slota <[email protected]>

Rename variable to nodeRecord

c8fe73a

Signed-off-by: Martin Slota <[email protected]>

Rename another variable to nodeRecord

21b947d

Signed-off-by: Martin Slota <[email protected]>

Fix a reference to connection pool nodes

6536e22

Signed-off-by: Martin Slota <[email protected]>

Do not fail when retrieving a node by non-existing key

158c64c

Signed-off-by: Martin Slota <[email protected]>

martinslota force-pushed the clean-up-node-listeners-upon-disconnect branch from 3d6cc5a to 158c64c Compare June 10, 2024 12:21

mcollina approved these changes Jun 10, 2024

View reviewed changes

martinslota added 4 commits June 10, 2024 18:48

Revert "Fix a reference to connection pool nodes"

788361c

This reverts commit 6536e22. Signed-off-by: Martin Slota <[email protected]>

Fix a reference to connection pool nodes, this time a bit more correctly

0deaeba

Signed-off-by: Martin Slota <[email protected]>

Add a valid slots table to mock server in tests that expect to be abl…

d5d85e9

…e to connect using the Cluster client Signed-off-by: Martin Slota <[email protected]>

Do not assume that node removal will occur *after* refreshSlotsCache(…

d3f83c5

…) is finished Signed-off-by: Martin Slota <[email protected]>

martinslota force-pushed the clean-up-node-listeners-upon-disconnect branch from cb3509e to d3f83c5 Compare June 10, 2024 16:57

martinslota requested a review from mcollina June 10, 2024 19:07

mcollina approved these changes Jun 11, 2024

View reviewed changes

mcollina merged commit 2733aee into valkey-io:main Jun 13, 2024

martinslota deleted the clean-up-node-listeners-upon-disconnect branch June 13, 2024 15:51

martinslota mentioned this pull request Aug 15, 2024

fix: Improve cluster connection pool logic when disconnecting redis/ioredis#1864

Open

ekscentrysytet mentioned this pull request May 21, 2025

Cluster client throws "TypeError: Cannot read properties of undefined (reading 'redis')" on AWS Elasticache Serverless Valkey #31

Closed

martinslota mentioned this pull request May 21, 2025

Tolerate having no connected nodes in ConnectionPool.getSampleInstance() #32

Closed

gurgunday mentioned this pull request May 27, 2025

fix: handle non-existant redis node gracefully #34

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Improve cluster connection pool logic when disconnecting #5

fix: Improve cluster connection pool logic when disconnecting #5

Uh oh!

martinslota commented Jun 10, 2024 •

edited

Loading

Uh oh!

mcollina left a comment

Uh oh!

martinslota commented Jun 10, 2024 •

edited

Loading

Uh oh!

mcollina left a comment

Uh oh!

martinslota commented Jun 13, 2024

Uh oh!

mcollina commented Jun 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: Improve cluster connection pool logic when disconnecting #5

fix: Improve cluster connection pool logic when disconnecting #5

Uh oh!

Conversation

martinslota commented Jun 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Background

Changes

Uh oh!

mcollina left a comment

Choose a reason for hiding this comment

Uh oh!

martinslota commented Jun 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mcollina left a comment

Choose a reason for hiding this comment

Uh oh!

martinslota commented Jun 13, 2024

Uh oh!

mcollina commented Jun 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

martinslota commented Jun 10, 2024 •

edited

Loading

martinslota commented Jun 10, 2024 •

edited

Loading