Redis Cluster Node Enabled - Failed to read from master when replica is being replaced #3231

anushar04 · 2025-03-26T00:07:11Z

Bug Report

We use lettuce client to connect to aws elastic cache (redis) with cluster mode enabled.
We have 5 shards (with 3 nodes each, onf of the 3 is master), replica node in shard 1 had degaraded performance due to which AWS triggered replacement for the same which took 7 mins, during this window, we were not able to read from primary thought master node was not impacted.

Current Behavior

Read/Write from master node fails while one of the replica's in shard is being reaplced.

We received 2 different types of errors during this window

Command timed out after [x] secons
CLUSTERDOWN Hash slot not served

// your stack trace here;

Java Application

Input Code

// your code here;

Expected no disruption in the read /write with master node

Environment

Lettuce version(s): 5.1.5.RELEASE
Redis version: 5.0.9 engine version

Possible Solution

Additional context

07:51 AM PST - redis-0001-003 Primary became unhealthy - we had some issue reading from it - this is expected from lettuce
07:55 AM PST - continued to provide Degraded experience from master node redis-0001-003
07:56 AM PST - Failover of master node performed by AWS redis-0001-002 - new master(No impact during time)
07:56 AM PST to 08:31 AM PST - redis-0001-003 was not available in the shard, however other 2 nodes in shard were active
08:31 AM PST - AWS triggered replacement for redis-0001-003 (replica) since it was still in degraded state.During this window, application was not able to read or write from master node
08:38 AM PST - Complete Application Recovery redis-0001-002 continued to be primary, we were able to read / write from the client

Also during this failure 8:31 to 8:38 we see logs trying to reconnect to redis-0001-003 from connectionWatchDog

Need to understand why read from master node failed while replica being replaced.

tishun · 2025-03-27T15:10:37Z

Hey @anushar04 ,

The way I read this is that the driver was using the redis-0001-003 node even after it was replaced with redis-0001-002?
How is the driver configured? Do you have some topology update mechanism configured?

During such a failover the driver has no way to know that the - otherwise healthy - node was experiencing issues. Depending on how topology is updated and how the driver is configured it might continue trying to connect to the same node.

There are a lot of details missing, so I can't help much.

github-actions · 2025-04-27T00:19:51Z

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 2 weeks this issue will be closed.

tishun added the status: waiting-for-triage label Mar 27, 2025

tishun added status: waiting-for-feedback We need additional information before we can continue and removed status: waiting-for-triage labels Mar 27, 2025

github-actions bot added the status: feedback-reminder We've sent a reminder that we need additional information before we can continue label Apr 27, 2025

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Redis Cluster Node Enabled - Failed to read from master when replica is being replaced #3231

Redis Cluster Node Enabled - Failed to read from master when replica is being replaced #3231

anushar04 commented Mar 26, 2025

tishun commented Mar 27, 2025

Uh oh!

github-actions bot commented Apr 27, 2025

Uh oh!

Redis Cluster Node Enabled - Failed to read from master when replica is being replaced #3231

Redis Cluster Node Enabled - Failed to read from master when replica is being replaced #3231

Comments

anushar04 commented Mar 26, 2025

Bug Report

Current Behavior

Environment

Possible Solution

Additional context

tishun commented Mar 27, 2025

Uh oh!

github-actions bot commented Apr 27, 2025

Uh oh!