Stale Connections Suddenly Increase when there is a Spike on Application #3359

akshayk-ktk · 2025-04-28T06:59:55Z

We are noticing Redis commands taking more than 2 seconds when there is a spike on the application.

I am printing the pool stats in the application and I see that many connections became stale as soon as the spike comes.

What can be done to mitigate this?

Expected Behavior

Redis Commands should complete in expected time.

Current Behavior

Redis Commands take more than 2 seconds to complete the commands when there is a Spike.

Redis Client Configuration

poolSize: 5000
connMaxIdleTime: -1s
dialTimeout: 10s
poolTimeout: 10s
readTimeout: 10s
minIdleConns: 3500
maxIdleConns: 3800
connMaxLifetime: -1s

Redis Server

We are using elasticache redis version 7+ in cluster mode. Currently with single shard.

Application is on EC2 running on 4 instances.

Below are the screenshots

ndyakov · 2025-04-29T11:03:26Z

@akshayk-ktk based on your configuration, I think the reason for the increased number of stale connections should be an error either durring getting / initing the connection or when putting it back in the pool. Would you be able to check if there is anything reported in the logs? If not, would you be able to check what is the reason error value passed here:

go-redis/internal/pool/pool.go

Line 407 in 9762559

func (p *ConnPool) Remove(_ context.Context, cn *Conn, reason error) {

akshayk-ktk · 2025-04-29T12:49:51Z

@ndyakov We are using zerolog for logging and use the below custom struct to set it to the internal logger. We are not able to see any internal error logs here.

type RedisLogger struct {
	logger *zerolog.Logger
}

func (rl *RedisLogger) Printf(_ context.Context, format string, v ...interface{}) {
	rl.logger.Info().Bool("go-redis", true).Msgf(format, v...)
}

func init() {
	redis.SetLogger(&RedisLogger{
		logger: logger.GlobalLogger,
	})
}

Let me try adding logs in the Remove method to check what are the error values.

akshayk-ktk · 2025-04-30T05:31:12Z

Hey @ndyakov after adding logs found some errors. Looks like I/O timeout errors.

Does this mean the EC2 network interface is not able to handle the load?

Our EC2 machines are AWS c6g.xlarge.

Elasticache Redis - cache.m6g.large

ndyakov · 2025-04-30T16:56:07Z

@akshayk-ktk I cannot comment on the elasticache setup and on the AWS setup. You can try to play around with multiple shards to see if this improves with the database's horizontal scaling. As for now, doesn't look like client issue, but will let you decide if you are gonna try anything further now, or we should close this issue.

akshaykhairmode · 2025-05-01T13:17:46Z

Hey @ndyakov we are upgrading the redis cluster node type and ec2 instance node types one at a time so we can monitor if we still get IO erorrs and stale connection increase.

I would prefer to keep the issue open with an under observation tag if possible.

Will report back any observations we get here.

akshayk-ktk · 2025-05-12T07:03:51Z

Hey @ndyakov , We have upgraded the Application Machines with higher Network capacity and also doubled the shard size.

Connections are still becoming stale and there is no error in the Remove method. The IO error which I saw earlier is also not present.

The behaviour is same, when there is a Spike the stale connections increase.

Code,

Logs,

Before Spike

After Spike

What troubleshooting steps should I take next?

ndyakov · 2025-05-12T07:30:49Z

@akshayk-ktk how is the latency now? I assume the connections are just not needed and are removed as the spike drops. Will review the exact flow later today/tomorrow.

akshayk-ktk · 2025-05-12T07:51:14Z

Hey @ndyakov I still see latency which have max of 6.5 seconds.

The command's which took max time is a pipeline having 1 ZREM (less than 5 elements in the sset) and 1 ZINCRBY (around 8-9k elements in the sset) command.

akshaykhairmode mentioned this issue Apr 29, 2025

Handling Connection Spikes in WebSocket Server with Redis #3356

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stale Connections Suddenly Increase when there is a Spike on Application #3359

Stale Connections Suddenly Increase when there is a Spike on Application #3359

akshayk-ktk commented Apr 28, 2025

ndyakov commented Apr 29, 2025

akshayk-ktk commented Apr 29, 2025

akshayk-ktk commented Apr 30, 2025 •

edited

Loading

ndyakov commented Apr 30, 2025

akshaykhairmode commented May 1, 2025

akshayk-ktk commented May 12, 2025 •

edited

Loading

ndyakov commented May 12, 2025 •

edited

Loading

akshayk-ktk commented May 12, 2025

Stale Connections Suddenly Increase when there is a Spike on Application #3359

Stale Connections Suddenly Increase when there is a Spike on Application #3359

Comments

akshayk-ktk commented Apr 28, 2025

Expected Behavior

Current Behavior

Redis Client Configuration

Redis Server

ndyakov commented Apr 29, 2025

akshayk-ktk commented Apr 29, 2025

akshayk-ktk commented Apr 30, 2025 • edited Loading

ndyakov commented Apr 30, 2025

akshaykhairmode commented May 1, 2025

akshayk-ktk commented May 12, 2025 • edited Loading

ndyakov commented May 12, 2025 • edited Loading

akshayk-ktk commented May 12, 2025

akshayk-ktk commented Apr 30, 2025 •

edited

Loading

akshayk-ktk commented May 12, 2025 •

edited

Loading

ndyakov commented May 12, 2025 •

edited

Loading