Skip to content

Stale Connections Suddenly Increase when there is a Spike on Application #3359

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
akshayk-ktk opened this issue Apr 28, 2025 · 8 comments
Open

Comments

@akshayk-ktk
Copy link

We are noticing Redis commands taking more than 2 seconds when there is a spike on the application.

I am printing the pool stats in the application and I see that many connections became stale as soon as the spike comes.

What can be done to mitigate this?

Expected Behavior

Redis Commands should complete in expected time.

Current Behavior

Redis Commands take more than 2 seconds to complete the commands when there is a Spike.

Redis Client Configuration

poolSize: 5000
connMaxIdleTime: -1s
dialTimeout: 10s
poolTimeout: 10s
readTimeout: 10s
minIdleConns: 3500
maxIdleConns: 3800
connMaxLifetime: -1s

Redis Server

We are using elasticache redis version 7+ in cluster mode. Currently with single shard.

Application is on EC2 running on 4 instances.

Below are the screenshots

Image
Image

@ndyakov
Copy link
Member

ndyakov commented Apr 29, 2025

@akshayk-ktk based on your configuration, I think the reason for the increased number of stale connections should be an error either durring getting / initing the connection or when putting it back in the pool. Would you be able to check if there is anything reported in the logs? If not, would you be able to check what is the reason error value passed here:

func (p *ConnPool) Remove(_ context.Context, cn *Conn, reason error) {

@akshayk-ktk
Copy link
Author

@ndyakov We are using zerolog for logging and use the below custom struct to set it to the internal logger. We are not able to see any internal error logs here.

type RedisLogger struct {
	logger *zerolog.Logger
}

func (rl *RedisLogger) Printf(_ context.Context, format string, v ...interface{}) {
	rl.logger.Info().Bool("go-redis", true).Msgf(format, v...)
}

func init() {
	redis.SetLogger(&RedisLogger{
		logger: logger.GlobalLogger,
	})
}

Let me try adding logs in the Remove method to check what are the error values.

@akshayk-ktk
Copy link
Author

akshayk-ktk commented Apr 30, 2025

Hey @ndyakov after adding logs found some errors. Looks like I/O timeout errors.

Does this mean the EC2 network interface is not able to handle the load?

Our EC2 machines are AWS c6g.xlarge.

Elasticache Redis - cache.m6g.large

Image

@ndyakov
Copy link
Member

ndyakov commented Apr 30, 2025

@akshayk-ktk I cannot comment on the elasticache setup and on the AWS setup. You can try to play around with multiple shards to see if this improves with the database's horizontal scaling. As for now, doesn't look like client issue, but will let you decide if you are gonna try anything further now, or we should close this issue.

@akshaykhairmode
Copy link

Hey @ndyakov we are upgrading the redis cluster node type and ec2 instance node types one at a time so we can monitor if we still get IO erorrs and stale connection increase.

I would prefer to keep the issue open with an under observation tag if possible.

Will report back any observations we get here.

@akshayk-ktk
Copy link
Author

akshayk-ktk commented May 12, 2025

Hey @ndyakov , We have upgraded the Application Machines with higher Network capacity and also doubled the shard size.

Connections are still becoming stale and there is no error in the Remove method. The IO error which I saw earlier is also not present.

The behaviour is same, when there is a Spike the stale connections increase.

Code,

Image

Logs,

Image

Before Spike
Image

After Spike
Image

What troubleshooting steps should I take next?

@ndyakov
Copy link
Member

ndyakov commented May 12, 2025

@akshayk-ktk how is the latency now? I assume the connections are just not needed and are removed as the spike drops. Will review the exact flow later today/tomorrow.

@akshayk-ktk
Copy link
Author

Hey @ndyakov I still see latency which have max of 6.5 seconds.

The command's which took max time is a pipeline having 1 ZREM (less than 5 elements in the sset) and 1 ZINCRBY (around 8-9k elements in the sset) command.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants