DougJones DougJones - 17 days ago 7
C# Question

StackExchange.Redis timeout on only 1 server

When a new box starts up (or presumably gets the app pool recycled), we're seeing a timeout error for every redis request. What's interesting is that it's probably 1/30 or so. That is, 30 boxes will boot up just fine and work (actual call is a Redis Lock call) to every 1 box that boots up in this faulty state. The example below shows 9k items in queue. The ConnectionMultiplexer is being initialized lazily, per MS azure recommendation (though we're not on Azure) and here's the call:

var db = m_dbFactory.GetDatabase();
bool gotLock = db.LockTake(key, value, m_redisLockConfig.RedisLockMaxAgeTimeSpan);


and we're using Ninject to get a singleton of that dbFactory injected in:

kernel.Bind<IRedisDatabaseFactory>().To<RedisDatabaseFactory>().InSingletonScope();


We've had to redeploy the code (recycling the app pool) to fix the issue, or kill the 1 bad box behind the load balancer. Has anyone come across this problem before? I see we have 9k items in queue that haven't been written to outbound network, following an azure troubleshooting link: https://azure.microsoft.com/en-us/blog/investigating-timeout-exceptions-in-stackexchange-redis-for-azure-redis-cache/

If the connection was not opened, however, I am specifically throwing an error from my redis db factory (which I'm not seeing in our logs). Here's the whole class to see the connectionmultiplexer initialization:

public class RedisDatabaseFactory : IRedisDatabaseFactory
{
private readonly Lazy<IConnectionMultiplexer> m_lazyConnectionMultiplexer;

public RedisDatabaseFactory(IRedisConfig redisConfig)
{
var endPoint = new DnsEndPoint(redisConfig.Host, redisConfig.Port);

var configOptions = new ConfigurationOptions
{
EndPoints = { endPoint },
Password = redisConfig.Password,
ConnectTimeout = 5000,
AbortOnConnectFail = false
};

m_lazyConnectionMultiplexer = new Lazy<IConnectionMultiplexer>(() =>
ConnectionMultiplexer.Connect(configOptions));
}

private IConnectionMultiplexer Connection
{
get { return m_lazyConnectionMultiplexer.Value; }
}

/// <summary>
/// Gets a connected redis database
/// </summary>
/// <exception cref="Exception"></exception>
/// <returns>Connected redis database</returns>
public IDatabase GetDatabase()
{
if (!Connection.IsConnected)
{
throw new Exception("Redis connection failure");
}
return Connection.GetDatabase();
}
}


Here's the stack trace:

System.TimeoutException: Timeout performing SET mykey, inst: 0, mgr: ExecuteSelect, err: never, queue: 9058, qu: 9058, qs: 0, qc: 0, wr: 0, wq: 0, in: 0, ar: 0, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=1,Free=32766,Min=1,Max=32767), clientName: myclient
at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor1 processor, ServerEndPoint server)
at StackExchange.Redis.RedisBase.ExecuteSync[T](Message message, ResultProcessor1 processor, ServerEndPoint server)
at StackExchange.Redis.RedisDatabase.StringSet(RedisKey key, RedisValue value, Nullable1 expiry, When when, CommandFlags flags)
at StackExchange.Redis.RedisDatabase.LockTake(RedisKey key, RedisValue value, TimeSpan expiry, CommandFlags flags)


I changed the name of my key, client name, and removed backticks.

Answer

This is really late, but we did eventually make a change that solved the problem. We upgraded to the latest StackExchange.Redis in case the issue was fixed by Marc Gravell and team, but we also made the following change:

m_lazyConnectionMultiplexer = new Lazy<IConnectionMultiplexer>(() => ConnectionMultiplexer.Connect(configOptions),LazyThreadSafetyMode.PublicationOnly;

so that should the connection multiplexer initialize to a bad state, another would get initialized afterwards. After making those 2 changes, we never saw the issue again. I believe the issue was not actually in app pool recycle but in our process of tearing down and building up the boxes from Amazon Machine Image on a regular basis. When they were built back up, occasionally 1 was in a bad state. I wish I had pinpointed the fix, but that's what worked for us.

Comments