Alex Zevenbergen Alex Zevenbergen - 11 days ago 5
C# Question

System.Fabric.FabricNotPrimaryException When saving state from timer

I'm writing a stateful service that is hosted in Service Fabric. The service's job is to consume messages from an external queue, transform them and place them onto our own messaging system. Throughput can be up 6k messages / sec according to the suppliers docs.

I've configured the service into multiple partitions to spread the message load, and each partition has min 2/max 3 replicas. To recover from a failure i can subscribe to the suppliers queue and pass in a timestamp from which point i wish to receive messages. To do this i'm storing the timestamp of the last message processed in service state. Due to the volume of messages i decided to do this 'save' on a timer (and allow potential dups of messages downstream)

This is the code that is called by the time:

private async void _timer_Elapsed(object sender, ElapsedEventArgs e)
{
var saveRetryPolicy = Policy
.Handle<Exception>()
.WaitAndRetryAsync(5, retryAttempt =>
TimeSpan.FromSeconds(Math.Pow(2, retryAttempt))
);

await saveRetryPolicy.ExecuteAsync(async () =>
{
using (var tx = _stateManager.CreateTransaction())
{
var state = await _stateManager.TryGetAsync<IReliableDictionary<string, long>>(TimestampStateName);

if (state.HasValue)
{
await state.Value.AddOrUpdateAsync(tx, TimestampStateName, _lastTXTimestamp,
(s, l) => _lastTXTimestamp);

await tx.CommitAsync();
}
else
{
var s =
await _stateManager.GetOrAddAsync<IReliableDictionary<string, long>>(tx, TimestampStateName);

await tx.CommitAsync();
_timer_Elapsed(this, null);
}
}
});
}


Every time an attempt is made to persist this i get a 'System.Fabric.FabricNotPrimaryException' error, on each partition.

I have included a retry policy (courtesy of Polly Retry) as a there was a comment on a similar issue that recommended doing that. This had no effect, bar prolonging the time before the error was reported.

Am i misunderstanding something fundamental with how SF should be used? This seems a simple use case to me.

Answer

Answer from comments:

Make sure you don't start the timer on all replicas, but only on the primary replica.

Comments