ken2k ken2k - 1 month ago 9
C# Question

Should I put my events inside a queue after getting them from Azure Event Hub?

I'm currently developing an application hosted on Azure that uses Azure Event Hub. Basically I'm sending messages (or should I say, events) to the Event Hub from a Web API, and I have two listeners:


  • a Stream Analytics task for real-time analysis

  • a standard worker role that computes some stuff based on the received events and then stores them into an Azure SQL Database (this is a lambda architecture).



I'm currently using the EventProcessorHost library to retrieve my events from the Event Hub inside my worker role.

I'm trying to find some best practices about how to use the Event Hub (it is a bit harder to use Event Hubs than service bus queues, i.e. streaming vs message consuming), and I found some people saying I shouldn't do a lot of processing after retrieving
EventData
events from my Event Hub
.

Specifically :




Keep in mind you want to keep whatever it is you're doing relatively
fast - i.e. don't try to do many processes from here - that's what
consumer groups are for.





The author of this article added a queue between the Event Hub and the
worker role (it's not clear from the comments if it's really required
or not).


So the question is: should I do all my processing stuff directly after the Event Hub (i.e. inside the
ProcessEventsAsnyc
method of my
IEventProcessor
implementation), or should I use a queue between the Event Hub and the processing stuff?

Any recommendation about how to properly consume events from an Event Hub would be appreciated, the documentation is currently a bit... missing.

Answer

This falls into the category of question whose answer will be much more obvious once the source for EventProcessorHost is made available, which I've been told is going to happen.

The short answer is that you don't need to use a queue; however, I would keep the time it takes ProcessEventsAsync to return a Task relatively short.

While this advice sounds a lot like that of the first article, the key distinction is that it is the time to returning a Task not the time to Task completion. My assumption has been that ProcessEventsAsync is called on a thread used for the EventProcessorHost for other purposes. In this case you need to return quickly so that the other work can continue; this work might be calling ProcessEventsAsync for another partition (but we won't know without debugging I haven't found it necessary to do or reading the code when available).

I do my processing on a separate thread per partition by passing along the entire IEnumerable from ProcessEventsAsync. This is in contrast to taking all the items out of the IEnumerable and putting them into a Queue for the processing thread to consume. The other thread completes the Task returned by ProcessEventsAsync when it has finished processing the messages. (I actually give my processing thread a single IEnumerable which hides the details of ProcessEventsAsync by chaining the chunks together and completing the Task if needed on call to MoveNext).

So in short: In ProcessEventsAsync hand off the work to another thread, either one you already had lying around that you know how to communicate with or kick off a new Task with the TPL.

Putting all the messages into a Queue inside of ProcessEventsAsync isn't bad it's just not the most efficient way to pass the chunk of events to another thread.

If you decide to put the events into a queue (OR have a queue downstream in your processing code) and complete the task for the batch, you should make sure you limit the number of items you have outstanding in your code/queue to avoid running out of memory in the case where the EventHub is giving you items faster than your code can process them due to a traffic spike.

Note for Java EventHub Users 2016-10-27: Since this came to my attention there's this description describing how onEvents is called, while onEvents being slow won't be tragic since it's on a thread per partition, its speed appears to affect the speed with which the next batch is received. Thus depending on how much you care about the latency being quite fast here could be relatively important for your scenario.