view raw
Shang Wang Shang Wang - 6 months ago 50
Python Question

How to calculate running mean traffic for last minute

I have a python server that accepts time series data. Now I need to calculate the average traffic for the last minute, output like 90 samples/minute. I'm currently using a python list to hold all time stamps and use a pretty awful way(in my opinion) to calculate this. The code roughly looks like this:

class TrafficCalculator(object):
timestamps = []

def run():
while True:
# this gets one record of traffic
data = self.accept_data()
# get record's timestamp
timestamp = data.timestamp
# add to list
# get the time one minute ago
minute_ago = timestamp - datetime.timedelta(minutes=1)
# find out the first index of the timestamp in the past that's within 1 minute
for i, t in enumerate(self.timestamp):
if t > minute_ago:
# see how many records are within last minute
result = len(self.timestamp[i:])
# throw away the earlier data
self.timestamp = self.timestamp[i:]

As you can see, I have to do this for every record, if my traffic gets intense, the performance is miserable.

Is there a better data structure or algorithm I can use to make this more performant? Even further, how do I write a test to verify my algorithm? Thanks!


Use Queue to hold <traffic, timestamp> pair. Here timestamp is the time it has been pushed on Queue(arrives from server). Track the sum of the traffics of Queue. When a new traffic arrives and the difference between its timestamp and Queue's front element's timestamp more than 1 minute, pop front from Queue. And subtract the poped traffic value from sum. Push the new traffic into queue and add to sum.

This way, your queue is working as a window frame to hold the 1 minute traffic all the time. And you are tracking the sum and you know the Queue size, so you can calculate the average.

The space complexity is O(maximum traffic can be arrived within 1 minute). Time complexity is O(1) for getting average at any time.

This is a very conventional algorithm for Query on any running stream of data in constant time complexity.

Note: Unfortunately I don't know Python. Otherwise I would put the implementation.