Josh K Josh K - 1 month ago 12
Python Question

Computing Standard Deviation in a stream

Using Python, assume I'm running through a known quantity of items

I
, and have the ability to time how long it takes to process each one
t
, as well as a running total of time spent processing
T
and the number of items processed so far
c
. I'm currently calculating the average on the fly
A = T / c
but this can be skewed by say a single item taking an extraordinarily long time to process (a few seconds compared to a few milliseconds).

I would like to show a running Standard Deviation. How can I do this without keeping a record of each
t
?

Answer

I use Welford's Method, which gives more accurate results. This link points to John D. Cook's overview. Here's a paragraph from it that summarizes why it is a preferred approach:

This better way of computing variance goes back to a 1962 paper by B. P. Welford and is presented in Donald Knuth’s Art of Computer Programming, Vol 2, page 232, 3rd edition. Although this solution has been known for decades, not enough people know about it. Most people are probably unaware that computing sample variance can be difficult until the first time they compute a standard deviation and get an exception for taking the square root of a negative number.