Friday, September 23, 2011

Tracking a running standard deviation

It is fairly common to want to track statistics on your software as it runs. Sometimes it makes sense to scrape logs and aggregate a massive set of samples. Other times you really want "live" numbers; eg running statistics. Most commonly, arithmetic mean and standard deviation. Tracking arithmetic mean alone is relatively easy but it can be very misleading. I frequently see claims of "good performance" based on mean alone which completely break down when it turns out that if we consider standard deviation a significant percentage of our population is actually experiencing something much, much worse than the mean.

So, how do we track a standard deviation from a series of samples? We could keep all samples in memory but that sounds bad; ideally we'd just like a few variables used to keep an running count of the stat. Luckily Knuth solved this problem for us, and even more luckily a statistician has written sample code in C# for us at http://www.johndcook.com/standard_deviation.html.

In Java this winds up looking remarkably similar:
//class
public class RunningStat {
	private int m_n;
	private double m_oldM;
	private double m_newM;
	private double m_oldS;
	private double m_newS;
	
	public RunningStat() {
		m_n = 0;
	}
	
	public void clear() { m_n = 0; }
	
	public void addSample(double sample) {
        m_n++;

        // See Knuth TAOCP vol 2, 3rd edition, page 232
        if (m_n == 1)
        {
            m_oldM = m_newM = sample;
            m_oldS = 0.0;
        }
        else
        {
            m_newM = m_oldM + (sample - m_oldM)/m_n;
            m_newS = m_oldS + (sample - m_oldM)*(sample - m_newM);

            // set up for next iteration
            m_oldM = m_newM; 
            m_oldS = m_newS;
        }
	}
	
	public int getNumSamples() { return m_n; }
	public double getMean() { return (m_n > 0) ? m_newM : 0.0; }
	public double getVariance() { return ( (m_n > 1) ? m_newS/(m_n - 1) : 0.0 ); }
	public double getStdDev() { return Math.sqrt(getVariance()); }	
}

//usage
RunningStat timeStat = new RunningStat();		
...
long time = System.nanoTime();			
//do the thing we are tracking...						
time = (System.nanoTime() - time) / (1000*1000); //ms
			
timeStat.addSample(time); 

This does have some limitations for a typical Java web application, in particular it isn't threadsafe. We can simply slap synchronized on it, using a private lock to avoid unwanted publicity of our synchronization primatives:
public class RunningStat {
	private int m_n;
	private double m_oldM;
	private double m_newM;
	private double m_oldS;
	private double m_newS;
	
	private Object m_lock = new Object();
	
	public RunningStat() {
		m_n = 0;
	}
	
	public void clear() {
		synchronized(m_lock) {
			m_n = 0;
		}
	}
	
	public void addSample(double sample) {
		synchronized(m_lock) {
	        m_n++;
	
	        // See Knuth TAOCP vol 2, 3rd edition, page 232
	        if (m_n == 1)
	        {
	            m_oldM = m_newM = sample;
	            m_oldS = 0.0;
	        }
	        else
	        {
	            m_newM = m_oldM + (sample - m_oldM)/m_n;
	            m_newS = m_oldS + (sample - m_oldM)*(sample - m_newM);
	
	            // set up for next iteration
	            m_oldM = m_newM; 
	            m_oldS = m_newS;
	        }
		}
	}
	
	public int getNumSamples() { synchronized(m_lock) { return m_n; } }
	public double getMean() { synchronized(m_lock) { return (m_n > 0) ? m_newM : 0.0; } }
	public double getVariance() { synchronized(m_lock) { return ( (m_n > 1) ? m_newS/(m_n - 1) : 0.0 ); } }
	public double getStdDev() { synchronized(m_lock) { return Math.sqrt(getVariance()); } }	
}
This leaves open the question of how we actually distribute references to the stat. For example, it could be that anytime a given handler executes we want to update a stat, and we want to grab that stat to use in a periodic stat logger, and we want to grab it for our view statistics page. All these locations need to get a reference to the same instance of RunningStat. If we use something like Spring we can trivially inject the same reference. If not we may wish to manage a stats registry ourselves. A simple implementation using ConcurrentHashMap might look something like this:
import java.util.Collection;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;

public class StatisticsRegistry {
	private static final StatisticsRegistry instance = new StatisticsRegistry();				
	
	private final ConcurrentMap statById = new ConcurrentHashMap(); 
	private final Object writeLock = new Object();
	
	public static StatisticsRegistry getInstance() { return instance; }
	
	public RunningStat getNamedStatistic(String name) {
		/**
		 * Usually the stat will exist; avoid extra sync ops on write lock when possible
		 */
		RunningStat stat = statById.get(name);
		
		if (null == stat) {
			synchronized(writeLock) {
				//someone else could have just inserted it; if so putIfAbsent will return the value they put in
				RunningStat existing = statById.putIfAbsent(name, new RunningStat());
				if (existing != null) {
					stat = existing;
				}
			}
		}
		
		return stat;
	}
		
	public Collection getAllStatistics() {
		return statById.values();
	}
}
Clients can request a stat by name and clients like a stat logger or viewer can grab either all stats or specific ones as needed. As the same instance is always returned it is safe for clients to write code to retain RunningStat references.

7 comments:

htop said...

thanks for sharing this information
azure training in chennai
best devops training in chennai
best hadoop training in chennai
best hadoop training in omr
hadoop training in sholinganallur

i Digital Academy said...

Nice post!! Thanks for sharing...
SEO Services Bangalore

wwwnortoncomsetup.com said...

I'm so glad and enjoyed your BLOG, It is very informative on the subject or topic, and Thanks For Sharing this post. I have something to share here norton.com/setup

Danny Ethan said...
This comment has been removed by the author.
ben said...

payroll software singapore

Gouse said...

Really informative post! I appreciate how you broke down the concept of tracking a running standard deviation and explained its significance in real-time data analysis. The step-by-step approach, along with the mathematical explanation, makes it easy to follow. This is especially useful for applications that need continuous data monitoring without recalculating everything from scratch. It would be great to see some examples of how this technique is applied in large-scale systems or machine learning models. Thanks for the clear and practical insights!"
Digital Marketing Course In Ameerpet

Gouse said...

his is a highly informative post! I really appreciate how you clarified the concept of tracking a running standard deviation and highlighted its importance in real-time data analysis. The step-by-step breakdown and the mathematical explanations make it very easy to understand. This approach is particularly beneficial for applications requiring continuous data monitoring without the need to recalculate everything from the beginning. I’d love to see some examples of how this technique is utilized in large-scale systems or machine learning models. Thank you for the clear and practical insights!
Digital Marketing Course In Hyderabad

Post a Comment