Friday, September 23, 2011

Tracking a running standard deviation

It is fairly common to want to track statistics on your software as it runs. Sometimes it makes sense to scrape logs and aggregate a massive set of samples. Other times you really want "live" numbers; eg running statistics. Most commonly, arithmetic mean and standard deviation. Tracking arithmetic mean alone is relatively easy but it can be very misleading. I frequently see claims of "good performance" based on mean alone which completely break down when it turns out that if we consider standard deviation a significant percentage of our population is actually experiencing something much, much worse than the mean.

So, how do we track a standard deviation from a series of samples? We could keep all samples in memory but that sounds bad; ideally we'd just like a few variables used to keep an running count of the stat. Luckily Knuth solved this problem for us, and even more luckily a statistician has written sample code in C# for us at http://www.johndcook.com/standard_deviation.html.

In Java this winds up looking remarkably similar:
//class
public class RunningStat {
	private int m_n;
	private double m_oldM;
	private double m_newM;
	private double m_oldS;
	private double m_newS;
	
	public RunningStat() {
		m_n = 0;
	}
	
	public void clear() { m_n = 0; }
	
	public void addSample(double sample) {
        m_n++;

        // See Knuth TAOCP vol 2, 3rd edition, page 232
        if (m_n == 1)
        {
            m_oldM = m_newM = sample;
            m_oldS = 0.0;
        }
        else
        {
            m_newM = m_oldM + (sample - m_oldM)/m_n;
            m_newS = m_oldS + (sample - m_oldM)*(sample - m_newM);

            // set up for next iteration
            m_oldM = m_newM; 
            m_oldS = m_newS;
        }
	}
	
	public int getNumSamples() { return m_n; }
	public double getMean() { return (m_n > 0) ? m_newM : 0.0; }
	public double getVariance() { return ( (m_n > 1) ? m_newS/(m_n - 1) : 0.0 ); }
	public double getStdDev() { return Math.sqrt(getVariance()); }	
}

//usage
RunningStat timeStat = new RunningStat();		
...
long time = System.nanoTime();			
//do the thing we are tracking...						
time = (System.nanoTime() - time) / (1000*1000); //ms
			
timeStat.addSample(time); 

This does have some limitations for a typical Java web application, in particular it isn't threadsafe. We can simply slap synchronized on it, using a private lock to avoid unwanted publicity of our synchronization primatives:
public class RunningStat {
	private int m_n;
	private double m_oldM;
	private double m_newM;
	private double m_oldS;
	private double m_newS;
	
	private Object m_lock = new Object();
	
	public RunningStat() {
		m_n = 0;
	}
	
	public void clear() {
		synchronized(m_lock) {
			m_n = 0;
		}
	}
	
	public void addSample(double sample) {
		synchronized(m_lock) {
	        m_n++;
	
	        // See Knuth TAOCP vol 2, 3rd edition, page 232
	        if (m_n == 1)
	        {
	            m_oldM = m_newM = sample;
	            m_oldS = 0.0;
	        }
	        else
	        {
	            m_newM = m_oldM + (sample - m_oldM)/m_n;
	            m_newS = m_oldS + (sample - m_oldM)*(sample - m_newM);
	
	            // set up for next iteration
	            m_oldM = m_newM; 
	            m_oldS = m_newS;
	        }
		}
	}
	
	public int getNumSamples() { synchronized(m_lock) { return m_n; } }
	public double getMean() { synchronized(m_lock) { return (m_n > 0) ? m_newM : 0.0; } }
	public double getVariance() { synchronized(m_lock) { return ( (m_n > 1) ? m_newS/(m_n - 1) : 0.0 ); } }
	public double getStdDev() { synchronized(m_lock) { return Math.sqrt(getVariance()); } }	
}
This leaves open the question of how we actually distribute references to the stat. For example, it could be that anytime a given handler executes we want to update a stat, and we want to grab that stat to use in a periodic stat logger, and we want to grab it for our view statistics page. All these locations need to get a reference to the same instance of RunningStat. If we use something like Spring we can trivially inject the same reference. If not we may wish to manage a stats registry ourselves. A simple implementation using ConcurrentHashMap might look something like this:
import java.util.Collection;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;

public class StatisticsRegistry {
	private static final StatisticsRegistry instance = new StatisticsRegistry();				
	
	private final ConcurrentMap statById = new ConcurrentHashMap(); 
	private final Object writeLock = new Object();
	
	public static StatisticsRegistry getInstance() { return instance; }
	
	public RunningStat getNamedStatistic(String name) {
		/**
		 * Usually the stat will exist; avoid extra sync ops on write lock when possible
		 */
		RunningStat stat = statById.get(name);
		
		if (null == stat) {
			synchronized(writeLock) {
				//someone else could have just inserted it; if so putIfAbsent will return the value they put in
				RunningStat existing = statById.putIfAbsent(name, new RunningStat());
				if (existing != null) {
					stat = existing;
				}
			}
		}
		
		return stat;
	}
		
	public Collection getAllStatistics() {
		return statById.values();
	}
}
Clients can request a stat by name and clients like a stat logger or viewer can grab either all stats or specific ones as needed. As the same instance is always returned it is safe for clients to write code to retain RunningStat references.