NETWORKED INSIGHTS BLOG

Stats, Baseball, and Software Development

If you can’t measure it, you can’t change it “. This is one of those fundamentals that is so often forgotten at many levels of an organization. Every effort made by an organization to change something has behind it an implicit measurement, a representation in the mind of someone with some quality that may differ from what what someone else believes it should be.

There is power in having concrete numbers, even if they do not mean much in isolation, for at least the sake of comparison. One of the best examples of a measure without absolute meaning is baseball’s OPS. A player’s OPS is the sum of their On-base percentage (which is a pretty self-explanatory number) with the Slugging Percentage. The Slugging Percentage (SLG) is the sum of the total bases reached by a runner on a single at bat divided by the total number of at-bats that the batter has. The concept that SLG captures is the hitter’s overall effectiveness and power [1].

If you close your eyes and get near to that crevasse where conscientiousness and lucid dreaming meet, you may start to understand what on earth SLG or OBP meant in the minds of their creators. They are, as numbers without a context, meaningless. That said, if you give them either the context of Other Players or the context of Chance We’ll Win This Game their value is unmatched in today’s baseball. An OPS near 1 is fabulous! What does 1 mean? Well… uh… it means we win a lot, and that’s good enough (unless you are my Mets who have great stats yet constantly defy the odds). Sigh, I digress…

Software, like baseball players, is magnificent, often times mysterious, and occasionally questionably worth the sum you paid for it. And to evaluate it, likeĀ  baseball players (who’s behavior also often seems peculiar and irrational), you must boil your application down to some simple numbers you can digest.

For instance, if you were to compute a health metric to digest your system’s state, what might you use? Some insights gained from the recent software sprint I was involved in point to more than your typical CPU usage, Memory usage, and network usage. While these numbers are valuable, they do not give you a good idea as to how your system is behaving compared to how it should be behaving. That is to say, if your application uses 90% of the CPU most of the time, then that is normal and acceptable. It is not until you establish the mode water mark that you can start reasoning about what high and low look like.

To begin our application’s instrumentation we first built a restful webpage to pump our statistics out via XML. We also wired up simple serialization of all the available system JMX server named object values. This gives us, essentially for free, the max and current values for memory, garbage cleanup statistics, thread statistics, and a host of other diagnostic bits of information.

In retrospect, we should have wired our application metrics into JMX as well, and perhaps that will be a future task, but for now they reside in series of iBatis calls to the database storing simple double-precision values, their weighted moving averages, their min, max, and current values as well as some time stamping information about how long the interval is between calls to that statistic and how often the statistic has been updated.

Of particular interest is that the weighted moving average consists of only two data points: The current average and the new value. Also stored with the statistic is the weight given to the new value (the value of which should be >= 0 and <= 1). The larger this weight is, the faster the average will move toward the new value. For daily job runs this value should be high. For frequent calls to the database, this value should be low. This causes the average to move quickly for the infrequent calls and slowly for the very frequent calls.

One added bonus, errors and exceptions, regardless of language, should be treated as data to be handled and preferably captured and stored. The statistics engine does provide an exception handling routine to store an exception, though this system needs refinement as we learn more about how we will be using this information in the future.

The goal of all this a dashboard with fields not unlike the following:

  • exception count for the last day
  • current memory divided by peak memory
  • current threads divided by peak threads
  • batch job timeout count + batch job completion count = total job count
  • user calls per minute

These are all instrumentable values and, at a glance, can tell you some profound things about the general activity of your system. It will tell you something about throughput and something about how close to your hardware limitations you are.

What is even more profound is that once the application holds this information, the application can “learn” and take action when the situation changes. It can throttle user queries to let the database catch up. It can send warning emails. It can stop running jobs. It can signal other clients of the DB or other resources to “back off use” for 10 minutes. These are all very simple reactions to resource strains, but impossible to do intelligently without some very basic measurements.

Information and Links


Other Posts
Next post:
Previous post:

Reader Comments

Sorry, comments are closed.