BitWorking

Detecting Benchmark Regression

Subtitle if this were an academic paper, which it’s not: A k-means clustering derived point statistic highly correlated with regressions in commit-series data with applications to automatic anomaly detection in large sets of benchmark results.

TL;DR: To detect regressions in benchmark data over a series of commits use k-means clustering on mean and variance normed commit-series. For each of the clusters find the best fitting step function to each cluster’s centroid. The metric |step/fit| is highly correlated with regressions, where step is the height of the step function, and fit is the mean square error of the fit of the step function to the centroid.

Below is the description of how we detect performance regression for the Skia graphics library. I’m writing this up because after much searching I haven’t found anyone describe the method we came up with for detecting performance regressions and maybe this writeup will be useful to other people.

Problem Statement

Skia is an open source cross platform 2D graphics library. In Skia, like many other software projects, we have large number of performance tests, aka benchmarks, and we run those benchmarks every time we change the code. Just having a large number of benchmarks isn’t a problem, but being cross platform means running those tests across many different platforms; Linux, Mac, Windows, Android, ChromeOS, on different GPUs, etc. which leads to a combinatorial explosion in benchmark results. For every commit to Skia today the testing infrastructure generates approximately 40,000 benchmark measurements. That number of results tends to change frequently as tests, platforms, and configurations are added and removed regularly. The number of results has been over 70,000 per commit in the past several months.

Definitions

To make the following discussion easier let’s define some terms.

Trace
A Trace is single benchmark and configuration tracked over a series of commits. Note that this isn’t exactly a time series since the measurements aren’t taken at equidistant times, but are spaced by commits to the codebase. Also note that for each benchmark there may be multiple traces, for example, one for Windows 8, one for Linux, and one for Android.

Fig 1 - Trace

Regression
A “performance regression” is a significant change in either direction of a metric. Now a metric that drops may actually be a good performance increase, but could also be an indication of a test that is broken, or has stopped working entirely. So regardless of the benchmark, we are looking for step-like changes in either direction.

The issue with tens of thousands of traces is that you just can’t look at the raw numbers, or even plot all the data, and figure out when you’ve had a regression. At first we tried setting performance ranges for each trace, i.e. an upper and lower bound for each trace. If a later commit caused the benchmark results to move outside those bounds then that would trigger an error. There are many drawbacks to monitoring benchmarks by manually placing bounds on each trace:

  1. The most important drawback is that in such a system a single test result can trigger an alert. You know old phrase, “the plural of anecdote isn't data”, a single benchmark measurement is virtually meaningless as any number of anomalies could actually be responsible for that benchmark result changing. For example, a machine could overheat forcing a move to frequency scaling, or other processes on the machine may starve the test of CPU cycles. You can work to mitigate these eventualities, but they never completely go away. SPC systems such as the Western Electric rules might be applicable in this case, but we’ve never tested them.
  2. Constant manual editing of trace bounds is time consuming and error prone.
  3. Constantly adding manual trace bounds for new benchmarks is time consuming. Add one test and you may have to add many trace bounds, one for each member of that combinatorial explosion.
  4. Forgetting to add new ranges for new benchmarks another source of error.

Even if you automate the placing of trace bounds, you still have the issue of transient behavior that looks like a regression, and you also have to take pains that the automatic calculation of trace bounds doesn’t mask a true regression.

Fig 2- Is this a regression or an anomaly?

So we needed a better system than setting trace bounds. The next section explains the system we implemented and have successfully run for several months now.

Before we go further let’s define a few more terms.

Normalized Traces
Normalization is the process of modifying each Trace so that it has a mean of zero and a standard deviation of 1.0. Note that if the standard deviation of a trace is too small, then blowing that up to a standard deviation of 1.0 would introduce nothing but noise, so there’s a lower limit for the standard deviation of a trace, and below that we don’t normalize the standard deviation of the trace. The idea is to extract just the shape of the trace, so that all the normalized traces are comparable using a sum of squares distance metric. The smaller the sum of squares error is between two normalized trace, the more similar their shapes.
k-means clustering
I’m not going to explain k-means clustering in one paragraph, you should go look it up on Wikipedia or any of the fine explanations available on the web. The important point is that we are using k-means clustering to group normalized traces together based on their shape. The idea is that many traces will move together in the same direction from commit to commit. For example, if I speed up rectangle drawing then all tests that use rectangles should get faster, if not in the same proportion.
Centroid
The centroid is the center point at the center of a cluster. In this case the mean of the normalized traces in a cluster, which acts as a prototype shape for the members of the cluster.
Regression Factor

For each cluster of normalized traces we find the best fitting step function to the centroid. From that best fitting step function we calculate Fit and Step, where Fit is the sum of squares error between the step function and the centroid, and Step is the height of the step function.

From there we calculate the Regression Factor:

R = Step / Fit

A smaller Fit values gives you a larger R, which means that the more a centroid looks like step function the larger R gets. Similarly the larger Step gets the larger R gets, which is a measure of how big of a change the centroid represents.

Putting it all together.

So finally, with all the preliminaries set up, we can get to the core of the system.

Here’s a view of the system at work, finding regressions in performance. Note that out of 40,000 benchmarks the first cluster contains 1336 traces and has a Regression Factor of -4.08e+3.

Screenshot 2014-11-23 at 1.14.19 PM.png

Continuous Analysis

The above system works for finding regressions once. But what happens if you want to check for regressions continuously as new commits land and new benchmark results arrive? One last definition:

Interesting
A cluster is considered interesting if it’s Regression Factor is over 150. This is only a rule of thumb based on observing the system and may be relevant only to the Skia benchmarks, while a different cutoff may be appropriate for other datasets. The important point in that as |R| grows so does the likelihood of that cluster being a regression.

To continuously monitor for Interesting clusters, start by running the above analysis once and find interesting clusters. If there are any then triage them as either really needing attention, such as a CL needs to be rolled back, or ignorable, say in the case where a true performance increase was seen.  Then on a periodic basis run the analysis again when new data arrives. What should be done with the new set of interesting clusters produced from the analysis? The existing interesting clusters have already been triaged, and those same clusters may appear in the output of the analysis, and new interesting clusters may appear. The process of only raising up new interesting clusters for triaging while folding existing clusters with similar clusters that appear in the analysis results is called cluster coalescing.

Cluster coalescing currently works by looking at all the new interesting clusters and if they have the same traces as the 20 best traces in an existing cluster then they are considered the same cluster. Note that ‘best’ means the 20 traces that are closest to the centroid of a cluster. Note that this is an area of active work and we are still experimenting regularly with new cluster coalescing schemes.

Wrap Up

I hope that was useful. Please shoot me any questions on Twitter @bitworking. The code for the software that does the above analysis, and much more, is open sourced here.

2014-11-30

Observations on hg and git

Having recently moved to using Git from Mercurial here are my observations:

Git just works

No matter what I try to do, there's a short and simple git command that does it. Need to copy a single file from one branch to my current branch, need to roll back the last two commits and place their changes into the index, need to push or pull from a local branch to a remote and differently named branch, there are all ways to do those things. More importantly, Git does them natively, I don't have to turn on plugins to get a particular piece of functionality.

Turning on plugins is a hurdle

The fact that what I consider to be core functionality is hidden away in plugins and need to be turned on manually is an issue. For example, look at this section of the docs for the Google API Python Client:

https://code.google.com/p/google-api-python-client/wiki/BecomingAContributor#Submitting_Your_Approved_Code

A big thing that trips up contributors is that "--rebase" is in a plugin (and I keep forgetting to update the docs to explain that).

Git is fast

So Git is fast, not just "ooh that was fast", but fast as in, "there must have been a problem because there's no way it could have worked that fast". That's a feature.

Branching

Git branches are much smoother and integrated than MQ. Maybe this is just because I got stuck on MQ and never learned another way to use hg, but the branch and merge workflow is a lot better than MQ.

SSH

In Git ssh: URIs just work for me. Maybe I just got lucky, or was previously unlucky, but I never seemed to be able to pull or push to a remote repository via ssh with hg, and it just worked as advertised with Git.

Helpful

Git is helpful. Git is filled with helpful messages, many of the form "it looks like you are trying to do blah, here's the exact command line for that", or "you seem to be in 'weird state foo', here's a couple different command lines you might use to rectify the situation". Obviously those are paraphrasing, but the general idea of providing long, helpful messages with actual commands in them is done well throughout Git.

Caveats

I'm not writing this to cast aspersions on the Mercurial developers, and I've already passed this information along to developers that work on Mercurial. I am hoping that if you're building command line tools that you can incorporate some of the items here, such as helpful error messages, speed, and robust out-of-the-box capabilities.

2014-07-21

No more JS frameworks

Stop writing Javascript frameworks.

Translations: Japanese

JavaScript frameworks seem like death and taxes; inevitable and unavoidable. I'm sure that if I could be a fly on that wall every time someone started a new web project, the very first question they'd ask is, which JS framework are we using? That's how ingrained the role of JS frameworks are in the industry today. But that's not the way it needs to be, and actually, it needs to stop.

Let's back up and see how we got here.

Angular and Backbone and Ember, oh my.

For a long time the web platform, the technology stack most succinctly described as HTML+CSS+JS, was, for lack of a better term, a disaster. Who can forget the IE box model, or the layer tag? I'm sure I just started several of you twitching with flashbacks to the bad old days of web development with just those words.

For a long time there was a whole lot of inconsistency between browsers and we, as an industry, had to write frameworks to paper over them. The problem is that there was disagreement even on the fundamental issues among browsers, like how events propagate, or what tags to support, so every framework not only papered over the holes, but designed their own model of how the browser should work. Actually their own models, plural, because you got to invent a model for how events propagate, a model for how to interact with the DOM, etc. A lot of inventing went on. So frameworks were written, each one a snowflake, a thousand flowers bloomed and gave us the likes of jQuery and Dojo and MochiKit and Ext JS and AngularJS and Backbone and Ember and React. For the past ten years we’ve been churning out a steady parade of JS frameworks.

But something else has happened over the past ten years; browsers got better. Their support for standards improved, and now there are evergreen browsers: automatically updating browsers, each version more capable and standards compliant than the last. With newer standards like:

I think it's time to rethink the model of JS frameworks. There's no need to invent yet another way to do something, just use HTML+CSS+JS.

So why are we still writing JS frameworks? I think a large part of it is inertia, it's habit. But is that so bad, it's not like frameworks are actively harmful, right? Well, let's first start off by defining what I mean by web framework. There's actually a gradient of code that starts with a simple snippet of code, such as a Gist, and that moves to larger and larger collections of code, moving up to libraries, and finally frameworks:

gist -> library -> framework

Frameworks aren't just big libraries, they have their own models for how to interact with events, with the DOM, etc. So why avoid frameworks?

Abstractions Well, one of the problems of frameworks is usually one of their selling points, that they abstract away the platform so you can concentrate on building your own software. The problem is that now you have two systems to learn, HTML+CSS+JS, and the framework. Sure, if the framework was a perfect abstraction of the web as a platform you would never have to go beyond the framework, but guess what, abstractions leak. So you need to know HTML+CSS+JS because at some point your program won't work the way you expect it to, and you’ll have to dig down through all the layers in the framework to figure out what's wrong, all the way down to HTML+CSS+JS.

Mapping the iceberg.

A framework is like an iceberg, that 10% floating above the water doesn't look dangerous, it's the hidden 90% that will eventually get you. Actually it's even more apt than that, learning a framework is like mapping an iceberg, in order to use the framework you need to learn the whole thing, apply the effort of mapping out the entire thing, and in the long run the process is pointless because the iceberg is going to melt anyway.

Widgets Another selling point of frameworks is that you can get access to a library of widgets. But really, you shouldn't need to adopt a framework to get access to widgets, they should all be orthogonal and independent. A good example of this today is CodeMirror, a syntax highlighting code editor built in JavaScript. You can use it anywhere, no framework needed.

There is also the lost effort of building widgets for a framework. Remember all those MochiKit widgets you wrote? Yeah, how much good are they doing you now that you've migrated to Ember, or Angular?

Data Binding Honestly I've never needed it, but if you do, it should come in the form of a library and not a framework.

The longer term problem with frameworks is that they end up being silos, they segment the landscape, widgets built for framework A don't work in framework B. That's lost effort.

So what does a post-framework world look like?

HTML+CSS+JS are my framework.

The fundamental idea is that frameworks aren't needed, use the capabilities already built into HTML+CSS+JS to build your widgets. Break apart the monoliths into orthogonal components that can be mixed in any combination. The final pieces that enable all of this fall under the umbrella of Web Components.

HTML Imports, HTML Templates, Custom Elements, and Shadow DOM are the enabling technologies that should allow us to cut the cord from frameworks, allowing the creation of reusable elements and functionality. For a much better introduction see these articles and libraries:

So, we all create <x-flipbox>'s, declare victory, and go home?

No, not actually, the first thing you need for working with Web Components are polyfills for that functionality, such as X-Tag and Polymer. The need for those will decrease over time as browsers flesh out their implementations of those specifications.

A point to be stressed here is that these polyfills aren't frameworks that introduce their own models to developing on the web, they enable the HTML 5 model. But that isn't really the only need, there are still minor gaps in the platform where one browser deviates in a small way from current standards, and that's something we need to polyfill. MDN seems to have much of the needed code, as the documentation frequently contains short per-function polyfills.

So one huge HTML 5 Polyfill library would be good, but even better would be what I call html-5-polyfill-o-matic, a set of tools that allows me to write Web Components via bog standard HTML+JS and then after analyzing my code, either via static analysis or via Object.observe at runtime, it produces a precise subset of the full HTML 5 polyfill for my project.

This sort of functionality will be even more important as I start trying to mix and match web components and libraries from multiple sources, i.e. an <x-foo> from X-Tag and a <core-bar> from Polymer, does that mean I should have to include both of their polyfill libraries? (It turns out the answer is no.) And how exactly should I get these custom elements? Both X-Tag and Brick have custom bundle generators:

If I start creating custom elements do I need to create my own custom bundler too? I don't think that's a scalable idea, I believe we need idioms and tools that handle this much better. This may actually mean changing how we do open source; a 'widget' isn't a project, so our handling of these things needs to change. Sure, still put the code in Git, but do you need the full overhead of a GitHub project? Something lighter weight, closer to a Gist than a current project might be a better fit. How do I minimize/vulcanize all of this code into the right form for use in my project? Something like Asset Graph might be a good start on that.

So what do we need now?

  1. Idioms and guidelines for building reusable components.
  2. Tools that work under those idioms to compile, crush, etc. all that HTML, CSS, and JS.
  3. A scalable HTML 5 polyfill, full or scaled down based on what's really used.

That's what we need to build a future where we don't need to learn the latest model of the newest framework, instead we just work directly with the platform, pulling in custom elements and libraries to fill specific needs, and spend our time building applications, not mapping icebergs.

Q&A

Q: Why do you hate framework authors.

A: I don’t hate them. Some of my best friends are framework authors. I will admit a bit of inspiration from the tongue-in-cheek you have ruined javascript, but again, no derision intended for framework authors.

Q: You can’t do ____ in HTML5, for that you need a framework.

A: First, that's not a question. Second, thanks for pointing that out. Now let's work together to add the capabilities to HTML 5 that allows ____ to be done w/o a framework.

Q: But ___ isn't a framework, it's a library!

A: Yeah, like I said, it’s a gradient from gist to framework, and you might draw the lines slightly differently from me. That's OK, this isn't about the categorization of any one particular piece of software, it's about moving away from frameworks.

Q: I've been doing this for years, with ___ and ___ and ___.

A: Again, that's not a question, but regardless, good for you, you should be in good shape to help everyone else.

Q: So everyone needs to rewrite dropdown menus, tabs, sliders and toggles themselves?

A: Absolutely not, the point is there should be a way to create those elements in a way that doesn't require buying into one particular framework.

Q: Dude, all those HTML Imports are going to kill my sites performance.

A: Yes, if you implemented all this stuff naively it would, which is why I mentioned the need for tools to compile and crush all the HTML, CSS, and JS.

Q: So I'm not supposed to use any libraries?

A: No, that's not what I said, I was very careful to delineate a line between libraries and frameworks, a library providing an orthogonal piece of functionality that can be used with other libraries. Libraries are fine, it's the frameworks that demand 100% buyin that I'd like to see us move away from.

Q: But I like data binding!

A: Lot's of people do, I was only expressing a personal preference. I didn't say that you shouldn't use data binding, but only that you don't need to adopt an entire framework to get data-binding, there are standalone libraries for that.

2014-05-09

IPython and Curves

So I've been impressed with IPython so far, it can be a little fiddly to install, but given the power I'm not going to complain. I'm now working on graphics and in trying to get up to speed going back and learning, or relearning, some basics. Today was Bézier curves, and thus this IPython notebook. Note that the content there isn't actually educational, you should follow the links provided to really learn about these constructs, I just wanted an excuse to try out LaTeX and plot some interesting graphs.

2014-01-25

The shiny parabolic broadcast spreader of vomit

Due to suspected food poisoning I checked into the local emergency room last night around 2 AM, trusty 13 gallon plastic garbage bag in hand, because, I've been throwing up. Once they get me into a room the nurse offers me a shallow pink plastic pan in exchange for my plastic garbage bag, and I'm thinking to myself, "Really, have you never seen anyone vomit in your entire life?". Why on earth would you offer me a shallow round bottom plastic pan as an alternative to a 13 gallon plastic garbage bag? This is vomit we're trying to contain here. This reminds me of a previous visit to the same ER with my son when he had appendicitis, this time we came in with a kitchen garbage pail and the nurse laughed at us and handed him a small metal kidney dish. My son held it in his hands, looking at it perplexedly for about five seconds before he started vomiting again, turning it from a kidney dish into a shiny metal parabolic broadcast spreader of vomit.

I don't know what to make of this phenomenon, as I thought that working in an ER would expose you to a lot of puking people, and thus you'd know better than to give someone a kidney dish to throw up in. I can only come up with two possible conclusions, the first that the ER folks are just aren't quick learners and haven't picked up on the association:

kidney dish : vomit
  :: broadcast spreader : grass seed
  

The other possibility is that my family is unique, maybe my wife and I are both descended from long lines of projectile vomitters, a long and honorable tradition of high velocity emesis, and that the rest of population is filled with polite, low volume, low velocity vomitters. If so, you people make me sick.

2013-12-13

Snow

Reilly has been learning Javascript, and one of the projects he wanted to do was a snow simulation. I guess growing up in the south snow is a rare and wonderous event for him.

2013-12-08

Internet of Things

So while everyone was losing their minds over the addition of the word twerk to the Oxford Dictionaries Online, I'm actually more upset that the Internet of Things was added with the wrong definition, or at least an incomplete definiton.

a proposed development of the Internet in which everyday objects have network connectivity, allowing them to send and receive data

The first definitions of the Internet of Things I heard was from Bruce Sterling and it went beyond just having network connectivity, it was about lifecycle management, tracking objects from cradle to grave. Now I understand that the ODO tracks popular usage of a term, and that literally everyone is using the term wrong, so the wrong term gets added to the dictionary (did you see what I did there?). I'm OK with the simpler definition being added to the ODO, but it unfortunately means that we don't have a word for the richer definition of the Internet of Things, which is an important concept that we shouldn't lose sight of.

2013-09-04

Platonic Programs

One of the important types of open source that doesn't get talked about is what I call Platonic Programs, which are small programs that exhibit the base level of functionality for some domain, but no more, and have small, clear, and consise code. These projects have outsized impacts on the ecosystem and yet aren't talked about that much.

For example, look at MicroEmacs, which not only has a string of variants that are all named a variation of "MicroEmacs", but also EmACT, Jasspa emacs, NanoEmacs, mg, and vile. My contention is that there's a pletora of child projects of MicroEmacs not only because it was eventually open source, but because the source code was simple and clean enough, and the base set of functionality small enough that many people would look at the code and think, "that would be my perfect editor if only they added X, Y and Z, and the code is simple enough that I can see exactly how to add all those features."

I ran into the idea of Platonic Program shortly after I wrote Robaccia, which was an example of the rudiments of a Python web framework. I was shocked by the uptake, including ports to Groovy, and inclusion in academic class coursework. I'm not saying Robaccia is a quintessential example of the Platonic Program, but after the experience I had with Robaccia, I started to see similar patterns in other projects.

As for the future, I think Fogleman Minecraft has a lot of potential, it takes a wildly popular paradigm and boils it down to less than 1,000 lines of Python. I just scroll through the code and think to myself, I could add X, Y and Z, and it would be so easy...

2013-07-22

Driving and kids these days.

This NYTime article on The End of Car Culture very mich rings true for me, none of our kids look forward to having a car like we did. It felt like we had to push the oldest one to get his learners permit. Made even odder by the fact that we don't live anywhere near any sort of useful public transportation.

2013-07-03

Screencast recording under Ubuntu

I'm sure this evolves over time, but for today, the best command-line incantation I found for recording screencasts is:

$ avconv -strict experimental \
-f x11grab -r 25 -s 1024x768 -i :0.0 \
 -pre medium -f alsa -ac 2  -ar 22050 \
 -i pulse cast.webm

2013-07-02