Analyzing Presentation Traffic

(... or how to use Digital Signal Processing on your log files ...)

One of the benefits of writing your own presentation software is that you can do web-friendly things like giving each and every page of your presentation it's own URI. That lets lots of cool stuff happen like allowing search engines to spider the content and allowing people to link directly to a slide in the deck. It also lets me monitor how people look at the slides by just looking at my traffic logs. Now the first easy metric is just how many people have looked at the slides, but I got curious and wanted to see how many people actually made it all the way through to the end of my presentation.

Here is a simple Python program to load up the server logs and count the number of hits on each page of the presentation; printing the results as a comma separated list.

import re

slide_regex = re.compile("GET /projects/cascon06/(\d+).html")
hits = [0] * 130

def analyze(filename):
    for line in file(filename, "r"):
        match = slide_regex.search(line)
        if match:
            index = int(match.groups()[0])
            if index > 0 and index < 130:
                hits[int(match.groups()[0])] += 1

    print ",".join([str(i) for i in hits[2:]])

analyze("20061018.log")

Note that the code doesn't report the first two values in hits. That's because there is no slide 0.html and 1.html has index.html as an alias. I got a copy of the logs around 11AM yesterday and the output of the program looks like this:

   217,206,201,211,200,195,185,187,185,180,178,175,175,179,
   181,178,176,178,177,175,172,176,175,167,163,162,166,166,
   164,163,160,157,154,150,148,144,142,141,139,138,137,143,
   146,141,138,137,134,140,141,140,140,133,133,127,125,137,
   135,121,120,121,122,121,121,120,120,120,123,124,123,120,
   123,125,122,120,120,123,122,122,125,126,125,124,125,123,
   120,121,121,119,124,128,124,121,121,120,125,126,118,118,
   117,111,112,108,100,103,102,99,99,97,97,100,103,106,99,97,
   96,96,97,98,98,96,96,96,94,95,95,97,104,96

If you look at the first and last numbers in the list, 217 and 96, you can see that approximately half of the people that started actually made it all the way to the end. I'm pretty impressed with that.

I expected the data to be perfectly montonic, always decreasing as you went further and further into the presentation, but that 104 as the second to last value points to something else going on. Let's graph the data. Luckily the data is perfectly formatted for the sparkline generator.

Instead of montonic there do appear to be some bumps, like the little jump around the second to the last slide. That slide contains the source code to the presentation, so I that slide may have gotten a second look, i.e. it was more 'interesting' and people came back to that slide.

Let's find all those bumps and see what other slides count as 'interesting'.

What I'd like to do is process that data so that the bumps become pronounced. One way to do that is to take every set of 3 adjacent points and calculate:

-a[n-1]/2 + a[n] - a[n+1]/2

Now you can look at that as the (negative) acceleration at each point, or you can view that as convolving the sample array with the filter (-1/2, 1, -1/2) and wander off into Digital Signal Processing territory, but either way you look at it the montonic behavior will tend to zero, and the bumps will not.

So if we update our program to do the above calculation it now looks like:

import re

slide_regex = re.compile("GET /projects/cascon06/(\d+).html")
hits = [0] * 130

def analyze(filename):
    for line in file(filename, "r"):
        match = slide_regex.search(line)
        if match:
            index = int(match.groups()[0])
            if index > 0 and index < 130:
                hits[int(match.groups()[0])] += 1

    #print ",".join([str(i) for i in hits[2:]])

    prefilt = zip(range(len(hits)), hits, hits[1:], hits[2:])

    filt = [(-a/2.0 + b - c/2.0, i+1) for (i, a, b, c) in prefilt]

    top = sorted(filt)
    top.reverse()

    print "Page Weight"
for (weight, index) in top[1:8]:
        print "%6d %6.1f" % (index, weight)

analyze("20061018.log")

You'll note that we skip the first page as that always turns out to be slide number two, which is just an artifact of index.html being an alias for 1.html. The output of our program looks like:

  Page Weight
     5   10.5
   128    7.5
    57    7.0
    58    6.0
   113    5.0
    97    4.5
    91    4.0

The 'interesting' pages include the jumbled words from Cambridge, the source code to the presentation the laws of simplicity, and the assertion that simple means the underlying technologies are close to the surface. If those pages really were the 'interesting' ones than I'd say my little analysis program, and the presentation in general, were a success.