02 Feb 2020, 14:22

Somewhere between 8 and 11 years without boinkmarks

About 8 years ago, I turned off and carted away the server running boinkmarks (aka autobench), the benchmarking and performance tracker for SBCL. It seems like its last planet.sbcl.org entry was around 2009-04-11.

At the time, it had run benchmarks on every revision1 of SBCL since 2004 - I’d been running it for 8 years, as long as I now have not run it.

Time for a retrospective!

In the 8 years (2004 - 2012) that it did run, I feel that it provided a pretty useful service to the community. In a world where Travis CI was 6 years away the day it was announced, boinkmarks were effectively a CI service for this one, very specific CL compiler.

What did it do well?

Autobench did a whole bunch of things, for every “commit” – then, as imported in to the git repo from CVS – building the manual, checking that the compiler itself be built with at least one older version of itself, and running its performance suite.

Autobench then stored those benchmark results in a postgresql database and provided a web frontend that visualized the history of each test.

There was also an RSS feed that computed the “significant” changes between one revision and the previous one - the intent was to make performance regressions easy to spot.

Autobench alerted me to build failures via email - usually, that was due to an environmental condition (it ran on my personal “server”, and that was before configuration as code), but when anything pointed to a build failure in the source code, I either tried to fix it myself, or alerted the other committers.

Quite importantly to me, autobench didn’t operate on a “daily” or other timeslotted granularity, where it would build each day’s HEAD revision. It built every commit, and it benchmarked every commit. That made it much easier to figure out which commit broke a build or caused a performance regression.

Over time, SBCL-the-compiler (which compiled SBCL-the-codebase into SBCL-the-compiler) got much faster - both at compiling itself (12-20 minutes when I started, 7 minutes when the machine got wrapped up) as well as in the code it generated. I don’t know if that is directly attributable to the performance regression suite (see below), but I’d like to think it did highlight regressions that the suite tested for!

What was not so great?

Despite all my attempts at making it generic, autobench was extremely specific to the target – SBCL – and the benchmark suite, CL-BENCH (so much so that I included a copy of that library in autobench’s source code itself). Forget about using it for tracking performance of anything that wasn’t a CL compiler… or a CL compiler that could be bootstrapped by running a simple command-line invocation.

While I eventually got it to build CLISP (another Common Lisp implementation), the automatic builds for it never really reached the point where it could really build unsupervised, let alone finish running its benchmarks2.

Autobench (and CL-BENCH) had very little idea of how to “do” benchmarks in a way that’s statistically meaningful. You can see some background on that in this delightful paper:

  • The benchmarks themselves mostly tested changes in garbage collector speed, or method call performance.
  • While I did my best to “echo scientific convention” – running them 3 times & working with standard deviations – it took several years until I finally started changing the iteration times on each benchmark, so they ran for several seconds each… eliminating a whole ton of noise.
  • While the machine that the benchmarks ran on didn’t have a ton to do, it certainly wasn’t he quietest neighbor to the benchmarking task itself. These days, I would probably try to again run on “bare metal”, but also eliminate things like cron and other shared-tenancy systems.

What has changed since autobench got turned off?

Git and other “meaningful changeset” version control systems are the norm now. It’s hard to imagine from today’s perspective, but many projects that were around for a long time in 2010 were still using (or had just stopped using) CVS. Nowadays, you can bisect history! Or look at the “previous commit” and compare different points in history, along semantic boundaries. Pure luxury!

CI is the norm for most open-source projects now. Thanks to travis and all the many other CI systems out there that are free to OSS projects, we’ve normalized the idea that software should (a) compile and (b) run its tests! That’s pretty huge.

The pull request type workflow is now mainstream. Automation runs regression tests on a proposed change and will sometimes even block it from going into the main codebase if a test should fail.

Test coverage seems to be, if not the norm, then a stretch goal for projects now, too. Since codecov is free for OSS projects too, there’s several that put a badge on their homepage.

There now exist very high-quality, mainstream, libraries for measuring performance! The criterion crate is one extremely good example - it’ll automatically find the number of iterations that a benchmark needs to run, perform quite sophisticated statistical analysis and offers some great ways of comparing the current state to previous baselines.

Thoughts

From Nelson’s incredibly good post on why Sorbet is fast, I take one thing away: Software starts at one speed and it stays fast by avoiding performance regressions. (And sometimes, it even gets faster!)

The default way a developer interacts with software they work on is by making it slower: New features make programs more complex and usually consume more time to execute and render, processing more data takes more time, bugs creep in that don’t affect the correctness but do affect performance. We just don’t notice when we make things worse by making them slower.

Right now, it feels like software’s performance is still treated as an externality to development, like “can build” and “can run tests” was, 15 years ago. There is little in the way of automated systems in place (in OSS, at least!) that track or even alert on performance repressions. Projects mostly rely on users to feel pain & to report if performance gets worse. And when performance gets worse, it’s hard to track down where the slowdown happened.

Even Sorbet’s main method of performance regression testing was to run the test suite on something big and rely on its developers to notice when tests took too long!

Projects do exist that track performance history:

These are rather the exception than a normal thing to do on a project, though: It’s not a coincidence that these are all pretty large projects with a decently large volunteer pool. Benchmarking (especially benchmarking something that changes rapidly in a reasonably stable environment) takes work. And analyzing that performance data, too, takes work.

This rOpenSci issue puts it quite succinctly: Few people have an answer to the equation “codecov.io is to code coverage as ??? is to benchmarking”. There’s very few mainstream tools I know of that will take benchmark data and perform the kind of rudimentary analysis autobench did. Asking around, two systems stand out: python’s airspeed velocity (which has a pretty cool demo site) and codespeed. I haven’t tried either yet; at least asv seems to cover autobench’s feature set, and the demos look good! But not everyone has the time/patience/digitalocean credits to host a python webapp somewhere… the way you get a concept to the people is by running it “as a service,” and free for OSS projects.

What I wish would happen is that somebody3 built&productized that “codecov for performance data” service. Forget about the actually-running-the-benchmarks problem for now4: Ingest a reasonable format for benchmark data into a database, host a mostly-ok looking UI around it and hook into github’s commit/pull request check workflow. If you built this — I think people would love it.

And, I think best of all, you’d help build software that is faster, whose maintainers are confident that it’s not getting worse.


  1. At the time, SBCL’s canonical source code repository was sourceforge’s CVS server. “Revisions” corresponded to an increment of the version number in /version.lisp-expr, using the same commit message for that file and the rest of the code, mentioning that version number. If you don’t know what that means, count yourself lucky! You escaped CVS and live in the world of meaningful history now! ↩︎

  2. The other problem in getting something with less “commit discipline” than SBCL built was that I had to run a CVS->git gateway. I’m definitely blaming the grey hairs I found when I turned 25 on that gateway. ↩︎

  3. I wish I had the time to build this. But, alas! Fulltime employment! ↩︎

  4. People who care enough about consistent results will need to run their own hardware; others, doing casual analysis for their OSS projects (like me!) can run their benchmarks on a free CI host and learn to deal with variance in the milliseconds, I guess. ↩︎

09 Sep 2012, 00:00

Some more updates

So I’ve been moving stuff off my 6 year old server to a machine hosted in Germany lately. I hope to bring back Boinkmarks on it some day soon. (Not in the way I brought back the git repos, though - no outsourcing for benchmarks!) (-:

There are a couple state changes in my projects that would not warrant a blog post on their own, but I think as a whole are still something to write home about:

  • I’m retiring the Jofr.li web site - it pretty much got obsoleted at birth by Twitter’s URL shortening thing, and it was just a finger exercise anyway. You can still peek at the source if you want to get a feel for how I think a lisp redis-backed hunchentoot app could be structured.

  • The CXML-RPC library’s server part should now work in the newest Hunchentoot. In the process, I think I found a bug in CXML’s Klacks parser under CCL - it fails to decode &# entities.

  • In not so very lisp-related news, I am learning that keeping your server’s configuration in puppet is a really great thing. I’d never done this for my own machines up until this point, but it definitely helps to have all the state re-bootstrapable in one repository. Makes it way easier to reason about system configuration (“Now where are these vhosts’ www root directories again? What, they were behind a bind mount? What was I thinking?!” - these moments are severely reduced when you can just look at git log output).

And lastly, Dan Weinreb died. I wish I had had more opportunities to work with him, chat with him and learn from him.

09 Sep 2012, 00:00

Git lives again - somewhere else

I’ve revived the git repos affected by this outage

  • the cvs->git conversion is now alive again, and the repos there are now kept on github.com.

Turns out there are only two more CVS repos left that I was converting to git: McCLIM and SLIME. So, they’re online again, and I hope you still find them useful.

If you are missing any repos that I forgot to move, please send me a note. I should have extensive backups of everything, so restoring anything that’s missing probably isn’t a problem. (Famous last words, hah!)

24 Aug 2012, 00:00

git.boinkor.net outage

I’m currently moving some of boinkor.net’s services off the creaky old machine that used to host it, over to another machine. This affects git.boinkor.net - it’s not going to be available for the next 2 days. (With a bit of luck, it may be back up a little sooner, though.)

This probably affects you if you follow the slime and mcclim git repos hosted there.

This was caused by a case of really bad planning on my part. Sorry for the inconvenience.

30 Dec 2011, 00:00

Converting the Mustache Test Suite to CL

Matthew Snyder has a great introductory post on his blog where he converts the Mustache spec into a runnable fiveam test suite. Very cool stuff - I hope he posts more (-:

21 Dec 2011, 00:00

IDNA Now Supports Punycode Decoding

My IDNA library now supports decoding IDNA strings via the to-unicode function:

(to-unicode "xn--mller-kva.example.com")
;; =>  "müller.example.com"

That’s in addition to the regular encoding for unicode domain names:

(to-ascii "müller.example.com")
;; => "xn--mller-kva.example.com"

Sadly, I haven’t managed to get the logic for case-sensitive punycode encoding to work yet. But fortunately, IDNA domain name encoding doesn’t require that! Anyone looking for some low-hanging fruit-shaped lisp projects is welcome to add that! (-:

08 Oct 2011, 00:00

Accessing the Stripe API From Lisp

Stripe is a new payment processor on the Web, and they seem to be a lot less insane than Paypal. On a whim, I made a little (almost completely untested, toy) CL library for accessing their HTTP API from Lisp. Maybe you’ll find it useful: cl-stripe.

This was pretty great fun! Thanks to their nice HTTP API, drakma, and alexandria, I have been able to write this with a minimum of horribly hacky code, in just 5 or 6 hours of working on it, on and off, this saturday afternoon.

If it still looks like fun, I think I may add some clucumber tests to it tomorrow. Stay tuned.

05 Oct 2011, 00:00

A Weird Problem With ASDF on NFS (and a workaround)

Recently, we at Franz have been seeing weird failures when building a certain ASDF system on NFS: We were intermittently getting redefinition warnings when loading a system - not all the time, but more often when we compiled during certain timeslots.

This was a weird one, and I think it would be nice to record what this is, and how we figured out what’s going on (and how we arrived at our work-around).

Update 2011-10-11: Faré informs me that this problem is fixed (all the way, no need for a workaround) in ASDF 2.017.9!

The Symptom

We have a system and a specialized operation (concat-op, it generates a single loadable .fasl we can ship) that depends on asdf:load-op. In our build process, we first load the system, then generate some data with the loaded system, and then perform the custom operation on the system.

When performing that operation with a source tree that was checked out on NFS, the load-op that it depends on sometimes got performed a second time: The lisp loaded all the .fasls again, and for some constructs in some .fasls, signaled a WARNING, which made the build break.

Oddly enough, the failure happened only during certain time slots - we would see the build work between 3pm and 4:30pm, and starting at 4:30 it failed consistently until it was time to go home. Huh.

Aside: How ASDF Operates

Not everyone might be familiar with how ASDF works (if you are, feel free to skip to the next section, or stay and nitpick (-:), so here’s a small primer on what happens when you type (asdf:load-system :some-system). Here’s a little walkthrough:

  1. ASDF runs the generic function traverse with the system and the operation as parameters.

  2. traverse walks the dependencies of the system and the contents of the system itself, and determines which operations are not yet done.

    For a load-op on a CL source file, traverse will try to generate a load-op for the input-file of that load-op (the .fasl file), check if that .fasl file exists, and if it doesn’t, then it will also generate a compile-op for the corresponding .lisp file.

  3. As a result, traverse returns a list of operations that must be performed on each component (or module, or system). For a clean source tree, that list looks something like: ((compile-op . source-1) (load-op . source-1) (compile-op . source-2) (load-op . source-2) …)

  4. operate takes that list and just performes each operation on its component in order.

All this means that ASDF takes a two-step approach: It first determines what needs to be done, then does it. All the smarts in ASDF are in that traverse operation and the underlying mechanisms. The rest is just a dolist.

OK, with that out of the way:

The Hunt

I’d gotten this error before, but that was when I was running on a source tree checked out on an NFS-mounted file system on Windows. I didn’t pay it much mind, because, hey, it’s the NFS client on Windows.

But then this exact same problem started happening to a client using two Linux machines as the client and the server. We had a problem.

At first, we suspected that there was an issue with build order (that result list of traverse). This was a blind alley: The files were loaded in exactly the same order in the failing and working scenarios. No luck.

The next thing was to instrument operation-done-p before performing the operation, and there we saw what happened: operation-done-p reported that load-op had not been performed on a file. But that file had been loaded into this very same image just minutes before! Huh?

operation-done-p a generic function and has a method that attempts to handle the most common cases of operations on ASDF components: the method specialized on (operation component), which does the following in the branch that applies to load-op:

(defmethod operation-done-p (operation component)
  (let ((out-files (output-files o c))
        (op-time (component-operation-time o c)))
    (flet ((latest-in ()
             (reduce #'max (mapcar #'safe-file-write-date in-files))))
      (cond
        ;; ...[cut some branches]

        ((not out-files)
         ;; an operation without output-files is probably meant
         ;; for its side-effects in the current image,
         ;; assumed to be idem-potent,
         ;; e.g. LOAD-OP or LOAD-SOURCE-OP of some CL-SOURCE-FILE.
         (and op-time (>= op-time (latest-in))))

         ;; ...[some more branches here]
         ))))

This consults a registry of times when an operation was performed on a component: component-operation-time returns a universal-time, that is a number of seconds, and compares that to the file-write-date

  • also a universal-time - of the input file (the .fasl). After some tracing, we determined that for some reason, the .fasl file was one second younger than the time that ASDF thought the load-op had been performed on it. In other words, the compiler had written the file AFTER load had had a chance to read it. ASDF was reading a file from the future.

This was the time when we started scratching our heads.

First, we wrote a little test program to verify we weren’t crazy:

#include <stdio.h>
#include <sys/fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/time.h>
#include <sys/stat.h>

int main(int argc, char **argv) {
  int fd;
  struct timeval tv1, tv2;
  struct stat sb1, sb2;

  char buf[1024];

  while (1) {
    gettimeofday(&tv1, NULL);
    printf("%u.%u\n", tv1.tv_sec, tv1.tv_usec);

    fd=open(argv[1], O_WRONLY|O_CREAT|O_TRUNC, 0660);

    write(fd, buf, sizeof(buf));
    write(fd, buf, sizeof(buf));
    write(fd, buf, sizeof(buf));

    gettimeofday(&tv1, NULL);
    stat(argv[1], &sb1);
    close(fd);
    gettimeofday(&tv2, NULL);
    stat(argv[1], &sb2);

    /* never seems to be triggered */
    if (sb2.st_mtime != sb1.st_mtime) {
      printf("mtime changed between last write (%d) and close (%u)\n",
         sb1.st_mtime, sb2.st_mtime);
      exit(1);
    }
    if (sb1.st_mtime > tv1.tv_sec) {
      printf("mtime after last write has a future timestamp (%u > %u)\n",
         sb1.st_mtime, tv1.tv_sec);
      exit(1);
    }
    if (sb2.st_mtime > tv2.tv_sec) {
      printf("mtime after close has a future timestamp (%u > %u)\n",
         sb2.st_mtime, tv2.tv_sec);
      exit(1);
    }
  }

  return 0;
}

When we ran it, after a while, at timestamps very close to the boundary to the next second, we’d get “mtime after close has a future timestamp”. What. The.

We checked that all machines were synchronized with NTP. They were, to the same machine on the local network. What is going on?

Luckily, my colleague Ahmon has a lot of experience with NFS. His expertise and ample use of tcpdump finally provided the final puzzle piece: NFS protocol 3 on Linux has a feature called “weak cache consistency”: information can be supplied by servers after most NFS calls (e.g., WRITE) and has the server’s take on file attributes (such as mtime). So if the time on the server is just a tiny bit ahead of the client, the server will report the file that the client just wrote is from the client’s future.

When one apparently time-traveling file appears in the source tree, the traverse method will consider the system to not have been loaded, and will reload the .fasl files starting at the time-traveling file. Anything after that file in the build order could (and did!) potentially mess up the lisp image. In the best case, it would just slow down the build a lot by re-loading a ton of .fasl files. Argh.

Fixing this Mess (aka, the Workaround)

Since ASDF consults a registry of times that a file was loaded, we decided it would be easiest to alter the method that records this timestamp: Instead of the current time, it should record whichever is later: the current time or the timestamp of the file that it loaded.

(defmethod perform :after ((operation load-op) (c component))
  (setf (gethash (type-of operation) (component-operation-times c))
    (reduce #'max (cons (get-universal-time)
                        (mapcar #'safe-file-write-date (input-files operation c))))))

And that’s it - with this method in place, asdf can now accurately build our system repeatedly, on NFS, even if wtf.c triggers.

Lessons Learned

That was a pretty fun afternoon spent debugging our build process. As a result, we got a working build, and a few shiny new ideas in our heads:

One, a program should never rely on the system time and some file’s creation time being comparable. This just doesn’t work anymore in a distributed system, especially if you’re using full seconds to represent time.

Two, ASDF is pretty flexible (almost to the point of being too flexible). To diagnose ASDF’s internal state, all we had to do was trace some functions it defines, and we managed to put this workaround in without having to deeply modify any of its sources: All it takes is an additional :after method. Sweet.

And three, the Allegro CL fasl loader is very fast (at least it feels so to me, coming from SBCL): In that tiny window (less than 0.07 seconds of real time) it would load a pretty substantial .fasl file and asdf would register it as loaded. That’s pretty impressive! (-:

30 Sep 2011, 00:00

New Lisp Tips Blogs

As of today, there are two new Lisp tips blogs on the web: Common Lisp tips by Xach and SLIME tips by Stas. Both already have some nice stuff that I didn’t know about, so I hope they keep the tips coming!

27 Sep 2011, 00:00

"Basic Lisp Techniques" for sale on Amazon

I just saw that David Cooper’s Basic Lisp Techniques is now available on Amazon, as a Kindle edition, for $9.95.

Problem is, this book is freely available on the Franz web site, doesn’t seem like it is an authorized conversion, and judging from the free sample the Kindle edition it is a slightly crappy (footnotes didn’t convert properly, ToC is ugly) pdf->azw conversion of the PDF. I recommend using the free PDF (you can even convert it yourself using Calibre if you want the book on an e-reader, it’s not hard at all to get very readable results).