The Beez' speaks..: Benchmarking

Every now and then the question pops up: "Why is 4tH so slow". Well, 4tH isn't slow. It's twice as fast as Python and outruns most other C-based Forth compilers with the obvious exception of gForth. Anton Ertl has done a great job of documenting gForth and you get a great insight on how he has been able to achieve this. First of all, gForth is bound to the GCC family of compilers and the creators of gForth use that to their advantage. They examine the code that is created and how GCC optimizes it. They do not only do this for a single architecture, but for several.

4tH is very different. It tries to be as generic as possible, so most compilers are able to compile it properly, even those of twenty years ago. You can check the code of some compilers of some platforms, but not all of them. And there is the main problem. As "Advanced C: Tips and Techniques" illustrates, the same source does not result in similar behavior on different platforms and with different compilers. Some say that pointer access is always faster than array access. "Advanced C" proves you wrong.

The only way you can actually improve the performance of a compiler like that is by inventing a much cleverer algorithm. E.g. by reducing function calls, eliminating checks or caching results. Today, I experimented a bit with a special GCC construct, the computed goto. To my surprise, I gained very little. Wait a minute, I've done this several years ago, using the same C compiler. Why doesn't it work now?

Well, I'd forgotten to throw in an extra instruction pointer. When I did that, the results were as expected. I still didn't like the source. It was a mess. Then I got an idea. Why not just apply the extra instruction pointer to the general source? I did it and to my surprise I achieved almost the very same results. The computed goto didn't do the trick, but the extra instruction pointer did. Unfortunately, I can't discard the indexed instruction pointer completely, so after each instruction two instruction pointers have to be incremented instead of one. So, this works for GCC, but will it work for TurboC or LCC or Pelles C or XL C? It may even slow it down..

Even though it may seem that C compilers produce comparable code, the opposite is true. A switch() statement is transformed to a jump table by GCC and TurboC. XL C produces something that is comparable to a long "if..elif..elif..endif" construction. In order to accommodate XL C, the most often used tokens are in the beginning of the switch() statement. That's what makes performance such a difficult issue. What works for one compiler doesn't work for another. That's why more and more products – like the Linux kernel itself – are bound to one particular compiler or even one particular version of a compiler.

But there are more qualities related to source code than just raw speed. It has to be easy to read and to maintain. The recent speedup in 4tH complicated matters for developers – the previous source code was much easier to extend and modify. The computed goto would complicate matters even more and for what: 1% speedup for a particular compiler family. The first thing I did when I created 4tH was to set down the requirements. Speed was one of them, but only if it didn't violate other design objectives.

IMHO speed is overrated. Benchmarks execute hundreds of millions of the same instructions, taking ages to finish. In real life that is almost never the case, except for some pretty processor intensive jobs you'd never dream to write in a scripting language. It is another example of how artificial most benchmarks are and explains why even significant improvements are almost never translated fully to every day use. Telling an interpreter it is slow makes just as much sense as telling a bicycle it is slow. Although that is true, millions of Dutchmen find good reasons to use them every day. If you need to go a long way, you simply won't use a bicycle. If an interpreter is too slow for a job, you grab a compiler. It is as simple as that.

Still, many jobs don't require that much raw speed. I designed a lot of batch processing programs in 4tH. Some take about a minute to finish – that could be better. However, since the rest of the job takes another ten minutes it won't change the scheme of things if I take off half a minute of its execution time, exchanging a perfectly reliable and proven program for a new one that may contain bugs. All that development effort, all the time it takes to test it - for what. Half a minute of execution time per week..

Not every program is created equal. Some are there to run only once and solve a particular problem. E.g. it I want to convert a mailbox permanently, it won't matter too much if it runs five minutes or half a minute. A cron job running ten seconds a day or a second doesn't make much difference either. A program running 0.01 seconds or 0.1 seconds is almost indistinguishable. In the ten years I'm using 4tH I've found I hardly ever need to grab a C compiler anymore. The upside is, that I'm able to create programs much faster and much more reliable, especially where parsing and text conversion are required.

People know that. Modern scripting languages like Perl, Ruby, Python and PHP are much slower than 4tH, but benchmarking is rarely performed. Some will say those languages can be sped up with Parrot and equivalents. The truth is that few ever do. How many people do you know that have converted all their Perl scripts to Parrot? The answer is: for most jobs you do not need to. Fast is fast enough.

Computers are thousands of times faster than 20 years ago. That means a job you could only perform using a high speed compiler can now be done by a scripting language. That may explain the recent rise of the scripting languages. I used Z80 assembly in the days of the Sinclair Spectrum for three reasons: first, interpreters were highly underpowered, second you had not much room to play around with and third, compilers took too much memory to be useful – the best you could do with them was display "Hello world" on the screen in a very speedy way. When the IBM PC came along, I took my refuge to C. It was versatile and boosted my productivity. And now we are here, in the age of the scripting languages.

Those who are still focused on raw speed are the nerd equivalent of the guys tinkering with automobiles. Nowhere – except in Germany – you can go over 65 MPH. Still, they are trying how far you can go with current technology. You won't be able to get your groceries any faster, but that isn't the objective of it all. The problem for us developers is, that we're trying to please. But 10% or 50% speedup – no matter the technical feat in itself - is not what these people are looking for. They are going for two or three times faster, something that is very hard to achieve, even more when there are other requirements you have to satisfy.

Sometimes the solution is very easy – just get a better compiler. The first and biggest speedup I achieved is when I switched from TurboC to DJGPP. After that, 4tH ran 2.5 times faster. But nowadays high quality compilers are commonplace. Note that even a high quality compiler like Intels cannot speed up GCC programs by more than about 15%, tops.

I still think that most Open Source developers are excellent. They listen very carefully to their users and try to comply. But stay true to your design objectives and carefully balance the needs of your audience. You may be able to satisfy the wishes of the speed demons, but always consider what qualities you trade in. It may be that you have to let go users of a certain platform, since there is no compiler left (to your disposal) that compiles your source. You might have to make your source code harder to read and understand and lose those who loved to tinker with your program. Your speedup may cause a slowdown in other compilers.

A drag racer goes very fast, but I prefer a Audi cabriolet A4 to cruise along the French coastline. It is so much more comfortable.

The Beez' speaks..

Sunday, July 29, 2007

Benchmarking

No comments: