Samuel Williams Saturday, 09 May 2009

Runtime polymorphism is typically the usage of function pointers to provide an implementation for functionality that is not known at compile time. It is also often used to simplify the structure of a program, by using common interfaces to expose specific functionality.

There are a number of purposes for runtime polymorphism:

The cost of run time polymorphism is great. It can be reduced, but the best way to reduce the cost of runtime polymorphism is to avoid it all together. However this is not always an option. It is often the right tool to use when approaching dynamic functionality, as it can simplify code structure, reduce coupling and improve code localization.

The cost of dynamic dispatch

Dynamic dispatch can cost quite a bit, depending on your perspective.

I've prepared a sample application which can be compiled on Mac OS X using Xcode. The purpose of this tool is to benchmark a very contrived small function, and produce a baseline time.

The function body we are calling is identical in all cases, and looks like this:

// C Dispatch
void basicFunction () {
	g_count += 1;
};

Here is a table of the results. Non-dynamic C function dispatch was used as the baseline. The bigger the percentage, the slower it is.

Method-O0 % -O1 %-O2 %-O3 %-Os %
C Dispatch (Baseline)100%100%100%100%100%
C Pointer Dispatch95.56%112.5%104.6%450%442.1%
Base C++ Dispatch125.7%132%116.9%449.9%590.6%
Derived C++ Dispatch125.6%132.1%116.5%544.2%511.9%
Objective-C Dispatch245.1%272.3%274.2%1235%1213%

From this, we can draw a number of conclusions.

C Dispatch

A standard function in C is a prime candidate for being inlined. In this case, at O3 and Os this occurs, and the benefits are amazing. The function takes almost 1/5th the time to execute in this case.

For larger functions, this won't be the case, but in terms of function dispatch, the key point I want to make is this is the only kind of function call that can be inlined with such a huge performance improvement. It is also the case that once a function is inlined, many other optimizations can be made, such as reducing the amount of data that may be copied during the function call.

Here is the code before it is inlined (-O0 to -O2):

L310:
LM29:
	call basicFunction()
LM30:
	leal -40(%ebp), %eax
	incl (%eax)
L309:
	cmpl $19999, -40(%ebp)
	jbe L310

Here is the code after it is inlined (-O3 and -Os):

	movl _g_count, %eax
	xorl %edx, %edx
LVL119:
L207:
LBB201:
LM27:
	incl %edx
LBE201:
LM28:
	incl %eax
LBB202:
LM29:
	cmpl $1000, %edx
	jne L207
	movl %eax, _g_count

I think the funny thing about this is that the compiler did inline the function, but failed to reduce the loop to a simple addition - e.g. rather than adding 1 several thousand times, why not just add several thousand one time. I've seen GCC do this optimization before, so I'm not sure why it isn't doing it here.

C Pointer Dispatch

This method can not be inlined since the pointer must be dereferenced in order to call the function, thus the function is not defined at compile time.

There is a anomaly for -O0 which I cannot explain. Even thought the statistics are about +/- 10%, this result was consistently about 5% less than the baseline.

Here is the code generated for -O0:

LM33:
	movl $0, -36(%ebp)
	jmp L312
L313:
LM34:
	movl -48(%ebp), %eax
	call *%eax
LM35:
	leal -36(%ebp), %eax
	incl (%eax)
L312:
	cmpl $19999, -36(%ebp)
	jbe L313

The interesting thing about this method of dispatch, is that it it is the most basic kind of dynamic dispatch, and it can't be optimized very much, the improvement between -O0 and -O3 is minimal. It also has a clear relationship to C++ dynamic dispatch in terms of performance cost and the ability for the compiler to provide optimizations.

C++ Virtual Dispatch

This is the most common kind of dynamic dispatch available to programmers who use C++. It is essentially the same as a function pointer, but occasionally it can become more complex depending on the class hierarchy. However, in this case, it is incredibly simple. There was no noticeable difference between a pointer to the base class and a pointer to the derived class, until we use -O3 and -Os. I cannot explain this behavior without looking at the generated asm code, but suffice to say, that little optimization can be done by the compiler.

The performance of the virtual dispatch cannot be improved much, and the compiler is unable to inline the code. I find this at least a little bit fascinating, because in the entire program, there is only one implementation of the virtual function of the same name. In this case, should it not be possible for the compiler to assume that no other implementation exists?

Unfortunately, due to the way GCC compiles code, we cannot be sure until link time whether there are multiple implementations of the same base interface, and at this point, no further optimizations can be done. However, some compilers such as LLVM do provide this facility, and it may be possible to provide the context necessary for this kind of analysis.

The fascinating thing about virtual functions is that once a function is marked as virtual, it will then generally require a dynamic dispatch. Therefore, it is important to only make a function virtual when needed, and to try to keep the number of virtual functions to a minimum if performance is a concern.

Here is the code for virtual dispatch at -O0:

LM38:
	movl $0, -32(%ebp)
	jmp L315
L316:
LM39:
	movl -52(%ebp), %eax
	movl (%eax), %eax
	movl (%eax), %edx
	movl -52(%ebp), %eax
	movl %eax, (%esp)
	call *%edx
LM40:
	leal -32(%ebp), %eax
	incl (%eax)
L315:
	cmpl $19999, -32(%ebp)
	jbe L316

Objective-C Dispatch

Objective-C is fascinating. I won't go over the tiny details here, suffice to say that it requires (as far as I know) some sort of hash table search. Objective-C dispatch is built on top of standard C dispatch and functions, and so is in the worst case 12 times slower than standard C dispatch at -O3. This shouldn't come as much of a surprise since the standard C dispatch at -O3 was inlined.

With Objective-C, it is neither possible nor desirable to inline function calls. The purpose of this is to provide a highly dynamic language which allows for a multitude of possibilities in terms of the API and generally what is possible in terms of dynamic dispatch. In this case, the basic Objective-C dynamic dispatch is about 2.5 times slower than dynamic C++ dispatch, but provides a number of interesting features which are generally not available when using C++ dynamic dispatch:

Generally, Objective-C provides an incredible amount of flexibility at runtime, but you do pay for it performance wise, and with Objective-C, you can't back out of it - i.e. there is no non-virtual Objective-C dispatch available. However, as Objective-C is built on top of C, you can fall back to standard C functions as needed for implementation details.

Here is the code for Objective-C dispatch at -O0:

LM48:
	movl $0, -24(%ebp)
	jmp L321
L322:
LM49:
	movl -60(%ebp), %edx
	movl L_OBJC_SELECTOR_REFERENCES_2, %eax
	movl %eax, 4(%esp)
	movl %edx, (%esp)
	call L_objc_msgSend$stub
LM50:
	leal -24(%ebp), %eax
	incl (%eax)
L321:
	cmpl $19999, -24(%ebp)
	jbe L322

Conclusion

There are many options available to programmers to design robust and efficient programs. It is important to learn where to use dynamic dispatch, especially in performance critical code paths. There is a lot of good advice out there in books and on the internet, and it is worthwhile taking the time to benchmark your program. Although from the benchmarks the performance cost may appear minimal, if you are dealing with thousands of objects, these small costs can really add up.

The topic of dynamic dispatch and inter-procedural optimisation is a big one and ever increasing in scope. One interesting area of research is the Java HotSpot compiler, which from what I've read, can provide optimistic optimisation of dynamic dispatch. Even thought this option is available in terms of improving speed, it is fairly complex in terms of implementation, and is not going to give completely consistent results.

Comments

Interesting that the dynamic dispatch of Obj-C works so badly - in true dynamic languages (most notably Self, which pioneered these techniques) the code can take advantage of a technique called polymorphic inline caching :- effectively inlining the code of commonly used calls, and falling back to dynamic dispatch only when needed. For many cases, the behaviour can be entirely deduced, so the inline code is as efficient as C.

Self was running at about half the speed of C as a result of these techniques. Paradoxically, in this case Obj-C’s static nature prevents optimisations available to the dynamic language. Many of the Self team went on to work on projects like Java HotSpot VM and have influenced other projects such as Google’s V8 JavaScript engine.

I was always uneasy with Obj-C’s mix of dynamic dispatch in a static language - mainly because of the bizarre hybrid syntax. This shows that there are good technical reasons the approach is suboptimal as well.

Leave a comment

Please note, comments must be formatted using Markdown. Links can be enclosed in angle brackets, e.g. <www.codeotaku.com>.