Samuel Williams Saturday, 02 June 2018

Ruby currently uses OS provided abstractions for implementing Fibers. Unfortunately, these often perform poorly and have tricky semantics. We present a native implementation of coroutines and demonstrate improvements to performance.

Existing Implementations

The most common implementation on UNIX systems involves manipulating the state of makecontext() and swapcontext(). These function calls have been deprecated and removed from the POSIX standard, but still exist in most UNIX implementations for backwards compatibility. As well as being poorly implemented and supported, the documented semantics of these functions requires certain system calls which limit their performance.

Another typical undocumented and unsupported approach is to abuse setjmp() and longjmp(). It is possible to manipulate the jmp_buf to change the return address and stack pointer, so it is possible to jump between fibers.

Windows provides native APIs for fibers and these are pretty decent. They work as expected, but depending on the situation, may do more than required at the expense of performance.

Improving Performance

A native implementation of coroutines using assembly can significantly improve performance of the Ruby's Fibers. Even thought some assembly is required, the net semantic complexity is reduced and existing hacks can be removed.

x64 Implementation

The state that is required per coroutine - typically just the stack pointer, but it is possible to augment this with other per-coroutine data:

// The fiber context (stack pointer).
typedef struct
{
	void **stack_pointer;
} coroutine_context;

The initialization function prepares the stack so that when we transfer to it, it will return to the given start address:

inline void coroutine_initialize(
	coroutine_context *context,
	coroutine_start start,
	void *stack_pointer,
	size_t stack_size
) {
	/* Force 16-byte alignment */
	context->stack_pointer = (void**)((uintptr_t)stack_pointer & ~0xF);

	if (!start) {
		assert(!context->stack_pointer);
		/* We are main coroutine for this thread */
		return;
	}

	*--context->stack_pointer = NULL;
	*--context->stack_pointer = (void*)start;

	context->stack_pointer -= COROUTINE_REGISTERS;
	memset(context->stack_pointer, 0, sizeof(void*) * COROUTINE_REGISTERS);
}

The destroy function essentially just nullifies the stack pointer:

inline void coroutine_destroy(coroutine_context * context)
{
	context->stack_pointer = NULL;
}

The transfer function switches between coroutines. It essentially saves the caller state onto its stack, swaps the stack to another coroutine, and then returns.

coroutine_transfer:

	# Save caller state
	pushq %rbp
	pushq %rbx
	pushq %r12
	pushq %r13
	pushq %r14
	pushq %r15
	
	# Save caller stack pointer
	movq %rsp, (%rdi)
	
	# Restore callee stack pointer
	movq (%rsi), %rsp
	
	# Restore callee stack
	popq %r15
	popq %r14
	popq %r13
	popq %r12
	popq %rbx
	popq %rbp
	
	# Put the first argument into the return value
	movq %rdi, %rax
	
	# We pop the return address and jump to it
	ret

Keep in mind that this is the simplest possible implementation and doesn't preserve things like sigmask, FPU state, etc. They essentially remain per-thread with this specific implementation.

Performance

Generally speaking, the overhead of a fiber context switch is not much more than a standard C fast call (pass as many arguments in registers as possible). On x64, the standard (and only) ABI is fast call, and so it's efficent by default.

Further Reading

The code is available here and the Ruby bug report has more details. There is a PR tracking changes.

The goal of these improvements is to improve the performance of async. I've measured a 5% improvement to async-http.

Comments

The most common implementation on UNIX systems involves manipulating the state of makecontext() and swapcontext(). These function calls have been deprecated and removed from the POSIX standard, but still exist in most UNIX implementations for backwards compatibility. As well as being poorly implemented and supported, the documented semantics of these functions requires certain system calls which limit their performance.

That got me poking around as I had plans for them….

So I asked a question on the libc-help mailing list and got this response… https://sourceware.org/ml/libc-help/2018-06/msg00011.html

I didn’t find anything on the Austin groups bug tracker except…. http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xsh_chap03.html#tag_22_03_01_07

getcontext(), makecontext(), swapcontext()

Due to portability issues with these functions, especially with the manipulation of contexts, applications are recommended to be rewritten to use POSIX threads.

Which IMHO is not a very satisfactory answer.

In my experience of playing around with swapcontext(), the behaviour of signals and sigprocmask were certainly a source of pain (and in the presence of the early spectre / meltdown mitigations, a huge slowdown).

Unfortunately I think POSIX has invested so much sunk cost into pthreads they no longer have the will to step away from it. I wish they would.

Leave a comment

Please note, comments must be formatted using Markdown. Links can be enclosed in angle brackets, e.g. <www.codeotaku.com>.