On a 3GHz (3 billion hertz) processor, you expect to be able to context switch billions of times per second?coder543

Modern CPUs are incredible and implementing stackful coroutines has about the same overhead as a function call. Here is the implementation I wrote:

.globl coroutine_transfer
coroutine_transfer:
	# Save caller state
	pushq %rbp
	pushq %rbx
	pushq %r12
	pushq %r13
	pushq %r14
	pushq %r15

	# Save caller stack pointer
	movq %rsp, (%rdi)

	# Restore callee stack pointer
	movq (%rsi), %rsp

	# Restore callee stack
	popq %r15
	popq %r14
	popq %r13
	popq %r12
	popq %rbx
	popq %rbp

	# Put the first argument into the return value
	movq %rdi, %rax

	# We pop the return address and jump to it
	ret

On my (objectively ancient) linux desktop, on a single core, I get on the order of 100 million context switches per second. Across all 8 cores, this approaches 1 billion.

That being said, my original remark that it was possible to context switch billions of times per second was too casual and without evidence. At best it was unclear and at worse it was off by an order of magnitude. So, I apologise for any confusion and have updated the article.

The source code is available and you can run the benchmark yourself.