Samuel Williams Tuesday, 05 June 2018

Asynchronicity should be a property of how the program is executed, not what it does.

Ruby currently implements mutually exclusive threads and exposes both blocking and non-blocking operations. It also supports Fibers which can be used to implement cooperatively scheduled event-driven IO. The cognitive burden of dealing with these different APIs is left as an exercise to the programmer, and thus we have a wide range of IO libraries with varying degrees of concurrency. Composability of components build on different underlying IO libraries is generally poor because each library exposes its own API and has its own underlying event loop. We present an approach to concurrency that scales well and avoids the need to change to existing programs.

Improving Concurrency and Composability

Fibers are a negative overhead abstraction for concurrency, with each fiber representing a synchronous set of operations, and multiple fibers executing cooperatively in a single thread. This design provides concurrency with none of the overheads of parallel (multi-threaded) programming. Programmers can write their code as if it were sequential, which is easy to reason about, but when an operation would block, other fibers are given a chance to execute. Excellent scalability on Ruby is achieved by running multiple processes, each with its own event loop, and many fibers.

Basic Operations

Here is an example of a basic asynchronous read() operation. It is possible to inject such wrappers into existing code and they will work concurrenty without any further changes:

class Wrapper
	# ... initialze, write, close, etc
	def read(*args)
		while result = @io.read_nonblock(*args, exception: false)
			case result
			when :wait_readable
				return result

What does wait_readable() look like? In a simple select()-based implementation:

class Selector
	# initialze, wait_writable, etc
	def wait_readable(io)
		@readable[io] = Fiber.current
		return true
	def run
		while @readable.any? or @writable.any?
			readable, writable =, @writable.keys, [])
			readable.each do |io|
			writable.each do |io|

The problem with this design is that everyone has to agree on a wrapper and selector implementation. We already have a core IO layer in Ruby that practically everyone uses. Along with we have a ton of options for event driven concurrency, including but not limited to: NIO4R (alive), Async (alive), LightIO (experimental), EventMachine (undead), ruby-io (experimental).

Extending Ruby

The best boundary for event-drive IO loops in Ruby is per-thread (or taking the GIL into account, per-process). Event driven IO is naturally cooperative, and scheduling operations across threads makes it needlessly complicated. We can leverage Ruby's existing IO implementation by intercepting calls to io.wait_readable() and io.wait_writable() and redirect them to Thread.current.selector.

We add an appropriate C API for Thread.current.selector and add a layer of indirection to int rb_io_wait_readable(int f) (and others):

int rb_io_wait_readable(int f)
	VALUE selector = rb_current_thread_selector();
	if (selector != Qnil) {
		VALUE result = rb_funcall(selector, rb_intern("wait_readable"), 1, INT2NUM(f));
		return RTEST(result);
	/* existing implementation ... */

Here is an example of how this fits together:

thread = do
	selector =
	Thread.current.selector = selector
	i, o = IO.pipe
	i.nonblock = true # this could be default
	o.nonblock = true do
		message =
	end.resume do
		o.write("Hello World")
	end.resume # could be invoked implicitly

This design has a very minimal surface area, allows reuse of existing event loops (e.g. EventMachine, NIO4r). It's also trivial for other Rubies to implement (e.g. JRuby, Rubinius, TruffleRuby, etc).


While it's hard to make objective comparisons since this is a feature addition rather than a performance improvement, we can at least look at some benchmarks from async-http and async-postgres which implement the wrapper aproach discussed above.

Puma scales up to its configured limits. Falcon scales up until all cores are pegged.

Further Reading

The code is available here and the Ruby bug report has more details. There is a PR tracking changes.

The goal of these improvements is to improve the composability and performance of async. I've implemented the wrapper approach in async-io and it's proven itself to be a good model in several high level libraries, including: async-dns, async-http and async-http-faraday.


It seems kind of an unfair comparison to use a single process in Puma vs. 8 processes in Falcon.

Just from some quick math it seems like Falcon would still win by a large margin even if Puma was running 8 processes, so I think it’s unnecessary to “cheat” in order to look good no?

It seems kind of an unfair comparison to use a single process in Puma vs. 8 processes in Falcon.

Puma is running with 16 threads. Threads which in practice allow Puma to use all 8 cores to the same extent as Falcon. So I think it’s fair.

Leave a comment

Please note, comments must be formatted using Markdown. Links can be enclosed in angle brackets, e.g. <>.