Samuel Williams Sunday, 21 March 2010

Ensuring sites are cache-friendly is an important part of deploying a website. Sites that load quickly and reduce bandwidth costs are great, especially when there are lots of visitors.

I've been experimenting with Rack-Cache which is an excellent project. It is a full-featured cache for Rack with support for multiple backends.

It is important to keep in mind that not all resources are suited for caching. For example, I made two separate caches as part of my application: A file cache and a dynamic content cache. This is because files on disk don't need to be stored in a cache, where as dynamically generated content does (otherwise you'd have to regenerate it each time).

Caching for resources such as files and other static resources should rely on ETags. Each static resource has an ETag, which is typically a hash of the file size and last modified time. This is pretty easy to implement.

# The core parts of the File class that I use:
class FileReader
	def initialize(path)
		@path = path
		@etag = Digest::SHA1.hexdigest("#{File.size(@path)}#{mtime_date}")
	end

	attr :path
	attr :etag

	def to_path
		@path
	end

	def mtime_date
		File.mtime(@path).httpdate
	end

	def size
		File.size(@path)
	end

	def each
		File.open(@path, "rb") do |fp|
			while part = fp.read(8192)
				yield part
			end
		end
	end

	def modified?(env)
		if modified_since = env['HTTP_IF_MODIFIED_SINCE']
			return false if File.mtime(@path) <= Time.parse(modified_since)
		end

		if etags = env['HTTP_IF_NONE_MATCH']
			etags = etags.split(/\s*,\s*/)
			return false if etags.include?(etag) || etags.include?('*')
		end

		return true
	end
end

# Here is basically how we serve the file to the client:
class Static
	# ... snip ...

	def call(env)
		file = File.new(...)
	
		response_headers = {
			"Last-Modified" => file.mtime_date,
			"Content-Type" => @extensions[ext],
			"Cache-Control" => @cache_control,
			"ETag" => file.etag
		}

		if file.modified?(env)
			response_headers["Content-Length"] = file.size.to_s
			return [200, response_headers, file]
		else
			return [304, response_headers, []]
		end
	end
end

Caching for resources such as content that is dynamically generated should typically use last modified time exclusively, and typically for only a short period of time (such as 1 hour). This ensures that your site won't be overloaded generating content (when you get slashdotted), but that content will be regenerated fairly frequently.

# Using rack-cache is easy - simply install it and add it to your config.ru

use Rack::Cache, {
	:verbose => true
}

# Then in your content generation, write something like this

response.headers['Cache-Control'] = 'max-age=3600'

# And rack-cache will take care of the rest :)

Also, just because you are caching content, doesn't mean your page can't have dynamic elements - AJAX can provide interactive RSS feeds, change images, change content, very trivially. This means that the majority of your content can be cached while specific parts are generated on the client dynamically. This is something which I'm experimenting with.

Debugging Cache Issues

I had problems because Apache was adding a second set of Cache-Control headers to all requests. This was because of a global ExpiresDefault directive, which simply appends another Cache-Control header. This can cause incorrect cache information to permeate through the internet. Figuring out all the little problems took me a while since there are many levels which potentially cache information.

I found two great tools for checking whether your pages are serving the correct headers, and your stack responds to things such as If-Modified-Since and If-None-Match correctly:

Both of these sites will point out issues with the content you are serving, and highlight potential problems with resources which won't be cached properly due to missing headers, incorrect headers and/or incorrect behavior.

Comments

Leave a comment

Please note, comments must be formatted using Markdown. Links can be enclosed in angle brackets, e.g. <www.codeotaku.com>.