Redis is blazingly fast, amazingly versatile and its use is virtually ubiquitous among Rails apps. Typically it's being leveraged for background job processing, pub/sub, request rate limiting, and all manner of other ad-hoc tasks that require persistence and speed. Unfortunately, its adoption as a cache has lagged in the shadow of Memcached, the longstanding in-memory caching alternative. That may be due to lingering views on what Redis's strengths are, but I believe it comes down to a lack of great libraries. That's precisely what led to writing Readthis, an extremely fast caching library for Ruby and backed by Redis.

Before diving into project goals and implementation details let's look at a chart comparing the performance of multi cache operations across varying cache libraries. Multi, or pipelined, read/write operations are particularly valuable for caching with API requests, and an excellent example of where Readthis's performance excels:

The multi benchmark can be found in the Readthis repository.

The only store faster than Readthis is ActiveSupport's in memory storage, which isn't persisted to a database at all. Throughout the rest of this post we'll look at the high level goals that made this performance possible, and examine some of the specific steps taken to achieve it.

High Level Goals

Writing a new implementation of existing software begins with setting high level goals. These goals establish how the library will be differentiated from the alternatives and provide some metrics of success. As there was already a Redis backed cache available in redis-store, and an extremely popular Memcached library in dalli, setting the initial goals was quite straight forward.

Lightweight - Aside from Redis there is no need for external dependencies. Keep the gem as portable as possible and avoid requiring the ActiveSupport beast while still supporting integration points with Rails apps.
Speedy - Raw speed with a low impact on memory is the ultimate focus. Start benchmarking and profiling right from the beginning so that the impact of each change can be measured.
Pooled - Many apps use a single global connection to Redis, which is a cause for contention in multi-threaded systems. Follow Dalli's lead and leverage connection pooling to increase throughput.
Well Tested - Caching is a critical component in production systems. Each code path needs to be exercised so that changes and optimizations can be made with confidence. This is a case where 100% test coverage is necessary.
Maintained - Project maintenance isn't a concrete feature, but it is paramount to the trust and adoption of a library. I submitted numerous patches to redis-activesupport but the pull requests languished for months while compatibility drifted away from releases of Rails.

Identifying Performance Bottlenecks

Once the initial library structure was in place a small suite of benchmark scripts were created to measure performance and memory usage. As features were added or enhanced the scripts were used to identify performance bottlenecks, while also ensuring there weren't any performance regressions.

The initial benchmark results can be broken down into three distinct bottlenecks: round trips to Redis, marshaling cached data and cache entry creation. Though there were also other micro-optimizations that presented themselves, these three areas provided the most obvious gains.

Mitigating the Redis Round-trip

Redis is extremely fast, but no amount of speed can compensate for wasting time with repeated calls back and forth between an application and the database. The round-trip back and forth wastes a lot of CPU time and instantiates additional objects that will need to be garbage collected eventually. Redis provides pipelining via the MULTI command for exactly this situation.

Readthis uses MULTI to complete data setting and retrieval with as few transactions as possible. Primarily this benefits "bulk" operations such as read_multi, fetch_multi, or the Readthis specific write_multi. For fetch operations where values are written only when they can't be retrieved, reading and writing of all values is always performed with two commands, no matter how many entries are being retrieved.

The most significant gains to pipelining and round-trip performance came through the use of hiredis. Hiredis is a Redis adapter written in C that drastically speeds up the parsing of bulk replies.

require 'bundler'

Bundler.setup

require 'benchmark/ips'
require 'readthis'

REDIS_URL = 'redis://localhost:6379/11'
native  = Readthis::Cache.new(REDIS_URL, driver: :ruby,  expires_in: 60)
hiredis = Readthis::Cache.new(REDIS_URL, driver: :hiredis, expires_in: 60)

('a'..'z').each { |key| native.write(key, key * 1024) }

Benchmark.ips do |x|
  x.report('native:read-multi')  { native.read_multi(*('a'..'z')) }
  x.report('hiredis:read-multi') { hiredis.read_multi(*('a'..'z')) }

  x.compare!
end

Faster Marshalling

Once you eliminate time spent retrieving data over the wire it becomes clear that most of the wall time is spent marshaling data back and forth between strings and native Ruby objects. Even when a value being cached is already a string it is still marshaled as a Ruby string:

Marshal.dump('ruby') #=> "\x04\bI\"\truby\x06:\x06ET"

For some caching use cases, such as storing JSON payloads, it simply isn't necessary to load stored strings back into Ruby objects. This insight provided an opportunity to make the marshaller plug-able, and even bypass serialization entirely, yielding a significant performance boost. In some implementations, such as Dalli's, a raw option may be set to bypass entry serialization as well, but the option is checked on every read or write and doesn't provide any additional flexibility.

Let's look at the script used to measure marshal performance. It illustrates that configuring the marshaller is as simple as passing an option during construction. Any object that responds to both dump and load may be used.

require 'bundler'

Bundler.setup

require 'benchmark/ips'
require 'json'
require 'oj'
require 'readthis'
require 'readthis/passthrough'

REDIS_URL = 'redis://localhost:6379/11'
OPTIONS   = { compressed: false }

readthis_pass = Readthis::Cache.new(REDIS_URL, OPTIONS.merge(marshal: Readthis::Passthrough))
readthis_oj   = Readthis::Cache.new(REDIS_URL, OPTIONS.merge(marshal: Oj))
readthis_json = Readthis::Cache.new(REDIS_URL, OPTIONS.merge(marshal: JSON))
readthis_ruby = Readthis::Cache.new(REDIS_URL, OPTIONS.merge(marshal: Marshal))

HASH = ('a'..'z').each_with_object({}) { |key, memo| memo[key] = key }

Benchmark.ips do |x|
  x.report('pass:hash:dump') { readthis_pass.write('pass', HASH) }
  x.report('oj:hash:dump')   { readthis_oj.write('oj',     HASH) }
  x.report('json:hash:dump') { readthis_json.write('json', HASH) }
  x.report('ruby:hash:dump') { readthis_ruby.write('ruby', HASH) }

  x.compare!
end

Benchmark.ips do |x|
  x.report('pass:hash:load') { readthis_pass.read('pass') }
  x.report('oj:hash:load')   { readthis_oj.read('oj') }
  x.report('json:hash:load') { readthis_json.read('json') }
  x.report('ruby:hash:load') { readthis_ruby.read('ruby') }

  x.compare!
end

The results, in prettified chart form:

This benchmark demonstrates the relative difference between serialization modules when working with a small hash. Performance varies for other primitives, such as strings, but the pass-through module is always fastest for load operations. This makes sense as there aren't any additional allocations being made, the string that is read back from Redis is returned directly.

When you can get away with it, which is any time you're only working with strings, the pass-through module provides an enormous boost in load performance. Otherwise, if you are only working with basic types (strings, arrays, numbers, booleans, hashes) then there are gains to be made with Oj, particularly during dump operations.

Entity Storage

All of the caches built off of ActiveSupport::Cache rely on the Entry class for wrapping values. The Entry class provides a base for serialization, compression, and expiration tracking. Every time a new value is read or written to the store a new entry is initialized for the value.

When working with Redis not all of the entry class's functionality is necessary. For instance, some stores, such as FileStore or MemoryStore require per-entry expirations to evict stale cache entries. Redis has built in support for expiration and can avoid wrapping individual entries. By not wrapping each cache entry Readthis can use pure methods and avoid instantiating additional objects.

In synthetic benchmarks the performance gains are negligible (and it makes for a very boring chart), but there is a direct reduction in the number of objects allocated. That savings adds up across thousands of requests, aiding in fewer GC pauses.

Lessons to be Learned

Ignoring the implementation details between Redis and Memcached there isn't anything preventing other caches from benefiting from these techniques. Everybody benefits from healthy competition. There is always room for improvement and I hope to see Readthis pushed further.