Soren

Efficient Redis Caching Through Hashing

If you spend some time browsing through Redis documentation you'll quickly stumble upon references to "intelligent caching". Within the context of Redis, intelligent caching refers to leveraging the native data types rather than storing all values as strings. There are numerous examples of this out in the wild, some using ordered lists, some using sets, and a notable example using hashes.

Using Hashes

The inspiration for this post came from the memory optimization documentation for Redis itself:

Small hashes are encoded in a very small space, so you should try representing your data using hashes every time it is possible. For instance if you have objects representing users in a web application, instead of using different keys for name, surname, email, password, use a single hash with all the required fields.

Redis Documentation

Typically Redis string storage costs more memory than Memcached, but it doesn't have to be that way. Memory consumption, write and read performance can be boosted by using the optimized storage of the hash data type. Wrapping data in hashes can be semantically cleaner, much more memory efficient, faster to write, and faster to retrieve.

Smaller and faster always sound great in theory, let's see if it proves to be true.

Proving the Concept

Before delving into specific use cases, we'll set up some benchmarking and measurement scripts to confirm or deny the following hypothesis: bundling values into hashes is more memory efficient and performant than discrete string storage.

The benchmark, written in Ruby only for the sake of simplicity, performs the following steps:

  1. Flush the database
  2. Generate 1001 structures with 101 randomized string values
  3. Measure the speed of writing each value and each structure
require 'redis'
require 'benchmark'
require 'securerandom'

REDIS = Redis.new(url: 'redis://localhost:6379/11')
REDIS.flushdb

def write_string(obj, index)
  obj.each do |key, data|
    REDIS.set("string-#{key}-#{index}", data)
  end
end

def write_hash(obj, index)
  fields = obj.flat_map { |field, value| [field, value] }
  REDIS.hmset("hash-#{index}", *fields)
end

values = (0..1_000).to_a.map do |n|
  data = SecureRandom.hex(100)

  (0..100).to_a.each_with_object({}) do |i, memo|
    memo["field-#{i}"] = data
  end
end

Benchmark.bm do |x|
  x.report('write_string') do
    REDIS.multi do
      values.each_with_index { |value, index| write_string(value, index) }
    end
  end

  x.report('write_hash') do
    REDIS.multi do
      values.each_with_index { |value, index| write_hash(value, index) }
    end
  end
end

The results for one iteration in seconds (lower is better):

With some slight modifications, the same script can be used to measure memory consumption, simply by checking Redis#info(:memory) for each strategy.

# ...speed script from above

REDIS.flushdb
write_strings
puts "String: #{REDIS.info(:memory)['used_memory_human']}"

REDIS.flushdb
REDIS.config(:set, 'hash-max-ziplist-value', '64')
write_hashes
puts "Hash: #{REDIS.info(:memory)['used_memory_human']}"

REDIS.flushdb
REDIS.config(:set, 'hash-max-ziplist-value', '1024')
write_hashes
puts "Tuned: #{REDIS.info(:memory)['used_memory_human']}"

The results, in megabytes, are consistent across multiple iterations (lower is better):

This demonstrates a sizable difference between string storage, hash-based storage, and tuned hash-based storage (more on that later). With this sample data, the memory savings are nearly 30%. Those are pretty huge savings! Note that there is a slight trade off between tuned hash entry size and insertion time.

Again, with some slight modifications, the writing and memory benchmark can also be used to measure read speed. There isn't any appreciable performance difference for HGETALL between hash entry size, so only one data-point is included.

# ...speed script from above

string_keys = (0..1_000).to_a.flat_map do |n|
  (0..100).to_a.map { |i| "string-#{n}-#{i}" }
end

hash_keys = (0..1_000).to_a.map do |n|
  "hash-#{n}"
end

Benchmark.bm do |x|
  x.report('read_string') do
    REDIS.multi do
      string_keys.map { |key| REDIS.get(key) }
    end
  end

  x.report('read_hash') do
    REDIS.multi do
      hash_keys.map { |key| REDIS.hgetall(key) }
    end
  end
end

The results for one iteration in seconds (lower is better):

Ignoring the fact that reading over 10,000 items in a single MULTI command is pretty slow, you can see that hash fetching is 26% faster. This is intuitive. Fetching one hundredth as many keys should be faster.

As expected, the documentation was right. With a little tuning, the hash-based approach can be smaller and faster. There are some caveats though. Let's explore them with a practical example.

A Practical Example

The original idea, as demonstrated, uses a sharding scheme to bucket models into hashes based on a model's unique key. This has a wonderful sense of symmetry and predictability, but doesn't do anything to ensure that the models are related to each other. The cache only has a finite amount of space and we want old values to be evicted, so it is desirable to keep all fields within a hash related. Additionally, there are performance benefits to naively fetching each hash in its entirety.

For example, imagine an API endpoint that serves up JSON data for blog posts. Each post would include the author, some comments, and tags. Naturally the serialized JSON would be cached in order to boost response times. Typically each post and the associated data would be stored separately, as strings, available at keys such as posts/:id/:timestamp. Instead, with hash based caching, all of the serialized values are stored inside of a single hash referenced by the post's key at the top level.

Cache Hashing

When requests come in, a post is retrieved from the database, the cache key generated in the format of posts/:id/:children/:timestamp, and if the cache is fresh there is only a single fetch necessary. Field invalidation for associated children (authors, comments, etc.), field additions (new comments, new tags), or field removal (deleted comments) are simply dealt with by using the number of children and a timestamp within the cache key.

Logistics & Tuning

Previously I mentioned that there was a significant difference in storage that was achieved by "tuning." Through the Redis config file, or with the CONFIG command, the storage semantics of most data structures can be configured. In this case the hash-max-zipmap-* values are most important.

Hashes, Lists, Sets composed of just integers, and Sorted Sets, when smaller than a given number of elements, and up to a maximum element size, are encoded in a very memory efficient way that uses up to 10 times less memory.

Redis Documentation

As long as the hash-max-zipmap-value size is larger than the maximum value being stored in a hash then storage will be optimized. For caching it is recommended that you use a particularly aggressive value, at least 2048-4096b.

Fields within a hash can't have individual expiration, only the hash key itself. Bundling related fields together allows the entire hash to be treated as a single unit, eventually falling out of memory entirely when it is no longer needed. This means that it is preferable to use Redis as an LRU cache. We're already doing that though, right?

Bundling serialized data for associated models is just one way to utilize the native hash type for intelligent caching. Custom caching schemes require more explicit and purposeful schemes for organizing data, but can be far more rewarding than naive key/value storage.