Self Healing Deployments

Every process that is available when a server boots was brought up by the init system (or systems). That is all the init system does, just what it is good at. It's certainly better at managing processes than an ad-hoc deployment script is.

Each production server should only have one job, be it running a load balancer, serving up static pages or working as a database. Realistically that isn't always the case, staging servers are a notable exemption, but it is an attainable goal. Every component in the stack should rely on the init system to maintain a steady state. Chances are the load balancer, reverse proxy cache, NoSQL server, SQL server or configuration registry is already being managed by an init system. The application should too.

Service Configurations

This post assumes you are deploying to Ubuntu, though the same principles apply to nearly any other *nix system. The current service management system for Ubuntu is Upstart, though it is being phased out in favor of the controversial RedHat driven systemd. Regardless, Upstart is included in Ubuntu 14.04 LTS, so it will be around for at least another four years.

The Upstart Cookbook is your best friend when crafting upstart configuration files. Don't be intimidated by the cookbook's massive length. While searching around for specific details you'll learn of other useful features that you didn't even know existed.

The least common denominator for any web application is the server, so that is what we will look at setting up as a service. Below is a configuration file for running the Puma web server as a service. Most of the details are common to any upstart script, and in fact much of this configuration is straight out of the example from the Puma repository:

description "Puma Server"

setuid deploy
setgid deploy
env HOME=/home/deploy

reload signal USR1
normal exit 0 TERM

respawn limit 3 30

start on runlevel [2345]
stop on runlevel [06]

  cd /var/www/app/current
  exec bin/puma -C config/puma.rb -b 'unix:///var/run/puma.sock?umask=0111'
end script

post_script exec rm -f /var/run/puma.sock

There are a couple of important changes and additions to the configuration that I'll point out, as they are crucial for service maintainability.

setuid deploy
setgid deploy

First, drop down to a less priveleged user for the sake of security. This is a very helpful feature built into more recent versions of Upstart. Your service simply should not need to run as root. Some sudo level commands are necessary for service control, but they should be enabled within sudoers, as we'll look at later.

reload signal USR1
normal exit 0 TERM

Use an alternate reload signal. The standard signal emitted to the process is HUP, which tells a process to reload its configuration file. Puma, like some of the other web servers, can perform a full code reload and hot restart when sent a particular signal. Here we are hijacking the upstart reload event to send Puma the USR1 signal, triggering a phased restart. Part of the phased restart process involves sending the TERM signal, which we tell upstart to ignore. Without the normal exit directive Upstart would consider the Puma process down after one reload.

respawn limit 3 30

Add a respawning directive. It will try to restart the job up to 3 times within a 30 second window if it fails for some reason. More often than not, the service simply isn't coming back. It's nice to have a backup.

start on runlevel [2345]

Automatic start is one of the strongest selling points for using an init system for an application. If the VM is mysteriously rebooted by your hosting provider, which is guaranteed to happen at some point, it will be brought right back up when the VM boots.

exec bin/puma -C config/puma.rb -b 'unix:///var/run/puma.sock?umask=0111'

The final line of the script block determines which process will be tracked by upstart. While that may seem obvious, there are some gotchas to be aware of. By default upstart pipes STDOUT and STDERR to /var/log/upstart/puma.log, which is convenient. If you decide that you'd prefer to log directly to syslog you may be tempted to add a pipe:

exec bin/puma ... | logger -t puma

However, that causes upstart to track the logger process's PID instead of Puma's, preventing any further control of the Puma process by upstart. As you would soon discover, attempts to sudo stop puma would only stop the logger process and leave a zombie Puma process running in the background. Tracking the proper PID is also crucial for the next stage of managing applications as services, service monitoring.

Controlling Services

By placing the configuration file in the proper location we can use service commands to control the server process. Write the file to /etc/init/puma.conf. All configuration files go into etc/init/, and the service becomes available as whatever the file is named.

With the configuration in place the server can start up:

sudo service puma start

Even though the process will be ran as the deploy user the service must be controlled with sudo. This can be problematic when using a deployment tool like Capistrano, which doesn't officially support running commands as sudo. In order for all of the necessary job control to be available during deployment you will need to configure the deploy user with proper sudoer permissions. Playing with passwordless sudo can be dangerous, so only add an exemption for controlling the puma process directly:

sudo echo "deploy ALL = (root) NOPASSWD: /sbin/start puma, /sbin/stop puma, /sbin/restart puma, /sbin/reload puma" >> /etc/sudoers

The various service commands (start, stop, restart, and reload) are all aliased into /sbin. This makes the passwordless commands slightly more readable, but is functionally equivalent to the service {name} {action} version.

Now the service is up and the init system will ensure it comes back up if the system crashes, or even if the process itself crashes. But what happens if the process itself misbehaves or starts syphoning too many resources? There are tools for just that situation, of course.

Monitoring Services

Utilities for monitoring a server and the services on that server are essential to maintaining the health of a system. Many systems in the Ruby world have relied on tools like God or Bluepill to monitor and control application state. Those particular tools have a couple of large drawbacks though. Notably they require a Ruby runtime, which reduces portability and sacrifices stability when version management is involved. More importantly, instead of working with an existing init system they duplicate the functionality.

A recently released monitoring tool called Inspeqtor addresses both of the aforementioned issues. It is distributed as a small self-contained binary that itself is managed by an init system. However, it doesn't get into the business of trying to control services directly. Instead, it leverages the init system and very concise configuration files to help the system manage services directly. Installation is simple and works with the existing package manager.

Continuing on with the goal of keeping the system up, self-healing, and allowing the init system to do our work for us here is an example configuration file for Puma. It is targeting the Puma service specifically, and would be placed in /etc/inspeqtor/services.d/puma.inq:

check service puma
  if cpu:total_user > 90% then alert
  if memory:total_rss > 2g then alert, reload

That outlines, in very plain language, how Inspeqtor will monitor the service. It will find the init system that is managing the process and periodically perform some analysis on it. It performs simple status checks, such as whether the service is even up currently, and can alert you if the service goes down. Deeper introspection into resource usage is also possible, as shown in the example above. Experience tells us that a Ruby web server will suffer memory bloat over time and we'll want to track it. When the memory passes a threshold Inspeqtor will take action. In this case it will tell Upstart to reload puma (the same as running service puma reload) and it will send an alert to any of the configured channels such as email or Slack.

Some services, such as Sidekiq workers for example, may not have such strident requirements on uptime or may not have any notion of "phased restart". In that case the config can use restart in place of reload.

Keep Deployment Simple

Make the most of the tools that are available to you. Some of them, such as Upstart, can be leveraged to great effect with a tiny bit of configuration and some outside monitoring. Converting a system from a set of custom deployment recipies that manage logs, sockets and pid files to one that manages and maintains itself will be vastly more stable and predictable.