Zero Downtime Deployment With Unicorn And Capistrano

Everytime we're fixing a bug or adding a new feature, we want it to go live very fast to have happy users at gigmit. Thats why we really love to deploy several times a day. While running a pretty standard Rails stack with Unicorn and handling deployments with Capistrano, we always had a few seconds when the Unicorn workers were restarted. We did this only at night, because it wasn't always sure[1], if they will really come back without errors.

So I searched the internet for how others do this with Unicorn and found many different deploy scripts and blog posts . I tried to get my head around all this and basically our scripts we're using now are a compilation of all the stuff I found there. Plus because I don't really like bash scripting, I tried to do as much as possible in Ruby.

Basically there are three files involved in the whole process. The Unicorn config file, which is checked in in our app's git repo; an init.d script for handling unicorn as a service on linux and our app's config/deploy.rb because Capistrano wants you to overwrite their empty restart task to let it know, what to do at the end of a successful deployment.

The Unicorn config is at config/unicorn.rb and has the following content:

root = "/home/deployer/current"

pid           "#{root}/tmp/pids/unicorn.pid"
stderr_path   "#{root}/log/unicorn_error.log"
stdout_path   "#{root}/log/unicorn.log"

listen "/tmp/unicorn.gigmit.com.socket", :backlog => 2048

preload_app true
working_directory root
worker_processes 3
timeout 30

before_fork do |server, worker|
  # the following is highly recomended for Rails + "preload_app true"
  # as there's no need for the master process to hold a connection
  defined?(ActiveRecord::Base) and ActiveRecord::Base.connection.disconnect!

  ##
  # When sent a USR2, Unicorn will suffix its pidfile with .oldbin and
  # immediately start loading up a new version of itself (loaded with a new
  # version of our app). When this new Unicorn is completely loaded
  # it will begin spawning workers. The first worker spawned will check to
  # see if an .oldbin pidfile exists. If so, this means we've just booted up
  # a new Unicorn and need to tell the old one that it can now die. To do so
  # we send it a QUIT.
  #
  # Using this method we get 0 downtime deploys.

  old_pid = "#{root}/tmp/pids/unicorn.pid.oldbin"
  if File.exists?(old_pid) && server.pid != old_pid
    begin
      Process.kill("QUIT", File.read(old_pid).to_i)
    rescue Errno::ENOENT, Errno::ESRCH
      # someone else did our job for us
    end
  end
end

after_fork do |server, worker|
  ##
  # Unicorn master loads the app then forks off workers - because of the way
  # Unix forking works, we need to make sure we aren't using any of the parent's
  # sockets, e.g. db connection

  defined?(ActiveRecord::Base) and ActiveRecord::Base.establish_connection
  # Redis and Memcached would go here but their connections are established
  # on demand, so the master never opens a socket

  ##
  # Unicorn master is started as root, which is fine, but let's
  # drop the workers to deployer:deployer

  begin
    uid, gid = Process.euid, Process.egid
    user, group = 'deployer', 'deployer'
    target_uid = Etc.getpwnam(user).uid
    target_gid = Etc.getgrnam(group).gid
    worker.tmp.chown(target_uid, target_gid)
    if uid != target_uid || gid != target_gid
      Process.initgroups(user, target_gid)
      Process::GID.change_privilege(target_gid)
      Process::UID.change_privilege(target_uid)
    end
  rescue => e
    if RAILS_ENV == 'development'
      STDERR.puts "couldn't change user, oh well"
    else
      raise e
    end
  end
end

Beside some other stuff, the important part happens in the before_fork method. If a worker came up successfully and is accepting new requests (which means your app is ok after a change) it checks if there is an old PID. If yes, the new worker will tell the old process to (finish all pending requests and) quit. This means, only if we get a new version of the app up and running, we kill the old one. So if something fails in the startup process, the old version of the app will still be available.

To handle Unicorn with init.d, we have this script in /etc/init.d/unicorn:

#!/bin/sh
### BEGIN INIT INFO
# Provides:          unicorn
# Required-Start:    $remote_fs $syslog
# Required-Stop:     $remote_fs $syslog
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Short-Description: Manage unicorn server
# Description:       Start, stop, restart unicorn server for a specific application.
### END INIT INFO
set -e

# Feel free to change any of the following variables for your app:
export RUBY_GC_HEAP_MIN_SLOTS=800000
export RUBY_GC_HEAP_SLOTS_INCREMENT=250000
export RUBY_GC_HEAP_SLOTS_GROWTH_FACTOR=1
export RUBY_GC_MALLOC_LIMIT=50000000

TIMEOUT=${TIMEOUT-60}
APP_ROOT=/home/deployer/current
PID=$APP_ROOT/tmp/pids/unicorn.pid
CMD="cd $APP_ROOT && BUNDLE_GEMFILE=$APP_ROOT/Gemfile bundle exec unicorn -D -c $APP_ROOT/config/unicorn.rb -E production"
AS_USER=deployer
set -u

OLD_PIN="$PID.oldbin"

sig () {
  test -s "$PID" && kill -s $1 `cat $PID`
}

run () {
  if [ "$(id -un)" = "$AS_USER" ]; then
    eval $1
  else
    su -c "$1" - $AS_USER
  fi
}

case "$1" in
start)
  sig 0 && echo >&2 "Already running" && exit 0
  run "$CMD"
  ;;
stop)
  sig QUIT && exit 0
  echo >&2 "Not running"
  ;;
force-stop)
  sig TERM && exit 0
  echo >&2 "Not running"
  ;;
restart|reload)
  sig HUP && echo reloaded OK && exit 0
  echo >&2 "Couldn't reload, starting '$CMD' instead"
  run "$CMD"
  ;;
upgrade)
  if sig USR2 && sleep 3
  then
    n=$TIMEOUT
    while test -s $OLD_PIN && test $n -ge 0
    do
      printf '.' && sleep 1 && n=$(( $n - 1 ))
    done
    echo

    if test $n -lt 0 && test -s $OLD_PIN
    then
      echo >&2 "$OLD_PIN still exists after $TIMEOUT seconds"
      exit 1
    fi
    exit 0
  fi
  echo >&2 "Couldn't upgrade, starting '$CMD' instead"
  run "$CMD"
  ;;
reopen-logs)
  sig USR1
  ;;
*)
  echo >&2 "Usage: $0 <start|stop|restart|upgrade|force-stop|reopen-logs>"
  exit 1
  ;;
esac

On a deployment we only call service unicorn upgrade which is sending a USR2 signal at the current Unicorn process. After this it will wait 60 seconds for the old PID to disappear, which from what we learned above only happens if a new process came up successfully otherwise it will fail and interrupt the deployment.

To let Capistrano handle all of this stuff automatically I added the following to our config/deploy.rb:

namespace :deploy do
  desc "restart (upgrade) unicorn server"
  task :restart, roles: :app, except: {no_release: true} do
    run "service unicorn upgrade"
  end
end

Which means we now have zero downtime deployment with a simple cap production deploy. Woop Woop!

PS: We had some problems starting new Unicorn workers with a new set of Gem versions. We fixed that by explicitly specifying the Gemfile in the init.d script.

[1] We have a staging environment, but its not 100% an exact copy of our production system, because of some payment provider foo. But thats probably worth another blogpost.