Zero Downtime Deployment With Unicorn And Capistrano
Everytime we're fixing a bug or adding a new feature, we want it to go live very fast to have happy users at gigmit. Thats why we really love to deploy several times a day. While running a pretty standard Rails stack with Unicorn and handling deployments with Capistrano, we always had a few seconds when the Unicorn workers were restarted. We did this only at night, because it wasn't always sure[1], if they will really come back without errors.
So I searched the internet for how others do this with Unicorn and found many different deploy scripts and blog posts . I tried to get my head around all this and basically our scripts we're using now are a compilation of all the stuff I found there. Plus because I don't really like bash scripting, I tried to do as much as possible in Ruby.
Basically there are three files involved in the whole process. The Unicorn config file, which is checked in in our app's git repo; an init.d script for handling unicorn as a service on linux and our app's config/deploy.rb because Capistrano wants you to overwrite their empty restart task to let it know, what to do at the end of a successful deployment.
The Unicorn config is at config/unicorn.rb
and has the following content:
root = "/home/deployer/current"
pid "#{root}/tmp/pids/unicorn.pid"
stderr_path "#{root}/log/unicorn_error.log"
stdout_path "#{root}/log/unicorn.log"
listen "/tmp/unicorn.gigmit.com.socket", :backlog => 2048
preload_app true
working_directory root
worker_processes 3
timeout 30
before_fork do |server, worker|
# the following is highly recomended for Rails + "preload_app true"
# as there's no need for the master process to hold a connection
defined?(ActiveRecord::Base) and ActiveRecord::Base.connection.disconnect!
##
# When sent a USR2, Unicorn will suffix its pidfile with .oldbin and
# immediately start loading up a new version of itself (loaded with a new
# version of our app). When this new Unicorn is completely loaded
# it will begin spawning workers. The first worker spawned will check to
# see if an .oldbin pidfile exists. If so, this means we've just booted up
# a new Unicorn and need to tell the old one that it can now die. To do so
# we send it a QUIT.
#
# Using this method we get 0 downtime deploys.
old_pid = "#{root}/tmp/pids/unicorn.pid.oldbin"
if File.exists?(old_pid) && server.pid != old_pid
begin
Process.kill("QUIT", File.read(old_pid).to_i)
rescue Errno::ENOENT, Errno::ESRCH
# someone else did our job for us
end
end
end
after_fork do |server, worker|
##
# Unicorn master loads the app then forks off workers - because of the way
# Unix forking works, we need to make sure we aren't using any of the parent's
# sockets, e.g. db connection
defined?(ActiveRecord::Base) and ActiveRecord::Base.establish_connection
# Redis and Memcached would go here but their connections are established
# on demand, so the master never opens a socket
##
# Unicorn master is started as root, which is fine, but let's
# drop the workers to deployer:deployer
begin
uid, gid = Process.euid, Process.egid
user, group = 'deployer', 'deployer'
target_uid = Etc.getpwnam(user).uid
target_gid = Etc.getgrnam(group).gid
worker.tmp.chown(target_uid, target_gid)
if uid != target_uid || gid != target_gid
Process.initgroups(user, target_gid)
Process::GID.change_privilege(target_gid)
Process::UID.change_privilege(target_uid)
end
rescue => e
if RAILS_ENV == 'development'
STDERR.puts "couldn't change user, oh well"
else
raise e
end
end
end
Beside some other stuff, the important part happens in the before_fork
method.
If a worker came up successfully and is accepting new requests (which means your
app is ok after a change) it checks if there is an old PID. If yes, the new
worker will tell the old process to (finish all pending requests and) quit.
This means, only if we get a new version of the app up and running, we kill the
old one. So if something fails in the startup process, the old version of the
app will still be available.
To handle Unicorn with init.d, we have this script in
/etc/init.d/unicorn
:
#!/bin/sh
### BEGIN INIT INFO
# Provides: unicorn
# Required-Start: $remote_fs $syslog
# Required-Stop: $remote_fs $syslog
# Default-Start: 2 3 4 5
# Default-Stop: 0 1 6
# Short-Description: Manage unicorn server
# Description: Start, stop, restart unicorn server for a specific application.
### END INIT INFO
set -e
# Feel free to change any of the following variables for your app:
export RUBY_GC_HEAP_MIN_SLOTS=800000
export RUBY_GC_HEAP_SLOTS_INCREMENT=250000
export RUBY_GC_HEAP_SLOTS_GROWTH_FACTOR=1
export RUBY_GC_MALLOC_LIMIT=50000000
TIMEOUT=${TIMEOUT-60}
APP_ROOT=/home/deployer/current
PID=$APP_ROOT/tmp/pids/unicorn.pid
CMD="cd $APP_ROOT && BUNDLE_GEMFILE=$APP_ROOT/Gemfile bundle exec unicorn -D -c $APP_ROOT/config/unicorn.rb -E production"
AS_USER=deployer
set -u
OLD_PIN="$PID.oldbin"
sig () {
test -s "$PID" && kill -s $1 `cat $PID`
}
run () {
if [ "$(id -un)" = "$AS_USER" ]; then
eval $1
else
su -c "$1" - $AS_USER
fi
}
case "$1" in
start)
sig 0 && echo >&2 "Already running" && exit 0
run "$CMD"
;;
stop)
sig QUIT && exit 0
echo >&2 "Not running"
;;
force-stop)
sig TERM && exit 0
echo >&2 "Not running"
;;
restart|reload)
sig HUP && echo reloaded OK && exit 0
echo >&2 "Couldn't reload, starting '$CMD' instead"
run "$CMD"
;;
upgrade)
if sig USR2 && sleep 3
then
n=$TIMEOUT
while test -s $OLD_PIN && test $n -ge 0
do
printf '.' && sleep 1 && n=$(( $n - 1 ))
done
echo
if test $n -lt 0 && test -s $OLD_PIN
then
echo >&2 "$OLD_PIN still exists after $TIMEOUT seconds"
exit 1
fi
exit 0
fi
echo >&2 "Couldn't upgrade, starting '$CMD' instead"
run "$CMD"
;;
reopen-logs)
sig USR1
;;
*)
echo >&2 "Usage: $0 <start|stop|restart|upgrade|force-stop|reopen-logs>"
exit 1
;;
esac
On a deployment we only call service unicorn upgrade
which is sending a USR2
signal at the current Unicorn process. After this it will wait 60 seconds for
the old PID to disappear, which from what we learned above only happens if a new
process came up successfully otherwise it will fail and interrupt the
deployment.
To let Capistrano handle all of this stuff automatically I added the following to our config/deploy.rb:
namespace :deploy do
desc "restart (upgrade) unicorn server"
task :restart, roles: :app, except: {no_release: true} do
run "service unicorn upgrade"
end
end
Which means we now have zero downtime deployment with a simple
cap production deploy
. Woop Woop!
PS: We had some problems starting new Unicorn workers with a new set of Gem versions. We fixed that by explicitly specifying the Gemfile in the init.d script.
[1] We have a staging environment, but its not 100% an exact copy of our production system, because of some payment provider foo. But thats probably worth another blogpost.