Zero Downtime Deployments, Zero Docker
The Problem
When I want to make a change to this blog site, the deployment process is simple:
- Generate all of the updated files.
- Swap out the old binary for the new one on the server.
- Restart the server.
This simplicity comes with one major downside. During step 3, while the old process dies and the new one starts, the website is down.
For those few seconds, any user trying to load the site sees a 502 Bad Gateway error or a connection refused message.
Below is the story of how I engineered my system to achieve 100% uptime during deployments, without sacrificing simplicity.
Minimalism
Zero downtime deployments are usually considered a "solved problem" involving Docker containers, Load Balancers, and rolling updates in Kubernetes.
But that felt like bringing an aircraft carrier to a fishing trip.
I wanted to keep my site lightweight. My entire server runs on just 11MB of RAM, aligning with my ethos of extreme efficiency. I didn't want to add gigabytes of overhead just to manage deployments. Because of this, I decided to implement my own mechanism for zero-downtime deploys on bare metal.
The Concept: Passing the Torch
In a standard restart, the operating system closes the network port when the old process dies, and the new process has to open it again. That gap is where the downtime happens.
To fix this, we use a process called Socket Inheritance. Socket inheritance is a mechanism that allows a child process created by the server to inherit its file descriptors. To utilize this, our blog server must spawn a child server and hand over its descriptors. The process looks like this:
- Old Process is serving traffic.
- Update signal is received.
- Old Process starts New Process and passes the "listening socket."
- New Process starts accepting connections immediately.
- Old Process finishes its current requests and exits.
From the outside, the port never closes. Thus, the server never goes offline.
The Implementation
To achieve this in Go, I used a library called cloudflare/tableflip. Tableflip handles passing the socket from the parent server to the child server for us. It also gives us some nice synchronization mechanisms between the parent and child server.
First, in the main function of our server, we instantiate tableflip.
1upg, err := tableflip.New(tableflip.Options{
2 PIDFile: "blog.pid",
3})
4if err != nil {
5 log.Fatalf("failed to init tableflip: %v", err)
6}
7defer upg.Stop()
8
9go handleSignals(upg)
Notice I start the function handleSignals in a new Goroutine. This function is how we receive signals from the operating system and gracefully trigger an upgrade to the running code by spawning a child process, or terminating gracefully if a SIGINT or SIGTERM is received.
1func handleSignals(upg *tableflip.Upgrader) {
2 sig := make(chan os.Signal, 1)
3 signal.Notify(sig, syscall.SIGHUP, syscall.SIGTERM, syscall.SIGINT)
4 for s := range sig {
5 if s == syscall.SIGTERM || s == syscall.SIGINT {
6 log.Println("SIGTERM / SIGINT received. Closing server..")
7 upg.Stop()
8 return
9 }
10 log.Println("SIGHUP received: upgrading binary...")
11 if err := upg.Upgrade(); err != nil {
12 log.Printf("Upgrade failed: %v", err)
13 }
14 }
15}
Then, we create the server + redirect server, serving traffic through tableflip.
1srv, redirectSrv, _ := NewServer(engineCtx, coloredLog, "./public", domain, serviceStartTime)
2
3g := new(errgroup.Group)
4
5listenAddr := srv.Addr
6if listenAddr == "" {
7 listenAddr = ":8080"
8}
9
10ln, err := upg.Listen("tcp", listenAddr)
11if err != nil {
12 log.Fatalf("failed to listen on %s: %v", listenAddr, err)
13}
14
15g.Go(func() error {
16 if domain != "" {
17 return srv.ServeTLS(ln, "", "")
18 }
19 return srv.Serve(ln)
20})
21
22if redirectSrv != nil {
23 ln80, err := upg.Listen("tcp", ":80")
24 if err != nil {
25 log.Fatalf("failed to listen on :80: %v", err)
26 }
27
28 g.Go(func() error {
29 log.Println("Redirect server listening on :80 via tableflip")
30 return redirectSrv.Serve(ln80)
31 })
32}
Finally, the child process signals it is ready. The parent process, which is blocking on <-upg.Exit() receives the ready signal and exits.
1if err := upg.Ready(); err != nil {
2 log.Fatalf("tableflip ready failed: %v", err)
3}
4
5log.Printf("Servers ready. Waiting for exit signal...")
6
7<-upg.Exit()
8
9log.Printf("Received exit signal. Shutting down servers...")
That covers the Go implementation. Now we need to discuss the deployment process.
Changes to Deployment Process
Now our server listens for SIGHUP signals and is capable of spawning a new server, which it will pass its open ports to.
However, logistically there's some things we need to do when we're deploying changes:
- During our automated deployment, we need to swap the old Go code for the new Go code.
- Then, we need to trigger a SIGHUP to Go.
Systemd Dilemma
We have a problem though. We use systemd to run our app on a daemon. With the default behavior of systemd, our process won't work. Here's why:
- systemd watches the parent server process. By default, if that process exits, systemd will try to restart it (this is actually a good thing, it provides resilience.)
- This directly conflicts with tableflip, though. Tableflip spawns a child process and then makes the parent process exit. This triggers systemd to restart the daemon. Now our app was offline for 5 seconds again. Argh!
Systemd Solution
To get around this behavior, we create a wrapper process which watches the server. Here's the breakdown behind why this works:
- The wrapper is watched by systemd. The wrapper never exits, it only watches the server.
- Because of this, systemd doesn't get triggered when the parent server process exits - no daemon reload we don't want/need. It's watching the wrapper, which never goes down without instruction to do so.
- The wrapper listens for SIGTERM in case the blog needs to actually be shut down for some reason.
Here's what the script looks like, named run.sh, which we use to run our app:
1#!/bin/bash
2PID_FILE="/root/blog.pid"
3BINARY="/root/myblog"
4
5# Trap SIGTERM only (To stop the app when Systemd stops)
6trap 'kill -TERM $(cat $PID_FILE); exit 0' SIGTERM
7
8$BINARY &
9CURRENT_PID=$!
10echo "Wrapper: Started initial process $CURRENT_PID"
11
12while true; do
13 # Wait for the process to exit
14 # tail --pid blocks until the PID dies.
15 tail --pid=$CURRENT_PID -f /dev/null
16
17 # Check if it was an upgrade (PID file changed)
18 NEW_PID=$(cat $PID_FILE 2>/dev/null)
19 if [ "$NEW_PID" != "$CURRENT_PID" ] && [ -n "$NEW_PID" ]; then
20 echo "Wrapper: Upgrade detected. Switching watch -> $NEW_PID"
21 CURRENT_PID=$NEW_PID
22 else
23 # It was a real crash
24 echo "Wrapper: Process $CURRENT_PID exited. Exiting."
25 exit 1
26 fi
27done
A very important detail: we need to modify our .service file to do two things, which affects the behavior of our daemon:
- When it receives a reload signal, it sends a SIGHUP to our Go app.
- When it receives a "start" signal, it starts the wrapper instead of the Go app directly.
1[Service]
2ExecStart=/root/run-blog.sh
3ExecReload=/bin/bash -c 'kill -HUP $(cat /root/blog.pid)'
4KillMode=control-group
5Restart=always
The Smart Deploy
One final edge case remained: What if I update the run.sh wrapper script itself? Since run.sh is running in memory, swapping the file on disk won't update the active process. In that specific case, we do need a full restart.
To solve this, my deployment pipeline performs a checksum comparison.
- It compares the SHA256 hash of the local run.sh vs the remote run.sh.
- If they match, it performs a Reload (Zero Downtime).
- If they differ, it performs a Restart (Safety First).
1OLD_HASH=$(sha256sum /root/run-blog.sh)
2NEW_HASH=$(sha256sum /root/temp_deploy/run.sh)
3
4if [ "$OLD_HASH" != "$NEW_HASH" ]; then
5 DEPLOY_STRATEGY="restart"
6else
7 DEPLOY_STRATEGY="reload"
8fi
9
10systemctl $DEPLOY_STRATEGY myblog
The Result
With the wrapper in place and tableflip handling the sockets, the logs now show a perfect handoff. The old process spawns the new one, hands over the sockets, and retires, all while the wrapper script watches silently from the sidelines.
[63127] 2025/12/02 12:25:08 SIGHUP received: upgrading binary...
[64028] 2025/12/02 12:25:08 Process PID: 64028
[64028] 2025/12/02 12:25:08 Service Original Start: 2025-12-02T11:26:14-06:00 (Current Uptime: 58m54.289333s)
[64028] 2025/12/02 12:25:08 configuring local environment on port :8080
[64028] 2025/12/02 12:25:08 error getting S3 client: environment is not prod - not configuring backups
[64028] 2025/12/02 12:25:08 Servers ready. Waiting for exit signal...
[63127] 2025/12/02 12:25:08 Received exit signal. Shutting down servers...
[63127] 2025/12/02 12:25:08 Shutdown finished.