Safe rolling deploys: why yours probably isn't

Swapping containers without downtime sounds simple — pull new image, kill old, start new. Works until the first Friday at 5 p.m. The 6 details that separate real rolling deploy from theater.

HeroCtl team·December 04, 2025·13 min· Read in Portuguese →

Every engineering team operating containers in production, sooner or later, writes a similar sentence in the status channel: "deploy completed, no downtime". The sentence is optimistic. In at least half the cases we've audited — in homemade scripts, in popular self-hosted panels, in official tutorials that became reference —, what actually happened was a 5 to 30 second window where the load balancer returned 502, some uploads cut off mid-way, and no one noticed because monitoring samples once a minute.

Rolling deploy seems like the simplest orchestration problem: you have N containers running an old version, you want to have N containers running a new version, and you want the app to keep responding during the swap. The conceptual recipe fits in three lines. Replace an old container with a new one, one at a time, keeping traffic always directed to who's alive. What makes this hard isn't the strategy — it's the set of six details that need to be right at the same time. Each alone seems like an implementation detail. The six together are the difference between "real no-downtime deploy" and "deploy that seems no-downtime on Friday morning, but has thirty seconds of error in the middle of 5 p.m.".

This post maps the six. At the end there's a recipe in spec format, an honest comparison of who implements what, and a tests section you can run on your current system to discover if it's bad — before your customer discovers first.

Detail 1 — Health check before promoting the new container

The most common error in homemade rolling deploy is trusting the running state the container runtime reports. You bring up a new container, Docker (or equivalent) marks it as running in milliseconds, and your script considers that proof it can kill the old. Kills the old. The new, internally, is still starting — waiting for database connection, loading cache in memory, downloading feature flag configuration, opening thread pool. During that interval, the load balancer routes traffic to a process that isn't yet ready to receive and returns 502 or 503.

The window is short — usually between 5 and 30 seconds per container —, and that's why it fools monitoring. If your error metric is sampled every minute and you swap five containers in sequence, each with 10 seconds of "is running but not ready", the spike doesn't always fall exactly on a collection window. You're left with the statistical impression that everything went well.

What rolling should do, instead, is separate two concepts: "the process is running" and "the process is ready to serve traffic". Running is runtime state; ready is affirmative response from an endpoint that the app itself exposes — /healthz, /readyz, or equivalent. The orchestrator does HTTP GET on that endpoint of the new container, waits to receive 200, waits for that response sustained for a period (min_healthy_time, typically 10 seconds), and only then removes the old container from the load balancer.

The min_healthy_time is the detail many people skip. It exists because a single 200 means nothing — it could be that the app responded before closing a critical connection, and will start failing in the next second. Waiting for 10 consecutive seconds of healthy responses filters those false positives without lengthening the deploy absurdly.

Classic Watchtower — popular script for updating containers from new image tags — does none of this. It pulls, stops the old, starts the new. Coolify and Dokploy implement partially, depending on the application configuration and the type of health check you enabled. A serious cluster orchestrator treats this as minimum requirement.

Detail 2 — Connection draining and graceful shutdown

Even when you order the router to stop sending new connections to the old container, there are still in-flight connections — file uploads, large downloads, long requests waiting for database response, open websocket, event streaming. If you simply send SIGKILL (or let the runtime send, after a too-short timeout), all those connections cut off mid-way. The user sees network error at the exact moment of deploy.

The correct flow has four ordered steps. First, you signal the router that the old container should no longer receive new connections — that's usually a load balancer removal or a node drain. Second, you send SIGTERM to the process. SIGTERM is a catchable signal; the app can handle it and start graceful shutdown. Third, you wait. The timeout depends on the application profile — 30 to 60 seconds covers the vast majority of web apps; APIs with large file upload may need 120 or more. Fourth, and only after that timeout, you send SIGKILL to whatever hasn't finished.

There's a known pitfall in this step: the app itself needs to handle SIGTERM. Node, Rails, Django, Go HTTP server — all have middleware or helpers for this, but none of them comes on by default in basic template. If your application doesn't catch SIGTERM, the signal becomes no-op and the orchestrator will wait the full timeout before killing with SIGKILL. The result is slow deploy and, even so, cut connections — because the app only realized it would die at the moment of KILL.

Check your app. Specifically: when it receives SIGTERM, does it stop accepting new connections, wait for in-flight connections to end, and only then close? If the answer is "I don't know", your rolling deploy is broken on this detail.

Detail 3 — Previous image pre-pulled for fast rollback

Critical bug in production, three minutes after deploy. You need to revert to the previous version now. The pull command of the old image takes 30 to 60 seconds per node — because, of course, you "cleaned" the old to save disk, or the node's image cache has already been rotated. Multiply by number of replicas, add the orchestration time of the swap itself, and your five-minute incident becomes fifteen. Fifteen minutes is the difference between "momentary instability" in the postmortem and "incident reported to customer".

The fix is trivial and almost no one implements it: keep the N-1 image pre-pulled on the nodes that ran it. Rollback becomes changing the pointed-to tag and restarting — operation of about 10 seconds per container, dominated by the old container's health check coming back to life.

The more sophisticated version is keeping a snapshot of the job's complete state — not just image, but environment variables, network configurations, associated secrets, allocated resources. Partial rollback (just image) covers most cases, but doesn't cover regression introduced in a feature flag or in a connection string. Full snapshot is what separates fast rollback from complete rollback.

Detail 4 — Automatic failure detection and auto-revert

Common scenario in homemade script: the deploy goes up, the new container enters crash-loop (returns to running, dies in 5 seconds, returns again, dies again). The system stays waiting for someone to see the problem and abort manually. If that happens at four in the morning and the alert goes to a Slack no one's watching, the downtime extends until someone wakes up.

What rolling should do is define a healthy_deadline — an absolute ceiling within which the new container needs to enter and remain in healthy state. Our default is 300 seconds; five minutes covers apps that take time to initialize (heavy Java apps, apps with cache warm-up) without giving indefinite margin. If the deadline passed and the container isn't healthy, the orchestrator automatically reverts to the previous version. Alerts the team afterward, without urgency — because the system already protected itself.

The practical implementation counts two combined signals: restart count on the new container (if more than 3 in 60 seconds, it's crash-loop) and total elapsed time without sustained positive health check. Either of the two firing before min_healthy_time is reached aborts the deploy of that replica and triggers the revert.

A subtle detail: auto-revert only makes sense if detail 3 (previous image pre-pulled) is implemented. Reverting to an image that needs to be downloaded again during auto-revert nullifies the gain.

Detail 5 — `max_parallel: 1` in multi-instance cluster

You have five replicas of the same service. Tempted to swap all five at the same time: parallel deploy, wait for all to be healthy, done. That path has three problems. First, during the window when all are being swapped, all traffic passes through the new version — if it has a bug, 100% of users feel it, with no fallback. Second, the resource usage peak during the swap is 2× (because old and new coexist for instants), which can blow node memory and generate cascading OOM. Third, you lose the cheap opportunity to detect regression before propagating.

Rolling should swap one replica at a time (max_parallel: 1) — or, in very large clusters, a small fraction (10-25%). Swap the first, wait for healthy, wait for min_healthy_time, swap the second, and so on. Total service capacity stays maintained throughout the deploy window: if five replicas handled traffic before, four new + one old (or vice versa) also handle.

The tradeoff is time. Ten replicas with 30 seconds each of swap + min_healthy_time = five minutes of total deploy window. It's not fast. In exchange, you gain cheap rollback: if the first new replica fails, the orchestrator stops swapping the others. You're left with nine old + one failed new, discard the failed, returned to the previous state without any capacity impact. The cost of slow deploy is paid by the safety of not having entire cluster with bug running before someone notices.

There's an extra knob that helps: stagger, the interval between consecutive swaps. Reasonable default is 30 seconds. That delay lets metrics and logs of the recently swapped container be collected and evaluated before moving to the next — minimum window to detect bug that only appears under real traffic.

Detail 6 — Pre-stop hooks for long-running jobs

This detail specifically affects apps that have async workers — Sidekiq, Celery, Resque, RQ, BullMQ, any background job queue. The worker container picked up a job that takes 30 minutes to process (bulk email sending, report generation, payment processing). In the middle of processing, here comes the SIGTERM from the deploy. If the worker doesn't have proper handling, the job is lost — stays in intermediate state, or returns to the queue and is processed in duplicate, or simply disappears.

The correct flow is more elaborate than detail 2's. Even before SIGTERM, the orchestrator needs to execute a pre-stop hook that signals the worker to enter drain mode: stop accepting new jobs, but finish those already picked up. The orchestrator then waits (with configurable timeout — 60 to 300 seconds is normal range) for the local queue to drain. Only then does it send SIGTERM to the process.

The implementation varies. More sophisticated apps expose a /pause or /drain endpoint that puts the worker in graceful mode. Simpler apps use a sentinel file — pre-stop creates a file, worker checks the file each loop and stops picking up new jobs if it exists. In both cases, the key is that the orchestrator needs to wait for confirmation that the local queue drained before sending SIGTERM.

Without this, your async job failure rate during deploy is directly proportional to average job processing time × number of swapped workers. In apps that process payment or email sending, that failure rate becomes a serious problem fast.

The complete recipe

The six details compose into a recognizable specification. In spec format:

update:
  max_parallel: 1
  min_healthy_time: 10    # 10s sustained healthy
  healthy_deadline: 300   # 5 min max to be healthy
  auto_revert: true       # if past deadline, revert
  stagger: 30             # 30s between swapped replicas
tasks:
  - name: web
    healthcheck:
      path: /healthz
      interval: 5s
      timeout: 2s
      retries: 3
    lifecycle:
      pre_stop:
        timeout: 60         # 60s for worker to drain
        command: ["/bin/sh", "-c", "kill -TERM 1; sleep 30"]

That configuration — in text, in orchestrator config file, or implicit in deploy code — is the minimum viable safe rolling deploy. Covering the six details is what differentiates serious orchestration from a series of docker commands queued up.

Who implements what (the honest version)

The grid below covers the most common ecosystem in the self-hosted/small-cluster niche. It's not exhaustive — there are excellent tools left out —, but it's honest about the ones that show up in typical architecture decisions.

Kubernetes. Implements all six when the manifest is complete: readinessProbe covers detail 1, terminationGracePeriodSeconds + preStop covers 2 and 6, imagePullPolicy + local cache covers 3, progressDeadlineSeconds + revisionHistoryLimit covers 4, maxUnavailable + maxSurge covers 5. The problem isn't what it does; it's the size of the manifest needed to do this, and the number of fields whose default isn't what you need.

Docker Swarm. Implements rolling via docker service update. The primitives exist (--update-parallelism, --update-delay, --update-failure-action), but the defaults are too aggressive — high default parallelism, no mandatory health check, and auto-revert behavior is opt-in with a specific flag. Needs to be tuned for each service; rarely is.

Nomad. Implements natively, with sensible defaults. update block has max_parallel, min_healthy_time, healthy_deadline, auto_revert, stagger — basically the same fields as the spec above, because that nomenclature isn't coincidence: it's the inheritance of best practices the major orchestrators converged on.

HeroCtl. Implements all six details natively, with configuration practically identical to the spec above. The control plane coordinator election takes about 7 seconds when the leader node falls, so even a deploy executed in the middle of coordinator swap is resilient — the new coordinator resumes the health check cycle from where it was. The defaults are max_parallel: 1, min_healthy_time: 10, healthy_deadline: 300, auto_revert: true. If you submit a job without configuring anything, that's what catches.

Watchtower. No. Watchtower is a useful tool for a specific case — automatic container updates in environment where you accept short downtime and connection loss as cost of not having deploy pipeline. In serious production, it fails on five of the six details. It's not criticism of the project; it's criticism of using it in the wrong context.

Coolify and Dokploy. Implement partially. Health check exists but needs to be configured per application. Connection draining depends on the app catching SIGTERM (shared responsibility). Auto-revert is manual in both. Generic pre-stop hook isn't a first-class primitive. For single-server, it's enough; for cluster, it's fragile.

Homemade scripts. That docker pull && docker stop && docker run combination in a shell script managed by cron or triggered by GitHub webhook. Zero of the six. Honest coverage of what that is: a deploy with short downtime, not a rolling deploy.

The four patterns beyond rolling

Rolling isn't the only strategy. Depending on requirement and budget, three others make sense in specific contexts.

Blue-green. Two complete parallel environments, each with the entire stack. Deploy is bringing up the alternative environment (green) with the new version, validating in parallel, and switching via DNS or load balancer — a single atomic moment when the entire traffic changes. Safer than rolling because you can validate the new version with synthetic traffic before any real user. Costs, in capacity, 2× during the deploy window. Recommended for apps where the cost of bug in production is high and the cost of extra capacity is low.

Canary. Send 5% (or 1%, or any small fraction) of traffic to the new version, monitor key metrics for an observation period (15 minutes to a few hours), and scale gradually — 5%, 25%, 50%, 100%. Detects regression before affecting the main user. Prerequisite is having reliable metrics with high sensitivity; without it, canary just delays the deploy without gaining safety. Combines well with rolling: rolling is the mechanism, canary is the promotion strategy.

Rainbow. Several versions coexisting simultaneously in production, with traffic routed by customer key or tenant type. Rare use case, usually in B2B with version-per-contract requirements. Almost never the first option.

Recreate. Stop everything, bring up new. Explicit and accepted downtime. Acceptable for internal apps with maintenance window or for development environments. Surprisingly appropriate in specific cases: deploy involving database migration that breaks schema, or deploy of app whose architecture doesn't support two versions coexisting. When recreate is the right choice, it's the right choice — there's no prize for doing rolling for everything.

How to detect that your rolling is bad

Four direct tests. The first two are metrics you can look at in monitoring you already have; the next two are experiments that need to be done with intent.

5xx rate during the deploy window. If your 5xx rate in production is statistically different from zero during the deploy window, it's bad. "Statistically different" means: take 30 consecutive deploys, measure the 5xx rate in the minute preceding the deploy and in the minute of the deploy. If the second's mean is higher, there's real error, and the window is cutting connection.

p99 latency during deploy. If p99 rises 3× or more during the deploy window, it's bad. Latency spike indicates that requests are being internally restarted, or that the load balancer is re-accepting connections to slow-to-respond containers.

Forced crash test. Before a scheduled deploy, force the app in the new container to fail — chmod 000 on the binary, or environment variable that makes process.exit(1) on startup. Does the system automatically revert within healthy_deadline? If it stays stuck waiting for human intervention, detail 4 is broken.

Friday 5 p.m. deploy with real traffic. The social test. Do a non-trivial deploy at peak hour, on a day no one on the team is actively watching. If your app's metric during that window is indistinguishable from a random window, your rolling deploy is safe. If intervention was required, or if the status channel registered something, it isn't.

FAQ

Is Watchtower safe for production? For small production with explicit tolerance for short downtime and fallback tool (fast manual rollback), yes. For production that has paying customer with SLA expectation, no. Watchtower was made for a different problem.

Health check on /healthz or /readyz? The convention that helps most in practice: /healthz indicates "the process is alive" (liveness — used by orchestrator to decide whether to restart the container) and /readyz indicates "I'm ready to serve traffic" (readiness — used by the orchestrator to decide whether to include the container in the load balancer). For rolling deploy, what matters is readiness; the orchestrator only promotes a container to the traffic pool when readiness returns 200. If you have only one endpoint and it responds 200 immediately after the process starts, your readiness isn't measuring what it should.

How much should min_healthy_time be? Typical band is 10 to 30 seconds. Shorter than that (3 seconds, 5 seconds) lets false positives through — apps that respond 200 right at the start but start failing when real traffic arrives. Longer than 60 seconds becomes operational impediment without proportional gain. If your application has complex warm-up (in-memory cache, slow third-party connection), the place to cover that is in the health check itself — making it only respond 200 after warm-up — not inflating min_healthy_time.

How do I do pre-stop in Rails app? Rails responds natively to SIGTERM with graceful shutdown since 5.x — the server stops accepting new connections and finishes the in-flight ones. For Sidekiq, the correct signal is SIGTSTP (pause workers), followed by SIGTERM after Sidekiq.redis { |c| c.llen("queue:default") } zeroes. In practice, the pre-stop hook executes a small script that sends SIGTSTP, polls the queue for up to N seconds, and returns — the orchestrator then does the conventional SIGTERM.

Do sticky sessions and rolling deploy go well together? Badly. Sticky session means your architecture is delegating state to the load balancer, and during rolling that state is discarded when the replica that held the session is swapped. Result: user is logged out, loses form mid-way, or has inconsistent behavior. If you need sticky, that's a symptom — refactor to external state (Redis, database) and rolling deploy becomes trivial.

Database migration on rolling deploy? The practical rule that avoids 90% of pain: every migration needs to be compatible with the previous version of the app during the deploy window. Adding nullable column: fine. Removing column: do in two releases (release N stops using the column; release N+1 removes it). Renaming column: idem, with new column as copy. This allows old and new replicas to coexist, which is exactly what rolling presupposes.

Can I test rolling deploy locally? You can. Bring up three local replicas with docker-compose, simulate a load balancer (nginx or caddy) in front, fire hey or wrk with sustained traffic, and execute a swap script as your pipeline would execute in production. Measure 5xx during the window. It's an imperfect test (traffic is synthetic, node is single, there's no real network between nodes), but catches the gross bugs in details 1, 2, and 3 before they leak to production.

Closing

Safe rolling deploy isn't a button; it's a set of six coordinated behaviors, and most tools that promise "zero downtime" cover three or four of them. The practical difference becomes visible Friday 5 p.m., or in Wednesday morning's incident, or in the postmortem where someone asks why three users reported error during the last deploy.

The recipe is above. If your current tool covers the six, great. If not, either you take on the cost of covering the missing ones manually, or you swap for something that covers natively.

HeroCtl covers natively. Community plan is permanent free, no server or job limit, with the rolling deploy configuration described here as default. Business plan adds SSO, RBAC, audit, and SLA support for teams with formal platform requirements. Enterprise plan adds source code escrow and continuity contract.

To get started:

curl -sSL https://get.heroctl.com/install.sh | sh

If you want to see the other sides of the topic, read Why we built HeroCtl for the product context, and in the next posts we'll cover Docker deploy in production, from compose to cluster and database backup strategies in cluster for 3 a.m..

The intent remains the same: container orchestration, without ceremony — and without theater.

#rolling-deploy#deploy#engineering#zero-downtime#production

Back to the blog