Database backup in a cluster: strategies that survive 3 a.m.

A backup that has never been restored is placebo. Five strategies with real recovery time (RTO) and honest acceptable data loss (RPO), for each Brazilian SaaS stage.

HeroCtl team··14 min· Read in Portuguese →

Three in the morning. The alert wakes you because the health check endpoint has been returning 500 for twelve minutes. You open the terminal half-asleep, connect to the production database, and the first query returns ERROR: invalid page in block 4421 of relation base/16384/24576. Physical corruption. Postgres still answers some requests but is delivering rotten data to half of the clients that fit in the sane part of the pages.

You remember the pg_dump cron that runs at 3 a.m. You look at the clock: 3:12. The cron started twelve minutes ago. That is, the backup being written right now to S3 is a snapshot of the corruption. The previous backup is 24 hours old. You have 24 hours of orders, payments, messages and uploads to lose, or a corrupted database to restore.

This is the scenario where every backup decision is paid. It is not at the architecture meeting. It is not at the sprint retro. It is at three in the morning, alone in a terminal, deciding between two bad options.

This post opens up the five backup models for a clustered database, with honest numbers of how much each loses and how long each takes to come back. Each strategy has a SaaS range where it is the right choice — and a range where it is negligence. The difference is the stage your company is at, not the style of who configured it.

The phrase that defines everything: a backup that has never been restored is placebo

Most teams have confident opinions on backup and fragile practice. "There's a pg_dump cron to S3" is the operational equivalent of "there's a fire extinguisher but I never tested that it works". Confidence comes from having the file there in the bucket. Fragility comes from never actually pulling that file, restoring it in an isolated environment, validating row counts, checking checksums.

Backup is the kind of system where the version "works until it stops working" is indistinguishable from the version "really works" — until the exact moment you need it. The first two years of a SaaS are lived in the Schrödinger state: the backup is alive and dead at the same time, and nobody opened the box.

This text is the exercise of opening the box before the incident. The five strategies below are organized in increasing complexity and guarantee. For each, the question is: how much data do I lose? how long am I down? how much does it cost? and — the criterion nobody takes seriously until they need it — how much internal competence does this require?

RPO and RTO without buzzwords

Before comparing strategies, two numbers need to be on the table. They have annoying acronyms but the concept is simple.

RPO — Recovery Point Objective. How much data you accept losing. It is the distance between the "now" of the incident and the last consistent point you can restore. Daily backup at 3 a.m. means RPO of up to 24 hours — if corruption happens at 2 a.m. the next day, you lose 23 hours of transactions. Continuous backup means RPO of seconds.

RTO — Recovery Time Objective. How long you accept being offline. It is the distance between the "incident started" and the "serving traffic again". pg_dump restore on a 50 GB database takes between 30 and 60 minutes on a decent machine. Failover to a streaming replica takes 30 seconds.

Both cost money in different ways. Low RPO costs storage and continuous bandwidth (each commit needs to become durable bytes somewhere before being confirmed to the client). Low RTO costs redundant hardware — a hot replica is literally another database running, consuming CPU, RAM and disk in parallel, without serving traffic at normal times.

Defining RPO and RTO before implementing avoids the most common mistake: spending a lot to solve the wrong side. A team that pays US$300 per month for a read replica and still loses 24 hours of data when the disk corrupts spent badly — bought low RTO and ignored RPO. A team that does fortnightly pg_dump to a cross-region encrypted bucket also spent badly — bought extreme durability and accepted RTO of hours that the B2B client doesn't tolerate.

Strategy 1 — Cron + pg_dump to S3

The MVP version. A cron job on the database server, runs pg_dump | gzip | aws s3 cp at three in the morning. Cross-region bucket with lifecycle policy: files older than 30 days migrate to cold storage, older than 90 days are deleted.

Real RPO: 24 hours on the normal path. Can reach 48 if the cron fails one night and nobody notices.

Real RTO: 30 to 60 minutes for databases up to 50 GB. pg_restore of a compressed dump runs close to disk I/O speed — a machine with decent SSD does 1 to 2 GB per minute, counting index creation time.

Works for: personal project, MVP, internal tool, app where 24 hours of data is humanly recoverable. Registration platform where the client redoes what they lost. Cache tool that rehydrates from upstream. B2C SaaS in the first 100 users, where "we lost a day, sorry" is an acceptable answer.

Where it hurts: hot backup on a transactional database can capture inconsistent state between related tables if you don't use the right flags. pg_dump with --serializable-deferrable solves transactional consistency but can block heavy writes. The parallel version with -j is faster but requires --format=directory, which complicates the direct pipe to S3 — you end up needing local temporary disk.

And restore is slow in a way that surprises. A 200 GB database that seemed "just a bit large" becomes three hours of restore with indexes being rebuilt. Three hours your platform is offline.

Cost: R$5 to R$30 per month of S3-compatible storage for reasonable retention. Human time: zero after setup, ten minutes a month to check the cron log.

Strategy 2 — pg_basebackup + WAL archiving

The first time RPO drops from "a day" to "a few minutes". The idea is to separate two things: the complete snapshot of the database (basebackup) and the change history since the last snapshot (WAL — Write-Ahead Log, the file where Postgres writes each commit before applying to data).

Setup: weekly pg_basebackup records a complete snapshot of the data directory. In parallel, the Postgres archive_command copies each closed WAL file (16 MB each) to S3 as they are produced. Under normal conditions, the WAL is closed in seconds under medium load.

Real RPO: 1 to 5 minutes. It is the interval between the last WAL having been sent and the disk having died.

Real RTO: 15 to 45 minutes. Restore the basebackup, replay the WALs to the desired point.

Works for: SaaS with 10 to 100 GB of database, first hundred paying clients. Stage where losing 24 hours is catastrophic but the team still doesn't have a dedicated DBA.

Where it hurts: the archive_command needs to be treated as production code, not a weekend script. If it fails and nobody notices, WAL accumulates on the database disk until it bursts, and when it bursts the database stops accepting writes. I have seen a cluster fall at four in the morning because the archive_command called aws s3 cp without --no-progress and the progress output blocked stdout in an environment without TTY.

WAL volume surprises. A database with average traffic generates 5 to 50 GB of WAL per day, depending on what it writes. Multiplied by 30 days of retention, becomes terabytes. Cheap storage is still cheap, but the bucket lifecycle policy has to be sharp.

And restore requires sequential replay. You cannot skip WALs. If an intermediate file was lost (archive bug, bucket deleted by mistake, anything), restore stops at that point and what came after is unreadable. Periodic verification is mandatory.

Cost: R$50 to R$200 per month of storage depending on retention. Human time: 2 to 4 hours for well-done initial setup, plus the first half hour of each incident understanding which WAL is needed.

Strategy 3 — pgBackRest, WAL-E, restic, xtrabackup

When strategy 2 is too good to be done by hand. Dedicated tools that combine basebackup, WAL archiving, retention, compression, encryption and — crucially — automatic verification.

pgBackRest is the most mature name for Postgres. Does everything strategy 2 does, but with parallelization (multiple processes sending WAL simultaneously), checksum validation on each file, point-in-time recovery (PITR) without pain, and — the big operational gain — a single command pgbackrest restore that knows how to find the most recent backup, download what's needed, and replay until the moment you ask.

WAL-E is older but simpler if you just want to send to S3 and come back.

restic is generic (not Postgres-specific), but has deduplication that is especially useful when you do a full dump multiple times a day — the second dump only sends the blocks that changed.

xtrabackup is the equivalent for MySQL, with the same philosophy: hot backup without lock, incremental, PITR.

Real RPO: minutes, same as strategy 2. The difference is that here the "5 minutes" is more reliable because the tool continuously verifies that the pipeline is working.

Real RTO: 10 to 30 minutes with direct PITR. The command pgbackrest restore --type=time --target='2025-12-11 03:42:00' solves what in strategy 2 requires half an hour of scripts.

Works for: serious SaaS, database from 100 GB to 1 TB, team that takes formal responsibility for monthly restore tests.

Where it hurts: learning the tool. pgBackRest has decent documentation but the vocabulary is specific — stanza, repo, archive-async. Well-done initial configuration takes 2 to 4 hours, plus another day to run the first full + incrementals and be sure everything matches.

The trick here is that the tool hides the mechanism's complexity, doesn't remove it. When something goes wrong, you need to understand that underneath it is still basebackup + WAL. The difference is that the common bugs of strategy 2 (archive failing silently, missing WAL) are detected and alerted by the tool before the incident.

Cost: storage proportional to volume + part-time DBA time. I say "part-time DBA" even if no one with that title exists on the team — it is the time someone on call needs to invest monthly to run the restore test.

Strategy 4 — Streaming replication with automatic failover

The first strategy where RTO drops below one minute. Instead of copying the database after it has finished writing, you keep a replica receiving the WAL stream in real time. When the primary dies, an orchestrator promotes the replica to primary, updates routing, and the service comes back without human intervention.

Patroni is the most common name to orchestrate this in Postgres. It solves leader election between nodes, manages replication slots, fences the dead node to avoid two simultaneous writes, and exposes an endpoint that your load balancer queries to know which node is the current primary.

Real RPO: seconds. Synchronous replication can reach zero (commit on the primary only confirms after the copy reaches the replica) but costs latency on each write — a choice made transaction by transaction in most systems.

Real RTO: 30 seconds to 5 minutes. The 30 seconds is automatic failover without human-in-the-loop. The 5 minutes is the scenario where the orchestrator detects the problem, decides it is not a false positive, promotes the replica, and the clients' DNS cache expires.

Works for: SaaS with first serious B2B client, contractual SLA equal to or greater than 99.5%. Platform where 5 minutes of window equals contractual penalty.

Where it hurts: split-brain during network partitioning. If the primary becomes isolated but keeps accepting writes, and the replica is promoted by the orchestrator on the other side of the partition, you end up with two divergent truths that need to be manually reconciled when the network comes back. Serious orchestrators use a third witness node to avoid this, but the configuration is demanding.

Worse: the replica copies corruption. If the primary wrote a rotten page, the replica received the identical WAL and is equally rotten. Streaming replication protects you against primary hardware failure; doesn't protect against application bug that wrote garbage, against wrong DROP TABLE, against ransomware.

That's why this strategy never replaces the previous three — it adds. Very low RTO for hardware failure, plus traditional backup for logical failure.

Cost: two or three databases running all the time (one active primary, one or two standbys). Storage and CPU doubled or tripled relative to a single database. Human time: one week of initial setup, plus quarterly battery of failover tests.

Strategy 5 — Backup managed by a third party

The strategy where you buy the problem out. RDS on AWS, Cloud SQL on Google, Neon, Crunchy Bridge, or a Brazilian provider that delivers automatic backup with PITR. You define the retention window, they take care of the rest.

Real RPO: 5 minutes is the typical published number.

Real RTO: 30 seconds to 5 minutes for failover within the same region. Cross-region restore is another animal — can be an hour, depending on the provider.

Works for: team without internal expertise OR compliance that requires vendor with SOC 2 / ISO 27001 issued. Company that would prefer to pay dearly to not think about the subject.

Where it hurts: cost. A database that would cost R$300/month in hardware becomes R$1,500 to R$3,000/month in equivalent RDS, depending on replica configuration. Provider lock-in — leaving RDS is a complicated migration because the way to do backup is product-specific. And cross-region restore (which is the part the B2B client demands in contract) is frequently more difficult than it seems in commercial slides.

The real advantage is zero ops under normal conditions. The real disadvantage is that when you are in an incident, the path to escalate with provider support can take more time than you would have to solve with a well-configured self-hosted.

Cost: US$50 to US$500 per month for small and medium databases. Above that it is proportional charging to size.

Comparison table

The honest version side by side. Each column is one of the strategies above.

Criterion1: Cron + pg_dump2: basebackup + WAL3: pgBackRest4: Streaming + failover5: Managed
Typical RPO24h5 min1-5 minseconds5 min
Typical RTO30-60 min15-45 min10-30 min30s-5 min30s-5 min
Ideal database sizeup to 50 GB10-100 GB100 GB - 1 TBanyany
Storage costvery lowmediummediumhigh (replica)high
Human time cost~zeromediummediumhigh~zero
Setup complexitytrivialmoderatemoderatehightrivial
Restore complexitylowmoderatelow (PITR)n/a (failover)low
Multi-regionmanualmanualnativerequires configdepends on provider
Point-in-time recoverynoyes, manualyes, single commandn/ayes
Brazilian SaaS rangeMVPindieearly/midstartup with SLAenterprise / no expertise

Notice that no row has an absolute winner. Strategy 5 wins on "human time" and loses on "storage cost". Strategy 4 wins on RPO and loses on "setup complexity". The choice is always about which column you prioritize, given the company stage.

The five mistakes that kill the colleague's backup

From here it is operational folklore. Each of these mistakes was made by a team confident in the backup they had. Each became a postmortem.

Never restored. A backup that has never been restored in an isolated environment is hypothesis. A monthly cron that pulls the most recent backup from the bucket, brings it up in an ephemeral database (a temporary machine you turn off afterwards), validates row count of critical tables, checks checksum of some samples, and emails the team confirming success. The backup of the month that cron failed is the only backup you are sure will work. Everything before is faith.

Backup on the same disk or same region. Disk dies, backup goes with it. Entire region of cloud provider falls (has happened several times in the last five years), backup goes with it. Cross-region is the minimum. Cross-provider — main backup on one provider, secondary copy on another — is what separates "prepared" from "fan in the stands".

No long logical retention. Seven days of retention seems comfortable until you discover corruption started eight days ago and nobody noticed. Subtle application bug that writes invalid data on one row per minute doesn't trigger an alert, but in two weeks poisoned half a million records. Short retention is negligence dressed as savings. Reasonable minimum policy: 7 days of hourly backup, 30 days of daily, 12 months of monthly.

No failure alert. The silent cron is the most common killer. Failed thirty days ago because of an S3 credential change, nobody looked at the log, everyone slept peacefully. Alert integrated with the team's notification system (not email lost in the inbox, but Slack or equivalent that someone will read) is mandatory. And the alert has to be double — alert when it fails, and alert when there has been no success for more than N hours (to detect the case where the cron didn't even run).

Backup without encryption. Public S3 bucket leak becomes a leak of the entire database. Press incidents in recent years are all composed of "bucket was public for two days and had the complete backup". Encryption at rest (server-side encryption on the bucket is the minimum, client-side encryption with a key managed by you is adequate), encryption in transit, and — separate from that — bucket access control following the principle of least privilege.

The Brazilian SaaS maturity trail

The practical question: which strategy for which stage. Taking Brazilian monthly recurring revenue metrics as reference.

MVP up to R$5k MRR. Strategy 1. pg_dump cron to S3-compatible, 30-day retention, basic failure alert. Costs practically nothing and protects against the main risk at this stage, which is "the server vanished". 24-hour RPO is acceptable because the client at this stage has expectations of a product that's just starting.

Indie from R$5k to R$30k. Strategy 2. Adds WAL archiving, drops RPO from 24 hours to 5 minutes. The complexity jump is compensated by the guarantee jump. A client paying R$500 per month for B2B subscription starts asking SLA questions — you need a better answer than "daily".

Early startup from R$30k to R$200k. Strategy 3. pgBackRest configured with automatic monthly restore test. Here it is no longer optional — you have a serious client, serious contract, and the fragility of strategy 2 done by hand is concrete operational risk. Half a day of setup pays itself in three months by the first incident that doesn't become a public postmortem.

Startup with contractual SLA. Strategy 4 added to 3. Streaming replication with automatic failover takes care of hardware; pgBackRest takes care of logical data. The two together solve the two failure modes that serious B2B contracts charge for: momentary unavailability and data loss.

Enterprise or heavy compliance. Strategy 5 OR strategy 4 with detailed audit. Here the decision is less technical and more regulatory. If audit requires a vendor with X certification, you buy managed. If audit accepts self-hosted with documented runbook, you operate strategies 3 and 4 and invest in the audit trail.

How HeroCtl simplifies this

The product motivation is exactly the scheme above — they are layers that appear repeatedly in every SaaS, and each team spends an entire season rebuilding the same plumbing. HeroCtl solves the transport, orchestration and observation of these layers. Keeps what is database-specific with the team, automates what is generic.

Concretely, on the Community plan (free), you run Postgres as a regular job in the cluster, with encrypted persistence at rest. The backup cron becomes another job, with automatic retry, integrated failure alert (no need to set up Alertmanager separately), and metrics that appear on the panel — dump duration, file size, time since last success.

On Business, managed Postgres and MySQL backup enters. The difference vs. "doing it yourself" is that integrity verification, monthly restore test, client-side encryption with a key managed by you, and three-tier retention policy (hourly/daily/monthly) come pre-configured. You define the window and the rest is our problem.

On Enterprise, orchestration of streaming replication between cluster nodes enters — a job describes the topology (primary here, standby there, who promotes whom), and the cluster takes care of failover when the primary dies. Chaos test battery runs monthly against your cluster, with a report.

On all plans, the restore test can be configured as an internal cron job: the cluster brings up an ephemeral database, restores the most recent backup, validates what you defined as "healthy" (a checksum query, a row count, whatever makes sense), reports success or failure, and turns off the ephemeral database. The cost is the test's CPU minutes, not an engineer's workday.

The philosophy is the same as the rest of the product: what is generic to every SaaS is already done; what is specific to your domain stays with you. Backup is generic. The database content is yours.

FAQ

Is pg_dump enough to start? Yes, for MVP up to first paying clients. The practical rule: if losing 24 hours of data costs less than three hours of engineer time, pg_dump is the right choice. When it inverts (losing data costs more than migrating to strategy 2), migrate.

How much does S3-compatible storage cost in Brazil in 2026? Current range between R$0.15 and R$0.40 per GB-month depending on provider and tier. Backup of a 50 GB database with 30-day retention plus some old weekly dumps = ~3 TB-month equivalent compressed = R$50 to R$120/month. Cross-region practically doubles it. Cold tier for long monthly retention drops it to half.

How do I test restore without affecting production? Cron job in the cluster that: 1) downloads the most recent backup from the bucket; 2) brings up a temporary Postgres on another node (or ephemeral container) without exposing public port; 3) restores the dump; 4) runs validation queries (per-table count, checksum of samples, critical domain query); 5) compares with expected baselines; 6) reports the result on the alerts channel; 7) destroys the environment. Runs monthly, at minimum. Never touches the production database at any point.

Encrypted backup on S3 — what's the right way? Three layers. Server-side encryption of the bucket (AES256 or KMS) is the base and is free. Client-side encryption with a key managed by you (you encrypt the file before uploading, with a key that lives outside the cloud provider) protects against credential compromise scenarios. Bucket policy with aws:SecureTransport=true and explicit public access blocking. The three together, no exception. One alone is not enough.

Does a read replica count as a backup? No. A read replica protects against primary hardware failure. Doesn't protect against wrong DROP TABLE, against a bug that writes invalid data, against logical corruption that replicates. Every logical corruption that enters the primary enters the replica in seconds. Read replica is high availability, not backup. The two coexist; neither replaces the other.

Is cross-region replication necessary? For serious B2B client, yes. For MVP, no. The practical boundary: if a single cloud provider falling for 4 hours is enough to break a contract, cross-region is minimum. If 4 hours of downtime can still be explained by email, cross-region can wait.

How long does it take to restore 100 GB of Postgres? Depends a lot on disk. Local NVMe SSD: 20 to 40 minutes counting index creation. Remote SSD (cloud provider volume): 40 to 90 minutes. HDD: you don't want to know. Dump compression typically drops to half the original size; 100 GB database becomes a 30 to 50 GB dump. Create indexes in parallel (pg_restore -j) to speed up 30 to 50%.

Closing

Backup is less about technical strategy and more about test discipline. The five strategies above are all decent at some stage. The difference between the team that survives the incident and the team that loses a client is not in which they chose — it is in having restored a backup recently enough to trust.

If your answer to "when was the last restore test?" begins with "let me see", you have work for this week. Not the next sprint. This week. The three a.m. scenario isn't scheduled.

HeroCtl runs on any Linux server with Docker. You install in a lab, bring up a Postgres as a job, configure automatic backup, schedule monthly restore test, and in an afternoon you have the entire strategy 3 scheme working. Without setting up five different products, without learning specialized orchestrator vocabulary.

curl -sSL get.heroctl.com/install.sh | sh

For context on when it makes sense to run Postgres in your own cluster vs. buying managed, read Postgres in production: managed vs self-hosted. To understand why the deploy window is the other side of the backup coin (because a bad deploy is the most common way to generate the incident that will require a restore), read Safe rolling deploy: why yours might not be.

A backup that has never been restored is placebo. The difference between placebo and medicine appears exactly once, exactly when you can't choose.

#backup#postgres#disaster-recovery#engineering#rpo-rto