Backup and restore of cluster state
How to save, schedule, and restore HeroCtl control plane snapshots. Disaster recovery strategy.
A cluster backup is not the same thing as a backup of your data. They are two universes. This guide deals only with the first.
What cluster state is
HeroCtl keeps a catalog of what exists: declared jobs, stored secrets, ACL rules, recent metric history, configured ingress, known nodes. All of this is the control plane state.
If you lose this state, the cluster forgets what should be running. The containers may stay up, but no one knows how to manage them anymore. It's the kind of situation that ends in a manual rebuild.
Backing up this state is cheap. Not doing it is expensive.
What does NOT go into the snapshot: the data living inside your containers. Postgres database, volumes, user uploads. Those need their own backup (
pg_dump, storage snapshots, etc.).
Manual snapshot
The command is one line:
heroctl snapshot save /tmp/snap-2026-04-26.tar.gz
The output:
snapshot saved: 4.2 MB
included: 47 jobs, 23 secrets, 12 acl rules, 4 nodes
duration: 1.3s
Internally the file is a compressed tarball with:
- Job definitions
- Secrets (encrypted)
- ACL policies and tokens
- Ingress configuration
- List of known nodes
- Metric history from the last 7 days
It does not include old logs or container images.
Restore
Restoring is more delicate. The cluster needs to be stopped:
# Para o cluster em todos os nós
sudo systemctl stop heroctl-server
# Em UM nó, restaura o snapshot
heroctl snapshot restore /backups/snap-2026-04-26.tar.gz
# Sobe o nó restaurado
sudo systemctl start heroctl-server
# Os demais nós sincronizam automaticamente ao subir
sudo systemctl start heroctl-server # nos outros
Warning: restore overwrites the current state. If the snapshot is from yesterday, you lose everything that happened since then. Take a snapshot of the current state first, even if it's broken — it can preserve evidence.
Automatic backup (Business plan)
Manual snapshots work. Scheduled snapshots work more.
Configuration:
# /etc/heroctl/server.yaml
backup:
enabled: true
schedule: "0 3 * * *" # 3h da manhã, todo dia
retention_days: 30
storage:
type: s3
bucket: heroctl-backups
region: us-east-2
prefix: cluster-prod/
access_key_id: AKIA...
secret_access_key: <em segredo>
The schedule is cron-like. Examples:
| Expression | When |
|---|---|
0 3 * * * | 3am every day |
0 */6 * * * | Every 6 hours |
0 3 * * 0 | Sundays at 3am |
30 2 1 * * | Day 1 of each month, 2:30am |
Compatible storage
Any S3-compatible storage works:
- AWS S3
- Cloudflare R2
- Backblaze B2
- Wasabi
- MinIO (self-hosted)
Your choice. R2 and B2 are usually the cheapest for long retention.
Retention
retention_days: 30
Snapshots older than 30 days are deleted automatically. You can use different values:
7— small teams, operational only30— recommended default90— audit, regulated environments365— heavy compliance
Each snapshot weighs between 2 MB and a few tens of MB. Even 365 copies take little space.
Encryption
Every snapshot is encrypted at rest with AES-256. The key is configured on the server:
backup:
encryption_key_file: /etc/heroctl/backup.key
Generate the key once:
openssl rand -base64 32 > /etc/heroctl/backup.key
chmod 600 /etc/heroctl/backup.key
Warning: without this key the snapshot becomes garbage. Keep a copy outside the cluster. In a password vault. On paper inside an envelope. Wherever makes sense — but keep it.
Disaster recovery
Three scenarios, three responses.
Scenario 1: cluster lost coordination
More nodes went down than survived. The cluster locks up in read-only mode.
Normal solution: bring the downed nodes back. They re-sync and everything returns.
Emergency solution (last resort): forced bootstrap from the most recent snapshot.
# Em um nó saudável
heroctl snapshot restore /backups/snap-mais-recente.tar.gz --force-bootstrap
You lose changes made after the last snapshot.
Scenario 2: cluster storage corrupted
Disk died. Datacenter burned. Operator deleted /var/lib/heroctl.
Same procedure:
- Provision new nodes.
- Install HeroCtl.
heroctl snapshot restoreon one of them.- Have the others join as peers.
If you had an off-site snapshot, you recover in minutes. If you didn't, you recover in days (redoing everything from scratch).
Scenario 3: application data lost
A cluster snapshot does not help you here. You need:
- Scheduled
pg_dumpfor Postgres - Volume snapshot for file storage
- Replication for Redis (if persistent)
Each workload is responsible for its own data backup. HeroCtl orchestrates; it does not replace your application backup strategy.
Restore tests
A backup no one tested is not a backup. It's a file.
HeroCtl Business schedules automatic tests:
backup:
test_restore:
enabled: true
schedule: "0 4 1 * *" # dia 1 de cada mês, 4h
target: staging-cluster
Monthly the staging cluster wipes its state, restores the latest production snapshot, and runs a battery of checks:
- Did all jobs come back?
- Do secrets decrypt correctly?
- Is the ACL consistent?
If something fails, it alerts on the configured channel.
Best practices
The traditional 3-2-1 backup rule applies here too:
- 3 copies of the snapshot
- 2 different media (local disk + cloud)
- 1 off-site (different provider from the cluster)
Applied:
- Copy 1: generated on the coordinator node.
- Copy 2: replicated to an S3 bucket in the same region.
- Copy 3: replicated to a bucket in another region or another provider.
When the entire datacenter goes down, it's copy 3 that saves you.
Summary
| Action | Command | Frequency |
|---|---|---|
| Manual snapshot | heroctl snapshot save | Before big changes |
| Scheduled snapshot | configured in server.yaml | Daily |
| Restore | heroctl snapshot restore | In emergencies |
| Restore test | automatic | Monthly |
Next steps:
- Metrics and alerts — know when the backup failed.
- ACL — who can trigger a restore.
- Troubleshooting — when restore doesn't work.