Backup and restore of cluster state

How to save, schedule, and restore HeroCtl control plane snapshots. Disaster recovery strategy.

Observability·7 min read·last reviewed 2026-04-26

A cluster backup is not the same thing as a backup of your data. They are two universes. This guide deals only with the first.

What cluster state is

HeroCtl keeps a catalog of what exists: declared jobs, stored secrets, ACL rules, recent metric history, configured ingress, known nodes. All of this is the control plane state.

If you lose this state, the cluster forgets what should be running. The containers may stay up, but no one knows how to manage them anymore. It's the kind of situation that ends in a manual rebuild.

Backing up this state is cheap. Not doing it is expensive.

What does NOT go into the snapshot: the data living inside your containers. Postgres database, volumes, user uploads. Those need their own backup (pg_dump, storage snapshots, etc.).

Manual snapshot

The command is one line:

heroctl snapshot save /tmp/snap-2026-04-26.tar.gz

The output:

snapshot saved: 4.2 MB
included: 47 jobs, 23 secrets, 12 acl rules, 4 nodes
duration: 1.3s

Internally the file is a compressed tarball with:

Job definitions
Secrets (encrypted)
ACL policies and tokens
Ingress configuration
List of known nodes
Metric history from the last 7 days

It does not include old logs or container images.

Restore

Restoring is more delicate. The cluster needs to be stopped:

# Para o cluster em todos os nós
sudo systemctl stop heroctl-server

# Em UM nó, restaura o snapshot
heroctl snapshot restore /backups/snap-2026-04-26.tar.gz

# Sobe o nó restaurado
sudo systemctl start heroctl-server

# Os demais nós sincronizam automaticamente ao subir
sudo systemctl start heroctl-server  # nos outros

Warning: restore overwrites the current state. If the snapshot is from yesterday, you lose everything that happened since then. Take a snapshot of the current state first, even if it's broken — it can preserve evidence.

Automatic backup (Business plan)

Manual snapshots work. Scheduled snapshots work more.

Configuration:

# /etc/heroctl/server.yaml
backup:
  enabled: true
  schedule: "0 3 * * *"        # 3h da manhã, todo dia
  retention_days: 30
  storage:
    type: s3
    bucket: heroctl-backups
    region: us-east-2
    prefix: cluster-prod/
    access_key_id: AKIA...
    secret_access_key: <em segredo>

The schedule is cron-like. Examples:

Expression	When
`0 3 * * *`	3am every day
`0 /6 * *`	Every 6 hours
`0 3 * * 0`	Sundays at 3am
`30 2 1 * *`	Day 1 of each month, 2:30am

Compatible storage

Any S3-compatible storage works:

AWS S3
Cloudflare R2
Backblaze B2
Wasabi
MinIO (self-hosted)

Your choice. R2 and B2 are usually the cheapest for long retention.

Retention

retention_days: 30

Snapshots older than 30 days are deleted automatically. You can use different values:

7 — small teams, operational only
30 — recommended default
90 — audit, regulated environments
365 — heavy compliance

Each snapshot weighs between 2 MB and a few tens of MB. Even 365 copies take little space.

Encryption

Every snapshot is encrypted at rest with AES-256. The key is configured on the server:

backup:
  encryption_key_file: /etc/heroctl/backup.key

Generate the key once:

openssl rand -base64 32 > /etc/heroctl/backup.key
chmod 600 /etc/heroctl/backup.key

Warning: without this key the snapshot becomes garbage. Keep a copy outside the cluster. In a password vault. On paper inside an envelope. Wherever makes sense — but keep it.

Disaster recovery

Three scenarios, three responses.

Scenario 1: cluster lost coordination

More nodes went down than survived. The cluster locks up in read-only mode.

Normal solution: bring the downed nodes back. They re-sync and everything returns.

Emergency solution (last resort): forced bootstrap from the most recent snapshot.

# Em um nó saudável
heroctl snapshot restore /backups/snap-mais-recente.tar.gz --force-bootstrap

You lose changes made after the last snapshot.

Scenario 2: cluster storage corrupted

Disk died. Datacenter burned. Operator deleted /var/lib/heroctl.

Same procedure:

Provision new nodes.
Install HeroCtl.
heroctl snapshot restore on one of them.
Have the others join as peers.

If you had an off-site snapshot, you recover in minutes. If you didn't, you recover in days (redoing everything from scratch).

Scenario 3: application data lost

A cluster snapshot does not help you here. You need:

Scheduled pg_dump for Postgres
Volume snapshot for file storage
Replication for Redis (if persistent)

Each workload is responsible for its own data backup. HeroCtl orchestrates; it does not replace your application backup strategy.

Restore tests

A backup no one tested is not a backup. It's a file.

HeroCtl Business schedules automatic tests:

backup:
  test_restore:
    enabled: true
    schedule: "0 4 1 * *"     # dia 1 de cada mês, 4h
    target: staging-cluster

Monthly the staging cluster wipes its state, restores the latest production snapshot, and runs a battery of checks:

Did all jobs come back?
Do secrets decrypt correctly?
Is the ACL consistent?

If something fails, it alerts on the configured channel.

Best practices

The traditional 3-2-1 backup rule applies here too:

3 copies of the snapshot
2 different media (local disk + cloud)
1 off-site (different provider from the cluster)

Applied:

Copy 1: generated on the coordinator node.
Copy 2: replicated to an S3 bucket in the same region.
Copy 3: replicated to a bucket in another region or another provider.

When the entire datacenter goes down, it's copy 3 that saves you.

Summary

Action	Command	Frequency
Manual snapshot	`heroctl snapshot save`	Before big changes
Scheduled snapshot	configured in `server.yaml`	Daily
Restore	`heroctl snapshot restore`	In emergencies
Restore test	automatic	Monthly

Next steps:

Metrics and alerts — know when the backup failed.
ACL — who can trigger a restore.
Troubleshooting — when restore doesn't work.

#backup#restore#snapshot#disaster-recovery#s3

Back to index