Bring up a 3-node cluster
Form a cluster with 3 servers in under 10 minutes. Tolerates 1-node failure with no downtime.
A single node works, but any reboot brings the application down. To tolerate failures and deploy without a window, you need 3 nodes forming a distributed control plane.
Why 3 nodes, not 2 or 4?
The rule is simple: consensus among servers needs a majority. With 3 nodes, losing 1 still keeps 2 alive — a majority. With 2, losing 1 leaves 1 alive, no majority, and the cluster locks for safety.
| Nodes | Tolerates failure of | Indicated for |
|---|---|---|
| 1 | 0 | development, lab |
| 3 | 1 | small/medium production |
| 5 | 2 | critical production |
| 4 or 6 | same as 3 or 5 | never, it is waste |
Even numbers add cost without adding resilience. Always odd.
Note: workers (pure agent mode) do not count toward this math. You can have 3 servers + 50 workers and the rule remains "1 server can go down".
Provision the 3 servers
HeroCtl does not require identical machines, but servers should have the same CPU and disk profile to avoid slowdowns from whoever lags behind. A cheap and functional example:
| Provider | Plan | Cost/month | CPU | RAM | Disk |
|---|---|---|---|---|---|
| Hetzner | CPX21 | € 7.99 | 3 vCPU | 4 GB | 80 GB |
| DigitalOcean | s-2vcpu-4gb | US$ 24 | 2 vCPU | 4 GB | 80 GB |
| Vultr | vc2-2c-4gb | US$ 24 | 2 vCPU | 4 GB | 80 GB |
Provision 3 machines in the same datacenter, on a private network if the provider offers one. Latency between nodes should stay below 10 ms.
Then follow the installation step on each one. When you finish, you will have 3 servers with the binary ready and nothing else.
Initialize the first node
The first node "opens" the cluster. The others will join it. This command runs on one machine and never again.
# On node 1, private IP 10.0.0.1
sudo heroctl cluster init --advertise 10.0.0.1
Expected output:
cluster initialized
node-id: node-1
state: healthy
nodes: 1/1
From here, node 1 already accepts jobs. But without the other two, there is no fault tolerance.
Generate join token
For other nodes to join, you need a token. It is signed by the cluster and has configurable validity.
# On node 1
heroctl cluster join-token --ttl 1h
# eyJhbGciOi...truncado...8X7Z
Warning: the token grants entry to the control plane. Treat as a password. Use short TTL (1h is enough to join 3 nodes) and never paste in logs or public Slack.
Connect nodes 2 and 3
On each of the other two nodes, run the join command. Replace <TOKEN> with the generated token and <IP> with that machine's private IP.
# On node 2, private IP 10.0.0.2
sudo heroctl cluster join \
--token eyJhbGciOi...8X7Z \
--advertise 10.0.0.2 \
--servers 10.0.0.1:8080
# On node 3, private IP 10.0.0.3
sudo heroctl cluster join \
--token eyJhbGciOi...8X7Z \
--advertise 10.0.0.3 \
--servers 10.0.0.1:8080
Each command takes 5 to 15 seconds. The node downloads the current state, syncs, and starts receiving real-time updates.
Check health
Once the 3 are connected, any of them responds with the cluster view:
heroctl cluster status
Healthy output:
cluster: 3 nodes
quorum: ok (2/3 required)
leader: node-1 (10.0.0.1)
peers:
- node-1 10.0.0.1 server ready applied=1247
- node-2 10.0.0.2 server ready applied=1247
- node-3 10.0.0.3 server ready applied=1247
last_update: 0.4s ago
What to look at:
- quorum: ok — majority alive, cluster accepts writes.
- applied equal across all 3 nodes — all saw the same changes. Small difference (1–2) is normal between the current pulse.
- last_update below 5s — replication is flowing.
If applied diverges a lot (hundreds), there is a network or disk problem on a node. See Common problems below.
Add workers
Workers only run containers. They do not vote, decide, or store state. Add as many as you want without affecting consensus.
# On each worker
sudo heroctl agent \
--token <TOKEN> \
--advertise 10.0.0.10 \
--servers 10.0.0.1:8080,10.0.0.2:8080,10.0.0.3:8080
Confirm with:
heroctl node list
# node-1 server ready 3.2 GB free 2 jobs
# node-2 server ready 3.4 GB free 1 job
# node-3 server ready 3.0 GB free 2 jobs
# node-10 worker ready 7.8 GB free 0 jobs
Note: always pass the 3 IPs in
--servers. If you point only at node 1 and it goes down during the join, the agent stays in retry.
Common problems
Node does not connect
Symptom: the cluster join stays at "connecting..." for over 30s.
Typical cause: firewall blocking port 8082 between nodes. Check with nc -zv <node-1-ip> 8082. If it fails, open the port in the provider's external firewall.
Divergent state between nodes
Symptom: applied very different between nodes (>100).
Possible cause: a node went offline and is replaying history. Wait a few minutes. If it does not converge, force resync on the lagging node:
sudo systemctl restart heroctl
heroctl node info <node-id>
Cluster without majority
Symptom: quorum: degraded or write commands hang.
It means 2 of 3 nodes are out. The cluster refuses writes to avoid inconsistency. Recover the nodes before trying to change anything. If one of the servers is permanently lost, replace it with cluster leave + new cluster join.
Two "leaders" appear
Does not happen. If one node's status says it is the leader and another says the same, there is a clock problem. Sync with chrony or ntpd on all nodes and restart.
Next step: deploy your first app.