Bring up a 3-node cluster

Form a cluster with 3 servers in under 10 minutes. Tolerates 1-node failure with no downtime.

Operations·8 min read·last reviewed 2026-04-26

A single node works, but any reboot brings the application down. To tolerate failures and deploy without a window, you need 3 nodes forming a distributed control plane.

Why 3 nodes, not 2 or 4?

The rule is simple: consensus among servers needs a majority. With 3 nodes, losing 1 still keeps 2 alive — a majority. With 2, losing 1 leaves 1 alive, no majority, and the cluster locks for safety.

Nodes	Tolerates failure of	Indicated for
1	0	development, lab
3	1	small/medium production
5	2	critical production
4 or 6	same as 3 or 5	never, it is waste

Even numbers add cost without adding resilience. Always odd.

Note: workers (pure agent mode) do not count toward this math. You can have 3 servers + 50 workers and the rule remains "1 server can go down".

Provision the 3 servers

HeroCtl does not require identical machines, but servers should have the same CPU and disk profile to avoid slowdowns from whoever lags behind. A cheap and functional example:

Provider	Plan	Cost/month	CPU	RAM	Disk
Hetzner	CPX21	€ 7.99	3 vCPU	4 GB	80 GB
DigitalOcean	s-2vcpu-4gb	US$ 24	2 vCPU	4 GB	80 GB
Vultr	vc2-2c-4gb	US$ 24	2 vCPU	4 GB	80 GB

Provision 3 machines in the same datacenter, on a private network if the provider offers one. Latency between nodes should stay below 10 ms.

Then follow the installation step on each one. When you finish, you will have 3 servers with the binary ready and nothing else.

Initialize the first node

The first node "opens" the cluster. The others will join it. This command runs on one machine and never again.

# On node 1, private IP 10.0.0.1
sudo heroctl cluster init --advertise 10.0.0.1

Expected output:

cluster initialized
node-id:  node-1
state:    healthy
nodes:    1/1

From here, node 1 already accepts jobs. But without the other two, there is no fault tolerance.

Generate join token

For other nodes to join, you need a token. It is signed by the cluster and has configurable validity.

# On node 1
heroctl cluster join-token --ttl 1h
# eyJhbGciOi...truncado...8X7Z

Warning: the token grants entry to the control plane. Treat as a password. Use short TTL (1h is enough to join 3 nodes) and never paste in logs or public Slack.

Connect nodes 2 and 3

On each of the other two nodes, run the join command. Replace <TOKEN> with the generated token and <IP> with that machine's private IP.

# On node 2, private IP 10.0.0.2
sudo heroctl cluster join \
  --token eyJhbGciOi...8X7Z \
  --advertise 10.0.0.2 \
  --servers 10.0.0.1:8080

# On node 3, private IP 10.0.0.3
sudo heroctl cluster join \
  --token eyJhbGciOi...8X7Z \
  --advertise 10.0.0.3 \
  --servers 10.0.0.1:8080

Each command takes 5 to 15 seconds. The node downloads the current state, syncs, and starts receiving real-time updates.

Check health

Once the 3 are connected, any of them responds with the cluster view:

heroctl cluster status

Healthy output:

cluster:  3 nodes
quorum:   ok (2/3 required)
leader:   node-1 (10.0.0.1)
peers:
  - node-1  10.0.0.1  server  ready  applied=1247
  - node-2  10.0.0.2  server  ready  applied=1247
  - node-3  10.0.0.3  server  ready  applied=1247
last_update: 0.4s ago

What to look at:

quorum: ok — majority alive, cluster accepts writes.
applied equal across all 3 nodes — all saw the same changes. Small difference (1–2) is normal between the current pulse.
last_update below 5s — replication is flowing.

If applied diverges a lot (hundreds), there is a network or disk problem on a node. See Common problems below.

Add workers

Workers only run containers. They do not vote, decide, or store state. Add as many as you want without affecting consensus.

# On each worker
sudo heroctl agent \
  --token <TOKEN> \
  --advertise 10.0.0.10 \
  --servers 10.0.0.1:8080,10.0.0.2:8080,10.0.0.3:8080

Confirm with:

heroctl node list
# node-1   server  ready   3.2 GB free   2 jobs
# node-2   server  ready   3.4 GB free   1 job
# node-3   server  ready   3.0 GB free   2 jobs
# node-10  worker  ready   7.8 GB free   0 jobs

Note: always pass the 3 IPs in --servers. If you point only at node 1 and it goes down during the join, the agent stays in retry.

Common problems

Node does not connect

Symptom: the cluster join stays at "connecting..." for over 30s.

Typical cause: firewall blocking port 8082 between nodes. Check with nc -zv <node-1-ip> 8082. If it fails, open the port in the provider's external firewall.

Divergent state between nodes

Symptom: applied very different between nodes (>100).

Possible cause: a node went offline and is replaying history. Wait a few minutes. If it does not converge, force resync on the lagging node:

sudo systemctl restart heroctl
heroctl node info <node-id>

Cluster without majority

Symptom: quorum: degraded or write commands hang.

It means 2 of 3 nodes are out. The cluster refuses writes to avoid inconsistency. Recover the nodes before trying to change anything. If one of the servers is permanently lost, replace it with cluster leave + new cluster join.

Two "leaders" appear

Does not happen. If one node's status says it is the leader and another says the same, there is a clock problem. Sync with chrony or ntpd on all nodes and restart.

Next step: deploy your first app.

#cluster#high-availability#getting-started

Back to index