Troubleshooting common problems

The 12 most frequent problems in HeroCtl clusters, with symptom, diagnosis, and step-by-step fix.

Troubleshooting·10 min read·last reviewed 2026-04-26

This guide covers the twelve problems that most often appear in HeroCtl clusters. Each item has symptom, diagnosis, and fix. Use as a quick reference during an incident.

1. Cluster won't start: "cannot bind to port 8080"

Symptom: the service comes up and dies right after. The log says port 8080 is in use.

Diagnosis:

sudo lsof -i :8080
# ou
sudo ss -tlnp | grep 8080

The output shows which process is holding the port.

Fix:

If it's a legitimate process (another app), change the HeroCtl port:

# /etc/heroctl/server.yaml
api:
  port: 8090

If it's a zombie process (old HeroCtl that didn't die cleanly):

sudo kill -9 <PID>
sudo systemctl start heroctl-server

2. Node cannot join the cluster

Symptom: the join command hangs or returns connection refused / invalid token.

Diagnosis:

Three common suspects:

# 1. Token expirou?
heroctl cluster join-token list

# 2. Firewall bloqueando?
nc -zv <ip-do-coordenador> 4646
nc -zv <ip-do-coordenador> 4647
nc -zv <ip-do-coordenador> 4648

# 3. Relógios fora de sincronia?
timedatectl status

Fix:

  • Token expired: generate another with heroctl cluster join-token create --ttl 1h.
  • Firewall: open ports 4646, 4647, and 4648 between nodes. See firewall.
  • Clock: install and enable NTP.
sudo apt install chrony
sudo systemctl enable --now chrony

A drift greater than 30 seconds between nodes breaks coordination.

3. Cluster lost coordination

Symptom: API responds with 503 and messages about a missing coordinator. Changes are not accepted.

Diagnosis:

heroctl cluster status

You'll see how many nodes are healthy. If fewer than half respond, the cluster locks into read-only mode for safety.

Fix:

The normal solution is to bring the downed nodes back:

ssh nó-caído sudo systemctl start heroctl-server

If they don't come back (dead disk, lost machine), use forced bootstrap from the latest snapshot:

heroctl snapshot restore /backups/ultimo.tar.gz --force-bootstrap

Warning: forced bootstrap discards everything that happened after the snapshot. See backup and restore.

4. Job stuck in "pending"

Symptom: heroctl jobs status meu-job shows pending for minutes. Nothing starts.

Diagnosis:

heroctl jobs explain meu-job

The output details why the scheduler can't place the job. Typical cause: no node has the CPU/RAM available for the requested resources.

Fix:

Two options:

  • Add more nodes to the cluster.
  • Reduce the resources required by the job:
resources:
  cpu_mhz: 500      # antes era 2000
  memory_mb: 256    # antes era 1024

5. Health check failing

Symptom: the job comes up but is marked as unhealthy. Restarts in a loop.

Diagnosis:

heroctl logs <alloc-id> | tail -50

Frequently the app is taking longer to start than healthy_deadline allows.

Fix:

Increase the deadline:

health_check:
  path: /health
  port: 8080
  interval: 10s
  timeout: 3s
  healthy_deadline: 120s    # era 30s

If the app really is slow to start (loads a huge cache, connects to several services), work on boot time. Lazy loading usually solves it.

6. TLS certificate not issued

Symptom: the site responds with a self-signed certificate or a TLS error. Ingress logs mention failure in automatic issuance.

Diagnosis:

# DNS aponta pro IP correto?
dig +short meudominio.com

# Porta 80 acessível externamente?
curl -I http://meudominio.com/.well-known/acme-challenge/test

Automatic certificate issuance needs two things: public DNS pointing at a cluster node and port 80 open to the world.

Fix:

  • Wrong DNS: fix the A record at your provider.
  • Port 80 closed: open it on the server's firewall and the provider's firewall (security group, etc.).
  • Domain with active CDN proxy: turn off the proxy temporarily for issuance; re-enable afterwards.

7. App slow under load

Symptom: latency rises when traffic grows. Users complain.

Diagnosis:

heroctl metrics --job meu-app --since 30m

Look at CPU, memory, and instance count. Also check whether a deploy is in flight — gradual deploys temporarily remove capacity.

Fix:

If capacity is short, scale:

heroctl jobs scale meu-app --count 6   # de 3 para 6

If a deploy is in flight, wait for it to finish before evaluating. If the app has a memory leak or a tight loop, profile the code — there's nothing the orchestrator can do about an app's internal problem.

8. Logs don't show up

Symptom: heroctl logs returns empty even with the app running and producing output.

Diagnosis:

docker inspect <container-id> | grep LogConfig

If you see "Type": "none" or an unsupported driver, that's the problem.

Fix:

Configure the default log driver on the machine:

// /etc/docker/daemon.json
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m",
    "max-file": "3"
  }
}

Restart the service:

sudo systemctl restart docker

9. Postgres connection timing out

Symptom: the app logs connection timeout or too many clients when connecting to the database.

Diagnosis:

On Postgres:

SELECT count(*) FROM pg_stat_activity;
SHOW max_connections;

If count(*) is close to max_connections, the pool is saturated.

Fix:

Put a pgbouncer between app and database:

# job pgbouncer
config:
  max_client_conn: 1000
  default_pool_size: 25

And point the apps at pgbouncer instead of the database directly. You can serve thousands of client connections with a few real database connections.

10. Cluster appears to have two coordinators

Symptom: strange behaviors — writes on one node don't show up on the other. Inconsistent metrics across panels.

Diagnosis:

heroctl cluster peers

If the peer list varies depending on the node you query, there was a network split and two halves thought they were the good half.

Fix:

Identify the minority half (the one with fewer nodes) and restart those nodes:

sudo systemctl restart heroctl-server

They re-sync with the majority half and the inconsistency goes away. Then check whether any data diverged during the interval:

heroctl jobs status --all | grep -i diverge

11. Disk full

Symptom: the node starts misbehaving. Slow API. Agent restarts containers for no apparent reason. df -h shows 100%.

Diagnosis:

sudo du -sh /var/lib/heroctl/* | sort -h
sudo du -sh /var/log/* | sort -h

The usual culprits are old logs and uncleaned snapshots.

Fix:

Configure rotation:

# /etc/heroctl/server.yaml
logs:
  retention_days: 7
  max_size_per_alloc_mb: 500

snapshots:
  retention_count: 10

And an immediate manual cleanup:

sudo journalctl --vacuum-time=3d
heroctl snapshot prune --keep 10

12. Container killed for lack of memory

Symptom: heroctl logs ends with OOMKilled. The container restarts in a loop.

Diagnosis:

heroctl alloc status <id> | grep -A5 "memory"

Compare actual usage with the defined limit.

Fix:

Raise the limit in the job spec:

resources:
  memory_mb: 1024    # era 512

Submit the new version:

heroctl jobs submit meu-app.json

If memory usage grows over time (a leak), raising the limit only delays the problem. Investigate the app.

When none of this helps

Gather the following information before opening a ticket:

  • heroctl cluster status (full output)
  • heroctl version on every node
  • The request_id returned by the API error
  • Log excerpt with the timestamp of the incident

Send to suporte@heroctl.com with this information in the message body. The more context, the faster the response.

Next steps

#troubleshooting#diagnosis#operations#incidents