Troubleshooting common problems
The 12 most frequent problems in HeroCtl clusters, with symptom, diagnosis, and step-by-step fix.
This guide covers the twelve problems that most often appear in HeroCtl clusters. Each item has symptom, diagnosis, and fix. Use as a quick reference during an incident.
1. Cluster won't start: "cannot bind to port 8080"
Symptom: the service comes up and dies right after. The log says port 8080 is in use.
Diagnosis:
sudo lsof -i :8080
# ou
sudo ss -tlnp | grep 8080
The output shows which process is holding the port.
Fix:
If it's a legitimate process (another app), change the HeroCtl port:
# /etc/heroctl/server.yaml
api:
port: 8090
If it's a zombie process (old HeroCtl that didn't die cleanly):
sudo kill -9 <PID>
sudo systemctl start heroctl-server
2. Node cannot join the cluster
Symptom: the join command hangs or returns connection refused / invalid token.
Diagnosis:
Three common suspects:
# 1. Token expirou?
heroctl cluster join-token list
# 2. Firewall bloqueando?
nc -zv <ip-do-coordenador> 4646
nc -zv <ip-do-coordenador> 4647
nc -zv <ip-do-coordenador> 4648
# 3. Relógios fora de sincronia?
timedatectl status
Fix:
- Token expired: generate another with
heroctl cluster join-token create --ttl 1h. - Firewall: open ports 4646, 4647, and 4648 between nodes. See firewall.
- Clock: install and enable NTP.
sudo apt install chrony
sudo systemctl enable --now chrony
A drift greater than 30 seconds between nodes breaks coordination.
3. Cluster lost coordination
Symptom: API responds with 503 and messages about a missing coordinator. Changes are not accepted.
Diagnosis:
heroctl cluster status
You'll see how many nodes are healthy. If fewer than half respond, the cluster locks into read-only mode for safety.
Fix:
The normal solution is to bring the downed nodes back:
ssh nó-caído sudo systemctl start heroctl-server
If they don't come back (dead disk, lost machine), use forced bootstrap from the latest snapshot:
heroctl snapshot restore /backups/ultimo.tar.gz --force-bootstrap
Warning: forced bootstrap discards everything that happened after the snapshot. See backup and restore.
4. Job stuck in "pending"
Symptom: heroctl jobs status meu-job shows pending for minutes. Nothing starts.
Diagnosis:
heroctl jobs explain meu-job
The output details why the scheduler can't place the job. Typical cause: no node has the CPU/RAM available for the requested resources.
Fix:
Two options:
- Add more nodes to the cluster.
- Reduce the resources required by the job:
resources:
cpu_mhz: 500 # antes era 2000
memory_mb: 256 # antes era 1024
5. Health check failing
Symptom: the job comes up but is marked as unhealthy. Restarts in a loop.
Diagnosis:
heroctl logs <alloc-id> | tail -50
Frequently the app is taking longer to start than healthy_deadline allows.
Fix:
Increase the deadline:
health_check:
path: /health
port: 8080
interval: 10s
timeout: 3s
healthy_deadline: 120s # era 30s
If the app really is slow to start (loads a huge cache, connects to several services), work on boot time. Lazy loading usually solves it.
6. TLS certificate not issued
Symptom: the site responds with a self-signed certificate or a TLS error. Ingress logs mention failure in automatic issuance.
Diagnosis:
# DNS aponta pro IP correto?
dig +short meudominio.com
# Porta 80 acessível externamente?
curl -I http://meudominio.com/.well-known/acme-challenge/test
Automatic certificate issuance needs two things: public DNS pointing at a cluster node and port 80 open to the world.
Fix:
- Wrong DNS: fix the A record at your provider.
- Port 80 closed: open it on the server's firewall and the provider's firewall (security group, etc.).
- Domain with active CDN proxy: turn off the proxy temporarily for issuance; re-enable afterwards.
7. App slow under load
Symptom: latency rises when traffic grows. Users complain.
Diagnosis:
heroctl metrics --job meu-app --since 30m
Look at CPU, memory, and instance count. Also check whether a deploy is in flight — gradual deploys temporarily remove capacity.
Fix:
If capacity is short, scale:
heroctl jobs scale meu-app --count 6 # de 3 para 6
If a deploy is in flight, wait for it to finish before evaluating. If the app has a memory leak or a tight loop, profile the code — there's nothing the orchestrator can do about an app's internal problem.
8. Logs don't show up
Symptom: heroctl logs returns empty even with the app running and producing output.
Diagnosis:
docker inspect <container-id> | grep LogConfig
If you see "Type": "none" or an unsupported driver, that's the problem.
Fix:
Configure the default log driver on the machine:
// /etc/docker/daemon.json
{
"log-driver": "json-file",
"log-opts": {
"max-size": "100m",
"max-file": "3"
}
}
Restart the service:
sudo systemctl restart docker
9. Postgres connection timing out
Symptom: the app logs connection timeout or too many clients when connecting to the database.
Diagnosis:
On Postgres:
SELECT count(*) FROM pg_stat_activity;
SHOW max_connections;
If count(*) is close to max_connections, the pool is saturated.
Fix:
Put a pgbouncer between app and database:
# job pgbouncer
config:
max_client_conn: 1000
default_pool_size: 25
And point the apps at pgbouncer instead of the database directly. You can serve thousands of client connections with a few real database connections.
10. Cluster appears to have two coordinators
Symptom: strange behaviors — writes on one node don't show up on the other. Inconsistent metrics across panels.
Diagnosis:
heroctl cluster peers
If the peer list varies depending on the node you query, there was a network split and two halves thought they were the good half.
Fix:
Identify the minority half (the one with fewer nodes) and restart those nodes:
sudo systemctl restart heroctl-server
They re-sync with the majority half and the inconsistency goes away. Then check whether any data diverged during the interval:
heroctl jobs status --all | grep -i diverge
11. Disk full
Symptom: the node starts misbehaving. Slow API. Agent restarts containers for no apparent reason. df -h shows 100%.
Diagnosis:
sudo du -sh /var/lib/heroctl/* | sort -h
sudo du -sh /var/log/* | sort -h
The usual culprits are old logs and uncleaned snapshots.
Fix:
Configure rotation:
# /etc/heroctl/server.yaml
logs:
retention_days: 7
max_size_per_alloc_mb: 500
snapshots:
retention_count: 10
And an immediate manual cleanup:
sudo journalctl --vacuum-time=3d
heroctl snapshot prune --keep 10
12. Container killed for lack of memory
Symptom: heroctl logs ends with OOMKilled. The container restarts in a loop.
Diagnosis:
heroctl alloc status <id> | grep -A5 "memory"
Compare actual usage with the defined limit.
Fix:
Raise the limit in the job spec:
resources:
memory_mb: 1024 # era 512
Submit the new version:
heroctl jobs submit meu-app.json
If memory usage grows over time (a leak), raising the limit only delays the problem. Investigate the app.
When none of this helps
Gather the following information before opening a ticket:
heroctl cluster status(full output)heroctl versionon every node- The
request_idreturned by the API error - Log excerpt with the timestamp of the incident
Send to suporte@heroctl.com with this information in the message body. The more context, the faster the response.
Next steps
- Metrics and alerts — detect problems before users do.
- Backup and restore — preparation for the worst scenarios.
- API reference — when the CLI isn't enough.