Metrics and logs
Collect metrics, logs, and traces without standing up an external observability stack. When it's worth it, and when to integrate with an outside tool.
Observability usually requires a software stack parallel to the cluster: metric agent on each node, central time series, log aggregator, dashboard, alerter, tracer. Five components, each with its own configuration, its own update cycle, its own bill.
HeroCtl solves this internally. Metrics, logs, alerts, and tracing are already built into the control plane. You only plug in an external tool when the team has a concrete reason for it.
Metrics
The default endpoint
Each server node exposes metrics in Prometheus format at /v1/metrics:
curl -H "X-Heroctl-Token: $TOKEN" https://manage.exemplo.com/v1/metrics
Typical output (truncated):
# HELP heroctl_node_cpu_usage_percent CPU em uso por nó
# TYPE heroctl_node_cpu_usage_percent gauge
heroctl_node_cpu_usage_percent{node="server-1"} 23.4
heroctl_node_cpu_usage_percent{node="server-2"} 18.1
# HELP heroctl_alloc_memory_bytes Memória usada por alocação
# TYPE heroctl_alloc_memory_bytes gauge
heroctl_alloc_memory_bytes{job="api",alloc="abc123"} 285212672
What ships ready
| Metric family | Examples |
|---|---|
| Nodes | CPU, RAM, disk, network, load average, uptime |
| Allocations | CPU, RAM, restarts, status, age |
| Jobs | Healthy replicas, pending allocations, active deploys |
| Router | Requests/s, p50/p95/p99 latency, 5xx errors per host |
| Ingress TLS | Certificate validity, renewal failures |
| Internal API | Latency, throughput, error rate |
On a freshly installed cluster, this already feeds a complete dashboard with no extra configuration.
Custom application metrics
Your application exposes /metrics on any port, in Prometheus format. Declare it in the spec:
job: api-pagamentos
metrics:
enabled: true
path: /metrics
port: 9090
interval: 15s
The cluster scrapes, aggregates, and serves them on the same /v1/metrics endpoint. The metrics come labeled with job, alloc, node, so querying is direct:
rate(http_requests_total{job="api-pagamentos",status=~"5.."}[5m])
Official Prometheus clients exist for Go, Python, Java, Node.js, Ruby, .NET, Rust, and PHP. In any of these languages, instrumenting an application takes fifteen minutes.
Embedded panel
The admin panel (port 8443) ships a ready-made charts section:
- Cluster view: aggregated CPU, RAM, network
- Job view: replicas, restarts, router latency
- Allocation view: log stream, individual metrics
- Host view: detail of each node, allocations on it
For most teams, this panel replaces a Grafana stood up externally. When you need heavily customized dashboards or correlation with external sources, it's worth pointing a Grafana at the /v1/metrics endpoint as a regular Prometheus datasource.
Logs
Collection model
Each allocation has stdout and stderr captured by the local agent, compressed, and sent to the cluster's central log writer. There is no separate log agent to install, configure, or update.
Real-time tail
# stream do job inteiro (todas as alocações)
heroctl logs -f --job api-pagamentos
# uma alocação específica
heroctl logs -f --alloc abc123
# só stderr
heroctl logs -f --job api-pagamentos --stream stderr
Filtering
# entre dois timestamps
heroctl logs --job api-pagamentos \
--since "2026-04-25 10:00" \
--until "2026-04-25 11:00"
# busca textual
heroctl logs --job api-pagamentos --since 1h | grep "panic"
# saída estruturada para processar com jq
heroctl logs --job api-pagamentos --since 1h --format json
Retention
Default: 30 days per active allocation, 7 days after the allocation terminates. Configurable in the cluster spec:
logs:
retention:
active_days: 30
terminated_days: 7
storage:
type: local
path: /var/lib/heroctl/logs
max_size_gb: 100
For longer retention, export to external storage (next section).
Export outside
When you need retention in years, or correlation with logs from systems that don't run on the cluster, ready-made outputs are available:
logs:
export:
- type: syslog
destination: logs.empresa.com.br:514
protocol: tcp
tls: true
- type: loki
url: https://loki.empresa.com.br
tenant: heroctl-prod
- type: cloudwatch
region: us-east-1
log_group: /heroctl/prod
credentials: ${secret.aws_logs}
- type: elasticsearch
url: https://elastic.empresa.com.br
index: heroctl-%Y.%m.%d
credentials: ${secret.es_creds}
Multiple destinations can run at the same time. The cluster keeps the local copy for the retention period and replicates to the configured destinations.
Alerts
An alert is an expression over metrics that fires a webhook when true for a configured duration:
alerts:
- name: api-erro-alto
expr: |
rate(http_requests_total{job="api-pagamentos",status=~"5.."}[5m])
/ rate(http_requests_total{job="api-pagamentos"}[5m]) > 0.05
for: 5m
severity: critical
annotations:
summary: "Taxa de erro acima de 5% em api-pagamentos"
runbook: https://wiki.empresa.com.br/runbook/api-pagamentos
notify:
- type: slack
webhook: ${secret.slack_oncall}
- type: pagerduty
routing_key: ${secret.pagerduty_critical}
- name: certificado-expirando
expr: heroctl_ingress_cert_expiry_days < 14
for: 1h
severity: warning
notify:
- type: discord
webhook: ${secret.discord_ops}
Channels supported out of the box: Slack, Discord, PagerDuty, Opsgenie, generic webhook. For anyone wanting a custom integration (Telegram, email, SMS), the generic webhook covers everything.
Warning: Start with a few critical alerts. Twenty noisy alerts become zero alerts — the team learns to ignore them. Five alerts that always indicate a real problem are useful.
Distributed tracing
Tracing is available as opt-in in the job spec:
job: api-pagamentos
tracing:
enabled: true
protocol: otlp
sample_rate: 0.1 # 10% das requisições
The application instrumented with OpenTelemetry sends to the embedded collector. The panel shows traces correlated with logs and metrics from the same allocation.
For advanced visualization (span timeline, trace comparison, tail analysis), export to Jaeger, Tempo, or a SaaS like Honeycomb:
tracing:
export:
- type: otlp
endpoint: tempo.empresa.com.br:4317
tls: true
Cost comparison
For a typical cluster — 4 nodes, 30 jobs, 100 million requests/month — a commercial SaaS observability stack runs between R$ 1,000 and R$ 2,000 per month. An equivalent self-hosted stack (Prometheus + Loki + Grafana + Alertmanager + Tempo) has low direct cost, but requires half a day of operations per week.
| Item | Internal stack | Commercial SaaS | Self-hosted stack |
|---|---|---|---|
| Direct cost/month | R$ 0 | R$ 1,000–2,000 | R$ 100–300 (infra) |
| Setup time | 0 (already running) | 1 day | 1 to 2 weeks |
| Maintenance | Alongside cluster | Zero | A few hours/week |
| Limits | For teams up to ~50 jobs | Practically unlimited | Whatever your infra holds |
| Dashboard customization | Embedded panel | High | Total |
Practical recommendation: start with the internal stack. When operations grow beyond what it covers — usually past 50 jobs or log retention beyond 6 months — export to self-hosted Loki and Grafana. Commercial SaaS only pays off when team time is more expensive than the bill.
Next steps
- Configure alerts wired to Slack or PagerDuty before the first critical deploy.
- Review RBAC to limit who sees which logs (logs may contain sensitive data).