Metrics and logs

Collect metrics, logs, and traces without standing up an external observability stack. When it's worth it, and when to integrate with an outside tool.

Observability·10 min read·last reviewed 2026-04-26

Observability usually requires a software stack parallel to the cluster: metric agent on each node, central time series, log aggregator, dashboard, alerter, tracer. Five components, each with its own configuration, its own update cycle, its own bill.

HeroCtl solves this internally. Metrics, logs, alerts, and tracing are already built into the control plane. You only plug in an external tool when the team has a concrete reason for it.

Metrics

The default endpoint

Each server node exposes metrics in Prometheus format at /v1/metrics:

curl -H "X-Heroctl-Token: $TOKEN" https://manage.exemplo.com/v1/metrics

Typical output (truncated):

# HELP heroctl_node_cpu_usage_percent CPU em uso por nó
# TYPE heroctl_node_cpu_usage_percent gauge
heroctl_node_cpu_usage_percent{node="server-1"} 23.4
heroctl_node_cpu_usage_percent{node="server-2"} 18.1

# HELP heroctl_alloc_memory_bytes Memória usada por alocação
# TYPE heroctl_alloc_memory_bytes gauge
heroctl_alloc_memory_bytes{job="api",alloc="abc123"} 285212672

What ships ready

Metric family	Examples
Nodes	CPU, RAM, disk, network, load average, uptime
Allocations	CPU, RAM, restarts, status, age
Jobs	Healthy replicas, pending allocations, active deploys
Router	Requests/s, p50/p95/p99 latency, 5xx errors per host
Ingress TLS	Certificate validity, renewal failures
Internal API	Latency, throughput, error rate

On a freshly installed cluster, this already feeds a complete dashboard with no extra configuration.

Custom application metrics

Your application exposes /metrics on any port, in Prometheus format. Declare it in the spec:

job: api-pagamentos
metrics:
  enabled: true
  path: /metrics
  port: 9090
  interval: 15s

The cluster scrapes, aggregates, and serves them on the same /v1/metrics endpoint. The metrics come labeled with job, alloc, node, so querying is direct:

rate(http_requests_total{job="api-pagamentos",status=~"5.."}[5m])

Official Prometheus clients exist for Go, Python, Java, Node.js, Ruby, .NET, Rust, and PHP. In any of these languages, instrumenting an application takes fifteen minutes.

Embedded panel

The admin panel (port 8443) ships a ready-made charts section:

Cluster view: aggregated CPU, RAM, network
Job view: replicas, restarts, router latency
Allocation view: log stream, individual metrics
Host view: detail of each node, allocations on it

For most teams, this panel replaces a Grafana stood up externally. When you need heavily customized dashboards or correlation with external sources, it's worth pointing a Grafana at the /v1/metrics endpoint as a regular Prometheus datasource.

Logs

Collection model

Each allocation has stdout and stderr captured by the local agent, compressed, and sent to the cluster's central log writer. There is no separate log agent to install, configure, or update.

Real-time tail

# stream do job inteiro (todas as alocações)
heroctl logs -f --job api-pagamentos

# uma alocação específica
heroctl logs -f --alloc abc123

# só stderr
heroctl logs -f --job api-pagamentos --stream stderr

Filtering

# entre dois timestamps
heroctl logs --job api-pagamentos \
  --since "2026-04-25 10:00" \
  --until "2026-04-25 11:00"

# busca textual
heroctl logs --job api-pagamentos --since 1h | grep "panic"

# saída estruturada para processar com jq
heroctl logs --job api-pagamentos --since 1h --format json

Retention

Default: 30 days per active allocation, 7 days after the allocation terminates. Configurable in the cluster spec:

logs:
  retention:
    active_days: 30
    terminated_days: 7
  storage:
    type: local
    path: /var/lib/heroctl/logs
    max_size_gb: 100

For longer retention, export to external storage (next section).

Export outside

When you need retention in years, or correlation with logs from systems that don't run on the cluster, ready-made outputs are available:

logs:
  export:
    - type: syslog
      destination: logs.empresa.com.br:514
      protocol: tcp
      tls: true

    - type: loki
      url: https://loki.empresa.com.br
      tenant: heroctl-prod

    - type: cloudwatch
      region: us-east-1
      log_group: /heroctl/prod
      credentials: ${secret.aws_logs}

    - type: elasticsearch
      url: https://elastic.empresa.com.br
      index: heroctl-%Y.%m.%d
      credentials: ${secret.es_creds}

Multiple destinations can run at the same time. The cluster keeps the local copy for the retention period and replicates to the configured destinations.

Alerts

An alert is an expression over metrics that fires a webhook when true for a configured duration:

alerts:
  - name: api-erro-alto
    expr: |
      rate(http_requests_total{job="api-pagamentos",status=~"5.."}[5m])
        / rate(http_requests_total{job="api-pagamentos"}[5m]) > 0.05
    for: 5m
    severity: critical
    annotations:
      summary: "Taxa de erro acima de 5% em api-pagamentos"
      runbook: https://wiki.empresa.com.br/runbook/api-pagamentos

    notify:
      - type: slack
        webhook: ${secret.slack_oncall}
      - type: pagerduty
        routing_key: ${secret.pagerduty_critical}

  - name: certificado-expirando
    expr: heroctl_ingress_cert_expiry_days < 14
    for: 1h
    severity: warning
    notify:
      - type: discord
        webhook: ${secret.discord_ops}

Channels supported out of the box: Slack, Discord, PagerDuty, Opsgenie, generic webhook. For anyone wanting a custom integration (Telegram, email, SMS), the generic webhook covers everything.

Warning: Start with a few critical alerts. Twenty noisy alerts become zero alerts — the team learns to ignore them. Five alerts that always indicate a real problem are useful.

Distributed tracing

Tracing is available as opt-in in the job spec:

job: api-pagamentos
tracing:
  enabled: true
  protocol: otlp
  sample_rate: 0.1   # 10% das requisições

The application instrumented with OpenTelemetry sends to the embedded collector. The panel shows traces correlated with logs and metrics from the same allocation.

For advanced visualization (span timeline, trace comparison, tail analysis), export to Jaeger, Tempo, or a SaaS like Honeycomb:

tracing:
  export:
    - type: otlp
      endpoint: tempo.empresa.com.br:4317
      tls: true

Cost comparison

For a typical cluster — 4 nodes, 30 jobs, 100 million requests/month — a commercial SaaS observability stack runs between R$ 1,000 and R$ 2,000 per month. An equivalent self-hosted stack (Prometheus + Loki + Grafana + Alertmanager + Tempo) has low direct cost, but requires half a day of operations per week.

Item	Internal stack	Commercial SaaS	Self-hosted stack
Direct cost/month	R$ 0	R$ 1,000–2,000	R$ 100–300 (infra)
Setup time	0 (already running)	1 day	1 to 2 weeks
Maintenance	Alongside cluster	Zero	A few hours/week
Limits	For teams up to ~50 jobs	Practically unlimited	Whatever your infra holds
Dashboard customization	Embedded panel	High	Total

Practical recommendation: start with the internal stack. When operations grow beyond what it covers — usually past 50 jobs or log retention beyond 6 months — export to self-hosted Loki and Grafana. Commercial SaaS only pays off when team time is more expensive than the bill.

Next steps

Configure alerts wired to Slack or PagerDuty before the first critical deploy.
Review RBAC to limit who sees which logs (logs may contain sensitive data).

#metrics#logs#prometheus#opentelemetry#alerts

Back to index