Complete monitoring stack in 2026: Prometheus + Grafana + Loki step by step
Honest tutorial to spin up metrics, logs and dashboards for your cluster — in 4 hours, without Datadog. Open-source stack that fits in 1 VPS at R$80/month.
The first time your site crashes at three in the morning, you'll discover something uncomfortable: there's no way to know what happened. There's no CPU graph, there's no log of the container that died, there's no alert that warned beforehand. You'll open a terminal, connect to the servers one by one, run top, df, journalctl, and try to reconstitute a crime scene that has already cooled down.
This post is the shortcut so you don't go through that. In four hours, with R$80 to R$120 per month of hardware, you can assemble the open-source observability stack that replaces Datadog, New Relic and CloudWatch in 95% of cases for a startup. The tools are the same that run inside companies with tens of thousands of servers — and they fit comfortably on a small VPS for a team starting out.
TL;DR
The standard open-source monitoring stack in 2026 — Prometheus + Grafana + Loki + Alertmanager — fits on a single 4 GB RAM VPS and covers metrics, centralized logs, dashboards and alerts. This tutorial shows step-by-step setup for a 4-to-5-server cluster in approximately four hours, using docker-compose or orchestrator job specs.
For a Brazilian startup, that means R$80 to R$120 per month of hardware vs R$1,000 to R$2,000 per month of equivalent observability SaaS. The time cost is honest: four hours of initial setup plus two to four hours per month of ongoing maintenance.
Deliverable result at the end of the tutorial: dashboards for CPU, RAM, disk, network and HTTP metrics; searchable logs with 30-day retention; alerts routed to Slack, Discord or email. Prerequisites: 1 Linux VPS with 4 GB of RAM and 50 GB SSD, Docker installed, and a domain with DNS controlled by you.
The choice between running this stack on a dedicated VPS outside the production cluster or as a job inside the orchestrator itself is an architectural decision — we cover both options in step 8 and in "How to run this inside HeroCtl".
What each component does, in one sentence
Before installing anything, it's worth understanding the role of each piece. The stack has six components; the confusion usually comes from thinking some of them is "the monitoring system". It's not. Each one does one thing.
- Prometheus is a time-series database (TSDB) that collects metrics via HTTP scrape — it pulls the numbers, nobody pushes them. Retains 15 days by default.
- Grafana is the visualization layer. Connects to Prometheus, to Loki, to Postgres, to almost any structured source, and draws graphs.
- Loki is the log piece. Syntax similar to Prometheus, indexes only labels (not log content), and because of that gets about ten times cheaper than ELK to run.
- Promtail (or Grafana Agent, which is replacing Promtail in 2026) is the collector that reads log files from each server and sends to Loki.
- node_exporter runs on each monitored node and exposes an HTTP endpoint with CPU, RAM, disk and network in Prometheus format.
- Alertmanager receives alert rules from Prometheus and handles routing — Slack, email, PagerDuty, arbitrary webhook.
Whoever designs the first stack usually confuses Prometheus with "monitoring" and Grafana with "pretty dashboards". The real separation is: Prometheus stores numbers, Loki stores text, Grafana shows both, Alertmanager screams when some number is wrong.
What's the recommended architecture?
For a cluster of 3 to 5 servers running production applications, the topology that has worked in practice is to separate the observability server from the rest. A dedicated node, outside the cluster it monitors, with two objectives: not dying together when the cluster dies, and not competing for CPU/RAM with the real application.
- 1 dedicated "observability" server, 4 GB of RAM, 50 GB SSD. Runs Prometheus, Grafana, Loki, Alertmanager.
- Each monitored server runs only two lightweight processes: node_exporter (system metrics) and Promtail (log shipping).
- Your applications expose a
/metricsendpoint in Prometheus format. If you use a popular framework, there's a ready client. If not, it's a library of a few dozen lines. - Grafana is accessible via subdomain (
monitor.yourdomain.com) with automatic TLS and basic authentication in front.
This separation has a cost: you pay for one more VPS. In exchange, when the main cluster falls, you can still look at the graphs to understand what happened. For a startup, this trade-off pays off almost always — the worst monitoring scenario is discovering that the only thing that stopped along with the site was the system that would warn you that the site stopped.
Step 1 — How to provision the observability VPS?
Estimated time: 10 minutes.
Any cheap provider works. The two with best cost-benefit for the Brazilian case today are Hetzner (CPX21 at 7.99 EUR per month with 3 vCPUs and 4 GB of RAM, datacenter in Germany) and DigitalOcean (Basic Droplet at US$24 per month with the same configuration, datacenters closer to Brazil). For monitoring workload, scrape latency in European datacenter doesn't cause a problem — Prometheus pulls every 15 seconds by default, so 200ms RTT between Hetzner and your servers doesn't disrupt.
Provisioning:
- Create the VPS with Ubuntu 24.04 LTS or Debian 12.
- Add your public SSH key on creation. Disable password login.
- Install Docker and the compose plugin:
curl -fsSL https://get.docker.com | sh && apt install docker-compose-plugin. - Configure firewall: port 22 (SSH) open, port 443 (HTTPS) open, all others closed. Internal ports (3000, 9090, 3100, 9093) only stay accessible via
localhostof the VPS itself — the reverse proxy exposes Grafana via 443. - Point DNS: create an A record
monitor.yourdomain.comto the VPS IP.
Validation: docker --version returns 26.x or higher; dig monitor.yourdomain.com returns the correct IP; ssh root@monitor.yourdomain.com connects without asking for password.
Step 2 — How to bring up the stack via docker-compose?
Estimated time: 45 minutes.
Create the working directory at /opt/observability/ with the following structure:
/opt/observability/
├── docker-compose.yml
├── prometheus/
│ ├── prometheus.yml
│ └── alerts.yml
├── alertmanager/
│ └── alertmanager.yml
├── loki/
│ └── loki-config.yml
└── grafana/
└── provisioning/
└── datasources/
└── datasources.yml
The abbreviated but functional docker-compose.yml:
services:
prometheus:
image: prom/prometheus:v2.55.0
volumes:
- ./prometheus:/etc/prometheus
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle' # permite reload via HTTP POST
ports:
- '127.0.0.1:9090:9090'
restart: unless-stopped
grafana:
image: grafana/grafana:11.3.0
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
- GF_USERS_ALLOW_SIGN_UP=false
ports:
- '127.0.0.1:3000:3000'
restart: unless-stopped
loki:
image: grafana/loki:3.2.0
volumes:
- ./loki/loki-config.yml:/etc/loki/config.yml
- loki-data:/loki
command: -config.file=/etc/loki/config.yml
ports:
- '127.0.0.1:3100:3100'
restart: unless-stopped
alertmanager:
image: prom/alertmanager:v0.27.0
volumes:
- ./alertmanager:/etc/alertmanager
ports:
- '127.0.0.1:9093:9093'
restart: unless-stopped
volumes:
prometheus-data:
grafana-data:
loki-data:
Three important points in this file. First, all ports are bound to 127.0.0.1 — none of the services is directly accessible from the internet. Second, volumes are named (not bind mounts), so they survive docker-compose down. Third, the Grafana password comes from environment variable: create a .env next to the compose with GRAFANA_PASSWORD=something_long_random and never commit that.
Bring up the stack:
cd /opt/observability
docker compose up -d
docker compose ps # all should be "Up" / healthy
Quick validation: curl localhost:9090/-/ready returns Prometheus Server is Ready; curl localhost:3100/ready returns ready; curl localhost:3000/api/health returns JSON with "database": "ok".
Step 3 — How to configure Prometheus scrapes?
Estimated time: 30 minutes.
The prometheus/prometheus.yml is where you tell Prometheus which endpoints to scrape. For a 4-server cluster, it looks like this:
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- 'alerts.yml'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets:
- 'server-1.yourdomain.internal:9100'
- 'server-2.yourdomain.internal:9100'
- 'server-3.yourdomain.internal:9100'
- 'worker-1.yourdomain.internal:9100'
labels:
environment: 'production'
- job_name: 'apps'
static_configs:
- targets:
- 'api.yourdomain.internal:8080'
- 'worker.yourdomain.internal:8080'
labels:
environment: 'production'
metrics_path: '/metrics'
For larger clusters or those that change composition frequently, swap static_configs for file_sd_configs pointing to a JSON you generate automatically. For 4 static servers, the file above resolves it.
Reload: curl -X POST localhost:9090/-/reload. Check at localhost:9090/targets if all jobs are UP. The ones that are DOWN haven't been instrumented yet — that's step 4.
Step 4 — How to install node_exporter on each server?
Estimated time: 15 minutes for 4 servers.
On each monitored server, run node_exporter. There are two ways: direct binary via systemd, or Docker container. In 2026 the consensus is container — easier to update and isolate. On each node:
docker run -d \
--name node-exporter \
--restart unless-stopped \
--net="host" \
--pid="host" \
-v "/:/host:ro,rslave" \
prom/node-exporter:v1.8.2 \
--path.rootfs=/host
The --net=host is necessary for it to see real network interfaces. The bind mount on /host allows reading /proc, /sys and /etc/passwd from the host (read-only) without running the container with root privileges.
Firewall: open port 9100 only to the observability server IP. On Ubuntu with ufw:
ufw allow from <OBSERVABILITY_IP> to any port 9100
Validation: from the observability server, curl http://server-1.yourdomain.internal:9100/metrics should return hundreds of lines starting with # HELP node_cpu_seconds_total....
Step 5 — How to configure Loki + Promtail?
Estimated time: 30 minutes.
Loki is already running in the compose from step 2. Missing the loki-config.yml:
auth_enabled: false
server:
http_listen_port: 3100
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
limits_config:
retention_period: 720h # 30 dias
reject_old_samples: true
reject_old_samples_max_age: 168h
Filesystem storage is enough to start. When you exceed 50 GB of logs per day or want 90+ days retention, migrate to S3 (or compatible). Don't migrate before — complicates operation without real gain.
On each monitored server, install Promtail (or Grafana Agent) also via container:
# /opt/promtail/promtail-config.yml em cada servidor
server:
http_listen_port: 9080
clients:
- url: http://monitor.yourdomain.com:3100/loki/api/v1/push
scrape_configs:
- job_name: system
static_configs:
- targets: [localhost]
labels:
job: varlogs
host: ${HOSTNAME}
__path__: /var/log/*.log
- job_name: docker
docker_sd_configs:
- host: unix:///var/run/docker.sock
relabel_configs:
- source_labels: ['__meta_docker_container_name']
target_label: 'container'
Important: the endpoint http://monitor.yourdomain.com:3100/loki/api/v1/push needs to be accessible from the servers. If you followed step 2 and bound Loki to 127.0.0.1, you have two options: expose 3100 via reverse proxy with basic authentication, or open an SSH/WireGuard tunnel between servers. The second option is more secure and what we recommend.
Validation: in Grafana, go to Explore, select the Loki data source, run {job="varlogs"} and see logs appearing in real time.
Step 6 — How to import Grafana dashboards?
Estimated time: 20 minutes.
Access https://monitor.yourdomain.com (after configuring the reverse proxy from step 8 — you can skip ahead now if you want). Login admin with the password from .env.
Add the two data sources via automatic provisioning. In grafana/provisioning/datasources/datasources.yml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
- name: Loki
type: loki
access: proxy
url: http://loki:3100
Restart Grafana with docker compose restart grafana and the sources appear automatically.
Import ready dashboards. In Dashboards → New → Import, paste the dashboard ID:
- 1860 — Node Exporter Full. CPU, RAM, disk, network, filesystem. It's the most used dashboard in the Prometheus community, with reason.
- 13639 — Logs / App. Basic visualization of Loki logs with filters by job, container, host.
- 15172 — Cluster overview. Consolidated view per server, useful for small cluster.
Customize each one to use environment="production" in the default filter. After two weeks using, you'll want to create your own dashboards for specific workloads — there's no shortcut there, it's chair time.
Step 7 — How to configure basic alerts?
Estimated time: 45 minutes.
Alerts are where 80% of teams stumble: either they put very few and discover incidents through customers, or they put dozens and desensitize the team.
Start with six essential alerts. In prometheus/alerts.yml:
groups:
- name: essentials
interval: 30s
rules:
- alert: ServerDown
expr: up{job="node"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Servidor {{ $labels.instance }} está fora do ar"
- alert: HighCPU
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
- alert: DiskAlmostFull
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
for: 5m
labels:
severity: critical
- alert: HighMemory
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 10m
labels:
severity: warning
- alert: HighHTTPErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
- alert: HighLatency
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 2
for: 10m
labels:
severity: warning
And the alertmanager/alertmanager.yml pointing to a Slack or Discord webhook:
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'slack-default'
routes:
- match:
severity: critical
receiver: 'slack-critical'
repeat_interval: 1h
receivers:
- name: 'slack-default'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/HERE'
channel: '#alerts'
send_resolved: true
- name: 'slack-critical'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/HERE'
channel: '#alerts-critical'
send_resolved: true
Two details that save sleep. The for: 10m on CPU prevents short spikes from becoming alerts — the server can hit 95% for 30 seconds and that be normal. The repeat_interval: 4h for warnings ensures that a warning resolved in one hour doesn't become 60 messages — Alertmanager groups.
Reload Prometheus (curl -X POST localhost:9090/-/reload) and test by forcing an alert: stress --cpu 4 --timeout 700s on some server should trigger HighCPU in 10 minutes.
Step 8 — How to put reverse proxy and TLS in front?
Estimated time: 20 minutes.
To access Grafana via https://monitor.yourdomain.com with valid certificate, you need something in front of port 3000. Two options:
- Orchestrator's integrated router — if you already have the HeroCtl cluster running, just declare Grafana as a job with
ingress: { host: monitor.yourdomain.com, tls: true }. Automatic Let's Encrypt certificate, without additional tool. - Caddy standalone on the observability VPS itself — also issues Let's Encrypt automatically. Minimum Caddyfile:
monitor.yourdomain.com { reverse_proxy localhost:3000 basicauth /login { admin <bcrypt_hash> } }
For defense in depth, keep Caddy/router basic authentication in front of Grafana login — two barriers, not one. The second is especially important because the default Grafana login is admin/admin and the first thing bots do on an exposed Grafana is try that combination.
Step 9 — How to instrument application metrics?
Estimated time: varies according to number of applications.
System metrics are half the story. The other half is what your application is doing — how many requests per second, what the p99 latency is, how many errors, what the background job queue size is.
Each popular language has an official Prometheus client:
- Node.js:
prom-client - Python:
prometheus-client - Ruby:
prometheus-client - Go:
github.com/prometheus/client_golang
The minimum standard is three metrics per HTTP endpoint:
http_requests_total— counter, with labelsmethod,path,status.http_request_duration_seconds— histogram, same label set.app_errors_total— counter, with labelkind("validation", "db", "external_api", etc).
Expose all of that in /metrics. Add the endpoint in Prometheus's scrape_configs. In hours you have dashboards per endpoint, alerts per error rate, and the ability to answer "what was happening at 3:14 yesterday" with a graph instead of a guess.
Watch for cardinality. Each unique combination of labels becomes a separate time series. If you put user_id as label, with 100k users you create 100k series — and Prometheus will consume 8+ GB of RAM just to index that. Practical rule: labels have values in small sets (status code: 5 values; method: 5 values; path: dozens). Unique identifiers go in logs, not in metrics.
How to run this inside HeroCtl instead of dedicated VPS?
For clusters already running the orchestrator, it makes sense to consider the stack as one more job. Trade-off: you save a VPS, but lose isolation (if the cluster dies, monitoring dies along).
The topology looks like this:
- 1 single job spec with 4 tasks: prometheus, grafana, loki, alertmanager.
- Replicated volumes in the cluster — data survives node failure.
- Integrated router does automatic TLS via subdomain. No need for additional Caddy.
- Cluster's own metrics are already exposed in Prometheus format on the administrative API, so the scrape is direct.
For critical production, we recommend physical separation (dedicated VPS outside the cluster). For personal project, MVP, or small team where "everything falls together" is acceptable, running inside is cheaper and operationally simpler. The entire job spec sits around 80 lines of manifest.
How much does this stack cost per month in Brazil?
| Item | Monthly cost (BRL) |
|---|---|
| Dedicated observability VPS (4 GB RAM) | R$40 to R$80 |
| Object storage for long log retention (optional) | R$30 |
| Maintenance time (2 to 4h × hour value) | R$200 to R$400 |
| Total operational | R$300 to R$500 |
For comparison, a Datadog or New Relic subscription with equivalent coverage (5 hosts, 30-day log retention, alerts, dashboards) goes for around R$1,500 to R$2,000 per month — without counting the automatic overage that appears at month-end when someone forgets a verbose log on.
The difference isn't small: in a year, the open-source self-hosted stack saves between R$12,000 and R$18,000. For an early-stage startup, that's half a junior engineer.
Table of ports, resources and characteristics per component
| Component | Port | Minimum RAM | Disk | Default retention | Data format |
|---|---|---|---|---|---|
| Prometheus | 9090 | 512 MB | 10 GB | 15 days | binary TSDB |
| Grafana | 3000 | 256 MB | 1 GB | N/A | SQLite or Postgres |
| Loki | 3100 | 512 MB | 30 GB | 30 days (configurable) | compressed chunks |
| Promtail / Agent | 9080 | 128 MB | minimum | N/A | passes by value |
| Alertmanager | 9093 | 128 MB | 1 GB | N/A | notification log |
| node_exporter | 9100 | 64 MB | minimum | N/A | scrape endpoint |
These are the viable minimums for small cluster. In production with 30 servers and real traffic, multiply RAM by 3 and disk by 5.
The four errors that kill a new monitoring stack
Teams setting up observability for the first time stumble almost always on the same four errors. Knowing about them beforehand saves months.
Not monitoring monitoring. Prometheus stopped scraping on Thursday; nobody saw it. On Wednesday of the following week a server actually went down and they discovered there was no alert because Prometheus was dead for 6 days. Solution: configure a simple external cron (even free Pingdom serves) that hits https://monitor.yourdomain.com/api/health every 5 minutes and warns you when Grafana itself falls.
No retention strategy. Disk fills up in three months, Prometheus stops recording, someone deletes everything in despair, loses 90 days of history. Configure --storage.tsdb.retention.time=30d from day one and establish a housekeeping job.
High cardinality in labels. We already covered in step 9, but worth repeating: each user_id, request_id or UUID that becomes a label is a number that explosively multiplies Prometheus RAM consumption. Unique identifiers go to Loki, not to Prometheus.
Noisy alerts. The team receives 200 alerts per day. In two weeks, nobody looks anymore. When the site actually crashes, the alert will be in the middle of 199 others. Solution: start with six alerts (those from step 7), audit every two weeks, and exclude everything that fired but didn't require human action. Alert without action is noise.
FAQ
Can I run everything on a 2 GB VPS? Technically yes, for a cluster of up to 3 servers and few applications. In practice you'll hit the RAM ceiling in 2 to 3 months, especially if you import dense Grafana dashboards. Pay 50 reais more and go straight to 4 GB VPS — the time you save not fighting OOM kills pays for itself.
How much disk for 30 days of logs? Depends entirely on your application's log volume. Rough rule for small startup: cluster of 4 servers with normal web applications generates 1 to 5 GB of logs per day after Loki compression. Thirty days gives between 30 and 150 GB. Start with 50 GB SSD, monitor growth for two weeks, expand if necessary. If you go much beyond that, it's time to go to object storage.
Grafana Cloud vs self-hosted, which to choose? Grafana Cloud free tier is generous (10k series, 50 GB of logs, 14-day retention) and eliminates the work of maintaining the server. For solo project or very small team, makes sense. From the moment you exceed the free tier, prices scale fast — from US$50/month — and you lose control over the data. Self-hosted costs hardware + time, Cloud costs money + lock-in. For a company that intends to grow and has a DevOps dev on the team, self-hosted wins.
Promtail or Grafana Agent? In 2026, Grafana Agent (renamed to Grafana Alloy) is officially replacing Promtail. For new setup, go straight to Alloy. For setup that has been running Promtail for a long time, no urgency to migrate — Promtail will continue working for years.
Where does OpenTelemetry fit in this stack?
OTel is the application instrumentation standard that's consolidating. Instead of using prom-client directly, you use OTel's SDK and it exports to Prometheus, Loki and Tempo simultaneously. The big advantage is portability — if you want to swap Prometheus for something else 3 years from now, your application doesn't change a line. For a startup starting today, we recommend OTel from day one.
How do I backup Prometheus?
Prometheus has snapshot via API: curl -X POST localhost:9090/api/v1/admin/tsdb/snapshot creates a snapshot in the data directory. Do that once a day via cron, do tar.gz and send to object storage. In case of disaster, what you lose is metrics — and metrics, unlike logs, are typically recoverable in hours (start collecting again and dashboards return). Lost logs are lost forever, so invest more in Loki backup.
Tempo (distributed traces) worth installing now? No. Traces become useful from the moment you have 5+ services talking to each other and debugging latency involves following a request through several hops. For monolithic architecture or few services, traces give disproportionate work to the value. Add when complexity calls for it.
Does Loki index full-text like ELK? No, and that's the feature, not bug. Loki indexes only labels (job, host, container, severity) and log content stays compressed without index. To search text, you filter by labels first and then grep on the resulting chunks. That's what makes Loki ten times cheaper than ELK in RAM and CPU. In exchange, free-text queries across all history are slower. For 90% of debugging cases, filtering by job + host + time window already reduces to dozens of MB where grep flies.
Next steps
Brought up the stack, have dashboard, have alert, have searchable log? Good. The next three things worth investing in are, in order:
- Custom dashboards per application — business metrics (subscriptions created/hour, jobs processed, email queue) instead of just infrastructure.
- Runbooks linked in alerts — every rule in
alerts.ymlshould haveannotations.runbook_urlpointing to a page explaining what to do. When the alert fires at 3 AM, sleep doesn't think. - Monthly alert review — 30 minutes once a month auditing what fired in the previous month, deleting what became noise, adjusting thresholds.
For those who want to go further and understand why we chose this stack instead of managed SaaS, read Observability without Datadog: the Brazilian startup stack. And to close the operations cycle — because there's no point knowing the database fell if you can't restore — it's worth reading Database backup in cluster: strategies for 3 AM.
If you want to skip this entire setup and run the stack as a job inside an orchestrator that already takes care of TLS, rolling update deploy and volume replication:
curl -sSL get.heroctl.com/install.sh | sh
Four hours become forty minutes. The rest is the same work of thinking about which alerts matter — and on that part, no one frees you.