[{"data":1,"prerenderedAt":3143},["ShallowReactive",2],{"blog-en-\u002Fen\u002Fblog\u002Fmonitoring-stack-prometheus-grafana-loki":3,"blog-en-surround-\u002Fen\u002Fblog\u002Fmonitoring-stack-prometheus-grafana-loki":3129},{"id":4,"title":5,"author":6,"body":7,"category":3110,"cover":3111,"date":3112,"description":3113,"draft":3114,"extension":3115,"lastReviewed":3111,"meta":3116,"navigation":391,"path":3117,"readingTime":3118,"seo":3119,"sitemap":3120,"stem":3121,"tags":3122,"__hash__":3128},"blog_en\u002Fen\u002Fblog\u002Fmonitoring-stack-prometheus-grafana-loki.md","Complete monitoring stack in 2026: Prometheus + Grafana + Loki step by step","HeroCtl team",{"type":8,"value":9,"toc":3090},"minimark",[10,26,29,34,42,53,56,59,63,66,106,122,126,129,162,165,169,175,178,181,209,224,228,233,240,250,257,661,680,683,726,750,754,759,766,1046,1056,1075,1079,1085,1088,1168,1189,1195,1234,1244,1248,1252,1258,1513,1516,1519,1707,1717,1724,1728,1733,1742,1748,1857,1864,1871,1891,1898,1902,1906,1909,1919,2257,2264,2482,2493,2507,2511,2515,2521,2545,2552,2556,2561,2564,2567,2600,2603,2636,2645,2656,2660,2663,2666,2692,2695,2699,2755,2758,2761,2765,2899,2902,2906,2909,2919,2929,2941,2947,2951,2957,2963,2969,2975,2984,2998,3004,3010,3014,3017,3045,3058,3061,3083,3086],[11,12,13,14,18,19,18,22,25],"p",{},"The first time your site crashes at three in the morning, you'll discover something uncomfortable: there's no way to know what happened. There's no CPU graph, there's no log of the container that died, there's no alert that warned beforehand. You'll open a terminal, connect to the servers one by one, run ",[15,16,17],"code",{},"top",", ",[15,20,21],{},"df",[15,23,24],{},"journalctl",", and try to reconstitute a crime scene that has already cooled down.",[11,27,28],{},"This post is the shortcut so you don't go through that. In four hours, with R$80 to R$120 per month of hardware, you can assemble the open-source observability stack that replaces Datadog, New Relic and CloudWatch in 95% of cases for a startup. The tools are the same that run inside companies with tens of thousands of servers — and they fit comfortably on a small VPS for a team starting out.",[30,31,33],"h2",{"id":32},"tldr","TL;DR",[11,35,36,37,41],{},"The standard open-source monitoring stack in 2026 — ",[38,39,40],"strong",{},"Prometheus + Grafana + Loki + Alertmanager"," — fits on a single 4 GB RAM VPS and covers metrics, centralized logs, dashboards and alerts. This tutorial shows step-by-step setup for a 4-to-5-server cluster in approximately four hours, using docker-compose or orchestrator job specs.",[11,43,44,45,48,49,52],{},"For a Brazilian startup, that means ",[38,46,47],{},"R$80 to R$120 per month of hardware"," vs ",[38,50,51],{},"R$1,000 to R$2,000 per month"," of equivalent observability SaaS. The time cost is honest: four hours of initial setup plus two to four hours per month of ongoing maintenance.",[11,54,55],{},"Deliverable result at the end of the tutorial: dashboards for CPU, RAM, disk, network and HTTP metrics; searchable logs with 30-day retention; alerts routed to Slack, Discord or email. Prerequisites: 1 Linux VPS with 4 GB of RAM and 50 GB SSD, Docker installed, and a domain with DNS controlled by you.",[11,57,58],{},"The choice between running this stack on a dedicated VPS outside the production cluster or as a job inside the orchestrator itself is an architectural decision — we cover both options in step 8 and in \"How to run this inside HeroCtl\".",[30,60,62],{"id":61},"what-each-component-does-in-one-sentence","What each component does, in one sentence",[11,64,65],{},"Before installing anything, it's worth understanding the role of each piece. The stack has six components; the confusion usually comes from thinking some of them is \"the monitoring system\". It's not. Each one does one thing.",[67,68,69,76,82,88,94,100],"ul",{},[70,71,72,75],"li",{},[38,73,74],{},"Prometheus"," is a time-series database (TSDB) that collects metrics via HTTP scrape — it pulls the numbers, nobody pushes them. Retains 15 days by default.",[70,77,78,81],{},[38,79,80],{},"Grafana"," is the visualization layer. Connects to Prometheus, to Loki, to Postgres, to almost any structured source, and draws graphs.",[70,83,84,87],{},[38,85,86],{},"Loki"," is the log piece. Syntax similar to Prometheus, indexes only labels (not log content), and because of that gets about ten times cheaper than ELK to run.",[70,89,90,93],{},[38,91,92],{},"Promtail"," (or Grafana Agent, which is replacing Promtail in 2026) is the collector that reads log files from each server and sends to Loki.",[70,95,96,99],{},[38,97,98],{},"node_exporter"," runs on each monitored node and exposes an HTTP endpoint with CPU, RAM, disk and network in Prometheus format.",[70,101,102,105],{},[38,103,104],{},"Alertmanager"," receives alert rules from Prometheus and handles routing — Slack, email, PagerDuty, arbitrary webhook.",[11,107,108,109,18,112,18,115,18,118,121],{},"Whoever designs the first stack usually confuses Prometheus with \"monitoring\" and Grafana with \"pretty dashboards\". The real separation is: ",[38,110,111],{},"Prometheus stores numbers",[38,113,114],{},"Loki stores text",[38,116,117],{},"Grafana shows both",[38,119,120],{},"Alertmanager screams when some number is wrong",".",[30,123,125],{"id":124},"whats-the-recommended-architecture","What's the recommended architecture?",[11,127,128],{},"For a cluster of 3 to 5 servers running production applications, the topology that has worked in practice is to separate the observability server from the rest. A dedicated node, outside the cluster it monitors, with two objectives: not dying together when the cluster dies, and not competing for CPU\u002FRAM with the real application.",[67,130,131,137,143,153],{},[70,132,133,136],{},[38,134,135],{},"1 dedicated \"observability\" server",", 4 GB of RAM, 50 GB SSD. Runs Prometheus, Grafana, Loki, Alertmanager.",[70,138,139,142],{},[38,140,141],{},"Each monitored server"," runs only two lightweight processes: node_exporter (system metrics) and Promtail (log shipping).",[70,144,145,148,149,152],{},[38,146,147],{},"Your applications"," expose a ",[15,150,151],{},"\u002Fmetrics"," endpoint in Prometheus format. If you use a popular framework, there's a ready client. If not, it's a library of a few dozen lines.",[70,154,155,157,158,161],{},[38,156,80],{}," is accessible via subdomain (",[15,159,160],{},"monitor.yourdomain.com",") with automatic TLS and basic authentication in front.",[11,163,164],{},"This separation has a cost: you pay for one more VPS. In exchange, when the main cluster falls, you can still look at the graphs to understand what happened. For a startup, this trade-off pays off almost always — the worst monitoring scenario is discovering that the only thing that stopped along with the site was the system that would warn you that the site stopped.",[30,166,168],{"id":167},"step-1-how-to-provision-the-observability-vps","Step 1 — How to provision the observability VPS?",[11,170,171,172,121],{},"Estimated time: ",[38,173,174],{},"10 minutes",[11,176,177],{},"Any cheap provider works. The two with best cost-benefit for the Brazilian case today are Hetzner (CPX21 at 7.99 EUR per month with 3 vCPUs and 4 GB of RAM, datacenter in Germany) and DigitalOcean (Basic Droplet at US$24 per month with the same configuration, datacenters closer to Brazil). For monitoring workload, scrape latency in European datacenter doesn't cause a problem — Prometheus pulls every 15 seconds by default, so 200ms RTT between Hetzner and your servers doesn't disrupt.",[11,179,180],{},"Provisioning:",[182,183,184,187,190,196,203],"ol",{},[70,185,186],{},"Create the VPS with Ubuntu 24.04 LTS or Debian 12.",[70,188,189],{},"Add your public SSH key on creation. Disable password login.",[70,191,192,193,121],{},"Install Docker and the compose plugin: ",[15,194,195],{},"curl -fsSL https:\u002F\u002Fget.docker.com | sh && apt install docker-compose-plugin",[70,197,198,199,202],{},"Configure firewall: port 22 (SSH) open, port 443 (HTTPS) open, all others closed. Internal ports (3000, 9090, 3100, 9093) only stay accessible via ",[15,200,201],{},"localhost"," of the VPS itself — the reverse proxy exposes Grafana via 443.",[70,204,205,206,208],{},"Point DNS: create an A record ",[15,207,160],{}," to the VPS IP.",[11,210,211,212,215,216,219,220,223],{},"Validation: ",[15,213,214],{},"docker --version"," returns 26.x or higher; ",[15,217,218],{},"dig monitor.yourdomain.com"," returns the correct IP; ",[15,221,222],{},"ssh root@monitor.yourdomain.com"," connects without asking for password.",[30,225,227],{"id":226},"step-2-how-to-bring-up-the-stack-via-docker-compose","Step 2 — How to bring up the stack via docker-compose?",[11,229,171,230,121],{},[38,231,232],{},"45 minutes",[11,234,235,236,239],{},"Create the working directory at ",[15,237,238],{},"\u002Fopt\u002Fobservability\u002F"," with the following structure:",[241,242,247],"pre",{"className":243,"code":245,"language":246},[244],"language-text","\u002Fopt\u002Fobservability\u002F\n├── docker-compose.yml\n├── prometheus\u002F\n│   ├── prometheus.yml\n│   └── alerts.yml\n├── alertmanager\u002F\n│   └── alertmanager.yml\n├── loki\u002F\n│   └── loki-config.yml\n└── grafana\u002F\n    └── provisioning\u002F\n        └── datasources\u002F\n            └── datasources.yml\n","text",[15,248,245],{"__ignoreMap":249},"",[11,251,252,253,256],{},"The abbreviated but functional ",[15,254,255],{},"docker-compose.yml",":",[241,258,262],{"className":259,"code":260,"language":261,"meta":249,"style":249},"language-yaml shiki shiki-themes github-dark-default","services:\n  prometheus:\n    image: prom\u002Fprometheus:v2.55.0\n    volumes:\n      - .\u002Fprometheus:\u002Fetc\u002Fprometheus\n      - prometheus-data:\u002Fprometheus\n    command:\n      - '--config.file=\u002Fetc\u002Fprometheus\u002Fprometheus.yml'\n      - '--storage.tsdb.retention.time=30d'\n      - '--web.enable-lifecycle'  # permite reload via HTTP POST\n    ports:\n      - '127.0.0.1:9090:9090'\n    restart: unless-stopped\n\n  grafana:\n    image: grafana\u002Fgrafana:11.3.0\n    volumes:\n      - grafana-data:\u002Fvar\u002Flib\u002Fgrafana\n      - .\u002Fgrafana\u002Fprovisioning:\u002Fetc\u002Fgrafana\u002Fprovisioning\n    environment:\n      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}\n      - GF_USERS_ALLOW_SIGN_UP=false\n    ports:\n      - '127.0.0.1:3000:3000'\n    restart: unless-stopped\n\n  loki:\n    image: grafana\u002Floki:3.2.0\n    volumes:\n      - .\u002Floki\u002Floki-config.yml:\u002Fetc\u002Floki\u002Fconfig.yml\n      - loki-data:\u002Floki\n    command: -config.file=\u002Fetc\u002Floki\u002Fconfig.yml\n    ports:\n      - '127.0.0.1:3100:3100'\n    restart: unless-stopped\n\n  alertmanager:\n    image: prom\u002Falertmanager:v0.27.0\n    volumes:\n      - .\u002Falertmanager:\u002Fetc\u002Falertmanager\n    ports:\n      - '127.0.0.1:9093:9093'\n    restart: unless-stopped\n\nvolumes:\n  prometheus-data:\n  grafana-data:\n  loki-data:\n","yaml",[15,263,264,277,285,298,306,315,323,331,339,347,359,367,375,386,393,401,411,418,426,434,442,450,458,465,473,482,487,495,505,512,520,528,538,545,553,562,567,575,585,592,600,607,615,624,629,637,645,653],{"__ignoreMap":249},[265,266,269,273],"span",{"class":267,"line":268},"line",1,[265,270,272],{"class":271},"sPWt5","services",[265,274,276],{"class":275},"sZEs4",":\n",[265,278,280,283],{"class":267,"line":279},2,[265,281,282],{"class":271},"  prometheus",[265,284,276],{"class":275},[265,286,288,291,294],{"class":267,"line":287},3,[265,289,290],{"class":271},"    image",[265,292,293],{"class":275},": ",[265,295,297],{"class":296},"s9uIt","prom\u002Fprometheus:v2.55.0\n",[265,299,301,304],{"class":267,"line":300},4,[265,302,303],{"class":271},"    volumes",[265,305,276],{"class":275},[265,307,309,312],{"class":267,"line":308},5,[265,310,311],{"class":275},"      - ",[265,313,314],{"class":296},".\u002Fprometheus:\u002Fetc\u002Fprometheus\n",[265,316,318,320],{"class":267,"line":317},6,[265,319,311],{"class":275},[265,321,322],{"class":296},"prometheus-data:\u002Fprometheus\n",[265,324,326,329],{"class":267,"line":325},7,[265,327,328],{"class":271},"    command",[265,330,276],{"class":275},[265,332,334,336],{"class":267,"line":333},8,[265,335,311],{"class":275},[265,337,338],{"class":296},"'--config.file=\u002Fetc\u002Fprometheus\u002Fprometheus.yml'\n",[265,340,342,344],{"class":267,"line":341},9,[265,343,311],{"class":275},[265,345,346],{"class":296},"'--storage.tsdb.retention.time=30d'\n",[265,348,350,352,355],{"class":267,"line":349},10,[265,351,311],{"class":275},[265,353,354],{"class":296},"'--web.enable-lifecycle'",[265,356,358],{"class":357},"sH3jZ","  # permite reload via HTTP POST\n",[265,360,362,365],{"class":267,"line":361},11,[265,363,364],{"class":271},"    ports",[265,366,276],{"class":275},[265,368,370,372],{"class":267,"line":369},12,[265,371,311],{"class":275},[265,373,374],{"class":296},"'127.0.0.1:9090:9090'\n",[265,376,378,381,383],{"class":267,"line":377},13,[265,379,380],{"class":271},"    restart",[265,382,293],{"class":275},[265,384,385],{"class":296},"unless-stopped\n",[265,387,389],{"class":267,"line":388},14,[265,390,392],{"emptyLinePlaceholder":391},true,"\n",[265,394,396,399],{"class":267,"line":395},15,[265,397,398],{"class":271},"  grafana",[265,400,276],{"class":275},[265,402,404,406,408],{"class":267,"line":403},16,[265,405,290],{"class":271},[265,407,293],{"class":275},[265,409,410],{"class":296},"grafana\u002Fgrafana:11.3.0\n",[265,412,414,416],{"class":267,"line":413},17,[265,415,303],{"class":271},[265,417,276],{"class":275},[265,419,421,423],{"class":267,"line":420},18,[265,422,311],{"class":275},[265,424,425],{"class":296},"grafana-data:\u002Fvar\u002Flib\u002Fgrafana\n",[265,427,429,431],{"class":267,"line":428},19,[265,430,311],{"class":275},[265,432,433],{"class":296},".\u002Fgrafana\u002Fprovisioning:\u002Fetc\u002Fgrafana\u002Fprovisioning\n",[265,435,437,440],{"class":267,"line":436},20,[265,438,439],{"class":271},"    environment",[265,441,276],{"class":275},[265,443,445,447],{"class":267,"line":444},21,[265,446,311],{"class":275},[265,448,449],{"class":296},"GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}\n",[265,451,453,455],{"class":267,"line":452},22,[265,454,311],{"class":275},[265,456,457],{"class":296},"GF_USERS_ALLOW_SIGN_UP=false\n",[265,459,461,463],{"class":267,"line":460},23,[265,462,364],{"class":271},[265,464,276],{"class":275},[265,466,468,470],{"class":267,"line":467},24,[265,469,311],{"class":275},[265,471,472],{"class":296},"'127.0.0.1:3000:3000'\n",[265,474,476,478,480],{"class":267,"line":475},25,[265,477,380],{"class":271},[265,479,293],{"class":275},[265,481,385],{"class":296},[265,483,485],{"class":267,"line":484},26,[265,486,392],{"emptyLinePlaceholder":391},[265,488,490,493],{"class":267,"line":489},27,[265,491,492],{"class":271},"  loki",[265,494,276],{"class":275},[265,496,498,500,502],{"class":267,"line":497},28,[265,499,290],{"class":271},[265,501,293],{"class":275},[265,503,504],{"class":296},"grafana\u002Floki:3.2.0\n",[265,506,508,510],{"class":267,"line":507},29,[265,509,303],{"class":271},[265,511,276],{"class":275},[265,513,515,517],{"class":267,"line":514},30,[265,516,311],{"class":275},[265,518,519],{"class":296},".\u002Floki\u002Floki-config.yml:\u002Fetc\u002Floki\u002Fconfig.yml\n",[265,521,523,525],{"class":267,"line":522},31,[265,524,311],{"class":275},[265,526,527],{"class":296},"loki-data:\u002Floki\n",[265,529,531,533,535],{"class":267,"line":530},32,[265,532,328],{"class":271},[265,534,293],{"class":275},[265,536,537],{"class":296},"-config.file=\u002Fetc\u002Floki\u002Fconfig.yml\n",[265,539,541,543],{"class":267,"line":540},33,[265,542,364],{"class":271},[265,544,276],{"class":275},[265,546,548,550],{"class":267,"line":547},34,[265,549,311],{"class":275},[265,551,552],{"class":296},"'127.0.0.1:3100:3100'\n",[265,554,556,558,560],{"class":267,"line":555},35,[265,557,380],{"class":271},[265,559,293],{"class":275},[265,561,385],{"class":296},[265,563,565],{"class":267,"line":564},36,[265,566,392],{"emptyLinePlaceholder":391},[265,568,570,573],{"class":267,"line":569},37,[265,571,572],{"class":271},"  alertmanager",[265,574,276],{"class":275},[265,576,578,580,582],{"class":267,"line":577},38,[265,579,290],{"class":271},[265,581,293],{"class":275},[265,583,584],{"class":296},"prom\u002Falertmanager:v0.27.0\n",[265,586,588,590],{"class":267,"line":587},39,[265,589,303],{"class":271},[265,591,276],{"class":275},[265,593,595,597],{"class":267,"line":594},40,[265,596,311],{"class":275},[265,598,599],{"class":296},".\u002Falertmanager:\u002Fetc\u002Falertmanager\n",[265,601,603,605],{"class":267,"line":602},41,[265,604,364],{"class":271},[265,606,276],{"class":275},[265,608,610,612],{"class":267,"line":609},42,[265,611,311],{"class":275},[265,613,614],{"class":296},"'127.0.0.1:9093:9093'\n",[265,616,618,620,622],{"class":267,"line":617},43,[265,619,380],{"class":271},[265,621,293],{"class":275},[265,623,385],{"class":296},[265,625,627],{"class":267,"line":626},44,[265,628,392],{"emptyLinePlaceholder":391},[265,630,632,635],{"class":267,"line":631},45,[265,633,634],{"class":271},"volumes",[265,636,276],{"class":275},[265,638,640,643],{"class":267,"line":639},46,[265,641,642],{"class":271},"  prometheus-data",[265,644,276],{"class":275},[265,646,648,651],{"class":267,"line":647},47,[265,649,650],{"class":271},"  grafana-data",[265,652,276],{"class":275},[265,654,656,659],{"class":267,"line":655},48,[265,657,658],{"class":271},"  loki-data",[265,660,276],{"class":275},[11,662,663,664,667,668,671,672,675,676,679],{},"Three important points in this file. First, all ports are bound to ",[15,665,666],{},"127.0.0.1"," — none of the services is directly accessible from the internet. Second, volumes are named (not bind mounts), so they survive ",[15,669,670],{},"docker-compose down",". Third, the Grafana password comes from environment variable: create a ",[15,673,674],{},".env"," next to the compose with ",[15,677,678],{},"GRAFANA_PASSWORD=something_long_random"," and never commit that.",[11,681,682],{},"Bring up the stack:",[241,684,688],{"className":685,"code":686,"language":687,"meta":249,"style":249},"language-bash shiki shiki-themes github-dark-default","cd \u002Fopt\u002Fobservability\ndocker compose up -d\ndocker compose ps  # all should be \"Up\" \u002F healthy\n","bash",[15,689,690,699,714],{"__ignoreMap":249},[265,691,692,696],{"class":267,"line":268},[265,693,695],{"class":694},"sFSAA","cd",[265,697,698],{"class":296}," \u002Fopt\u002Fobservability\n",[265,700,701,705,708,711],{"class":267,"line":279},[265,702,704],{"class":703},"sQhOw","docker",[265,706,707],{"class":296}," compose",[265,709,710],{"class":296}," up",[265,712,713],{"class":694}," -d\n",[265,715,716,718,720,723],{"class":267,"line":287},[265,717,704],{"class":703},[265,719,707],{"class":296},[265,721,722],{"class":296}," ps",[265,724,725],{"class":357},"  # all should be \"Up\" \u002F healthy\n",[11,727,728,729,732,733,736,737,732,740,736,743,746,747,121],{},"Quick validation: ",[15,730,731],{},"curl localhost:9090\u002F-\u002Fready"," returns ",[15,734,735],{},"Prometheus Server is Ready","; ",[15,738,739],{},"curl localhost:3100\u002Fready",[15,741,742],{},"ready",[15,744,745],{},"curl localhost:3000\u002Fapi\u002Fhealth"," returns JSON with ",[15,748,749],{},"\"database\": \"ok\"",[30,751,753],{"id":752},"step-3-how-to-configure-prometheus-scrapes","Step 3 — How to configure Prometheus scrapes?",[11,755,171,756,121],{},[38,757,758],{},"30 minutes",[11,760,761,762,765],{},"The ",[15,763,764],{},"prometheus\u002Fprometheus.yml"," is where you tell Prometheus which endpoints to scrape. For a 4-server cluster, it looks like this:",[241,767,769],{"className":259,"code":768,"language":261,"meta":249,"style":249},"global:\n  scrape_interval: 15s\n  evaluation_interval: 15s\n\nalerting:\n  alertmanagers:\n    - static_configs:\n        - targets: ['alertmanager:9093']\n\nrule_files:\n  - 'alerts.yml'\n\nscrape_configs:\n  - job_name: 'prometheus'\n    static_configs:\n      - targets: ['localhost:9090']\n\n  - job_name: 'node'\n    static_configs:\n      - targets:\n          - 'server-1.yourdomain.internal:9100'\n          - 'server-2.yourdomain.internal:9100'\n          - 'server-3.yourdomain.internal:9100'\n          - 'worker-1.yourdomain.internal:9100'\n        labels:\n          environment: 'production'\n\n  - job_name: 'apps'\n    static_configs:\n      - targets:\n          - 'api.yourdomain.internal:8080'\n          - 'worker.yourdomain.internal:8080'\n        labels:\n          environment: 'production'\n    metrics_path: '\u002Fmetrics'\n",[15,770,771,778,788,797,801,808,815,825,842,846,853,861,865,872,884,891,904,908,919,925,933,941,948,955,962,969,979,983,994,1000,1008,1015,1022,1028,1036],{"__ignoreMap":249},[265,772,773,776],{"class":267,"line":268},[265,774,775],{"class":271},"global",[265,777,276],{"class":275},[265,779,780,783,785],{"class":267,"line":279},[265,781,782],{"class":271},"  scrape_interval",[265,784,293],{"class":275},[265,786,787],{"class":296},"15s\n",[265,789,790,793,795],{"class":267,"line":287},[265,791,792],{"class":271},"  evaluation_interval",[265,794,293],{"class":275},[265,796,787],{"class":296},[265,798,799],{"class":267,"line":300},[265,800,392],{"emptyLinePlaceholder":391},[265,802,803,806],{"class":267,"line":308},[265,804,805],{"class":271},"alerting",[265,807,276],{"class":275},[265,809,810,813],{"class":267,"line":317},[265,811,812],{"class":271},"  alertmanagers",[265,814,276],{"class":275},[265,816,817,820,823],{"class":267,"line":325},[265,818,819],{"class":275},"    - ",[265,821,822],{"class":271},"static_configs",[265,824,276],{"class":275},[265,826,827,830,833,836,839],{"class":267,"line":333},[265,828,829],{"class":275},"        - ",[265,831,832],{"class":271},"targets",[265,834,835],{"class":275},": [",[265,837,838],{"class":296},"'alertmanager:9093'",[265,840,841],{"class":275},"]\n",[265,843,844],{"class":267,"line":341},[265,845,392],{"emptyLinePlaceholder":391},[265,847,848,851],{"class":267,"line":349},[265,849,850],{"class":271},"rule_files",[265,852,276],{"class":275},[265,854,855,858],{"class":267,"line":361},[265,856,857],{"class":275},"  - ",[265,859,860],{"class":296},"'alerts.yml'\n",[265,862,863],{"class":267,"line":369},[265,864,392],{"emptyLinePlaceholder":391},[265,866,867,870],{"class":267,"line":377},[265,868,869],{"class":271},"scrape_configs",[265,871,276],{"class":275},[265,873,874,876,879,881],{"class":267,"line":388},[265,875,857],{"class":275},[265,877,878],{"class":271},"job_name",[265,880,293],{"class":275},[265,882,883],{"class":296},"'prometheus'\n",[265,885,886,889],{"class":267,"line":395},[265,887,888],{"class":271},"    static_configs",[265,890,276],{"class":275},[265,892,893,895,897,899,902],{"class":267,"line":403},[265,894,311],{"class":275},[265,896,832],{"class":271},[265,898,835],{"class":275},[265,900,901],{"class":296},"'localhost:9090'",[265,903,841],{"class":275},[265,905,906],{"class":267,"line":413},[265,907,392],{"emptyLinePlaceholder":391},[265,909,910,912,914,916],{"class":267,"line":420},[265,911,857],{"class":275},[265,913,878],{"class":271},[265,915,293],{"class":275},[265,917,918],{"class":296},"'node'\n",[265,920,921,923],{"class":267,"line":428},[265,922,888],{"class":271},[265,924,276],{"class":275},[265,926,927,929,931],{"class":267,"line":436},[265,928,311],{"class":275},[265,930,832],{"class":271},[265,932,276],{"class":275},[265,934,935,938],{"class":267,"line":444},[265,936,937],{"class":275},"          - ",[265,939,940],{"class":296},"'server-1.yourdomain.internal:9100'\n",[265,942,943,945],{"class":267,"line":452},[265,944,937],{"class":275},[265,946,947],{"class":296},"'server-2.yourdomain.internal:9100'\n",[265,949,950,952],{"class":267,"line":460},[265,951,937],{"class":275},[265,953,954],{"class":296},"'server-3.yourdomain.internal:9100'\n",[265,956,957,959],{"class":267,"line":467},[265,958,937],{"class":275},[265,960,961],{"class":296},"'worker-1.yourdomain.internal:9100'\n",[265,963,964,967],{"class":267,"line":475},[265,965,966],{"class":271},"        labels",[265,968,276],{"class":275},[265,970,971,974,976],{"class":267,"line":484},[265,972,973],{"class":271},"          environment",[265,975,293],{"class":275},[265,977,978],{"class":296},"'production'\n",[265,980,981],{"class":267,"line":489},[265,982,392],{"emptyLinePlaceholder":391},[265,984,985,987,989,991],{"class":267,"line":497},[265,986,857],{"class":275},[265,988,878],{"class":271},[265,990,293],{"class":275},[265,992,993],{"class":296},"'apps'\n",[265,995,996,998],{"class":267,"line":507},[265,997,888],{"class":271},[265,999,276],{"class":275},[265,1001,1002,1004,1006],{"class":267,"line":514},[265,1003,311],{"class":275},[265,1005,832],{"class":271},[265,1007,276],{"class":275},[265,1009,1010,1012],{"class":267,"line":522},[265,1011,937],{"class":275},[265,1013,1014],{"class":296},"'api.yourdomain.internal:8080'\n",[265,1016,1017,1019],{"class":267,"line":530},[265,1018,937],{"class":275},[265,1020,1021],{"class":296},"'worker.yourdomain.internal:8080'\n",[265,1023,1024,1026],{"class":267,"line":540},[265,1025,966],{"class":271},[265,1027,276],{"class":275},[265,1029,1030,1032,1034],{"class":267,"line":547},[265,1031,973],{"class":271},[265,1033,293],{"class":275},[265,1035,978],{"class":296},[265,1037,1038,1041,1043],{"class":267,"line":555},[265,1039,1040],{"class":271},"    metrics_path",[265,1042,293],{"class":275},[265,1044,1045],{"class":296},"'\u002Fmetrics'\n",[11,1047,1048,1049,1051,1052,1055],{},"For larger clusters or those that change composition frequently, swap ",[15,1050,822],{}," for ",[15,1053,1054],{},"file_sd_configs"," pointing to a JSON you generate automatically. For 4 static servers, the file above resolves it.",[11,1057,1058,1059,1062,1063,1066,1067,1070,1071,1074],{},"Reload: ",[15,1060,1061],{},"curl -X POST localhost:9090\u002F-\u002Freload",". Check at ",[15,1064,1065],{},"localhost:9090\u002Ftargets"," if all jobs are ",[15,1068,1069],{},"UP",". The ones that are ",[15,1072,1073],{},"DOWN"," haven't been instrumented yet — that's step 4.",[30,1076,1078],{"id":1077},"step-4-how-to-install-node_exporter-on-each-server","Step 4 — How to install node_exporter on each server?",[11,1080,171,1081,1084],{},[38,1082,1083],{},"15 minutes"," for 4 servers.",[11,1086,1087],{},"On each monitored server, run node_exporter. There are two ways: direct binary via systemd, or Docker container. In 2026 the consensus is container — easier to update and isolate. On each node:",[241,1089,1091],{"className":685,"code":1090,"language":687,"meta":249,"style":249},"docker run -d \\\n  --name node-exporter \\\n  --restart unless-stopped \\\n  --net=\"host\" \\\n  --pid=\"host\" \\\n  -v \"\u002F:\u002Fhost:ro,rslave\" \\\n  prom\u002Fnode-exporter:v1.8.2 \\\n  --path.rootfs=\u002Fhost\n",[15,1092,1093,1107,1117,1127,1137,1146,1156,1163],{"__ignoreMap":249},[265,1094,1095,1097,1100,1103],{"class":267,"line":268},[265,1096,704],{"class":703},[265,1098,1099],{"class":296}," run",[265,1101,1102],{"class":694}," -d",[265,1104,1106],{"class":1105},"suJrU"," \\\n",[265,1108,1109,1112,1115],{"class":267,"line":279},[265,1110,1111],{"class":694},"  --name",[265,1113,1114],{"class":296}," node-exporter",[265,1116,1106],{"class":1105},[265,1118,1119,1122,1125],{"class":267,"line":287},[265,1120,1121],{"class":694},"  --restart",[265,1123,1124],{"class":296}," unless-stopped",[265,1126,1106],{"class":1105},[265,1128,1129,1132,1135],{"class":267,"line":300},[265,1130,1131],{"class":694},"  --net=",[265,1133,1134],{"class":296},"\"host\"",[265,1136,1106],{"class":1105},[265,1138,1139,1142,1144],{"class":267,"line":308},[265,1140,1141],{"class":694},"  --pid=",[265,1143,1134],{"class":296},[265,1145,1106],{"class":1105},[265,1147,1148,1151,1154],{"class":267,"line":317},[265,1149,1150],{"class":694},"  -v",[265,1152,1153],{"class":296}," \"\u002F:\u002Fhost:ro,rslave\"",[265,1155,1106],{"class":1105},[265,1157,1158,1161],{"class":267,"line":325},[265,1159,1160],{"class":296},"  prom\u002Fnode-exporter:v1.8.2",[265,1162,1106],{"class":1105},[265,1164,1165],{"class":267,"line":333},[265,1166,1167],{"class":694},"  --path.rootfs=\u002Fhost\n",[11,1169,761,1170,1173,1174,1177,1178,18,1181,1184,1185,1188],{},[15,1171,1172],{},"--net=host"," is necessary for it to see real network interfaces. The bind mount on ",[15,1175,1176],{},"\u002Fhost"," allows reading ",[15,1179,1180],{},"\u002Fproc",[15,1182,1183],{},"\u002Fsys"," and ",[15,1186,1187],{},"\u002Fetc\u002Fpasswd"," from the host (read-only) without running the container with root privileges.",[11,1190,1191,1192,256],{},"Firewall: open port 9100 only to the observability server IP. On Ubuntu with ",[15,1193,1194],{},"ufw",[241,1196,1198],{"className":685,"code":1197,"language":687,"meta":249,"style":249},"ufw allow from \u003COBSERVABILITY_IP> to any port 9100\n",[15,1199,1200],{"__ignoreMap":249},[265,1201,1202,1204,1207,1210,1213,1216,1219,1222,1225,1228,1231],{"class":267,"line":268},[265,1203,1194],{"class":703},[265,1205,1206],{"class":296}," allow",[265,1208,1209],{"class":296}," from",[265,1211,1212],{"class":1105}," \u003C",[265,1214,1215],{"class":296},"OBSERVABILITY_I",[265,1217,1218],{"class":275},"P",[265,1220,1221],{"class":1105},">",[265,1223,1224],{"class":296}," to",[265,1226,1227],{"class":296}," any",[265,1229,1230],{"class":296}," port",[265,1232,1233],{"class":694}," 9100\n",[11,1235,1236,1237,1240,1241,121],{},"Validation: from the observability server, ",[15,1238,1239],{},"curl http:\u002F\u002Fserver-1.yourdomain.internal:9100\u002Fmetrics"," should return hundreds of lines starting with ",[15,1242,1243],{},"# HELP node_cpu_seconds_total...",[30,1245,1247],{"id":1246},"step-5-how-to-configure-loki-promtail","Step 5 — How to configure Loki + Promtail?",[11,1249,171,1250,121],{},[38,1251,758],{},[11,1253,1254,1255,256],{},"Loki is already running in the compose from step 2. Missing the ",[15,1256,1257],{},"loki-config.yml",[241,1259,1261],{"className":259,"code":1260,"language":261,"meta":249,"style":249},"auth_enabled: false\n\nserver:\n  http_listen_port: 3100\n\ncommon:\n  path_prefix: \u002Floki\n  storage:\n    filesystem:\n      chunks_directory: \u002Floki\u002Fchunks\n      rules_directory: \u002Floki\u002Frules\n  replication_factor: 1\n  ring:\n    kvstore:\n      store: inmemory\n\nschema_config:\n  configs:\n    - from: 2024-01-01\n      store: tsdb\n      object_store: filesystem\n      schema: v13\n      index:\n        prefix: index_\n        period: 24h\n\nlimits_config:\n  retention_period: 720h  # 30 dias\n  reject_old_samples: true\n  reject_old_samples_max_age: 168h\n",[15,1262,1263,1273,1277,1284,1294,1298,1305,1315,1322,1329,1339,1349,1359,1366,1373,1383,1387,1394,1401,1413,1422,1432,1442,1449,1459,1469,1473,1480,1493,1503],{"__ignoreMap":249},[265,1264,1265,1268,1270],{"class":267,"line":268},[265,1266,1267],{"class":271},"auth_enabled",[265,1269,293],{"class":275},[265,1271,1272],{"class":694},"false\n",[265,1274,1275],{"class":267,"line":279},[265,1276,392],{"emptyLinePlaceholder":391},[265,1278,1279,1282],{"class":267,"line":287},[265,1280,1281],{"class":271},"server",[265,1283,276],{"class":275},[265,1285,1286,1289,1291],{"class":267,"line":300},[265,1287,1288],{"class":271},"  http_listen_port",[265,1290,293],{"class":275},[265,1292,1293],{"class":694},"3100\n",[265,1295,1296],{"class":267,"line":308},[265,1297,392],{"emptyLinePlaceholder":391},[265,1299,1300,1303],{"class":267,"line":317},[265,1301,1302],{"class":271},"common",[265,1304,276],{"class":275},[265,1306,1307,1310,1312],{"class":267,"line":325},[265,1308,1309],{"class":271},"  path_prefix",[265,1311,293],{"class":275},[265,1313,1314],{"class":296},"\u002Floki\n",[265,1316,1317,1320],{"class":267,"line":333},[265,1318,1319],{"class":271},"  storage",[265,1321,276],{"class":275},[265,1323,1324,1327],{"class":267,"line":341},[265,1325,1326],{"class":271},"    filesystem",[265,1328,276],{"class":275},[265,1330,1331,1334,1336],{"class":267,"line":349},[265,1332,1333],{"class":271},"      chunks_directory",[265,1335,293],{"class":275},[265,1337,1338],{"class":296},"\u002Floki\u002Fchunks\n",[265,1340,1341,1344,1346],{"class":267,"line":361},[265,1342,1343],{"class":271},"      rules_directory",[265,1345,293],{"class":275},[265,1347,1348],{"class":296},"\u002Floki\u002Frules\n",[265,1350,1351,1354,1356],{"class":267,"line":369},[265,1352,1353],{"class":271},"  replication_factor",[265,1355,293],{"class":275},[265,1357,1358],{"class":694},"1\n",[265,1360,1361,1364],{"class":267,"line":377},[265,1362,1363],{"class":271},"  ring",[265,1365,276],{"class":275},[265,1367,1368,1371],{"class":267,"line":388},[265,1369,1370],{"class":271},"    kvstore",[265,1372,276],{"class":275},[265,1374,1375,1378,1380],{"class":267,"line":395},[265,1376,1377],{"class":271},"      store",[265,1379,293],{"class":275},[265,1381,1382],{"class":296},"inmemory\n",[265,1384,1385],{"class":267,"line":403},[265,1386,392],{"emptyLinePlaceholder":391},[265,1388,1389,1392],{"class":267,"line":413},[265,1390,1391],{"class":271},"schema_config",[265,1393,276],{"class":275},[265,1395,1396,1399],{"class":267,"line":420},[265,1397,1398],{"class":271},"  configs",[265,1400,276],{"class":275},[265,1402,1403,1405,1408,1410],{"class":267,"line":428},[265,1404,819],{"class":275},[265,1406,1407],{"class":271},"from",[265,1409,293],{"class":275},[265,1411,1412],{"class":694},"2024-01-01\n",[265,1414,1415,1417,1419],{"class":267,"line":436},[265,1416,1377],{"class":271},[265,1418,293],{"class":275},[265,1420,1421],{"class":296},"tsdb\n",[265,1423,1424,1427,1429],{"class":267,"line":444},[265,1425,1426],{"class":271},"      object_store",[265,1428,293],{"class":275},[265,1430,1431],{"class":296},"filesystem\n",[265,1433,1434,1437,1439],{"class":267,"line":452},[265,1435,1436],{"class":271},"      schema",[265,1438,293],{"class":275},[265,1440,1441],{"class":296},"v13\n",[265,1443,1444,1447],{"class":267,"line":460},[265,1445,1446],{"class":271},"      index",[265,1448,276],{"class":275},[265,1450,1451,1454,1456],{"class":267,"line":467},[265,1452,1453],{"class":271},"        prefix",[265,1455,293],{"class":275},[265,1457,1458],{"class":296},"index_\n",[265,1460,1461,1464,1466],{"class":267,"line":475},[265,1462,1463],{"class":271},"        period",[265,1465,293],{"class":275},[265,1467,1468],{"class":296},"24h\n",[265,1470,1471],{"class":267,"line":484},[265,1472,392],{"emptyLinePlaceholder":391},[265,1474,1475,1478],{"class":267,"line":489},[265,1476,1477],{"class":271},"limits_config",[265,1479,276],{"class":275},[265,1481,1482,1485,1487,1490],{"class":267,"line":497},[265,1483,1484],{"class":271},"  retention_period",[265,1486,293],{"class":275},[265,1488,1489],{"class":296},"720h",[265,1491,1492],{"class":357},"  # 30 dias\n",[265,1494,1495,1498,1500],{"class":267,"line":507},[265,1496,1497],{"class":271},"  reject_old_samples",[265,1499,293],{"class":275},[265,1501,1502],{"class":694},"true\n",[265,1504,1505,1508,1510],{"class":267,"line":514},[265,1506,1507],{"class":271},"  reject_old_samples_max_age",[265,1509,293],{"class":275},[265,1511,1512],{"class":296},"168h\n",[11,1514,1515],{},"Filesystem storage is enough to start. When you exceed 50 GB of logs per day or want 90+ days retention, migrate to S3 (or compatible). Don't migrate before — complicates operation without real gain.",[11,1517,1518],{},"On each monitored server, install Promtail (or Grafana Agent) also via container:",[241,1520,1522],{"className":259,"code":1521,"language":261,"meta":249,"style":249},"# \u002Fopt\u002Fpromtail\u002Fpromtail-config.yml em cada servidor\nserver:\n  http_listen_port: 9080\n\nclients:\n  - url: http:\u002F\u002Fmonitor.yourdomain.com:3100\u002Floki\u002Fapi\u002Fv1\u002Fpush\n\nscrape_configs:\n  - job_name: system\n    static_configs:\n      - targets: [localhost]\n        labels:\n          job: varlogs\n          host: ${HOSTNAME}\n          __path__: \u002Fvar\u002Flog\u002F*.log\n\n  - job_name: docker\n    docker_sd_configs:\n      - host: unix:\u002F\u002F\u002Fvar\u002Frun\u002Fdocker.sock\n    relabel_configs:\n      - source_labels: ['__meta_docker_container_name']\n        target_label: 'container'\n",[15,1523,1524,1529,1535,1544,1548,1555,1567,1571,1577,1588,1594,1606,1612,1622,1632,1642,1646,1657,1664,1676,1683,1697],{"__ignoreMap":249},[265,1525,1526],{"class":267,"line":268},[265,1527,1528],{"class":357},"# \u002Fopt\u002Fpromtail\u002Fpromtail-config.yml em cada servidor\n",[265,1530,1531,1533],{"class":267,"line":279},[265,1532,1281],{"class":271},[265,1534,276],{"class":275},[265,1536,1537,1539,1541],{"class":267,"line":287},[265,1538,1288],{"class":271},[265,1540,293],{"class":275},[265,1542,1543],{"class":694},"9080\n",[265,1545,1546],{"class":267,"line":300},[265,1547,392],{"emptyLinePlaceholder":391},[265,1549,1550,1553],{"class":267,"line":308},[265,1551,1552],{"class":271},"clients",[265,1554,276],{"class":275},[265,1556,1557,1559,1562,1564],{"class":267,"line":317},[265,1558,857],{"class":275},[265,1560,1561],{"class":271},"url",[265,1563,293],{"class":275},[265,1565,1566],{"class":296},"http:\u002F\u002Fmonitor.yourdomain.com:3100\u002Floki\u002Fapi\u002Fv1\u002Fpush\n",[265,1568,1569],{"class":267,"line":325},[265,1570,392],{"emptyLinePlaceholder":391},[265,1572,1573,1575],{"class":267,"line":333},[265,1574,869],{"class":271},[265,1576,276],{"class":275},[265,1578,1579,1581,1583,1585],{"class":267,"line":341},[265,1580,857],{"class":275},[265,1582,878],{"class":271},[265,1584,293],{"class":275},[265,1586,1587],{"class":296},"system\n",[265,1589,1590,1592],{"class":267,"line":349},[265,1591,888],{"class":271},[265,1593,276],{"class":275},[265,1595,1596,1598,1600,1602,1604],{"class":267,"line":361},[265,1597,311],{"class":275},[265,1599,832],{"class":271},[265,1601,835],{"class":275},[265,1603,201],{"class":296},[265,1605,841],{"class":275},[265,1607,1608,1610],{"class":267,"line":369},[265,1609,966],{"class":271},[265,1611,276],{"class":275},[265,1613,1614,1617,1619],{"class":267,"line":377},[265,1615,1616],{"class":271},"          job",[265,1618,293],{"class":275},[265,1620,1621],{"class":296},"varlogs\n",[265,1623,1624,1627,1629],{"class":267,"line":388},[265,1625,1626],{"class":271},"          host",[265,1628,293],{"class":275},[265,1630,1631],{"class":296},"${HOSTNAME}\n",[265,1633,1634,1637,1639],{"class":267,"line":395},[265,1635,1636],{"class":271},"          __path__",[265,1638,293],{"class":275},[265,1640,1641],{"class":296},"\u002Fvar\u002Flog\u002F*.log\n",[265,1643,1644],{"class":267,"line":403},[265,1645,392],{"emptyLinePlaceholder":391},[265,1647,1648,1650,1652,1654],{"class":267,"line":413},[265,1649,857],{"class":275},[265,1651,878],{"class":271},[265,1653,293],{"class":275},[265,1655,1656],{"class":296},"docker\n",[265,1658,1659,1662],{"class":267,"line":420},[265,1660,1661],{"class":271},"    docker_sd_configs",[265,1663,276],{"class":275},[265,1665,1666,1668,1671,1673],{"class":267,"line":428},[265,1667,311],{"class":275},[265,1669,1670],{"class":271},"host",[265,1672,293],{"class":275},[265,1674,1675],{"class":296},"unix:\u002F\u002F\u002Fvar\u002Frun\u002Fdocker.sock\n",[265,1677,1678,1681],{"class":267,"line":436},[265,1679,1680],{"class":271},"    relabel_configs",[265,1682,276],{"class":275},[265,1684,1685,1687,1690,1692,1695],{"class":267,"line":444},[265,1686,311],{"class":275},[265,1688,1689],{"class":271},"source_labels",[265,1691,835],{"class":275},[265,1693,1694],{"class":296},"'__meta_docker_container_name'",[265,1696,841],{"class":275},[265,1698,1699,1702,1704],{"class":267,"line":452},[265,1700,1701],{"class":271},"        target_label",[265,1703,293],{"class":275},[265,1705,1706],{"class":296},"'container'\n",[11,1708,1709,1710,1713,1714,1716],{},"Important: the endpoint ",[15,1711,1712],{},"http:\u002F\u002Fmonitor.yourdomain.com:3100\u002Floki\u002Fapi\u002Fv1\u002Fpush"," needs to be accessible from the servers. If you followed step 2 and bound Loki to ",[15,1715,666],{},", you have two options: expose 3100 via reverse proxy with basic authentication, or open an SSH\u002FWireGuard tunnel between servers. The second option is more secure and what we recommend.",[11,1718,1719,1720,1723],{},"Validation: in Grafana, go to Explore, select the Loki data source, run ",[15,1721,1722],{},"{job=\"varlogs\"}"," and see logs appearing in real time.",[30,1725,1727],{"id":1726},"step-6-how-to-import-grafana-dashboards","Step 6 — How to import Grafana dashboards?",[11,1729,171,1730,121],{},[38,1731,1732],{},"20 minutes",[11,1734,1735,1736,1739,1740,121],{},"Access ",[15,1737,1738],{},"https:\u002F\u002Fmonitor.yourdomain.com"," (after configuring the reverse proxy from step 8 — you can skip ahead now if you want). Login admin with the password from ",[15,1741,674],{},[11,1743,1744,1745,256],{},"Add the two data sources via automatic provisioning. In ",[15,1746,1747],{},"grafana\u002Fprovisioning\u002Fdatasources\u002Fdatasources.yml",[241,1749,1751],{"className":259,"code":1750,"language":261,"meta":249,"style":249},"apiVersion: 1\ndatasources:\n  - name: Prometheus\n    type: prometheus\n    access: proxy\n    url: http:\u002F\u002Fprometheus:9090\n    isDefault: true\n  - name: Loki\n    type: loki\n    access: proxy\n    url: http:\u002F\u002Floki:3100\n",[15,1752,1753,1762,1769,1781,1791,1801,1811,1820,1831,1840,1848],{"__ignoreMap":249},[265,1754,1755,1758,1760],{"class":267,"line":268},[265,1756,1757],{"class":271},"apiVersion",[265,1759,293],{"class":275},[265,1761,1358],{"class":694},[265,1763,1764,1767],{"class":267,"line":279},[265,1765,1766],{"class":271},"datasources",[265,1768,276],{"class":275},[265,1770,1771,1773,1776,1778],{"class":267,"line":287},[265,1772,857],{"class":275},[265,1774,1775],{"class":271},"name",[265,1777,293],{"class":275},[265,1779,1780],{"class":296},"Prometheus\n",[265,1782,1783,1786,1788],{"class":267,"line":300},[265,1784,1785],{"class":271},"    type",[265,1787,293],{"class":275},[265,1789,1790],{"class":296},"prometheus\n",[265,1792,1793,1796,1798],{"class":267,"line":308},[265,1794,1795],{"class":271},"    access",[265,1797,293],{"class":275},[265,1799,1800],{"class":296},"proxy\n",[265,1802,1803,1806,1808],{"class":267,"line":317},[265,1804,1805],{"class":271},"    url",[265,1807,293],{"class":275},[265,1809,1810],{"class":296},"http:\u002F\u002Fprometheus:9090\n",[265,1812,1813,1816,1818],{"class":267,"line":325},[265,1814,1815],{"class":271},"    isDefault",[265,1817,293],{"class":275},[265,1819,1502],{"class":694},[265,1821,1822,1824,1826,1828],{"class":267,"line":333},[265,1823,857],{"class":275},[265,1825,1775],{"class":271},[265,1827,293],{"class":275},[265,1829,1830],{"class":296},"Loki\n",[265,1832,1833,1835,1837],{"class":267,"line":341},[265,1834,1785],{"class":271},[265,1836,293],{"class":275},[265,1838,1839],{"class":296},"loki\n",[265,1841,1842,1844,1846],{"class":267,"line":349},[265,1843,1795],{"class":271},[265,1845,293],{"class":275},[265,1847,1800],{"class":296},[265,1849,1850,1852,1854],{"class":267,"line":361},[265,1851,1805],{"class":271},[265,1853,293],{"class":275},[265,1855,1856],{"class":296},"http:\u002F\u002Floki:3100\n",[11,1858,1859,1860,1863],{},"Restart Grafana with ",[15,1861,1862],{},"docker compose restart grafana"," and the sources appear automatically.",[11,1865,1866,1867,1870],{},"Import ready dashboards. In ",[38,1868,1869],{},"Dashboards → New → Import",", paste the dashboard ID:",[67,1872,1873,1879,1885],{},[70,1874,1875,1878],{},[38,1876,1877],{},"1860"," — Node Exporter Full. CPU, RAM, disk, network, filesystem. It's the most used dashboard in the Prometheus community, with reason.",[70,1880,1881,1884],{},[38,1882,1883],{},"13639"," — Logs \u002F App. Basic visualization of Loki logs with filters by job, container, host.",[70,1886,1887,1890],{},[38,1888,1889],{},"15172"," — Cluster overview. Consolidated view per server, useful for small cluster.",[11,1892,1893,1894,1897],{},"Customize each one to use ",[15,1895,1896],{},"environment=\"production\""," in the default filter. After two weeks using, you'll want to create your own dashboards for specific workloads — there's no shortcut there, it's chair time.",[30,1899,1901],{"id":1900},"step-7-how-to-configure-basic-alerts","Step 7 — How to configure basic alerts?",[11,1903,171,1904,121],{},[38,1905,232],{},[11,1907,1908],{},"Alerts are where 80% of teams stumble: either they put very few and discover incidents through customers, or they put dozens and desensitize the team.",[11,1910,1911,1912,1915,1916,256],{},"Start with ",[38,1913,1914],{},"six essential alerts",". In ",[15,1917,1918],{},"prometheus\u002Falerts.yml",[241,1920,1922],{"className":259,"code":1921,"language":261,"meta":249,"style":249},"groups:\n  - name: essentials\n    interval: 30s\n    rules:\n      - alert: ServerDown\n        expr: up{job=\"node\"} == 0\n        for: 2m\n        labels:\n          severity: critical\n        annotations:\n          summary: \"Servidor {{ $labels.instance }} está fora do ar\"\n\n      - alert: HighCPU\n        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100) > 80\n        for: 10m\n        labels:\n          severity: warning\n\n      - alert: DiskAlmostFull\n        expr: (node_filesystem_avail_bytes{mountpoint=\"\u002F\"} \u002F node_filesystem_size_bytes{mountpoint=\"\u002F\"}) * 100 \u003C 15\n        for: 5m\n        labels:\n          severity: critical\n\n      - alert: HighMemory\n        expr: (1 - (node_memory_MemAvailable_bytes \u002F node_memory_MemTotal_bytes)) * 100 > 90\n        for: 10m\n        labels:\n          severity: warning\n\n      - alert: HighHTTPErrorRate\n        expr: sum(rate(http_requests_total{status=~\"5..\"}[5m])) \u002F sum(rate(http_requests_total[5m])) > 0.05\n        for: 5m\n        labels:\n          severity: critical\n\n      - alert: HighLatency\n        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 2\n        for: 10m\n        labels:\n          severity: warning\n",[15,1923,1924,1931,1942,1952,1959,1971,1981,1991,1997,2007,2014,2024,2028,2039,2048,2057,2063,2072,2076,2087,2096,2105,2111,2119,2123,2134,2143,2151,2157,2165,2169,2180,2189,2197,2203,2211,2215,2226,2235,2243,2249],{"__ignoreMap":249},[265,1925,1926,1929],{"class":267,"line":268},[265,1927,1928],{"class":271},"groups",[265,1930,276],{"class":275},[265,1932,1933,1935,1937,1939],{"class":267,"line":279},[265,1934,857],{"class":275},[265,1936,1775],{"class":271},[265,1938,293],{"class":275},[265,1940,1941],{"class":296},"essentials\n",[265,1943,1944,1947,1949],{"class":267,"line":287},[265,1945,1946],{"class":271},"    interval",[265,1948,293],{"class":275},[265,1950,1951],{"class":296},"30s\n",[265,1953,1954,1957],{"class":267,"line":300},[265,1955,1956],{"class":271},"    rules",[265,1958,276],{"class":275},[265,1960,1961,1963,1966,1968],{"class":267,"line":308},[265,1962,311],{"class":275},[265,1964,1965],{"class":271},"alert",[265,1967,293],{"class":275},[265,1969,1970],{"class":296},"ServerDown\n",[265,1972,1973,1976,1978],{"class":267,"line":317},[265,1974,1975],{"class":271},"        expr",[265,1977,293],{"class":275},[265,1979,1980],{"class":296},"up{job=\"node\"} == 0\n",[265,1982,1983,1986,1988],{"class":267,"line":325},[265,1984,1985],{"class":271},"        for",[265,1987,293],{"class":275},[265,1989,1990],{"class":296},"2m\n",[265,1992,1993,1995],{"class":267,"line":333},[265,1994,966],{"class":271},[265,1996,276],{"class":275},[265,1998,1999,2002,2004],{"class":267,"line":341},[265,2000,2001],{"class":271},"          severity",[265,2003,293],{"class":275},[265,2005,2006],{"class":296},"critical\n",[265,2008,2009,2012],{"class":267,"line":349},[265,2010,2011],{"class":271},"        annotations",[265,2013,276],{"class":275},[265,2015,2016,2019,2021],{"class":267,"line":361},[265,2017,2018],{"class":271},"          summary",[265,2020,293],{"class":275},[265,2022,2023],{"class":296},"\"Servidor {{ $labels.instance }} está fora do ar\"\n",[265,2025,2026],{"class":267,"line":369},[265,2027,392],{"emptyLinePlaceholder":391},[265,2029,2030,2032,2034,2036],{"class":267,"line":377},[265,2031,311],{"class":275},[265,2033,1965],{"class":271},[265,2035,293],{"class":275},[265,2037,2038],{"class":296},"HighCPU\n",[265,2040,2041,2043,2045],{"class":267,"line":388},[265,2042,1975],{"class":271},[265,2044,293],{"class":275},[265,2046,2047],{"class":296},"100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100) > 80\n",[265,2049,2050,2052,2054],{"class":267,"line":395},[265,2051,1985],{"class":271},[265,2053,293],{"class":275},[265,2055,2056],{"class":296},"10m\n",[265,2058,2059,2061],{"class":267,"line":403},[265,2060,966],{"class":271},[265,2062,276],{"class":275},[265,2064,2065,2067,2069],{"class":267,"line":413},[265,2066,2001],{"class":271},[265,2068,293],{"class":275},[265,2070,2071],{"class":296},"warning\n",[265,2073,2074],{"class":267,"line":420},[265,2075,392],{"emptyLinePlaceholder":391},[265,2077,2078,2080,2082,2084],{"class":267,"line":428},[265,2079,311],{"class":275},[265,2081,1965],{"class":271},[265,2083,293],{"class":275},[265,2085,2086],{"class":296},"DiskAlmostFull\n",[265,2088,2089,2091,2093],{"class":267,"line":436},[265,2090,1975],{"class":271},[265,2092,293],{"class":275},[265,2094,2095],{"class":296},"(node_filesystem_avail_bytes{mountpoint=\"\u002F\"} \u002F node_filesystem_size_bytes{mountpoint=\"\u002F\"}) * 100 \u003C 15\n",[265,2097,2098,2100,2102],{"class":267,"line":444},[265,2099,1985],{"class":271},[265,2101,293],{"class":275},[265,2103,2104],{"class":296},"5m\n",[265,2106,2107,2109],{"class":267,"line":452},[265,2108,966],{"class":271},[265,2110,276],{"class":275},[265,2112,2113,2115,2117],{"class":267,"line":460},[265,2114,2001],{"class":271},[265,2116,293],{"class":275},[265,2118,2006],{"class":296},[265,2120,2121],{"class":267,"line":467},[265,2122,392],{"emptyLinePlaceholder":391},[265,2124,2125,2127,2129,2131],{"class":267,"line":475},[265,2126,311],{"class":275},[265,2128,1965],{"class":271},[265,2130,293],{"class":275},[265,2132,2133],{"class":296},"HighMemory\n",[265,2135,2136,2138,2140],{"class":267,"line":484},[265,2137,1975],{"class":271},[265,2139,293],{"class":275},[265,2141,2142],{"class":296},"(1 - (node_memory_MemAvailable_bytes \u002F node_memory_MemTotal_bytes)) * 100 > 90\n",[265,2144,2145,2147,2149],{"class":267,"line":489},[265,2146,1985],{"class":271},[265,2148,293],{"class":275},[265,2150,2056],{"class":296},[265,2152,2153,2155],{"class":267,"line":497},[265,2154,966],{"class":271},[265,2156,276],{"class":275},[265,2158,2159,2161,2163],{"class":267,"line":507},[265,2160,2001],{"class":271},[265,2162,293],{"class":275},[265,2164,2071],{"class":296},[265,2166,2167],{"class":267,"line":514},[265,2168,392],{"emptyLinePlaceholder":391},[265,2170,2171,2173,2175,2177],{"class":267,"line":522},[265,2172,311],{"class":275},[265,2174,1965],{"class":271},[265,2176,293],{"class":275},[265,2178,2179],{"class":296},"HighHTTPErrorRate\n",[265,2181,2182,2184,2186],{"class":267,"line":530},[265,2183,1975],{"class":271},[265,2185,293],{"class":275},[265,2187,2188],{"class":296},"sum(rate(http_requests_total{status=~\"5..\"}[5m])) \u002F sum(rate(http_requests_total[5m])) > 0.05\n",[265,2190,2191,2193,2195],{"class":267,"line":540},[265,2192,1985],{"class":271},[265,2194,293],{"class":275},[265,2196,2104],{"class":296},[265,2198,2199,2201],{"class":267,"line":547},[265,2200,966],{"class":271},[265,2202,276],{"class":275},[265,2204,2205,2207,2209],{"class":267,"line":555},[265,2206,2001],{"class":271},[265,2208,293],{"class":275},[265,2210,2006],{"class":296},[265,2212,2213],{"class":267,"line":564},[265,2214,392],{"emptyLinePlaceholder":391},[265,2216,2217,2219,2221,2223],{"class":267,"line":569},[265,2218,311],{"class":275},[265,2220,1965],{"class":271},[265,2222,293],{"class":275},[265,2224,2225],{"class":296},"HighLatency\n",[265,2227,2228,2230,2232],{"class":267,"line":577},[265,2229,1975],{"class":271},[265,2231,293],{"class":275},[265,2233,2234],{"class":296},"histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 2\n",[265,2236,2237,2239,2241],{"class":267,"line":587},[265,2238,1985],{"class":271},[265,2240,293],{"class":275},[265,2242,2056],{"class":296},[265,2244,2245,2247],{"class":267,"line":594},[265,2246,966],{"class":271},[265,2248,276],{"class":275},[265,2250,2251,2253,2255],{"class":267,"line":602},[265,2252,2001],{"class":271},[265,2254,293],{"class":275},[265,2256,2071],{"class":296},[11,2258,2259,2260,2263],{},"And the ",[15,2261,2262],{},"alertmanager\u002Falertmanager.yml"," pointing to a Slack or Discord webhook:",[241,2265,2267],{"className":259,"code":2266,"language":261,"meta":249,"style":249},"route:\n  group_by: ['alertname', 'severity']\n  group_wait: 30s\n  group_interval: 5m\n  repeat_interval: 4h\n  receiver: 'slack-default'\n  routes:\n    - match:\n        severity: critical\n      receiver: 'slack-critical'\n      repeat_interval: 1h\n\nreceivers:\n  - name: 'slack-default'\n    slack_configs:\n      - api_url: 'https:\u002F\u002Fhooks.slack.com\u002Fservices\u002FYOUR\u002FWEBHOOK\u002FHERE'\n        channel: '#alerts'\n        send_resolved: true\n\n  - name: 'slack-critical'\n    slack_configs:\n      - api_url: 'https:\u002F\u002Fhooks.slack.com\u002Fservices\u002FYOUR\u002FWEBHOOK\u002FHERE'\n        channel: '#alerts-critical'\n        send_resolved: true\n",[15,2268,2269,2276,2293,2302,2311,2321,2331,2338,2347,2356,2366,2376,2380,2387,2397,2404,2416,2426,2435,2439,2449,2455,2465,2474],{"__ignoreMap":249},[265,2270,2271,2274],{"class":267,"line":268},[265,2272,2273],{"class":271},"route",[265,2275,276],{"class":275},[265,2277,2278,2281,2283,2286,2288,2291],{"class":267,"line":279},[265,2279,2280],{"class":271},"  group_by",[265,2282,835],{"class":275},[265,2284,2285],{"class":296},"'alertname'",[265,2287,18],{"class":275},[265,2289,2290],{"class":296},"'severity'",[265,2292,841],{"class":275},[265,2294,2295,2298,2300],{"class":267,"line":287},[265,2296,2297],{"class":271},"  group_wait",[265,2299,293],{"class":275},[265,2301,1951],{"class":296},[265,2303,2304,2307,2309],{"class":267,"line":300},[265,2305,2306],{"class":271},"  group_interval",[265,2308,293],{"class":275},[265,2310,2104],{"class":296},[265,2312,2313,2316,2318],{"class":267,"line":308},[265,2314,2315],{"class":271},"  repeat_interval",[265,2317,293],{"class":275},[265,2319,2320],{"class":296},"4h\n",[265,2322,2323,2326,2328],{"class":267,"line":317},[265,2324,2325],{"class":271},"  receiver",[265,2327,293],{"class":275},[265,2329,2330],{"class":296},"'slack-default'\n",[265,2332,2333,2336],{"class":267,"line":325},[265,2334,2335],{"class":271},"  routes",[265,2337,276],{"class":275},[265,2339,2340,2342,2345],{"class":267,"line":333},[265,2341,819],{"class":275},[265,2343,2344],{"class":271},"match",[265,2346,276],{"class":275},[265,2348,2349,2352,2354],{"class":267,"line":341},[265,2350,2351],{"class":271},"        severity",[265,2353,293],{"class":275},[265,2355,2006],{"class":296},[265,2357,2358,2361,2363],{"class":267,"line":349},[265,2359,2360],{"class":271},"      receiver",[265,2362,293],{"class":275},[265,2364,2365],{"class":296},"'slack-critical'\n",[265,2367,2368,2371,2373],{"class":267,"line":361},[265,2369,2370],{"class":271},"      repeat_interval",[265,2372,293],{"class":275},[265,2374,2375],{"class":296},"1h\n",[265,2377,2378],{"class":267,"line":369},[265,2379,392],{"emptyLinePlaceholder":391},[265,2381,2382,2385],{"class":267,"line":377},[265,2383,2384],{"class":271},"receivers",[265,2386,276],{"class":275},[265,2388,2389,2391,2393,2395],{"class":267,"line":388},[265,2390,857],{"class":275},[265,2392,1775],{"class":271},[265,2394,293],{"class":275},[265,2396,2330],{"class":296},[265,2398,2399,2402],{"class":267,"line":395},[265,2400,2401],{"class":271},"    slack_configs",[265,2403,276],{"class":275},[265,2405,2406,2408,2411,2413],{"class":267,"line":403},[265,2407,311],{"class":275},[265,2409,2410],{"class":271},"api_url",[265,2412,293],{"class":275},[265,2414,2415],{"class":296},"'https:\u002F\u002Fhooks.slack.com\u002Fservices\u002FYOUR\u002FWEBHOOK\u002FHERE'\n",[265,2417,2418,2421,2423],{"class":267,"line":413},[265,2419,2420],{"class":271},"        channel",[265,2422,293],{"class":275},[265,2424,2425],{"class":296},"'#alerts'\n",[265,2427,2428,2431,2433],{"class":267,"line":420},[265,2429,2430],{"class":271},"        send_resolved",[265,2432,293],{"class":275},[265,2434,1502],{"class":694},[265,2436,2437],{"class":267,"line":428},[265,2438,392],{"emptyLinePlaceholder":391},[265,2440,2441,2443,2445,2447],{"class":267,"line":436},[265,2442,857],{"class":275},[265,2444,1775],{"class":271},[265,2446,293],{"class":275},[265,2448,2365],{"class":296},[265,2450,2451,2453],{"class":267,"line":444},[265,2452,2401],{"class":271},[265,2454,276],{"class":275},[265,2456,2457,2459,2461,2463],{"class":267,"line":452},[265,2458,311],{"class":275},[265,2460,2410],{"class":271},[265,2462,293],{"class":275},[265,2464,2415],{"class":296},[265,2466,2467,2469,2471],{"class":267,"line":460},[265,2468,2420],{"class":271},[265,2470,293],{"class":275},[265,2472,2473],{"class":296},"'#alerts-critical'\n",[265,2475,2476,2478,2480],{"class":267,"line":467},[265,2477,2430],{"class":271},[265,2479,293],{"class":275},[265,2481,1502],{"class":694},[11,2483,2484,2485,2488,2489,2492],{},"Two details that save sleep. The ",[15,2486,2487],{},"for: 10m"," on CPU prevents short spikes from becoming alerts — the server can hit 95% for 30 seconds and that be normal. The ",[15,2490,2491],{},"repeat_interval: 4h"," for warnings ensures that a warning resolved in one hour doesn't become 60 messages — Alertmanager groups.",[11,2494,2495,2496,2498,2499,2502,2503,2506],{},"Reload Prometheus (",[15,2497,1061],{},") and test by forcing an alert: ",[15,2500,2501],{},"stress --cpu 4 --timeout 700s"," on some server should trigger ",[15,2504,2505],{},"HighCPU"," in 10 minutes.",[30,2508,2510],{"id":2509},"step-8-how-to-put-reverse-proxy-and-tls-in-front","Step 8 — How to put reverse proxy and TLS in front?",[11,2512,171,2513,121],{},[38,2514,1732],{},[11,2516,2517,2518,2520],{},"To access Grafana via ",[15,2519,1738],{}," with valid certificate, you need something in front of port 3000. Two options:",[182,2522,2523,2533],{},[70,2524,2525,2528,2529,2532],{},[38,2526,2527],{},"Orchestrator's integrated router"," — if you already have the HeroCtl cluster running, just declare Grafana as a job with ",[15,2530,2531],{},"ingress: { host: monitor.yourdomain.com, tls: true }",". Automatic Let's Encrypt certificate, without additional tool.",[70,2534,2535,2538,2539],{},[38,2536,2537],{},"Caddy standalone"," on the observability VPS itself — also issues Let's Encrypt automatically. Minimum Caddyfile:",[241,2540,2543],{"className":2541,"code":2542,"language":246},[244],"monitor.yourdomain.com {\n  reverse_proxy localhost:3000\n  basicauth \u002Flogin {\n    admin \u003Cbcrypt_hash>\n  }\n}\n",[15,2544,2542],{"__ignoreMap":249},[11,2546,2547,2548,2551],{},"For defense in depth, keep Caddy\u002Frouter basic authentication in front of Grafana login — two barriers, not one. The second is especially important because the default Grafana login is ",[15,2549,2550],{},"admin\u002Fadmin"," and the first thing bots do on an exposed Grafana is try that combination.",[30,2553,2555],{"id":2554},"step-9-how-to-instrument-application-metrics","Step 9 — How to instrument application metrics?",[11,2557,171,2558,121],{},[38,2559,2560],{},"varies according to number of applications",[11,2562,2563],{},"System metrics are half the story. The other half is what your application is doing — how many requests per second, what the p99 latency is, how many errors, what the background job queue size is.",[11,2565,2566],{},"Each popular language has an official Prometheus client:",[67,2568,2569,2577,2585,2592],{},[70,2570,2571,293,2574],{},[38,2572,2573],{},"Node.js",[15,2575,2576],{},"prom-client",[70,2578,2579,293,2582],{},[38,2580,2581],{},"Python",[15,2583,2584],{},"prometheus-client",[70,2586,2587,293,2590],{},[38,2588,2589],{},"Ruby",[15,2591,2584],{},[70,2593,2594,293,2597],{},[38,2595,2596],{},"Go",[15,2598,2599],{},"github.com\u002Fprometheus\u002Fclient_golang",[11,2601,2602],{},"The minimum standard is three metrics per HTTP endpoint:",[67,2604,2605,2620,2626],{},[70,2606,2607,2610,2611,18,2614,18,2617,121],{},[15,2608,2609],{},"http_requests_total"," — counter, with labels ",[15,2612,2613],{},"method",[15,2615,2616],{},"path",[15,2618,2619],{},"status",[70,2621,2622,2625],{},[15,2623,2624],{},"http_request_duration_seconds"," — histogram, same label set.",[70,2627,2628,2631,2632,2635],{},[15,2629,2630],{},"app_errors_total"," — counter, with label ",[15,2633,2634],{},"kind"," (\"validation\", \"db\", \"external_api\", etc).",[11,2637,2638,2639,2641,2642,2644],{},"Expose all of that in ",[15,2640,151],{},". Add the endpoint in Prometheus's ",[15,2643,869],{},". In hours you have dashboards per endpoint, alerts per error rate, and the ability to answer \"what was happening at 3:14 yesterday\" with a graph instead of a guess.",[11,2646,2647,2648,2651,2652,2655],{},"Watch for ",[38,2649,2650],{},"cardinality",". Each unique combination of labels becomes a separate time series. If you put ",[15,2653,2654],{},"user_id"," as label, with 100k users you create 100k series — and Prometheus will consume 8+ GB of RAM just to index that. Practical rule: labels have values in small sets (status code: 5 values; method: 5 values; path: dozens). Unique identifiers go in logs, not in metrics.",[30,2657,2659],{"id":2658},"how-to-run-this-inside-heroctl-instead-of-dedicated-vps","How to run this inside HeroCtl instead of dedicated VPS?",[11,2661,2662],{},"For clusters already running the orchestrator, it makes sense to consider the stack as one more job. Trade-off: you save a VPS, but lose isolation (if the cluster dies, monitoring dies along).",[11,2664,2665],{},"The topology looks like this:",[67,2667,2668,2674,2680,2686],{},[70,2669,2670,2673],{},[38,2671,2672],{},"1 single job spec"," with 4 tasks: prometheus, grafana, loki, alertmanager.",[70,2675,2676,2679],{},[38,2677,2678],{},"Replicated volumes"," in the cluster — data survives node failure.",[70,2681,2682,2685],{},[38,2683,2684],{},"Integrated router"," does automatic TLS via subdomain. No need for additional Caddy.",[70,2687,2688,2691],{},[38,2689,2690],{},"Cluster's own metrics"," are already exposed in Prometheus format on the administrative API, so the scrape is direct.",[11,2693,2694],{},"For critical production, we recommend physical separation (dedicated VPS outside the cluster). For personal project, MVP, or small team where \"everything falls together\" is acceptable, running inside is cheaper and operationally simpler. The entire job spec sits around 80 lines of manifest.",[30,2696,2698],{"id":2697},"how-much-does-this-stack-cost-per-month-in-brazil","How much does this stack cost per month in Brazil?",[2700,2701,2702,2715],"table",{},[2703,2704,2705],"thead",{},[2706,2707,2708,2712],"tr",{},[2709,2710,2711],"th",{},"Item",[2709,2713,2714],{},"Monthly cost (BRL)",[2716,2717,2718,2727,2735,2743],"tbody",{},[2706,2719,2720,2724],{},[2721,2722,2723],"td",{},"Dedicated observability VPS (4 GB RAM)",[2721,2725,2726],{},"R$40 to R$80",[2706,2728,2729,2732],{},[2721,2730,2731],{},"Object storage for long log retention (optional)",[2721,2733,2734],{},"R$30",[2706,2736,2737,2740],{},[2721,2738,2739],{},"Maintenance time (2 to 4h × hour value)",[2721,2741,2742],{},"R$200 to R$400",[2706,2744,2745,2750],{},[2721,2746,2747],{},[38,2748,2749],{},"Total operational",[2721,2751,2752],{},[38,2753,2754],{},"R$300 to R$500",[11,2756,2757],{},"For comparison, a Datadog or New Relic subscription with equivalent coverage (5 hosts, 30-day log retention, alerts, dashboards) goes for around R$1,500 to R$2,000 per month — without counting the automatic overage that appears at month-end when someone forgets a verbose log on.",[11,2759,2760],{},"The difference isn't small: in a year, the open-source self-hosted stack saves between R$12,000 and R$18,000. For an early-stage startup, that's half a junior engineer.",[30,2762,2764],{"id":2763},"table-of-ports-resources-and-characteristics-per-component","Table of ports, resources and characteristics per component",[2700,2766,2767,2789],{},[2703,2768,2769],{},[2706,2770,2771,2774,2777,2780,2783,2786],{},[2709,2772,2773],{},"Component",[2709,2775,2776],{},"Port",[2709,2778,2779],{},"Minimum RAM",[2709,2781,2782],{},"Disk",[2709,2784,2785],{},"Default retention",[2709,2787,2788],{},"Data format",[2716,2790,2791,2810,2829,2847,2866,2882],{},[2706,2792,2793,2795,2798,2801,2804,2807],{},[2721,2794,74],{},[2721,2796,2797],{},"9090",[2721,2799,2800],{},"512 MB",[2721,2802,2803],{},"10 GB",[2721,2805,2806],{},"15 days",[2721,2808,2809],{},"binary TSDB",[2706,2811,2812,2814,2817,2820,2823,2826],{},[2721,2813,80],{},[2721,2815,2816],{},"3000",[2721,2818,2819],{},"256 MB",[2721,2821,2822],{},"1 GB",[2721,2824,2825],{},"N\u002FA",[2721,2827,2828],{},"SQLite or Postgres",[2706,2830,2831,2833,2836,2838,2841,2844],{},[2721,2832,86],{},[2721,2834,2835],{},"3100",[2721,2837,2800],{},[2721,2839,2840],{},"30 GB",[2721,2842,2843],{},"30 days (configurable)",[2721,2845,2846],{},"compressed chunks",[2706,2848,2849,2852,2855,2858,2861,2863],{},[2721,2850,2851],{},"Promtail \u002F Agent",[2721,2853,2854],{},"9080",[2721,2856,2857],{},"128 MB",[2721,2859,2860],{},"minimum",[2721,2862,2825],{},[2721,2864,2865],{},"passes by value",[2706,2867,2868,2870,2873,2875,2877,2879],{},[2721,2869,104],{},[2721,2871,2872],{},"9093",[2721,2874,2857],{},[2721,2876,2822],{},[2721,2878,2825],{},[2721,2880,2881],{},"notification log",[2706,2883,2884,2886,2889,2892,2894,2896],{},[2721,2885,98],{},[2721,2887,2888],{},"9100",[2721,2890,2891],{},"64 MB",[2721,2893,2860],{},[2721,2895,2825],{},[2721,2897,2898],{},"scrape endpoint",[11,2900,2901],{},"These are the viable minimums for small cluster. In production with 30 servers and real traffic, multiply RAM by 3 and disk by 5.",[30,2903,2905],{"id":2904},"the-four-errors-that-kill-a-new-monitoring-stack","The four errors that kill a new monitoring stack",[11,2907,2908],{},"Teams setting up observability for the first time stumble almost always on the same four errors. Knowing about them beforehand saves months.",[11,2910,2911,2914,2915,2918],{},[38,2912,2913],{},"Not monitoring monitoring."," Prometheus stopped scraping on Thursday; nobody saw it. On Wednesday of the following week a server actually went down and they discovered there was no alert because Prometheus was dead for 6 days. Solution: configure a simple external cron (even free Pingdom serves) that hits ",[15,2916,2917],{},"https:\u002F\u002Fmonitor.yourdomain.com\u002Fapi\u002Fhealth"," every 5 minutes and warns you when Grafana itself falls.",[11,2920,2921,2924,2925,2928],{},[38,2922,2923],{},"No retention strategy."," Disk fills up in three months, Prometheus stops recording, someone deletes everything in despair, loses 90 days of history. Configure ",[15,2926,2927],{},"--storage.tsdb.retention.time=30d"," from day one and establish a housekeeping job.",[11,2930,2931,2934,2935,18,2937,2940],{},[38,2932,2933],{},"High cardinality in labels."," We already covered in step 9, but worth repeating: each ",[15,2936,2654],{},[15,2938,2939],{},"request_id"," or UUID that becomes a label is a number that explosively multiplies Prometheus RAM consumption. Unique identifiers go to Loki, not to Prometheus.",[11,2942,2943,2946],{},[38,2944,2945],{},"Noisy alerts."," The team receives 200 alerts per day. In two weeks, nobody looks anymore. When the site actually crashes, the alert will be in the middle of 199 others. Solution: start with six alerts (those from step 7), audit every two weeks, and exclude everything that fired but didn't require human action. Alert without action is noise.",[30,2948,2950],{"id":2949},"faq","FAQ",[11,2952,2953,2956],{},[38,2954,2955],{},"Can I run everything on a 2 GB VPS?","\nTechnically yes, for a cluster of up to 3 servers and few applications. In practice you'll hit the RAM ceiling in 2 to 3 months, especially if you import dense Grafana dashboards. Pay 50 reais more and go straight to 4 GB VPS — the time you save not fighting OOM kills pays for itself.",[11,2958,2959,2962],{},[38,2960,2961],{},"How much disk for 30 days of logs?","\nDepends entirely on your application's log volume. Rough rule for small startup: cluster of 4 servers with normal web applications generates 1 to 5 GB of logs per day after Loki compression. Thirty days gives between 30 and 150 GB. Start with 50 GB SSD, monitor growth for two weeks, expand if necessary. If you go much beyond that, it's time to go to object storage.",[11,2964,2965,2968],{},[38,2966,2967],{},"Grafana Cloud vs self-hosted, which to choose?","\nGrafana Cloud free tier is generous (10k series, 50 GB of logs, 14-day retention) and eliminates the work of maintaining the server. For solo project or very small team, makes sense. From the moment you exceed the free tier, prices scale fast — from US$50\u002Fmonth — and you lose control over the data. Self-hosted costs hardware + time, Cloud costs money + lock-in. For a company that intends to grow and has a DevOps dev on the team, self-hosted wins.",[11,2970,2971,2974],{},[38,2972,2973],{},"Promtail or Grafana Agent?","\nIn 2026, Grafana Agent (renamed to Grafana Alloy) is officially replacing Promtail. For new setup, go straight to Alloy. For setup that has been running Promtail for a long time, no urgency to migrate — Promtail will continue working for years.",[11,2976,2977,2980,2981,2983],{},[38,2978,2979],{},"Where does OpenTelemetry fit in this stack?","\nOTel is the application instrumentation standard that's consolidating. Instead of using ",[15,2982,2576],{}," directly, you use OTel's SDK and it exports to Prometheus, Loki and Tempo simultaneously. The big advantage is portability — if you want to swap Prometheus for something else 3 years from now, your application doesn't change a line. For a startup starting today, we recommend OTel from day one.",[11,2985,2986,2989,2990,2993,2994,2997],{},[38,2987,2988],{},"How do I backup Prometheus?","\nPrometheus has snapshot via API: ",[15,2991,2992],{},"curl -X POST localhost:9090\u002Fapi\u002Fv1\u002Fadmin\u002Ftsdb\u002Fsnapshot"," creates a snapshot in the data directory. Do that once a day via cron, do ",[15,2995,2996],{},"tar.gz"," and send to object storage. In case of disaster, what you lose is metrics — and metrics, unlike logs, are typically recoverable in hours (start collecting again and dashboards return). Lost logs are lost forever, so invest more in Loki backup.",[11,2999,3000,3003],{},[38,3001,3002],{},"Tempo (distributed traces) worth installing now?","\nNo. Traces become useful from the moment you have 5+ services talking to each other and debugging latency involves following a request through several hops. For monolithic architecture or few services, traces give disproportionate work to the value. Add when complexity calls for it.",[11,3005,3006,3009],{},[38,3007,3008],{},"Does Loki index full-text like ELK?","\nNo, and that's the feature, not bug. Loki indexes only labels (job, host, container, severity) and log content stays compressed without index. To search text, you filter by labels first and then grep on the resulting chunks. That's what makes Loki ten times cheaper than ELK in RAM and CPU. In exchange, free-text queries across all history are slower. For 90% of debugging cases, filtering by job + host + time window already reduces to dozens of MB where grep flies.",[30,3011,3013],{"id":3012},"next-steps","Next steps",[11,3015,3016],{},"Brought up the stack, have dashboard, have alert, have searchable log? Good. The next three things worth investing in are, in order:",[182,3018,3019,3025,3039],{},[70,3020,3021,3024],{},[38,3022,3023],{},"Custom dashboards per application"," — business metrics (subscriptions created\u002Fhour, jobs processed, email queue) instead of just infrastructure.",[70,3026,3027,3030,3031,3034,3035,3038],{},[38,3028,3029],{},"Runbooks linked in alerts"," — every rule in ",[15,3032,3033],{},"alerts.yml"," should have ",[15,3036,3037],{},"annotations.runbook_url"," pointing to a page explaining what to do. When the alert fires at 3 AM, sleep doesn't think.",[70,3040,3041,3044],{},[38,3042,3043],{},"Monthly alert review"," — 30 minutes once a month auditing what fired in the previous month, deleting what became noise, adjusting thresholds.",[11,3046,3047,3048,3053,3054,121],{},"For those who want to go further and understand why we chose this stack instead of managed SaaS, read ",[3049,3050,3052],"a",{"href":3051},"\u002Fen\u002Fblog\u002Fobservability-without-datadog-startup-stack","Observability without Datadog: the Brazilian startup stack",". And to close the operations cycle — because there's no point knowing the database fell if you can't restore — it's worth reading ",[3049,3055,3057],{"href":3056},"\u002Fen\u002Fblog\u002Fdatabase-backup-strategies-cluster","Database backup in cluster: strategies for 3 AM",[11,3059,3060],{},"If you want to skip this entire setup and run the stack as a job inside an orchestrator that already takes care of TLS, rolling update deploy and volume replication:",[241,3062,3064],{"className":685,"code":3063,"language":687,"meta":249,"style":249},"curl -sSL get.heroctl.com\u002Finstall.sh | sh\n",[15,3065,3066],{"__ignoreMap":249},[265,3067,3068,3071,3074,3077,3080],{"class":267,"line":268},[265,3069,3070],{"class":703},"curl",[265,3072,3073],{"class":694}," -sSL",[265,3075,3076],{"class":296}," get.heroctl.com\u002Finstall.sh",[265,3078,3079],{"class":1105}," |",[265,3081,3082],{"class":703}," sh\n",[11,3084,3085],{},"Four hours become forty minutes. The rest is the same work of thinking about which alerts matter — and on that part, no one frees you.",[3087,3088,3089],"style",{},"html pre.shiki code .sPWt5, html code.shiki .sPWt5{--shiki-default:#7EE787}html pre.shiki code .sZEs4, html code.shiki .sZEs4{--shiki-default:#E6EDF3}html pre.shiki code .s9uIt, html code.shiki .s9uIt{--shiki-default:#A5D6FF}html pre.shiki code .sH3jZ, html code.shiki .sH3jZ{--shiki-default:#8B949E}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html pre.shiki code .sFSAA, html code.shiki .sFSAA{--shiki-default:#79C0FF}html pre.shiki code .sQhOw, html code.shiki .sQhOw{--shiki-default:#FFA657}html pre.shiki code .suJrU, html code.shiki .suJrU{--shiki-default:#FF7B72}",{"title":249,"searchDepth":279,"depth":279,"links":3091},[3092,3093,3094,3095,3096,3097,3098,3099,3100,3101,3102,3103,3104,3105,3106,3107,3108,3109],{"id":32,"depth":279,"text":33},{"id":61,"depth":279,"text":62},{"id":124,"depth":279,"text":125},{"id":167,"depth":279,"text":168},{"id":226,"depth":279,"text":227},{"id":752,"depth":279,"text":753},{"id":1077,"depth":279,"text":1078},{"id":1246,"depth":279,"text":1247},{"id":1726,"depth":279,"text":1727},{"id":1900,"depth":279,"text":1901},{"id":2509,"depth":279,"text":2510},{"id":2554,"depth":279,"text":2555},{"id":2658,"depth":279,"text":2659},{"id":2697,"depth":279,"text":2698},{"id":2763,"depth":279,"text":2764},{"id":2904,"depth":279,"text":2905},{"id":2949,"depth":279,"text":2950},{"id":3012,"depth":279,"text":3013},"engineering",null,"2026-05-12","Honest tutorial to spin up metrics, logs and dashboards for your cluster — in 4 hours, without Datadog. Open-source stack that fits in 1 VPS at R$80\u002Fmonth.",false,"md",{},"\u002Fen\u002Fblog\u002Fmonitoring-stack-prometheus-grafana-loki","16 min",{"title":5,"description":3113},{"loc":3117},"en\u002Fblog\u002Fmonitoring-stack-prometheus-grafana-loki",[3123,3124,3125,3126,3127,3110],"prometheus","grafana","loki","monitoring","tutorial","qXuCsrBWk65Tau6l18D0_EwAL61sTr4A97-gZfDIzKs",[3130,3137],{"title":3131,"path":3132,"stem":3133,"description":3134,"date":3135,"category":3136,"children":-1},"Migrating from Kubernetes to a simpler stack: real case of complexity reduction","\u002Fen\u002Fblog\u002Fmigrating-from-kubernetes-to-simpler-stack","en\u002Fblog\u002Fmigrating-from-kubernetes-to-simpler-stack","When a company adopts K8s too early, everyone pays. The reverse path — leaving K8s for simpler orchestration — is viable and more common than it seems. What to validate before, during and after.","2026-03-18","case-study",{"title":3138,"path":3139,"stem":3140,"description":3141,"date":3142,"category":3110,"children":-1},"Multi-tenant SaaS with real isolation: 3 patterns and when each one becomes a nightmare","\u002Fen\u002Fblog\u002Fmulti-tenant-saas-real-isolation","en\u002Fblog\u002Fmulti-tenant-saas-real-isolation","Pool, schema-per-tenant, app-per-tenant. Each pattern has obvious benefits and invisible costs. How to decide before the first serious B2B customer asks 'is my data isolated?'.","2026-04-01",1777362213770]