[{"data":1,"prerenderedAt":12750},["ShallowReactive",2],{"home-recent-posts-en":3},[4,3394,4411,5380,6397,7509,8771,11784],{"id":5,"title":6,"author":7,"body":8,"category":3378,"cover":3379,"date":3380,"description":3381,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":3384,"navigation":411,"path":3385,"readingTime":3386,"seo":3387,"sitemap":3388,"stem":3389,"tags":3390,"__hash__":3393},"blog_en\u002Fen\u002Fblog\u002Fzero-downtime-deploy-without-kubernetes.md","Zero-downtime deploy without Kubernetes: a practical tutorial in 2026","HeroCtl team",{"type":9,"value":10,"toc":3353},"minimark",[11,15,18,23,39,42,56,59,63,66,88,91,95,102,105,108,111,115,118,205,208,211,213,217,220,223,275,286,289,314,317,339,346,350,357,367,372,916,920,1000,1004,1080,1094,1097,1101,1104,1140,1143,1208,1211,1236,1239,1243,1246,1249,1252,1266,1273,1363,1369,1372,1434,1440,1457,1460,1464,1467,2346,2356,2382,2385,2443,2446,2450,2453,2456,2505,2508,2519,2524,2532,2535,2537,2541,2544,2606,2610,2617,2623,2637,2709,2712,2726,2730,2733,2754,2757,2761,2764,2782,2785,2789,2792,2940,2943,2946,2967,2970,2974,3167,3170,3174,3222,3226,3231,3238,3243,3246,3251,3254,3259,3265,3270,3273,3278,3284,3289,3292,3297,3304,3306,3310,3313,3316,3332,3346,3349],[12,13,14],"p",{},"There's a persistent myth that zero-downtime deploy is exclusive to those who ran Kubernetes in production. It isn't. The technique has existed since before the colossus had a name — any team that ran a pair of physical servers behind a load balancer last decade was already doing it, with fifty-line scripts and zero CRDs in their lives. What changed was the marketing around the practice, not the practice itself.",[12,16,17],{},"This post is a step-by-step tutorial to set up zero-downtime deploy from scratch, on two Linux machines, with no heavyweight orchestrator, no magic panel. At the end you'll have a bash script that swaps one instance at a time, waits for the new one to be healthy, and rolls to the next — exactly the algorithm large orchestrators implement, just without the boilerplate.",[19,20,22],"h2",{"id":21},"tldr","TL;DR",[12,24,25,26,30,31,34,35,38],{},"Zero-downtime deploy depends on three ingredients, not on a specific tool. First: ",[27,28,29],"strong",{},"two or more application instances running in parallel",", behind a basic proxy. Second: ",[27,32,33],{},"a reliable health check endpoint"," that validates real dependencies (database, cache, queue), doesn't just return 200 instantly. Third: ",[27,36,37],{},"a script or orchestrator that replaces one container at a time",", waiting for the new one to be healthy before moving on to the next.",[12,40,41],{},"This tutorial sets up the full setup on two Linux VPS with Docker, Caddy in front as proxy + load balancer, and a fifty-line bash script that does rolling update with active health check, minimum healthy time, and automatic rollback on failure. Result: deploy with no 5xx visible to the user, in less than a minute, with no maintenance window.",[12,43,44,47,48,51,52,55],{},[27,45,46],{},"Prerequisites:"," two Linux VPS with Docker (Hetzner CPX11 at R$30 each), domain with controllable DNS, app with a decent health check. ",[27,49,50],{},"Setup time:"," two to three hours. ",[27,53,54],{},"Monthly cost:"," R$60 (R$75 if you want a third VPS dedicated to the proxy). At the end we show the \"robust\" version via HeroCtl for those who want to stop scripting.",[57,58],"hr",{},[19,60,62],{"id":61},"the-three-ingredients-without-these-its-not-zero-downtime","The three ingredients (without these, it's not zero-downtime)",[12,64,65],{},"Before any command, worth fixing the theory — because every more elaborate configuration you'll see on the internet is a variation on these three pieces.",[67,68,69,76,82],"ol",{},[70,71,72,75],"li",{},[27,73,74],{},"Multiple instances of the app running in parallel."," Minimum two. If you only have one, any restart is an error window. There's no working around it with a configuration trick.",[70,77,78,81],{},[27,79,80],{},"A proxy\u002Fload balancer in front, doing health checks."," The proxy decides which instance to send traffic to. If one falls (or was deliberately taken out for the deploy), the proxy only sends to the remaining ones.",[70,83,84,87],{},[27,85,86],{},"A script that swaps instances one at a time."," Never all together. Wait for the new one to be healthy before touching the next. If the new one fails, halt the deploy and keep the old ones serving.",[12,89,90],{},"That's it. The rest — Kubernetes, modern panels, lightweight orchestrators — is wrapping around these three points.",[19,92,94],{"id":93},"why-single-server-is-never-zero-downtime-even-if-its-fast","Why single-server is NEVER zero-downtime (even if it's fast)",[12,96,97,98,101],{},"I see this question every week in the community Discord: \"can I do zero-downtime with a single server, if the deploy is fast enough?\". Short answer: ",[27,99,100],{},"no",".",[12,103,104],{},"On a single machine, the deploy cycle is: stop the old container, bring up the new one. Even if everything happens in three seconds, those three seconds exist. In-flight TCP connections are cut. Requests arriving in that interval get connection refused or 502. If you have five requests per second, that's fifteen users seeing errors per deploy.",[12,106,107],{},"There are clever variations — bring the new one up on a different port, switch the local proxy, drop the old one. That improves things but doesn't eliminate them. If the app takes time to close in-flight connections, the cutover still produces errors. If the health check is weak, the proxy points traffic at an app that hasn't finished coming up. There's always a window.",[12,109,110],{},"The only reliable way to eliminate the window is to have at least one instance always available throughout the deploy. That requires two machines. Period.",[19,112,114],{"id":113},"the-minimum-setup-two-vps-a-proxy","The minimum setup (two VPS + a proxy)",[12,116,117],{},"The cheapest topology that delivers real zero-downtime:",[119,120,121,140],"table",{},[122,123,124],"thead",{},[125,126,127,131,134,137],"tr",{},[128,129,130],"th",{},"Component",[128,132,133],{},"Size",[128,135,136],{},"Cost",[128,138,139],{},"Function",[141,142,143,158,170,189],"tbody",{},[125,144,145,149,152,155],{},[146,147,148],"td",{},"VPS A",[146,150,151],{},"2 vCPU \u002F 2 GB RAM",[146,153,154],{},"R$30\u002Fmonth",[146,156,157],{},"App instance 1",[125,159,160,163,165,167],{},[146,161,162],{},"VPS B",[146,164,151],{},[146,166,154],{},[146,168,169],{},"App instance 2",[125,171,172,175,183,186],{},[146,173,174],{},"Proxy",[146,176,177,178,182],{},"running on VPS A ",[179,180,181],"em",{},"or"," third VPS",[146,184,185],{},"R$0 (shared) or R$15\u002Fmonth",[146,187,188],{},"Caddy\u002Fnginx doing balance",[125,190,191,194,199,202],{},[146,192,193],{},"Database",[146,195,196,197,182],{},"managed Postgres ",[179,198,181],{},[146,200,201],{},"varies",[146,203,204],{},"Shared state between A and B",[12,206,207],{},"Keeping the proxy shared on one of the VPS itself saves money but has a trade-off: if the VPS hosting the proxy falls entirely, the site falls with it (even with the other VPS running). For a small team this is acceptable. When you grow, the proxy migrates to a dedicated VPS or becomes a redundant pair.",[12,209,210],{},"Your domain's DNS A record points to the proxy IP. Apps on A and B connect to the same database — without that shared part, the two instances diverge and the user sees different results depending on which one answered.",[57,212],{},[19,214,216],{"id":215},"step-1-provision-two-vps-15-min","Step 1 — Provision two VPS (15 min)",[12,218,219],{},"I use Hetzner CPX11 (€4.75 ≈ R$30) as a reference. DigitalOcean Droplet at US$6, Vultr Cloud Compute at US$6, or Linode Nanode at US$5 deliver something similar. What matters is modern Linux (Ubuntu 24.04 LTS or Debian 12) with Docker.",[12,221,222],{},"Provision both machines with the same SSH key:",[224,225,230],"pre",{"className":226,"code":227,"language":228,"meta":229,"style":229},"language-bash shiki shiki-themes github-dark-default","# from your laptop\nssh-keygen -t ed25519 -f ~\u002F.ssh\u002Fdeploy_key -C \"deploy@meudominio.com\"\n# add ~\u002F.ssh\u002Fdeploy_key.pub on the provider console before creating the VPS\n","bash","",[231,232,233,242,269],"code",{"__ignoreMap":229},[234,235,238],"span",{"class":236,"line":237},"line",1,[234,239,241],{"class":240},"sH3jZ","# from your laptop\n",[234,243,245,249,253,257,260,263,266],{"class":236,"line":244},2,[234,246,248],{"class":247},"sQhOw","ssh-keygen",[234,250,252],{"class":251},"sFSAA"," -t",[234,254,256],{"class":255},"s9uIt"," ed25519",[234,258,259],{"class":251}," -f",[234,261,262],{"class":255}," ~\u002F.ssh\u002Fdeploy_key",[234,264,265],{"class":251}," -C",[234,267,268],{"class":255}," \"deploy@meudominio.com\"\n",[234,270,272],{"class":236,"line":271},3,[234,273,274],{"class":240},"# add ~\u002F.ssh\u002Fdeploy_key.pub on the provider console before creating the VPS\n",[12,276,277,278,281,282,285],{},"Create each VPS, note the IPs. I'll use ",[231,279,280],{},"203.0.113.10"," (VPS A) and ",[231,283,284],{},"203.0.113.20"," (VPS B) as placeholders for the rest of the post.",[12,287,288],{},"Install Docker on each:",[224,290,292],{"className":226,"code":291,"language":228,"meta":229,"style":229},"ssh root@203.0.113.10 \"curl -fsSL https:\u002F\u002Fget.docker.com | sh\"\nssh root@203.0.113.20 \"curl -fsSL https:\u002F\u002Fget.docker.com | sh\"\n",[231,293,294,305],{"__ignoreMap":229},[234,295,296,299,302],{"class":236,"line":237},[234,297,298],{"class":247},"ssh",[234,300,301],{"class":255}," root@203.0.113.10",[234,303,304],{"class":255}," \"curl -fsSL https:\u002F\u002Fget.docker.com | sh\"\n",[234,306,307,309,312],{"class":236,"line":244},[234,308,298],{"class":247},[234,310,311],{"class":255}," root@203.0.113.20",[234,313,304],{"class":255},[12,315,316],{},"Configure firewall to allow only 22 (SSH) and 8080 (internal port where the app will listen). HTTP\u002FHTTPS traffic only arrives at the proxy:",[224,318,320],{"className":226,"code":319,"language":228,"meta":229,"style":229},"ssh root@203.0.113.10 \"ufw allow 22 && ufw allow 8080\u002Ftcp && ufw --force enable\"\nssh root@203.0.113.20 \"ufw allow 22 && ufw allow 8080\u002Ftcp && ufw --force enable\"\n",[231,321,322,331],{"__ignoreMap":229},[234,323,324,326,328],{"class":236,"line":237},[234,325,298],{"class":247},[234,327,301],{"class":255},[234,329,330],{"class":255}," \"ufw allow 22 && ufw allow 8080\u002Ftcp && ufw --force enable\"\n",[234,332,333,335,337],{"class":236,"line":244},[234,334,298],{"class":247},[234,336,311],{"class":255},[234,338,330],{"class":255},[12,340,341,342,345],{},"Validation: ",[231,343,344],{},"docker run --rm hello-world"," on each machine should complete without errors.",[19,347,349],{"id":348},"step-2-app-with-a-decent-health-check-30-min","Step 2 — App with a decent health check (30 min)",[12,351,352,353,356],{},"The ",[231,354,355],{},"\u002Fhealthz"," endpoint is the heart of the scheme. If it returns 200 when the app isn't actually ready, the proxy sends traffic to a broken instance and the user sees an error. If it returns 500 when the app is healthy, the proxy takes the good instance out of balancing. Meaning: the health check is the source of truth for the entire system.",[12,358,359,360,362,363,366],{},"Golden rule: ",[231,361,355],{}," validates ",[27,364,365],{},"real dependencies the app needs to respond",". Minimum: connection to the database. If you have a cache (Redis), include it. If you have a queue (SQS, RabbitMQ), include it. DON'T return 200 right at boot — wait for assets to compile, cache to warm, connections to open.",[368,369,371],"h3",{"id":370},"nodejs-express","Node.js (Express)",[224,373,377],{"className":374,"code":375,"language":376,"meta":229,"style":229},"language-js shiki shiki-themes github-dark-default","import express from \"express\"\nimport { Pool } from \"pg\"\n\nconst app = express()\nconst pool = new Pool({ connectionString: process.env.DATABASE_URL })\n\nlet ready = false\n\n\u002F\u002F warm-up assíncrono — só fica ready quando dependencies validam\n;(async () => {\n  await pool.query(\"SELECT 1\")\n  \u002F\u002F outras inicializações: cache prime, etc.\n  ready = true\n})()\n\napp.get(\"\u002Fhealthz\", async (_req, res) => {\n  if (!ready) return res.status(503).send(\"warming up\")\n  try {\n    await pool.query(\"SELECT 1\")\n    res.status(200).send(\"ok\")\n  } catch (e) {\n    res.status(503).send(\"db down\")\n  }\n})\n\napp.get(\"\u002F\", (_req, res) => res.send(\"Hello v1\"))\n\nconst server = app.listen(8080, () => console.log(\"listening 8080\"))\n\n\u002F\u002F graceful shutdown — drena conexões antes de morrer\nprocess.on(\"SIGTERM\", () => {\n  ready = false  \u002F\u002F health check passa a falhar imediatamente\n  setTimeout(() => {\n    server.close(() => process.exit(0))\n  }, 5000)  \u002F\u002F 5s pro proxy notar e parar de mandar tráfego novo\n})\n","js",[231,378,379,395,407,413,432,457,462,477,482,488,506,527,533,544,550,555,592,633,641,657,681,693,715,721,727,732,769,774,813,818,824,844,857,870,896,911],{"__ignoreMap":229},[234,380,381,385,389,392],{"class":236,"line":237},[234,382,384],{"class":383},"suJrU","import",[234,386,388],{"class":387},"sZEs4"," express ",[234,390,391],{"class":383},"from",[234,393,394],{"class":255}," \"express\"\n",[234,396,397,399,402,404],{"class":236,"line":244},[234,398,384],{"class":383},[234,400,401],{"class":387}," { Pool } ",[234,403,391],{"class":383},[234,405,406],{"class":255}," \"pg\"\n",[234,408,409],{"class":236,"line":271},[234,410,412],{"emptyLinePlaceholder":411},true,"\n",[234,414,416,419,422,425,429],{"class":236,"line":415},4,[234,417,418],{"class":383},"const",[234,420,421],{"class":251}," app",[234,423,424],{"class":383}," =",[234,426,428],{"class":427},"sc3cj"," express",[234,430,431],{"class":387},"()\n",[234,433,435,437,440,442,445,448,451,454],{"class":236,"line":434},5,[234,436,418],{"class":383},[234,438,439],{"class":251}," pool",[234,441,424],{"class":383},[234,443,444],{"class":383}," new",[234,446,447],{"class":427}," Pool",[234,449,450],{"class":387},"({ connectionString: process.env.",[234,452,453],{"class":251},"DATABASE_URL",[234,455,456],{"class":387}," })\n",[234,458,460],{"class":236,"line":459},6,[234,461,412],{"emptyLinePlaceholder":411},[234,463,465,468,471,474],{"class":236,"line":464},7,[234,466,467],{"class":383},"let",[234,469,470],{"class":387}," ready ",[234,472,473],{"class":383},"=",[234,475,476],{"class":251}," false\n",[234,478,480],{"class":236,"line":479},8,[234,481,412],{"emptyLinePlaceholder":411},[234,483,485],{"class":236,"line":484},9,[234,486,487],{"class":240},"\u002F\u002F warm-up assíncrono — só fica ready quando dependencies validam\n",[234,489,491,494,497,500,503],{"class":236,"line":490},10,[234,492,493],{"class":387},";(",[234,495,496],{"class":383},"async",[234,498,499],{"class":387}," () ",[234,501,502],{"class":383},"=>",[234,504,505],{"class":387}," {\n",[234,507,509,512,515,518,521,524],{"class":236,"line":508},11,[234,510,511],{"class":383},"  await",[234,513,514],{"class":387}," pool.",[234,516,517],{"class":427},"query",[234,519,520],{"class":387},"(",[234,522,523],{"class":255},"\"SELECT 1\"",[234,525,526],{"class":387},")\n",[234,528,530],{"class":236,"line":529},12,[234,531,532],{"class":240},"  \u002F\u002F outras inicializações: cache prime, etc.\n",[234,534,536,539,541],{"class":236,"line":535},13,[234,537,538],{"class":387},"  ready ",[234,540,473],{"class":383},[234,542,543],{"class":251}," true\n",[234,545,547],{"class":236,"line":546},14,[234,548,549],{"class":387},"})()\n",[234,551,553],{"class":236,"line":552},15,[234,554,412],{"emptyLinePlaceholder":411},[234,556,558,561,564,566,569,572,574,577,580,582,585,588,590],{"class":236,"line":557},16,[234,559,560],{"class":387},"app.",[234,562,563],{"class":427},"get",[234,565,520],{"class":387},[234,567,568],{"class":255},"\"\u002Fhealthz\"",[234,570,571],{"class":387},", ",[234,573,496],{"class":383},[234,575,576],{"class":387}," (",[234,578,579],{"class":247},"_req",[234,581,571],{"class":387},[234,583,584],{"class":247},"res",[234,586,587],{"class":387},") ",[234,589,502],{"class":383},[234,591,505],{"class":387},[234,593,595,598,600,603,606,609,612,615,617,620,623,626,628,631],{"class":236,"line":594},17,[234,596,597],{"class":383},"  if",[234,599,576],{"class":387},[234,601,602],{"class":383},"!",[234,604,605],{"class":387},"ready) ",[234,607,608],{"class":383},"return",[234,610,611],{"class":387}," res.",[234,613,614],{"class":427},"status",[234,616,520],{"class":387},[234,618,619],{"class":251},"503",[234,621,622],{"class":387},").",[234,624,625],{"class":427},"send",[234,627,520],{"class":387},[234,629,630],{"class":255},"\"warming up\"",[234,632,526],{"class":387},[234,634,636,639],{"class":236,"line":635},18,[234,637,638],{"class":383},"  try",[234,640,505],{"class":387},[234,642,644,647,649,651,653,655],{"class":236,"line":643},19,[234,645,646],{"class":383},"    await",[234,648,514],{"class":387},[234,650,517],{"class":427},[234,652,520],{"class":387},[234,654,523],{"class":255},[234,656,526],{"class":387},[234,658,660,663,665,667,670,672,674,676,679],{"class":236,"line":659},20,[234,661,662],{"class":387},"    res.",[234,664,614],{"class":427},[234,666,520],{"class":387},[234,668,669],{"class":251},"200",[234,671,622],{"class":387},[234,673,625],{"class":427},[234,675,520],{"class":387},[234,677,678],{"class":255},"\"ok\"",[234,680,526],{"class":387},[234,682,684,687,690],{"class":236,"line":683},21,[234,685,686],{"class":387},"  } ",[234,688,689],{"class":383},"catch",[234,691,692],{"class":387}," (e) {\n",[234,694,696,698,700,702,704,706,708,710,713],{"class":236,"line":695},22,[234,697,662],{"class":387},[234,699,614],{"class":427},[234,701,520],{"class":387},[234,703,619],{"class":251},[234,705,622],{"class":387},[234,707,625],{"class":427},[234,709,520],{"class":387},[234,711,712],{"class":255},"\"db down\"",[234,714,526],{"class":387},[234,716,718],{"class":236,"line":717},23,[234,719,720],{"class":387},"  }\n",[234,722,724],{"class":236,"line":723},24,[234,725,726],{"class":387},"})\n",[234,728,730],{"class":236,"line":729},25,[234,731,412],{"emptyLinePlaceholder":411},[234,733,735,737,739,741,744,747,749,751,753,755,757,759,761,763,766],{"class":236,"line":734},26,[234,736,560],{"class":387},[234,738,563],{"class":427},[234,740,520],{"class":387},[234,742,743],{"class":255},"\"\u002F\"",[234,745,746],{"class":387},", (",[234,748,579],{"class":247},[234,750,571],{"class":387},[234,752,584],{"class":247},[234,754,587],{"class":387},[234,756,502],{"class":383},[234,758,611],{"class":387},[234,760,625],{"class":427},[234,762,520],{"class":387},[234,764,765],{"class":255},"\"Hello v1\"",[234,767,768],{"class":387},"))\n",[234,770,772],{"class":236,"line":771},27,[234,773,412],{"emptyLinePlaceholder":411},[234,775,777,779,782,784,787,790,792,795,798,800,803,806,808,811],{"class":236,"line":776},28,[234,778,418],{"class":383},[234,780,781],{"class":251}," server",[234,783,424],{"class":383},[234,785,786],{"class":387}," app.",[234,788,789],{"class":427},"listen",[234,791,520],{"class":387},[234,793,794],{"class":251},"8080",[234,796,797],{"class":387},", () ",[234,799,502],{"class":383},[234,801,802],{"class":387}," console.",[234,804,805],{"class":427},"log",[234,807,520],{"class":387},[234,809,810],{"class":255},"\"listening 8080\"",[234,812,768],{"class":387},[234,814,816],{"class":236,"line":815},29,[234,817,412],{"emptyLinePlaceholder":411},[234,819,821],{"class":236,"line":820},30,[234,822,823],{"class":240},"\u002F\u002F graceful shutdown — drena conexões antes de morrer\n",[234,825,827,830,833,835,838,840,842],{"class":236,"line":826},31,[234,828,829],{"class":387},"process.",[234,831,832],{"class":427},"on",[234,834,520],{"class":387},[234,836,837],{"class":255},"\"SIGTERM\"",[234,839,797],{"class":387},[234,841,502],{"class":383},[234,843,505],{"class":387},[234,845,847,849,851,854],{"class":236,"line":846},32,[234,848,538],{"class":387},[234,850,473],{"class":383},[234,852,853],{"class":251}," false",[234,855,856],{"class":240},"  \u002F\u002F health check passa a falhar imediatamente\n",[234,858,860,863,866,868],{"class":236,"line":859},33,[234,861,862],{"class":427},"  setTimeout",[234,864,865],{"class":387},"(() ",[234,867,502],{"class":383},[234,869,505],{"class":387},[234,871,873,876,879,881,883,886,889,891,894],{"class":236,"line":872},34,[234,874,875],{"class":387},"    server.",[234,877,878],{"class":427},"close",[234,880,865],{"class":387},[234,882,502],{"class":383},[234,884,885],{"class":387}," process.",[234,887,888],{"class":427},"exit",[234,890,520],{"class":387},[234,892,893],{"class":251},"0",[234,895,768],{"class":387},[234,897,899,902,905,908],{"class":236,"line":898},35,[234,900,901],{"class":387},"  }, ",[234,903,904],{"class":251},"5000",[234,906,907],{"class":387},")  ",[234,909,910],{"class":240},"\u002F\u002F 5s pro proxy notar e parar de mandar tráfego novo\n",[234,912,914],{"class":236,"line":913},36,[234,915,726],{"class":387},[368,917,919],{"id":918},"python-django-gunicorn","Python (Django + gunicorn)",[224,921,925],{"className":922,"code":923,"language":924,"meta":229,"style":229},"language-python shiki shiki-themes github-dark-default","# health\u002Fviews.py\nfrom django.db import connection\nfrom django.http import JsonResponse, HttpResponse\nimport redis, os\n\n_r = redis.from_url(os.environ[\"REDIS_URL\"])\n\ndef healthz(request):\n    try:\n        with connection.cursor() as c:\n            c.execute(\"SELECT 1\")\n        _r.ping()\n        return HttpResponse(\"ok\", status=200)\n    except Exception as e:\n        return HttpResponse(f\"unhealthy: {e}\", status=503)\n","python",[231,926,927,932,937,942,947,951,956,960,965,970,975,980,985,990,995],{"__ignoreMap":229},[234,928,929],{"class":236,"line":237},[234,930,931],{},"# health\u002Fviews.py\n",[234,933,934],{"class":236,"line":244},[234,935,936],{},"from django.db import connection\n",[234,938,939],{"class":236,"line":271},[234,940,941],{},"from django.http import JsonResponse, HttpResponse\n",[234,943,944],{"class":236,"line":415},[234,945,946],{},"import redis, os\n",[234,948,949],{"class":236,"line":434},[234,950,412],{"emptyLinePlaceholder":411},[234,952,953],{"class":236,"line":459},[234,954,955],{},"_r = redis.from_url(os.environ[\"REDIS_URL\"])\n",[234,957,958],{"class":236,"line":464},[234,959,412],{"emptyLinePlaceholder":411},[234,961,962],{"class":236,"line":479},[234,963,964],{},"def healthz(request):\n",[234,966,967],{"class":236,"line":484},[234,968,969],{},"    try:\n",[234,971,972],{"class":236,"line":490},[234,973,974],{},"        with connection.cursor() as c:\n",[234,976,977],{"class":236,"line":508},[234,978,979],{},"            c.execute(\"SELECT 1\")\n",[234,981,982],{"class":236,"line":529},[234,983,984],{},"        _r.ping()\n",[234,986,987],{"class":236,"line":535},[234,988,989],{},"        return HttpResponse(\"ok\", status=200)\n",[234,991,992],{"class":236,"line":546},[234,993,994],{},"    except Exception as e:\n",[234,996,997],{"class":236,"line":552},[234,998,999],{},"        return HttpResponse(f\"unhealthy: {e}\", status=503)\n",[368,1001,1003],{"id":1002},"ruby-rails","Ruby (Rails)",[224,1005,1009],{"className":1006,"code":1007,"language":1008,"meta":229,"style":229},"language-ruby shiki shiki-themes github-dark-default","# config\u002Froutes.rb\nget \"\u002Fhealthz\", to: \"health#show\"\n\n# app\u002Fcontrollers\u002Fhealth_controller.rb\nclass HealthController \u003C ApplicationController\n  def show\n    ActiveRecord::Base.connection.execute(\"SELECT 1\")\n    Rails.cache.read(\"__healthcheck__\")\n    head :ok\n  rescue => e\n    Rails.logger.warn(\"healthcheck failed: #{e.message}\")\n    head :service_unavailable\n  end\nend\n","ruby",[231,1010,1011,1016,1021,1025,1030,1035,1040,1045,1050,1055,1060,1065,1070,1075],{"__ignoreMap":229},[234,1012,1013],{"class":236,"line":237},[234,1014,1015],{},"# config\u002Froutes.rb\n",[234,1017,1018],{"class":236,"line":244},[234,1019,1020],{},"get \"\u002Fhealthz\", to: \"health#show\"\n",[234,1022,1023],{"class":236,"line":271},[234,1024,412],{"emptyLinePlaceholder":411},[234,1026,1027],{"class":236,"line":415},[234,1028,1029],{},"# app\u002Fcontrollers\u002Fhealth_controller.rb\n",[234,1031,1032],{"class":236,"line":434},[234,1033,1034],{},"class HealthController \u003C ApplicationController\n",[234,1036,1037],{"class":236,"line":459},[234,1038,1039],{},"  def show\n",[234,1041,1042],{"class":236,"line":464},[234,1043,1044],{},"    ActiveRecord::Base.connection.execute(\"SELECT 1\")\n",[234,1046,1047],{"class":236,"line":479},[234,1048,1049],{},"    Rails.cache.read(\"__healthcheck__\")\n",[234,1051,1052],{"class":236,"line":484},[234,1053,1054],{},"    head :ok\n",[234,1056,1057],{"class":236,"line":490},[234,1058,1059],{},"  rescue => e\n",[234,1061,1062],{"class":236,"line":508},[234,1063,1064],{},"    Rails.logger.warn(\"healthcheck failed: #{e.message}\")\n",[234,1066,1067],{"class":236,"line":529},[234,1068,1069],{},"    head :service_unavailable\n",[234,1071,1072],{"class":236,"line":535},[234,1073,1074],{},"  end\n",[234,1076,1077],{"class":236,"line":546},[234,1078,1079],{},"end\n",[12,1081,1082,1083,1086,1087,1090,1091,1093],{},"The detail that separates an amateur from a professional health check is ",[27,1084,1085],{},"graceful shutdown",": on receiving ",[231,1088,1089],{},"SIGTERM",", the app starts returning 503 on ",[231,1092,355],{}," immediately, but keeps accepting in-flight connections for a few more seconds. The proxy notices the 503, stops sending new traffic, and when the app finally closes there's nobody waiting for a response.",[12,1095,1096],{},"Without this, the cutover always leaks some errors even with everything else right.",[19,1098,1100],{"id":1099},"step-3-bring-up-two-docker-instances-15-min","Step 3 — Bring up two Docker instances (15 min)",[12,1102,1103],{},"Build your app into a Docker image. For the tutorial I'll use a generic image you replace:",[224,1105,1107],{"className":226,"code":1106,"language":228,"meta":229,"style":229},"# no seu laptop, push pra registry (Docker Hub, ECR, GHCR)\ndocker build -t meuusuario\u002Fmyapp:v1 .\ndocker push meuusuario\u002Fmyapp:v1\n",[231,1108,1109,1114,1130],{"__ignoreMap":229},[234,1110,1111],{"class":236,"line":237},[234,1112,1113],{"class":240},"# no seu laptop, push pra registry (Docker Hub, ECR, GHCR)\n",[234,1115,1116,1119,1122,1124,1127],{"class":236,"line":244},[234,1117,1118],{"class":247},"docker",[234,1120,1121],{"class":255}," build",[234,1123,252],{"class":251},[234,1125,1126],{"class":255}," meuusuario\u002Fmyapp:v1",[234,1128,1129],{"class":255}," .\n",[234,1131,1132,1134,1137],{"class":236,"line":271},[234,1133,1118],{"class":247},[234,1135,1136],{"class":255}," push",[234,1138,1139],{"class":255}," meuusuario\u002Fmyapp:v1\n",[12,1141,1142],{},"Bring up the instance on VPS A:",[224,1144,1146],{"className":226,"code":1145,"language":228,"meta":229,"style":229},"ssh root@203.0.113.10 \"\n  docker pull meuusuario\u002Fmyapp:v1 &&\n  docker run -d --name app --restart=unless-stopped \\\n    -p 8080:8080 \\\n    -e DATABASE_URL='postgres:\u002F\u002Fuser:pass@db.example.com:5432\u002Fapp' \\\n    --health-cmd='curl -f http:\u002F\u002Flocalhost:8080\u002Fhealthz || exit 1' \\\n    --health-interval=5s --health-timeout=2s --health-retries=3 \\\n    meuusuario\u002Fmyapp:v1\n\"\n",[231,1147,1148,1157,1162,1170,1177,1184,1191,1198,1203],{"__ignoreMap":229},[234,1149,1150,1152,1154],{"class":236,"line":237},[234,1151,298],{"class":247},[234,1153,301],{"class":255},[234,1155,1156],{"class":255}," \"\n",[234,1158,1159],{"class":236,"line":244},[234,1160,1161],{"class":255},"  docker pull meuusuario\u002Fmyapp:v1 &&\n",[234,1163,1164,1167],{"class":236,"line":271},[234,1165,1166],{"class":255},"  docker run -d --name app --restart=unless-stopped ",[234,1168,1169],{"class":383},"\\\n",[234,1171,1172,1175],{"class":236,"line":415},[234,1173,1174],{"class":255},"    -p 8080:8080 ",[234,1176,1169],{"class":383},[234,1178,1179,1182],{"class":236,"line":434},[234,1180,1181],{"class":255},"    -e DATABASE_URL='postgres:\u002F\u002Fuser:pass@db.example.com:5432\u002Fapp' ",[234,1183,1169],{"class":383},[234,1185,1186,1189],{"class":236,"line":459},[234,1187,1188],{"class":255},"    --health-cmd='curl -f http:\u002F\u002Flocalhost:8080\u002Fhealthz || exit 1' ",[234,1190,1169],{"class":383},[234,1192,1193,1196],{"class":236,"line":464},[234,1194,1195],{"class":255},"    --health-interval=5s --health-timeout=2s --health-retries=3 ",[234,1197,1169],{"class":383},[234,1199,1200],{"class":236,"line":479},[234,1201,1202],{"class":255},"    meuusuario\u002Fmyapp:v1\n",[234,1204,1205],{"class":236,"line":484},[234,1206,1207],{"class":255},"\"\n",[12,1209,1210],{},"Repeat for VPS B swapping the IP. Validate:",[224,1212,1214],{"className":226,"code":1213,"language":228,"meta":229,"style":229},"curl http:\u002F\u002F203.0.113.10:8080\u002Fhealthz   # deve retornar \"ok\"\ncurl http:\u002F\u002F203.0.113.20:8080\u002Fhealthz   # deve retornar \"ok\"\n",[231,1215,1216,1227],{"__ignoreMap":229},[234,1217,1218,1221,1224],{"class":236,"line":237},[234,1219,1220],{"class":247},"curl",[234,1222,1223],{"class":255}," http:\u002F\u002F203.0.113.10:8080\u002Fhealthz",[234,1225,1226],{"class":240},"   # deve retornar \"ok\"\n",[234,1228,1229,1231,1234],{"class":236,"line":244},[234,1230,1220],{"class":247},[234,1232,1233],{"class":255}," http:\u002F\u002F203.0.113.20:8080\u002Fhealthz",[234,1235,1226],{"class":240},[12,1237,1238],{},"If both return 200, the base is ready.",[19,1240,1242],{"id":1241},"step-4-caddy-as-reverse-proxy-load-balancer-30-min","Step 4 — Caddy as reverse proxy + load balancer (30 min)",[12,1244,1245],{},"Caddy is easier to start with than nginx because of built-in automatic TLS — Let's Encrypt works out of the box, no external bot to configure. nginx is more flexible and has a larger ecosystem; Caddy is simpler for this case. For the tutorial I'll use Caddy.",[12,1247,1248],{},"I'll run Caddy on VPS A, sharing the machine with one of the app instances. If you prefer a dedicated third VPS, swap the IP where relevant.",[12,1250,1251],{},"First, open ports 80 and 443 on VPS A:",[224,1253,1255],{"className":226,"code":1254,"language":228,"meta":229,"style":229},"ssh root@203.0.113.10 \"ufw allow 80 && ufw allow 443\"\n",[231,1256,1257],{"__ignoreMap":229},[234,1258,1259,1261,1263],{"class":236,"line":237},[234,1260,298],{"class":247},[234,1262,301],{"class":255},[234,1264,1265],{"class":255}," \"ufw allow 80 && ufw allow 443\"\n",[12,1267,1268,1269,1272],{},"Create the ",[231,1270,1271],{},"Caddyfile",":",[224,1274,1278],{"className":1275,"code":1276,"language":1277,"meta":229,"style":229},"language-caddyfile shiki shiki-themes github-dark-default","meudominio.com {\n    reverse_proxy 203.0.113.10:8080 203.0.113.20:8080 {\n        lb_policy round_robin\n        health_uri \u002Fhealthz\n        health_interval 5s\n        health_timeout 2s\n        health_status 200\n\n        fail_duration 30s\n        max_fails 2\n        unhealthy_status 5xx\n\n        transport http {\n            dial_timeout 2s\n        }\n    }\n}\n","caddyfile",[231,1279,1280,1285,1290,1295,1300,1305,1310,1315,1319,1324,1329,1334,1338,1343,1348,1353,1358],{"__ignoreMap":229},[234,1281,1282],{"class":236,"line":237},[234,1283,1284],{},"meudominio.com {\n",[234,1286,1287],{"class":236,"line":244},[234,1288,1289],{},"    reverse_proxy 203.0.113.10:8080 203.0.113.20:8080 {\n",[234,1291,1292],{"class":236,"line":271},[234,1293,1294],{},"        lb_policy round_robin\n",[234,1296,1297],{"class":236,"line":415},[234,1298,1299],{},"        health_uri \u002Fhealthz\n",[234,1301,1302],{"class":236,"line":434},[234,1303,1304],{},"        health_interval 5s\n",[234,1306,1307],{"class":236,"line":459},[234,1308,1309],{},"        health_timeout 2s\n",[234,1311,1312],{"class":236,"line":464},[234,1313,1314],{},"        health_status 200\n",[234,1316,1317],{"class":236,"line":479},[234,1318,412],{"emptyLinePlaceholder":411},[234,1320,1321],{"class":236,"line":484},[234,1322,1323],{},"        fail_duration 30s\n",[234,1325,1326],{"class":236,"line":490},[234,1327,1328],{},"        max_fails 2\n",[234,1330,1331],{"class":236,"line":508},[234,1332,1333],{},"        unhealthy_status 5xx\n",[234,1335,1336],{"class":236,"line":529},[234,1337,412],{"emptyLinePlaceholder":411},[234,1339,1340],{"class":236,"line":535},[234,1341,1342],{},"        transport http {\n",[234,1344,1345],{"class":236,"line":546},[234,1346,1347],{},"            dial_timeout 2s\n",[234,1349,1350],{"class":236,"line":552},[234,1351,1352],{},"        }\n",[234,1354,1355],{"class":236,"line":557},[234,1356,1357],{},"    }\n",[234,1359,1360],{"class":236,"line":594},[234,1361,1362],{},"}\n",[12,1364,1365,1366,1368],{},"Fifteen lines. Everything that matters is there: round-robin between the two IPs, active health check every five seconds on ",[231,1367,355],{},", marks as unhealthy after two consecutive failures in 30s, two-second timeout to open a connection.",[12,1370,1371],{},"Bring up Caddy:",[224,1373,1375],{"className":226,"code":1374,"language":228,"meta":229,"style":229},"ssh root@203.0.113.10 \"\n  mkdir -p \u002Fetc\u002Fcaddy &&\n  docker run -d --name caddy --restart=unless-stopped \\\n    --network host \\\n    -v \u002Fetc\u002Fcaddy\u002FCaddyfile:\u002Fetc\u002Fcaddy\u002FCaddyfile \\\n    -v caddy_data:\u002Fdata \\\n    -v caddy_config:\u002Fconfig \\\n    caddy:2-alpine\n\"\n",[231,1376,1377,1385,1390,1397,1404,1411,1418,1425,1430],{"__ignoreMap":229},[234,1378,1379,1381,1383],{"class":236,"line":237},[234,1380,298],{"class":247},[234,1382,301],{"class":255},[234,1384,1156],{"class":255},[234,1386,1387],{"class":236,"line":244},[234,1388,1389],{"class":255},"  mkdir -p \u002Fetc\u002Fcaddy &&\n",[234,1391,1392,1395],{"class":236,"line":271},[234,1393,1394],{"class":255},"  docker run -d --name caddy --restart=unless-stopped ",[234,1396,1169],{"class":383},[234,1398,1399,1402],{"class":236,"line":415},[234,1400,1401],{"class":255},"    --network host ",[234,1403,1169],{"class":383},[234,1405,1406,1409],{"class":236,"line":434},[234,1407,1408],{"class":255},"    -v \u002Fetc\u002Fcaddy\u002FCaddyfile:\u002Fetc\u002Fcaddy\u002FCaddyfile ",[234,1410,1169],{"class":383},[234,1412,1413,1416],{"class":236,"line":459},[234,1414,1415],{"class":255},"    -v caddy_data:\u002Fdata ",[234,1417,1169],{"class":383},[234,1419,1420,1423],{"class":236,"line":464},[234,1421,1422],{"class":255},"    -v caddy_config:\u002Fconfig ",[234,1424,1169],{"class":383},[234,1426,1427],{"class":236,"line":479},[234,1428,1429],{"class":255},"    caddy:2-alpine\n",[234,1431,1432],{"class":236,"line":484},[234,1433,1207],{"class":255},[12,1435,1436,1437,1439],{},"Point your domain's DNS A to ",[231,1438,280],{},". In a few minutes:",[224,1441,1443],{"className":226,"code":1442,"language":228,"meta":229,"style":229},"curl https:\u002F\u002Fmeudominio.com\u002F\n# deve retornar \"Hello v1\" (alternando entre as duas instâncias)\n",[231,1444,1445,1452],{"__ignoreMap":229},[234,1446,1447,1449],{"class":236,"line":237},[234,1448,1220],{"class":247},[234,1450,1451],{"class":255}," https:\u002F\u002Fmeudominio.com\u002F\n",[234,1453,1454],{"class":236,"line":244},[234,1455,1456],{"class":240},"# deve retornar \"Hello v1\" (alternando entre as duas instâncias)\n",[12,1458,1459],{},"Caddy issued a Let's Encrypt certificate automatically. This works because the domain resolves to the IP where Caddy is listening on port 80 (HTTP-01 challenge).",[19,1461,1463],{"id":1462},"step-5-bash-deploy-script-60-min","Step 5 — Bash deploy script (60 min)",[12,1465,1466],{},"This is the heart of the tutorial. A script that orchestrates rolling update between the two VPS:",[224,1468,1470],{"className":226,"code":1469,"language":228,"meta":229,"style":229},"#!\u002Fusr\u002Fbin\u002Fenv bash\n# deploy.sh — rolling deploy zero-downtime entre duas VPS\nset -euo pipefail\n\nIMAGE=\"${1:?Uso: .\u002Fdeploy.sh meuusuario\u002Fmyapp:v2}\"\nHOSTS=(\"203.0.113.10\" \"203.0.113.20\")\nHEALTH_DEADLINE=300   # max segundos esperando health check\nMIN_HEALTHY_TIME=10   # segundos saudável sustentado antes de prosseguir\nSSH_OPTS=\"-o StrictHostKeyChecking=no -o ConnectTimeout=5\"\n\ndeploy_host() {\n  local host=$1\n  local image=$2\n  echo \"==> [${host}] pulling ${image}\"\n  ssh ${SSH_OPTS} \"root@${host}\" \"docker pull ${image}\"\n\n  # guarda imagem antiga pro caso de rollback\n  local old_image\n  old_image=$(ssh ${SSH_OPTS} \"root@${host}\" \"docker inspect app --format '{{.Config.Image}}' 2>\u002Fdev\u002Fnull || echo none\")\n  echo \"==> [${host}] versão atual: ${old_image}\"\n\n  echo \"==> [${host}] substituindo contêiner\"\n  ssh ${SSH_OPTS} \"root@${host}\" \"\n    docker stop app 2>\u002Fdev\u002Fnull || true\n    docker rm app 2>\u002Fdev\u002Fnull || true\n    docker run -d --name app --restart=unless-stopped \\\n      -p 8080:8080 \\\n      -e DATABASE_URL='${DATABASE_URL}' \\\n      --health-cmd='curl -f http:\u002F\u002Flocalhost:8080\u002Fhealthz || exit 1' \\\n      --health-interval=5s --health-timeout=2s --health-retries=3 \\\n      ${image}\n  \"\n\n  echo \"==> [${host}] esperando health check (max ${HEALTH_DEADLINE}s)\"\n  local start=$(date +%s)\n  local healthy_since=0\n  while true; do\n    local now=$(date +%s)\n    if (( now - start > HEALTH_DEADLINE )); then\n      echo \"!!  [${host}] healthy_deadline excedido — fazendo rollback pra ${old_image}\"\n      ssh ${SSH_OPTS} \"root@${host}\" \"\n        docker stop app && docker rm app &&\n        docker run -d --name app --restart=unless-stopped \\\n          -p 8080:8080 -e DATABASE_URL='${DATABASE_URL}' \\\n          ${old_image}\n      \"\n      return 1\n    fi\n\n    if curl -sf --max-time 2 \"http:\u002F\u002F${host}:8080\u002Fhealthz\" > \u002Fdev\u002Fnull; then\n      if (( healthy_since == 0 )); then\n        healthy_since=${now}\n        echo \"    [${host}] saudável — confirmando por ${MIN_HEALTHY_TIME}s\"\n      elif (( now - healthy_since >= MIN_HEALTHY_TIME )); then\n        echo \"==> [${host}] saudável sustentado — promovendo\"\n        return 0\n      fi\n    else\n      healthy_since=0\n    fi\n    sleep 2\n  done\n}\n\necho \"### Deploy ${IMAGE} em ${#HOSTS[@]} hosts (rolling, max_parallel=1)\"\nfor host in \"${HOSTS[@]}\"; do\n  if ! deploy_host \"${host}\" \"${IMAGE}\"; then\n    echo \"### Deploy abortado em ${host}. Hosts anteriores mantidos como estavam.\"\n    exit 1\n  fi\ndone\necho \"### Deploy completo: todos os hosts em ${IMAGE}\"\n",[231,1471,1472,1477,1482,1493,1497,1550,1567,1580,1593,1603,1607,1615,1628,1640,1660,1683,1687,1692,1699,1724,1740,1744,1755,1769,1774,1779,1786,1793,1805,1812,1819,1828,1833,1837,1853,1872,1884,1899,1918,1942,1960,1976,1982,1990,2002,2012,2018,2027,2033,2038,2073,2093,2104,2123,2144,2156,2165,2171,2177,2187,2192,2201,2207,2212,2217,2245,2273,2300,2314,2322,2328,2334],{"__ignoreMap":229},[234,1473,1474],{"class":236,"line":237},[234,1475,1476],{"class":240},"#!\u002Fusr\u002Fbin\u002Fenv bash\n",[234,1478,1479],{"class":236,"line":244},[234,1480,1481],{"class":240},"# deploy.sh — rolling deploy zero-downtime entre duas VPS\n",[234,1483,1484,1487,1490],{"class":236,"line":271},[234,1485,1486],{"class":251},"set",[234,1488,1489],{"class":251}," -euo",[234,1491,1492],{"class":255}," pipefail\n",[234,1494,1495],{"class":236,"line":415},[234,1496,412],{"emptyLinePlaceholder":411},[234,1498,1499,1502,1504,1507,1510,1513,1516,1518,1521,1524,1527,1529,1532,1535,1537,1540,1542,1545,1548],{"class":236,"line":434},[234,1500,1501],{"class":387},"IMAGE",[234,1503,473],{"class":383},[234,1505,1506],{"class":255},"\"",[234,1508,1509],{"class":251},"${1",[234,1511,1512],{"class":383},":?",[234,1514,1515],{"class":387},"Uso",[234,1517,1272],{"class":383},[234,1519,1520],{"class":255}," .",[234,1522,1523],{"class":383},"\u002F",[234,1525,1526],{"class":387},"deploy",[234,1528,101],{"class":255},[234,1530,1531],{"class":387},"sh",[234,1533,1534],{"class":387}," meuusuario",[234,1536,1523],{"class":383},[234,1538,1539],{"class":387},"myapp",[234,1541,1272],{"class":383},[234,1543,1544],{"class":387},"v2",[234,1546,1547],{"class":251},"}",[234,1549,1207],{"class":255},[234,1551,1552,1555,1557,1559,1562,1565],{"class":236,"line":459},[234,1553,1554],{"class":387},"HOSTS",[234,1556,473],{"class":383},[234,1558,520],{"class":387},[234,1560,1561],{"class":255},"\"203.0.113.10\"",[234,1563,1564],{"class":255}," \"203.0.113.20\"",[234,1566,526],{"class":387},[234,1568,1569,1572,1574,1577],{"class":236,"line":464},[234,1570,1571],{"class":387},"HEALTH_DEADLINE",[234,1573,473],{"class":383},[234,1575,1576],{"class":255},"300",[234,1578,1579],{"class":240},"   # max segundos esperando health check\n",[234,1581,1582,1585,1587,1590],{"class":236,"line":479},[234,1583,1584],{"class":387},"MIN_HEALTHY_TIME",[234,1586,473],{"class":383},[234,1588,1589],{"class":255},"10",[234,1591,1592],{"class":240},"   # segundos saudável sustentado antes de prosseguir\n",[234,1594,1595,1598,1600],{"class":236,"line":484},[234,1596,1597],{"class":387},"SSH_OPTS",[234,1599,473],{"class":383},[234,1601,1602],{"class":255},"\"-o StrictHostKeyChecking=no -o ConnectTimeout=5\"\n",[234,1604,1605],{"class":236,"line":490},[234,1606,412],{"emptyLinePlaceholder":411},[234,1608,1609,1612],{"class":236,"line":508},[234,1610,1611],{"class":427},"deploy_host",[234,1613,1614],{"class":387},"() {\n",[234,1616,1617,1620,1623,1625],{"class":236,"line":529},[234,1618,1619],{"class":383},"  local",[234,1621,1622],{"class":387}," host",[234,1624,473],{"class":383},[234,1626,1627],{"class":247},"$1\n",[234,1629,1630,1632,1635,1637],{"class":236,"line":535},[234,1631,1619],{"class":383},[234,1633,1634],{"class":387}," image",[234,1636,473],{"class":383},[234,1638,1639],{"class":247},"$2\n",[234,1641,1642,1645,1648,1651,1654,1657],{"class":236,"line":546},[234,1643,1644],{"class":251},"  echo",[234,1646,1647],{"class":255}," \"==> [${",[234,1649,1650],{"class":387},"host",[234,1652,1653],{"class":255},"}] pulling ${",[234,1655,1656],{"class":387},"image",[234,1658,1659],{"class":255},"}\"\n",[234,1661,1662,1665,1668,1671,1673,1676,1679,1681],{"class":236,"line":552},[234,1663,1664],{"class":247},"  ssh",[234,1666,1667],{"class":387}," ${SSH_OPTS} ",[234,1669,1670],{"class":255},"\"root@${",[234,1672,1650],{"class":387},[234,1674,1675],{"class":255},"}\"",[234,1677,1678],{"class":255}," \"docker pull ${",[234,1680,1656],{"class":387},[234,1682,1659],{"class":255},[234,1684,1685],{"class":236,"line":557},[234,1686,412],{"emptyLinePlaceholder":411},[234,1688,1689],{"class":236,"line":594},[234,1690,1691],{"class":240},"  # guarda imagem antiga pro caso de rollback\n",[234,1693,1694,1696],{"class":236,"line":635},[234,1695,1619],{"class":383},[234,1697,1698],{"class":387}," old_image\n",[234,1700,1701,1704,1706,1709,1711,1713,1715,1717,1719,1722],{"class":236,"line":643},[234,1702,1703],{"class":387},"  old_image",[234,1705,473],{"class":383},[234,1707,1708],{"class":387},"$(",[234,1710,298],{"class":247},[234,1712,1667],{"class":387},[234,1714,1670],{"class":255},[234,1716,1650],{"class":387},[234,1718,1675],{"class":255},[234,1720,1721],{"class":255}," \"docker inspect app --format '{{.Config.Image}}' 2>\u002Fdev\u002Fnull || echo none\"",[234,1723,526],{"class":387},[234,1725,1726,1728,1730,1732,1735,1738],{"class":236,"line":659},[234,1727,1644],{"class":251},[234,1729,1647],{"class":255},[234,1731,1650],{"class":387},[234,1733,1734],{"class":255},"}] versão atual: ${",[234,1736,1737],{"class":387},"old_image",[234,1739,1659],{"class":255},[234,1741,1742],{"class":236,"line":683},[234,1743,412],{"emptyLinePlaceholder":411},[234,1745,1746,1748,1750,1752],{"class":236,"line":695},[234,1747,1644],{"class":251},[234,1749,1647],{"class":255},[234,1751,1650],{"class":387},[234,1753,1754],{"class":255},"}] substituindo contêiner\"\n",[234,1756,1757,1759,1761,1763,1765,1767],{"class":236,"line":717},[234,1758,1664],{"class":247},[234,1760,1667],{"class":387},[234,1762,1670],{"class":255},[234,1764,1650],{"class":387},[234,1766,1675],{"class":255},[234,1768,1156],{"class":255},[234,1770,1771],{"class":236,"line":723},[234,1772,1773],{"class":255},"    docker stop app 2>\u002Fdev\u002Fnull || true\n",[234,1775,1776],{"class":236,"line":729},[234,1777,1778],{"class":255},"    docker rm app 2>\u002Fdev\u002Fnull || true\n",[234,1780,1781,1784],{"class":236,"line":734},[234,1782,1783],{"class":255},"    docker run -d --name app --restart=unless-stopped ",[234,1785,1169],{"class":383},[234,1787,1788,1791],{"class":236,"line":771},[234,1789,1790],{"class":255},"      -p 8080:8080 ",[234,1792,1169],{"class":383},[234,1794,1795,1798,1800,1803],{"class":236,"line":776},[234,1796,1797],{"class":255},"      -e DATABASE_URL='${",[234,1799,453],{"class":387},[234,1801,1802],{"class":255},"}' ",[234,1804,1169],{"class":383},[234,1806,1807,1810],{"class":236,"line":815},[234,1808,1809],{"class":255},"      --health-cmd='curl -f http:\u002F\u002Flocalhost:8080\u002Fhealthz || exit 1' ",[234,1811,1169],{"class":383},[234,1813,1814,1817],{"class":236,"line":820},[234,1815,1816],{"class":255},"      --health-interval=5s --health-timeout=2s --health-retries=3 ",[234,1818,1169],{"class":383},[234,1820,1821,1824,1826],{"class":236,"line":826},[234,1822,1823],{"class":255},"      ${",[234,1825,1656],{"class":387},[234,1827,1362],{"class":255},[234,1829,1830],{"class":236,"line":846},[234,1831,1832],{"class":255},"  \"\n",[234,1834,1835],{"class":236,"line":859},[234,1836,412],{"emptyLinePlaceholder":411},[234,1838,1839,1841,1843,1845,1848,1850],{"class":236,"line":872},[234,1840,1644],{"class":251},[234,1842,1647],{"class":255},[234,1844,1650],{"class":387},[234,1846,1847],{"class":255},"}] esperando health check (max ${",[234,1849,1571],{"class":387},[234,1851,1852],{"class":255},"}s)\"\n",[234,1854,1855,1857,1860,1862,1864,1867,1870],{"class":236,"line":898},[234,1856,1619],{"class":383},[234,1858,1859],{"class":387}," start",[234,1861,473],{"class":383},[234,1863,1708],{"class":387},[234,1865,1866],{"class":247},"date",[234,1868,1869],{"class":255}," +%s",[234,1871,526],{"class":387},[234,1873,1874,1876,1879,1881],{"class":236,"line":913},[234,1875,1619],{"class":383},[234,1877,1878],{"class":387}," healthy_since",[234,1880,473],{"class":383},[234,1882,1883],{"class":251},"0\n",[234,1885,1887,1890,1893,1896],{"class":236,"line":1886},37,[234,1888,1889],{"class":383},"  while",[234,1891,1892],{"class":251}," true",[234,1894,1895],{"class":387},"; ",[234,1897,1898],{"class":383},"do\n",[234,1900,1902,1905,1908,1910,1912,1914,1916],{"class":236,"line":1901},38,[234,1903,1904],{"class":383},"    local",[234,1906,1907],{"class":387}," now",[234,1909,473],{"class":383},[234,1911,1708],{"class":387},[234,1913,1866],{"class":247},[234,1915,1869],{"class":255},[234,1917,526],{"class":387},[234,1919,1921,1924,1927,1930,1933,1936,1939],{"class":236,"line":1920},39,[234,1922,1923],{"class":383},"    if",[234,1925,1926],{"class":387}," (( now ",[234,1928,1929],{"class":383},"-",[234,1931,1932],{"class":387}," start ",[234,1934,1935],{"class":383},">",[234,1937,1938],{"class":387}," HEALTH_DEADLINE )); ",[234,1940,1941],{"class":383},"then\n",[234,1943,1945,1948,1951,1953,1956,1958],{"class":236,"line":1944},40,[234,1946,1947],{"class":251},"      echo",[234,1949,1950],{"class":255}," \"!!  [${",[234,1952,1650],{"class":387},[234,1954,1955],{"class":255},"}] healthy_deadline excedido — fazendo rollback pra ${",[234,1957,1737],{"class":387},[234,1959,1659],{"class":255},[234,1961,1963,1966,1968,1970,1972,1974],{"class":236,"line":1962},41,[234,1964,1965],{"class":247},"      ssh",[234,1967,1667],{"class":387},[234,1969,1670],{"class":255},[234,1971,1650],{"class":387},[234,1973,1675],{"class":255},[234,1975,1156],{"class":255},[234,1977,1979],{"class":236,"line":1978},42,[234,1980,1981],{"class":255},"        docker stop app && docker rm app &&\n",[234,1983,1985,1988],{"class":236,"line":1984},43,[234,1986,1987],{"class":255},"        docker run -d --name app --restart=unless-stopped ",[234,1989,1169],{"class":383},[234,1991,1993,1996,1998,2000],{"class":236,"line":1992},44,[234,1994,1995],{"class":255},"          -p 8080:8080 -e DATABASE_URL='${",[234,1997,453],{"class":387},[234,1999,1802],{"class":255},[234,2001,1169],{"class":383},[234,2003,2005,2008,2010],{"class":236,"line":2004},45,[234,2006,2007],{"class":255},"          ${",[234,2009,1737],{"class":387},[234,2011,1362],{"class":255},[234,2013,2015],{"class":236,"line":2014},46,[234,2016,2017],{"class":255},"      \"\n",[234,2019,2021,2024],{"class":236,"line":2020},47,[234,2022,2023],{"class":383},"      return",[234,2025,2026],{"class":251}," 1\n",[234,2028,2030],{"class":236,"line":2029},48,[234,2031,2032],{"class":383},"    fi\n",[234,2034,2036],{"class":236,"line":2035},49,[234,2037,412],{"emptyLinePlaceholder":411},[234,2039,2041,2043,2046,2049,2052,2055,2058,2060,2063,2066,2069,2071],{"class":236,"line":2040},50,[234,2042,1923],{"class":383},[234,2044,2045],{"class":247}," curl",[234,2047,2048],{"class":251}," -sf",[234,2050,2051],{"class":251}," --max-time",[234,2053,2054],{"class":251}," 2",[234,2056,2057],{"class":255}," \"http:\u002F\u002F${",[234,2059,1650],{"class":387},[234,2061,2062],{"class":255},"}:8080\u002Fhealthz\"",[234,2064,2065],{"class":383}," >",[234,2067,2068],{"class":255}," \u002Fdev\u002Fnull",[234,2070,1895],{"class":387},[234,2072,1941],{"class":383},[234,2074,2076,2079,2082,2085,2088,2091],{"class":236,"line":2075},51,[234,2077,2078],{"class":383},"      if",[234,2080,2081],{"class":387}," (( healthy_since ",[234,2083,2084],{"class":383},"==",[234,2086,2087],{"class":251}," 0",[234,2089,2090],{"class":387}," )); ",[234,2092,1941],{"class":383},[234,2094,2096,2099,2101],{"class":236,"line":2095},52,[234,2097,2098],{"class":387},"        healthy_since",[234,2100,473],{"class":383},[234,2102,2103],{"class":387},"${now}\n",[234,2105,2107,2110,2113,2115,2118,2120],{"class":236,"line":2106},53,[234,2108,2109],{"class":251},"        echo",[234,2111,2112],{"class":255}," \"    [${",[234,2114,1650],{"class":387},[234,2116,2117],{"class":255},"}] saudável — confirmando por ${",[234,2119,1584],{"class":387},[234,2121,2122],{"class":255},"}s\"\n",[234,2124,2126,2129,2131,2133,2136,2139,2142],{"class":236,"line":2125},54,[234,2127,2128],{"class":383},"      elif",[234,2130,1926],{"class":387},[234,2132,1929],{"class":383},[234,2134,2135],{"class":387}," healthy_since ",[234,2137,2138],{"class":383},">=",[234,2140,2141],{"class":387}," MIN_HEALTHY_TIME )); ",[234,2143,1941],{"class":383},[234,2145,2147,2149,2151,2153],{"class":236,"line":2146},55,[234,2148,2109],{"class":251},[234,2150,1647],{"class":255},[234,2152,1650],{"class":387},[234,2154,2155],{"class":255},"}] saudável sustentado — promovendo\"\n",[234,2157,2159,2162],{"class":236,"line":2158},56,[234,2160,2161],{"class":383},"        return",[234,2163,2164],{"class":251}," 0\n",[234,2166,2168],{"class":236,"line":2167},57,[234,2169,2170],{"class":383},"      fi\n",[234,2172,2174],{"class":236,"line":2173},58,[234,2175,2176],{"class":383},"    else\n",[234,2178,2180,2183,2185],{"class":236,"line":2179},59,[234,2181,2182],{"class":387},"      healthy_since",[234,2184,473],{"class":383},[234,2186,1883],{"class":255},[234,2188,2190],{"class":236,"line":2189},60,[234,2191,2032],{"class":383},[234,2193,2195,2198],{"class":236,"line":2194},61,[234,2196,2197],{"class":247},"    sleep",[234,2199,2200],{"class":251}," 2\n",[234,2202,2204],{"class":236,"line":2203},62,[234,2205,2206],{"class":383},"  done\n",[234,2208,2210],{"class":236,"line":2209},63,[234,2211,1362],{"class":387},[234,2213,2215],{"class":236,"line":2214},64,[234,2216,412],{"emptyLinePlaceholder":411},[234,2218,2220,2223,2226,2228,2231,2234,2236,2239,2242],{"class":236,"line":2219},65,[234,2221,2222],{"class":251},"echo",[234,2224,2225],{"class":255}," \"### Deploy ${",[234,2227,1501],{"class":387},[234,2229,2230],{"class":255},"} em ${",[234,2232,2233],{"class":383},"#",[234,2235,1554],{"class":387},[234,2237,2238],{"class":255},"[",[234,2240,2241],{"class":383},"@",[234,2243,2244],{"class":255},"]} hosts (rolling, max_parallel=1)\"\n",[234,2246,2248,2251,2254,2257,2260,2262,2264,2266,2269,2271],{"class":236,"line":2247},66,[234,2249,2250],{"class":383},"for",[234,2252,2253],{"class":387}," host ",[234,2255,2256],{"class":383},"in",[234,2258,2259],{"class":255}," \"${",[234,2261,1554],{"class":387},[234,2263,2238],{"class":255},[234,2265,2241],{"class":383},[234,2267,2268],{"class":255},"]}\"",[234,2270,1895],{"class":387},[234,2272,1898],{"class":383},[234,2274,2276,2278,2281,2284,2286,2288,2290,2292,2294,2296,2298],{"class":236,"line":2275},67,[234,2277,597],{"class":383},[234,2279,2280],{"class":383}," !",[234,2282,2283],{"class":247}," deploy_host",[234,2285,2259],{"class":255},[234,2287,1650],{"class":387},[234,2289,1675],{"class":255},[234,2291,2259],{"class":255},[234,2293,1501],{"class":387},[234,2295,1675],{"class":255},[234,2297,1895],{"class":387},[234,2299,1941],{"class":383},[234,2301,2303,2306,2309,2311],{"class":236,"line":2302},68,[234,2304,2305],{"class":251},"    echo",[234,2307,2308],{"class":255}," \"### Deploy abortado em ${",[234,2310,1650],{"class":387},[234,2312,2313],{"class":255},"}. Hosts anteriores mantidos como estavam.\"\n",[234,2315,2317,2320],{"class":236,"line":2316},69,[234,2318,2319],{"class":251},"    exit",[234,2321,2026],{"class":251},[234,2323,2325],{"class":236,"line":2324},70,[234,2326,2327],{"class":383},"  fi\n",[234,2329,2331],{"class":236,"line":2330},71,[234,2332,2333],{"class":383},"done\n",[234,2335,2337,2339,2342,2344],{"class":236,"line":2336},72,[234,2338,2222],{"class":251},[234,2340,2341],{"class":255}," \"### Deploy completo: todos os hosts em ${",[234,2343,1501],{"class":387},[234,2345,1659],{"class":255},[12,2347,2348,2349,571,2352,2355],{},"Save as ",[231,2350,2351],{},"deploy.sh",[231,2353,2354],{},"chmod +x",", and:",[224,2357,2359],{"className":226,"code":2358,"language":228,"meta":229,"style":229},"export DATABASE_URL='postgres:\u002F\u002Fuser:pass@db.example.com:5432\u002Fapp'\n.\u002Fdeploy.sh meuusuario\u002Fmyapp:v2\n",[231,2360,2361,2374],{"__ignoreMap":229},[234,2362,2363,2366,2369,2371],{"class":236,"line":237},[234,2364,2365],{"class":383},"export",[234,2367,2368],{"class":387}," DATABASE_URL",[234,2370,473],{"class":383},[234,2372,2373],{"class":255},"'postgres:\u002F\u002Fuser:pass@db.example.com:5432\u002Fapp'\n",[234,2375,2376,2379],{"class":236,"line":244},[234,2377,2378],{"class":247},".\u002Fdeploy.sh",[234,2380,2381],{"class":255}," meuusuario\u002Fmyapp:v2\n",[12,2383,2384],{},"The algorithm is literally what large orchestrators do internally:",[67,2386,2387,2393,2407,2413,2418,2424,2437],{},[70,2388,2389,2392],{},[27,2390,2391],{},"For each host, sequentially"," (max_parallel = 1)",[70,2394,2395,2398,2399,2402,2403,2406],{},[27,2396,2397],{},"Pull the new image"," before touching the container — that way the downtime between ",[231,2400,2401],{},"docker stop"," and ",[231,2404,2405],{},"docker run"," is minimal",[70,2408,2409,2412],{},[27,2410,2411],{},"Save reference to the old image"," for rollback if something goes wrong",[70,2414,2415],{},[27,2416,2417],{},"Replace the container",[70,2419,2420,2423],{},[27,2421,2422],{},"Loop waiting for health check"," with a five-minute deadline",[70,2425,2426,2429,2430,2432,2433,2436],{},[27,2427,2428],{},"Min healthy time of ten seconds",": only advances when ",[231,2431,355],{}," returned 200 ",[179,2434,2435],{},"sustainedly"," for ten seconds (if it falls in the middle, restart the count)",[70,2438,2439,2442],{},[27,2440,2441],{},"Automatic rollback"," if the deadline is exceeded",[12,2444,2445],{},"The numbers (max_parallel: 1, min_healthy_time: 10s, healthy_deadline: 300s) are exactly the defaults we use in HeroCtl. It's no coincidence — these are the values that survived years of trial and error. Min healthy time too short detects transient symptoms as \"healthy\" and breaks; too long makes the deploy slow with no gain. Ten seconds is the point where noise disappears and the deploy still finishes quickly.",[19,2447,2449],{"id":2448},"step-6-validate-with-a-load-test-during-deploy-15-min","Step 6 — Validate with a load test during deploy (15 min)",[12,2451,2452],{},"This is the fire test: run sustained load and deploy at the same time. If any 5xx appears, some part of the scheme is broken.",[12,2454,2455],{},"On an external machine (your laptop or another VPS):",[224,2457,2459],{"className":226,"code":2458,"language":228,"meta":229,"style":229},"# instale hey\ngo install github.com\u002Frakyll\u002Fhey@latest\n\n# carga sustentada de 60s, 5 conexões concorrentes\nhey -z 60s -c 5 https:\u002F\u002Fmeudominio.com\u002F\n",[231,2460,2461,2466,2477,2481,2486],{"__ignoreMap":229},[234,2462,2463],{"class":236,"line":237},[234,2464,2465],{"class":240},"# instale hey\n",[234,2467,2468,2471,2474],{"class":236,"line":244},[234,2469,2470],{"class":247},"go",[234,2472,2473],{"class":255}," install",[234,2475,2476],{"class":255}," github.com\u002Frakyll\u002Fhey@latest\n",[234,2478,2479],{"class":236,"line":271},[234,2480,412],{"emptyLinePlaceholder":411},[234,2482,2483],{"class":236,"line":415},[234,2484,2485],{"class":240},"# carga sustentada de 60s, 5 conexões concorrentes\n",[234,2487,2488,2491,2494,2497,2500,2503],{"class":236,"line":434},[234,2489,2490],{"class":247},"hey",[234,2492,2493],{"class":251}," -z",[234,2495,2496],{"class":255}," 60s",[234,2498,2499],{"class":251}," -c",[234,2501,2502],{"class":251}," 5",[234,2504,1451],{"class":255},[12,2506,2507],{},"In another window, simultaneously:",[224,2509,2511],{"className":226,"code":2510,"language":228,"meta":229,"style":229},".\u002Fdeploy.sh meuusuario\u002Fmyapp:v2\n",[231,2512,2513],{"__ignoreMap":229},[234,2514,2515,2517],{"class":236,"line":237},[234,2516,2378],{"class":247},[234,2518,2381],{"class":255},[12,2520,2521,2522,1272],{},"At the end of ",[231,2523,2490],{},[224,2525,2530],{"className":2526,"code":2528,"language":2529},[2527],"language-text","Status code distribution:\n  [200] 1847 responses\n","text",[231,2531,2528],{"__ignoreMap":229},[12,2533,2534],{},"Only 200. If a 502 or 503 shows up, one of the three pieces is weak: health check returning 200 too early, missing graceful shutdown, or short min healthy time. Investigate and fix.",[57,2536],{},[19,2538,2540],{"id":2539},"the-six-details-that-separate-real-zero-downtime-from-approximation","The six details that separate real zero-downtime from approximation",[12,2542,2543],{},"We covered most of these throughout the tutorial, but worth consolidating — because a single one missing turns the whole scheme into \"mostly zero-downtime\", which is different.",[67,2545,2546,2555,2572,2588,2594,2600],{},[70,2547,2548,2551,2552,2554],{},[27,2549,2550],{},"Connection draining on SIGTERM."," When the container receives the stop signal, the app marks ",[231,2553,355],{}," as failing immediately, but keeps accepting in-flight connections for a few seconds. Without it, connections open at the moment of stop are cut.",[70,2556,2557,2560,2561,2564,2565,2568,2569,101],{},[27,2558,2559],{},"Pre-stop hook if you have an async worker."," Queues that process background jobs need an explicit pause before killing the process, or the running job is orphaned. In Sidekiq, it's ",[231,2562,2563],{},":quiet"," before ",[231,2566,2567],{},":term",". In Celery, it's ",[231,2570,2571],{},"--soft-time-limit",[70,2573,2574,2577,2578,2581,2582,2584,2585,2587],{},[27,2575,2576],{},"Health check BEFORE promoting, not \"container running\"."," ",[231,2579,2580],{},"docker ps"," shows \"running\" milliseconds after ",[231,2583,2405],{},". It means nothing. Promote only after ",[231,2586,355],{}," returns 200 sustainedly.",[70,2589,2590,2593],{},[27,2591,2592],{},"Min healthy time of ten sustained seconds."," Don't trust seeing a single 200 and moving on — apps with irregular warm-up pass for a moment and fail again.",[70,2595,2596,2599],{},[27,2597,2598],{},"Previous version pre-pulled for fast rollback."," If you trusted \"keep the old image in Docker's cache\", at some point it's cleared by garbage collection and rollback gets slow. Keep the last three images explicitly.",[70,2601,2602,2605],{},[27,2603,2604],{},"Auto-revert when the healthy deadline is exceeded."," Without it, the deploy gets stuck in a partial state — half the hosts on v2, half on v1, with nobody to decide what to do.",[19,2607,2609],{"id":2608},"database-migrations-zero-downtime-the-part-that-breaks-experienced-peoples-deploys","Database migrations + zero-downtime (the part that breaks experienced people's deploys)",[12,2611,2612,2613,2616],{},"This is the topic I see senior developers get wrong most often. Rolling deploy assumes that ",[27,2614,2615],{},"both versions of the app run simultaneously in production for some period",". If v2 expects a schema incompatible with what v1 understands, one of the two breaks during the transition window.",[12,2618,2619,2620,101],{},"Non-negotiable golden rule: ",[27,2621,2622],{},"migrations are always backward-compatible",[12,2624,2625,2626,2629,2630,2633,2634,2636],{},"Classic case: you want to rename column ",[231,2627,2628],{},"email"," to ",[231,2631,2632],{},"email_address",". Wrong solution: do the migration that renames directly before the deploy. Result: during the rolling, v1 instances still write to ",[231,2635,2628],{}," (which no longer exists) and break. Right solution, in three deploys:",[119,2638,2639,2652],{},[122,2640,2641],{},[125,2642,2643,2646,2649],{},[128,2644,2645],{},"Deploy",[128,2647,2648],{},"Migration",[128,2650,2651],{},"Code v*",[141,2653,2654,2676,2694],{},[125,2655,2656,2659,2665],{},[146,2657,2658],{},"1",[146,2660,2661,2662,2664],{},"Add ",[231,2663,2632],{}," (nullable). No removal.",[146,2666,2667,2668,2670,2671,2673,2674,101],{},"App writes to ",[231,2669,2628],{}," AND to ",[231,2672,2632],{},"; reads from ",[231,2675,2628],{},[125,2677,2678,2681,2688],{},[146,2679,2680],{},"2",[146,2682,2683,2684,2687],{},"Backfill: ",[231,2685,2686],{},"UPDATE users SET email_address = email WHERE email_address IS NULL",". NOT NULL constraint.",[146,2689,2690,2691,2693],{},"App reads from ",[231,2692,2632],{},"; still writes to both.",[125,2695,2696,2699,2704],{},[146,2697,2698],{},"3",[146,2700,2701,2702,101],{},"Drop ",[231,2703,2628],{},[146,2705,2706,2707,101],{},"App only uses ",[231,2708,2632],{},[12,2710,2711],{},"Three deploys, weeks apart. It's tedious, it's the way. Direct column drop always breaks. Direct type change always breaks. Adding NOT NULL without a default directly always breaks.",[12,2713,2714,2715,2402,2718,2721,2722,2725],{},"Tools that help: ",[231,2716,2717],{},"pg-osc",[231,2719,2720],{},"pgroll"," (Postgres), ",[231,2723,2724],{},"gh-ost"," (MySQL) — do online schema change without a long lock. For light migrations, the manual three-step way solves it.",[19,2727,2729],{"id":2728},"patterns-beyond-rolling","Patterns beyond rolling",[12,2731,2732],{},"Rolling is the default and most economical pattern. Others worth knowing:",[2734,2735,2736,2742,2748],"ul",{},[70,2737,2738,2741],{},[27,2739,2740],{},"Blue-green."," Two complete parallel environments — \"blue\" running v1, \"green\" provisioned with v2 empty. You bring up v2 entirely on green, validate, switch DNS (or load balancer cutover). Advantage: instant rollback (return DNS to blue). Disadvantage: costs double the resources during the deploy window.",[70,2743,2744,2747],{},[27,2745,2746],{},"Canary."," Send 5% of traffic to v2, observe metrics (errors, latency, conversion rate), decide whether to promote to 100% or abort. Detects subtle bugs that health check doesn't catch — like regression in checkout conversion. Requires a proxy with weighted routing and decent observability.",[70,2749,2750,2753],{},[27,2751,2752],{},"Rainbow \u002F N+1."," Generalization of blue-green with N coexisting versions. Useful when you want long-running A\u002FB tests between entire versions.",[12,2755,2756],{},"For the tutorial, rolling is what makes sense. The others are worth it when the traffic size justifies the extra investment.",[19,2758,2760],{"id":2759},"easy-version-coolify-or-dokploy","\"Easy\" version — Coolify or Dokploy",[12,2762,2763],{},"If you don't want to script, two modern panels do rolling deploy automatically:",[2734,2765,2766,2772],{},[70,2767,2768,2771],{},[27,2769,2770],{},"Coolify"," in multi-server mode does rolling with configurable health check. Multi-server was added in more recent versions — before it was single-server only. Worth checking the version.",[70,2773,2774,2777,2778,2781],{},[27,2775,2776],{},"Dokploy"," on top of Docker Swarm does rolling with ",[231,2779,2780],{},"--update-parallelism 1 --update-delay",". Leverages what Swarm already offers.",[12,2783,2784],{},"Trade-off: you swap the fifty-line script (where you understand everything happening) for a panel (which is faster to set up, but becomes a black box when something goes wrong). For a small team where one person handles operations partially, the panel wins. For a team where you need to understand exactly what happened at 3 a.m., the script wins.",[19,2786,2788],{"id":2787},"robust-version-heroctl","\"Robust\" version — HeroCtl",[12,2790,2791],{},"For those who want to stop scripting but don't want a black box, HeroCtl combines automatic rolling deploy with a replicated control plane. You describe the service in a configuration file and the orchestrator does the rest:",[224,2793,2797],{"className":2794,"code":2795,"language":2796,"meta":229,"style":229},"language-hcl shiki shiki-themes github-dark-default","job \"minhaapp\" {\n  group \"web\" {\n    count = 2\n\n    task \"app\" {\n      driver = \"docker\"\n      config {\n        image = \"meuusuario\u002Fmyapp:v2\"\n        ports = [\"http\"]\n      }\n\n      service {\n        port = \"http\"\n        check {\n          type     = \"http\"\n          path     = \"\u002Fhealthz\"\n          interval = \"5s\"\n          timeout  = \"2s\"\n        }\n      }\n    }\n\n    update {\n      max_parallel      = 1\n      min_healthy_time  = \"10s\"\n      healthy_deadline  = \"5m\"\n      auto_revert       = true\n    }\n  }\n}\n","hcl",[231,2798,2799,2804,2809,2814,2818,2823,2828,2833,2838,2843,2848,2852,2857,2862,2867,2872,2877,2882,2887,2891,2895,2899,2903,2908,2913,2918,2923,2928,2932,2936],{"__ignoreMap":229},[234,2800,2801],{"class":236,"line":237},[234,2802,2803],{},"job \"minhaapp\" {\n",[234,2805,2806],{"class":236,"line":244},[234,2807,2808],{},"  group \"web\" {\n",[234,2810,2811],{"class":236,"line":271},[234,2812,2813],{},"    count = 2\n",[234,2815,2816],{"class":236,"line":415},[234,2817,412],{"emptyLinePlaceholder":411},[234,2819,2820],{"class":236,"line":434},[234,2821,2822],{},"    task \"app\" {\n",[234,2824,2825],{"class":236,"line":459},[234,2826,2827],{},"      driver = \"docker\"\n",[234,2829,2830],{"class":236,"line":464},[234,2831,2832],{},"      config {\n",[234,2834,2835],{"class":236,"line":479},[234,2836,2837],{},"        image = \"meuusuario\u002Fmyapp:v2\"\n",[234,2839,2840],{"class":236,"line":484},[234,2841,2842],{},"        ports = [\"http\"]\n",[234,2844,2845],{"class":236,"line":490},[234,2846,2847],{},"      }\n",[234,2849,2850],{"class":236,"line":508},[234,2851,412],{"emptyLinePlaceholder":411},[234,2853,2854],{"class":236,"line":529},[234,2855,2856],{},"      service {\n",[234,2858,2859],{"class":236,"line":535},[234,2860,2861],{},"        port = \"http\"\n",[234,2863,2864],{"class":236,"line":546},[234,2865,2866],{},"        check {\n",[234,2868,2869],{"class":236,"line":552},[234,2870,2871],{},"          type     = \"http\"\n",[234,2873,2874],{"class":236,"line":557},[234,2875,2876],{},"          path     = \"\u002Fhealthz\"\n",[234,2878,2879],{"class":236,"line":594},[234,2880,2881],{},"          interval = \"5s\"\n",[234,2883,2884],{"class":236,"line":635},[234,2885,2886],{},"          timeout  = \"2s\"\n",[234,2888,2889],{"class":236,"line":643},[234,2890,1352],{},[234,2892,2893],{"class":236,"line":659},[234,2894,2847],{},[234,2896,2897],{"class":236,"line":683},[234,2898,1357],{},[234,2900,2901],{"class":236,"line":695},[234,2902,412],{"emptyLinePlaceholder":411},[234,2904,2905],{"class":236,"line":717},[234,2906,2907],{},"    update {\n",[234,2909,2910],{"class":236,"line":723},[234,2911,2912],{},"      max_parallel      = 1\n",[234,2914,2915],{"class":236,"line":729},[234,2916,2917],{},"      min_healthy_time  = \"10s\"\n",[234,2919,2920],{"class":236,"line":734},[234,2921,2922],{},"      healthy_deadline  = \"5m\"\n",[234,2924,2925],{"class":236,"line":771},[234,2926,2927],{},"      auto_revert       = true\n",[234,2929,2930],{"class":236,"line":776},[234,2931,1357],{},[234,2933,2934],{"class":236,"line":815},[234,2935,720],{},[234,2937,2938],{"class":236,"line":820},[234,2939,1362],{},[12,2941,2942],{},"The same parameters as the bash script, declarative. The difference is that the orchestrator coordinates rolling across N servers (not just two), does automatic leader election in around seven seconds if the current node falls, and keeps the control plane distributed across the first three servers. Cluster survives the loss of any single server without human intervention.",[12,2944,2945],{},"Installation:",[224,2947,2949],{"className":226,"code":2948,"language":228,"meta":229,"style":229},"curl -sSL https:\u002F\u002Fget.heroctl.com\u002Finstall.sh | sh\n",[231,2950,2951],{"__ignoreMap":229},[234,2952,2953,2955,2958,2961,2964],{"class":236,"line":237},[234,2954,1220],{"class":247},[234,2956,2957],{"class":251}," -sSL",[234,2959,2960],{"class":255}," https:\u002F\u002Fget.heroctl.com\u002Finstall.sh",[234,2962,2963],{"class":383}," |",[234,2965,2966],{"class":247}," sh\n",[12,2968,2969],{},"Community plan is permanently free — no server or job limit, with all the orchestration features described in the tutorial. Business plan adds SSO\u002FSAML, granular RBAC, detailed audit, and SLA-backed support, for teams that have formal platform requirements. Enterprise plan adds source-code escrow, continuity contract, and 24×7 support. Business and Enterprise prices are published on the plans page — no mandatory \"talk to sales\".",[19,2971,2973],{"id":2972},"comparison-five-paths-side-by-side","Comparison: five paths side by side",[119,2975,2976,3001],{},[122,2977,2978],{},[125,2979,2980,2983,2986,2989,2992,2995,2998],{},[128,2981,2982],{},"Criterion",[128,2984,2985],{},"Bash script (2 servers)",[128,2987,2988],{},"Coolify multi-server",[128,2990,2991],{},"Dokploy + Swarm",[128,2993,2994],{},"HeroCtl",[128,2996,2997],{},"Kamal",[128,2999,3000],{},"Kubernetes",[141,3002,3003,3025,3048,3070,3088,3105,3127,3147],{},[125,3004,3005,3008,3011,3014,3017,3020,3022],{},[146,3006,3007],{},"Setup time",[146,3009,3010],{},"2-3h",[146,3012,3013],{},"30 min",[146,3015,3016],{},"1h",[146,3018,3019],{},"5 min",[146,3021,3016],{},[146,3023,3024],{},"4h-4 days",[125,3026,3027,3030,3033,3036,3039,3042,3045],{},[146,3028,3029],{},"Lines of config",[146,3031,3032],{},"~50 (script)",[146,3034,3035],{},"UI",[146,3037,3038],{},"~20",[146,3040,3041],{},"~50",[146,3043,3044],{},"~40",[146,3046,3047],{},"300+",[125,3049,3050,3053,3056,3059,3062,3065,3067],{},[146,3051,3052],{},"HA of the control plane",[146,3054,3055],{},"N\u002FA",[146,3057,3058],{},"No",[146,3060,3061],{},"Limited",[146,3063,3064],{},"Yes",[146,3066,3055],{},[146,3068,3069],{},"Yes (5+ components)",[125,3071,3072,3075,3078,3080,3082,3084,3086],{},[146,3073,3074],{},"Declarative health check",[146,3076,3077],{},"Manual",[146,3079,3064],{},[146,3081,3064],{},[146,3083,3064],{},[146,3085,3064],{},[146,3087,3064],{},[125,3089,3090,3092,3095,3097,3099,3101,3103],{},[146,3091,2441],{},[146,3093,3094],{},"Manual in script",[146,3096,3064],{},[146,3098,3064],{},[146,3100,3064],{},[146,3102,3064],{},[146,3104,3064],{},[125,3106,3107,3110,3113,3116,3119,3122,3124],{},[146,3108,3109],{},"Target scale",[146,3111,3112],{},"1-3 servers",[146,3114,3115],{},"1-10 servers",[146,3117,3118],{},"1-20 servers",[146,3120,3121],{},"1-500 servers",[146,3123,3115],{},[146,3125,3126],{},"50+ servers",[125,3128,3129,3132,3135,3137,3140,3143,3145],{},[146,3130,3131],{},"Black box?",[146,3133,3134],{},"No (you wrote it)",[146,3136,3064],{},[146,3138,3139],{},"Partial",[146,3141,3142],{},"No (short declarative)",[146,3144,3058],{},[146,3146,3064],{},[125,3148,3149,3152,3155,3157,3160,3162,3164],{},[146,3150,3151],{},"Learning curve",[146,3153,3154],{},"Low",[146,3156,3154],{},[146,3158,3159],{},"Medium",[146,3161,3154],{},[146,3163,3154],{},[146,3165,3166],{},"High",[12,3168,3169],{},"Each column has its niche. Bash script is unbeatable when you want to understand each line. Coolify wins when you just want a panel. HeroCtl wins when you need real HA without setting up an external control plane. Kubernetes wins at planetary scale — where the complexity pays off.",[19,3171,3173],{"id":3172},"the-five-most-common-errors","The five most common errors",[67,3175,3176,3188,3194,3204,3216],{},[70,3177,3178,3184,3185,3187],{},[27,3179,3180,3181,3183],{},"Health check on ",[231,3182,1523],{}," returning 200 without validating dependencies."," The app returns 200 before connecting to the database, the proxy promotes, and the user sees a 500 error on the first requests. Solution: ",[231,3186,355],{}," validates database, cache, queue — anything the app needs to actually respond.",[70,3189,3190,3193],{},[27,3191,3192],{},"Min healthy time of 1 second."," Apps with irregular warm-up may return 200 at one moment and 503 right after (cache populating, class being lazy-loaded). The orchestrator promotes on the first good window, and the next request hits a bad state. Ten sustained seconds eliminate ninety percent of these cases.",[70,3195,3196,3199,3200,3203],{},[27,3197,3198],{},"No max_parallel (or max_parallel = N)."," If you swap all instances together, during the cutover window there's nobody healthy serving. It's single-server downtime in disguise. Always ",[231,3201,3202],{},"max_parallel = 1"," to start.",[70,3205,3206,3209,3210,3212,3213,3215],{},[27,3207,3208],{},"Mix of versions in production without schema compat."," v1 writes to ",[231,3211,2628],{},", v2 reads from ",[231,3214,2632],{},", and during the five-minute rolling the two coexist — users hitting v2 don't see data v1 just wrote. Backward-compatible migration in three steps solves it.",[70,3217,3218,3221],{},[27,3219,3220],{},"Stale cache on the client (CDN, browser, service worker)."," Backend is already v2 but the user has the v1 JS in cache, and the old JS calls an API that no longer exists. Solution: keep old endpoints for a window; API versioning; strong cache-busting on critical assets.",[19,3223,3225],{"id":3224},"faq","FAQ",[12,3227,3228],{},[27,3229,3230],{},"Can I do zero-downtime with a single server?",[12,3232,3233,3234,3237],{},"No. Every variation that promises this has a measurable error window when you measure with ",[231,3235,3236],{},"hey -c 20",". The only way to have real zero-downtime is to keep at least one instance always healthy throughout the deploy — which requires two machines minimum.",[12,3239,3240],{},[27,3241,3242],{},"Does DNS round-robin work as a load balancer?",[12,3244,3245],{},"It works as a basic load balancer, but not as a health check mechanism. DNS doesn't quickly remove a dead IP from rotation — TTLs caching at ISPs and clients keep the wrong IP in use for minutes or hours. For zero-downtime you need a real proxy (Caddy, nginx, HAProxy) that takes unhealthy instances out of balancing in seconds.",[12,3247,3248],{},[27,3249,3250],{},"Caddy or Traefik — which is better for this setup?",[12,3252,3253],{},"For two servers and a static setup, Caddy is simpler — fifteen-line Caddyfile solves it. Traefik shines when you have dynamic service discovery (like Docker labels or Consul) and many backends changing all the time. nginx sits in the middle: more flexible, no built-in automatic TLS (needs external certbot). For this tutorial, Caddy.",[12,3255,3256],{},[27,3257,3258],{},"Do WebSocket connections survive during rolling?",[12,3260,3261,3262,3264],{},"Connections open on an instance that's being torn down are cut. The client has to reconnect. A good WebSocket library (Socket.IO, Phoenix Channels) reconnects automatically — the user sees a half-second blink in state. Connection draining helps: the instance marks ",[231,3263,355],{}," failing, the proxy stops sending new connections, but existing ones continue until the pre-stop timer. Thirty seconds of drain are usually enough for long-running connections to drain naturally.",[12,3266,3267],{},[27,3268,3269],{},"Database migrations — what's the golden rule?",[12,3271,3272],{},"Every migration must be backward-compatible. Drop a column never directly. Rename never directly. Type change never directly. Instead, three deploys: add new structure, backfill, remove the old. Slow, yes. But rolling deploy depends on this not to break.",[12,3274,3275],{},[27,3276,3277],{},"Automatic rollback — how to implement?",[12,3279,3280,3281,101],{},"Two pieces: deadline (max time waiting for health check) and reference to the previous image pre-pulled. If the deadline passes without becoming healthy, the script reinstalls the previous version. The example in Step 5 does exactly that. In declarative orchestrators, it becomes ",[231,3282,3283],{},"auto_revert = true",[12,3285,3286],{},[27,3287,3288],{},"Do sticky sessions complicate zero-downtime?",[12,3290,3291],{},"Yes. If the app stores session state in process memory, taking down the instance takes down the sessions of users connected to it. Solution: take session out of memory — Redis, Postgres, or signed JWT. Then any instance serves any user, and rolling cuts no session.",[12,3293,3294],{},[27,3295,3296],{},"How long does a complete deploy take?",[12,3298,3299,3300,3303],{},"Two servers, app that comes up in fifteen seconds: about a minute. Breakdown: image pull (5-15s, depends on network and size), container replacement (1s), warm-up + health check (10-30s), 10s min healthy time, total around 30-50s per host, multiplied by two hosts in sequence = 1-2 min. Four servers around 2-4 min. With fifty servers, deploy starts taking ten or fifteen minutes — time to raise ",[231,3301,3302],{},"max_parallel"," to two or three (keeping rigorous health check).",[57,3305],{},[19,3307,3309],{"id":3308},"closing","Closing",[12,3311,3312],{},"Zero-downtime deploy is architecture, not tool. The three ingredients — multiple instances, proxy with health check, controlled rolling — work with bash and Caddy as well as with a large orchestrator. The difference is in how much of the operation you want to write by hand and how much to delegate.",[12,3314,3315],{},"For a small SaaS, three VPS and a fifty-line script solve it indefinitely. When the cluster grows to dozens of servers or the team needs real HA on the control plane, it's worth stepping up to the declarative orchestrator:",[224,3317,3318],{"className":226,"code":2948,"language":228,"meta":229,"style":229},[231,3319,3320],{"__ignoreMap":229},[234,3321,3322,3324,3326,3328,3330],{"class":236,"line":237},[234,3323,1220],{"class":247},[234,3325,2957],{"class":251},[234,3327,2960],{"class":255},[234,3329,2963],{"class":383},[234,3331,2966],{"class":247},[12,3333,3334,3335,3340,3341,3345],{},"More on the rolling algorithm in ",[3336,3337,3339],"a",{"href":3338},"\u002Fen\u002Fblog\u002Fsafe-rolling-deploys-why-yours-might-not-be","Safe rolling deploy: why yours might not be",". For those leaving Compose for a multi-server setup, ",[3336,3342,3344],{"href":3343},"\u002Fen\u002Fblog\u002Fdocker-deploy-production-compose-to-cluster","Docker deploy in production: from compose to a cluster"," covers the intermediate path.",[12,3347,3348],{},"Container orchestration, without ceremony.",[3350,3351,3352],"style",{},"html pre.shiki code .sH3jZ, html code.shiki .sH3jZ{--shiki-default:#8B949E}html pre.shiki code .sQhOw, html code.shiki .sQhOw{--shiki-default:#FFA657}html pre.shiki code .sFSAA, html code.shiki .sFSAA{--shiki-default:#79C0FF}html pre.shiki code .s9uIt, html code.shiki .s9uIt{--shiki-default:#A5D6FF}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html pre.shiki code .suJrU, html code.shiki .suJrU{--shiki-default:#FF7B72}html pre.shiki code .sZEs4, html code.shiki .sZEs4{--shiki-default:#E6EDF3}html pre.shiki code .sc3cj, html code.shiki .sc3cj{--shiki-default:#D2A8FF}",{"title":229,"searchDepth":244,"depth":244,"links":3354},[3355,3356,3357,3358,3359,3360,3365,3366,3367,3368,3369,3370,3371,3372,3373,3374,3375,3376,3377],{"id":21,"depth":244,"text":22},{"id":61,"depth":244,"text":62},{"id":93,"depth":244,"text":94},{"id":113,"depth":244,"text":114},{"id":215,"depth":244,"text":216},{"id":348,"depth":244,"text":349,"children":3361},[3362,3363,3364],{"id":370,"depth":271,"text":371},{"id":918,"depth":271,"text":919},{"id":1002,"depth":271,"text":1003},{"id":1099,"depth":244,"text":1100},{"id":1241,"depth":244,"text":1242},{"id":1462,"depth":244,"text":1463},{"id":2448,"depth":244,"text":2449},{"id":2539,"depth":244,"text":2540},{"id":2608,"depth":244,"text":2609},{"id":2728,"depth":244,"text":2729},{"id":2759,"depth":244,"text":2760},{"id":2787,"depth":244,"text":2788},{"id":2972,"depth":244,"text":2973},{"id":3172,"depth":244,"text":3173},{"id":3224,"depth":244,"text":3225},{"id":3308,"depth":244,"text":3309},"engineering",null,"2026-06-09","You don't need Kubernetes for zero-downtime deploys. Full tutorial with 2 servers, Caddy\u002FTraefik in front, and rolling update via script or lightweight orchestrator.",false,"md",{},"\u002Fen\u002Fblog\u002Fzero-downtime-deploy-without-kubernetes","15 min",{"title":6,"description":3381},{"loc":3385},"en\u002Fblog\u002Fzero-downtime-deploy-without-kubernetes",[1526,3391,3392,3378],"zero-downtime","tutorial","lwgFsUuWTJnDZ04WNV5qWhFTbnfIZk7sCm3BW82wMaY",{"id":3395,"title":3396,"author":7,"body":3397,"category":3378,"cover":3379,"date":4397,"description":4398,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":4399,"navigation":411,"path":4400,"readingTime":4401,"seo":4402,"sitemap":4403,"stem":4404,"tags":4405,"__hash__":4410},"blog_en\u002Fen\u002Fblog\u002Fself-hosted-api-gateway-when-to-install.md","Self-hosted API gateway: when it's worth installing Kong, Traefik or similar",{"type":9,"value":3398,"toc":4377},[3399,3402,3405,3409,3416,3419,3426,3430,3433,3515,3522,3582,3585,3589,3592,3596,3599,3605,3611,3617,3621,3624,3629,3634,3639,3643,3646,3651,3656,3661,3665,3668,3673,3678,3683,3687,3690,3695,3700,3705,3709,3712,3750,3753,3757,3760,3812,3815,3819,3822,3828,3831,3834,3838,4124,4131,4135,4138,4144,4150,4156,4162,4168,4171,4175,4178,4184,4190,4196,4202,4206,4209,4241,4245,4255,4261,4267,4277,4283,4289,4295,4309,4315,4319,4322,4325,4328,4331,4347,4361,4371,4374],[12,3400,3401],{},"\"API gateway\" is one of the most overloaded jargon categories in back-end architecture. The term became an umbrella for things a simple reverse proxy has done for twenty years — routing, terminating TLS, balancing between instances — mixed with things that genuinely require a dedicated component: per-client API key validation, per-user request limiting, request body transformation, aggregating multiple back-ends into a single response. The confusion sells a lot of product. And makes a lot of startups install a critical component they didn't need — paying later in latency, RAM, operational complexity and failure surface.",[12,3403,3404],{},"This post separates what each thing covers, lists the five main players with honest resource consumption numbers, and draws a practical ruler: when the reverse proxy embedded in the orchestrator is enough, when it's worth bringing up a standalone Traefik, and when you actually need a Kong with authentication plug-ins. The audience is the tech lead looking at the current stack and trying to decide whether the next pain deserves another component on the critical path — or whether the pain is fake.",[19,3406,3408],{"id":3407},"tldr-installing-a-dedicated-gateway-is-an-expensive-decision-keep-the-ruler-short","TL;DR — installing a dedicated gateway is an expensive decision, keep the ruler short",[12,3410,3411,3412,3415],{},"A simple reverse proxy covers the trunk of the problem: HTTPS terminated, automatic Let's Encrypt certificates, routing by host and path, balancing between back-ends, health check, compression. For a typical B2B SaaS with web app + a few HTTP microservices, ",[27,3413,3414],{},"that's enough",". No need to install Kong, no need for Tyk, no need for KrakenD.",[12,3417,3418],{},"A dedicated API gateway becomes a defensible investment when three signs appear simultaneously: you publish an API for third parties to consume (not just your own web\u002Fmobile), you need request limits per client key (not per IP), and you want interactive documentation with try-it-here for consumers. In that scenario, Kong, Tyk or Traefik with rich middlewares pay their own cost. Outside that scenario, you are adding 100–300 MB of RAM on the critical path, 1 to 3 milliseconds of latency per request, and one more component that can fall in production — in exchange for features nobody will use.",[12,3420,3421,3422,3425],{},"The simplest ruler we know: ",[27,3423,3424],{},"if your end client is a person opening a browser, a reverse proxy is enough. If your end client is a developer with an API key, consider the gateway."," Everything else is variation on top of those two lines.",[19,3427,3429],{"id":3428},"what-a-simple-reverse-proxy-already-covers-and-whats-still-missing","What a simple reverse proxy already covers, and what's still missing",[12,3431,3432],{},"Before comparing gateways, it's worth fixing the floor. A decent reverse proxy — Caddy, nginx, or the integrated router of a modern orchestrator — delivers a lot for free. This list represents the state of the art in 2026, not the historical minimum:",[2734,3434,3435,3441,3447,3464,3470,3476,3482,3488,3494,3500],{},[70,3436,3437,3440],{},[27,3438,3439],{},"HTTP\u002FHTTPS termination with HTTP\u002F2 and HTTP\u002F3."," The proxy speaks any modern protocol with the client and speaks clean HTTP\u002F1.1 to the back-end if needed.",[70,3442,3443,3446],{},[27,3444,3445],{},"Automatic Let's Encrypt certificates."," Issuance, renewal at 60 days, error recovery. Today this is commodity — any serious router does it.",[70,3448,3449,2577,3452,3455,3456,3459,3460,3463],{},[27,3450,3451],{},"Routing by host and path.",[231,3453,3454],{},"api.example.com"," goes to one back-end, ",[231,3457,3458],{},"app.example.com"," goes to another, ",[231,3461,3462],{},"\u002Fv1\u002Fusers"," goes to a third. Rules with prefix, regex and priority.",[70,3465,3466,3469],{},[27,3467,3468],{},"Balancing between instances."," Round-robin, least connections, IP hash. Enough to distribute load between replicas of the same service.",[70,3471,3472,3475],{},[27,3473,3474],{},"Active and passive health check."," Removes a sick instance from the pool. Re-includes it when it comes back.",[70,3477,3478,3481],{},[27,3479,3480],{},"gzip and brotli compression."," Negotiates with the client, compresses what's worth compressing.",[70,3483,3484,3487],{},[27,3485,3486],{},"Static content cache."," For immutable files, avoids hitting the back-end.",[70,3489,3490,3493],{},[27,3491,3492],{},"Basic per-IP limit."," Thirty requests per second per address, for example. Covers most silly abuse.",[70,3495,3496,3499],{},[27,3497,3498],{},"Timeouts and retries."," Fail fast, retry on an alternative back-end if applicable.",[70,3501,3502,2577,3505,571,3508,571,3511,3514],{},[27,3503,3504],{},"Proxy headers.",[231,3506,3507],{},"X-Forwarded-For",[231,3509,3510],{},"X-Real-IP",[231,3512,3513],{},"X-Forwarded-Proto",". The back-end sees the real client.",[12,3516,3517,3518,3521],{},"That's a lot. For 80% of B2B SaaS web applications, it's all you need on the entry path. What a simple reverse proxy ",[27,3519,3520],{},"does not"," cover is what differentiates a gateway:",[2734,3523,3524,3530,3540,3546,3552,3558,3564,3570,3576],{},[70,3525,3526,3529],{},[27,3527,3528],{},"Per-client API key validation."," Each consumer gets a key, the gateway validates, identifies the client, and uses that identity for limits and auditing.",[70,3531,3532,3535,3536,3539],{},[27,3533,3534],{},"JWT token validation with rotatable keys."," The gateway downloads public keys from the issuer, validates signature and time, exposes ",[231,3537,3538],{},"claims"," to the back-end.",[70,3541,3542,3545],{},[27,3543,3544],{},"Request limits per key\u002Fuser\u002Froute."," Client A can make 1,000 calls\u002Fhour; client B, 100. Per route, per day, with sliding window. Hard to do in a simple proxy.",[70,3547,3548,3551],{},[27,3549,3550],{},"Request and response transformation."," Add\u002Fremove headers, rewrite JSON body, translate between API versions.",[70,3553,3554,3557],{},[27,3555,3556],{},"Versioning by header or path."," Coexist with v1 clients while v2 gains traction. Deprecation policy.",[70,3559,3560,3563],{},[27,3561,3562],{},"Back-end aggregation."," Composite endpoint that calls three microservices and returns a unified response (back-end-for-frontend pattern).",[70,3565,3566,3569],{},[27,3567,3568],{},"Request schema validation."," Reject at the gateway what doesn't match the OpenAPI contract before touching the back-end.",[70,3571,3572,3575],{},[27,3573,3574],{},"Documentation portal with try-it-here."," Interactive page for developers to explore the API.",[70,3577,3578,3581],{},[27,3579,3580],{},"Granular metrics per API key."," Who called, how much, when, with what latency. Vital if the API is the product.",[12,3583,3584],{},"Each item on this second list is a feature that costs a lot to do in application code spread out. If you need most of it, a gateway pays. If you need almost nothing — which is the common case in product SaaS — the gateway is dead weight.",[19,3586,3588],{"id":3587},"the-five-players-that-matter-in-2026","The five players that matter in 2026",[12,3590,3591],{},"The market has settled. There are five defensible choices for a self-hosted gateway, with reasonably distinct profiles. The RAM and latency numbers below are measured with default configuration and modest workload (a few dozen calls per second); heavy plug-ins or high volume change everything.",[368,3593,3595],{"id":3594},"kong-lua-based-on-top-of-openresty","Kong (Lua-based, on top of OpenResty)",[12,3597,3598],{},"The best-known name in the category. Kong started in 2015 and has the largest plug-in catalog in the space — OAuth authentication, JWT validation, transformation, log to Elasticsearch, integration with external vaults, all pre-built. The open source version covers most cases; the paid one adds a more polished developer portal, fine-grained RBAC, and SLA support.",[12,3600,3601,3604],{},[27,3602,3603],{},"Resources:"," realistic minimum of 200 MB of RAM per instance, plus the database if you don't use db-less mode. Added latency of 1 to 2 milliseconds per request on a simple call. Heavy plug-ins (schema validation with large OpenAPI, complex JSON transformation) can double that.",[12,3606,3607,3610],{},[27,3608,3609],{},"When it makes sense:"," serious public API with multiple external consumers, need for catalog plug-ins, team willing to learn Lua if customization is needed. Payments company, communication platform, any business where the API is the product sold.",[12,3612,3613,3616],{},[27,3614,3615],{},"Gotcha:"," the mode with PostgreSQL puts the database on the critical path. Database down, gateway can't update configuration. Use db-less mode (declarative configuration via file) whenever possible — eliminates that dependency.",[368,3618,3620],{"id":3619},"traefik-written-in-go-speaking-various-orchestrator-proxies","Traefik (written in Go, speaking various orchestrator proxies)",[12,3622,3623],{},"Known as a Kubernetes ingress controller, but has rich enough middlewares to cover many gateway cases. Per-client request limiting, basic JWT validation, header transformation, complex redirects, forward auth (delegating to an external service). The paid version adds commercial plug-ins and a more robust dashboard.",[12,3625,3626,3628],{},[27,3627,3603],{}," 50 to 100 MB of RAM, added latency of 0.5 to 1 millisecond. Automatic back-end discovery via container labels is the strong point — you don't write route configuration, it appears when the service comes up.",[12,3630,3631,3633],{},[27,3632,3609],{}," already using Traefik as the entry router and want to avoid adding one more component; need reasonable middlewares but not Kong's giant catalog; value declarative configuration by label rather than database.",[12,3635,3636,3638],{},[27,3637,3615],{}," some advanced patterns (call aggregation, full OpenAPI validation, interactive documentation portal) don't fit in Traefik. If you need that, the temptation to \"stretch\" Traefik via custom plug-ins leads to complexity that Kong would solve more cleanly.",[368,3640,3642],{"id":3641},"tyk-written-in-go-focus-on-developer-portal","Tyk (written in Go, focus on developer portal)",[12,3644,3645],{},"The open source version delivers far more than most — request limiting per key, key management, developer portal, all in the free plan. The paid version adds multi-tenant dashboard, multi-region replication, and support.",[12,3647,3648,3650],{},[27,3649,3603],{}," 100 MB of RAM, added latency of 1 to 2 milliseconds. Database (Redis) is central to the architecture — request limits and counters live there.",[12,3652,3653,3655],{},[27,3654,3609],{}," API with many external consumers, developer portal is part of the product, you want to pay less than what Kong charges for the equivalent in resources. Small teams publishing API for partners have found a good fit here.",[12,3657,3658,3660],{},[27,3659,3615],{}," fewer ready-made plug-ins than Kong. If your expected integration exists in Kong's list but not in Tyk's, the trade-off changes.",[368,3662,3664],{"id":3663},"krakend-written-in-go-no-database-focus-on-aggregation","KrakenD (written in Go, no database, focus on aggregation)",[12,3666,3667],{},"KrakenD is the small gateway that specializes in aggregation. 100% file configuration, no external state, designed to compose endpoints — the client makes one call, KrakenD calls three back-ends in parallel and returns a combined response. Great for the back-end-for-frontend pattern.",[12,3669,3670,3672],{},[27,3671,3603],{}," 50 MB of RAM, added latency of 0.5 milliseconds. The lightest of the category. No database, no panel — everything is static configuration file.",[12,3674,3675,3677],{},[27,3676,3609],{}," you have multiple microservices and want to expose a cleaner API to the mobile\u002Fweb front-end. You don't need dynamic key management or developer portal. You like immutable configuration: change file, deploy, done.",[12,3679,3680,3682],{},[27,3681,3615],{}," everything is static. Adding a new key is a deploy. For a small team that's simplification; for an API platform with third parties self-registering, it becomes a bottleneck.",[368,3684,3686],{"id":3685},"envoy-gateway-cncf-on-top-of-envoy-proxy","Envoy Gateway (CNCF, on top of Envoy proxy)",[12,3688,3689],{},"The serious newcomer of the list. Envoy is the very-high-performance proxy used in large service meshes. Envoy Gateway is the project that packages Envoy as an API gateway with declarative configuration. Focus on Kubernetes, high throughput, mesh integration.",[12,3691,3692,3694],{},[27,3693,3603],{}," raw Envoy consumes 50 to 100 MB on the data proxy; the control plane weighs more. Low added latency (\u003C 1 millisecond) on a simple call. But operational complexity is the highest on the list.",[12,3696,3697,3699],{},[27,3698,3609],{}," you already run a service mesh with Envoy (Istio, Consul, Linkerd with compatible proxy) and want configuration consistency between mesh and gateway. You operate at high enough scale that Envoy throughput matters (tens of thousands of requests per second).",[12,3701,3702,3704],{},[27,3703,3615],{}," for a startup with 4 servers and a few dozen requests per second, Envoy Gateway is overkill by two or three sizes. The configuration complexity doesn't pay off.",[19,3706,3708],{"id":3707},"when-is-a-simple-reverse-proxy-enough","When is a simple reverse proxy enough?",[12,3710,3711],{},"This is the question that saves money. The honest answer is: in the vast majority of Brazilian B2B SaaS we see running, it is enough. The criteria for \"enough\":",[2734,3713,3714,3720,3726,3732,3738,3744],{},[70,3715,3716,3719],{},[27,3717,3718],{},"Audience for your API is your own application."," Web, mobile, internal integrations. There are no unknown third parties calling endpoints with keys you issued.",[70,3721,3722,3725],{},[27,3723,3724],{},"Authentication happens in the application, not on the path."," Cookie session, JWT token issued by the back-end itself and validated by application middleware, OAuth via library inside the code. The proxy doesn't need to see the user.",[70,3727,3728,3731],{},[27,3729,3730],{},"Request limit is \"avoid silly abuse\"."," Thirty per second per IP, perhaps. There is no commercial plan that limits Client A to 1,000 calls\u002Fday and Client B to 10,000.",[70,3733,3734,3737],{},[27,3735,3736],{},"You don't need to combine back-ends."," Each front-end call goes to one endpoint, that endpoint calls what it needs internally. No path-level aggregation.",[70,3739,3740,3743],{},[27,3741,3742],{},"API documentation is internal or non-existent."," No developer portal with try-it-here for third parties.",[70,3745,3746,3749],{},[27,3747,3748],{},"Versioning, if it exists, is managed in code."," The back-end routes internally between v1 and v2 when needed. No formal policy at the gateway.",[12,3751,3752],{},"If five of the six items above are true, installing a dedicated gateway is expensive for the real benefit. A reverse proxy embedded in the orchestrator, or standalone Caddy\u002Fnginx, covers everything.",[19,3754,3756],{"id":3755},"when-is-a-dedicated-gateway-worth-it","When is a dedicated gateway worth it?",[12,3758,3759],{},"The inversion of the previous list. A gateway pays off when some of these appear:",[2734,3761,3762,3768,3774,3780,3794,3800,3806],{},[70,3763,3764,3767],{},[27,3765,3766],{},"Public API is part of the product."," You charge (or plan to charge) per API usage. Third parties register, get a key, consume.",[70,3769,3770,3773],{},[27,3771,3772],{},"Limit per key\u002Fuser\u002Froute is a business rule."," Free plan has a ceiling, paid plan has a higher ceiling, enterprise plan is negotiated. That limit needs to live somewhere — gateway is the right place.",[70,3775,3776,3779],{},[27,3777,3778],{},"Multiple back-ends need to be combined into one response."," Back-end-for-frontend pattern, microservice aggregation, fan-out and fan-in. High costs in the application, modest costs in the gateway.",[70,3781,3782,3785,3786,3789,3790,3793],{},[27,3783,3784],{},"Formal API versioning."," You support v1 and v2 simultaneously, with announced deprecation date. ",[231,3787,3788],{},"Accept-Version"," header or ",[231,3791,3792],{},"\u002Fv2\u002F"," path. Legacy client can't break.",[70,3795,3796,3799],{},[27,3797,3798],{},"Complex authentication."," Validation of JWT issued by a third party, with public keys downloaded and cached, with automatic rotation. OAuth with multiple providers. Authentication by client certificate (mutual TLS) for inter-company integrations.",[70,3801,3802,3805],{},[27,3803,3804],{},"Developer portal with try-it-here."," Interactive documentation, self-service key management, usage panel for consumers.",[70,3807,3808,3811],{},[27,3809,3810],{},"Per-API-key metrics."," Who calls what, when, latency per consumer. Commercial dashboards, usage reports, per-client SLA.",[12,3813,3814],{},"Three or more of these criteria true, gateway is defensible. One or two, you can still solve it in other ways (authentication in the application, limit per app, structured metrics in the log).",[19,3816,3818],{"id":3817},"heroctl-integrated-router-where-it-sits-on-this-ruler","HeroCtl integrated router — where it sits on this ruler",[12,3820,3821],{},"The router embedded in HeroCtl doesn't try to be a gateway. It covers the well-done reverse proxy side: HTTPS terminated, automatic Let's Encrypt with renewal, routing by host and path, balancing between the replicas the orchestrator brought up, health check coordinated with the agent on each node, compression, proxy headers, basic per-IP limit, retry policy on back-end failure.",[12,3823,3824,3825,3827],{},"What the integrated router ",[27,3826,3520],{}," do: per-consumer API key validation, per-key\u002Fuser limit, body transformation, back-end aggregation, OpenAPI schema validation, developer portal. For 80% of cases where the end client is the company's own browser or mobile app, the embedded router covers the entire entry path — you don't install anything else in front.",[12,3829,3830],{},"For the 20% who need a dedicated gateway, the path is direct: install Kong, standalone Traefik, Tyk or KrakenD as another job in the cluster, behind the embedded router. The router terminates TLS at the edge, the gateway does the gateway work, the back-ends sit behind. Without ceremony, without circular dependency.",[12,3832,3833],{},"The HeroCtl control plane occupies between 200 and 400 MB per server — meaning an installed Kong adds practically the same weight as the entire control plane. Worth remembering the order of magnitude before \"just install\".",[19,3835,3837],{"id":3836},"comparison-table-12-criteria","Comparison table — 12 criteria",[119,3839,3840,3867],{},[122,3841,3842],{},[125,3843,3844,3846,3849,3852,3855,3858,3861,3864],{},[128,3845,2982],{},[128,3847,3848],{},"Simple reverse proxy (Caddy\u002Fnginx)",[128,3850,3851],{},"HeroCtl router",[128,3853,3854],{},"Standalone Traefik",[128,3856,3857],{},"KrakenD",[128,3859,3860],{},"Tyk OSS",[128,3862,3863],{},"Kong OSS",[128,3865,3866],{},"Envoy Gateway",[141,3868,3869,3895,3919,3940,3959,3978,3997,4017,4036,4057,4078,4099],{},[125,3870,3871,3874,3877,3880,3883,3886,3889,3892],{},[146,3872,3873],{},"Minimum RAM",[146,3875,3876],{},"20–50 MB",[146,3878,3879],{},"embedded in control plane",[146,3881,3882],{},"50–100 MB",[146,3884,3885],{},"~50 MB",[146,3887,3888],{},"~100 MB",[146,3890,3891],{},"~200 MB",[146,3893,3894],{},"~100 MB + control plane",[125,3896,3897,3900,3903,3905,3908,3911,3914,3916],{},[146,3898,3899],{},"Added latency",[146,3901,3902],{},"\u003C 0.5 ms",[146,3904,3902],{},[146,3906,3907],{},"0.5–1 ms",[146,3909,3910],{},"~0.5 ms",[146,3912,3913],{},"1–2 ms",[146,3915,3913],{},[146,3917,3918],{},"\u003C 1 ms (can grow)",[125,3920,3921,3924,3927,3929,3931,3934,3936,3938],{},[146,3922,3923],{},"Automatic certificates",[146,3925,3926],{},"Yes (native Caddy)",[146,3928,3064],{},[146,3930,3064],{},[146,3932,3933],{},"Not direct",[146,3935,3064],{},[146,3937,3064],{},[146,3939,3064],{},[125,3941,3942,3945,3947,3949,3951,3953,3955,3957],{},[146,3943,3944],{},"Host\u002Fpath routing",[146,3946,3064],{},[146,3948,3064],{},[146,3950,3064],{},[146,3952,3064],{},[146,3954,3064],{},[146,3956,3064],{},[146,3958,3064],{},[125,3960,3961,3964,3966,3968,3970,3972,3974,3976],{},[146,3962,3963],{},"Balancing + health",[146,3965,3064],{},[146,3967,3064],{},[146,3969,3064],{},[146,3971,3064],{},[146,3973,3064],{},[146,3975,3064],{},[146,3977,3064],{},[125,3979,3980,3983,3985,3987,3989,3991,3993,3995],{},[146,3981,3982],{},"Per-IP limit",[146,3984,3064],{},[146,3986,3064],{},[146,3988,3064],{},[146,3990,3064],{},[146,3992,3064],{},[146,3994,3064],{},[146,3996,3064],{},[125,3998,3999,4002,4004,4006,4009,4011,4013,4015],{},[146,4000,4001],{},"Per key\u002Fuser limit",[146,4003,3058],{},[146,4005,3058],{},[146,4007,4008],{},"Yes (with middleware)",[146,4010,3064],{},[146,4012,3064],{},[146,4014,3064],{},[146,4016,3064],{},[125,4018,4019,4022,4024,4026,4028,4030,4032,4034],{},[146,4020,4021],{},"JWT validation",[146,4023,3058],{},[146,4025,3058],{},[146,4027,3139],{},[146,4029,3064],{},[146,4031,3064],{},[146,4033,3064],{},[146,4035,3064],{},[125,4037,4038,4041,4043,4045,4047,4050,4052,4055],{},[146,4039,4040],{},"Back-end aggregation",[146,4042,3058],{},[146,4044,3058],{},[146,4046,3058],{},[146,4048,4049],{},"Yes (focus)",[146,4051,3139],{},[146,4053,4054],{},"Yes (with plug-in)",[146,4056,3064],{},[125,4058,4059,4062,4064,4066,4068,4071,4073,4076],{},[146,4060,4061],{},"OpenAPI validation",[146,4063,3058],{},[146,4065,3058],{},[146,4067,3058],{},[146,4069,4070],{},"Yes (subscriber)",[146,4072,3064],{},[146,4074,4075],{},"Yes (plug-in)",[146,4077,3064],{},[125,4079,4080,4083,4085,4087,4089,4091,4094,4097],{},[146,4081,4082],{},"Developer portal",[146,4084,3058],{},[146,4086,3058],{},[146,4088,3058],{},[146,4090,3058],{},[146,4092,4093],{},"Yes (included)",[146,4095,4096],{},"Yes (paid in robust OSS)",[146,4098,3058],{},[125,4100,4101,4104,4107,4110,4113,4115,4118,4121],{},[146,4102,4103],{},"Configuration",[146,4105,4106],{},"File",[146,4108,4109],{},"Panel + API",[146,4111,4112],{},"Labels\u002Ffile",[146,4114,4106],{},[146,4116,4117],{},"File + panel",[146,4119,4120],{},"File + panel + database",[146,4122,4123],{},"Custom Resources",[12,4125,4126,4127,4130],{},"The table has clear zones. The first three columns solve the entry path with low weight. The last four solve the entry path ",[27,4128,4129],{},"plus"," gateway work, with growing weight and complexity.",[19,4132,4134],{"id":4133},"typical-stack-by-company-stage","Typical stack by company stage",[12,4136,4137],{},"This is the ruler we recommend. Not a strict prescription — it is what we see working in Brazilian SaaS teams.",[12,4139,4140,4143],{},[27,4141,4142],{},"MVP (1 back-end, 1 developer)."," Standalone Caddy on a server, or embedded router if you're already in an orchestrator. Don't install anything. Don't think about gateway. Focus on product.",[12,4145,4146,4149],{},[27,4147,4148],{},"Indie hacker (3 to 5 back-ends, team of 1 to 3)."," Embedded router in the orchestrator, period. The entry path already covers what matters. Authentication in the application, basic per-IP limit on the proxy. Time spent with gateway is time not spent on product features.",[12,4151,4152,4155],{},[27,4153,4154],{},"Early startup (10 to 20 back-ends, first external API consumers)."," Time to evaluate. If the external API is an experiment that may still die, leave authentication in the application and limit by key in a shared library. If the API is part of the product's core promise, install standalone Traefik with authentication and limit middlewares, or Tyk OSS for the included portal. Kong at this stage is usually too heavy.",[12,4157,4158,4161],{},[27,4159,4160],{},"Mid startup (50+ back-ends, public API platform becomes product)."," Kong OSS or paid Tyk. You need plug-ins, robust portal, self-service key management, commercial metrics. Kong's weight now justifies — you're charging for API usage and the gateway is revenue, not cost.",[12,4163,4164,4167],{},[27,4165,4166],{},"Large company (hundreds of services, integrations with serious partners)."," Kong Enterprise or Envoy Gateway, depending on context. Dedicated team looking after the gateway. Formal versioning, deprecation, per-client SLA policy.",[12,4169,4170],{},"The natural migration — reverse proxy → Traefik\u002FTyk → Kong — works because each step solves real pain from the previous step. Skipping steps is expensive: installing Kong at the MVP phase is bringing a truck to deliver a pizza.",[19,4172,4174],{"id":4173},"the-4-most-common-expensive-gateway-mistakes","The 4 most common expensive gateway mistakes",[12,4176,4177],{},"The stumbles we see in production:",[12,4179,4180,4183],{},[27,4181,4182],{},"Installing Kong with PostgreSQL on the critical path without needing to."," db-less mode exists and is perfect for most cases. Declarative configuration via file, no external dependency, no extra failure point. Many teams fall into the default configuration with database and only discover it when the database becomes unavailable and the gateway can no longer propagate changes. If you need dynamic key management (consumers self-registering), database pays off. Otherwise, db-less mode.",[12,4185,4186,4189],{},[27,4187,4188],{},"Not monitoring the gateway with the same severity as the back-ends."," A gateway in front becomes an easily-forgotten black box. Latency grows 5 ms, error rate goes from 0.01% to 0.5%, nobody notices until the client complains. Gateway metrics (per-route latency, 4xx\u002F5xx error rate, memory usage, configuration propagation lag) deserve their own dashboard and alerts, not just a shy inclusion in the general panel.",[12,4191,4192,4195],{},[27,4193,4194],{},"Custom plug-in in Lua\u002FJS running in production."," Kong allows custom plug-in in Lua. Tyk in JavaScript. Huge temptation to solve \"just this transformation\" at the gateway. A bug in that plug-in takes the entire gateway down — a bug you created, without load testing, on the critical path of everything. If you need custom transformation, do it in a microservice behind the gateway. Custom plug-in only with serious review, load testing, and automatic rollback plan.",[12,4197,4198,4201],{},[27,4199,4200],{},"Outdated gateway version."," Kong, Envoy and Tyk receive CVEs (security vulnerabilities) regularly. The gateway is exposed to the internet — a relevant attack surface. An 18-month-old version is known vulnerability accumulating. Make it part of the maintenance cycle: updating the gateway is as important as updating the operating system.",[19,4203,4205],{"id":4204},"real-scenarios-where-you-should-not-install-a-gateway","Real scenarios where you should NOT install a gateway",[12,4207,4208],{},"Strong list. If you are in any of these, avoid the dedicated gateway even if the topic comes up in council:",[2734,4210,4211,4217,4223,4229,4235],{},[70,4212,4213,4216],{},[27,4214,4215],{},"Web SaaS with 5 endpoints, no public external API."," End client is the browser. Session authentication. Reverse proxy solves the entire entry path. Adding a gateway here is architectural vanity.",[70,4218,4219,4222],{},[27,4220,4221],{},"Small team (1 to 3 developers)."," Kong's learning curve costs two to four weeks of total team productivity. On a 3-person team, that's a quarter of features stalled. Unless the gateway solves concrete pain today, postpone.",[70,4224,4225,4228],{},[27,4226,4227],{},"Workload where sub-10ms latency is a hard requirement."," Low-latency trading, real-time auction, multiplayer game. Every millisecond counts. Gateway adds 1 to 3 ms — in sensitive workload, expensive. Put intelligence in the application.",[70,4230,4231,4234],{},[27,4232,4233],{},"Monolithic application without aggregation."," The monolith serves the front-end directly, without composition between services. There's nothing to aggregate. Gateway is a solution looking for a problem.",[70,4236,4237,4240],{},[27,4238,4239],{},"Compliance that prefers minimal attack surface."," Each component exposed to the internet is one more item for audit, one more patch to apply, one more log to keep. If audit values minimalism, justify each component — gateway not covering concrete pain is a minus.",[19,4242,4244],{"id":4243},"frequent-questions","Frequent questions",[12,4246,4247,4250,4251,4254],{},[27,4248,4249],{},"Is Kong db-less stable in 2026?","\nYes. Declarative mode (",[231,4252,4253],{},"db_less = on",") is mature, recommended by Kong itself for a large part of the cases, and eliminates PostgreSQL as a dependency. You lose dynamic key management via admin API (need to deploy new configuration), but gain enormous operational simplicity. For a small team, the trade is almost always worth it.",[12,4256,4257,4260],{},[27,4258,4259],{},"Does Traefik do everything Kong does?","\nNo. Traefik with middlewares covers most common cases — basic authentication, simple per-key limit, header transformation, forward auth. It doesn't cover Kong's plug-in catalog, native OpenAPI validation, robust developer portal, ready-made commercial integrations. If your pain fits in Traefik middleware, stay in Traefik (lighter, simpler). If you need something from Kong's catalog, Kong.",[12,4262,4263,4266],{},[27,4264,4265],{},"Can I have two gateways in series?","\nTechnically yes, in practice it almost always is a symptom of confused organization. Two gateways = two configurations to maintain, two latencies summed, two failure points. The defensible case is: edge router doing TLS and basic routing, dedicated gateway behind doing specific work (key validation, aggregation). That's different from \"two complete gateways in series\" — it's responsibility split.",[12,4268,4269,4272,4273,101],{},[27,4270,4271],{},"Does an API gateway replace a service mesh?","\nNo. Gateway handles north-south traffic (external client → your system). Mesh handles east-west traffic (internal service → internal service). Similar functions (authentication, limits, observability) but different scope. For a medium startup, the gateway solves the part that matters; a complete service mesh only becomes a defensible investment at larger scale. We address that boundary in ",[3336,4274,4276],{"href":4275},"\u002Fen\u002Fblog\u002Fservice-mesh-when-its-worth-for-small-saas","service mesh: when it's worth it for small and medium SaaS",[12,4278,4279,4282],{},[27,4280,4281],{},"How much latency does Kong add on a typical call?","\nOn modern hardware, with default configuration and light plug-ins (key validation + per-key limit): 1 to 2 milliseconds per request. Heavy plug-ins (full OpenAPI validation on large payload, complex JSON transformation, synchronous log to external service) can add 3 to 10 ms. Measure before and after — don't trust generic blog post numbers.",[12,4284,4285,4288],{},[27,4286,4287],{},"Self-hosted OAuth provider — Keycloak or Hydra?","\nKeycloak is the standard for those wanting a robust admin panel, federation with LDAP\u002FSAML, complete user management. Heavier (1 GB of RAM minimum, JVM). Hydra is minimalist, focuses only on OAuth\u002FOIDC, no user management (you integrate with your existing user system). For a small team that already has its own user system, Hydra is more appropriate. For a company that wants a single place for identity, Keycloak. Both speak standard protocols, so the gateway doesn't differentiate between them.",[12,4290,4291,4294],{},[27,4292,4293],{},"Schema validation — OpenAPI or JSON Schema?","\nOpenAPI (formerly Swagger) is the standard for describing HTTP API — covers paths, methods, request and response. Includes JSON Schema for describing payloads. Kong, Tyk and standalone validators speak OpenAPI directly. Pure JSON Schema is more portable (not tied to HTTP) but requires more glue. Use OpenAPI when the gateway supports it; worth keeping the contract schema alive, not outdated.",[12,4296,4297,4300,4301,4304,4305,4308],{},[27,4298,4299],{},"Can I do limiting in the application instead of the gateway?","\nYou can. Libraries like ",[231,4302,4303],{},"golang.org\u002Fx\u002Ftime\u002Frate"," or Redis with ",[231,4306,4307],{},"INCR"," solve per-user limit at the application level. The question is where the limit is cheaper: at the gateway, before the back-end is touched (saves back-end resources, applies before work begins) or in the application, with business rules closer to the code (easier to reason about, easier to test). For simple limits, the application is enough. For limits per commercial plan with multiple tiers and auditing, the gateway is the right place.",[12,4310,4311,4314],{},[27,4312,4313],{},"Can I use two different gateways on distinct routes?","\nYou can. Some companies run Kong for \"product\" routes (sold public API) and Traefik for \"internal\" routes (admin, ops, cron). The justification is that each gateway solves a different pain, and having only one would force compromise. Worth it when usage profiles actually diverge. Not worth it just for the pleasure of variety — two pieces to maintain.",[19,4316,4318],{"id":4317},"closing-start-with-the-minimum-level-up-when-the-pain-is-real","Closing — start with the minimum, level up when the pain is real",[12,4320,4321],{},"The trap of the \"API gateway\" category is treating the decision as binary — install or not — when it is gradual. A well-done reverse proxy covers 80% of applications. The integrated router in the orchestrator covers the same 80% without a separate component. A dedicated gateway is a defensible investment when three or four signs appear simultaneously: public external API, per-key limit, aggregation, developer portal.",[12,4323,4324],{},"The honest ruler: install the minimum until concrete pain forces the next step. Skipping steps costs dearly in latency, RAM, complexity, failure surface. A small team that installs Kong \"because they will need it\" spends three weeks configuring something they don't use, and still has one more component to monitor.",[12,4326,4327],{},"HeroCtl delivers the lowest step embedded — integrated router with automatic TLS, balancing, health check, per-IP limit. When gateway pain appears for real, you bring up Kong, standalone Traefik, Tyk or KrakenD as another job in the cluster. Without painful migration, without ceremony.",[12,4329,4330],{},"To bring up a cluster and test:",[224,4332,4333],{"className":226,"code":2948,"language":228,"meta":229,"style":229},[231,4334,4335],{"__ignoreMap":229},[234,4336,4337,4339,4341,4343,4345],{"class":236,"line":237},[234,4338,1220],{"class":247},[234,4340,2957],{"class":251},[234,4342,2960],{"class":255},[234,4344,2963],{"class":383},[234,4346,2966],{"class":247},[12,4348,4349,4352,4353,4356,4357,4360],{},[27,4350,4351],{},"Community"," is free forever, no server ceiling, no job ceiling, no feature gate. ",[27,4354,4355],{},"Business"," adds SSO\u002FSAML, granular RBAC, detailed auditing and SLA support. ",[27,4358,4359],{},"Enterprise"," adds source code escrow, continuity contract and 24×7 support.",[12,4362,4363,4364,2402,4366,4370],{},"Upcoming posts: ",[3336,4365,4276],{"href":4275},[3336,4367,4369],{"href":4368},"\u002Fen\u002Fblog\u002Fmulti-tenant-saas-real-isolation","multi-tenant SaaS: real isolation between clients",". The three topics together cover most of the platform decisions for a Brazilian startup in the 1 to 500 server range.",[12,4372,4373],{},"Container orchestration, without ceremony. Gateway only when the pain asks.",[3350,4375,4376],{},"html pre.shiki code .sQhOw, html code.shiki .sQhOw{--shiki-default:#FFA657}html pre.shiki code .sFSAA, html code.shiki .sFSAA{--shiki-default:#79C0FF}html pre.shiki code .s9uIt, html code.shiki .s9uIt{--shiki-default:#A5D6FF}html pre.shiki code .suJrU, html code.shiki .suJrU{--shiki-default:#FF7B72}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}",{"title":229,"searchDepth":244,"depth":244,"links":4378},[4379,4380,4381,4388,4389,4390,4391,4392,4393,4394,4395,4396],{"id":3407,"depth":244,"text":3408},{"id":3428,"depth":244,"text":3429},{"id":3587,"depth":244,"text":3588,"children":4382},[4383,4384,4385,4386,4387],{"id":3594,"depth":271,"text":3595},{"id":3619,"depth":271,"text":3620},{"id":3641,"depth":271,"text":3642},{"id":3663,"depth":271,"text":3664},{"id":3685,"depth":271,"text":3686},{"id":3707,"depth":244,"text":3708},{"id":3755,"depth":244,"text":3756},{"id":3817,"depth":244,"text":3818},{"id":3836,"depth":244,"text":3837},{"id":4133,"depth":244,"text":4134},{"id":4173,"depth":244,"text":4174},{"id":4204,"depth":244,"text":4205},{"id":4243,"depth":244,"text":4244},{"id":4317,"depth":244,"text":4318},"2026-06-03","An API gateway solves auth, rate limiting, transformations and observability — in exchange for one more critical component. When a simple reverse proxy is enough vs. when a dedicated gateway is worth it.",{},"\u002Fen\u002Fblog\u002Fself-hosted-api-gateway-when-to-install","13 min",{"title":3396,"description":4398},{"loc":4400},"en\u002Fblog\u002Fself-hosted-api-gateway-when-to-install",[4406,4407,4408,3378,4409],"api-gateway","kong","traefik","architecture","CjqfmOvCmwhSmEqb8jVK10KBwZYkFYPuH8o9g6JGEs0",{"id":4412,"title":4413,"author":7,"body":4414,"category":3378,"cover":3379,"date":5369,"description":5370,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":5371,"navigation":411,"path":4275,"readingTime":4401,"seo":5372,"sitemap":5373,"stem":5374,"tags":5375,"__hash__":5379},"blog_en\u002Fen\u002Fblog\u002Fservice-mesh-when-its-worth-for-small-saas.md","Is service mesh overkill for a Brazilian startup? When Istio\u002FLinkerd is worth installing",{"type":9,"value":4415,"toc":5351},[4416,4419,4423,4426,4429,4433,4436,4526,4529,4533,4536,4575,4578,4582,4585,4619,4622,4626,4629,4705,4708,4734,4737,4741,4744,4761,4764,4771,4775,4778,4781,4801,4804,4818,4821,4825,4828,5026,5033,5037,5040,5066,5069,5073,5076,5108,5112,5115,5141,5144,5148,5151,5171,5174,5178,5181,5201,5204,5208,5211,5247,5251,5257,5266,5272,5278,5284,5290,5296,5302,5308,5310,5313,5316,5334,5346,5349],[12,4417,4418],{},"The question always arrives in the same format. A tech lead from a Brazilian SaaS with six or eight services running reads three English posts on service mesh, sees the entire American industry using Istio, and opens the terminal to install — along with the doubt: \"isn't this too much for the size of my company?\". It probably is. But the honest answer requires separating four problems that service mesh solves, showing the cost in RAM and CPU per server, and describing the exact point where the benefit starts surpassing the overhead.",[19,4420,4422],{"id":4421},"tldr-is-service-mesh-worth-it-for-smallmedium-startup","TL;DR — Is service mesh worth it for small\u002Fmedium startup?",[12,4424,4425],{},"Service mesh (Istio, Linkerd, Cilium Service Mesh, Consul Connect) solves four real problems between services of a microservices architecture: automatic encryption (call between pods without TLS by default leaks plaintext traffic), retries and circuit breakers (configurable resilience), granular observability (which service calls which, with what latency), and traffic shaping for canary releases. In exchange, it adds a parallel proxy on each pod (usually Envoy) that consumes between 50 and 100 MB of RAM and adds 5 to 10 ms of latency per internal call.",[12,4427,4428],{},"For startup with fewer than ten active services and fewer than fifty pods, service mesh is overkill — operational overhead exceeds benefit, and the team spends weeks studying a layer that solves a problem it doesn't yet have. For company with fifty or more microservices where diagnosing \"which service is delaying the call?\" takes hours, mesh pays in productivity. The middle ground are clusters with inter-service encryption built into the control plane itself — they cover about 60% of what mesh offers without the parallel sidecar, and serve most Brazilian cases up to the thirty-services range.",[19,4430,4432],{"id":4431},"what-service-mesh-solves-in-one-sentence","What service mesh solves, in one sentence",[12,4434,4435],{},"Before discussing cost, it's necessary to be clear on what's being bought. Service mesh is a network layer that intrudes on each call between services and adds six behaviors:",[2734,4437,4438,4451,4463,4475,4502,4511],{},[70,4439,4440,4443,4444,2629,4447,4450],{},[27,4441,4442],{},"Automatic encryption between pods."," Without mesh, a call from ",[231,4445,4446],{},"orders",[231,4448,4449],{},"users"," inside the cluster travels in plain HTTP. Any agent with access to the node's network sees the content. With mesh, each call is encrypted with automatically issued certificates, no change to application code.",[70,4452,4453,4456,4457,4459,4460,4462],{},[27,4454,4455],{},"Automatic retries on internal calls."," When ",[231,4458,4446],{}," calls ",[231,4461,4449],{}," and the first attempt fails due to a 200 ms network flap, the mesh resends. Without mesh, the application needs to implement that logic on each HTTP client it creates.",[70,4464,4465,4468,4469,4471,4472,4474],{},[27,4466,4467],{},"Configurable circuit breakers."," If ",[231,4470,4449],{}," starts responding with five-second latency, the mesh opens the circuit and makes ",[231,4473,4446],{}," fail fast instead of stacking connections. Without mesh, the team needs to add a library to each service.",[70,4476,4477,4480,4481,571,4484,4487,4488,4490,4491,4493,4494,2402,4496,4499,4500,101],{},[27,4478,4479],{},"Automatic distributed tracing."," The mesh propagates correlation headers (",[231,4482,4483],{},"x-request-id",[231,4485,4486],{},"traceparent",") through the entire call chain. The team can see, on a panel, that a request entered the ",[231,4489,4406],{},", passed through ",[231,4492,4446],{},", called ",[231,4495,4449],{},[231,4497,4498],{},"inventory",", and spent most of the time in ",[231,4501,4498],{},[70,4503,4504,4507,4508,4510],{},[27,4505,4506],{},"Fine traffic shaping."," Routing 5% of ",[231,4509,4446],{}," traffic to a new version (canary), mirroring 100% to a test version without affecting the customer (mirror), or alternating between two complete versions (blue-green) — all configured declaratively, no code.",[70,4512,4513,4516,4517,2402,4519,4522,4523,4525],{},[27,4514,4515],{},"Authorization policies between services."," Declaring that only ",[231,4518,4446],{},[231,4520,4521],{},"reports"," can call ",[231,4524,4449],{},", and any other service receives 403. It's the basis of so-called \"zero-trust network\" between pods.",[12,4527,4528],{},"Those six behaviors are real and the value is measurable. The question is whether your cluster today has enough volume and complexity to justify paying for them.",[19,4530,4532],{"id":4531},"whats-not-a-service-mesh-problem","What's NOT a service mesh problem",[12,4534,4535],{},"Before advancing, it's worth eliminating four problems many teams confuse with reason to install mesh — and that modern orchestrator already solves alone:",[2734,4537,4538,4550,4556,4565],{},[70,4539,4540,4543,4544,4546,4547,4549],{},[27,4541,4542],{},"Ingress routing (HTTP ingress)."," Receive external traffic, terminate TLS, route ",[231,4545,3454],{}," to a service and ",[231,4548,3458],{}," to another. That's work for the integrated router of the orchestrator, not for mesh.",[70,4551,4552,4555],{},[27,4553,4554],{},"Simple load balancing."," Distribute requests among three replicas of the same service with round-robin. Orchestrator does this with internal DNS and health checks. Mesh only adds when load balancing policy needs to be sophisticated (region weight, complex sticky sessions).",[70,4557,4558,4561,4562,4564],{},[27,4559,4560],{},"Service discovery."," Find where ",[231,4563,4449],{}," is running. Internal cluster DNS solves it. Mesh brings nothing new here.",[70,4566,4567,4570,4571,4574],{},[27,4568,4569],{},"HTTP\u002FHTTPS termination at the edge."," Ingress controller solves it. Mesh handles traffic ",[179,4572,4573],{},"between"," services, not the entry.",[12,4576,4577],{},"Whoever installs mesh expecting it to solve those four is paying twice for the same work.",[19,4579,4581],{"id":4580},"the-four-main-players","The four main players",[12,4583,4584],{},"Four products dominate this category in 2026. The differences matter when the tradeoff is overhead vs features.",[2734,4586,4587,4597,4607,4613],{},[70,4588,4589,4592,4593,4596],{},[27,4590,4591],{},"Istio."," The oldest, most complete, most documented — and heaviest. Uses Envoy as sidecar on each pod. De facto standard at large companies that adopted mesh between 2019 and 2022. The Ambient Mode version (no sidecar, with ",[231,4594,4595],{},"ztunnel"," per node) reduces overhead, but is still stabilizing in production.",[70,4598,4599,4602,4603,4606],{},[27,4600,4601],{},"Linkerd."," Focus on simplicity. Own proxy written in Rust (",[231,4604,4605],{},"linkerd2-proxy","), much lighter than Envoy. Short learning curve — installation fits in a couple of commands. CNCF graduated, but with smaller community than Istio.",[70,4608,4609,4612],{},[27,4610,4611],{},"Cilium Service Mesh."," Takes advantage of eBPF in the kernel to implement much of the mesh without sidecar. Per-pod overhead borders zero. In exchange, cluster setup needs recent kernel and compatible CNI, and some advanced features (like sophisticated L7 authorization) still depend on auxiliary proxy.",[70,4614,4615,4618],{},[27,4616,4617],{},"Consul Connect."," From Hashicorp. Integrates with the company's own secrets vault, and works well in mixed environments (VMs + containers). Brazilian community smaller than Istio\u002FLinkerd.",[12,4620,4621],{},"There are others (Kuma, Open Service Mesh, AWS App Mesh), but concentrating on the quartet above covers 95% of real decisions a Brazilian tech lead will face.",[19,4623,4625],{"id":4624},"how-much-does-it-cost-in-ram-and-cpu","How much does it cost in RAM and CPU?",[12,4627,4628],{},"The question that decides the discussion.",[119,4630,4631,4647],{},[122,4632,4633],{},[125,4634,4635,4638,4641,4644],{},[128,4636,4637],{},"Mesh",[128,4639,4640],{},"RAM per pod",[128,4642,4643],{},"CPU per pod",[128,4645,4646],{},"Additional latency",[141,4648,4649,4663,4677,4691],{},[125,4650,4651,4654,4657,4660],{},[146,4652,4653],{},"Istio (Envoy sidecar)",[146,4655,4656],{},"+80–120 MB",[146,4658,4659],{},"+10–15%",[146,4661,4662],{},"5–10 ms",[125,4664,4665,4668,4671,4674],{},[146,4666,4667],{},"Linkerd (linkerd2-proxy Rust)",[146,4669,4670],{},"+20–40 MB",[146,4672,4673],{},"+3–6%",[146,4675,4676],{},"1–3 ms",[125,4678,4679,4682,4685,4688],{},[146,4680,4681],{},"Cilium Service Mesh (eBPF)",[146,4683,4684],{},"~0 MB per pod",[146,4686,4687],{},"~2% on the node",[146,4689,4690],{},"\u003C1 ms",[125,4692,4693,4696,4699,4702],{},[146,4694,4695],{},"Consul Connect (Envoy sidecar)",[146,4697,4698],{},"+70–110 MB",[146,4700,4701],{},"+8–12%",[146,4703,4704],{},"4–8 ms",[12,4706,4707],{},"In cluster with one hundred active pods:",[2734,4709,4710,4716,4722,4728],{},[70,4711,4712,4715],{},[27,4713,4714],{},"Istio"," consumes about 10 GB of RAM in parallel proxies alone, before any application.",[70,4717,4718,4721],{},[27,4719,4720],{},"Linkerd"," consumes about 3 GB.",[70,4723,4724,4727],{},[27,4725,4726],{},"Cilium"," consumes almost nothing per pod, but requires an agent per node (about 200–400 MB each).",[70,4729,4730,4733],{},[27,4731,4732],{},"Consul Connect"," stays close to Istio.",[12,4735,4736],{},"For typical Brazilian startup cluster — four servers with 4 GB of RAM each, totaling 16 GB — Istio alone occupies a third of cluster memory before any line of code runs. Linkerd occupies a fifth. Cilium occupies almost nothing per pod, but requires CNI planning.",[19,4738,4740],{"id":4739},"does-my-startup-need-this","Does my startup need this?",[12,4742,4743],{},"Direct answer: probably not. The honest criteria for \"needs\":",[2734,4745,4746,4749,4752,4755,4758],{},[70,4747,4748],{},"Thirty or more active microservices in production.",[70,4750,4751],{},"Inter-service traffic is more than 50% of the cluster's total HTTP volume.",[70,4753,4754],{},"More than one incident per month related to \"which service fell, delayed, or is busting timeout\".",[70,4756,4757],{},"Formal compliance demands zero-trust network between pods (PCI-DSS level 1, certain contracts with Banco Central, health frameworks).",[70,4759,4760],{},"Team has at least one person dedicated to platform, with time to study and operate the mesh.",[12,4762,4763],{},"If you don't hit at least three of those five criteria, mesh is overkill. The added complexity doesn't return in value — it returns in on-call calls trying to understand why the sidecar is recycling.",[12,4765,4766,4767,4770],{},"Most important and least discussed criterion: ",[27,4768,4769],{},"how much of the traffic is internal?",". Application that receives request at the edge, makes a single database query and responds, spends 95% of the time between external client and database — not between services. Application that receives request at the edge, calls ten internal services to assemble the response, spends most of the time on internal traffic. For the first, mesh adds nothing perceptible. For the second, mesh can cut hours of debugging per month.",[19,4772,4774],{"id":4773},"the-cluster-native-substitute","The cluster-native substitute",[12,4776,4777],{},"Here lives the part the American discourse underestimates. In 2026, several modern orchestrators — including HeroCtl and some distributions of the orthodox colossus — come with inter-service encryption built into the control plane. No sidecar, no parallel proxy, no installing additional product.",[12,4779,4780],{},"What this covers:",[2734,4782,4783,4789,4795],{},[70,4784,4785,4788],{},[27,4786,4787],{},"Encryption between services."," Each service receives certificate automatically issued by the cluster. Internal call is encrypted by default.",[70,4790,4791,4794],{},[27,4792,4793],{},"Service identity."," Each service authenticates by certificate, not by IP or DNS.",[70,4796,4797,4800],{},[27,4798,4799],{},"Basic authorization."," Lists of who can call whom, declarative in the service config file.",[12,4802,4803],{},"What this does NOT cover:",[2734,4805,4806,4809,4812,4815],{},[70,4807,4808],{},"Fine traffic shaping (canary with 5% of traffic, mirror).",[70,4810,4811],{},"Completely automatic distributed tracing.",[70,4813,4814],{},"Configurable circuit breakers per call.",[70,4816,4817],{},"Sophisticated retry policies.",[12,4819,4820],{},"For medium startup that was thinking of installing mesh just to have \"encryption between services\", cluster-native is enough. Covers the most common audit topic without costing 10 GB of RAM.",[19,4822,4824],{"id":4823},"side-by-side-no-frills","Side by side, no frills",[12,4826,4827],{},"The table compares Istio, Linkerd, Cilium, and the option of not installing mesh (with cluster-native encryption active) on twelve criteria. There's no column without caveat.",[119,4829,4830,4846],{},[122,4831,4832],{},[125,4833,4834,4836,4838,4840,4843],{},[128,4835,2982],{},[128,4837,4714],{},[128,4839,4720],{},[128,4841,4842],{},"Cilium SM",[128,4844,4845],{},"No mesh + cluster-native",[141,4847,4848,4862,4875,4890,4907,4923,4939,4952,4966,4980,4993,5009],{},[125,4849,4850,4853,4855,4857,4860],{},[146,4851,4852],{},"RAM overhead per pod",[146,4854,4656],{},[146,4856,4670],{},[146,4858,4859],{},"~0",[146,4861,4859],{},[125,4863,4864,4867,4869,4871,4873],{},[146,4865,4866],{},"CPU overhead per pod",[146,4868,4659],{},[146,4870,4673],{},[146,4872,4687],{},[146,4874,4859],{},[125,4876,4877,4880,4882,4884,4887],{},[146,4878,4879],{},"Setup complexity",[146,4881,3166],{},[146,4883,3154],{},[146,4885,4886],{},"Medium (kernel)",[146,4888,4889],{},"Minimal",[125,4891,4892,4895,4898,4901,4904],{},[146,4893,4894],{},"Documentation in PT-BR",[146,4896,4897],{},"Good",[146,4899,4900],{},"Reasonable",[146,4902,4903],{},"Little",[146,4905,4906],{},"Embedded in orchestrator",[125,4908,4909,4912,4915,4917,4920],{},[146,4910,4911],{},"Brazilian community",[146,4913,4914],{},"Large",[146,4916,3159],{},[146,4918,4919],{},"Small",[146,4921,4922],{},"Grows with the orchestrator",[125,4924,4925,4928,4931,4934,4937],{},[146,4926,4927],{},"Parallel sidecar",[146,4929,4930],{},"Yes (Envoy)",[146,4932,4933],{},"Yes (Rust)",[146,4935,4936],{},"No (eBPF)",[146,4938,3058],{},[125,4940,4941,4944,4946,4948,4950],{},[146,4942,4943],{},"Automatic encryption between services",[146,4945,3064],{},[146,4947,3064],{},[146,4949,3064],{},[146,4951,3064],{},[125,4953,4954,4957,4959,4961,4963],{},[146,4955,4956],{},"Automatic distributed tracing",[146,4958,3064],{},[146,4960,3064],{},[146,4962,3139],{},[146,4964,4965],{},"No (needs OpenTelemetry)",[125,4967,4968,4971,4973,4975,4977],{},[146,4969,4970],{},"Fine traffic shaping (canary 5%)",[146,4972,3064],{},[146,4974,3064],{},[146,4976,3139],{},[146,4978,4979],{},"Basic (rolling, blue-green)",[125,4981,4982,4985,4987,4989,4991],{},[146,4983,4984],{},"Configurable circuit breakers",[146,4986,3064],{},[146,4988,3064],{},[146,4990,3061],{},[146,4992,3058],{},[125,4994,4995,4997,5000,5003,5006],{},[146,4996,3151],{},[146,4998,4999],{},"6–10 weeks",[146,5001,5002],{},"2–4 weeks",[146,5004,5005],{},"4–6 weeks",[146,5007,5008],{},"Days",[125,5010,5011,5014,5017,5020,5023],{},[146,5012,5013],{},"Ideal application range",[146,5015,5016],{},"50+ services",[146,5018,5019],{},"10–50 services",[146,5021,5022],{},"30+ services with new kernel",[146,5024,5025],{},"1–30 services",[12,5027,5028,5029,5032],{},"The column that matters is the last line — ",[27,5030,5031],{},"ideal application range",". Whoever is below the band, pays overhead without return. Whoever is above, feels lacking feature.",[19,5034,5036],{"id":5035},"when-service-mesh-pays-the-price","When service mesh pays the price",[12,5038,5039],{},"Four scenarios where the investment is justified:",[2734,5041,5042,5048,5054,5060],{},[70,5043,5044,5047],{},[27,5045,5046],{},"Thirty or more active microservices."," Operational complexity without mesh becomes worse than with mesh — diagnosing a chain of six internal calls across three different teams is expensive without automatic tracing.",[70,5049,5050,5053],{},[27,5051,5052],{},"Enterprise compliance with zero-trust requirements."," Some audit frameworks ask the stack to have \"zero-trust network nominally\". Mesh formally solves the checkbox.",[70,5055,5056,5059],{},[27,5057,5058],{},"Multi-cluster federation."," Service routing between two or three clusters in different regions, with automatic failover. Mesh facilitates this scenario; cluster-native solves it poorly.",[70,5061,5062,5065],{},[27,5063,5064],{},"Platform team of five or more dedicated people."," You have capacity to extract value from the mesh — operate, evolve, scale its control plane. Without that team, mesh becomes liability.",[12,5067,5068],{},"If you hit two or more of those, start evaluating. Start with Linkerd — it's what gives less pain for less relative return lost.",[19,5070,5072],{"id":5071},"when-not-to-install-most-cases","When NOT to install (most cases)",[12,5074,5075],{},"Five scenarios where installing mesh today costs more than it returns:",[2734,5077,5078,5084,5090,5096,5102],{},[70,5079,5080,5083],{},[27,5081,5082],{},"Monolith with five to ten auxiliary microservices."," Zero gain, large cost. The RAM overhead falls directly on the server bill.",[70,5085,5086,5089],{},[27,5087,5088],{},"Small team, fewer than three people on platform."," Operating mesh requires dedicated on-call for it. Small team absorbs that cost at the expense of product feature.",[70,5091,5092,5095],{},[27,5093,5094],{},"Cluster with fewer than thirty total pods."," Managing thirty pods is human work, doesn't require automatic tracing. The cost of learning mesh doesn't return.",[70,5097,5098,5101],{},[27,5099,5100],{},"Simple HTTP workload without canary requirements."," If you never needed to release 5% of traffic to a new version because rolling update always served, mesh is solution for problem that doesn't exist.",[70,5103,5104,5107],{},[27,5105,5106],{},"Cluster cost under pressure."," If every gigabyte of RAM is being counted, spending 10 GB on sidecars is decision hard to defend to investor.",[19,5109,5111],{"id":5110},"evolutionary-decision-by-stage","Evolutionary decision, by stage",[12,5113,5114],{},"The right decision changes with the size of the system. Four stages:",[2734,5116,5117,5123,5129,5135],{},[70,5118,5119,5122],{},[27,5120,5121],{},"Stage 1 — 1 to 10 services."," No mesh. If you need encryption between services, do TLS in the code (most languages have ready HTTPS client). Not worth the learning. Focus on delivering product.",[70,5124,5125,5128],{},[27,5126,5127],{},"Stage 2 — 10 to 30 services."," Cluster with encryption built into the control plane (HeroCtl, some colossus presets). Solves encryption + identity + service discovery without sidecar. Covers most of what mesh offers, without the cost.",[70,5130,5131,5134],{},[27,5132,5133],{},"Stage 3 — 30 to 50 services with platform team."," Evaluate Linkerd first. Short curve, low overhead, solves tracing and circuit breakers. Istio only if advanced features (sophisticated L7 authorization, real multi-cluster federation) are immediate requirement.",[70,5136,5137,5140],{},[27,5138,5139],{},"Stage 4 — 50+ services, enterprise compliance."," Istio or Cilium Service Mesh. Compliance will ask for one of the two; the rest are details.",[12,5142,5143],{},"Going from one stage to the next is a deliberate decision, not gradual. Add the component when the team takes on the learning and the cluster takes on the overhead. Not before.",[19,5145,5147],{"id":5146},"the-lets-install-now-to-be-prepared-trap","The \"let's install now to be prepared\" trap",[12,5149,5150],{},"Argument that appears in every discussion: \"if I'm going to grow to fifty services next year, better install now and learn\". The trap has three faces:",[2734,5152,5153,5159,5165],{},[70,5154,5155,5158],{},[27,5156,5157],{},"Learning mesh costs four to eight weeks per person on the team."," On team of five, that's twenty to forty person-weeks. Multiplied by R$200\u002Fhour, it's between R$160k and R$320k just in learning. That money buys feature or buys runway period.",[70,5160,5161,5164],{},[27,5162,5163],{},"Each new component is one more critical failure point."," Mesh control plane (Istio Pilot, Linkerd controller, Cilium operator) can fail and take internal connectivity with it. More components in quorum, more incident surface. Add only when the gain compensates that risk.",[70,5166,5167,5170],{},[27,5168,5169],{},"When you need it, installing takes a week, not a month."," Linkerd in particular is installable in a couple of commands. Cilium in a few hours if the cluster takes recent kernel. Postponing the decision isn't technical debt — it's debt postponed at lower interest.",[12,5172,5173],{},"\"Anticipate to be prepared\" doesn't work. What works is monitoring the objective criteria of the previous section and installing when two or more become reality.",[19,5175,5177],{"id":5176},"how-heroctl-approaches-the-problem","How HeroCtl approaches the problem",[12,5179,5180],{},"Our position is deliberate: service mesh, in most Brazilian cases, is decision for stage three or four. To cover stages one and two, HeroCtl brings built into the control plane:",[2734,5182,5183,5189,5195],{},[70,5184,5185,5188],{},[27,5186,5187],{},"Automatic encryption between services."," Each submitted service receives its own identity. Internal call between two services is encrypted by default, with no change in application code and no parallel sidecar.",[70,5190,5191,5194],{},[27,5192,5193],{},"Distributed tracing via integrated OpenTelemetry exporter."," The cluster propagates correlation headers and exports to any collector that understands OTLP. Not as rich as full mesh (which automatically injects tracing into the sidecars), but covers 80% of real use.",[70,5196,5197,5200],{},[27,5198,5199],{},"Basic embedded traffic shaping."," Rolling update, canary with fixed percentage of traffic, blue-green. Sufficient for startup that does ten deploys a day. Doesn't cover mirror or canary with weight per header — for that, need to install mesh.",[12,5202,5203],{},"For Brazilian startup up to the thirty-services range, this covers about 80% of what a complete mesh delivers — without the sidecar, without the four weeks of learning, without the 10 GB of RAM. When the system grows beyond that, installing Linkerd on top of HeroCtl is documented path.",[19,5205,5207],{"id":5206},"the-four-most-expensive-mistakes-installing-service-mesh","The four most expensive mistakes installing service mesh",[12,5209,5210],{},"For team that decided on the step, four traps that cost from two weeks to three months of rework:",[2734,5212,5213,5219,5235,5241],{},[70,5214,5215,5218],{},[27,5216,5217],{},"Installing before needing."," Unnecessary coverage becomes liability. New component in quorum, RAM cost, learning time — without equivalent return.",[70,5220,5221,2577,5224,5227,5228,5231,5232,5234],{},[27,5222,5223],{},"Configuring strict encryption on day one without thinking about legacy.",[231,5225,5226],{},"STRICT"," mode breaks any service that hasn't yet been migrated. The correct migration is gradual: ",[231,5229,5230],{},"PERMISSIVE"," mode at the start (accepts encrypted and non-encrypted traffic), only becomes ",[231,5233,5226],{}," when all services are inside the mesh.",[70,5236,5237,5240],{},[27,5238,5239],{},"Not sizing the control plane."," Istio Pilot and similar need enough RAM and CPU to distribute configuration to all sidecars. In growing cluster, becoming control plane bottleneck is classic incident for those who didn't plan.",[70,5242,5243,5246],{},[27,5244,5245],{},"Skipping Linkerd to Istio \"because it's more popular\"."," Linkerd solves 80% of cases with 30% of the overhead. Choosing Istio is only justified when a specific feature (sophisticated L7 authorization, integration with external identity service, multi-cluster federation) is a real requirement, not résumé preference.",[19,5248,5250],{"id":5249},"frequently-asked-questions","Frequently asked questions",[12,5252,5253,5256],{},[27,5254,5255],{},"Is Linkerd light enough for small cluster?","\nLighter than Istio by an order of magnitude, but still parallel sidecar on each pod. For cluster with twenty pods and four 4 GB nodes, Linkerd eats about 600 MB of total RAM — significant but tolerable. For cluster with ten pods, it's still excessive. Linkerd enters the scene at stage three (10–50 services), not before.",[12,5258,5259,5262,5263,5265],{},[27,5260,5261],{},"Does Istio Ambient Mode (no sidecar) change this decision?","\nReduces per-pod overhead (goes to one agent per node, ",[231,5264,4595],{},"), but still requires operating the entire Istio control plane. In stable production since 2024, but the Brazilian community is still small — waiting a few more quarters for adoption in critical project is prudent.",[12,5267,5268,5271],{},[27,5269,5270],{},"Does Cilium eBPF really have zero overhead?","\nPer pod, yes — has no parallel sidecar. But the Cilium agent on each node consumes from 200 to 400 MB and adds load on the kernel. For cluster with modern Linux kernel and compatible CNI, it's the most efficient option. For cluster still running old kernel or using specific CNI, the setup becomes a project.",[12,5273,5274,5277],{},[27,5275,5276],{},"How do I do encryption between services without service mesh?","\nThree paths. First, TLS in application code — each service exposes HTTPS, each client trusts internal CA. Works, but requires distributing certificates manually (or via secrets vault). Second, orchestrator control plane issuing certificates automatically — HeroCtl and some colossus distributions do this, it's the cleanest path. Third, VPN or encrypted overlay network (WireGuard) between nodes — protects traffic inside the cluster, but not service-to-service identity.",[12,5279,5280,5283],{},[27,5281,5282],{},"Does distributed tracing need mesh?","\nNo. OpenTelemetry SDK in each service, exporting to a central collector (Tempo, Jaeger, or managed service), covers 90% of use. Mesh automates injection without changing code, which is comfortable — but it's not a requirement. For startup, starting with OpenTelemetry in code is cheaper.",[12,5285,5286,5289],{},[27,5287,5288],{},"Service mesh in managed cluster is easier?","\nEasier to install, yes — most providers offer Istio or Linkerd add-on with one click. Easier to operate, no — you still need to understand the control plane, size, debug when a sidecar recycles. Don't gain install time at the expense of operational unpreparedness.",[12,5291,5292,5295],{},[27,5293,5294],{},"Which mesh is most used in Brazilian startup?","\nBy community experience, Istio dominates in companies that adopted between 2020 and 2022 (CNCF fashion effect). Linkerd grows since 2024 among those who migrated or started new, especially mid-size fintechs. Cilium appears in specific cases (very large clusters, cost optimization). Consul Connect very rare in Brazil.",[12,5297,5298,5301],{},[27,5299,5300],{},"Worth it for monolith + 3 microservices?","\nNo. Monolith + three microservices doesn't have internal complexity that mesh helps tame. TLS in code solves encryption. Centralized logs solve visibility. Orchestrator's rolling update solves safe deploy. Installing mesh in that scenario is bringing a problem to solve another problem that doesn't exist.",[12,5303,5304,5307],{},[27,5305,5306],{},"Does HeroCtl completely replace a service mesh?","\nFor stages one and two (up to thirty services), it replaces in about 80% of real use. For stages three and four (above thirty services, or specific compliance), HeroCtl coexists with Linkerd or Istio running as jobs on top. HeroCtl's control plane inter-service encryption coexists with the mesh — the mesh takes care of traffic between your pods, HeroCtl takes care of service identity and communication with the control plane.",[19,5309,3309],{"id":3308},[12,5311,5312],{},"The practical rule we recommend for Brazilian tech lead: install mesh when two or more of the objective criteria become reality — thirty active services, more than one incident per month related to internal calls, formal compliance asking for zero-trust, platform team of five people, real multi-cluster federation. Before that, cluster with encryption built into the control plane solves most of what you'd buy with mesh, without the 10 GB of RAM and without the eight weeks of learning.",[12,5314,5315],{},"To start exploring this path — orchestrator with inter-service encryption already included, no parallel sidecar, control plane occupying between 200 and 400 MB per server and coordinator election in about seven seconds when something falls — install on any Linux server and open the panel:",[224,5317,5319],{"className":226,"code":5318,"language":228,"meta":229,"style":229},"curl -sSL get.heroctl.com\u002Finstall.sh | sh\n",[231,5320,5321],{"__ignoreMap":229},[234,5322,5323,5325,5327,5330,5332],{"class":236,"line":237},[234,5324,1220],{"class":247},[234,5326,2957],{"class":251},[234,5328,5329],{"class":255}," get.heroctl.com\u002Finstall.sh",[234,5331,2963],{"class":383},[234,5333,2966],{"class":247},[12,5335,5336,5337,5340,5341,5345],{},"To continue on this line, two direct posts. In ",[3336,5338,5339],{"href":4368},"Multi-tenant SaaS — real isolation or just namespace?"," we deal with the neighbor problem — separating customers within the same cluster without breaking the budget. In ",[3336,5342,5344],{"href":5343},"\u002Fen\u002Fblog\u002Fk3s-vs-heroctl-when-each-fits","K3s vs HeroCtl — when each makes sense"," we compare the most common alternative when the team has already decided that the orthodox colossus is excessive.",[12,5347,5348],{},"The choice for service mesh is, deep down, a choice of when to absorb complexity. The right question isn't \"do I need Istio?\" — it's \"what's the smallest system that still solves my current problem?\". For a large part of Brazilian startups, the answer is simpler than the American industry suggests.",[3350,5350,4376],{},{"title":229,"searchDepth":244,"depth":244,"links":5352},[5353,5354,5355,5356,5357,5358,5359,5360,5361,5362,5363,5364,5365,5366,5367,5368],{"id":4421,"depth":244,"text":4422},{"id":4431,"depth":244,"text":4432},{"id":4531,"depth":244,"text":4532},{"id":4580,"depth":244,"text":4581},{"id":4624,"depth":244,"text":4625},{"id":4739,"depth":244,"text":4740},{"id":4773,"depth":244,"text":4774},{"id":4823,"depth":244,"text":4824},{"id":5035,"depth":244,"text":5036},{"id":5071,"depth":244,"text":5072},{"id":5110,"depth":244,"text":5111},{"id":5146,"depth":244,"text":5147},{"id":5176,"depth":244,"text":5177},{"id":5206,"depth":244,"text":5207},{"id":5249,"depth":244,"text":5250},{"id":3308,"depth":244,"text":3309},"2026-05-29","Service mesh solves real problems (mTLS, inter-service observability, traffic shaping). But adds 30-50% RAM\u002FCPU overhead and complexity. When it's worth it and when it's overkill.",{},{"title":4413,"description":5370},{"loc":4275},"en\u002Fblog\u002Fservice-mesh-when-its-worth-for-small-saas",[5376,5377,5378,3378,4409],"service-mesh","istio","linkerd","VX4xpWtHCom09sHEcs0-6Nv7h5dXcx4BDqp11jXSWGQ",{"id":5381,"title":5382,"author":7,"body":5383,"category":6382,"cover":3379,"date":6383,"description":6384,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":6385,"navigation":411,"path":6386,"readingTime":6387,"seo":6388,"sitemap":6389,"stem":6390,"tags":6391,"__hash__":6396},"blog_en\u002Fen\u002Fblog\u002Fleaving-aws-without-rewriting-the-stack.md","How to leave AWS without rewriting the whole stack: practical 2026 guide",{"type":9,"value":5384,"toc":6344},[5385,5388,5392,5395,5398,5404,5407,5411,5414,5417,5420,5424,5427,5501,5504,5508,5511,5714,5717,5721,5724,5727,5731,5738,5741,5744,5750,5754,5757,5760,5764,5767,5774,5777,5781,5784,5787,5791,5794,5797,5801,5804,5808,5811,5814,5818,5821,5824,5827,5831,5834,5844,5847,5851,5854,5857,5861,5864,5870,5876,5879,5883,5886,5892,5898,5908,5914,5920,5926,5932,5938,5942,5945,5977,5981,5984,5989,6026,6031,6062,6065,6068,6072,6075,6078,6095,6098,6101,6105,6108,6114,6120,6126,6132,6136,6139,6153,6156,6160,6163,6166,6191,6194,6206,6209,6211,6215,6219,6222,6226,6229,6233,6236,6240,6246,6250,6269,6273,6276,6280,6283,6287,6290,6294,6297,6299,6303,6306,6309,6325,6328,6339,6342],[12,5386,5387],{},"Most Brazilian teams thinking about leaving AWS postpone indefinitely because they believe they are facing a project of \"rewriting the entire stack\". They aren't. It is a mapping project, not a rewrite. And the mapping fits in a twelve-row spreadsheet.",[19,5389,5391],{"id":5390},"tldr-what-youll-read-in-three-minutes","TL;DR — what you'll read in three minutes",[12,5393,5394],{},"A typical Brazilian SaaS stack uses about twelve AWS services, and each of them has a portable alternative that costs between three and seven times less. EC2 becomes VPS at any provider (Hetzner, DigitalOcean, Magalu Cloud). RDS becomes Postgres on a dedicated VPS, Neon or Supabase. ElastiCache becomes self-hosted Valkey. S3 becomes Cloudflare R2 or Backblaze B2 — both with S3-compatible API, so the code doesn't even change. SQS becomes a Redis-based queue or RabbitMQ. Lambda becomes an endpoint on the traditional app server or Cloudflare Workers. ALB becomes the orchestrator's integrated router. CloudFront becomes free Cloudflare. IAM becomes secret injection in the orchestrator.",[12,5396,5397],{},"Realistic schedule for a startup with five to ten applications: six to eight weeks, eighty to one hundred sixty hours of development. Typical savings: three to seven times on the infra bill, with payback in less than a month of senior salary.",[12,5399,5400,5403],{},[27,5401,5402],{},"Don't migrate if"," your compliance requires AWS by name, if the team is single and focused on product, or if the stack uses deep lock-in (DynamoDB with specific features, Aurora Serverless v2, complex cross-account IAM).",[12,5405,5406],{},"━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━",[19,5408,5410],{"id":5409},"why-do-so-many-brazilian-teams-postpone-leaving-aws","Why do so many Brazilian teams postpone leaving AWS?",[12,5412,5413],{},"The honest answer is confusion between two different projects. \"Leaving AWS\" became a mental synonym for \"rewriting the application\". It isn't the same thing.",[12,5415,5416],{},"Rewriting the application is changing core technology — relational database for NoSQL, synchronous framework for reactive, monolith for microservices. That does take quarters. Leaving AWS is changing the infra that sustains the application you already have. The domain code stays identical. What changes are database endpoints, credentials, some SDKs and the way to declare deploys.",[12,5418,5419],{},"The confusion lasts because the team looks at the AWS console and sees two hundred services. Nobody uses two hundred. The vast majority uses twelve. Map those twelve, find an alternative for each one, and what's left is execution work — not research.",[19,5421,5423],{"id":5422},"the-twelve-aws-services-your-stack-probably-uses","The twelve AWS services your stack probably uses",[12,5425,5426],{},"The starting spreadsheet is this. Anything outside it in your account is probably satellite — a CloudWatch alarm nobody looks at, a forgotten S3 bucket, a dead Lambda function. Focus on the twelve:",[67,5428,5429,5435,5441,5447,5453,5459,5465,5471,5477,5483,5489,5495],{},[70,5430,5431,5434],{},[27,5432,5433],{},"EC2"," — virtual machines running app server and workers",[70,5436,5437,5440],{},[27,5438,5439],{},"RDS"," — managed relational database (Postgres or MySQL)",[70,5442,5443,5446],{},[27,5444,5445],{},"ElastiCache"," — Redis for cache and session",[70,5448,5449,5452],{},[27,5450,5451],{},"S3"," — object storage (uploads, backups, assets)",[70,5454,5455,5458],{},[27,5456,5457],{},"ALB \u002F NLB"," — load balancer in front of the EC2s",[70,5460,5461,5464],{},[27,5462,5463],{},"CloudFront"," — CDN for static assets",[70,5466,5467,5470],{},[27,5468,5469],{},"Route 53"," — authoritative DNS",[70,5472,5473,5476],{},[27,5474,5475],{},"SES"," — transactional email",[70,5478,5479,5482],{},[27,5480,5481],{},"SQS \u002F SNS"," — queue and pub-sub",[70,5484,5485,5488],{},[27,5486,5487],{},"IAM"," — credentials and roles for services to talk to each other",[70,5490,5491,5494],{},[27,5492,5493],{},"CloudWatch"," — metrics and logs",[70,5496,5497,5500],{},[27,5498,5499],{},"Lambda"," — serverless functions",[12,5502,5503],{},"If your account has all twelve, congratulations: you are the median stack. If you have eight or nine, better — less to migrate. If you have five very specific services (Aurora Global, DynamoDB with Streams, complex EventBridge), you are on a different path — read the lock-ins section before continuing.",[19,5505,5507],{"id":5506},"service-by-service-mapping-alternative-cost-and-complexity","Service-by-service mapping — alternative, cost and complexity",[12,5509,5510],{},"The table below is the shortcut. Each row has expanded detail afterwards.",[119,5512,5513,5532],{},[122,5514,5515],{},[125,5516,5517,5520,5523,5526,5529],{},[128,5518,5519],{},"AWS service",[128,5521,5522],{},"Portable alternative",[128,5524,5525],{},"Cost before (R$\u002Fmonth)",[128,5527,5528],{},"Cost after (R$\u002Fmonth)",[128,5530,5531],{},"Migration complexity",[141,5533,5534,5550,5566,5582,5597,5612,5626,5641,5656,5672,5685,5699],{},[125,5535,5536,5539,5542,5545,5548],{},[146,5537,5538],{},"EC2 t3.medium",[146,5540,5541],{},"Hetzner CPX21 VPS",[146,5543,5544],{},"150",[146,5546,5547],{},"44",[146,5549,3154],{},[125,5551,5552,5555,5558,5561,5564],{},[146,5553,5554],{},"RDS db.t4g.large",[146,5556,5557],{},"Self-hosted Postgres or Neon",[146,5559,5560],{},"700",[146,5562,5563],{},"50–250",[146,5565,3159],{},[125,5567,5568,5571,5574,5577,5580],{},[146,5569,5570],{},"ElastiCache cache.t4g.micro",[146,5572,5573],{},"Self-hosted Valkey",[146,5575,5576],{},"75",[146,5578,5579],{},"30",[146,5581,3159],{},[125,5583,5584,5587,5590,5593,5595],{},[146,5585,5586],{},"S3 (1TB + egress)",[146,5588,5589],{},"Cloudflare R2",[146,5591,5592],{},"600",[146,5594,5576],{},[146,5596,3154],{},[125,5598,5599,5602,5605,5608,5610],{},[146,5600,5601],{},"ALB",[146,5603,5604],{},"Orchestrator integrated router",[146,5606,5607],{},"110",[146,5609,893],{},[146,5611,3159],{},[125,5613,5614,5616,5619,5622,5624],{},[146,5615,5463],{},[146,5617,5618],{},"Free Cloudflare",[146,5620,5621],{},"400",[146,5623,893],{},[146,5625,3154],{},[125,5627,5628,5630,5633,5636,5638],{},[146,5629,5469],{},[146,5631,5632],{},"Cloudflare DNS",[146,5634,5635],{},"25",[146,5637,893],{},[146,5639,5640],{},"Trivial",[125,5642,5643,5645,5648,5651,5654],{},[146,5644,5475],{},[146,5646,5647],{},"Resend or Postmark",[146,5649,5650],{},"50",[146,5652,5653],{},"75–100",[146,5655,5640],{},[125,5657,5658,5660,5663,5666,5669],{},[146,5659,5481],{},[146,5661,5662],{},"Redis Streams or RabbitMQ",[146,5664,5665],{},"80",[146,5667,5668],{},"0 (same VPS)",[146,5670,5671],{},"Medium–high",[125,5673,5674,5676,5679,5681,5683],{},[146,5675,5487],{},[146,5677,5678],{},"Orchestrator secrets",[146,5680,893],{},[146,5682,893],{},[146,5684,3159],{},[125,5686,5687,5689,5692,5695,5697],{},[146,5688,5493],{},[146,5690,5691],{},"Prometheus + Loki",[146,5693,5694],{},"250",[146,5696,5668],{},[146,5698,3159],{},[125,5700,5701,5703,5706,5708,5711],{},[146,5702,5499],{},[146,5704,5705],{},"App server or Cloudflare Workers",[146,5707,669],{},[146,5709,5710],{},"0–60",[146,5712,5713],{},"Variable",[12,5715,5716],{},"FX considered: five reais per dollar. Before-costs assume small-medium SaaS stack with five to ten active applications.",[368,5718,5720],{"id":5719},"ec2-becomes-vps-at-any-provider","EC2 becomes VPS at any provider",[12,5722,5723],{},"The most obvious migration. EC2 t3.medium costs about thirty dollars monthly — one hundred fifty reais. Hetzner CPX21 with the same CPU class and more RAM costs seven euros and ninety-nine — forty-four reais. DigitalOcean sits in the middle. Magalu Cloud is competitive for those prioritizing invoice in real and data on national soil.",[12,5725,5726],{},"The technical path is provisioning the VPS, running your existing Ansible (or a simple bootstrap script), copying the EC2 snapshot or bringing up the image from scratch. For each server, count two to four hours. It is not the time-consuming part of the migration.",[368,5728,5730],{"id":5729},"rds-becomes-self-hosted-postgres-or-neonsupabase","RDS becomes self-hosted Postgres or Neon\u002FSupabase",[12,5732,5733,5734,5737],{},"There are three honest paths here. The first is Postgres running on a dedicated VPS, with automated backup via ",[231,5735,5736],{},"pg_dump"," in cron and physical replication to a secondary in another region. Costs the price of the VPS — fifty to one hundred reais monthly — to replace an RDS of seven hundred.",[12,5739,5740],{},"The second is Neon. Serverless Postgres with branching, automatic ramp-up, generous free plan, paid plans starting at five dollars. Useful for those wanting to abandon AWS without taking on direct database operation.",[12,5742,5743],{},"The third is Supabase, which delivers Postgres with additional APIs (auth, realtime, storage) and a permanent free tier. Makes sense for startups that tolerate Supabase coupling in exchange for simplicity.",[12,5745,5746,5747,5749],{},"The migration itself is ",[231,5748,5736],{}," followed by restore at destination, with a short maintenance window — usually minutes, not hours, with logical replication working — or logical replication with cutover almost without downtime if your Postgres is version 13 or higher. Four to eight hours depending on base size.",[368,5751,5753],{"id":5752},"elasticache-becomes-self-hosted-valkey","ElastiCache becomes self-hosted Valkey",[12,5755,5756],{},"Redis became Valkey after the license change in 2024 — fork maintained by the Linux Foundation. Runs on any VPS in two clicks. Thirty reais monthly replace ElastiCache of seventy-five.",[12,5758,5759],{},"The migration has two stages. First, bring up the Valkey cluster with Sentinel for automatic failover. Second, populate the cache — script that reads from AWS and writes at the destination, or simply let the application populate organically after cutover (cache cold start of a few minutes). Three to six hours of work.",[368,5761,5763],{"id":5762},"s3-becomes-cloudflare-r2-or-backblaze-b2","S3 becomes Cloudflare R2 (or Backblaze B2)",[12,5765,5766],{},"This is the most immediate gain. Cloudflare R2 charges zero for egress — the most expensive slice of S3 when you serve assets to users. Fifteen cents of dollar per GB stored, against twenty-three cents of standard S3. Backblaze B2 is an almost identical alternative, with even cheaper integration for heavy backup workloads.",[12,5768,5769,5770,5773],{},"Technical migration is trivial: ",[231,5771,5772],{},"rclone copy s3:my-bucket r2:my-bucket"," in parallel. One terabyte transfers in around twelve hours depending on bandwidth. The application code changes exactly one line — the S3 client endpoint. Every AWS SDK library accepts custom endpoint configuration; R2 and B2 implement the identical S3 protocol.",[12,5775,5776],{},"Typical volume of medium SaaS (fifty GB of user uploads): R$75 monthly on R2 against R$600 on S3 with active egress. The savings pay a week of migration work in the first month.",[368,5778,5780],{"id":5779},"alb-becomes-orchestrator-integrated-router","ALB becomes orchestrator integrated router",[12,5782,5783],{},"If you are using ALB, it is because you have multiple EC2s behind it. The alternative is the router embedded in the chosen orchestrator — HeroCtl, Caddy, or the router embedded in other self-hosted stacks. The orchestrator discovers running containers, opens ports, terminates TLS via automatic Let's Encrypt, distributes traffic.",[12,5785,5786],{},"The migration swaps the AWS target group definition for an ingress definition in the orchestrator manifest. Four to eight hours to understand the right rules. One hundred ten reais saved monthly per balancer, and the orchestrator accepts however many hosts you want without additional charge.",[368,5788,5790],{"id":5789},"cloudfront-becomes-free-cloudflare","CloudFront becomes free Cloudflare",[12,5792,5793],{},"This deserves a highlight mention. CloudFront charges per GB transferred — those who serve video or heavy downloads bleed. Cloudflare offers free global CDN on the free plan, with configurable cache, basic DDoS mitigation and rudimentary WAF. For most SaaS cases, it is more than enough.",[12,5795,5796],{},"The migration is changing the domain's nameservers to Cloudflare and configuring cache rules. Two to four hours. The savings can be massive — four hundred reais monthly for those with average traffic volume, thousands for those with high volume.",[368,5798,5800],{"id":5799},"route-53-becomes-cloudflare-dns","Route 53 becomes Cloudflare DNS",[12,5802,5803],{},"DNS at Cloudflare is free and faster than Route 53 in most public measurements. Migration is exporting the zone file, importing in Cloudflare, validating records, changing nameservers at the registrar. Thirty minutes. Twenty-five reais monthly that come back to the cash flow.",[368,5805,5807],{"id":5806},"ses-becomes-resend-postmark-or-mailgun","SES becomes Resend, Postmark or Mailgun",[12,5809,5810],{},"AWS is cheap for volume sending, but SES deliverability requires IP warming and reputation configuration that takes time. Resend charges twenty dollars for fifty thousand monthly emails and has superior deliverability out of the box. Postmark charges fifteen for ten thousand. Mailgun covers the case of those sending lots of non-transactional volume.",[12,5812,5813],{},"The migration is changing SMTP credentials in the app — one hour of work.",[368,5815,5817],{"id":5816},"sqs-and-sns-become-redis-streams-or-rabbitmq","SQS and SNS become Redis Streams or RabbitMQ",[12,5819,5820],{},"The most delicate migration. SQS is a service that does one thing and does it well; replacing it requires choosing queue technology and refactoring producer and consumer.",[12,5822,5823],{},"The shortest path is Redis Streams, especially if you are already running Valkey for cache. Libraries like Sidekiq (Ruby), BullMQ (Node), RQ (Python) and Asynq (Go) consume Redis natively. RabbitMQ is more robust for complex routing scenarios. NATS is a modern alternative for pub-sub.",[12,5825,5826],{},"For each queue, count one to three days depending on complexity. Simple background job queues are trivial. Queues with fan-out, dead letter queues and custom visibility timeout require more care. Eighty reais monthly saved, and the queue runs on the same VPS as the cache — zero additional in infra.",[368,5828,5830],{"id":5829},"iam-becomes-orchestrator-secrets","IAM becomes orchestrator secrets",[12,5832,5833],{},"Here is the non-obvious migration that catches many AWS-experienced teams off guard. On AWS, the application accesses S3 and RDS without explicit credentials in the code — the EC2 inherits an IAM role and the SDK fetches tokens automatically. Outside AWS, that disappears.",[12,5835,5836,5837,5839,5840,5843],{},"The solution is secret injection by the orchestrator. HeroCtl, k3s and similar accept secrets as first-class resources — you declare ",[231,5838,453],{}," or ",[231,5841,5842],{},"S3_ACCESS_KEY"," in the job manifest and the orchestrator injects as environment variable in the container. For more sophisticated scenarios, self-hosted HashiCorp Vault does automatic rotation.",[12,5845,5846],{},"The migration is refactoring each IAM role into a set of explicit credentials, created at the destination provider (Cloudflare API token, specific Postgres user, etc.), and declared as secrets. Four to eight hours for a medium stack.",[368,5848,5850],{"id":5849},"cloudwatch-becomes-prometheus-loki","CloudWatch becomes Prometheus + Loki",[12,5852,5853],{},"Metrics become Prometheus + Grafana. Logs become Loki + Grafana. Everything runs in containers in the same cluster. Two hundred fifty reais monthly of CloudWatch become zero additional.",[12,5855,5856],{},"Initial configuration takes about four hours to be productive: Prometheus with service discovery pointing to the orchestrator agents, Loki receiving via Promtail or directly from the container runtime, Grafana with basic dashboards. There are dedicated posts about this migration on the blog.",[368,5858,5860],{"id":5859},"lambda-the-hardest-part","Lambda — the hardest part",[12,5862,5863],{},"Lambda is the service with the largest variance of complexity in migration. Depends entirely on how you are using it.",[12,5865,5866,5869],{},[27,5867,5868],{},"Simple HTTP Lambda"," (API Gateway → Lambda → response) is trivial. Becomes an endpoint on your app server. The function code changes little — framework handler in place of Lambda handler. One to two hours per function.",[12,5871,5872,5875],{},[27,5873,5874],{},"Event-driven Lambda"," (S3 triggers Lambda, SQS triggers Lambda, EventBridge schedules Lambda) is the expensive part. For S3 events, R2 offers events via Cloudflare Workers — you rewrite the Lambda as a Worker and keep the pattern. For SQS, becomes a consumer on the app server. For scheduled EventBridge, becomes a cron in the orchestrator.",[12,5877,5878],{},"Worst scenario: complex Lambda with chained EventBridge, Step Functions and dead letter queues. Here it is redesign. Reserve a week or two and design a simpler event model — usually the system gets better.",[19,5880,5882],{"id":5881},"realistic-six-to-eight-week-schedule","Realistic six-to-eight-week schedule",[12,5884,5885],{},"Order matters. Starting with the database is temptation and trap — database is last to migrate, not first.",[12,5887,5888,5891],{},[27,5889,5890],{},"Week 1 — Inventory and decision."," List the twelve services, note current cost, identify integrations between them. Choose alternative for each. One-page document with the mapping table. No code yet.",[12,5893,5894,5897],{},[27,5895,5896],{},"Week 2 — Provisioning destination in parallel."," Bring up the VPS, install the orchestrator (HeroCtl or similar), configure test DNS pointing to a subdomain. Bring up Postgres, Valkey, Cloudflare R2. Everything empty. Smoke test: a \"hello world\" running.",[12,5899,5900,5903,5904,5907],{},[27,5901,5902],{},"Week 3 — Storage migration."," S3 to R2 with ",[231,5905,5906],{},"rclone",". Usually slow (volume) but very low risk. Application still reads from S3, but you validate that R2 is synchronized. By end of week, dual-write — application writes to both.",[12,5909,5910,5913],{},[27,5911,5912],{},"Week 4 — Database migration."," Logical Postgres replica from RDS to destination. Cutover in a short maintenance window — usually minutes, not hours, with logical replication working. Application points to new database. RDS stays as hot standby for a week.",[12,5915,5916,5919],{},[27,5917,5918],{},"Week 5 — Web application migration."," Apps running on EC2 become jobs in the orchestrator. Integrated router plays the ALB role. DNS points to the orchestrator (or to Cloudflare in front of it). Gradual cutover using weighted DNS.",[12,5921,5922,5925],{},[27,5923,5924],{},"Week 6 — Queues and async jobs."," SQS leaves, Redis Streams or RabbitMQ enters. Workers run in the orchestrator. Period of dual-consume to ensure no message is dropped.",[12,5927,5928,5931],{},[27,5929,5930],{},"Week 7 — Lambdas and event-driven workloads."," The most variable week. HTTP Lambdas migrate quickly. Event-driven Lambdas require the redesign discussed above. If you have more than ten complex Lambdas, consider extending to two weeks.",[12,5933,5934,5937],{},[27,5935,5936],{},"Week 8 — Final cutover, intensive monitoring, decommission."," Cloudflare in front replaces CloudFront. Route 53 becomes Cloudflare DNS. CloudWatch goes to Prometheus + Loki. Last thing: turn off the old EC2s and close the AWS account — or leave a minimum balance if you still keep some residual service.",[19,5939,5941],{"id":5940},"the-five-lock-ins-that-hurt-most-in-the-migration","The five lock-ins that hurt most in the migration",[12,5943,5944],{},"Honesty matters: not everything migrates easily. Five things require extra work and sometimes change project viability:",[67,5946,5947,5953,5959,5965,5971],{},[70,5948,5949,5952],{},[27,5950,5951],{},"DynamoDB with specific features."," GSI, Streams, scan limits, TTL. There is no direct equivalent. The realistic path is redesign to Postgres with JSONB, or to a self-hosted NoSQL (FoundationDB, ScyllaDB) — re-architecture, not migration.",[70,5954,5955,5958],{},[27,5956,5957],{},"Aurora-only features."," Aurora Serverless v2 with auto-scaling of connections, Aurora Global Database, Aurora I\u002FO optimized. Self-hosted Postgres does almost everything, but doesn't have the instant auto-scaling. For spiky workloads, consider Neon (which offers a similar pattern).",[70,5960,5961,5964],{},[27,5962,5963],{},"Complex cross-service IAM."," Teams using cross-account IAM roles, Service Control Policies and hierarchical account organization have access control embedded in the architecture. Migrating requires reimplementing the hierarchy elsewhere — Vault, Cloudflare Access, or orchestrator secret injection. Count days, not hours.",[70,5966,5967,5970],{},[27,5968,5969],{},"Lambda + complex EventBridge."," Event pipelines with multiple hops, retries, dead letter queues. Doesn't migrate as is. Redesign around queues (RabbitMQ, NATS) and persistent workers. Usually the system gets simpler — but takes time.",[70,5972,5973,5976],{},[27,5974,5975],{},"S3 events triggering Lambda."," Very common pattern, and R2 with Cloudflare Workers covers most cases. For workloads that need exactly-once guarantee or strong ordering, switch to a queue pattern — producer writes event to queue when file is confirmed, worker consumes.",[19,5978,5980],{"id":5979},"the-savings-calculation-without-optimism","The savings calculation, without optimism",[12,5982,5983],{},"Typical Brazilian SaaS scenario with five applications:",[12,5985,5986],{},[27,5987,5988],{},"Before on AWS:",[2734,5990,5991,5994,5997,6000,6003,6006,6009,6012,6015,6018,6021],{},[70,5992,5993],{},"Five EC2 t3.medium: R$750",[70,5995,5996],{},"RDS db.t4g.large Multi-AZ: R$1,400",[70,5998,5999],{},"ElastiCache cache.t4g.micro: R$75",[70,6001,6002],{},"S3 with 100GB and average egress: R$300",[70,6004,6005],{},"ALB: R$110",[70,6007,6008],{},"CloudFront with average volume: R$400",[70,6010,6011],{},"Route 53 + SES: R$75",[70,6013,6014],{},"CloudWatch logs\u002Fmetrics: R$250",[70,6016,6017],{},"Lambda with average volume: R$200",[70,6019,6020],{},"NAT Gateway: R$200",[70,6022,6023],{},[27,6024,6025],{},"Total: R$3,760\u002Fmonth = R$45,120\u002Fyear",[12,6027,6028],{},[27,6029,6030],{},"After self-hosted:",[2734,6032,6033,6036,6039,6042,6045,6048,6051,6054,6057],{},[70,6034,6035],{},"Four Hetzner CPX21 VPS with orchestrator: R$176",[70,6037,6038],{},"Self-hosted Postgres (included on the VPS): R$0",[70,6040,6041],{},"Valkey (included): R$0",[70,6043,6044],{},"Cloudflare R2 50GB with unlimited egress: R$75",[70,6046,6047],{},"Cloudflare CDN + DNS: R$0",[70,6049,6050],{},"Resend for email: R$100",[70,6052,6053],{},"Prometheus + Loki (included): R$0",[70,6055,6056],{},"Queue workers (included): R$0",[70,6058,6059],{},[27,6060,6061],{},"Total: R$351\u002Fmonth = R$4,212\u002Fyear",[12,6063,6064],{},"Savings: R$3,409\u002Fmonth, R$40,908\u002Fyear. Roughly one month of senior engineer salary.",[12,6066,6067],{},"The migration consumes eighty to one hundred sixty hours. In senior internal dev hours, between sixteen and thirty-two thousand reais. Payback in five to ten months, with perpetual savings afterwards.",[19,6069,6071],{"id":6070},"the-most-non-obvious-migration-secrets-and-credentials","The most non-obvious migration: secrets and credentials",[12,6073,6074],{},"Worth repeating, because it is what most surprises an AWS-experienced team. On AWS you access S3 without credentials in code — the EC2's IAM role resolves it. Access RDS via IAM authentication. Access parameter store via IAM. The team loses awareness that this \"magic\" exists.",[12,6076,6077],{},"Outside AWS, every credential is explicit. The application needs:",[2734,6079,6080,6083,6086,6089,6092],{},[70,6081,6082],{},"Access key and secret for R2 (created in Cloudflare panel)",[70,6084,6085],{},"Connection string with user and password for Postgres",[70,6087,6088],{},"Valkey URL with password",[70,6090,6091],{},"API key for Resend",[70,6093,6094],{},"Token for Cloudflare API if you automate DNS",[12,6096,6097],{},"The orchestrator solution is to declare all of that as secrets injected into the container as environment variables. The secret is encrypted at rest in the orchestrator and never appears in logs. For automatic rotation and sophisticated audit, self-hosted Vault enters the game — but most teams don't need it.",[12,6099,6100],{},"Plan: make a spreadsheet with all the credentials each app needs, create each at the destination provider, declare as secret in the orchestrator, inject into the container. Four to eight hours for a medium stack.",[19,6102,6104],{"id":6103},"when-not-to-migrate-honest-profiles","When NOT to migrate (honest profiles)",[12,6106,6107],{},"Four situations where leaving AWS is the wrong decision:",[12,6109,6110,6113],{},[27,6111,6112],{},"Compliance that lists AWS by name."," FedRAMP, ITAR, certain American government contracts and some financial certifications require infra to run on pre-approved components — and most lists include AWS, GCP, Azure, and few additional providers. If your client is an American federal agency, AWS resolves a slice of compliance that would cost months to replicate elsewhere.",[12,6115,6116,6119],{},[27,6117,6118],{},"Single team focused on product."," If you are the only dev and are building the product, eight weeks redirected to migration kill roadmap. Do it when you have the second dev, or when AWS costs come to represent a significant slice of MRR. Before that, AWS is expensive but buyable.",[12,6121,6122,6125],{},[27,6123,6124],{},"AWS costs below 2% of MRR."," Bill of one thousand reais monthly for a startup billing one hundred thousand. The savings are real but the effort isn't worth the focus. Migrate when the bill exceeds five to ten percent of MRR — there the gain covers the lost opportunity.",[12,6127,6128,6131],{},[27,6129,6130],{},"Deep lock-in in DynamoDB or Aurora Serverless v2."," Already addressed above. If half your architecture is DynamoDB with Streams, you don't migrate — you re-architect. That's a different project, with different scope, different decision.",[19,6133,6135],{"id":6134},"hybrid-strategy-alternative-for-those-not-wanting-to-migrate-everything","Hybrid strategy — alternative for those not wanting to migrate everything",[12,6137,6138],{},"Teams with fifty or more applications on AWS rarely migrate in block. Hybrid strategy works better:",[2734,6140,6141,6144,6147,6150],{},[70,6142,6143],{},"Keep on AWS what is expensive to move (Aurora with specific features, critical Lambda, DynamoDB)",[70,6145,6146],{},"Move what is cheap to move and expensive to maintain (S3 → R2, CloudFront → Cloudflare, non-critical EC2 → VPS)",[70,6148,6149],{},"Establish VPN or private connection between the two endpoints",[70,6151,6152],{},"Partial savings but zero risk of radical migration",[12,6154,6155],{},"Typical result: cut of forty to sixty percent of the AWS bill, without touching critical pieces. For a company paying fifty thousand monthly, that is twenty to thirty thousand back — and the rest migrates organically over the following twelve months, as teams rewrite components for other reasons.",[19,6157,6159],{"id":6158},"heroctl-as-destination-what-changes-in-practice","HeroCtl as destination — what changes in practice",[12,6161,6162],{},"HeroCtl is a container orchestrator that runs on any Linux server with Docker. Four VPS running HeroCtl deliver an operational experience close to what you would have with managed ECS — without managed billing, without lock-in.",[12,6164,6165],{},"What it replaces:",[2734,6167,6168,6173,6179,6185],{},[70,6169,6170,6172],{},[27,6171,5601],{}," becomes the HeroCtl integrated router, with automatic Let's Encrypt TLS",[70,6174,6175,6178],{},[27,6176,6177],{},"Partial CloudWatch"," becomes embedded metrics and native centralized logs",[70,6180,6181,6184],{},[27,6182,6183],{},"RDS automated backups"," becomes managed backup on Business Edition",[70,6186,6187,6190],{},[27,6188,6189],{},"IAM roles in apps"," becomes secret injection in the job manifest",[12,6192,6193],{},"What stays the same: Docker running your app exactly as it runs on ECS. Environment variables, healthchecks, rolling deploys, multi-replicas. The application doesn't notice the difference.",[12,6195,6196,6197,6199,6200,6202,6203,6205],{},"There are three plans. ",[27,6198,4351],{}," is permanent free, no server or job limit — runs the entire stack described above including real high availability, router, certificates, metrics and logs. ",[27,6201,4355],{}," adds SSO, granular RBAC, detailed auditing, managed backup and SLA support — useful for those who already have formal platform requirements. ",[27,6204,4359],{}," adds source code escrow, 24×7 support and dedicated development. Business and Enterprise pricing is published on the plans page, without mandatory \"talk to sales\".",[12,6207,6208],{},"The public demo cluster runs on four servers and coordinator election happens in around seven seconds when the current node falls — measured number, not estimated.",[12,6210,5406],{},[19,6212,6214],{"id":6213},"questions-we-get-about-leaving-aws","Questions we get about leaving AWS",[368,6216,6218],{"id":6217},"how-long-does-it-really-take-to-migrate-a-medium-stack","How long does it really take to migrate a medium stack?",[12,6220,6221],{},"For a startup with five to ten applications, without deep lock-ins: six to eight weeks with a senior dev devoting half time, or three to four weeks with full dedication. Larger stacks or with complex event-driven Lambdas: three to four months. Stacks with critical DynamoDB or Aurora Serverless v2: turn it into a re-architecture project, six-month timeline or more.",[368,6223,6225],{"id":6224},"does-dynamodb-have-a-good-alternative","Does DynamoDB have a good alternative?",[12,6227,6228],{},"There is no identical substitute. The honest options are: Postgres with JSONB for most cases (resolves eighty percent of DynamoDB uses with excellent performance), self-hosted ScyllaDB or Cassandra for workloads that really need distributed NoSQL, FoundationDB for those needing distributed transactions. None of these is \"change the connection string and done\" — they require changes in the data model.",[368,6230,6232],{"id":6231},"can-i-keep-aws-for-the-database-and-move-compute","Can I keep AWS for the database and move compute?",[12,6234,6235],{},"Yes, and it is the most common hybrid strategy. Aurora or RDS stays on AWS, EC2s become Hetzner or DigitalOcean VPS, S3 becomes R2. You open VPN between the two endpoints and the app continues accessing RDS via private endpoint. Savings typically of fifty to seventy percent of the AWS bill.",[368,6237,6239],{"id":6238},"s3-r2-how-much-does-it-cost-to-transfer-1tb","S3 → R2: how much does it cost to transfer 1TB?",[12,6241,6242,6243,6245],{},"R2 charges zero for ingress. AWS charges for S3 egress — approximately nine cents of dollar per GB on the first 10 TB. One terabyte costs about ninety dollars to leave AWS, R$450. Transfer time: twelve to twenty-four hours with parallelized ",[231,6244,5906],{},", depending on bandwidth. After migration, R$75 monthly storing 50GB with unlimited egress, against R$600 for the same on S3 with active traffic.",[368,6247,6249],{"id":6248},"lambda-how-to-migrate-event-driven","Lambda — how to migrate event-driven?",[12,6251,6252,6253,6256,6257,6260,6261,6264,6265,6268],{},"Depends on the trigger. ",[27,6254,6255],{},"S3 triggering Lambda"," becomes R2 with Cloudflare Workers (same pattern, no radical change). ",[27,6258,6259],{},"SQS triggering Lambda"," becomes a persistent worker on the app server, consuming from the queue — usually simpler code than the original Lambda. ",[27,6262,6263],{},"Scheduled EventBridge"," becomes cron in the orchestrator. ",[27,6266,6267],{},"EventBridge with complex rules and chained Step Functions"," requires redesign — design the flow around a central queue with consumer workers, becomes more auditable.",[368,6270,6272],{"id":6271},"rds-multi-az-self-hosted-postgres-is-reliable","RDS Multi-AZ → self-hosted Postgres is reliable?",[12,6274,6275],{},"Postgres with physical streaming replication and failover via Patroni reaches reliability close to RDS Multi-AZ — provided the team knows how to operate. If nobody on the team masters Postgres in production, the safest path is Neon or Supabase, which deliver managed Postgres with free tier. For teams with SRE or DBA, self-hosted is viable and saves substantially. For teams without that competence, the savings don't compensate the risk — pay for managed.",[368,6277,6279],{"id":6278},"email-ses-who-is-cheaper","Email SES → who is cheaper?",[12,6281,6282],{},"Depends on volume. Up to 10k monthly emails, Postmark at US$15 delivers much more (superior deliverability, better dashboard, responsive support). Between 50k and 100k monthly, Resend at US$20 is the best cost-benefit. Above 500k monthly, Mailgun or Amazon SES compete on price — and SES might make sense to keep even after migrating the rest. Email is one of the few AWS services that can be rational to keep.",[368,6284,6286],{"id":6285},"dns-all-cloudflare-or-mix","DNS — all Cloudflare or mix?",[12,6288,6289],{},"Cloudflare resolves DNS, CDN, DDoS, WAF and workers on the free plan. For most stacks, concentrating everything there simplifies operation and cuts cost. The exception is compliance that requires geographic provider separation — some governance frameworks ask for DNS and CDN to be from distinct providers. In that case, Cloudflare DNS + Bunny CDN (or Fastly) fulfills the separation.",[368,6291,6293],{"id":6292},"does-lgpd-compliance-change-anything","Does LGPD compliance change anything?",[12,6295,6296],{},"LGPD doesn't require hosting on Brazilian soil. It requires that you know where the data is and that you have an adequate contract with the operator. Hetzner (Germany), DigitalOcean (multiple regions), Cloudflare R2 (multi-region) and Magalu Cloud (Brazil) are all LGPD-compatible provided the contract is in order. For those preferring data on national soil due to client preference, Magalu Cloud is the direct alternative.",[12,6298,5406],{},[19,6300,6302],{"id":6301},"concrete-next-step","Concrete next step",[12,6304,6305],{},"If you got this far, the next step is the spreadsheet. List the twelve services, mark which your stack uses, note current cost of each, choose alternative. In an afternoon you know if migration is worth the effort.",[12,6307,6308],{},"When you are ready to provision the destination:",[224,6310,6311],{"className":226,"code":5318,"language":228,"meta":229,"style":229},[231,6312,6313],{"__ignoreMap":229},[234,6314,6315,6317,6319,6321,6323],{"class":236,"line":237},[234,6316,1220],{"class":247},[234,6318,2957],{"class":251},[234,6320,5329],{"class":255},[234,6322,2963],{"class":383},[234,6324,2966],{"class":247},[12,6326,6327],{},"Runs on any Linux server with Docker. The first three become quorum for the replicated control plane. You submit jobs via CLI, API or embedded web panel. The cluster decides where to run, does health check, manages rolling deploys, issues Let's Encrypt certificates automatically.",[12,6329,6330,6331,2402,6335,101],{},"For additional context on cost and architecture, also read ",[3336,6332,6334],{"href":6333},"\u002Fen\u002Fblog\u002Faws-ecs-vs-kubernetes-vs-self-hosted","AWS ECS vs Kubernetes vs self-hosted",[3336,6336,6338],{"href":6337},"\u002Fen\u002Fblog\u002Fhow-much-to-host-a-brazilian-saas-2026","How much does it cost to host a Brazilian SaaS in 2026",[12,6340,6341],{},"The migration is more annoying than difficult. The hard part is deciding to start.",[3350,6343,4376],{},{"title":229,"searchDepth":244,"depth":244,"links":6345},[6346,6347,6348,6349,6363,6364,6365,6366,6367,6368,6369,6370,6381],{"id":5390,"depth":244,"text":5391},{"id":5409,"depth":244,"text":5410},{"id":5422,"depth":244,"text":5423},{"id":5506,"depth":244,"text":5507,"children":6350},[6351,6352,6353,6354,6355,6356,6357,6358,6359,6360,6361,6362],{"id":5719,"depth":271,"text":5720},{"id":5729,"depth":271,"text":5730},{"id":5752,"depth":271,"text":5753},{"id":5762,"depth":271,"text":5763},{"id":5779,"depth":271,"text":5780},{"id":5789,"depth":271,"text":5790},{"id":5799,"depth":271,"text":5800},{"id":5806,"depth":271,"text":5807},{"id":5816,"depth":271,"text":5817},{"id":5829,"depth":271,"text":5830},{"id":5849,"depth":271,"text":5850},{"id":5859,"depth":271,"text":5860},{"id":5881,"depth":244,"text":5882},{"id":5940,"depth":244,"text":5941},{"id":5979,"depth":244,"text":5980},{"id":6070,"depth":244,"text":6071},{"id":6103,"depth":244,"text":6104},{"id":6134,"depth":244,"text":6135},{"id":6158,"depth":244,"text":6159},{"id":6213,"depth":244,"text":6214,"children":6371},[6372,6373,6374,6375,6376,6377,6378,6379,6380],{"id":6217,"depth":271,"text":6218},{"id":6224,"depth":271,"text":6225},{"id":6231,"depth":271,"text":6232},{"id":6238,"depth":271,"text":6239},{"id":6248,"depth":271,"text":6249},{"id":6271,"depth":271,"text":6272},{"id":6278,"depth":271,"text":6279},{"id":6285,"depth":271,"text":6286},{"id":6292,"depth":271,"text":6293},{"id":6301,"depth":244,"text":6302},"case-study","2026-05-26","Migrating from AWS to a cheaper cloud (Hetzner\u002FDO) or self-hosted seems like a 1-year project. In practice, you can do it in 6-8 weeks if you map the 12 AWS-only services your stack actually uses.",{},"\u002Fen\u002Fblog\u002Fleaving-aws-without-rewriting-the-stack","16 min",{"title":5382,"description":6384},{"loc":6386},"en\u002Fblog\u002Fleaving-aws-without-rewriting-the-stack",[6392,6393,6394,888,6395],"aws","migration","cost","guide","sR1IiiNXy6Y_l6sjdp00CGJCMeQQZCQ1jpzt6L_dOxw",{"id":6398,"title":6399,"author":7,"body":6400,"category":3378,"cover":3379,"date":7497,"description":7498,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":7499,"navigation":411,"path":7500,"readingTime":4401,"seo":7501,"sitemap":7502,"stem":7503,"tags":7504,"__hash__":7508},"blog_en\u002Fen\u002Fblog\u002Fredis-in-production-managed-vs-self-hosted.md","Redis (and Valkey) in production: managed vs self-hosted in 2026",{"type":9,"value":6401,"toc":7469},[6402,6416,6418,6438,6442,6445,6448,6454,6461,6465,6468,6474,6480,6486,6489,6495,6500,6505,6508,6514,6519,6524,6527,6533,6538,6548,6552,6555,6599,6603,6606,6610,6613,6645,6648,6652,6655,6669,6676,6680,6694,6697,6701,6734,6737,6741,6744,6758,6765,6769,6772,6853,6856,6860,6863,6869,6875,6882,6885,6889,6892,6924,6927,6931,6938,6941,6944,7235,7239,7242,7268,7272,7275,7301,7305,7308,7329,7338,7341,7344,7348,7373,7387,7393,7402,7408,7414,7420,7426,7428,7431,7437,7440,7456,7467],[12,6403,6404,6405,6408,6409,2402,6412,6415],{},"The question \"managed or self-hosted Redis?\" became another question at the end of March 2024. That's when the company behind Redis switched the license from Apache 2.0 \u002F BSD to a combination of RSAL with SSPL — a pair of \"source available\" licenses designed to prevent cloud providers from offering Redis as a service without commercial licensing. The reaction was quick: the Linux Foundation launched ",[27,6406,6407],{},"Valkey"," as a direct fork from the last BSD version, with AWS, Google, and Oracle backing the development. In parallel, projects that already existed — ",[27,6410,6411],{},"KeyDB",[27,6413,6414],{},"Dragonfly"," — started appearing more frequently in benchmarks of companies reassessing their stack.",[19,6417,22],{"id":21},[12,6419,6420,6421,6424,6425,6427,6428,6430,6431,6433,6434,6437],{},"In 2026, \"Redis in production\" became a category with four implementations disputing the same protocol: ",[27,6422,6423],{},"Redis OSS"," (BSD pre-2024 or RSAL post), ",[27,6426,6407],{}," (BSD, drop-in via fork), ",[27,6429,6411],{}," (multi-thread, old fork), and ",[27,6432,6414],{}," (BSL, rewritten from scratch in C++). Self-hosting any of the four costs between R$30 and R$130 per month on Hetzner VPS. The managed path costs from R$75 (ElastiCache micro) to R$1,000\u002Fmonth (13 GB instance), plus Upstash with serverless billing varying US$0–100\u002Fmonth. For Brazilian startup with MRR below R$200k, ",[27,6435,6436],{},"self-hosted Valkey"," on its own cluster saves between R$300 and R$1,500 per month compared to managed, eliminates RSAL license exposure, and maintains full compatibility with Redis clients. Switching the stack after adopting the commercial version is real pain — starting with the OSS-friendly version is the bet with the lowest exit cost. This post compares the four products, the three managed paths (ElastiCache, Upstash, Redis Cloud), and the minimum configuration to run Valkey in production without losing sleep.",[19,6439,6441],{"id":6440},"the-short-story-of-the-license-change","The short story of the license change",[12,6443,6444],{},"Before March 2024, \"Redis\" was the dominant OSS cache: BSD, gigantic ecosystem, present in any stack that had ever fit the word \"Rails\" or \"Node\" on a résumé. The commercial vendor — Redis Inc, formerly Redis Labs — lived well off the managed product and the paid modules (Search, JSON, TimeSeries).",[12,6446,6447],{},"Then came the announcement: version 7.4 onward would ship under RSAL + SSPL, no longer BSD. In practical terms, the change directly targeted AWS, Google, and Azure. The internal reading from those who produce open source software was different: \"if it happened to Redis, it can happen to any VC-funded project\". It was the third recent case — after Elastic in 2021 and MongoDB in 2018 — where a project that seemed consolidated changed the rules.",[12,6449,6450,6451,6453],{},"The Linux Foundation was quick. Five days after the announcement, ",[27,6452,6407],{}," was formed as a fork of the last BSD version (7.2.4), with independent governance and weighty backers: AWS, Google Cloud, Oracle, Ericsson, Snap. In just over a year, AWS had already migrated ElastiCache's default engine to Valkey. Google Memorystore followed. In 2026, Valkey stopped being \"experimental fork\" to become a growing reference — with 7.x and 8.x versions already incorporating its own optimizations that weren't even offered to Redis OSS.",[12,6455,6456,6457,6460],{},"The operational lesson for those choosing cache today: ",[27,6458,6459],{},"the mainstream moved",". There's no longer the inertia of \"no one was fired for choosing Redis\" — the question in the architecture interview became \"why Redis and not Valkey?\". And the honest answer, in most cases, is \"habit\".",[19,6462,6464],{"id":6463},"what-are-the-four-products-disputing-this-market","What are the four products disputing this market?",[368,6466,6423],{"id":6467},"redis-oss",[12,6469,6470,6473],{},[27,6471,6472],{},"The original."," Versions before 7.4 are still under BSD and remain usable indefinitely — no one revokes a license retroactively. Versions 7.4 onward ship under RSAL\u002FSSPL.",[12,6475,6476,6479],{},[27,6477,6478],{},"Pros",": still huge community, battle-tested in production for over a decade, richer ecosystem (modules, integrations, books, talks). Almost every client library tested against Redis OSS first.",[12,6481,6482,6485],{},[27,6483,6484],{},"Cons",": the RSAL prevents offering-as-a-service without commercial licensing. For those operating Redis for internal use, that's irrelevant — the restriction is about resale. The real risk is strategic: if the vendor changed the license once, they can change it again. Adopting Redis OSS in 2026 means betting that the next critical feature will descend to the open branch, and not stay in the commercial product.",[368,6487,6407],{"id":6488},"valkey",[12,6490,6491,6494],{},[27,6492,6493],{},"The Linux Foundation fork."," Took the code from 7.2.4 BSD and kept developing. Drop-in replacement at the protocol level: no client needs to change a line of code to swap Redis for Valkey.",[12,6496,6497,6499],{},[27,6498,6478],{},": permanent BSD guaranteed by neutral governance (it's not a company, it's a foundation). Big backers align incentives to keep the project healthy. Technical parity with Redis 7.x and growing development speed.",[12,6501,6502,6504],{},[27,6503,6484],{},": the brand is still being built — some third-party plugins and very specific SDKs still only list \"Redis\" in the README. In 2026 that's increasingly cosmetic, but it can show up in old integrations needing minor adaptation.",[368,6506,6411],{"id":6507},"keydb",[12,6509,6510,6513],{},[27,6511,6512],{},"The multi-thread fork."," Has existed since 2019, was acquired by Snap in 2022, lives today as Snap-Telemetry project. The architectural difference is fundamental: Redis OSS and Valkey are single-thread by design (one main thread processes all commands). KeyDB runs multi-thread by default.",[12,6515,6516,6518],{},[27,6517,6478],{},": on CPUs with 4+ cores, KeyDB delivers 2 to 3 times more throughput than single-thread Redis on the same hardware. API is compatible, so the client doesn't change. For CPU-bound workloads with high volume, it's the obvious choice.",[12,6520,6521,6523],{},[27,6522,6484],{},": smaller community, pace of adopting new Redis features usually lags quarters behind. Some new Redis features (Functions, certain extensions) take time to appear in KeyDB.",[368,6525,6414],{"id":6526},"dragonfly",[12,6528,6529,6532],{},[27,6530,6531],{},"The rewrite."," Not a fork — it's a new implementation in modern C++, with hash table designed for cache (not Redis's generic structure), using io_uring on Linux for asynchronous I\u002FO. Compatibility at the protocol level, not at the code level.",[12,6534,6535,6537],{},[27,6536,6478],{},": claims of 25× throughput in specific benchmarks (heavy pipelines on modern hardware). Real memory efficiency — 2 to 3 times more data in the same RAM as Redis. No implicit GIL of single-thread; scales vertically on a machine with 32+ cores.",[12,6539,6540,6542,6543,6547],{},[27,6541,6484],{},": BSL (Business Source License) license — stays closed for 4 years before becoming Apache 2.0. It's exactly the same license pattern that caught other projects in the orchestration industry by surprise, which we covered in our post on ",[3336,6544,6546],{"href":6545},"\u002Fen\u002Fblog\u002Fwhy-we-built-heroctl","why we built HeroCtl",". Some commands still incompatible with Redis in edge cases (complex Lua scripts, certain cluster operations).",[19,6549,6551],{"id":6550},"which-to-choose-for-a-new-project-in-2026","Which to choose for a new project in 2026?",[12,6553,6554],{},"The short decision tree:",[2734,6556,6557,6566,6574,6582,6591],{},[70,6558,6559,6562,6563,6565],{},[27,6560,6561],{},"Sensible default",": ",[27,6564,6407],{},". Permanent BSD, Redis parity, client doesn't need to change, future guaranteed by big backers. There's no technical reason to prefer Redis OSS for a new project in 2026.",[70,6567,6568,6562,6571,6573],{},[27,6569,6570],{},"Critical performance",[27,6572,6414],{},", if the application sustains above 100k operations per second and the team accepts the BSL license risk.",[70,6575,6576,6562,6579,6581],{},[27,6577,6578],{},"Multi-thread without rewrite",[27,6580,6411],{},", if the bottleneck is CPU on big hardware and the team prefers not to migrate to Dragonfly.",[70,6583,6584,6562,6587,6590],{},[27,6585,6586],{},"Extreme simplicity (1 VPS, low volume)",[27,6588,6589],{},"Redis OSS 7.2.4 BSD"," still works perfectly. Crystallized as a stable version; will run on any Debian\u002FAlpine for the next five years without complaining.",[70,6592,6593,6562,6596,6598],{},[27,6594,6595],{},"Migrating from Redis Labs managed",[27,6597,6407],{}," is drop-in. Zero code changing. Migration is only operational — replication, DNS swap, rollback if necessary.",[19,6600,6602],{"id":6601},"managed-vs-self-hosted-the-math-without-frills","Managed vs self-hosted: the math without frills",[12,6604,6605],{},"The numbers below are list price in May 2026, R$5\u002FUSD exchange rate.",[368,6607,6609],{"id":6608},"aws-elasticache","AWS ElastiCache",[12,6611,6612],{},"Grows in steps per instance:",[2734,6614,6615,6624,6630,6639],{},[70,6616,6617,6620,6621],{},[231,6618,6619],{},"cache.t4g.micro"," (1 GB): about US$15\u002Fmonth = ",[27,6622,6623],{},"R$75\u002Fmonth",[70,6625,6626,6629],{},[231,6627,6628],{},"cache.t4g.small"," (2 GB): US$30\u002Fmonth = R$150\u002Fmonth",[70,6631,6632,6635,6636],{},[231,6633,6634],{},"cache.r6g.large"," (13 GB): about US$200\u002Fmonth = ",[27,6637,6638],{},"R$1,000\u002Fmonth",[70,6640,6641,6644],{},[231,6642,6643],{},"cache.r6g.xlarge"," (26 GB): about US$400\u002Fmonth = R$2,000\u002Fmonth",[12,6646,6647],{},"Multi-AZ doubles the price (replica in another zone). Automatic backup is included. Real Multi-AZ failover is the main argument — you pay not to have to think about it.",[368,6649,6651],{"id":6650},"upstash","Upstash",[12,6653,6654],{},"Serverless billing per command:",[2734,6656,6657,6660,6663,6666],{},[70,6658,6659],{},"Free tier: 256 MB, 500k commands\u002Fday",[70,6661,6662],{},"Pay-as-you-go: US$0.2 per 100k commands",[70,6664,6665],{},"For startup with medium volume (10M commands\u002Fday): about US$60\u002Fmonth = R$300\u002Fmonth",[70,6667,6668],{},"For app with low peak: can stay between US$0 and US$10\u002Fmonth",[12,6670,6671,6672,6675],{},"The unique operational advantage: ",[27,6673,6674],{},"zero pre-allocated capacity",". If the app sleeps, the bill sleeps. For Vercel\u002FCloudflare Workers, it's the natural complement. For sustained and predictable load, it ends up more expensive than ElastiCache.",[368,6677,6679],{"id":6678},"redis-cloud-direct-offer-from-redis-inc","Redis Cloud (direct offer from Redis Inc)",[2734,6681,6682,6685,6691],{},[70,6683,6684],{},"Essentials Plan 30MB: free",[70,6686,6687,6688],{},"Pro Plan 5GB single-region: about US$50\u002Fmonth = ",[27,6689,6690],{},"R$250\u002Fmonth",[70,6692,6693],{},"Pro Plan 10GB multi-AZ: about US$120\u002Fmonth = R$600\u002Fmonth",[12,6695,6696],{},"Includes commercial modules (Search, JSON, TimeSeries) that don't exist in Valkey or Redis OSS. If you use those modules, there's no direct alternative — it's Redis Cloud or buy commercial license and self-host.",[368,6698,6700],{"id":6699},"self-hosted-on-hetzner","Self-hosted on Hetzner",[2734,6702,6703,6713,6719,6728],{},[70,6704,6705,6708,6709,6712],{},[27,6706,6707],{},"CPX21"," (3 vCPU, 4 GB RAM, 80 GB SSD): €7.99 = ",[27,6710,6711],{},"R$44\u002Fmonth",". Fits 2 GB Valkey with room to spare.",[70,6714,6715,6718],{},[27,6716,6717],{},"CPX31"," (4 vCPU, 8 GB RAM, 160 GB SSD): €13.99 = R$78\u002Fmonth.",[70,6720,6721,6724,6725,101],{},[27,6722,6723],{},"Cluster of 3 CPX21 for Valkey + Sentinel HA",": 3 × €7.99 = €24\u002Fmonth = ",[27,6726,6727],{},"R$130\u002Fmonth",[70,6729,6730,6733],{},[27,6731,6732],{},"Cluster of 3 CPX31 for serious data",": €42\u002Fmonth = R$230\u002Fmonth.",[12,6735,6736],{},"For DigitalOcean, Linode, Vultr, multiply by approximately 1.5×. For AWS EC2, multiply by 2×. But in any case it stays cheaper than the equivalent managed.",[368,6738,6740],{"id":6739},"practical-difference","Practical difference",[12,6742,6743],{},"For 8 GB cache workload with replication:",[2734,6745,6746,6749,6752,6755],{},[70,6747,6748],{},"ElastiCache Multi-AZ: ~R$1,000\u002Fmonth",[70,6750,6751],{},"Redis Cloud Pro Multi-AZ: ~R$600\u002Fmonth",[70,6753,6754],{},"Self-hosted Valkey on 3× Hetzner CPX31: R$230\u002Fmonth",[70,6756,6757],{},"Single-node Valkey on 1× Hetzner CPX31 + S3 backup: R$80\u002Fmonth",[12,6759,6760,6761,6764],{},"Whoever chooses the managed path pays ",[27,6762,6763],{},"3 to 10 times more"," for the same throughput. The difference is what you buy with that: contractual SLA, automatic multi-AZ failover, absence of 3 a.m. pager. For a small team, that may be worth the price. For a team that already operates Linux servers in production, it usually isn't.",[19,6766,6768],{"id":6767},"minimum-production-grade-valkey-stack","Minimum production-grade Valkey stack",[12,6770,6771],{},"Configuration that withstands real production without theater:",[2734,6773,6774,6780,6789,6804,6810,6816,6822,6828],{},[70,6775,6776,6779],{},[27,6777,6778],{},"Container or systemd service on dedicated VPS."," Don't share the machine with the application — cache and app compete for RAM, and when it goes wrong it goes wrong for both at the same time.",[70,6781,6782,6788],{},[27,6783,6784,6787],{},[231,6785,6786],{},"maxmemory"," configured"," between 50 and 70% of available RAM. Leaving memory for the system and network buffers is more important than having the last megabytes for cache.",[70,6790,6791,6562,6796,6799,6800,6803],{},[27,6792,6793],{},[231,6794,6795],{},"maxmemory-policy",[231,6797,6798],{},"allkeys-lru"," if pure cache mode (throw out old keys when full). ",[231,6801,6802],{},"noeviction"," if storage mode (queue, sessions) — there prefer write error to silently losing data.",[70,6805,6806,6809],{},[27,6807,6808],{},"AOF persistence"," if the load is job queue (Sidekiq, BullMQ, Resque). Without AOF, a restart loses any job that was queued but unprocessed. RDB is insufficient in that scenario because snapshot is periodic.",[70,6811,6812,6815],{},[27,6813,6814],{},"Sufficient RDB"," if the load is pure cache (Rails cache, Django cache). If restarting losing cache only means \"slow request for a few seconds while it warms up\", AOF is unnecessary overhead.",[70,6817,6818,6821],{},[27,6819,6820],{},"Async replication to standby"," on a second node. Manual failover with internal DNS swap is acceptable for many cases. Automatic failover costs Sentinel or Cluster.",[70,6823,6824,6827],{},[27,6825,6826],{},"AOF + RDB backup to S3"," or compatible, daily. Restic or rclone solve well.",[70,6829,6830,6833,6834,6837,6838,571,6841,571,6844,571,6847,571,6850,101],{},[27,6831,6832],{},"Monitoring"," with ",[231,6835,6836],{},"redis_exporter"," exporting to Prometheus + alerts on Grafana or similar. Critical metrics: ",[231,6839,6840],{},"connected_clients",[231,6842,6843],{},"used_memory",[231,6845,6846],{},"evicted_keys",[231,6848,6849],{},"keyspace_hits\u002Fmisses",[231,6851,6852],{},"latency_percentiles",[12,6854,6855],{},"This setup runs comfortably on CPX21 (R$44\u002Fmonth) serving 50k+ ops\u002Fs sustained for average Brazilian app.",[19,6857,6859],{"id":6858},"sentinel-or-cluster","Sentinel or Cluster?",[12,6861,6862],{},"Question that confuses many teams coming to Redis for the first time.",[12,6864,6865,6868],{},[27,6866,6867],{},"Sentinel",": 1 master + N replicas + 3+ sentinel processes monitoring. Automatic failover when master falls — a sentinel detects, the sentinels vote, a replica becomes master, clients receive new endpoint via discovery. All on a single shard — the entire dataset fits on one node.",[12,6870,6871,6874],{},[27,6872,6873],{},"Cluster",": dataset partitioned into 16384 slots distributed across 3+ masters. Each master has its own replicas. Multi-shard, horizontal capacity scaling — you can have 100 GB total with no individual node holding more than 20 GB.",[12,6876,6877,6878,6881],{},"The practical rule: ",[27,6879,6880],{},"Sentinel is enough up to ~100 GB dataset",". Above that, Cluster is necessary. For most Brazilian startups, Sentinel is the right choice for simplicity — Cluster adds real complexity (key needs hashtag for multi-key operations, Lua scripts get restricted to a slot, some clients have bugs in cluster mode).",[12,6883,6884],{},"Don't use Cluster for status. Use Sentinel until the metric forces.",[19,6886,6888],{"id":6887},"sidekiq-bullmq-and-friends-patterns","Sidekiq, BullMQ and friends patterns",[12,6890,6891],{},"Real use, not marketing diagram:",[2734,6893,6894,6900,6906,6912,6918],{},[70,6895,6896,6899],{},[27,6897,6898],{},"Sidekiq Ruby",": Redis needs AOF. Without AOF, any crash loses queued jobs that haven't yet been picked up. Sidekiq Pro adds \"reliable fetch\" that improves — but the backstop is still AOF.",[70,6901,6902,6905],{},[27,6903,6904],{},"BullMQ Node",": similar. AOF essential for durability. BullMQ uses data structures that depend on Redis transactional atomicity — restart without AOF can leave queue in inconsistent state.",[70,6907,6908,6911],{},[27,6909,6910],{},"Resque Ruby",": the father of all. AOF necessary for the same reasons.",[70,6913,6914,6917],{},[27,6915,6916],{},"Pure cache (Rails.cache, Django cache, Laravel cache)",": can run without AOF, RDB sufficient. Losing cache on restart is acceptable.",[70,6919,6920,6923],{},[27,6921,6922],{},"Pure pub\u002Fsub",": doesn't even need persistence. Pub\u002Fsub is fire-and-forget by design.",[12,6925,6926],{},"Mixing cache and queue use on the same Redis works — just configure AOF (the \"worst case\" load determines). But for serious workload, separating into two instances (one for cache without AOF, another for queue with AOF) is cleaner. Operationally cheap if there's already an orchestrator running.",[19,6928,6930],{"id":6929},"is-elasticache-sao-paulo-reliable","Is ElastiCache São Paulo reliable?",[12,6932,6933,6934,6937],{},"Yes — 99.99% contractual uptime SLA, multi-AZ in São Paulo region (",[231,6935,6936],{},"sa-east-1","), automatic backup, tested failover. Latency from Brazilian app to ElastiCache São Paulo stays at 1-3ms, indistinguishable from local Redis for most workloads.",[12,6939,6940],{},"The weak point isn't technical reliability, it's cost and lock-in. AWS Brazil charges about 30% more than North American regions for the same resource. And migrating from ElastiCache to another provider later involves dump\u002Frestore + coordinated cutover — not apocalypse, but it's weekend work.",[19,6942,6943],{"id":3836},"Comparison table: 12 criteria",[119,6945,6946,6966],{},[122,6947,6948],{},[125,6949,6950,6952,6954,6956,6958,6960,6962,6964],{},[128,6951,2982],{},[128,6953,6423],{},[128,6955,6407],{},[128,6957,6411],{},[128,6959,6414],{},[128,6961,5445],{},[128,6963,6651],{},[128,6965,5573],{},[141,6967,6968,6993,7017,7039,7064,7084,7106,7127,7150,7173,7194,7215],{},[125,6969,6970,6973,6976,6979,6981,6984,6987,6990],{},[146,6971,6972],{},"License",[146,6974,6975],{},"RSAL\u002FSSPL (7.4+)",[146,6977,6978],{},"BSD",[146,6980,6978],{},[146,6982,6983],{},"BSL → Apache 4 years",[146,6985,6986],{},"Commercial AWS",[146,6988,6989],{},"Commercial Upstash",[146,6991,6992],{},"Permanent BSD",[125,6994,6995,6998,7001,7003,7006,7008,7011,7014],{},[146,6996,6997],{},"Threading",[146,6999,7000],{},"Single",[146,7002,7000],{},[146,7004,7005],{},"Multi",[146,7007,7005],{},[146,7009,7010],{},"Single (engine 7)",[146,7012,7013],{},"Serverless",[146,7015,7016],{},"Configurable",[125,7018,7019,7022,7025,7027,7029,7032,7034,7037],{},[146,7020,7021],{},"Redis client compat.",[146,7023,7024],{},"100%",[146,7026,7024],{},[146,7028,7024],{},[146,7030,7031],{},"95%+",[146,7033,7024],{},[146,7035,7036],{},"100% (subset of commands)",[146,7038,7024],{},[125,7040,7041,7044,7047,7049,7052,7055,7058,7061],{},[146,7042,7043],{},"Baseline throughput",[146,7045,7046],{},"100k ops\u002Fs",[146,7048,7046],{},[146,7050,7051],{},"250k ops\u002Fs",[146,7053,7054],{},"1M+ ops\u002Fs",[146,7056,7057],{},"depends on inst.",[146,7059,7060],{},"depends on plan",[146,7062,7063],{},"100-250k ops\u002Fs",[125,7065,7066,7068,7070,7072,7074,7076,7079,7082],{},[146,7067,6808],{},[146,7069,3064],{},[146,7071,3064],{},[146,7073,3064],{},[146,7075,3064],{},[146,7077,7078],{},"Yes (snapshot)",[146,7080,7081],{},"Managed",[146,7083,3064],{},[125,7085,7086,7089,7091,7093,7095,7097,7100,7103],{},[146,7087,7088],{},"Replication",[146,7090,3064],{},[146,7092,3064],{},[146,7094,3064],{},[146,7096,3064],{},[146,7098,7099],{},"Multi-AZ",[146,7101,7102],{},"Multi-region",[146,7104,7105],{},"Yes (manual config)",[125,7107,7108,7111,7114,7116,7118,7120,7123,7125],{},[146,7109,7110],{},"Automatic failover",[146,7112,7113],{},"Sentinel\u002FCluster",[146,7115,7113],{},[146,7117,7113],{},[146,7119,6873],{},[146,7121,7122],{},"Built-in",[146,7124,7122],{},[146,7126,7113],{},[125,7128,7129,7132,7135,7137,7139,7141,7144,7147],{},[146,7130,7131],{},"Cost 8GB\u002Fmonth (R$)",[146,7133,7134],{},"80 (VPS)",[146,7136,7134],{},[146,7138,7134],{},[146,7140,7134],{},[146,7142,7143],{},"1000 (Multi-AZ)",[146,7145,7146],{},"300-500",[146,7148,7149],{},"80-230",[125,7151,7152,7155,7158,7160,7162,7165,7168,7171],{},[146,7153,7154],{},"Lock-in",[146,7156,7157],{},"Medium (license)",[146,7159,3154],{},[146,7161,3154],{},[146,7163,7164],{},"Medium (BSL)",[146,7166,7167],{},"High (AWS)",[146,7169,7170],{},"High (Upstash API)",[146,7172,3154],{},[125,7174,7175,7178,7181,7183,7185,7187,7190,7192],{},[146,7176,7177],{},"Premium modules",[146,7179,7180],{},"Paid",[146,7182,3055],{},[146,7184,3055],{},[146,7186,3055],{},[146,7188,7189],{},"Add-on $$",[146,7191,3061],{},[146,7193,3055],{},[125,7195,7196,7199,7202,7204,7206,7208,7211,7213],{},[146,7197,7198],{},"Operational",[146,7200,7201],{},"You",[146,7203,7201],{},[146,7205,7201],{},[146,7207,7201],{},[146,7209,7210],{},"AWS",[146,7212,6651],{},[146,7214,7201],{},[125,7216,7217,7220,7222,7224,7226,7228,7231,7233],{},[146,7218,7219],{},"SLA support",[146,7221,7180],{},[146,7223,4351],{},[146,7225,4351],{},[146,7227,7180],{},[146,7229,7230],{},"Included",[146,7232,7230],{},[146,7234,7201],{},[19,7236,7238],{"id":7237},"when-managed-still-makes-sense","When managed still makes sense",[12,7240,7241],{},"Honesty is the defense mechanism of any technical recommendation. There are four profiles where paying for managed is the right choice:",[2734,7243,7244,7250,7256,7262],{},[70,7245,7246,7249],{},[27,7247,7248],{},"Team without operational capacity for Redis cluster."," If no one in the company knows how to debug a master that no longer responds, or interpret RDB fork latency, or take care of AOF backup — paying AWS to do that is rational. It's not an excuse, it's division of labor.",[70,7251,7252,7255],{},[27,7253,7254],{},"Compliance requiring SOC2\u002FISO certified vendor."," Audit asking for \"certified vendor X\" doesn't accept \"we run Valkey on a Hetzner VPS\". The path is ElastiCache, Redis Cloud, or similar with certifications in the contract.",[70,7257,7258,7261],{},[27,7259,7260],{},"Volume needing instant scale."," Application going from 100 req\u002Fs to 100k req\u002Fs in 5 minutes due to viral campaign — Upstash's serverless path is where it shines. Self-hosted needs reserved capacity beforehand; serverless grows on the fly.",[70,7263,7264,7267],{},[27,7265,7266],{},"Fully serverless application."," If the app runs on Vercel or Cloudflare Workers and Redis also needs to be serverless by billing model, Upstash is practically the only sane option. Connecting edge functions to a Redis on VPS implies bad cold start.",[19,7269,7271],{"id":7270},"when-self-hosting-is-obvious","When self-hosting is obvious",[12,7273,7274],{},"And four profiles where paying managed is waste:",[2734,7276,7277,7283,7289,7295],{},[70,7278,7279,7282],{},[27,7280,7281],{},"Startup with R$10k–R$200k MRR optimizing cost."," The difference between R$80\u002Fmonth and R$1,000\u002Fmonth of cache is 1% of total cost of a small SaaS; it's also 11 hours of dev person-hour salary. Worth doing the math.",[70,7284,7285,7288],{},[27,7286,7287],{},"Predictable workload."," If cache volume grows 10% per month, there's no advantage in serverless scaling. Reserved capacity on VPS is cheaper and more predictable.",[70,7290,7291,7294],{},[27,7292,7293],{},"Team has 1+ person comfortable with Linux\u002FDocker."," If there's already someone who operates Postgres, nginx, Docker — Redis\u002FValkey is easier than any of them. Learning curve is days, not weeks.",[70,7296,7297,7300],{},[27,7298,7299],{},"Already have own cluster."," If the company runs an orchestrator (HeroCtl, Coolify, similar platform) with spare nodes, Valkey becomes just another job. Marginal cost close to zero — you already pay for the nodes.",[19,7302,7304],{"id":7303},"heroctl-as-infrastructure-for-valkey","HeroCtl as infrastructure for Valkey",[12,7306,7307],{},"For those operating HeroCtl, running Valkey in production is a short configuration exercise. A ~30-line file describes a job with:",[2734,7309,7310,7313,7316,7319,7322],{},[70,7311,7312],{},"Official Valkey 8.x container",[70,7314,7315],{},"Replicated named volume between nodes (data survives kill -9 of server)",[70,7317,7318],{},"Reserved resources (RAM and CPU) with hard limits",[70,7320,7321],{},"Health check on Valkey ping",[70,7323,7324,7325,7328],{},"Internal routing between services (the app talks to ",[231,7326,7327],{},"valkey.servico.local"," without exposing port to the internet)",[12,7330,7331,7332,7334,7335,7337],{},"Automated AOF + RDB backup to S3-compatible is available in the ",[27,7333,4355],{}," plan — without setting up external restic, without manual cron on the host. Valkey metrics come out via ",[231,7336,6836],{}," running as sidecar and appear in the internal Prometheus (already included as a job of the cluster itself, no external stack).",[12,7339,7340],{},"Sentinel failover is integrated with the orchestrator's control plane: if the Valkey master node falls, the cluster detects in around 7 seconds and the replica is promoted. The app's configuration is updated via service discovery — no manual redeploy.",[12,7342,7343],{},"For a startup with 4 servers running the orchestrator, this setup replaces entire ElastiCache Multi-AZ at zero marginal cost (the servers are already there). The real monthly difference is the salary-equivalent of one person, depending on the size of the operation.",[19,7345,7347],{"id":7346},"questions-we-get","Questions we get",[12,7349,7350,7353,7354,571,7357,571,7360,571,7363,571,7366,571,7369,7372],{},[27,7351,7352],{},"Is Valkey compatible with Redis client libraries?","\nYes, in 100% of practical cases. The protocol is identical — ",[231,7355,7356],{},"redis-cli",[231,7358,7359],{},"node-redis",[231,7361,7362],{},"ioredis",[231,7364,7365],{},"redis-rb",[231,7367,7368],{},"redis-py",[231,7370,7371],{},"go-redis",", all work without changing a line. What changes is just the endpoint. In 2026, several libraries already announce explicit support for Valkey in the README, but that's cosmetic — the protocol is the same.",[12,7374,7375,7378,7379,7382,7383,7386],{},[27,7376,7377],{},"Can I migrate from managed Redis Labs to self-hosted Valkey without downtime?","\nYes, with replication. Configure Valkey as Redis Labs replica (",[231,7380,7381],{},"REPLICAOF host port","), wait for sync (a few minutes to hours depending on dataset), promote Valkey to master (",[231,7384,7385],{},"REPLICAOF NO ONE","), do internal DNS cutover, decommission Redis Labs after observation period. Real error window is seconds during the swap.",[12,7388,7389,7392],{},[27,7390,7391],{},"Is Dragonfly worth the BSL risk?","\nDepends on the company's horizon. BSL converts to Apache 2.0 after 4 years by the standard model — so today's code will be open by 2030. The risk is that the company behind it (DragonflyDB Inc) follows the path of Redis Inc and makes the conversion less friendly. For workloads that demand performance Valkey doesn't deliver (above 500k sustained ops\u002Fs), Dragonfly may be the right choice despite the risk. For the rest, Valkey is more conservative.",[12,7394,7395,7398,7399,7401],{},[27,7396,7397],{},"How much RAM does a Redis with 1 GB of useful data consume?","\nPractical math: 1 GB dataset occupies between 1.3 and 2 GB of real RAM (structure overhead, fragmentation, client buffers, replication backlog). Configuring ",[231,7400,6786],{}," at 60% of available RAM is a safe rule — 4 GB instance fits ~2.5 GB of useful data with room to spare.",[12,7403,7404,7407],{},[27,7405,7406],{},"Does Sidekiq really need AOF? Sidekiq docs say it can run without.","\nThe docs say it technically runs. In production, without AOF, any unexpected restart loses queued jobs that were in the buffer. For \"welcome email\" queue, you discover when customer complains. For \"recurring billing\" queue, you discover when the accountant complains. AOF is cheap (5-10% I\u002FO increment), the cost of not having it is large.",[12,7409,7410,7413],{},[27,7411,7412],{},"Cluster vs Sentinel for app processing 50k jobs\u002Fday?","\nSentinel. 50k jobs\u002Fday is 0.6 ops\u002Fs average — fits in 100 MB of Redis RAM. Cluster is overkill by an order of magnitude. Sentinel solves automatic failover with 1 master + 1 replica + 3 sentinels (3 sentinel processes on separate VPSes, can coexist with other things).",[12,7415,7416,7419],{},[27,7417,7418],{},"Does ElastiCache São Paulo have good latency for app running in São Paulo?","\nYes, 1-3ms p99 within the same AZ. The problem isn't latency — it's cost and lock-in. Latency only becomes a topic if the app is on another provider (Hetzner FSN, DigitalOcean NYC) trying to talk to ElastiCache São Paulo — there it rises to 130-200ms and the argument disappears.",[12,7421,7422,7425],{},[27,7423,7424],{},"How to back up self-hosted Valkey to survive disaster?","\nThree layers. First: persistent AOF on local disk (survives restart). Second: daily RDB snapshot copied to S3-compatible storage (Wasabi, Backblaze B2, Cloudflare R2 — all cheaper than AWS S3 for this case). Third: weekly snapshot copied to another storage provider (second region, second vendor). Restic or rclone do the work. Total storage cost for 4 GB Valkey backup: about US$1\u002Fmonth.",[19,7427,3309],{"id":3308},[12,7429,7430],{},"In 2026, \"Redis in production\" became a question with more nuance than it had in 2023. The original product's license changed, the Linux Foundation fork matured, multi-thread alternatives are standing, the serverless offering has a real use case. Choosing among the four implementations and the three managed paths is honest exercise — there's no single answer.",[12,7432,7433,7434,7436],{},"Our default recommendation for Brazilian startup in 2026: ",[27,7435,6436],{}," on its own cluster, Sentinel mode, AOF on if there's queue, monitoring with Prometheus. Cost in the R$80–R$230\u002Fmonth range, against R$600–R$2,000\u002Fmonth for equivalent managed alternatives. Full compatibility with any Redis library. No exposure to RSAL license. Reversible migration if it becomes a problem.",[12,7438,7439],{},"To stand up this stack:",[224,7441,7442],{"className":226,"code":2948,"language":228,"meta":229,"style":229},[231,7443,7444],{"__ignoreMap":229},[234,7445,7446,7448,7450,7452,7454],{"class":236,"line":237},[234,7447,1220],{"class":247},[234,7449,2957],{"class":251},[234,7451,2960],{"class":255},[234,7453,2963],{"class":383},[234,7455,2966],{"class":247},[12,7457,7458,7459,7463,7464,7466],{},"And to read in parallel: ",[3336,7460,7462],{"href":7461},"\u002Fen\u002Fblog\u002Fpostgres-in-production-managed-vs-self-hosted","Postgres in production: managed vs self-hosted"," (same analysis for the database) and ",[3336,7465,6338],{"href":6337}," (the consolidated math of the whole stack).",[3350,7468,4376],{},{"title":229,"searchDepth":244,"depth":244,"links":7470},[7471,7472,7473,7479,7480,7487,7488,7489,7490,7491,7492,7493,7494,7495,7496],{"id":21,"depth":244,"text":22},{"id":6440,"depth":244,"text":6441},{"id":6463,"depth":244,"text":6464,"children":7474},[7475,7476,7477,7478],{"id":6467,"depth":271,"text":6423},{"id":6488,"depth":271,"text":6407},{"id":6507,"depth":271,"text":6411},{"id":6526,"depth":271,"text":6414},{"id":6550,"depth":244,"text":6551},{"id":6601,"depth":244,"text":6602,"children":7481},[7482,7483,7484,7485,7486],{"id":6608,"depth":271,"text":6609},{"id":6650,"depth":271,"text":6651},{"id":6678,"depth":271,"text":6679},{"id":6699,"depth":271,"text":6700},{"id":6739,"depth":271,"text":6740},{"id":6767,"depth":244,"text":6768},{"id":6858,"depth":244,"text":6859},{"id":6887,"depth":244,"text":6888},{"id":6929,"depth":244,"text":6930},{"id":3836,"depth":244,"text":6943},{"id":7237,"depth":244,"text":7238},{"id":7270,"depth":244,"text":7271},{"id":7303,"depth":244,"text":7304},{"id":7346,"depth":244,"text":7347},{"id":3308,"depth":244,"text":3309},"2026-05-20","Redis changed its license in 2024, Valkey was born as an OSS fork, Dragonfly hits benchmarks. In 2026, choosing cache is no longer choosing Redis — it's choosing between 4 products. Honest analysis with costs.",{},"\u002Fen\u002Fblog\u002Fredis-in-production-managed-vs-self-hosted",{"title":6399,"description":7498},{"loc":7500},"en\u002Fblog\u002Fredis-in-production-managed-vs-self-hosted",[7505,6488,7506,7507,3378],"redis","cache","self-hosted","yY3ROe_Afo2prSU4Eu_g3AmHDl997tSuL6uAW4GbUDo",{"id":7510,"title":7511,"author":7,"body":7512,"category":8756,"cover":3379,"date":8757,"description":8758,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":8759,"navigation":411,"path":8760,"readingTime":8761,"seo":8762,"sitemap":8763,"stem":8764,"tags":8765,"__hash__":8770},"blog_en\u002Fen\u002Fblog\u002Fgithub-actions-vs-gitlab-ci-vs-drone.md","GitHub Actions vs GitLab CI vs Drone: which CI\u002FCD to pick for a Brazilian startup",{"type":9,"value":7513,"toc":8719},[7514,7521,7524,7528,7544,7550,7556,7562,7565,7568,7572,7575,7578,7602,7605,7609,7616,7620,7652,7656,7659,7704,7707,7710,7713,7717,7720,7723,7743,7746,7750,7753,7757,7793,7797,7800,7805,7854,7857,7863,7877,7880,7884,7887,7891,7933,7937,7963,7967,7970,7977,7981,7984,8104,8111,8114,8118,8121,8182,8185,8187,8191,8370,8373,8377,8380,8384,8390,8394,8400,8403,8407,8413,8416,8420,8426,8430,8436,8440,8443,8446,8472,8482,8485,8489,8493,8496,8512,8516,8519,8530,8533,8537,8540,8543,8547,8550,8568,8571,8573,8578,8581,8586,8589,8594,8597,8602,8605,8610,8613,8618,8621,8626,8633,8638,8641,8646,8649,8651,8655,8658,8690,8693,8698,8701,8712],[12,7515,7516,7517,7520],{},"The CI\u002FCD choice in 2026 is no longer about \"which tool has more features\". All three serious ones — GitHub Actions, GitLab CI, Drone (and its fork Woodpecker) — do the basics well. The real choice is about ",[27,7518,7519],{},"where your pain is going to show up first",": on the bill at the end of the month, on workflow complexity when the monorepo grows, or when you have to bring up a runner nobody understands when the senior dev goes on vacation.",[12,7522,7523],{},"This post is an honest comparison for Brazilian tech leads deciding CI\u002FCD in 2026. No artificial ranking, no column where one tool is \"champion\" at everything. Explicit tradeoffs, numbers in reais, and a recommendation per profile at the end.",[19,7525,7527],{"id":7526},"tldr-200-words","TL;DR (200 words)",[12,7529,7530,7531,571,7534,571,7537,7540,7541,101],{},"The CI\u002FCD decision in 2026 follows four forces: ",[27,7532,7533],{},"where the code is hosted",[27,7535,7536],{},"minute cost",[27,7538,7539],{},"workflow complexity",", and ",[27,7542,7543],{},"willingness to operate self-hosted",[12,7545,7546,7549],{},[27,7547,7548],{},"GitHub Actions"," won absolute mindshare for projects on GitHub. It's free up to 2000 minutes\u002Fmonth on public repos; after that costs US$0.008\u002Fmin on a Linux runner — between US$5 and US$30\u002Fmonth for a typical startup (R$25 to R$150). Marketplace has 10 thousand ready actions. The Achilles heel is the minute pricing when volume grows.",[12,7551,7552,7555],{},[27,7553,7554],{},"GitLab CI"," is more complete: native job dependency graph, parent-child pipelines, better monorepo handling, included image registry, embedded security scanning. Self-hosted (Community Edition) is free but requires 4 to 8 GB of RAM and active operation. SaaS Premium is US$29\u002Fuser\u002Fmonth — expensive for a large team.",[12,7557,7558,7561],{},[27,7559,7560],{},"Drone\u002FWoodpecker"," self-hosted is the option for cutting variable cost to zero. A R$30 to R$80\u002Fmonth server runs CI for five to ten projects. Costs in ops: you operate the runners.",[12,7563,7564],{},"For a small BR startup on GitHub, start on the Actions free plan. When it passes US$30\u002Fmonth, consider Woodpecker self-hosted. For a company that values CI + issue tracker + registry in a single product, GitLab self-hosted.",[12,7566,7567],{},"━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━",[19,7569,7571],{"id":7570},"why-does-this-decision-matter-more-than-it-seems","Why does this decision matter more than it seems?",[12,7573,7574],{},"CI\u002FCD is the most-used infrastructure of any product team: every commit touches the system, every PR depends on it, every deploy goes through it. A wrong choice doesn't break you in the first month — it breaks you in the third year, when migrating costs four weeks of two people and you pay that while the roadmap stalls.",[12,7576,7577],{},"The three symptoms that indicate the choice was wrong:",[67,7579,7580,7586,7596],{},[70,7581,7582,7585],{},[27,7583,7584],{},"The bill grows faster than the team."," If CI cost doubles every six months without deploy volume justifying it, the pricing model isn't yours.",[70,7587,7588,7591,7592,7595],{},[27,7589,7590],{},"Workflows become copy-paste."," If every new project starts with ",[231,7593,7594],{},"cp -r .github\u002Fworkflows\u002F",", the tool has no decent composition.",[70,7597,7598,7601],{},[27,7599,7600],{},"CI failures take more than an hour to debug."," If reproducing the error locally requires running a Docker image nobody knows how to set up, the build isn't portable.",[12,7603,7604],{},"The three main competitors solve these symptoms in different ways. Let's go piece by piece.",[19,7606,7608],{"id":7607},"github-actions-the-de-facto-standard-is-it-worth-the-price","GitHub Actions: the de facto standard — is it worth the price?",[12,7610,7611,7612,7615],{},"If your code is on GitHub, Actions has a structural advantage that can't be ignored: zero integration friction. You create ",[231,7613,7614],{},".github\u002Fworkflows\u002Fci.yml",", push, and you're running. No separate signup, no cross token, no webhook to configure.",[368,7617,7619],{"id":7618},"what-actions-does-well","What Actions does well",[2734,7621,7622,7628,7634,7640,7646],{},[70,7623,7624,7627],{},[27,7625,7626],{},"Huge marketplace."," More than ten thousand ready actions for common tasks: setup of Node, Python, Go; deploy to AWS, GCP, Azure; image signing; security scanning. Most are maintained by the technology vendor itself (HashiCorp publishes its own, AWS publishes its own, etc).",[70,7629,7630,7633],{},[27,7631,7632],{},"Matrix builds."," Running the same suite against five Node versions or three operating systems is a key in three lines.",[70,7635,7636,7639],{},[27,7637,7638],{},"Reusable workflows."," Since 2021 you can extract workflows shared across repos in the same organization — solves the \"copy-paste between projects\" problem for medium teams.",[70,7641,7642,7645],{},[27,7643,7644],{},"Deployment protection rules."," Manual approvals, time windows, branch restriction — all configurable without plugin.",[70,7647,7648,7651],{},[27,7649,7650],{},"Self-hosted runners."," You can run the agent on your own infra and use the Actions UI just as orchestrator. Solves the minute problem for high-volume teams.",[368,7653,7655],{"id":7654},"what-actions-charges-dearly-for","What Actions charges dearly for",[12,7657,7658],{},"The billing model is per minute, and the numbers matter:",[119,7660,7661,7670],{},[122,7662,7663],{},[125,7664,7665,7668],{},[128,7666,7667],{},"Runner type",[128,7669,136],{},[141,7671,7672,7680,7688,7696],{},[125,7673,7674,7677],{},[146,7675,7676],{},"Linux 2 vCPU (default)",[146,7678,7679],{},"US$0.008\u002Fmin (R$0.04)",[125,7681,7682,7685],{},[146,7683,7684],{},"Windows 2 vCPU",[146,7686,7687],{},"US$0.016\u002Fmin (R$0.08)",[125,7689,7690,7693],{},[146,7691,7692],{},"macOS (required for iOS builds)",[146,7694,7695],{},"US$0.08\u002Fmin (R$0.40)",[125,7697,7698,7701],{},[146,7699,7700],{},"Larger Linux (4 vCPU+)",[146,7702,7703],{},"US$0.016\u002Fmin and above",[12,7705,7706],{},"For a startup on GitHub with five devs and a reasonable workflow (build + test + lint on each PR), typical consumption is 800 to 2500 minutes\u002Fmonth on Linux. That's US$6 to US$20\u002Fmonth — that is, between R$30 and R$100. Fits in the \"dev tools\" line without pain.",[12,7708,7709],{},"When it hurts: heavy workflows (E2E with Playwright, Rust builds, tests that bring up Postgres + Redis on each job) easily pass 10 thousand minutes\u002Fmonth. At US$0.008\u002Fmin that becomes US$80\u002Fmonth — R$400. Multiply by 12 and you're paying R$5 thousand\u002Fyear on CI.",[12,7711,7712],{},"macOS builds are the worst case: US$0.40\u002Fmin is ten times more than Linux. Teams maintaining iOS apps spend three to four times more on CI than on production infra.",[368,7714,7716],{"id":7715},"do-actions-self-hosted-runners-solve-it","Do Actions self-hosted runners solve it?",[12,7718,7719],{},"Partially. You run the runner binary on a machine of yours, register on the repo or organization, and jobs go there instead of the managed pool. Minute cost goes to zero — you only pay for the machine.",[12,7721,7722],{},"But three catches:",[67,7724,7725,7731,7737],{},[70,7726,7727,7730],{},[27,7728,7729],{},"Runner maintenance."," The version updates frequently; outdated runners start failing silently. Without automation, it becomes manual operation work.",[70,7732,7733,7736],{},[27,7734,7735],{},"Manual scaling."," If the team has five devs opening 20 simultaneous PRs, one runner serializes everything. You need N runners — and provisioning\u002Fdeprovisioning on demand requires additional tooling.",[70,7738,7739,7742],{},[27,7740,7741],{},"Security on public repos."," Self-hosted runners on a public repo are an open door for any malicious fork to run arbitrary code on your machine. Always restricted to private repos or trusted organizations.",[12,7744,7745],{},"The mature solution is Actions Runner Controller (ARC): an operator that brings up runners on-demand on a Kubernetes cluster or similar. Solves scaling, but adds an entire infrastructure layer — not trivial.",[19,7747,7749],{"id":7748},"gitlab-ci-does-the-heavyweight-competitor-still-make-sense","GitLab CI: does the \"heavyweight\" competitor still make sense?",[12,7751,7752],{},"GitLab CI is older than Actions, more complete in features, and less popular outside teams already on the GitLab platform. The right question isn't \"is GitLab CI better than Actions?\", it's \"is it worth migrating to GitLab to use GitLab CI?\"",[368,7754,7756],{"id":7755},"what-gitlab-ci-does-better","What GitLab CI does better",[2734,7758,7759,7769,7775,7781,7787],{},[70,7760,7761,7764,7765,7768],{},[27,7762,7763],{},"Dependency graph (DAG)."," Native, without external tooling. You declare ",[231,7766,7767],{},"needs: [job_a, job_b]"," and jobs run in parallel respecting dependencies. For workflows with 30+ jobs (large monorepo, multiple languages, multi-environment deploy), this is the difference between 8 minutes and 25 minutes per pipeline.",[70,7770,7771,7774],{},[27,7772,7773],{},"Parent-child pipelines."," A large pipeline can trigger child pipelines with conditional logic — useful for monorepos where only changed services need to build.",[70,7776,7777,7780],{},[27,7778,7779],{},"Included image registry."," Each project comes with a native private container registry. No configuring secrets for Amazon ECR, Docker Hub, or similar.",[70,7782,7783,7786],{},[27,7784,7785],{},"Pages, security scanning, code quality, dependency scanning"," — all embedded in the platform. In Actions each is a separate marketplace action.",[70,7788,7789,7792],{},[27,7790,7791],{},"Deep merge request integration."," Pipelines appear inside the MR with coverage diff, bundle size comparison, build time comparison. In Actions checks appear as links — in GitLab they're structured data.",[368,7794,7796],{"id":7795},"where-gitlab-ci-charges-dearly","Where GitLab CI charges dearly",[12,7798,7799],{},"Two dimensions.",[12,7801,7802],{},[27,7803,7804],{},"SaaS pricing:",[119,7806,7807,7819],{},[122,7808,7809],{},[125,7810,7811,7814,7816],{},[128,7812,7813],{},"Plan",[128,7815,136],{},[128,7817,7818],{},"Monthly minute limit",[141,7820,7821,7832,7843],{},[125,7822,7823,7826,7829],{},[146,7824,7825],{},"Free",[146,7827,7828],{},"US$0\u002Fuser",[146,7830,7831],{},"400 minutes",[125,7833,7834,7837,7840],{},[146,7835,7836],{},"Premium",[146,7838,7839],{},"US$29\u002Fuser\u002Fmonth (R$145)",[146,7841,7842],{},"10,000 minutes",[125,7844,7845,7848,7851],{},[146,7846,7847],{},"Ultimate",[146,7849,7850],{},"US$99\u002Fuser\u002Fmonth (R$495)",[146,7852,7853],{},"50,000 minutes",[12,7855,7856],{},"For a five-dev team on Premium, that's US$145\u002Fmonth — R$725. That's just the entry ticket; extra minutes cost separately. For a team of 20, US$580\u002Fmonth = R$2,900 just on subscription.",[12,7858,7859,7862],{},[27,7860,7861],{},"Self-hosted Community Edition"," is free and removes that license cost — but:",[2734,7864,7865,7868,7871,7874],{},[70,7866,7867],{},"Realistic minimum: 4 vCPU, 8 GB RAM (16 GB if you'll use registry + pages + scanning).",[70,7869,7870],{},"Adequate VPS in Brazil: R$120 to R$250\u002Fmonth.",[70,7872,7873],{},"Ops: 2 to 4 hours\u002Fmonth on updates, backup, monitoring.",[70,7875,7876],{},"Monthly updates. GitLab has a rigid cadence; staying three versions behind opens documented security holes.",[12,7878,7879],{},"In real production, self-hosted GitLab is less work than Kubernetes but more than Actions SaaS. It's a real server that you operate.",[19,7881,7883],{"id":7882},"drone-ci-and-woodpecker-the-minimalist-alternative","Drone CI and Woodpecker: the minimalist alternative",[12,7885,7886],{},"Drone CI was born in 2014 as the \"container-native CI\": each pipeline step is a container, no magic. In 2020 the company behind it (Drone Inc.) was acquired by Harness, and the product gained a commercial Cloud version. The community fork Woodpecker remains 100% open-source, with API compatible with Drone.",[368,7888,7890],{"id":7889},"what-dronewoodpecker-does-well","What Drone\u002FWoodpecker does well",[2734,7892,7893,7902,7915,7921,7927],{},[70,7894,7895,7898,7899,7901],{},[27,7896,7897],{},"Simple YAML."," Each step declares an image and a command. No DSL, no reusable actions with their own semantics. What you run locally with ",[231,7900,2405],{}," is what runs on CI.",[70,7903,7904,7907,7908,2402,7911,7914],{},[27,7905,7906],{},"Container-native."," There's no Java \"executor\", no Python agent running steps. Each step is an isolated container. Reproducing the error locally is literal: copy the ",[231,7909,7910],{},"image:",[231,7912,7913],{},"commands:"," from YAML and run in the terminal.",[70,7916,7917,7920],{},[27,7918,7919],{},"Self-hosted from day one."," There's no \"free Drone Cloud\" pulling features into the paid version. The server + runners are the whole product.",[70,7922,7923,7926],{},[27,7924,7925],{},"Plugins via container."," Each plugin (SSH deploy, Slack, Docker push, AWS) is a published image. Versioned like any other dependency.",[70,7928,7929,7932],{},[27,7930,7931],{},"Supports multiple code hosts."," GitHub, GitLab, Bitbucket, Gitea, Forgejo — all on the same Drone server.",[368,7934,7936],{"id":7935},"where-dronewoodpecker-charges","Where Drone\u002FWoodpecker charges",[2734,7938,7939,7945,7951,7957],{},[70,7940,7941,7944],{},[27,7942,7943],{},"Smaller community."," When you hit an obscure bug, Stack Overflow has five answers, not fifty. Github issues are your main source.",[70,7946,7947,7950],{},[27,7948,7949],{},"Non-trivial operation at scale."," One server + one runner is easy. Five autoscaling runners behind a queue is tooling you assemble — auto-scaling isn't built-in.",[70,7952,7953,7956],{},[27,7954,7955],{},"Drone Cloud is paid."," If you want SaaS, you go to Harness; the free tier is limited. That's why the recommendation is always self-hosted.",[70,7958,7959,7962],{},[27,7960,7961],{},"Modest documentation."," Covers the happy path; edge cases you discover by reading code.",[368,7964,7966],{"id":7965},"why-woodpecker-instead-of-drone-in-2026","Why Woodpecker instead of Drone in 2026",[12,7968,7969],{},"Vanilla Drone still works, but Harness has prioritized the commercial cloud version. Woodpecker is the community fork of original Drone — 100% open-source, no paid version pulling features, monthly active releases, engaged community. API and YAML compatible with Drone, so migration is trivial: swap the server URL.",[12,7971,7972,7973,7976],{},"For any small team self-hosting in 2026, ",[27,7974,7975],{},"Woodpecker is the better choice than vanilla Drone",". Same architecture, without the overhead of a company controlling the roadmap.",[19,7978,7980],{"id":7979},"which-is-cheaper-in-2026","Which is cheaper in 2026?",[12,7982,7983],{},"Real total monthly cost, considering a five-dev team with medium volume (300 builds\u002Fmonth, average 8-minute Linux builds):",[119,7985,7986,8002],{},[122,7987,7988],{},[125,7989,7990,7993,7996,7999],{},[128,7991,7992],{},"Option",[128,7994,7995],{},"Fixed cost",[128,7997,7998],{},"Variable cost",[128,8000,8001],{},"Estimated total\u002Fmonth",[141,8003,8004,8019,8032,8046,8060,8074,8088],{},[125,8005,8006,8009,8012,8015],{},[146,8007,8008],{},"Woodpecker self-hosted (VPS R$80)",[146,8010,8011],{},"R$80",[146,8013,8014],{},"R$0",[146,8016,8017],{},[27,8018,8011],{},[125,8020,8021,8024,8026,8028],{},[146,8022,8023],{},"Actions public repos (open-source)",[146,8025,8014],{},[146,8027,8014],{},[146,8029,8030],{},[27,8031,8014],{},[125,8033,8034,8037,8039,8042],{},[146,8035,8036],{},"Actions private repos (free tier 2000 min)",[146,8038,8014],{},[146,8040,8041],{},"R$0 to R$50",[146,8043,8044],{},[27,8045,8041],{},[125,8047,8048,8051,8053,8056],{},[146,8049,8050],{},"Actions Linux paid (medium volume)",[146,8052,8014],{},[146,8054,8055],{},"R$50 to R$150",[146,8057,8058],{},[27,8059,8055],{},[125,8061,8062,8065,8068,8070],{},[146,8063,8064],{},"GitLab CI self-hosted (VPS R$200)",[146,8066,8067],{},"R$200",[146,8069,8014],{},[146,8071,8072],{},[27,8073,8067],{},[125,8075,8076,8079,8081,8084],{},[146,8077,8078],{},"Actions with heavy macOS builds",[146,8080,8014],{},[146,8082,8083],{},"R$300 to R$1,500",[146,8085,8086],{},[27,8087,8083],{},[125,8089,8090,8093,8096,8099],{},[146,8091,8092],{},"GitLab CI SaaS Premium (5 devs)",[146,8094,8095],{},"R$725",[146,8097,8098],{},"R$0 to R$200",[146,8100,8101],{},[27,8102,8103],{},"R$725 to R$925",[12,8105,8106,8107,8110],{},"Absolute cost winner: ",[27,8108,8109],{},"Woodpecker self-hosted"," for a team willing to operate a VPS. Costs the same as a lunch per month and runs CI for ten projects without breaking a sweat.",[12,8112,8113],{},"If ops isn't available, the Actions free plan is the next option. It fits a small team with light workflows; when it goes past US$30\u002Fmonth variable, it's worth at least evaluating self-hosted runners.",[19,8115,8117],{"id":8116},"which-has-the-best-developer-experience","Which has the best developer experience?",[12,8119,8120],{},"DX in CI\u002FCD is measured in three dimensions: time from \"blank yml\" to \"first passing build\", debug capability when it goes wrong, and ability to evolve the workflow when it grows.",[119,8122,8123,8133],{},[122,8124,8125],{},[125,8126,8127,8130],{},[128,8128,8129],{},"Dimension",[128,8131,8132],{},"Winner",[141,8134,8135,8143,8151,8159,8167,8175],{},[125,8136,8137,8140],{},[146,8138,8139],{},"Ready templates \u002F accessibility",[146,8141,8142],{},"GitHub Actions (marketplace + onboarding)",[125,8144,8145,8148],{},[146,8146,8147],{},"Complex workflows \u002F DAG \u002F monorepo",[146,8149,8150],{},"GitLab CI (parent-child + native needs)",[125,8152,8153,8156],{},[146,8154,8155],{},"Local reproduction \u002F conceptual simplicity",[146,8157,8158],{},"Drone\u002FWoodpecker (each step = container)",[125,8160,8161,8164],{},[146,8162,8163],{},"Intermittent failure debug",[146,8165,8166],{},"Drone\u002FWoodpecker (re-running an isolated step is trivial)",[125,8168,8169,8172],{},[146,8170,8171],{},"Cross-project composition",[146,8173,8174],{},"GitHub Actions (reusable workflows + composite actions)",[125,8176,8177,8180],{},[146,8178,8179],{},"Time-to-first-pipeline (zero to hello world)",[146,8181,7548],{},[12,8183,8184],{},"There's no absolute winner. For a team that values starting fast, Actions. For a team with complex workflow from day one (monorepo, multiple languages), GitLab CI. For a team that wants to understand exactly what's happening, Drone\u002FWoodpecker.",[12,8186,7567],{},[19,8188,8190],{"id":8189},"comparative-table-12-honest-criteria","Comparative table: 12 honest criteria",[119,8192,8193,8205],{},[122,8194,8195],{},[125,8196,8197,8199,8201,8203],{},[128,8198,2982],{},[128,8200,7548],{},[128,8202,7554],{},[128,8204,7560],{},[141,8206,8207,8220,8234,8248,8262,8276,8290,8304,8318,8329,8342,8356],{},[125,8208,8209,8212,8215,8218],{},[146,8210,8211],{},"BR startup monthly cost (5 devs, medium volume)",[146,8213,8214],{},"R$0 to R$150",[146,8216,8217],{},"R$80 to R$925",[146,8219,8011],{},[125,8221,8222,8225,8228,8231],{},[146,8223,8224],{},"Real free tier (2026)",[146,8226,8227],{},"2000 min\u002Fmonth private, unlimited public",[146,8229,8230],{},"400 min\u002Fmonth SaaS",[146,8232,8233],{},"Unlimited self-hosted",[125,8235,8236,8239,8242,8245],{},[146,8237,8238],{},"Self-hosted available",[146,8240,8241],{},"Yes (runners), SaaS UI",[146,8243,8244],{},"Yes (full CE)",[146,8246,8247],{},"Yes (the only sensible way)",[125,8249,8250,8253,8256,8259],{},[146,8251,8252],{},"Large workflow complexity",[146,8254,8255],{},"Good (reusable workflows)",[146,8257,8258],{},"Excellent (DAG + parent-child)",[146,8260,8261],{},"Modest (linear + matrix)",[125,8263,8264,8267,8270,8273],{},[146,8265,8266],{},"Monorepo support",[146,8268,8269],{},"Medium (paths filter)",[146,8271,8272],{},"Excellent (rules + parent-child)",[146,8274,8275],{},"Medium (when filter)",[125,8277,8278,8281,8284,8287],{},[146,8279,8280],{},"Integrated container registry",[146,8282,8283],{},"No (needs separate GHCR)",[146,8285,8286],{},"Yes, native",[146,8288,8289],{},"No (use external registry)",[125,8291,8292,8295,8298,8301],{},[146,8293,8294],{},"Secret management",[146,8296,8297],{},"Repo + org + environment",[146,8299,8300],{},"Project + group + instance",[146,8302,8303],{},"Server + repo",[125,8305,8306,8309,8312,8315],{},[146,8307,8308],{},"Out-of-the-box parallel jobs",[146,8310,8311],{},"Yes (matrix)",[146,8313,8314],{},"Yes (parallel + DAG)",[146,8316,8317],{},"Yes (depends_on)",[125,8319,8320,8323,8325,8327],{},[146,8321,8322],{},"BR community \u002F Portuguese material",[146,8324,4914],{},[146,8326,3159],{},[146,8328,4919],{},[125,8330,8331,8334,8337,8339],{},[146,8332,8333],{},"PT-BR documentation",[146,8335,8336],{},"Partial (official in English)",[146,8338,3139],{},[146,8340,8341],{},"Practically zero",[125,8343,8344,8347,8350,8353],{},[146,8345,8346],{},"GitHub\u002FGitLab\u002FGitea integration",[146,8348,8349],{},"GitHub only",[146,8351,8352],{},"GitLab only (external mirror is workaround)",[146,8354,8355],{},"All three + Bitbucket",[125,8357,8358,8361,8364,8367],{},[146,8359,8360],{},"Ideal usage range",[146,8362,8363],{},"1 to 50 devs on GitHub",[146,8365,8366],{},"5 to 500 devs on a single platform",[146,8368,8369],{},"1 to 30 devs with ops available",[12,8371,8372],{},"No competitor has a column without caveats. The right tool depends on the team profile.",[19,8374,8376],{"id":8375},"decision-by-team-profile","Decision by team profile",[12,8378,8379],{},"Four concrete recommendations, no \"depends\".",[368,8381,8383],{"id":8382},"indie-hacker-or-public-repo-on-github","Indie hacker or public repo on GitHub",[12,8385,8386,8389],{},[27,8387,8388],{},"Use GitHub Actions free plan."," Public repos have unlimited minutes. You have no reason to look for an alternative. If a year from now the project grows, you reassess.",[368,8391,8393],{"id":8392},"early-startup-on-github-private-repos-r10k-to-r50k-mrr","Early startup on GitHub, private repos, R$10k to R$50k MRR",[12,8395,8396,8399],{},[27,8397,8398],{},"Stay on the Actions free plan."," The 2000-minute free tier fits a two- to three-dev team with reasonable workflows. When it starts going over, first reduce waste (paths filter to not run everything on every PR, decent dependency cache) before migrating.",[12,8401,8402],{},"If you consistently go over US$30\u002Fmonth variable, consider migrating to self-hosted runners or Woodpecker in parallel.",[368,8404,8406],{"id":8405},"startup-with-r50k-to-r200k-mrr-on-github-high-ci-volume","Startup with R$50k to R$200k MRR on GitHub, high CI volume",[12,8408,8409,8412],{},[27,8410,8411],{},"Hybrid."," Use Actions for light workflows (lint, unit tests) and self-hosted runners (via ARC) or Woodpecker for heavy workflows (E2E, long builds, deploys). You pay per minute where it pays off and zero where it hurts.",[12,8414,8415],{},"For teams with regular macOS builds, consider a dedicated Mac mini as a self-hosted runner. R$10 thousand investment pays off in three months if you spend US$200\u002Fmonth on macOS Actions today.",[368,8417,8419],{"id":8418},"br-company-on-self-hosted-gitlab","BR company on self-hosted GitLab",[12,8421,8422,8425],{},[27,8423,8424],{},"Use native GitLab CI."," You're already paying the cost of operating GitLab; CI comes along at no additional cost. Migrating to another tool would mean operating two systems in parallel — not worth it.",[368,8427,8429],{"id":8428},"small-team-aggressively-controlling-cost","Small team aggressively controlling cost",[12,8431,8432,8435],{},[27,8433,8434],{},"Woodpecker self-hosted on R$80 VPS."," Runs CI for ten projects without sweating. Costs in ops 1 to 2 hours\u002Fmonth. If the team has someone with affinity for Unix tools, it's the most economical and most predictable option in the bill — you know exactly the cost every month.",[19,8437,8439],{"id":8438},"where-heroctl-comes-in-as-runner-infrastructure","Where HeroCtl comes in as runner infrastructure",[12,8441,8442],{},"Self-hosted CI\u002FCD is exactly the type of workload that HeroCtl orchestrates well: long services (CI server, database that maintains build history), services that scale horizontally (runners that go up and down with the queue), services with persistence needs (artifact cache).",[12,8444,8445],{},"Instead of operating Docker Compose on a single server — single point of failure — you describe the setup as a job configuration:",[2734,8447,8448,8454,8460,8466],{},[70,8449,8450,8453],{},[27,8451,8452],{},"Drone\u002FWoodpecker server as a long job",", with a single replica and persistent volume for the history database.",[70,8455,8456,8459],{},[27,8457,8458],{},"N runners as a replicable job",", scaling horizontally. The orchestrator distributes the runners across nodes; if a server dies, the runners migrate to the others.",[70,8461,8462,8465],{},[27,8463,8464],{},"Integrated backup"," for CI state (server database + artifact cache), without setting up external tooling.",[70,8467,8468,8471],{},[27,8469,8470],{},"Integrated metrics and logs"," — you see CPU, memory, build time usage without bringing up a separate observability stack.",[12,8473,8474,8475,2629,8478,8481],{},"Practical difference: instead of operating a CI stack in parallel to your production cluster, it becomes part of the same cluster, with the same high-availability guarantees. If a server falls, the runners migrate. If you want to double capacity for a heavy sprint, change ",[231,8476,8477],{},"replicas: 4",[231,8479,8480],{},"replicas: 8"," in the configuration file.",[12,8483,8484],{},"For those on the \"starting simple but going to grow\" frontier, this solves the transition without needing to swap tools mid-path.",[19,8486,8488],{"id":8487},"the-4-expensive-errors-in-self-hosted-cicd-and-how-to-avoid-them","The 4 expensive errors in self-hosted CI\u002FCD (and how to avoid them)",[368,8490,8492],{"id":8491},"error-1-silent-stale-cache","Error 1: silent stale cache",[12,8494,8495],{},"The symptom: build passes locally, fails on CI because of a dependency that exists on the dev machine but not on the fresh image. Worst case: passes on CI too because previous cache contains the dependency, but fails in production when the image is built without cache.",[12,8497,8498,8499,571,8502,571,8505,571,8508,8511],{},"The fix: a decent cache assumes it can be invalidated at any moment. Whenever you change dependency manifest files (",[231,8500,8501],{},"package.json",[231,8503,8504],{},"go.mod",[231,8506,8507],{},"requirements.txt",[231,8509,8510],{},"Cargo.toml","), include them in the cache key. Periodically (weekly), force build without cache to detect drift.",[368,8513,8515],{"id":8514},"error-2-secret-committed-by-accident","Error 2: secret committed by accident",[12,8517,8518],{},"The symptom: someone pasted a token in the CI config \"just to test\", committed, forgot. The repo is public; in 12 hours the token is in use by someone who shouldn't.",[12,8520,8521,8522,8525,8526,8529],{},"The fix: two layered mechanisms. ",[27,8523,8524],{},"Pre-commit hook"," that scans for common key patterns (AWS, Stripe, GitHub PAT). ",[27,8527,8528],{},"Automatic rotation"," of critical tokens (90 days max). If a token leaks, the exposure window is finite.",[12,8531,8532],{},"In GitLab CI, use variables with \"masked\" and \"protected\" flags. In Actions, use environment-scoped secrets with approval rules. In Drone\u002FWoodpecker, secrets are scoped per repo and never appear in logs by default.",[368,8534,8536],{"id":8535},"error-3-runner-running-on-the-same-production-server","Error 3: runner running on the same production server",[12,8538,8539],{},"The symptom: heavy build consumes CPU\u002FRAM, production gets slow, latency rises, alarm goes off, on-call wakes up. Common real case in small teams trying to save a machine.",[12,8541,8542],{},"The fix: runners on a separate server from production, always. If the budget is tight, a runner on a R$30\u002Fmonth VPS is still cheaper than a production incident during business hours.",[368,8544,8546],{"id":8545},"error-4-workflow-that-doesnt-run-outside-ci","Error 4: workflow that doesn't run outside CI",[12,8548,8549],{},"The symptom: the CI build is a 200-line script inline in the YAML, with 15 environment variables that the system injects. When something goes wrong, nobody can reproduce locally without reverse-engineering the YAML.",[12,8551,8552,8553,571,8556,8559,8560,8563,8564,8567],{},"The fix: CI should call commands that exist as ",[231,8554,8555],{},"Makefile",[231,8557,8558],{},"script\u002Fbuild",", or ",[231,8561,8562],{},"package.json scripts",". The CI YAML orchestrates; the logic lives in versioned scripts that run in any terminal. If you can't run ",[231,8565,8566],{},"make ci"," locally and see the same result, your CI isn't portable.",[12,8569,8570],{},"Drone\u002FWoodpecker forces this discipline by design (each step is a container). Actions and GitLab CI allow the anti-pattern; it's up to the team to avoid it.",[19,8572,5250],{"id":5249},[12,8574,8575],{},[27,8576,8577],{},"Is GitHub Actions faster than Drone?",[12,8579,8580],{},"In raw build, depends on the runner: the Actions managed pool uses 2-vCPU machines; a self-hosted runner on a 4-vCPU machine is faster. In total pipeline time (including queue), Actions wins when there's volume — they have huge idle capacity. Self-hosted (any tool) has queue proportional to the number of runners you provision.",[12,8582,8583],{},[27,8584,8585],{},"Can I use GitLab CI with a repo on GitHub?",[12,8587,8588],{},"Technically yes, via \"pull mirror\" (GitLab mirrors GitHub and runs CI on it). In practice it's fragile: webhooks lag, status checks don't return to GitHub the way the team expects, MRs get confusing. Not worth it. If you're on GitHub, use Actions or Drone\u002FWoodpecker (which accept GitHub as a native source).",[12,8590,8591],{},[27,8592,8593],{},"Are GitHub Actions self-hosted runners worth it?",[12,8595,8596],{},"For private repos with high volume (more than 5000 minutes\u002Fmonth), yes. You save paid minutes in exchange for operating machines. For public repos, no — security risk (malicious forks running code on your machine) outweighs benefit. ARC (Actions Runner Controller) helps at scale, but adds a Kubernetes layer; only makes sense for teams already operating K8s.",[12,8598,8599],{},[27,8600,8601],{},"Is Woodpecker stable enough in 2026?",[12,8603,8604],{},"Yes. Monthly releases, solid codebase (forked from Drone, which had five years of production), active community. In production at hundreds of small and medium-sized companies. It's not the safe bet \"nobody is fired for choosing it\" — that's Actions or GitLab — but in three years of fork there hasn't been a serious community incident. For a small self-hosted team, it's the sensible choice.",[12,8606,8607],{},[27,8608,8609],{},"Do ArgoCD and FluxCD enter this decision?",[12,8611,8612],{},"Not directly. ArgoCD\u002FFluxCD are GitOps tools for Kubernetes, not CI. They watch a Git repo and apply changes to the cluster. CI continues to be Actions\u002FGitLab\u002FDrone generating images; ArgoCD\u002FFlux apply the deploy. If you're not on Kubernetes, ArgoCD\u002FFlux aren't for you. Teams on other orchestrators deploy directly from CI or via the orchestrator's APIs.",[12,8614,8615],{},[27,8616,8617],{},"How many simultaneous runners for a team of 5 devs?",[12,8619,8620],{},"Practical rule: one runner per two active developers, plus one extra runner so long builds don't block fast PRs. Five-dev team: three runners is comfortable. At peak times (release day), bring it up to five temporarily. Each runner consumes 1 to 2 GB of RAM in typical workload; an 8-GB server runs four runners without pain.",[12,8622,8623],{},[27,8624,8625],{},"Dependency cache — which tool handles it best?",[12,8627,8628,8629,8632],{},"GitLab CI has native cache by key\u002Fpath, integrated with the own registry. GitHub Actions has ",[231,8630,8631],{},"actions\u002Fcache"," (free, 10 GB per repo). Drone\u002FWoodpecker depend on external cache plugin (S3, local MinIO) — more setup but more flexible. At moderate volume, all solve it; at high volume (large monorepo), GitLab has an advantage from registry integration.",[12,8634,8635],{},[27,8636,8637],{},"Migrating from GitHub Actions to Drone — how much work?",[12,8639,8640],{},"For simple workflows (build + test + push), 1 to 2 days. For workflows that depend on many marketplace actions, 1 to 2 weeks (need to rewrite each action as container). The biggest pain is secrets and environments — export and reimport carefully. Recommendation: migrate project by project, not all at once.",[12,8642,8643],{},[27,8644,8645],{},"Can I run Actions and Drone\u002FWoodpecker runners on the same server?",[12,8647,8648],{},"Technically yes, both are containers. In practice, isolation improves: runners on separate servers avoid one heavy build affecting the other. If the budget is tight, two R$40\u002Fmonth servers are better than one R$80\u002Fmonth server with everything together.",[12,8650,7567],{},[19,8652,8654],{"id":8653},"in-summary","In summary",[12,8656,8657],{},"CI\u002FCD in 2026 has no winning tool. It has usage profiles and honest tradeoffs:",[2734,8659,8660,8666,8672,8678,8684],{},[70,8661,8662,8665],{},[27,8663,8664],{},"You're on GitHub and the volume is light to medium?"," Actions, free plan. Don't look for problem where there isn't one.",[70,8667,8668,8671],{},[27,8669,8670],{},"You're on self-hosted GitLab?"," Native GitLab CI. Already paid.",[70,8673,8674,8677],{},[27,8675,8676],{},"You want predictable cost and have 1-2h\u002Fmonth of ops available?"," Woodpecker self-hosted on a R$80 VPS. The most economical choice.",[70,8679,8680,8683],{},[27,8681,8682],{},"You have a large monorepo with complex workflow?"," GitLab CI (native DAG) or Actions with reusable workflows.",[70,8685,8686,8689],{},[27,8687,8688],{},"You have high volume and minute pricing pain?"," Hybrid: Actions for light workflows, self-hosted runners for heavy ones.",[12,8691,8692],{},"If you're thinking of running the CI tool as part of the same cluster that serves production — with real high availability, integrated metrics, and backup without setting up a separate stack — install HeroCtl on a server:",[224,8694,8696],{"className":8695,"code":2948,"language":2529},[2527],[231,8697,2948],{"__ignoreMap":229},[12,8699,8700],{},"From there, describing a Woodpecker server with three auto-scalable runners is a fifty-line configuration file. The cluster takes care of the rest: distributes the runners across the nodes, keeps the server available even with machine loss, backs up state, exposes metrics in the embedded panel.",[12,8702,8703,8704,8706,8707,8711],{},"For more context, also worth reading ",[3336,8705,3344],{"href":3343}," — discusses when it makes sense to leave docker-compose for a replicated control plane, with the same honest criteria as this post. And for teams thinking of simplifying the entire orchestration stack, ",[3336,8708,8710],{"href":8709},"\u002Fen\u002Fblog\u002Fmigrating-from-kubernetes-to-simpler-stack","Migrating from Kubernetes to a simpler stack — real case"," has numbers from a real migration, with gains and pains.",[12,8713,8714,8715,8718],{},"The CI\u002FCD choice is one of the most enduring decisions of the team. Worth a few days of honest comparison before copying the ",[231,8716,8717],{},".github\u002Fworkflows\u002F"," from the previous project — because three years later, migration costs dearly.",{"title":229,"searchDepth":244,"depth":244,"links":8720},[8721,8722,8723,8728,8732,8737,8738,8739,8740,8747,8748,8754,8755],{"id":7526,"depth":244,"text":7527},{"id":7570,"depth":244,"text":7571},{"id":7607,"depth":244,"text":7608,"children":8724},[8725,8726,8727],{"id":7618,"depth":271,"text":7619},{"id":7654,"depth":271,"text":7655},{"id":7715,"depth":271,"text":7716},{"id":7748,"depth":244,"text":7749,"children":8729},[8730,8731],{"id":7755,"depth":271,"text":7756},{"id":7795,"depth":271,"text":7796},{"id":7882,"depth":244,"text":7883,"children":8733},[8734,8735,8736],{"id":7889,"depth":271,"text":7890},{"id":7935,"depth":271,"text":7936},{"id":7965,"depth":271,"text":7966},{"id":7979,"depth":244,"text":7980},{"id":8116,"depth":244,"text":8117},{"id":8189,"depth":244,"text":8190},{"id":8375,"depth":244,"text":8376,"children":8741},[8742,8743,8744,8745,8746],{"id":8382,"depth":271,"text":8383},{"id":8392,"depth":271,"text":8393},{"id":8405,"depth":271,"text":8406},{"id":8418,"depth":271,"text":8419},{"id":8428,"depth":271,"text":8429},{"id":8438,"depth":244,"text":8439},{"id":8487,"depth":244,"text":8488,"children":8749},[8750,8751,8752,8753],{"id":8491,"depth":271,"text":8492},{"id":8514,"depth":271,"text":8515},{"id":8535,"depth":271,"text":8536},{"id":8545,"depth":271,"text":8546},{"id":5249,"depth":244,"text":5250},{"id":8653,"depth":244,"text":8654},"comparison","2026-05-15","GitHub Actions won mindshare but has minute costs. GitLab CI is more complete but heavier. Drone (and Woodpecker) self-hosted runs on a small VPS. Practical comparison.",{},"\u002Fen\u002Fblog\u002Fgithub-actions-vs-gitlab-ci-vs-drone","14 min",{"title":7511,"description":8758},{"loc":8760},"en\u002Fblog\u002Fgithub-actions-vs-gitlab-ci-vs-drone",[8766,8767,8768,8769,8756],"github-actions","gitlab-ci","drone","ci-cd","BvIS4ezr8i7bx--48vwBoRynlj2ckXAEl-aYeNn6diE",{"id":8772,"title":8773,"author":7,"body":8774,"category":3378,"cover":3379,"date":11771,"description":11772,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":11773,"navigation":411,"path":11774,"readingTime":6387,"seo":11775,"sitemap":11776,"stem":11777,"tags":11778,"__hash__":11783},"blog_en\u002Fen\u002Fblog\u002Fmonitoring-stack-prometheus-grafana-loki.md","Complete monitoring stack in 2026: Prometheus + Grafana + Loki step by step",{"type":9,"value":8775,"toc":11751},[8776,8789,8792,8794,8801,8812,8815,8818,8822,8825,8863,8878,8882,8885,8918,8921,8925,8931,8934,8937,8964,8978,8982,8987,8994,9000,9006,9354,9373,9376,9414,9437,9441,9446,9452,9732,9742,9761,9765,9771,9774,9853,9873,9879,9917,9927,9931,9935,9941,10195,10198,10201,10388,10398,10405,10409,10414,10423,10429,10538,10545,10552,10572,10579,10583,10587,10590,10600,10938,10945,11163,11174,11188,11192,11196,11202,11226,11233,11237,11242,11245,11248,11281,11284,11316,11325,11336,11340,11343,11346,11372,11375,11379,11429,11432,11435,11439,11570,11573,11577,11580,11590,11600,11612,11618,11620,11626,11632,11638,11644,11653,11667,11673,11679,11683,11686,11714,11726,11729,11745,11748],[12,8777,8778,8779,571,8782,571,8785,8788],{},"The first time your site crashes at three in the morning, you'll discover something uncomfortable: there's no way to know what happened. There's no CPU graph, there's no log of the container that died, there's no alert that warned beforehand. You'll open a terminal, connect to the servers one by one, run ",[231,8780,8781],{},"top",[231,8783,8784],{},"df",[231,8786,8787],{},"journalctl",", and try to reconstitute a crime scene that has already cooled down.",[12,8790,8791],{},"This post is the shortcut so you don't go through that. In four hours, with R$80 to R$120 per month of hardware, you can assemble the open-source observability stack that replaces Datadog, New Relic and CloudWatch in 95% of cases for a startup. The tools are the same that run inside companies with tens of thousands of servers — and they fit comfortably on a small VPS for a team starting out.",[19,8793,22],{"id":21},[12,8795,8796,8797,8800],{},"The standard open-source monitoring stack in 2026 — ",[27,8798,8799],{},"Prometheus + Grafana + Loki + Alertmanager"," — fits on a single 4 GB RAM VPS and covers metrics, centralized logs, dashboards and alerts. This tutorial shows step-by-step setup for a 4-to-5-server cluster in approximately four hours, using docker-compose or orchestrator job specs.",[12,8802,8803,8804,8807,8808,8811],{},"For a Brazilian startup, that means ",[27,8805,8806],{},"R$80 to R$120 per month of hardware"," vs ",[27,8809,8810],{},"R$1,000 to R$2,000 per month"," of equivalent observability SaaS. The time cost is honest: four hours of initial setup plus two to four hours per month of ongoing maintenance.",[12,8813,8814],{},"Deliverable result at the end of the tutorial: dashboards for CPU, RAM, disk, network and HTTP metrics; searchable logs with 30-day retention; alerts routed to Slack, Discord or email. Prerequisites: 1 Linux VPS with 4 GB of RAM and 50 GB SSD, Docker installed, and a domain with DNS controlled by you.",[12,8816,8817],{},"The choice between running this stack on a dedicated VPS outside the production cluster or as a job inside the orchestrator itself is an architectural decision — we cover both options in step 8 and in \"How to run this inside HeroCtl\".",[19,8819,8821],{"id":8820},"what-each-component-does-in-one-sentence","What each component does, in one sentence",[12,8823,8824],{},"Before installing anything, it's worth understanding the role of each piece. The stack has six components; the confusion usually comes from thinking some of them is \"the monitoring system\". It's not. Each one does one thing.",[2734,8826,8827,8833,8839,8845,8851,8857],{},[70,8828,8829,8832],{},[27,8830,8831],{},"Prometheus"," is a time-series database (TSDB) that collects metrics via HTTP scrape — it pulls the numbers, nobody pushes them. Retains 15 days by default.",[70,8834,8835,8838],{},[27,8836,8837],{},"Grafana"," is the visualization layer. Connects to Prometheus, to Loki, to Postgres, to almost any structured source, and draws graphs.",[70,8840,8841,8844],{},[27,8842,8843],{},"Loki"," is the log piece. Syntax similar to Prometheus, indexes only labels (not log content), and because of that gets about ten times cheaper than ELK to run.",[70,8846,8847,8850],{},[27,8848,8849],{},"Promtail"," (or Grafana Agent, which is replacing Promtail in 2026) is the collector that reads log files from each server and sends to Loki.",[70,8852,8853,8856],{},[27,8854,8855],{},"node_exporter"," runs on each monitored node and exposes an HTTP endpoint with CPU, RAM, disk and network in Prometheus format.",[70,8858,8859,8862],{},[27,8860,8861],{},"Alertmanager"," receives alert rules from Prometheus and handles routing — Slack, email, PagerDuty, arbitrary webhook.",[12,8864,8865,8866,571,8869,571,8872,571,8875,101],{},"Whoever designs the first stack usually confuses Prometheus with \"monitoring\" and Grafana with \"pretty dashboards\". The real separation is: ",[27,8867,8868],{},"Prometheus stores numbers",[27,8870,8871],{},"Loki stores text",[27,8873,8874],{},"Grafana shows both",[27,8876,8877],{},"Alertmanager screams when some number is wrong",[19,8879,8881],{"id":8880},"whats-the-recommended-architecture","What's the recommended architecture?",[12,8883,8884],{},"For a cluster of 3 to 5 servers running production applications, the topology that has worked in practice is to separate the observability server from the rest. A dedicated node, outside the cluster it monitors, with two objectives: not dying together when the cluster dies, and not competing for CPU\u002FRAM with the real application.",[2734,8886,8887,8893,8899,8909],{},[70,8888,8889,8892],{},[27,8890,8891],{},"1 dedicated \"observability\" server",", 4 GB of RAM, 50 GB SSD. Runs Prometheus, Grafana, Loki, Alertmanager.",[70,8894,8895,8898],{},[27,8896,8897],{},"Each monitored server"," runs only two lightweight processes: node_exporter (system metrics) and Promtail (log shipping).",[70,8900,8901,8904,8905,8908],{},[27,8902,8903],{},"Your applications"," expose a ",[231,8906,8907],{},"\u002Fmetrics"," endpoint in Prometheus format. If you use a popular framework, there's a ready client. If not, it's a library of a few dozen lines.",[70,8910,8911,8913,8914,8917],{},[27,8912,8837],{}," is accessible via subdomain (",[231,8915,8916],{},"monitor.yourdomain.com",") with automatic TLS and basic authentication in front.",[12,8919,8920],{},"This separation has a cost: you pay for one more VPS. In exchange, when the main cluster falls, you can still look at the graphs to understand what happened. For a startup, this trade-off pays off almost always — the worst monitoring scenario is discovering that the only thing that stopped along with the site was the system that would warn you that the site stopped.",[19,8922,8924],{"id":8923},"step-1-how-to-provision-the-observability-vps","Step 1 — How to provision the observability VPS?",[12,8926,8927,8928,101],{},"Estimated time: ",[27,8929,8930],{},"10 minutes",[12,8932,8933],{},"Any cheap provider works. The two with best cost-benefit for the Brazilian case today are Hetzner (CPX21 at 7.99 EUR per month with 3 vCPUs and 4 GB of RAM, datacenter in Germany) and DigitalOcean (Basic Droplet at US$24 per month with the same configuration, datacenters closer to Brazil). For monitoring workload, scrape latency in European datacenter doesn't cause a problem — Prometheus pulls every 15 seconds by default, so 200ms RTT between Hetzner and your servers doesn't disrupt.",[12,8935,8936],{},"Provisioning:",[67,8938,8939,8942,8945,8951,8958],{},[70,8940,8941],{},"Create the VPS with Ubuntu 24.04 LTS or Debian 12.",[70,8943,8944],{},"Add your public SSH key on creation. Disable password login.",[70,8946,8947,8948,101],{},"Install Docker and the compose plugin: ",[231,8949,8950],{},"curl -fsSL https:\u002F\u002Fget.docker.com | sh && apt install docker-compose-plugin",[70,8952,8953,8954,8957],{},"Configure firewall: port 22 (SSH) open, port 443 (HTTPS) open, all others closed. Internal ports (3000, 9090, 3100, 9093) only stay accessible via ",[231,8955,8956],{},"localhost"," of the VPS itself — the reverse proxy exposes Grafana via 443.",[70,8959,8960,8961,8963],{},"Point DNS: create an A record ",[231,8962,8916],{}," to the VPS IP.",[12,8965,341,8966,8969,8970,8973,8974,8977],{},[231,8967,8968],{},"docker --version"," returns 26.x or higher; ",[231,8971,8972],{},"dig monitor.yourdomain.com"," returns the correct IP; ",[231,8975,8976],{},"ssh root@monitor.yourdomain.com"," connects without asking for password.",[19,8979,8981],{"id":8980},"step-2-how-to-bring-up-the-stack-via-docker-compose","Step 2 — How to bring up the stack via docker-compose?",[12,8983,8927,8984,101],{},[27,8985,8986],{},"45 minutes",[12,8988,8989,8990,8993],{},"Create the working directory at ",[231,8991,8992],{},"\u002Fopt\u002Fobservability\u002F"," with the following structure:",[224,8995,8998],{"className":8996,"code":8997,"language":2529},[2527],"\u002Fopt\u002Fobservability\u002F\n├── docker-compose.yml\n├── prometheus\u002F\n│   ├── prometheus.yml\n│   └── alerts.yml\n├── alertmanager\u002F\n│   └── alertmanager.yml\n├── loki\u002F\n│   └── loki-config.yml\n└── grafana\u002F\n    └── provisioning\u002F\n        └── datasources\u002F\n            └── datasources.yml\n",[231,8999,8997],{"__ignoreMap":229},[12,9001,9002,9003,1272],{},"The abbreviated but functional ",[231,9004,9005],{},"docker-compose.yml",[224,9007,9011],{"className":9008,"code":9009,"language":9010,"meta":229,"style":229},"language-yaml shiki shiki-themes github-dark-default","services:\n  prometheus:\n    image: prom\u002Fprometheus:v2.55.0\n    volumes:\n      - .\u002Fprometheus:\u002Fetc\u002Fprometheus\n      - prometheus-data:\u002Fprometheus\n    command:\n      - '--config.file=\u002Fetc\u002Fprometheus\u002Fprometheus.yml'\n      - '--storage.tsdb.retention.time=30d'\n      - '--web.enable-lifecycle'  # permite reload via HTTP POST\n    ports:\n      - '127.0.0.1:9090:9090'\n    restart: unless-stopped\n\n  grafana:\n    image: grafana\u002Fgrafana:11.3.0\n    volumes:\n      - grafana-data:\u002Fvar\u002Flib\u002Fgrafana\n      - .\u002Fgrafana\u002Fprovisioning:\u002Fetc\u002Fgrafana\u002Fprovisioning\n    environment:\n      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}\n      - GF_USERS_ALLOW_SIGN_UP=false\n    ports:\n      - '127.0.0.1:3000:3000'\n    restart: unless-stopped\n\n  loki:\n    image: grafana\u002Floki:3.2.0\n    volumes:\n      - .\u002Floki\u002Floki-config.yml:\u002Fetc\u002Floki\u002Fconfig.yml\n      - loki-data:\u002Floki\n    command: -config.file=\u002Fetc\u002Floki\u002Fconfig.yml\n    ports:\n      - '127.0.0.1:3100:3100'\n    restart: unless-stopped\n\n  alertmanager:\n    image: prom\u002Falertmanager:v0.27.0\n    volumes:\n      - .\u002Falertmanager:\u002Fetc\u002Falertmanager\n    ports:\n      - '127.0.0.1:9093:9093'\n    restart: unless-stopped\n\nvolumes:\n  prometheus-data:\n  grafana-data:\n  loki-data:\n","yaml",[231,9012,9013,9022,9029,9039,9046,9054,9061,9068,9075,9082,9092,9099,9106,9116,9120,9127,9136,9142,9149,9156,9163,9170,9177,9183,9190,9198,9202,9209,9218,9224,9231,9238,9247,9253,9260,9268,9272,9279,9288,9294,9301,9307,9314,9322,9326,9333,9340,9347],{"__ignoreMap":229},[234,9014,9015,9019],{"class":236,"line":237},[234,9016,9018],{"class":9017},"sPWt5","services",[234,9020,9021],{"class":387},":\n",[234,9023,9024,9027],{"class":236,"line":244},[234,9025,9026],{"class":9017},"  prometheus",[234,9028,9021],{"class":387},[234,9030,9031,9034,9036],{"class":236,"line":271},[234,9032,9033],{"class":9017},"    image",[234,9035,6562],{"class":387},[234,9037,9038],{"class":255},"prom\u002Fprometheus:v2.55.0\n",[234,9040,9041,9044],{"class":236,"line":415},[234,9042,9043],{"class":9017},"    volumes",[234,9045,9021],{"class":387},[234,9047,9048,9051],{"class":236,"line":434},[234,9049,9050],{"class":387},"      - ",[234,9052,9053],{"class":255},".\u002Fprometheus:\u002Fetc\u002Fprometheus\n",[234,9055,9056,9058],{"class":236,"line":459},[234,9057,9050],{"class":387},[234,9059,9060],{"class":255},"prometheus-data:\u002Fprometheus\n",[234,9062,9063,9066],{"class":236,"line":464},[234,9064,9065],{"class":9017},"    command",[234,9067,9021],{"class":387},[234,9069,9070,9072],{"class":236,"line":479},[234,9071,9050],{"class":387},[234,9073,9074],{"class":255},"'--config.file=\u002Fetc\u002Fprometheus\u002Fprometheus.yml'\n",[234,9076,9077,9079],{"class":236,"line":484},[234,9078,9050],{"class":387},[234,9080,9081],{"class":255},"'--storage.tsdb.retention.time=30d'\n",[234,9083,9084,9086,9089],{"class":236,"line":490},[234,9085,9050],{"class":387},[234,9087,9088],{"class":255},"'--web.enable-lifecycle'",[234,9090,9091],{"class":240},"  # permite reload via HTTP POST\n",[234,9093,9094,9097],{"class":236,"line":508},[234,9095,9096],{"class":9017},"    ports",[234,9098,9021],{"class":387},[234,9100,9101,9103],{"class":236,"line":529},[234,9102,9050],{"class":387},[234,9104,9105],{"class":255},"'127.0.0.1:9090:9090'\n",[234,9107,9108,9111,9113],{"class":236,"line":535},[234,9109,9110],{"class":9017},"    restart",[234,9112,6562],{"class":387},[234,9114,9115],{"class":255},"unless-stopped\n",[234,9117,9118],{"class":236,"line":546},[234,9119,412],{"emptyLinePlaceholder":411},[234,9121,9122,9125],{"class":236,"line":552},[234,9123,9124],{"class":9017},"  grafana",[234,9126,9021],{"class":387},[234,9128,9129,9131,9133],{"class":236,"line":557},[234,9130,9033],{"class":9017},[234,9132,6562],{"class":387},[234,9134,9135],{"class":255},"grafana\u002Fgrafana:11.3.0\n",[234,9137,9138,9140],{"class":236,"line":594},[234,9139,9043],{"class":9017},[234,9141,9021],{"class":387},[234,9143,9144,9146],{"class":236,"line":635},[234,9145,9050],{"class":387},[234,9147,9148],{"class":255},"grafana-data:\u002Fvar\u002Flib\u002Fgrafana\n",[234,9150,9151,9153],{"class":236,"line":643},[234,9152,9050],{"class":387},[234,9154,9155],{"class":255},".\u002Fgrafana\u002Fprovisioning:\u002Fetc\u002Fgrafana\u002Fprovisioning\n",[234,9157,9158,9161],{"class":236,"line":659},[234,9159,9160],{"class":9017},"    environment",[234,9162,9021],{"class":387},[234,9164,9165,9167],{"class":236,"line":683},[234,9166,9050],{"class":387},[234,9168,9169],{"class":255},"GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}\n",[234,9171,9172,9174],{"class":236,"line":695},[234,9173,9050],{"class":387},[234,9175,9176],{"class":255},"GF_USERS_ALLOW_SIGN_UP=false\n",[234,9178,9179,9181],{"class":236,"line":717},[234,9180,9096],{"class":9017},[234,9182,9021],{"class":387},[234,9184,9185,9187],{"class":236,"line":723},[234,9186,9050],{"class":387},[234,9188,9189],{"class":255},"'127.0.0.1:3000:3000'\n",[234,9191,9192,9194,9196],{"class":236,"line":729},[234,9193,9110],{"class":9017},[234,9195,6562],{"class":387},[234,9197,9115],{"class":255},[234,9199,9200],{"class":236,"line":734},[234,9201,412],{"emptyLinePlaceholder":411},[234,9203,9204,9207],{"class":236,"line":771},[234,9205,9206],{"class":9017},"  loki",[234,9208,9021],{"class":387},[234,9210,9211,9213,9215],{"class":236,"line":776},[234,9212,9033],{"class":9017},[234,9214,6562],{"class":387},[234,9216,9217],{"class":255},"grafana\u002Floki:3.2.0\n",[234,9219,9220,9222],{"class":236,"line":815},[234,9221,9043],{"class":9017},[234,9223,9021],{"class":387},[234,9225,9226,9228],{"class":236,"line":820},[234,9227,9050],{"class":387},[234,9229,9230],{"class":255},".\u002Floki\u002Floki-config.yml:\u002Fetc\u002Floki\u002Fconfig.yml\n",[234,9232,9233,9235],{"class":236,"line":826},[234,9234,9050],{"class":387},[234,9236,9237],{"class":255},"loki-data:\u002Floki\n",[234,9239,9240,9242,9244],{"class":236,"line":846},[234,9241,9065],{"class":9017},[234,9243,6562],{"class":387},[234,9245,9246],{"class":255},"-config.file=\u002Fetc\u002Floki\u002Fconfig.yml\n",[234,9248,9249,9251],{"class":236,"line":859},[234,9250,9096],{"class":9017},[234,9252,9021],{"class":387},[234,9254,9255,9257],{"class":236,"line":872},[234,9256,9050],{"class":387},[234,9258,9259],{"class":255},"'127.0.0.1:3100:3100'\n",[234,9261,9262,9264,9266],{"class":236,"line":898},[234,9263,9110],{"class":9017},[234,9265,6562],{"class":387},[234,9267,9115],{"class":255},[234,9269,9270],{"class":236,"line":913},[234,9271,412],{"emptyLinePlaceholder":411},[234,9273,9274,9277],{"class":236,"line":1886},[234,9275,9276],{"class":9017},"  alertmanager",[234,9278,9021],{"class":387},[234,9280,9281,9283,9285],{"class":236,"line":1901},[234,9282,9033],{"class":9017},[234,9284,6562],{"class":387},[234,9286,9287],{"class":255},"prom\u002Falertmanager:v0.27.0\n",[234,9289,9290,9292],{"class":236,"line":1920},[234,9291,9043],{"class":9017},[234,9293,9021],{"class":387},[234,9295,9296,9298],{"class":236,"line":1944},[234,9297,9050],{"class":387},[234,9299,9300],{"class":255},".\u002Falertmanager:\u002Fetc\u002Falertmanager\n",[234,9302,9303,9305],{"class":236,"line":1962},[234,9304,9096],{"class":9017},[234,9306,9021],{"class":387},[234,9308,9309,9311],{"class":236,"line":1978},[234,9310,9050],{"class":387},[234,9312,9313],{"class":255},"'127.0.0.1:9093:9093'\n",[234,9315,9316,9318,9320],{"class":236,"line":1984},[234,9317,9110],{"class":9017},[234,9319,6562],{"class":387},[234,9321,9115],{"class":255},[234,9323,9324],{"class":236,"line":1992},[234,9325,412],{"emptyLinePlaceholder":411},[234,9327,9328,9331],{"class":236,"line":2004},[234,9329,9330],{"class":9017},"volumes",[234,9332,9021],{"class":387},[234,9334,9335,9338],{"class":236,"line":2014},[234,9336,9337],{"class":9017},"  prometheus-data",[234,9339,9021],{"class":387},[234,9341,9342,9345],{"class":236,"line":2020},[234,9343,9344],{"class":9017},"  grafana-data",[234,9346,9021],{"class":387},[234,9348,9349,9352],{"class":236,"line":2029},[234,9350,9351],{"class":9017},"  loki-data",[234,9353,9021],{"class":387},[12,9355,9356,9357,9360,9361,9364,9365,9368,9369,9372],{},"Three important points in this file. First, all ports are bound to ",[231,9358,9359],{},"127.0.0.1"," — none of the services is directly accessible from the internet. Second, volumes are named (not bind mounts), so they survive ",[231,9362,9363],{},"docker-compose down",". Third, the Grafana password comes from environment variable: create a ",[231,9366,9367],{},".env"," next to the compose with ",[231,9370,9371],{},"GRAFANA_PASSWORD=something_long_random"," and never commit that.",[12,9374,9375],{},"Bring up the stack:",[224,9377,9379],{"className":226,"code":9378,"language":228,"meta":229,"style":229},"cd \u002Fopt\u002Fobservability\ndocker compose up -d\ndocker compose ps  # all should be \"Up\" \u002F healthy\n",[231,9380,9381,9389,9402],{"__ignoreMap":229},[234,9382,9383,9386],{"class":236,"line":237},[234,9384,9385],{"class":251},"cd",[234,9387,9388],{"class":255}," \u002Fopt\u002Fobservability\n",[234,9390,9391,9393,9396,9399],{"class":236,"line":244},[234,9392,1118],{"class":247},[234,9394,9395],{"class":255}," compose",[234,9397,9398],{"class":255}," up",[234,9400,9401],{"class":251}," -d\n",[234,9403,9404,9406,9408,9411],{"class":236,"line":271},[234,9405,1118],{"class":247},[234,9407,9395],{"class":255},[234,9409,9410],{"class":255}," ps",[234,9412,9413],{"class":240},"  # all should be \"Up\" \u002F healthy\n",[12,9415,9416,9417,9420,9421,1895,9424,9420,9427,1895,9430,9433,9434,101],{},"Quick validation: ",[231,9418,9419],{},"curl localhost:9090\u002F-\u002Fready"," returns ",[231,9422,9423],{},"Prometheus Server is Ready",[231,9425,9426],{},"curl localhost:3100\u002Fready",[231,9428,9429],{},"ready",[231,9431,9432],{},"curl localhost:3000\u002Fapi\u002Fhealth"," returns JSON with ",[231,9435,9436],{},"\"database\": \"ok\"",[19,9438,9440],{"id":9439},"step-3-how-to-configure-prometheus-scrapes","Step 3 — How to configure Prometheus scrapes?",[12,9442,8927,9443,101],{},[27,9444,9445],{},"30 minutes",[12,9447,352,9448,9451],{},[231,9449,9450],{},"prometheus\u002Fprometheus.yml"," is where you tell Prometheus which endpoints to scrape. For a 4-server cluster, it looks like this:",[224,9453,9455],{"className":9008,"code":9454,"language":9010,"meta":229,"style":229},"global:\n  scrape_interval: 15s\n  evaluation_interval: 15s\n\nalerting:\n  alertmanagers:\n    - static_configs:\n        - targets: ['alertmanager:9093']\n\nrule_files:\n  - 'alerts.yml'\n\nscrape_configs:\n  - job_name: 'prometheus'\n    static_configs:\n      - targets: ['localhost:9090']\n\n  - job_name: 'node'\n    static_configs:\n      - targets:\n          - 'server-1.yourdomain.internal:9100'\n          - 'server-2.yourdomain.internal:9100'\n          - 'server-3.yourdomain.internal:9100'\n          - 'worker-1.yourdomain.internal:9100'\n        labels:\n          environment: 'production'\n\n  - job_name: 'apps'\n    static_configs:\n      - targets:\n          - 'api.yourdomain.internal:8080'\n          - 'worker.yourdomain.internal:8080'\n        labels:\n          environment: 'production'\n    metrics_path: '\u002Fmetrics'\n",[231,9456,9457,9464,9474,9483,9487,9494,9501,9511,9528,9532,9539,9547,9551,9558,9570,9577,9590,9594,9605,9611,9619,9627,9634,9641,9648,9655,9665,9669,9680,9686,9694,9701,9708,9714,9722],{"__ignoreMap":229},[234,9458,9459,9462],{"class":236,"line":237},[234,9460,9461],{"class":9017},"global",[234,9463,9021],{"class":387},[234,9465,9466,9469,9471],{"class":236,"line":244},[234,9467,9468],{"class":9017},"  scrape_interval",[234,9470,6562],{"class":387},[234,9472,9473],{"class":255},"15s\n",[234,9475,9476,9479,9481],{"class":236,"line":271},[234,9477,9478],{"class":9017},"  evaluation_interval",[234,9480,6562],{"class":387},[234,9482,9473],{"class":255},[234,9484,9485],{"class":236,"line":415},[234,9486,412],{"emptyLinePlaceholder":411},[234,9488,9489,9492],{"class":236,"line":434},[234,9490,9491],{"class":9017},"alerting",[234,9493,9021],{"class":387},[234,9495,9496,9499],{"class":236,"line":459},[234,9497,9498],{"class":9017},"  alertmanagers",[234,9500,9021],{"class":387},[234,9502,9503,9506,9509],{"class":236,"line":464},[234,9504,9505],{"class":387},"    - ",[234,9507,9508],{"class":9017},"static_configs",[234,9510,9021],{"class":387},[234,9512,9513,9516,9519,9522,9525],{"class":236,"line":479},[234,9514,9515],{"class":387},"        - ",[234,9517,9518],{"class":9017},"targets",[234,9520,9521],{"class":387},": [",[234,9523,9524],{"class":255},"'alertmanager:9093'",[234,9526,9527],{"class":387},"]\n",[234,9529,9530],{"class":236,"line":484},[234,9531,412],{"emptyLinePlaceholder":411},[234,9533,9534,9537],{"class":236,"line":490},[234,9535,9536],{"class":9017},"rule_files",[234,9538,9021],{"class":387},[234,9540,9541,9544],{"class":236,"line":508},[234,9542,9543],{"class":387},"  - ",[234,9545,9546],{"class":255},"'alerts.yml'\n",[234,9548,9549],{"class":236,"line":529},[234,9550,412],{"emptyLinePlaceholder":411},[234,9552,9553,9556],{"class":236,"line":535},[234,9554,9555],{"class":9017},"scrape_configs",[234,9557,9021],{"class":387},[234,9559,9560,9562,9565,9567],{"class":236,"line":546},[234,9561,9543],{"class":387},[234,9563,9564],{"class":9017},"job_name",[234,9566,6562],{"class":387},[234,9568,9569],{"class":255},"'prometheus'\n",[234,9571,9572,9575],{"class":236,"line":552},[234,9573,9574],{"class":9017},"    static_configs",[234,9576,9021],{"class":387},[234,9578,9579,9581,9583,9585,9588],{"class":236,"line":557},[234,9580,9050],{"class":387},[234,9582,9518],{"class":9017},[234,9584,9521],{"class":387},[234,9586,9587],{"class":255},"'localhost:9090'",[234,9589,9527],{"class":387},[234,9591,9592],{"class":236,"line":594},[234,9593,412],{"emptyLinePlaceholder":411},[234,9595,9596,9598,9600,9602],{"class":236,"line":635},[234,9597,9543],{"class":387},[234,9599,9564],{"class":9017},[234,9601,6562],{"class":387},[234,9603,9604],{"class":255},"'node'\n",[234,9606,9607,9609],{"class":236,"line":643},[234,9608,9574],{"class":9017},[234,9610,9021],{"class":387},[234,9612,9613,9615,9617],{"class":236,"line":659},[234,9614,9050],{"class":387},[234,9616,9518],{"class":9017},[234,9618,9021],{"class":387},[234,9620,9621,9624],{"class":236,"line":683},[234,9622,9623],{"class":387},"          - ",[234,9625,9626],{"class":255},"'server-1.yourdomain.internal:9100'\n",[234,9628,9629,9631],{"class":236,"line":695},[234,9630,9623],{"class":387},[234,9632,9633],{"class":255},"'server-2.yourdomain.internal:9100'\n",[234,9635,9636,9638],{"class":236,"line":717},[234,9637,9623],{"class":387},[234,9639,9640],{"class":255},"'server-3.yourdomain.internal:9100'\n",[234,9642,9643,9645],{"class":236,"line":723},[234,9644,9623],{"class":387},[234,9646,9647],{"class":255},"'worker-1.yourdomain.internal:9100'\n",[234,9649,9650,9653],{"class":236,"line":729},[234,9651,9652],{"class":9017},"        labels",[234,9654,9021],{"class":387},[234,9656,9657,9660,9662],{"class":236,"line":734},[234,9658,9659],{"class":9017},"          environment",[234,9661,6562],{"class":387},[234,9663,9664],{"class":255},"'production'\n",[234,9666,9667],{"class":236,"line":771},[234,9668,412],{"emptyLinePlaceholder":411},[234,9670,9671,9673,9675,9677],{"class":236,"line":776},[234,9672,9543],{"class":387},[234,9674,9564],{"class":9017},[234,9676,6562],{"class":387},[234,9678,9679],{"class":255},"'apps'\n",[234,9681,9682,9684],{"class":236,"line":815},[234,9683,9574],{"class":9017},[234,9685,9021],{"class":387},[234,9687,9688,9690,9692],{"class":236,"line":820},[234,9689,9050],{"class":387},[234,9691,9518],{"class":9017},[234,9693,9021],{"class":387},[234,9695,9696,9698],{"class":236,"line":826},[234,9697,9623],{"class":387},[234,9699,9700],{"class":255},"'api.yourdomain.internal:8080'\n",[234,9702,9703,9705],{"class":236,"line":846},[234,9704,9623],{"class":387},[234,9706,9707],{"class":255},"'worker.yourdomain.internal:8080'\n",[234,9709,9710,9712],{"class":236,"line":859},[234,9711,9652],{"class":9017},[234,9713,9021],{"class":387},[234,9715,9716,9718,9720],{"class":236,"line":872},[234,9717,9659],{"class":9017},[234,9719,6562],{"class":387},[234,9721,9664],{"class":255},[234,9723,9724,9727,9729],{"class":236,"line":898},[234,9725,9726],{"class":9017},"    metrics_path",[234,9728,6562],{"class":387},[234,9730,9731],{"class":255},"'\u002Fmetrics'\n",[12,9733,9734,9735,9737,9738,9741],{},"For larger clusters or those that change composition frequently, swap ",[231,9736,9508],{}," for ",[231,9739,9740],{},"file_sd_configs"," pointing to a JSON you generate automatically. For 4 static servers, the file above resolves it.",[12,9743,9744,9745,9748,9749,9752,9753,9756,9757,9760],{},"Reload: ",[231,9746,9747],{},"curl -X POST localhost:9090\u002F-\u002Freload",". Check at ",[231,9750,9751],{},"localhost:9090\u002Ftargets"," if all jobs are ",[231,9754,9755],{},"UP",". The ones that are ",[231,9758,9759],{},"DOWN"," haven't been instrumented yet — that's step 4.",[19,9762,9764],{"id":9763},"step-4-how-to-install-node_exporter-on-each-server","Step 4 — How to install node_exporter on each server?",[12,9766,8927,9767,9770],{},[27,9768,9769],{},"15 minutes"," for 4 servers.",[12,9772,9773],{},"On each monitored server, run node_exporter. There are two ways: direct binary via systemd, or Docker container. In 2026 the consensus is container — easier to update and isolate. On each node:",[224,9775,9777],{"className":226,"code":9776,"language":228,"meta":229,"style":229},"docker run -d \\\n  --name node-exporter \\\n  --restart unless-stopped \\\n  --net=\"host\" \\\n  --pid=\"host\" \\\n  -v \"\u002F:\u002Fhost:ro,rslave\" \\\n  prom\u002Fnode-exporter:v1.8.2 \\\n  --path.rootfs=\u002Fhost\n",[231,9778,9779,9792,9802,9812,9822,9831,9841,9848],{"__ignoreMap":229},[234,9780,9781,9783,9786,9789],{"class":236,"line":237},[234,9782,1118],{"class":247},[234,9784,9785],{"class":255}," run",[234,9787,9788],{"class":251}," -d",[234,9790,9791],{"class":383}," \\\n",[234,9793,9794,9797,9800],{"class":236,"line":244},[234,9795,9796],{"class":251},"  --name",[234,9798,9799],{"class":255}," node-exporter",[234,9801,9791],{"class":383},[234,9803,9804,9807,9810],{"class":236,"line":271},[234,9805,9806],{"class":251},"  --restart",[234,9808,9809],{"class":255}," unless-stopped",[234,9811,9791],{"class":383},[234,9813,9814,9817,9820],{"class":236,"line":415},[234,9815,9816],{"class":251},"  --net=",[234,9818,9819],{"class":255},"\"host\"",[234,9821,9791],{"class":383},[234,9823,9824,9827,9829],{"class":236,"line":434},[234,9825,9826],{"class":251},"  --pid=",[234,9828,9819],{"class":255},[234,9830,9791],{"class":383},[234,9832,9833,9836,9839],{"class":236,"line":459},[234,9834,9835],{"class":251},"  -v",[234,9837,9838],{"class":255}," \"\u002F:\u002Fhost:ro,rslave\"",[234,9840,9791],{"class":383},[234,9842,9843,9846],{"class":236,"line":464},[234,9844,9845],{"class":255},"  prom\u002Fnode-exporter:v1.8.2",[234,9847,9791],{"class":383},[234,9849,9850],{"class":236,"line":479},[234,9851,9852],{"class":251},"  --path.rootfs=\u002Fhost\n",[12,9854,352,9855,9858,9859,9862,9863,571,9866,2402,9869,9872],{},[231,9856,9857],{},"--net=host"," is necessary for it to see real network interfaces. The bind mount on ",[231,9860,9861],{},"\u002Fhost"," allows reading ",[231,9864,9865],{},"\u002Fproc",[231,9867,9868],{},"\u002Fsys",[231,9870,9871],{},"\u002Fetc\u002Fpasswd"," from the host (read-only) without running the container with root privileges.",[12,9874,9875,9876,1272],{},"Firewall: open port 9100 only to the observability server IP. On Ubuntu with ",[231,9877,9878],{},"ufw",[224,9880,9882],{"className":226,"code":9881,"language":228,"meta":229,"style":229},"ufw allow from \u003COBSERVABILITY_IP> to any port 9100\n",[231,9883,9884],{"__ignoreMap":229},[234,9885,9886,9888,9891,9894,9897,9900,9903,9905,9908,9911,9914],{"class":236,"line":237},[234,9887,9878],{"class":247},[234,9889,9890],{"class":255}," allow",[234,9892,9893],{"class":255}," from",[234,9895,9896],{"class":383}," \u003C",[234,9898,9899],{"class":255},"OBSERVABILITY_I",[234,9901,9902],{"class":387},"P",[234,9904,1935],{"class":383},[234,9906,9907],{"class":255}," to",[234,9909,9910],{"class":255}," any",[234,9912,9913],{"class":255}," port",[234,9915,9916],{"class":251}," 9100\n",[12,9918,9919,9920,9923,9924,101],{},"Validation: from the observability server, ",[231,9921,9922],{},"curl http:\u002F\u002Fserver-1.yourdomain.internal:9100\u002Fmetrics"," should return hundreds of lines starting with ",[231,9925,9926],{},"# HELP node_cpu_seconds_total...",[19,9928,9930],{"id":9929},"step-5-how-to-configure-loki-promtail","Step 5 — How to configure Loki + Promtail?",[12,9932,8927,9933,101],{},[27,9934,9445],{},[12,9936,9937,9938,1272],{},"Loki is already running in the compose from step 2. Missing the ",[231,9939,9940],{},"loki-config.yml",[224,9942,9944],{"className":9008,"code":9943,"language":9010,"meta":229,"style":229},"auth_enabled: false\n\nserver:\n  http_listen_port: 3100\n\ncommon:\n  path_prefix: \u002Floki\n  storage:\n    filesystem:\n      chunks_directory: \u002Floki\u002Fchunks\n      rules_directory: \u002Floki\u002Frules\n  replication_factor: 1\n  ring:\n    kvstore:\n      store: inmemory\n\nschema_config:\n  configs:\n    - from: 2024-01-01\n      store: tsdb\n      object_store: filesystem\n      schema: v13\n      index:\n        prefix: index_\n        period: 24h\n\nlimits_config:\n  retention_period: 720h  # 30 dias\n  reject_old_samples: true\n  reject_old_samples_max_age: 168h\n",[231,9945,9946,9956,9960,9967,9977,9981,9988,9998,10005,10012,10022,10032,10042,10049,10056,10066,10070,10077,10084,10095,10104,10114,10124,10131,10141,10151,10155,10162,10175,10185],{"__ignoreMap":229},[234,9947,9948,9951,9953],{"class":236,"line":237},[234,9949,9950],{"class":9017},"auth_enabled",[234,9952,6562],{"class":387},[234,9954,9955],{"class":251},"false\n",[234,9957,9958],{"class":236,"line":244},[234,9959,412],{"emptyLinePlaceholder":411},[234,9961,9962,9965],{"class":236,"line":271},[234,9963,9964],{"class":9017},"server",[234,9966,9021],{"class":387},[234,9968,9969,9972,9974],{"class":236,"line":415},[234,9970,9971],{"class":9017},"  http_listen_port",[234,9973,6562],{"class":387},[234,9975,9976],{"class":251},"3100\n",[234,9978,9979],{"class":236,"line":434},[234,9980,412],{"emptyLinePlaceholder":411},[234,9982,9983,9986],{"class":236,"line":459},[234,9984,9985],{"class":9017},"common",[234,9987,9021],{"class":387},[234,9989,9990,9993,9995],{"class":236,"line":464},[234,9991,9992],{"class":9017},"  path_prefix",[234,9994,6562],{"class":387},[234,9996,9997],{"class":255},"\u002Floki\n",[234,9999,10000,10003],{"class":236,"line":479},[234,10001,10002],{"class":9017},"  storage",[234,10004,9021],{"class":387},[234,10006,10007,10010],{"class":236,"line":484},[234,10008,10009],{"class":9017},"    filesystem",[234,10011,9021],{"class":387},[234,10013,10014,10017,10019],{"class":236,"line":490},[234,10015,10016],{"class":9017},"      chunks_directory",[234,10018,6562],{"class":387},[234,10020,10021],{"class":255},"\u002Floki\u002Fchunks\n",[234,10023,10024,10027,10029],{"class":236,"line":508},[234,10025,10026],{"class":9017},"      rules_directory",[234,10028,6562],{"class":387},[234,10030,10031],{"class":255},"\u002Floki\u002Frules\n",[234,10033,10034,10037,10039],{"class":236,"line":529},[234,10035,10036],{"class":9017},"  replication_factor",[234,10038,6562],{"class":387},[234,10040,10041],{"class":251},"1\n",[234,10043,10044,10047],{"class":236,"line":535},[234,10045,10046],{"class":9017},"  ring",[234,10048,9021],{"class":387},[234,10050,10051,10054],{"class":236,"line":546},[234,10052,10053],{"class":9017},"    kvstore",[234,10055,9021],{"class":387},[234,10057,10058,10061,10063],{"class":236,"line":552},[234,10059,10060],{"class":9017},"      store",[234,10062,6562],{"class":387},[234,10064,10065],{"class":255},"inmemory\n",[234,10067,10068],{"class":236,"line":557},[234,10069,412],{"emptyLinePlaceholder":411},[234,10071,10072,10075],{"class":236,"line":594},[234,10073,10074],{"class":9017},"schema_config",[234,10076,9021],{"class":387},[234,10078,10079,10082],{"class":236,"line":635},[234,10080,10081],{"class":9017},"  configs",[234,10083,9021],{"class":387},[234,10085,10086,10088,10090,10092],{"class":236,"line":643},[234,10087,9505],{"class":387},[234,10089,391],{"class":9017},[234,10091,6562],{"class":387},[234,10093,10094],{"class":251},"2024-01-01\n",[234,10096,10097,10099,10101],{"class":236,"line":659},[234,10098,10060],{"class":9017},[234,10100,6562],{"class":387},[234,10102,10103],{"class":255},"tsdb\n",[234,10105,10106,10109,10111],{"class":236,"line":683},[234,10107,10108],{"class":9017},"      object_store",[234,10110,6562],{"class":387},[234,10112,10113],{"class":255},"filesystem\n",[234,10115,10116,10119,10121],{"class":236,"line":695},[234,10117,10118],{"class":9017},"      schema",[234,10120,6562],{"class":387},[234,10122,10123],{"class":255},"v13\n",[234,10125,10126,10129],{"class":236,"line":717},[234,10127,10128],{"class":9017},"      index",[234,10130,9021],{"class":387},[234,10132,10133,10136,10138],{"class":236,"line":723},[234,10134,10135],{"class":9017},"        prefix",[234,10137,6562],{"class":387},[234,10139,10140],{"class":255},"index_\n",[234,10142,10143,10146,10148],{"class":236,"line":729},[234,10144,10145],{"class":9017},"        period",[234,10147,6562],{"class":387},[234,10149,10150],{"class":255},"24h\n",[234,10152,10153],{"class":236,"line":734},[234,10154,412],{"emptyLinePlaceholder":411},[234,10156,10157,10160],{"class":236,"line":771},[234,10158,10159],{"class":9017},"limits_config",[234,10161,9021],{"class":387},[234,10163,10164,10167,10169,10172],{"class":236,"line":776},[234,10165,10166],{"class":9017},"  retention_period",[234,10168,6562],{"class":387},[234,10170,10171],{"class":255},"720h",[234,10173,10174],{"class":240},"  # 30 dias\n",[234,10176,10177,10180,10182],{"class":236,"line":815},[234,10178,10179],{"class":9017},"  reject_old_samples",[234,10181,6562],{"class":387},[234,10183,10184],{"class":251},"true\n",[234,10186,10187,10190,10192],{"class":236,"line":820},[234,10188,10189],{"class":9017},"  reject_old_samples_max_age",[234,10191,6562],{"class":387},[234,10193,10194],{"class":255},"168h\n",[12,10196,10197],{},"Filesystem storage is enough to start. When you exceed 50 GB of logs per day or want 90+ days retention, migrate to S3 (or compatible). Don't migrate before — complicates operation without real gain.",[12,10199,10200],{},"On each monitored server, install Promtail (or Grafana Agent) also via container:",[224,10202,10204],{"className":9008,"code":10203,"language":9010,"meta":229,"style":229},"# \u002Fopt\u002Fpromtail\u002Fpromtail-config.yml em cada servidor\nserver:\n  http_listen_port: 9080\n\nclients:\n  - url: http:\u002F\u002Fmonitor.yourdomain.com:3100\u002Floki\u002Fapi\u002Fv1\u002Fpush\n\nscrape_configs:\n  - job_name: system\n    static_configs:\n      - targets: [localhost]\n        labels:\n          job: varlogs\n          host: ${HOSTNAME}\n          __path__: \u002Fvar\u002Flog\u002F*.log\n\n  - job_name: docker\n    docker_sd_configs:\n      - host: unix:\u002F\u002F\u002Fvar\u002Frun\u002Fdocker.sock\n    relabel_configs:\n      - source_labels: ['__meta_docker_container_name']\n        target_label: 'container'\n",[231,10205,10206,10211,10217,10226,10230,10237,10249,10253,10259,10270,10276,10288,10294,10304,10314,10324,10328,10339,10346,10357,10364,10378],{"__ignoreMap":229},[234,10207,10208],{"class":236,"line":237},[234,10209,10210],{"class":240},"# \u002Fopt\u002Fpromtail\u002Fpromtail-config.yml em cada servidor\n",[234,10212,10213,10215],{"class":236,"line":244},[234,10214,9964],{"class":9017},[234,10216,9021],{"class":387},[234,10218,10219,10221,10223],{"class":236,"line":271},[234,10220,9971],{"class":9017},[234,10222,6562],{"class":387},[234,10224,10225],{"class":251},"9080\n",[234,10227,10228],{"class":236,"line":415},[234,10229,412],{"emptyLinePlaceholder":411},[234,10231,10232,10235],{"class":236,"line":434},[234,10233,10234],{"class":9017},"clients",[234,10236,9021],{"class":387},[234,10238,10239,10241,10244,10246],{"class":236,"line":459},[234,10240,9543],{"class":387},[234,10242,10243],{"class":9017},"url",[234,10245,6562],{"class":387},[234,10247,10248],{"class":255},"http:\u002F\u002Fmonitor.yourdomain.com:3100\u002Floki\u002Fapi\u002Fv1\u002Fpush\n",[234,10250,10251],{"class":236,"line":464},[234,10252,412],{"emptyLinePlaceholder":411},[234,10254,10255,10257],{"class":236,"line":479},[234,10256,9555],{"class":9017},[234,10258,9021],{"class":387},[234,10260,10261,10263,10265,10267],{"class":236,"line":484},[234,10262,9543],{"class":387},[234,10264,9564],{"class":9017},[234,10266,6562],{"class":387},[234,10268,10269],{"class":255},"system\n",[234,10271,10272,10274],{"class":236,"line":490},[234,10273,9574],{"class":9017},[234,10275,9021],{"class":387},[234,10277,10278,10280,10282,10284,10286],{"class":236,"line":508},[234,10279,9050],{"class":387},[234,10281,9518],{"class":9017},[234,10283,9521],{"class":387},[234,10285,8956],{"class":255},[234,10287,9527],{"class":387},[234,10289,10290,10292],{"class":236,"line":529},[234,10291,9652],{"class":9017},[234,10293,9021],{"class":387},[234,10295,10296,10299,10301],{"class":236,"line":535},[234,10297,10298],{"class":9017},"          job",[234,10300,6562],{"class":387},[234,10302,10303],{"class":255},"varlogs\n",[234,10305,10306,10309,10311],{"class":236,"line":546},[234,10307,10308],{"class":9017},"          host",[234,10310,6562],{"class":387},[234,10312,10313],{"class":255},"${HOSTNAME}\n",[234,10315,10316,10319,10321],{"class":236,"line":552},[234,10317,10318],{"class":9017},"          __path__",[234,10320,6562],{"class":387},[234,10322,10323],{"class":255},"\u002Fvar\u002Flog\u002F*.log\n",[234,10325,10326],{"class":236,"line":557},[234,10327,412],{"emptyLinePlaceholder":411},[234,10329,10330,10332,10334,10336],{"class":236,"line":594},[234,10331,9543],{"class":387},[234,10333,9564],{"class":9017},[234,10335,6562],{"class":387},[234,10337,10338],{"class":255},"docker\n",[234,10340,10341,10344],{"class":236,"line":635},[234,10342,10343],{"class":9017},"    docker_sd_configs",[234,10345,9021],{"class":387},[234,10347,10348,10350,10352,10354],{"class":236,"line":643},[234,10349,9050],{"class":387},[234,10351,1650],{"class":9017},[234,10353,6562],{"class":387},[234,10355,10356],{"class":255},"unix:\u002F\u002F\u002Fvar\u002Frun\u002Fdocker.sock\n",[234,10358,10359,10362],{"class":236,"line":659},[234,10360,10361],{"class":9017},"    relabel_configs",[234,10363,9021],{"class":387},[234,10365,10366,10368,10371,10373,10376],{"class":236,"line":683},[234,10367,9050],{"class":387},[234,10369,10370],{"class":9017},"source_labels",[234,10372,9521],{"class":387},[234,10374,10375],{"class":255},"'__meta_docker_container_name'",[234,10377,9527],{"class":387},[234,10379,10380,10383,10385],{"class":236,"line":695},[234,10381,10382],{"class":9017},"        target_label",[234,10384,6562],{"class":387},[234,10386,10387],{"class":255},"'container'\n",[12,10389,10390,10391,10394,10395,10397],{},"Important: the endpoint ",[231,10392,10393],{},"http:\u002F\u002Fmonitor.yourdomain.com:3100\u002Floki\u002Fapi\u002Fv1\u002Fpush"," needs to be accessible from the servers. If you followed step 2 and bound Loki to ",[231,10396,9359],{},", you have two options: expose 3100 via reverse proxy with basic authentication, or open an SSH\u002FWireGuard tunnel between servers. The second option is more secure and what we recommend.",[12,10399,10400,10401,10404],{},"Validation: in Grafana, go to Explore, select the Loki data source, run ",[231,10402,10403],{},"{job=\"varlogs\"}"," and see logs appearing in real time.",[19,10406,10408],{"id":10407},"step-6-how-to-import-grafana-dashboards","Step 6 — How to import Grafana dashboards?",[12,10410,8927,10411,101],{},[27,10412,10413],{},"20 minutes",[12,10415,10416,10417,10420,10421,101],{},"Access ",[231,10418,10419],{},"https:\u002F\u002Fmonitor.yourdomain.com"," (after configuring the reverse proxy from step 8 — you can skip ahead now if you want). Login admin with the password from ",[231,10422,9367],{},[12,10424,10425,10426,1272],{},"Add the two data sources via automatic provisioning. In ",[231,10427,10428],{},"grafana\u002Fprovisioning\u002Fdatasources\u002Fdatasources.yml",[224,10430,10432],{"className":9008,"code":10431,"language":9010,"meta":229,"style":229},"apiVersion: 1\ndatasources:\n  - name: Prometheus\n    type: prometheus\n    access: proxy\n    url: http:\u002F\u002Fprometheus:9090\n    isDefault: true\n  - name: Loki\n    type: loki\n    access: proxy\n    url: http:\u002F\u002Floki:3100\n",[231,10433,10434,10443,10450,10462,10472,10482,10492,10501,10512,10521,10529],{"__ignoreMap":229},[234,10435,10436,10439,10441],{"class":236,"line":237},[234,10437,10438],{"class":9017},"apiVersion",[234,10440,6562],{"class":387},[234,10442,10041],{"class":251},[234,10444,10445,10448],{"class":236,"line":244},[234,10446,10447],{"class":9017},"datasources",[234,10449,9021],{"class":387},[234,10451,10452,10454,10457,10459],{"class":236,"line":271},[234,10453,9543],{"class":387},[234,10455,10456],{"class":9017},"name",[234,10458,6562],{"class":387},[234,10460,10461],{"class":255},"Prometheus\n",[234,10463,10464,10467,10469],{"class":236,"line":415},[234,10465,10466],{"class":9017},"    type",[234,10468,6562],{"class":387},[234,10470,10471],{"class":255},"prometheus\n",[234,10473,10474,10477,10479],{"class":236,"line":434},[234,10475,10476],{"class":9017},"    access",[234,10478,6562],{"class":387},[234,10480,10481],{"class":255},"proxy\n",[234,10483,10484,10487,10489],{"class":236,"line":459},[234,10485,10486],{"class":9017},"    url",[234,10488,6562],{"class":387},[234,10490,10491],{"class":255},"http:\u002F\u002Fprometheus:9090\n",[234,10493,10494,10497,10499],{"class":236,"line":464},[234,10495,10496],{"class":9017},"    isDefault",[234,10498,6562],{"class":387},[234,10500,10184],{"class":251},[234,10502,10503,10505,10507,10509],{"class":236,"line":479},[234,10504,9543],{"class":387},[234,10506,10456],{"class":9017},[234,10508,6562],{"class":387},[234,10510,10511],{"class":255},"Loki\n",[234,10513,10514,10516,10518],{"class":236,"line":484},[234,10515,10466],{"class":9017},[234,10517,6562],{"class":387},[234,10519,10520],{"class":255},"loki\n",[234,10522,10523,10525,10527],{"class":236,"line":490},[234,10524,10476],{"class":9017},[234,10526,6562],{"class":387},[234,10528,10481],{"class":255},[234,10530,10531,10533,10535],{"class":236,"line":508},[234,10532,10486],{"class":9017},[234,10534,6562],{"class":387},[234,10536,10537],{"class":255},"http:\u002F\u002Floki:3100\n",[12,10539,10540,10541,10544],{},"Restart Grafana with ",[231,10542,10543],{},"docker compose restart grafana"," and the sources appear automatically.",[12,10546,10547,10548,10551],{},"Import ready dashboards. In ",[27,10549,10550],{},"Dashboards → New → Import",", paste the dashboard ID:",[2734,10553,10554,10560,10566],{},[70,10555,10556,10559],{},[27,10557,10558],{},"1860"," — Node Exporter Full. CPU, RAM, disk, network, filesystem. It's the most used dashboard in the Prometheus community, with reason.",[70,10561,10562,10565],{},[27,10563,10564],{},"13639"," — Logs \u002F App. Basic visualization of Loki logs with filters by job, container, host.",[70,10567,10568,10571],{},[27,10569,10570],{},"15172"," — Cluster overview. Consolidated view per server, useful for small cluster.",[12,10573,10574,10575,10578],{},"Customize each one to use ",[231,10576,10577],{},"environment=\"production\""," in the default filter. After two weeks using, you'll want to create your own dashboards for specific workloads — there's no shortcut there, it's chair time.",[19,10580,10582],{"id":10581},"step-7-how-to-configure-basic-alerts","Step 7 — How to configure basic alerts?",[12,10584,8927,10585,101],{},[27,10586,8986],{},[12,10588,10589],{},"Alerts are where 80% of teams stumble: either they put very few and discover incidents through customers, or they put dozens and desensitize the team.",[12,10591,10592,10593,10596,10597,1272],{},"Start with ",[27,10594,10595],{},"six essential alerts",". In ",[231,10598,10599],{},"prometheus\u002Falerts.yml",[224,10601,10603],{"className":9008,"code":10602,"language":9010,"meta":229,"style":229},"groups:\n  - name: essentials\n    interval: 30s\n    rules:\n      - alert: ServerDown\n        expr: up{job=\"node\"} == 0\n        for: 2m\n        labels:\n          severity: critical\n        annotations:\n          summary: \"Servidor {{ $labels.instance }} está fora do ar\"\n\n      - alert: HighCPU\n        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100) > 80\n        for: 10m\n        labels:\n          severity: warning\n\n      - alert: DiskAlmostFull\n        expr: (node_filesystem_avail_bytes{mountpoint=\"\u002F\"} \u002F node_filesystem_size_bytes{mountpoint=\"\u002F\"}) * 100 \u003C 15\n        for: 5m\n        labels:\n          severity: critical\n\n      - alert: HighMemory\n        expr: (1 - (node_memory_MemAvailable_bytes \u002F node_memory_MemTotal_bytes)) * 100 > 90\n        for: 10m\n        labels:\n          severity: warning\n\n      - alert: HighHTTPErrorRate\n        expr: sum(rate(http_requests_total{status=~\"5..\"}[5m])) \u002F sum(rate(http_requests_total[5m])) > 0.05\n        for: 5m\n        labels:\n          severity: critical\n\n      - alert: HighLatency\n        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 2\n        for: 10m\n        labels:\n          severity: warning\n",[231,10604,10605,10612,10623,10633,10640,10652,10662,10672,10678,10688,10695,10705,10709,10720,10729,10738,10744,10753,10757,10768,10777,10786,10792,10800,10804,10815,10824,10832,10838,10846,10850,10861,10870,10878,10884,10892,10896,10907,10916,10924,10930],{"__ignoreMap":229},[234,10606,10607,10610],{"class":236,"line":237},[234,10608,10609],{"class":9017},"groups",[234,10611,9021],{"class":387},[234,10613,10614,10616,10618,10620],{"class":236,"line":244},[234,10615,9543],{"class":387},[234,10617,10456],{"class":9017},[234,10619,6562],{"class":387},[234,10621,10622],{"class":255},"essentials\n",[234,10624,10625,10628,10630],{"class":236,"line":271},[234,10626,10627],{"class":9017},"    interval",[234,10629,6562],{"class":387},[234,10631,10632],{"class":255},"30s\n",[234,10634,10635,10638],{"class":236,"line":415},[234,10636,10637],{"class":9017},"    rules",[234,10639,9021],{"class":387},[234,10641,10642,10644,10647,10649],{"class":236,"line":434},[234,10643,9050],{"class":387},[234,10645,10646],{"class":9017},"alert",[234,10648,6562],{"class":387},[234,10650,10651],{"class":255},"ServerDown\n",[234,10653,10654,10657,10659],{"class":236,"line":459},[234,10655,10656],{"class":9017},"        expr",[234,10658,6562],{"class":387},[234,10660,10661],{"class":255},"up{job=\"node\"} == 0\n",[234,10663,10664,10667,10669],{"class":236,"line":464},[234,10665,10666],{"class":9017},"        for",[234,10668,6562],{"class":387},[234,10670,10671],{"class":255},"2m\n",[234,10673,10674,10676],{"class":236,"line":479},[234,10675,9652],{"class":9017},[234,10677,9021],{"class":387},[234,10679,10680,10683,10685],{"class":236,"line":484},[234,10681,10682],{"class":9017},"          severity",[234,10684,6562],{"class":387},[234,10686,10687],{"class":255},"critical\n",[234,10689,10690,10693],{"class":236,"line":490},[234,10691,10692],{"class":9017},"        annotations",[234,10694,9021],{"class":387},[234,10696,10697,10700,10702],{"class":236,"line":508},[234,10698,10699],{"class":9017},"          summary",[234,10701,6562],{"class":387},[234,10703,10704],{"class":255},"\"Servidor {{ $labels.instance }} está fora do ar\"\n",[234,10706,10707],{"class":236,"line":529},[234,10708,412],{"emptyLinePlaceholder":411},[234,10710,10711,10713,10715,10717],{"class":236,"line":535},[234,10712,9050],{"class":387},[234,10714,10646],{"class":9017},[234,10716,6562],{"class":387},[234,10718,10719],{"class":255},"HighCPU\n",[234,10721,10722,10724,10726],{"class":236,"line":546},[234,10723,10656],{"class":9017},[234,10725,6562],{"class":387},[234,10727,10728],{"class":255},"100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100) > 80\n",[234,10730,10731,10733,10735],{"class":236,"line":552},[234,10732,10666],{"class":9017},[234,10734,6562],{"class":387},[234,10736,10737],{"class":255},"10m\n",[234,10739,10740,10742],{"class":236,"line":557},[234,10741,9652],{"class":9017},[234,10743,9021],{"class":387},[234,10745,10746,10748,10750],{"class":236,"line":594},[234,10747,10682],{"class":9017},[234,10749,6562],{"class":387},[234,10751,10752],{"class":255},"warning\n",[234,10754,10755],{"class":236,"line":635},[234,10756,412],{"emptyLinePlaceholder":411},[234,10758,10759,10761,10763,10765],{"class":236,"line":643},[234,10760,9050],{"class":387},[234,10762,10646],{"class":9017},[234,10764,6562],{"class":387},[234,10766,10767],{"class":255},"DiskAlmostFull\n",[234,10769,10770,10772,10774],{"class":236,"line":659},[234,10771,10656],{"class":9017},[234,10773,6562],{"class":387},[234,10775,10776],{"class":255},"(node_filesystem_avail_bytes{mountpoint=\"\u002F\"} \u002F node_filesystem_size_bytes{mountpoint=\"\u002F\"}) * 100 \u003C 15\n",[234,10778,10779,10781,10783],{"class":236,"line":683},[234,10780,10666],{"class":9017},[234,10782,6562],{"class":387},[234,10784,10785],{"class":255},"5m\n",[234,10787,10788,10790],{"class":236,"line":695},[234,10789,9652],{"class":9017},[234,10791,9021],{"class":387},[234,10793,10794,10796,10798],{"class":236,"line":717},[234,10795,10682],{"class":9017},[234,10797,6562],{"class":387},[234,10799,10687],{"class":255},[234,10801,10802],{"class":236,"line":723},[234,10803,412],{"emptyLinePlaceholder":411},[234,10805,10806,10808,10810,10812],{"class":236,"line":729},[234,10807,9050],{"class":387},[234,10809,10646],{"class":9017},[234,10811,6562],{"class":387},[234,10813,10814],{"class":255},"HighMemory\n",[234,10816,10817,10819,10821],{"class":236,"line":734},[234,10818,10656],{"class":9017},[234,10820,6562],{"class":387},[234,10822,10823],{"class":255},"(1 - (node_memory_MemAvailable_bytes \u002F node_memory_MemTotal_bytes)) * 100 > 90\n",[234,10825,10826,10828,10830],{"class":236,"line":771},[234,10827,10666],{"class":9017},[234,10829,6562],{"class":387},[234,10831,10737],{"class":255},[234,10833,10834,10836],{"class":236,"line":776},[234,10835,9652],{"class":9017},[234,10837,9021],{"class":387},[234,10839,10840,10842,10844],{"class":236,"line":815},[234,10841,10682],{"class":9017},[234,10843,6562],{"class":387},[234,10845,10752],{"class":255},[234,10847,10848],{"class":236,"line":820},[234,10849,412],{"emptyLinePlaceholder":411},[234,10851,10852,10854,10856,10858],{"class":236,"line":826},[234,10853,9050],{"class":387},[234,10855,10646],{"class":9017},[234,10857,6562],{"class":387},[234,10859,10860],{"class":255},"HighHTTPErrorRate\n",[234,10862,10863,10865,10867],{"class":236,"line":846},[234,10864,10656],{"class":9017},[234,10866,6562],{"class":387},[234,10868,10869],{"class":255},"sum(rate(http_requests_total{status=~\"5..\"}[5m])) \u002F sum(rate(http_requests_total[5m])) > 0.05\n",[234,10871,10872,10874,10876],{"class":236,"line":859},[234,10873,10666],{"class":9017},[234,10875,6562],{"class":387},[234,10877,10785],{"class":255},[234,10879,10880,10882],{"class":236,"line":872},[234,10881,9652],{"class":9017},[234,10883,9021],{"class":387},[234,10885,10886,10888,10890],{"class":236,"line":898},[234,10887,10682],{"class":9017},[234,10889,6562],{"class":387},[234,10891,10687],{"class":255},[234,10893,10894],{"class":236,"line":913},[234,10895,412],{"emptyLinePlaceholder":411},[234,10897,10898,10900,10902,10904],{"class":236,"line":1886},[234,10899,9050],{"class":387},[234,10901,10646],{"class":9017},[234,10903,6562],{"class":387},[234,10905,10906],{"class":255},"HighLatency\n",[234,10908,10909,10911,10913],{"class":236,"line":1901},[234,10910,10656],{"class":9017},[234,10912,6562],{"class":387},[234,10914,10915],{"class":255},"histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 2\n",[234,10917,10918,10920,10922],{"class":236,"line":1920},[234,10919,10666],{"class":9017},[234,10921,6562],{"class":387},[234,10923,10737],{"class":255},[234,10925,10926,10928],{"class":236,"line":1944},[234,10927,9652],{"class":9017},[234,10929,9021],{"class":387},[234,10931,10932,10934,10936],{"class":236,"line":1962},[234,10933,10682],{"class":9017},[234,10935,6562],{"class":387},[234,10937,10752],{"class":255},[12,10939,10940,10941,10944],{},"And the ",[231,10942,10943],{},"alertmanager\u002Falertmanager.yml"," pointing to a Slack or Discord webhook:",[224,10946,10948],{"className":9008,"code":10947,"language":9010,"meta":229,"style":229},"route:\n  group_by: ['alertname', 'severity']\n  group_wait: 30s\n  group_interval: 5m\n  repeat_interval: 4h\n  receiver: 'slack-default'\n  routes:\n    - match:\n        severity: critical\n      receiver: 'slack-critical'\n      repeat_interval: 1h\n\nreceivers:\n  - name: 'slack-default'\n    slack_configs:\n      - api_url: 'https:\u002F\u002Fhooks.slack.com\u002Fservices\u002FYOUR\u002FWEBHOOK\u002FHERE'\n        channel: '#alerts'\n        send_resolved: true\n\n  - name: 'slack-critical'\n    slack_configs:\n      - api_url: 'https:\u002F\u002Fhooks.slack.com\u002Fservices\u002FYOUR\u002FWEBHOOK\u002FHERE'\n        channel: '#alerts-critical'\n        send_resolved: true\n",[231,10949,10950,10957,10974,10983,10992,11002,11012,11019,11028,11037,11047,11057,11061,11068,11078,11085,11097,11107,11116,11120,11130,11136,11146,11155],{"__ignoreMap":229},[234,10951,10952,10955],{"class":236,"line":237},[234,10953,10954],{"class":9017},"route",[234,10956,9021],{"class":387},[234,10958,10959,10962,10964,10967,10969,10972],{"class":236,"line":244},[234,10960,10961],{"class":9017},"  group_by",[234,10963,9521],{"class":387},[234,10965,10966],{"class":255},"'alertname'",[234,10968,571],{"class":387},[234,10970,10971],{"class":255},"'severity'",[234,10973,9527],{"class":387},[234,10975,10976,10979,10981],{"class":236,"line":271},[234,10977,10978],{"class":9017},"  group_wait",[234,10980,6562],{"class":387},[234,10982,10632],{"class":255},[234,10984,10985,10988,10990],{"class":236,"line":415},[234,10986,10987],{"class":9017},"  group_interval",[234,10989,6562],{"class":387},[234,10991,10785],{"class":255},[234,10993,10994,10997,10999],{"class":236,"line":434},[234,10995,10996],{"class":9017},"  repeat_interval",[234,10998,6562],{"class":387},[234,11000,11001],{"class":255},"4h\n",[234,11003,11004,11007,11009],{"class":236,"line":459},[234,11005,11006],{"class":9017},"  receiver",[234,11008,6562],{"class":387},[234,11010,11011],{"class":255},"'slack-default'\n",[234,11013,11014,11017],{"class":236,"line":464},[234,11015,11016],{"class":9017},"  routes",[234,11018,9021],{"class":387},[234,11020,11021,11023,11026],{"class":236,"line":479},[234,11022,9505],{"class":387},[234,11024,11025],{"class":9017},"match",[234,11027,9021],{"class":387},[234,11029,11030,11033,11035],{"class":236,"line":484},[234,11031,11032],{"class":9017},"        severity",[234,11034,6562],{"class":387},[234,11036,10687],{"class":255},[234,11038,11039,11042,11044],{"class":236,"line":490},[234,11040,11041],{"class":9017},"      receiver",[234,11043,6562],{"class":387},[234,11045,11046],{"class":255},"'slack-critical'\n",[234,11048,11049,11052,11054],{"class":236,"line":508},[234,11050,11051],{"class":9017},"      repeat_interval",[234,11053,6562],{"class":387},[234,11055,11056],{"class":255},"1h\n",[234,11058,11059],{"class":236,"line":529},[234,11060,412],{"emptyLinePlaceholder":411},[234,11062,11063,11066],{"class":236,"line":535},[234,11064,11065],{"class":9017},"receivers",[234,11067,9021],{"class":387},[234,11069,11070,11072,11074,11076],{"class":236,"line":546},[234,11071,9543],{"class":387},[234,11073,10456],{"class":9017},[234,11075,6562],{"class":387},[234,11077,11011],{"class":255},[234,11079,11080,11083],{"class":236,"line":552},[234,11081,11082],{"class":9017},"    slack_configs",[234,11084,9021],{"class":387},[234,11086,11087,11089,11092,11094],{"class":236,"line":557},[234,11088,9050],{"class":387},[234,11090,11091],{"class":9017},"api_url",[234,11093,6562],{"class":387},[234,11095,11096],{"class":255},"'https:\u002F\u002Fhooks.slack.com\u002Fservices\u002FYOUR\u002FWEBHOOK\u002FHERE'\n",[234,11098,11099,11102,11104],{"class":236,"line":594},[234,11100,11101],{"class":9017},"        channel",[234,11103,6562],{"class":387},[234,11105,11106],{"class":255},"'#alerts'\n",[234,11108,11109,11112,11114],{"class":236,"line":635},[234,11110,11111],{"class":9017},"        send_resolved",[234,11113,6562],{"class":387},[234,11115,10184],{"class":251},[234,11117,11118],{"class":236,"line":643},[234,11119,412],{"emptyLinePlaceholder":411},[234,11121,11122,11124,11126,11128],{"class":236,"line":659},[234,11123,9543],{"class":387},[234,11125,10456],{"class":9017},[234,11127,6562],{"class":387},[234,11129,11046],{"class":255},[234,11131,11132,11134],{"class":236,"line":683},[234,11133,11082],{"class":9017},[234,11135,9021],{"class":387},[234,11137,11138,11140,11142,11144],{"class":236,"line":695},[234,11139,9050],{"class":387},[234,11141,11091],{"class":9017},[234,11143,6562],{"class":387},[234,11145,11096],{"class":255},[234,11147,11148,11150,11152],{"class":236,"line":717},[234,11149,11101],{"class":9017},[234,11151,6562],{"class":387},[234,11153,11154],{"class":255},"'#alerts-critical'\n",[234,11156,11157,11159,11161],{"class":236,"line":723},[234,11158,11111],{"class":9017},[234,11160,6562],{"class":387},[234,11162,10184],{"class":251},[12,11164,11165,11166,11169,11170,11173],{},"Two details that save sleep. The ",[231,11167,11168],{},"for: 10m"," on CPU prevents short spikes from becoming alerts — the server can hit 95% for 30 seconds and that be normal. The ",[231,11171,11172],{},"repeat_interval: 4h"," for warnings ensures that a warning resolved in one hour doesn't become 60 messages — Alertmanager groups.",[12,11175,11176,11177,11179,11180,11183,11184,11187],{},"Reload Prometheus (",[231,11178,9747],{},") and test by forcing an alert: ",[231,11181,11182],{},"stress --cpu 4 --timeout 700s"," on some server should trigger ",[231,11185,11186],{},"HighCPU"," in 10 minutes.",[19,11189,11191],{"id":11190},"step-8-how-to-put-reverse-proxy-and-tls-in-front","Step 8 — How to put reverse proxy and TLS in front?",[12,11193,8927,11194,101],{},[27,11195,10413],{},[12,11197,11198,11199,11201],{},"To access Grafana via ",[231,11200,10419],{}," with valid certificate, you need something in front of port 3000. Two options:",[67,11203,11204,11214],{},[70,11205,11206,11209,11210,11213],{},[27,11207,11208],{},"Orchestrator's integrated router"," — if you already have the HeroCtl cluster running, just declare Grafana as a job with ",[231,11211,11212],{},"ingress: { host: monitor.yourdomain.com, tls: true }",". Automatic Let's Encrypt certificate, without additional tool.",[70,11215,11216,11219,11220],{},[27,11217,11218],{},"Caddy standalone"," on the observability VPS itself — also issues Let's Encrypt automatically. Minimum Caddyfile:",[224,11221,11224],{"className":11222,"code":11223,"language":2529},[2527],"monitor.yourdomain.com {\n  reverse_proxy localhost:3000\n  basicauth \u002Flogin {\n    admin \u003Cbcrypt_hash>\n  }\n}\n",[231,11225,11223],{"__ignoreMap":229},[12,11227,11228,11229,11232],{},"For defense in depth, keep Caddy\u002Frouter basic authentication in front of Grafana login — two barriers, not one. The second is especially important because the default Grafana login is ",[231,11230,11231],{},"admin\u002Fadmin"," and the first thing bots do on an exposed Grafana is try that combination.",[19,11234,11236],{"id":11235},"step-9-how-to-instrument-application-metrics","Step 9 — How to instrument application metrics?",[12,11238,8927,11239,101],{},[27,11240,11241],{},"varies according to number of applications",[12,11243,11244],{},"System metrics are half the story. The other half is what your application is doing — how many requests per second, what the p99 latency is, how many errors, what the background job queue size is.",[12,11246,11247],{},"Each popular language has an official Prometheus client:",[2734,11249,11250,11258,11266,11273],{},[70,11251,11252,6562,11255],{},[27,11253,11254],{},"Node.js",[231,11256,11257],{},"prom-client",[70,11259,11260,6562,11263],{},[27,11261,11262],{},"Python",[231,11264,11265],{},"prometheus-client",[70,11267,11268,6562,11271],{},[27,11269,11270],{},"Ruby",[231,11272,11265],{},[70,11274,11275,6562,11278],{},[27,11276,11277],{},"Go",[231,11279,11280],{},"github.com\u002Fprometheus\u002Fclient_golang",[12,11282,11283],{},"The minimum standard is three metrics per HTTP endpoint:",[2734,11285,11286,11300,11306],{},[70,11287,11288,11291,11292,571,11295,571,11298,101],{},[231,11289,11290],{},"http_requests_total"," — counter, with labels ",[231,11293,11294],{},"method",[231,11296,11297],{},"path",[231,11299,614],{},[70,11301,11302,11305],{},[231,11303,11304],{},"http_request_duration_seconds"," — histogram, same label set.",[70,11307,11308,11311,11312,11315],{},[231,11309,11310],{},"app_errors_total"," — counter, with label ",[231,11313,11314],{},"kind"," (\"validation\", \"db\", \"external_api\", etc).",[12,11317,11318,11319,11321,11322,11324],{},"Expose all of that in ",[231,11320,8907],{},". Add the endpoint in Prometheus's ",[231,11323,9555],{},". In hours you have dashboards per endpoint, alerts per error rate, and the ability to answer \"what was happening at 3:14 yesterday\" with a graph instead of a guess.",[12,11326,11327,11328,11331,11332,11335],{},"Watch for ",[27,11329,11330],{},"cardinality",". Each unique combination of labels becomes a separate time series. If you put ",[231,11333,11334],{},"user_id"," as label, with 100k users you create 100k series — and Prometheus will consume 8+ GB of RAM just to index that. Practical rule: labels have values in small sets (status code: 5 values; method: 5 values; path: dozens). Unique identifiers go in logs, not in metrics.",[19,11337,11339],{"id":11338},"how-to-run-this-inside-heroctl-instead-of-dedicated-vps","How to run this inside HeroCtl instead of dedicated VPS?",[12,11341,11342],{},"For clusters already running the orchestrator, it makes sense to consider the stack as one more job. Trade-off: you save a VPS, but lose isolation (if the cluster dies, monitoring dies along).",[12,11344,11345],{},"The topology looks like this:",[2734,11347,11348,11354,11360,11366],{},[70,11349,11350,11353],{},[27,11351,11352],{},"1 single job spec"," with 4 tasks: prometheus, grafana, loki, alertmanager.",[70,11355,11356,11359],{},[27,11357,11358],{},"Replicated volumes"," in the cluster — data survives node failure.",[70,11361,11362,11365],{},[27,11363,11364],{},"Integrated router"," does automatic TLS via subdomain. No need for additional Caddy.",[70,11367,11368,11371],{},[27,11369,11370],{},"Cluster's own metrics"," are already exposed in Prometheus format on the administrative API, so the scrape is direct.",[12,11373,11374],{},"For critical production, we recommend physical separation (dedicated VPS outside the cluster). For personal project, MVP, or small team where \"everything falls together\" is acceptable, running inside is cheaper and operationally simpler. The entire job spec sits around 80 lines of manifest.",[19,11376,11378],{"id":11377},"how-much-does-this-stack-cost-per-month-in-brazil","How much does this stack cost per month in Brazil?",[119,11380,11381,11391],{},[122,11382,11383],{},[125,11384,11385,11388],{},[128,11386,11387],{},"Item",[128,11389,11390],{},"Monthly cost (BRL)",[141,11392,11393,11401,11409,11417],{},[125,11394,11395,11398],{},[146,11396,11397],{},"Dedicated observability VPS (4 GB RAM)",[146,11399,11400],{},"R$40 to R$80",[125,11402,11403,11406],{},[146,11404,11405],{},"Object storage for long log retention (optional)",[146,11407,11408],{},"R$30",[125,11410,11411,11414],{},[146,11412,11413],{},"Maintenance time (2 to 4h × hour value)",[146,11415,11416],{},"R$200 to R$400",[125,11418,11419,11424],{},[146,11420,11421],{},[27,11422,11423],{},"Total operational",[146,11425,11426],{},[27,11427,11428],{},"R$300 to R$500",[12,11430,11431],{},"For comparison, a Datadog or New Relic subscription with equivalent coverage (5 hosts, 30-day log retention, alerts, dashboards) goes for around R$1,500 to R$2,000 per month — without counting the automatic overage that appears at month-end when someone forgets a verbose log on.",[12,11433,11434],{},"The difference isn't small: in a year, the open-source self-hosted stack saves between R$12,000 and R$18,000. For an early-stage startup, that's half a junior engineer.",[19,11436,11438],{"id":11437},"table-of-ports-resources-and-characteristics-per-component","Table of ports, resources and characteristics per component",[119,11440,11441,11461],{},[122,11442,11443],{},[125,11444,11445,11447,11450,11452,11455,11458],{},[128,11446,130],{},[128,11448,11449],{},"Port",[128,11451,3873],{},[128,11453,11454],{},"Disk",[128,11456,11457],{},"Default retention",[128,11459,11460],{},"Data format",[141,11462,11463,11482,11500,11518,11537,11553],{},[125,11464,11465,11467,11470,11473,11476,11479],{},[146,11466,8831],{},[146,11468,11469],{},"9090",[146,11471,11472],{},"512 MB",[146,11474,11475],{},"10 GB",[146,11477,11478],{},"15 days",[146,11480,11481],{},"binary TSDB",[125,11483,11484,11486,11489,11492,11495,11497],{},[146,11485,8837],{},[146,11487,11488],{},"3000",[146,11490,11491],{},"256 MB",[146,11493,11494],{},"1 GB",[146,11496,3055],{},[146,11498,11499],{},"SQLite or Postgres",[125,11501,11502,11504,11507,11509,11512,11515],{},[146,11503,8843],{},[146,11505,11506],{},"3100",[146,11508,11472],{},[146,11510,11511],{},"30 GB",[146,11513,11514],{},"30 days (configurable)",[146,11516,11517],{},"compressed chunks",[125,11519,11520,11523,11526,11529,11532,11534],{},[146,11521,11522],{},"Promtail \u002F Agent",[146,11524,11525],{},"9080",[146,11527,11528],{},"128 MB",[146,11530,11531],{},"minimum",[146,11533,3055],{},[146,11535,11536],{},"passes by value",[125,11538,11539,11541,11544,11546,11548,11550],{},[146,11540,8861],{},[146,11542,11543],{},"9093",[146,11545,11528],{},[146,11547,11494],{},[146,11549,3055],{},[146,11551,11552],{},"notification log",[125,11554,11555,11557,11560,11563,11565,11567],{},[146,11556,8855],{},[146,11558,11559],{},"9100",[146,11561,11562],{},"64 MB",[146,11564,11531],{},[146,11566,3055],{},[146,11568,11569],{},"scrape endpoint",[12,11571,11572],{},"These are the viable minimums for small cluster. In production with 30 servers and real traffic, multiply RAM by 3 and disk by 5.",[19,11574,11576],{"id":11575},"the-four-errors-that-kill-a-new-monitoring-stack","The four errors that kill a new monitoring stack",[12,11578,11579],{},"Teams setting up observability for the first time stumble almost always on the same four errors. Knowing about them beforehand saves months.",[12,11581,11582,11585,11586,11589],{},[27,11583,11584],{},"Not monitoring monitoring."," Prometheus stopped scraping on Thursday; nobody saw it. On Wednesday of the following week a server actually went down and they discovered there was no alert because Prometheus was dead for 6 days. Solution: configure a simple external cron (even free Pingdom serves) that hits ",[231,11587,11588],{},"https:\u002F\u002Fmonitor.yourdomain.com\u002Fapi\u002Fhealth"," every 5 minutes and warns you when Grafana itself falls.",[12,11591,11592,11595,11596,11599],{},[27,11593,11594],{},"No retention strategy."," Disk fills up in three months, Prometheus stops recording, someone deletes everything in despair, loses 90 days of history. Configure ",[231,11597,11598],{},"--storage.tsdb.retention.time=30d"," from day one and establish a housekeeping job.",[12,11601,11602,11605,11606,571,11608,11611],{},[27,11603,11604],{},"High cardinality in labels."," We already covered in step 9, but worth repeating: each ",[231,11607,11334],{},[231,11609,11610],{},"request_id"," or UUID that becomes a label is a number that explosively multiplies Prometheus RAM consumption. Unique identifiers go to Loki, not to Prometheus.",[12,11613,11614,11617],{},[27,11615,11616],{},"Noisy alerts."," The team receives 200 alerts per day. In two weeks, nobody looks anymore. When the site actually crashes, the alert will be in the middle of 199 others. Solution: start with six alerts (those from step 7), audit every two weeks, and exclude everything that fired but didn't require human action. Alert without action is noise.",[19,11619,3225],{"id":3224},[12,11621,11622,11625],{},[27,11623,11624],{},"Can I run everything on a 2 GB VPS?","\nTechnically yes, for a cluster of up to 3 servers and few applications. In practice you'll hit the RAM ceiling in 2 to 3 months, especially if you import dense Grafana dashboards. Pay 50 reais more and go straight to 4 GB VPS — the time you save not fighting OOM kills pays for itself.",[12,11627,11628,11631],{},[27,11629,11630],{},"How much disk for 30 days of logs?","\nDepends entirely on your application's log volume. Rough rule for small startup: cluster of 4 servers with normal web applications generates 1 to 5 GB of logs per day after Loki compression. Thirty days gives between 30 and 150 GB. Start with 50 GB SSD, monitor growth for two weeks, expand if necessary. If you go much beyond that, it's time to go to object storage.",[12,11633,11634,11637],{},[27,11635,11636],{},"Grafana Cloud vs self-hosted, which to choose?","\nGrafana Cloud free tier is generous (10k series, 50 GB of logs, 14-day retention) and eliminates the work of maintaining the server. For solo project or very small team, makes sense. From the moment you exceed the free tier, prices scale fast — from US$50\u002Fmonth — and you lose control over the data. Self-hosted costs hardware + time, Cloud costs money + lock-in. For a company that intends to grow and has a DevOps dev on the team, self-hosted wins.",[12,11639,11640,11643],{},[27,11641,11642],{},"Promtail or Grafana Agent?","\nIn 2026, Grafana Agent (renamed to Grafana Alloy) is officially replacing Promtail. For new setup, go straight to Alloy. For setup that has been running Promtail for a long time, no urgency to migrate — Promtail will continue working for years.",[12,11645,11646,11649,11650,11652],{},[27,11647,11648],{},"Where does OpenTelemetry fit in this stack?","\nOTel is the application instrumentation standard that's consolidating. Instead of using ",[231,11651,11257],{}," directly, you use OTel's SDK and it exports to Prometheus, Loki and Tempo simultaneously. The big advantage is portability — if you want to swap Prometheus for something else 3 years from now, your application doesn't change a line. For a startup starting today, we recommend OTel from day one.",[12,11654,11655,11658,11659,11662,11663,11666],{},[27,11656,11657],{},"How do I backup Prometheus?","\nPrometheus has snapshot via API: ",[231,11660,11661],{},"curl -X POST localhost:9090\u002Fapi\u002Fv1\u002Fadmin\u002Ftsdb\u002Fsnapshot"," creates a snapshot in the data directory. Do that once a day via cron, do ",[231,11664,11665],{},"tar.gz"," and send to object storage. In case of disaster, what you lose is metrics — and metrics, unlike logs, are typically recoverable in hours (start collecting again and dashboards return). Lost logs are lost forever, so invest more in Loki backup.",[12,11668,11669,11672],{},[27,11670,11671],{},"Tempo (distributed traces) worth installing now?","\nNo. Traces become useful from the moment you have 5+ services talking to each other and debugging latency involves following a request through several hops. For monolithic architecture or few services, traces give disproportionate work to the value. Add when complexity calls for it.",[12,11674,11675,11678],{},[27,11676,11677],{},"Does Loki index full-text like ELK?","\nNo, and that's the feature, not bug. Loki indexes only labels (job, host, container, severity) and log content stays compressed without index. To search text, you filter by labels first and then grep on the resulting chunks. That's what makes Loki ten times cheaper than ELK in RAM and CPU. In exchange, free-text queries across all history are slower. For 90% of debugging cases, filtering by job + host + time window already reduces to dozens of MB where grep flies.",[19,11680,11682],{"id":11681},"next-steps","Next steps",[12,11684,11685],{},"Brought up the stack, have dashboard, have alert, have searchable log? Good. The next three things worth investing in are, in order:",[67,11687,11688,11694,11708],{},[70,11689,11690,11693],{},[27,11691,11692],{},"Custom dashboards per application"," — business metrics (subscriptions created\u002Fhour, jobs processed, email queue) instead of just infrastructure.",[70,11695,11696,11699,11700,11703,11704,11707],{},[27,11697,11698],{},"Runbooks linked in alerts"," — every rule in ",[231,11701,11702],{},"alerts.yml"," should have ",[231,11705,11706],{},"annotations.runbook_url"," pointing to a page explaining what to do. When the alert fires at 3 AM, sleep doesn't think.",[70,11709,11710,11713],{},[27,11711,11712],{},"Monthly alert review"," — 30 minutes once a month auditing what fired in the previous month, deleting what became noise, adjusting thresholds.",[12,11715,11716,11717,11721,11722,101],{},"For those who want to go further and understand why we chose this stack instead of managed SaaS, read ",[3336,11718,11720],{"href":11719},"\u002Fen\u002Fblog\u002Fobservability-without-datadog-startup-stack","Observability without Datadog: the Brazilian startup stack",". And to close the operations cycle — because there's no point knowing the database fell if you can't restore — it's worth reading ",[3336,11723,11725],{"href":11724},"\u002Fen\u002Fblog\u002Fdatabase-backup-strategies-cluster","Database backup in cluster: strategies for 3 AM",[12,11727,11728],{},"If you want to skip this entire setup and run the stack as a job inside an orchestrator that already takes care of TLS, rolling update deploy and volume replication:",[224,11730,11731],{"className":226,"code":5318,"language":228,"meta":229,"style":229},[231,11732,11733],{"__ignoreMap":229},[234,11734,11735,11737,11739,11741,11743],{"class":236,"line":237},[234,11736,1220],{"class":247},[234,11738,2957],{"class":251},[234,11740,5329],{"class":255},[234,11742,2963],{"class":383},[234,11744,2966],{"class":247},[12,11746,11747],{},"Four hours become forty minutes. The rest is the same work of thinking about which alerts matter — and on that part, no one frees you.",[3350,11749,11750],{},"html pre.shiki code .sPWt5, html code.shiki .sPWt5{--shiki-default:#7EE787}html pre.shiki code .sZEs4, html code.shiki .sZEs4{--shiki-default:#E6EDF3}html pre.shiki code .s9uIt, html code.shiki .s9uIt{--shiki-default:#A5D6FF}html pre.shiki code .sH3jZ, html code.shiki .sH3jZ{--shiki-default:#8B949E}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html pre.shiki code .sFSAA, html code.shiki .sFSAA{--shiki-default:#79C0FF}html pre.shiki code .sQhOw, html code.shiki .sQhOw{--shiki-default:#FFA657}html pre.shiki code .suJrU, html code.shiki .suJrU{--shiki-default:#FF7B72}",{"title":229,"searchDepth":244,"depth":244,"links":11752},[11753,11754,11755,11756,11757,11758,11759,11760,11761,11762,11763,11764,11765,11766,11767,11768,11769,11770],{"id":21,"depth":244,"text":22},{"id":8820,"depth":244,"text":8821},{"id":8880,"depth":244,"text":8881},{"id":8923,"depth":244,"text":8924},{"id":8980,"depth":244,"text":8981},{"id":9439,"depth":244,"text":9440},{"id":9763,"depth":244,"text":9764},{"id":9929,"depth":244,"text":9930},{"id":10407,"depth":244,"text":10408},{"id":10581,"depth":244,"text":10582},{"id":11190,"depth":244,"text":11191},{"id":11235,"depth":244,"text":11236},{"id":11338,"depth":244,"text":11339},{"id":11377,"depth":244,"text":11378},{"id":11437,"depth":244,"text":11438},{"id":11575,"depth":244,"text":11576},{"id":3224,"depth":244,"text":3225},{"id":11681,"depth":244,"text":11682},"2026-05-12","Honest tutorial to spin up metrics, logs and dashboards for your cluster — in 4 hours, without Datadog. Open-source stack that fits in 1 VPS at R$80\u002Fmonth.",{},"\u002Fen\u002Fblog\u002Fmonitoring-stack-prometheus-grafana-loki",{"title":8773,"description":11772},{"loc":11774},"en\u002Fblog\u002Fmonitoring-stack-prometheus-grafana-loki",[11779,11780,11781,11782,3392,3378],"prometheus","grafana","loki","monitoring","qXuCsrBWk65Tau6l18D0_EwAL61sTr4A97-gZfDIzKs",{"id":11785,"title":11786,"author":7,"body":11787,"category":3378,"cover":3379,"date":12737,"description":12738,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":12739,"navigation":411,"path":12740,"readingTime":4401,"seo":12741,"sitemap":12742,"stem":12743,"tags":12744,"__hash__":12749},"blog_en\u002Fen\u002Fblog\u002Fcloudflare-in-front-of-self-hosted-cluster.md","Cloudflare in front of a self-hosted cluster: is it worth it in 2026?",{"type":9,"value":11788,"toc":12715},[11789,11792,11796,11799,11802,11805,11809,11812,11868,11871,11875,11878,12018,12021,12025,12028,12034,12040,12047,12050,12054,12057,12083,12090,12093,12113,12117,12120,12158,12161,12165,12168,12241,12260,12264,12267,12273,12279,12285,12291,12295,12298,12318,12321,12325,12528,12531,12535,12538,12558,12562,12565,12569,12575,12579,12582,12586,12589,12625,12628,12632,12635,12639,12642,12646,12653,12657,12660,12664,12667,12670,12686,12689,12692,12710,12713],[12,11790,11791],{},"The question comes back every week in a Brazilian DevOps group: \"I brought up my cluster with three servers on DigitalOcean, is it worth putting Cloudflare in front?\". The short answer is \"almost always yes\" — but the \"almost\" carries trade-offs that nobody mentions until the first time something breaks in production and you spend two hours debugging a cache rule that masked a 500 from the app. This post is the long version, with measurable criteria, of the decision you need to make before moving the nameserver.",[19,11793,11795],{"id":11794},"tldr-the-200-word-summary","TL;DR — the 200-word summary",[12,11797,11798],{},"Free Cloudflare became de facto standard for any Brazilian site with traffic: protects against DDoS without contractual limit, issues automatic SSL certificate, caches assets in over 300 cities, hides the server's origin IP and still delivers DNS with sub-10ms. For a self-hosted cluster — be it HeroCtl, Coolify, k3s or Docker Swarm — putting Cloudflare in front is an easy decision in around 90% of cases.",[12,11800,11801],{},"The remaining 10% have concrete trade-offs: additional latency of 10 to 30ms on dynamic routes, TLS terminates at Cloudflare by default (no longer end-to-end to your server), cache rules can mask subtle bugs in the app, and lock-in grows as you adopt Workers, R2 and Pages.",[12,11803,11804],{},"Worth it when: you want DDoS protection without paying; global cache to reduce bandwidth cost; hide the server IP from scanners. Not worth it when: financial\u002Fhealth compliance requires truly end-to-end TLS; you need p99 below 50ms on dynamic routes; the cluster already has internal CDN edge in multiple data centers. A cluster with integrated router already covers around 60% of what Cloudflare offers — combining the two is the most common path.",[19,11806,11808],{"id":11807},"what-does-cloudflare-offer-for-free-in-2026","What does Cloudflare offer for free in 2026?",[12,11810,11811],{},"The free offering grew year over year. Today, the free plan covers what was paid plan in 2019:",[2734,11813,11814,11820,11826,11832,11838,11844,11850,11856,11862],{},[70,11815,11816,11819],{},[27,11817,11818],{},"DDoS protection without contractual limit"," — Layer 3, 4 and 7. Cloudflare absorbs attacks of hundreds of Gbps without charging excess.",[70,11821,11822,11825],{},[27,11823,11824],{},"Automatic SSL\u002FTLS certificate"," — issued in minutes by Cloudflare itself, automatically renewed. Wildcard requires the Advanced Certificate Manager plan (US$ 10\u002Fmonth).",[70,11827,11828,11831],{},[27,11829,11830],{},"Global CDN"," — over 300 cities in over 120 countries. Includes São Paulo, Rio, Fortaleza, Curitiba and Porto Alegre.",[70,11833,11834,11837],{},[27,11835,11836],{},"Authoritative DNS"," — sub-10ms global average, anycast, with APIs for automation.",[70,11839,11840,11843],{},[27,11841,11842],{},"Basic bot protection"," — blocking known bots and JavaScript challenges on suspicious traffic.",[70,11845,11846,11849],{},[27,11847,11848],{},"Static asset cache"," — recognized extensions (CSS, JS, images, fonts) cached by default.",[70,11851,11852,11855],{},[27,11853,11854],{},"Page Rules"," — three free rules to force HTTPS, extra cache, redirects.",[70,11857,11858,11861],{},[27,11859,11860],{},"Always Online"," — when origin falls, Cloudflare serves the last cached version.",[70,11863,11864,11867],{},[27,11865,11866],{},"Web Analytics"," — RUM metrics (visits, countries, browsers), no cookies.",[12,11869,11870],{},"The cutoff line is generous enough that a 10k visitors\u002Fday site runs 100% on free without any operational problem.",[19,11872,11874],{"id":11873},"and-what-does-cloudflare-charge-extra-for","And what does Cloudflare charge extra for?",[12,11876,11877],{},"Four plans: Free, Pro (US$ 25\u002Fmonth per domain), Business (US$ 250\u002Fmonth per domain) and Enterprise (on consultation, generally above US$ 5k\u002Fmonth).",[119,11879,11880,11897],{},[122,11881,11882],{},[125,11883,11884,11887,11889,11892,11895],{},[128,11885,11886],{},"Resource",[128,11888,7825],{},[128,11890,11891],{},"Pro US$ 25",[128,11893,11894],{},"Business US$ 250",[128,11896,4359],{},[141,11898,11899,11915,11929,11942,11956,11971,11984,12001],{},[125,11900,11901,11904,11906,11909,11912],{},[146,11902,11903],{},"WAF managed rulesets",[146,11905,3058],{},[146,11907,11908],{},"Yes (basic OWASP)",[146,11910,11911],{},"Yes (advanced)",[146,11913,11914],{},"Custom",[125,11916,11917,11920,11922,11925,11927],{},[146,11918,11919],{},"Image Resizing",[146,11921,3058],{},[146,11923,11924],{},"Yes (US$ 5\u002FM)",[146,11926,3064],{},[146,11928,3064],{},[125,11930,11931,11934,11936,11938,11940],{},[146,11932,11933],{},"Polish (image optimization)",[146,11935,3058],{},[146,11937,3064],{},[146,11939,3064],{},[146,11941,3064],{},[125,11943,11944,11947,11949,11952,11954],{},[146,11945,11946],{},"Argo Smart Routing",[146,11948,3058],{},[146,11950,11951],{},"US$ 5\u002Fmonth add-on",[146,11953,3064],{},[146,11955,3064],{},[125,11957,11958,11961,11963,11966,11968],{},[146,11959,11960],{},"Page Rules included",[146,11962,2698],{},[146,11964,11965],{},"20",[146,11967,5650],{},[146,11969,11970],{},"125+",[125,11972,11973,11976,11978,11980,11982],{},[146,11974,11975],{},"Cache Reserve",[146,11977,3058],{},[146,11979,3058],{},[146,11981,3064],{},[146,11983,3064],{},[125,11985,11986,11989,11992,11995,11998],{},[146,11987,11988],{},"Customer Support SLA",[146,11990,11991],{},"Best-effort",[146,11993,11994],{},"24h",[146,11996,11997],{},"Chat 24\u002F7",[146,11999,12000],{},"Dedicated engineer",[125,12002,12003,12006,12009,12012,12015],{},[146,12004,12005],{},"Log analysis",[146,12007,12008],{},"Last hour",[146,12010,12011],{},"Last 24h",[146,12013,12014],{},"Last 7 days",[146,12016,12017],{},"30 days",[12,12019,12020],{},"Workers and R2 have free tier independent of plan: 100k requests\u002Fday for Workers, 10 GB of storage and 1 million Class A operations\u002Fmonth for R2. For a modest marketing site, you can run image storage on R2 without ever reaching the bill.",[19,12022,12024],{"id":12023},"does-cloudflare-add-latency","Does Cloudflare add latency?",[12,12026,12027],{},"The honest question. Honest answer too: depends on the route.",[12,12029,12030,12033],{},[27,12031,12032],{},"For cached routes"," (static HTML, assets, optimized images), Cloudflare reduces latency. The user in Recife gets the content from the Fortaleza or São Paulo POP in 15 to 40ms, instead of doing round-trip to your server in New Jersey or Frankfurt. Typical savings: 150 to 250ms per request.",[12,12035,12036,12039],{},[27,12037,12038],{},"For dynamic routes"," (API, logged dashboard, checkout), traffic passes through the Cloudflare proxy before reaching your server. That adds between 10 and 30ms in normal conditions. The exact number depends on which POP the user is connected to and where the origin server is.",[12,12041,12042,12043,12046],{},"We measured on the public production cluster: the average response time of ",[231,12044,12045],{},"manage.heroctl.com\u002Fv1\u002Fnodes"," is 38ms without Cloudflare proxy and 51ms with proxy enabled, requesting from the same notebook in São Paulo. A delta of 13ms — perceptible in benchmark, invisible to a human.",[12,12048,12049],{},"Latency is only a dealbreaker in three real scenarios: online gaming, high-frequency financial auction, and low-latency WebSocket loads (trading, live collaboration). For the rest, the 13ms disappear in the browser render time.",[19,12051,12053],{"id":12052},"does-cloudflare-break-end-to-end-tls","Does Cloudflare break end-to-end TLS?",[12,12055,12056],{},"By default, yes. See the modes:",[2734,12058,12059,12065,12071,12077],{},[70,12060,12061,12064],{},[27,12062,12063],{},"Flexible"," (NEVER use this) — TLS only between client and Cloudflare. Cloudflare → server connection is plain HTTP. Vulnerable to interception on the inner leg.",[70,12066,12067,12070],{},[27,12068,12069],{},"Full"," — TLS between client and Cloudflare, and separately between Cloudflare and server. But Cloudflare accepts invalid\u002Fself-signed certificate at the server. Risk of man-in-the-middle between Cloudflare and origin.",[70,12072,12073,12076],{},[27,12074,12075],{},"Full (strict)"," — TLS on both legs, and Cloudflare requires valid certificate at origin. This is the minimum reasonable configuration.",[70,12078,12079,12082],{},[27,12080,12081],{},"Strict (SSL-Only Origin Pull)"," — Cloudflare verifies that the origin's certificate was issued by a public valid CA for the hostname. More secure than Full strict.",[12,12084,12085,12086,12089],{},"In all these modes, ",[27,12087,12088],{},"Cloudflare decrypts traffic in the middle of the path",". They see request body, headers, cookies — everything. For most cases that is acceptable (the contract with Cloudflare is clear), but in strict compliance (health, financial, government) it can break audit requirements.",[12,12091,12092],{},"The real exit for end-to-end:",[2734,12094,12095,12101,12107],{},[70,12096,12097,12100],{},[27,12098,12099],{},"Authenticated Origin Pulls"," — Cloudflare presents a client certificate when connecting to your origin; the server only accepts connections from that chain. Still decrypts in the middle, but at least only Cloudflare can reach your origin.",[70,12102,12103,12106],{},[27,12104,12105],{},"Cloudflare Tunnel + mTLS client at the endpoint"," — for internal apps, Tunnel replaces public IP and requires client certificate.",[70,12108,12109,12112],{},[27,12110,12111],{},"Gray cloud (DNS only)"," — disables proxy. You lose DDoS protection, cache, WAF — but get direct client-server connection with truly end-to-end TLS. It is a valid option when compliance commands.",[19,12114,12116],{"id":12115},"will-i-get-locked-in-to-cloudflare","Will I get locked-in to Cloudflare?",[12,12118,12119],{},"Depends exclusively on which features you adopt. Let's go layer by layer:",[2734,12121,12122,12128,12134,12140,12146,12152],{},[70,12123,12124,12127],{},[27,12125,12126],{},"DNS"," — trivially reversible. Moving nameserver takes 24 to 48h of propagation and nothing breaks. Zero lock-in.",[70,12129,12130,12133],{},[27,12131,12132],{},"Proxy + cache + WAF"," — reversible in hours. You disable the orange cloud, adjust DNS to point directly to the server, reconfigure WAF on your origin (if any). Low lock-in.",[70,12135,12136,12139],{},[27,12137,12138],{},"Workers"," — real lock-in. The Workers API is proprietary; rewriting to Lambda@Edge or Fastly Compute@Edge costs days to weeks depending on the code. It is not the worst case, but count on rework.",[70,12141,12142,12145],{},[27,12143,12144],{},"R2 Object Storage"," — API S3-compatible, so code keeps working. But R2 doesn't charge egress (S3 charges US$ 0.09\u002FGB), so moving to another provider makes the bill more expensive. Economic lock-in, not technical.",[70,12147,12148,12151],{},[27,12149,12150],{},"Pages"," — moderate lock-in. Build process is custom; rewrite to Vercel\u002FNetlify\u002Fgeneric static host takes an afternoon, but requires.",[70,12153,12154,12157],{},[27,12155,12156],{},"Zero Trust"," — high lock-in. Policies, identity, tunnels: complete rewrite to Tailscale\u002FTwingate\u002Fequivalent.",[12,12159,12160],{},"The operational recommendation is: use the Cloudflare core (DNS + proxy + WAF + Page Rules) without hesitation — you can revert in a day. Adopt Workers\u002FR2\u002FPages only with clear awareness that you are accepting lock-in proportional to the value that feature delivers.",[19,12162,12164],{"id":12163},"minimum-recommended-configuration-for-self-hosted-cluster","Minimum recommended configuration for self-hosted cluster",[12,12166,12167],{},"Practical sequence, no secret:",[67,12169,12170,12176,12185,12195,12201,12207,12213,12223,12229,12235],{},[70,12171,12172,12175],{},[27,12173,12174],{},"Create a Cloudflare account"," and add the domain. The site will scan your current DNS records and copy them to the new zone.",[70,12177,12178,12181,12182,101],{},[27,12179,12180],{},"Change the nameservers"," at the registrar (Hostinger, Registro.br, GoDaddy, wherever you are). Wait 4 to 48 hours for propagation. Verify with ",[231,12183,12184],{},"dig NS heroctl.com +short",[70,12186,12187,12190,12191,12194],{},[27,12188,12189],{},"DNS records of the cluster",": create an A record for the root domain pointing to the IP of the server receiving traffic, and a wildcard A record ",[231,12192,12193],{},"*"," pointing to the same IP. Mark both with proxy enabled (orange cloud).",[70,12196,12197,12200],{},[27,12198,12199],{},"SSL\u002FTLS mode",": configure Full (strict). That requires the cluster to have a valid certificate. The HeroCtl integrated router issues Let's Encrypt automatically — works out of the box.",[70,12202,12203,12206],{},[27,12204,12205],{},"Always Use HTTPS",": ON. Redirects any HTTP to HTTPS at the edge.",[70,12208,12209,12212],{},[27,12210,12211],{},"HSTS",": 6 months, include subdomains, no preload for now. Preload is a definitive decision — you can't undo it quickly if something breaks.",[70,12214,12215,12218,12219,12222],{},[27,12216,12217],{},"Page Rule for cache"," of static assets: ",[231,12220,12221],{},"*heroctl.com\u002Fstatic\u002F*"," → Cache Level: Cache Everything, Edge Cache TTL: 1 month.",[70,12224,12225,12228],{},[27,12226,12227],{},"WAF managed ruleset"," (Pro+): enable the Cloudflare Managed Ruleset and OWASP Core Rule Set in Block mode for high-score rules.",[70,12230,12231,12234],{},[27,12232,12233],{},"Security Level",": Medium. Low lets too many bots through; High challenges legitimate people.",[70,12236,12237,12240],{},[27,12238,12239],{},"Bot Fight Mode",": ON on the free plan. Controls basic scrapers without asking the human for CAPTCHA.",[12,12242,12243,12244,12247,12248,12251,12252,12255,12256,12259],{},"After applying all of that, run ",[231,12245,12246],{},"curl -I https:\u002F\u002Fyourdomain.com"," and confirm: header ",[231,12249,12250],{},"cf-ray"," present, header ",[231,12253,12254],{},"server: cloudflare",", header ",[231,12257,12258],{},"strict-transport-security"," with long max-age.",[19,12261,12263],{"id":12262},"when-is-cloudflare-not-worth-it","When is Cloudflare NOT worth it?",[12,12265,12266],{},"Four scenarios where the recommendation changes. They matter more than they seem.",[12,12268,12269,12272],{},[27,12270,12271],{},"Cluster with robust internal CDN\u002Fedge."," If you already run in four or five geographically spread regions, with proximity-based DNS balancing and local cache in each region, Cloudflare's CDN adds latency without gain. Worth running gray cloud (DNS only) and keeping the rest direct.",[12,12274,12275,12278],{},[27,12276,12277],{},"Financial or health compliance with mandatory end-to-end mTLS."," LGPD by itself doesn't require this; but specific audits (PCI-DSS Level 1 with custom requirements, strict HIPAA certifications, banking frameworks) may require encrypted traffic to never be decrypted at a third party. Since Cloudflare decrypts in the middle of the path, doesn't pass.",[12,12280,12281,12284],{},[27,12282,12283],{},"Purely internal apps (intranet\u002Fclosed B2B SaaS)."," Free Cloudflare doesn't cover advanced Zero Trust. For an app that exclusively serves employees, Tailscale or native WireGuard deliver more with less.",[12,12286,12287,12290],{},[27,12288,12289],{},"Small sites without traffic and without public enemy."," Personal blog of 200 visits\u002Fmonth, without payment form, without sensitive data. Direct DNS at Hostinger\u002FRegistro.br + Let's Encrypt from the integrated router serves perfectly. Adding Cloudflare is unnecessary ceremony.",[19,12292,12294],{"id":12293},"how-does-cloudflare-interact-with-a-high-availability-cluster","How does Cloudflare interact with a high availability cluster?",[12,12296,12297],{},"Here the design matters. A cluster with three or more nodes serves traffic on all of them — there is no single \"main\" node. The pragmatic configuration is:",[2734,12299,12300,12306,12312],{},[70,12301,12302,12305],{},[27,12303,12304],{},"DNS round-robin with health",": register A records for the IP of all nodes that run the router. Cloudflare does health check (Pro+) and removes a broken node from rotation automatically.",[70,12307,12308,12311],{},[27,12309,12310],{},"Cloudflare failover",": ~30 seconds to detect a dead node and remove from rotation (configurable to 5 seconds on Enterprise).",[70,12313,12314,12317],{},[27,12315,12316],{},"Internal cluster failover",": the HeroCtl integrated router reroutes traffic between healthy nodes in around 5 seconds. New coordinator election happens in ~7 seconds when the leader node falls.",[12,12319,12320],{},"Combined, real downtime perceived by the user stays below 40 seconds in the worst case (Cloudflare detects + cluster reacts). Without Cloudflare, stays at ~7 seconds (cluster alone). With Cloudflare and aggressive monitoring configuration (Pro+), back to ~10 seconds. The choice is clear: if you don't need DDoS protection, the cluster alone is already faster. If you need it, Cloudflare adds 30s of detection in exchange for protection against attacks.",[19,12322,12324],{"id":12323},"comparison-table-12-decision-criteria","Comparison table: 12 decision criteria",[119,12326,12327,12345],{},[122,12328,12329],{},[125,12330,12331,12333,12336,12339,12342],{},[128,12332,2982],{},[128,12334,12335],{},"Without Cloudflare",[128,12337,12338],{},"CF Free",[128,12340,12341],{},"CF Pro US$ 25",[128,12343,12344],{},"CF Business US$ 250",[141,12346,12347,12363,12380,12397,12413,12426,12441,12456,12471,12483,12495,12511],{},[125,12348,12349,12352,12355,12358,12360],{},[146,12350,12351],{},"DDoS Layer 3\u002F4",[146,12353,12354],{},"You handle it",[146,12356,12357],{},"Unlimited",[146,12359,12357],{},[146,12361,12362],{},"Unlimited + SLA",[125,12364,12365,12368,12371,12374,12377],{},[146,12366,12367],{},"DDoS Layer 7",[146,12369,12370],{},"Not available",[146,12372,12373],{},"Basic",[146,12375,12376],{},"Advanced",[146,12378,12379],{},"Advanced + Custom Rules",[125,12381,12382,12385,12388,12391,12394],{},[146,12383,12384],{},"Added latency on dynamic routes",[146,12386,12387],{},"0ms",[146,12389,12390],{},"+13 to 30ms",[146,12392,12393],{},"+10 to 25ms (Argo optional)",[146,12395,12396],{},"+5 to 15ms (Argo included)",[125,12398,12399,12402,12405,12408,12410],{},[146,12400,12401],{},"Global static cache",[146,12403,12404],{},"You build",[146,12406,12407],{},"300+ cities",[146,12409,12407],{},[146,12411,12412],{},"300+ cities + Reserve",[125,12414,12415,12418,12420,12422,12424],{},[146,12416,12417],{},"Hides server IP",[146,12419,3058],{},[146,12421,3064],{},[146,12423,3064],{},[146,12425,3064],{},[125,12427,12428,12431,12433,12436,12438],{},[146,12429,12430],{},"Truly end-to-end TLS",[146,12432,3064],{},[146,12434,12435],{},"No (decrypts)",[146,12437,3058],{},[146,12439,12440],{},"No (but Origin Pulls)",[125,12442,12443,12446,12448,12450,12453],{},[146,12444,12445],{},"Managed WAF",[146,12447,12370],{},[146,12449,3058],{},[146,12451,12452],{},"Basic OWASP",[146,12454,12455],{},"Advanced OWASP",[125,12457,12458,12461,12463,12465,12468],{},[146,12459,12460],{},"Bot protection",[146,12462,12404],{},[146,12464,12239],{},[146,12466,12467],{},"Super Bot Fight",[146,12469,12470],{},"Bot Management ML",[125,12472,12473,12475,12477,12479,12481],{},[146,12474,11854],{},[146,12476,3055],{},[146,12478,2698],{},[146,12480,11965],{},[146,12482,5650],{},[125,12484,12485,12487,12489,12491,12493],{},[146,12486,11860],{},[146,12488,3058],{},[146,12490,3064],{},[146,12492,3064],{},[146,12494,3064],{},[125,12496,12497,12500,12503,12505,12508],{},[146,12498,12499],{},"Monthly cost per domain",[146,12501,12502],{},"US$ 0",[146,12504,12502],{},[146,12506,12507],{},"US$ 25",[146,12509,12510],{},"US$ 250",[125,12512,12513,12516,12519,12522,12525],{},[146,12514,12515],{},"Proportional lock-in",[146,12517,12518],{},"Zero",[146,12520,12521],{},"Low (DNS+proxy)",[146,12523,12524],{},"Low to medium",[146,12526,12527],{},"Medium (Workers\u002FR2 begin to enter)",[12,12529,12530],{},"The line that decides for most is \"DDoS Layer 7 + hides IP\". These two alone justify the free plan. Paid lines only make sense with high-volume traffic or formal WAF requirement.",[19,12532,12534],{"id":12533},"does-free-cloudflare-have-a-traffic-limit","Does free Cloudflare have a traffic limit?",[12,12536,12537],{},"There is no contractual bandwidth limit on the free plan for normal web traffic through the proxy. But there are three practical limits worth mentioning:",[2734,12539,12540,12546,12552],{},[70,12541,12542,12545],{},[27,12543,12544],{},"Section 2.8 of the Terms of Service",": the free plan is for sites whose main content is HTML, and Cloudflare reserves the right to ask for upgrade if you use the service primarily to serve video or large files. In practice, they rarely act on this — but if you become a host for 50TB\u002Fmonth of pirated videos, expect to receive an email.",[70,12547,12548,12551],{},[27,12549,12550],{},"Workers free",": 100k requests\u002Fday. Above that, Workers Paid (US$ 5\u002Fmonth) with 10M requests included.",[70,12553,12554,12557],{},[27,12555,12556],{},"R2 free",": 10GB of storage, 1M Class A operations\u002Fmonth, 10M Class B operations\u002Fmonth. Above, US$ 0.015\u002FGB-month.",[19,12559,12561],{"id":12560},"can-i-use-cloudflare-dns-without-the-proxy","Can I use Cloudflare DNS without the proxy?",[12,12563,12564],{},"Yes — \"DNS only\" mode (gray cloud). You use Cloudflare DNS (fast, free, anycast global) but traffic goes directly to your server without passing through the proxy. You lose DDoS, cache, WAF, IP hiding — keep only the DNS infrastructure. Useful when: compliance prohibits decryption at third parties; you only want fast DNS without touching the traffic path; you are testing before activating the proxy.",[19,12566,12568],{"id":12567},"does-free-waf-block-sql-injection","Does free WAF block SQL injection?",[12,12570,12571,12572,12574],{},"Cloudflare Free has ",[27,12573,12239],{}," and automatic mitigation rules for obvious patterns, but doesn't have the complete OWASP Managed Ruleset. For reliable blocking of SQL injection, XSS, known RCE patterns, you need the Pro plan or higher. Alternative: run ModSecurity or your own WAF at your origin — works, but adds CPU and configuration.",[19,12576,12578],{"id":12577},"does-cloudflare-have-a-datacenter-in-brazil","Does Cloudflare have a datacenter in Brazil?",[12,12580,12581],{},"Yes. In 2026 there are five Brazilian POPs: São Paulo (two POPs), Rio de Janeiro, Fortaleza, Curitiba and Porto Alegre. Typical latency from any city in the Southeast to a POP stays below 20ms. The Fortaleza POP serves the Northeast very well because of the submarine cables that land there (EllaLink, Monet, GlobeNet). For the North, it is still a longer path — Manaus reaches Fortaleza in 80 to 120ms.",[19,12583,12585],{"id":12584},"how-do-i-migrate-nameservers-from-hostinger-to-cloudflare","How do I migrate nameservers from Hostinger to Cloudflare?",[12,12587,12588],{},"Four steps. Takes less than an hour active, plus up to 48h of propagation:",[67,12590,12591,12597,12609,12615],{},[70,12592,12593,12596],{},[27,12594,12595],{},"Cloudflare",": add the domain. The wizard scans your current DNS and creates the corresponding records in the new zone. Check that everything was copied — MX, TXT (SPF\u002FDKIM\u002FDMARC), CNAME, A. Copy errors here cause email taken down for a week.",[70,12598,12599,12601,12602,2402,12605,12608],{},[27,12600,12595],{},": it gives you two nameservers (something like ",[231,12603,12604],{},"kim.ns.cloudflare.com",[231,12606,12607],{},"walt.ns.cloudflare.com","). Note them.",[70,12610,12611,12614],{},[27,12612,12613],{},"Hostinger",": panel → Domains → your domain → Nameservers → \"Use custom nameservers\" → paste the two from Cloudflare. Save.",[70,12616,12617,12620,12621,12624],{},[27,12618,12619],{},"Wait for propagation",". Verify with ",[231,12622,12623],{},"dig NS yourdomain.com +short",". When the Cloudflare nameservers appear, the domain is under their management. DNS records continue to be edited on the Cloudflare panel from here on.",[12,12626,12627],{},"Important: while propagation happens, part of the users still resolves via Hostinger. Don't turn off the old zone until you confirm 100% of resolvers have already switched (24 to 48 hours is safe).",[19,12629,12631],{"id":12630},"where-does-tls-terminate-does-e2e-break","Where does TLS terminate? Does E2E break?",[12,12633,12634],{},"In proxy mode (orange cloud), TLS terminates at Cloudflare. They re-establish another TLS connection to your server (in Full strict mode). Technically: decrypts, processes, re-encrypts. For truly end-to-end: gray cloud (DNS only) or Cloudflare Tunnel with custom configuration. For most applications, \"truly end-to-end TLS\" is less important than it seems — the attack this protects against (interception in the middle of the network) requires an attacker already inside the Cloudflare network, an unrealistic scenario.",[19,12636,12638],{"id":12637},"cloudflare-workers-vs-serverless-from-my-cloud-when-is-it-worth-it","Cloudflare Workers vs serverless from my cloud — when is it worth it?",[12,12640,12641],{},"Workers are good for: edge computing where latency \u003C50ms matters (geo-routing, A\u002FB testing, header rewrite); lightweight request\u002Fresponse transformation; auth at the edge (validating JWT before reaching origin). They are not good for: workloads with more than 30 seconds of runtime; heavy integration with relational databases (cold start latency of DB driver kills); code that needs libraries that depend on filesystem or process. AWS Lambda remains better for long-runtime workload; Workers win at the edge. Use both, don't replace one with the other.",[19,12643,12645],{"id":12644},"can-i-use-cloudflare-r2-with-a-self-hosted-cluster","Can I use Cloudflare R2 with a self-hosted cluster?",[12,12647,12648,12649,12652],{},"Yes — R2 is S3-compatible at the API level. Your app uses ",[231,12650,12651],{},"aws-sdk"," configured with R2 endpoint and R2 credentials; code keeps the same. Economic advantage: zero egress fee. You can serve heavy downloads (installers, product videos, backups) directly from R2 without paying for outgoing bandwidth. Disadvantage: documented durability is 99.999999999% (11 nines), same as S3, but R2's operational history is shorter. For critical hot path, some teams prefer to keep S3 and use R2 only for cold storage and static delivery.",[19,12654,12656],{"id":12655},"origin-fell-does-always-online-solve-it","Origin fell — does Always Online solve it?",[12,12658,12659],{},"Partially. Always Online serves the last cached version of HTML pages when the server is offline. But: only works for routes that were being cached; only serves the static version (without updated dynamic data); only lasts while Cloudflare keeps the snapshot (usually a few days). It is a good safety net for static blog and marketing. Doesn't replace real cluster high availability — for a dynamic app, what solves it is the cluster having three nodes and automatic election when one falls.",[19,12661,12663],{"id":12662},"closing-combining-cloudflare-with-self-hosted-cluster","Closing — combining Cloudflare with self-hosted cluster",[12,12665,12666],{},"The combination we recommend for 90% of cases is: self-hosted cluster with three or more nodes (real high availability) + Cloudflare Free at the edge (DDoS, cache, IP hiding). The cluster takes care of internal routing, automatic certificates, failover between nodes in seconds. Cloudflare takes care of public protection, global cache and IP obfuscation. The two layers complement each other — they don't compete.",[12,12668,12669],{},"To start from scratch with this combination:",[224,12671,12672],{"className":226,"code":5318,"language":228,"meta":229,"style":229},[231,12673,12674],{"__ignoreMap":229},[234,12675,12676,12678,12680,12682,12684],{"class":236,"line":237},[234,12677,1220],{"class":247},[234,12679,2957],{"class":251},[234,12681,5329],{"class":255},[234,12683,2963],{"class":383},[234,12685,2966],{"class":247},[12,12687,12688],{},"You end up with a functional cluster on three nodes, automatic Let's Encrypt certificate on the domain you choose, web panel to submit jobs, real high availability. Then add Cloudflare Free in front of the domain and configure as per the \"Minimum configuration\" section of this post. Total time: an afternoon.",[12,12690,12691],{},"More reading along this line:",[2734,12693,12694,12704],{},[70,12695,12696,12699,12700,12703],{},[3336,12697,12698],{"href":3343},"Docker deploy in production: from compose to cluster"," — how to leave ",[231,12701,12702],{},"docker compose up"," and reach real high availability, with the intermediate steps.",[70,12705,12706,12709],{},[3336,12707,12708],{"href":11719},"Observability without Datadog: stack for a startup"," — metrics, logs and tracing without paying US$ 2,000\u002Fmonth for an observability SaaS.",[12,12711,12712],{},"Cloudflare is one of the few tools where the free tier is so good that refusing is stubbornness. But like any infra choice, the hard part is understanding exactly where the boundary lies — and, primarily, where it passes through your application's encrypted traffic.",[3350,12714,4376],{},{"title":229,"searchDepth":244,"depth":244,"links":12716},[12717,12718,12719,12720,12721,12722,12723,12724,12725,12726,12727,12728,12729,12730,12731,12732,12733,12734,12735,12736],{"id":11794,"depth":244,"text":11795},{"id":11807,"depth":244,"text":11808},{"id":11873,"depth":244,"text":11874},{"id":12023,"depth":244,"text":12024},{"id":12052,"depth":244,"text":12053},{"id":12115,"depth":244,"text":12116},{"id":12163,"depth":244,"text":12164},{"id":12262,"depth":244,"text":12263},{"id":12293,"depth":244,"text":12294},{"id":12323,"depth":244,"text":12324},{"id":12533,"depth":244,"text":12534},{"id":12560,"depth":244,"text":12561},{"id":12567,"depth":244,"text":12568},{"id":12577,"depth":244,"text":12578},{"id":12584,"depth":244,"text":12585},{"id":12630,"depth":244,"text":12631},{"id":12637,"depth":244,"text":12638},{"id":12644,"depth":244,"text":12645},{"id":12655,"depth":244,"text":12656},{"id":12662,"depth":244,"text":12663},"2026-05-08","Free Cloudflare blocks DDoS, caches static assets and hides the server IP. But adds latency, lock-in and features you may not use. When it's worth it and when it's overkill.",{},"\u002Fen\u002Fblog\u002Fcloudflare-in-front-of-self-hosted-cluster",{"title":11786,"description":12738},{"loc":12740},"en\u002Fblog\u002Fcloudflare-in-front-of-self-hosted-cluster",[12745,12746,12747,12748,3378],"cloudflare","cdn","ddos","performance","sh_ab86c1jnH6sTCINkR_53BAVKSC-NxuaXZKkb26gc",1777362201421]