[{"data":1,"prerenderedAt":31502},["ShallowReactive",2],{"blog-en-list":3},[4,3394,4411,5380,6397,7509,8771,11784,12750,13842,14941,15809,16728,17802,18361,19100,19786,20388,21750,22465,23019,23613,24169,25329,25885,26409,27055,27511,28302,29000,29760,30369,31045],{"id":5,"title":6,"author":7,"body":8,"category":3378,"cover":3379,"date":3380,"description":3381,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":3384,"navigation":411,"path":3385,"readingTime":3386,"seo":3387,"sitemap":3388,"stem":3389,"tags":3390,"__hash__":3393},"blog_en\u002Fen\u002Fblog\u002Fzero-downtime-deploy-without-kubernetes.md","Zero-downtime deploy without Kubernetes: a practical tutorial in 2026","HeroCtl team",{"type":9,"value":10,"toc":3353},"minimark",[11,15,18,23,39,42,56,59,63,66,88,91,95,102,105,108,111,115,118,205,208,211,213,217,220,223,275,286,289,314,317,339,346,350,357,367,372,916,920,1000,1004,1080,1094,1097,1101,1104,1140,1143,1208,1211,1236,1239,1243,1246,1249,1252,1266,1273,1363,1369,1372,1434,1440,1457,1460,1464,1467,2346,2356,2382,2385,2443,2446,2450,2453,2456,2505,2508,2519,2524,2532,2535,2537,2541,2544,2606,2610,2617,2623,2637,2709,2712,2726,2730,2733,2754,2757,2761,2764,2782,2785,2789,2792,2940,2943,2946,2967,2970,2974,3167,3170,3174,3222,3226,3231,3238,3243,3246,3251,3254,3259,3265,3270,3273,3278,3284,3289,3292,3297,3304,3306,3310,3313,3316,3332,3346,3349],[12,13,14],"p",{},"There's a persistent myth that zero-downtime deploy is exclusive to those who ran Kubernetes in production. It isn't. The technique has existed since before the colossus had a name — any team that ran a pair of physical servers behind a load balancer last decade was already doing it, with fifty-line scripts and zero CRDs in their lives. What changed was the marketing around the practice, not the practice itself.",[12,16,17],{},"This post is a step-by-step tutorial to set up zero-downtime deploy from scratch, on two Linux machines, with no heavyweight orchestrator, no magic panel. At the end you'll have a bash script that swaps one instance at a time, waits for the new one to be healthy, and rolls to the next — exactly the algorithm large orchestrators implement, just without the boilerplate.",[19,20,22],"h2",{"id":21},"tldr","TL;DR",[12,24,25,26,30,31,34,35,38],{},"Zero-downtime deploy depends on three ingredients, not on a specific tool. First: ",[27,28,29],"strong",{},"two or more application instances running in parallel",", behind a basic proxy. Second: ",[27,32,33],{},"a reliable health check endpoint"," that validates real dependencies (database, cache, queue), doesn't just return 200 instantly. Third: ",[27,36,37],{},"a script or orchestrator that replaces one container at a time",", waiting for the new one to be healthy before moving on to the next.",[12,40,41],{},"This tutorial sets up the full setup on two Linux VPS with Docker, Caddy in front as proxy + load balancer, and a fifty-line bash script that does rolling update with active health check, minimum healthy time, and automatic rollback on failure. Result: deploy with no 5xx visible to the user, in less than a minute, with no maintenance window.",[12,43,44,47,48,51,52,55],{},[27,45,46],{},"Prerequisites:"," two Linux VPS with Docker (Hetzner CPX11 at R$30 each), domain with controllable DNS, app with a decent health check. ",[27,49,50],{},"Setup time:"," two to three hours. ",[27,53,54],{},"Monthly cost:"," R$60 (R$75 if you want a third VPS dedicated to the proxy). At the end we show the \"robust\" version via HeroCtl for those who want to stop scripting.",[57,58],"hr",{},[19,60,62],{"id":61},"the-three-ingredients-without-these-its-not-zero-downtime","The three ingredients (without these, it's not zero-downtime)",[12,64,65],{},"Before any command, worth fixing the theory — because every more elaborate configuration you'll see on the internet is a variation on these three pieces.",[67,68,69,76,82],"ol",{},[70,71,72,75],"li",{},[27,73,74],{},"Multiple instances of the app running in parallel."," Minimum two. If you only have one, any restart is an error window. There's no working around it with a configuration trick.",[70,77,78,81],{},[27,79,80],{},"A proxy\u002Fload balancer in front, doing health checks."," The proxy decides which instance to send traffic to. If one falls (or was deliberately taken out for the deploy), the proxy only sends to the remaining ones.",[70,83,84,87],{},[27,85,86],{},"A script that swaps instances one at a time."," Never all together. Wait for the new one to be healthy before touching the next. If the new one fails, halt the deploy and keep the old ones serving.",[12,89,90],{},"That's it. The rest — Kubernetes, modern panels, lightweight orchestrators — is wrapping around these three points.",[19,92,94],{"id":93},"why-single-server-is-never-zero-downtime-even-if-its-fast","Why single-server is NEVER zero-downtime (even if it's fast)",[12,96,97,98,101],{},"I see this question every week in the community Discord: \"can I do zero-downtime with a single server, if the deploy is fast enough?\". Short answer: ",[27,99,100],{},"no",".",[12,103,104],{},"On a single machine, the deploy cycle is: stop the old container, bring up the new one. Even if everything happens in three seconds, those three seconds exist. In-flight TCP connections are cut. Requests arriving in that interval get connection refused or 502. If you have five requests per second, that's fifteen users seeing errors per deploy.",[12,106,107],{},"There are clever variations — bring the new one up on a different port, switch the local proxy, drop the old one. That improves things but doesn't eliminate them. If the app takes time to close in-flight connections, the cutover still produces errors. If the health check is weak, the proxy points traffic at an app that hasn't finished coming up. There's always a window.",[12,109,110],{},"The only reliable way to eliminate the window is to have at least one instance always available throughout the deploy. That requires two machines. Period.",[19,112,114],{"id":113},"the-minimum-setup-two-vps-a-proxy","The minimum setup (two VPS + a proxy)",[12,116,117],{},"The cheapest topology that delivers real zero-downtime:",[119,120,121,140],"table",{},[122,123,124],"thead",{},[125,126,127,131,134,137],"tr",{},[128,129,130],"th",{},"Component",[128,132,133],{},"Size",[128,135,136],{},"Cost",[128,138,139],{},"Function",[141,142,143,158,170,189],"tbody",{},[125,144,145,149,152,155],{},[146,147,148],"td",{},"VPS A",[146,150,151],{},"2 vCPU \u002F 2 GB RAM",[146,153,154],{},"R$30\u002Fmonth",[146,156,157],{},"App instance 1",[125,159,160,163,165,167],{},[146,161,162],{},"VPS B",[146,164,151],{},[146,166,154],{},[146,168,169],{},"App instance 2",[125,171,172,175,183,186],{},[146,173,174],{},"Proxy",[146,176,177,178,182],{},"running on VPS A ",[179,180,181],"em",{},"or"," third VPS",[146,184,185],{},"R$0 (shared) or R$15\u002Fmonth",[146,187,188],{},"Caddy\u002Fnginx doing balance",[125,190,191,194,199,202],{},[146,192,193],{},"Database",[146,195,196,197,182],{},"managed Postgres ",[179,198,181],{},[146,200,201],{},"varies",[146,203,204],{},"Shared state between A and B",[12,206,207],{},"Keeping the proxy shared on one of the VPS itself saves money but has a trade-off: if the VPS hosting the proxy falls entirely, the site falls with it (even with the other VPS running). For a small team this is acceptable. When you grow, the proxy migrates to a dedicated VPS or becomes a redundant pair.",[12,209,210],{},"Your domain's DNS A record points to the proxy IP. Apps on A and B connect to the same database — without that shared part, the two instances diverge and the user sees different results depending on which one answered.",[57,212],{},[19,214,216],{"id":215},"step-1-provision-two-vps-15-min","Step 1 — Provision two VPS (15 min)",[12,218,219],{},"I use Hetzner CPX11 (€4.75 ≈ R$30) as a reference. DigitalOcean Droplet at US$6, Vultr Cloud Compute at US$6, or Linode Nanode at US$5 deliver something similar. What matters is modern Linux (Ubuntu 24.04 LTS or Debian 12) with Docker.",[12,221,222],{},"Provision both machines with the same SSH key:",[224,225,230],"pre",{"className":226,"code":227,"language":228,"meta":229,"style":229},"language-bash shiki shiki-themes github-dark-default","# from your laptop\nssh-keygen -t ed25519 -f ~\u002F.ssh\u002Fdeploy_key -C \"deploy@meudominio.com\"\n# add ~\u002F.ssh\u002Fdeploy_key.pub on the provider console before creating the VPS\n","bash","",[231,232,233,242,269],"code",{"__ignoreMap":229},[234,235,238],"span",{"class":236,"line":237},"line",1,[234,239,241],{"class":240},"sH3jZ","# from your laptop\n",[234,243,245,249,253,257,260,263,266],{"class":236,"line":244},2,[234,246,248],{"class":247},"sQhOw","ssh-keygen",[234,250,252],{"class":251},"sFSAA"," -t",[234,254,256],{"class":255},"s9uIt"," ed25519",[234,258,259],{"class":251}," -f",[234,261,262],{"class":255}," ~\u002F.ssh\u002Fdeploy_key",[234,264,265],{"class":251}," -C",[234,267,268],{"class":255}," \"deploy@meudominio.com\"\n",[234,270,272],{"class":236,"line":271},3,[234,273,274],{"class":240},"# add ~\u002F.ssh\u002Fdeploy_key.pub on the provider console before creating the VPS\n",[12,276,277,278,281,282,285],{},"Create each VPS, note the IPs. I'll use ",[231,279,280],{},"203.0.113.10"," (VPS A) and ",[231,283,284],{},"203.0.113.20"," (VPS B) as placeholders for the rest of the post.",[12,287,288],{},"Install Docker on each:",[224,290,292],{"className":226,"code":291,"language":228,"meta":229,"style":229},"ssh root@203.0.113.10 \"curl -fsSL https:\u002F\u002Fget.docker.com | sh\"\nssh root@203.0.113.20 \"curl -fsSL https:\u002F\u002Fget.docker.com | sh\"\n",[231,293,294,305],{"__ignoreMap":229},[234,295,296,299,302],{"class":236,"line":237},[234,297,298],{"class":247},"ssh",[234,300,301],{"class":255}," root@203.0.113.10",[234,303,304],{"class":255}," \"curl -fsSL https:\u002F\u002Fget.docker.com | sh\"\n",[234,306,307,309,312],{"class":236,"line":244},[234,308,298],{"class":247},[234,310,311],{"class":255}," root@203.0.113.20",[234,313,304],{"class":255},[12,315,316],{},"Configure firewall to allow only 22 (SSH) and 8080 (internal port where the app will listen). HTTP\u002FHTTPS traffic only arrives at the proxy:",[224,318,320],{"className":226,"code":319,"language":228,"meta":229,"style":229},"ssh root@203.0.113.10 \"ufw allow 22 && ufw allow 8080\u002Ftcp && ufw --force enable\"\nssh root@203.0.113.20 \"ufw allow 22 && ufw allow 8080\u002Ftcp && ufw --force enable\"\n",[231,321,322,331],{"__ignoreMap":229},[234,323,324,326,328],{"class":236,"line":237},[234,325,298],{"class":247},[234,327,301],{"class":255},[234,329,330],{"class":255}," \"ufw allow 22 && ufw allow 8080\u002Ftcp && ufw --force enable\"\n",[234,332,333,335,337],{"class":236,"line":244},[234,334,298],{"class":247},[234,336,311],{"class":255},[234,338,330],{"class":255},[12,340,341,342,345],{},"Validation: ",[231,343,344],{},"docker run --rm hello-world"," on each machine should complete without errors.",[19,347,349],{"id":348},"step-2-app-with-a-decent-health-check-30-min","Step 2 — App with a decent health check (30 min)",[12,351,352,353,356],{},"The ",[231,354,355],{},"\u002Fhealthz"," endpoint is the heart of the scheme. If it returns 200 when the app isn't actually ready, the proxy sends traffic to a broken instance and the user sees an error. If it returns 500 when the app is healthy, the proxy takes the good instance out of balancing. Meaning: the health check is the source of truth for the entire system.",[12,358,359,360,362,363,366],{},"Golden rule: ",[231,361,355],{}," validates ",[27,364,365],{},"real dependencies the app needs to respond",". Minimum: connection to the database. If you have a cache (Redis), include it. If you have a queue (SQS, RabbitMQ), include it. DON'T return 200 right at boot — wait for assets to compile, cache to warm, connections to open.",[368,369,371],"h3",{"id":370},"nodejs-express","Node.js (Express)",[224,373,377],{"className":374,"code":375,"language":376,"meta":229,"style":229},"language-js shiki shiki-themes github-dark-default","import express from \"express\"\nimport { Pool } from \"pg\"\n\nconst app = express()\nconst pool = new Pool({ connectionString: process.env.DATABASE_URL })\n\nlet ready = false\n\n\u002F\u002F warm-up assíncrono — só fica ready quando dependencies validam\n;(async () => {\n  await pool.query(\"SELECT 1\")\n  \u002F\u002F outras inicializações: cache prime, etc.\n  ready = true\n})()\n\napp.get(\"\u002Fhealthz\", async (_req, res) => {\n  if (!ready) return res.status(503).send(\"warming up\")\n  try {\n    await pool.query(\"SELECT 1\")\n    res.status(200).send(\"ok\")\n  } catch (e) {\n    res.status(503).send(\"db down\")\n  }\n})\n\napp.get(\"\u002F\", (_req, res) => res.send(\"Hello v1\"))\n\nconst server = app.listen(8080, () => console.log(\"listening 8080\"))\n\n\u002F\u002F graceful shutdown — drena conexões antes de morrer\nprocess.on(\"SIGTERM\", () => {\n  ready = false  \u002F\u002F health check passa a falhar imediatamente\n  setTimeout(() => {\n    server.close(() => process.exit(0))\n  }, 5000)  \u002F\u002F 5s pro proxy notar e parar de mandar tráfego novo\n})\n","js",[231,378,379,395,407,413,432,457,462,477,482,488,506,527,533,544,550,555,592,633,641,657,681,693,715,721,727,732,769,774,813,818,824,844,857,870,896,911],{"__ignoreMap":229},[234,380,381,385,389,392],{"class":236,"line":237},[234,382,384],{"class":383},"suJrU","import",[234,386,388],{"class":387},"sZEs4"," express ",[234,390,391],{"class":383},"from",[234,393,394],{"class":255}," \"express\"\n",[234,396,397,399,402,404],{"class":236,"line":244},[234,398,384],{"class":383},[234,400,401],{"class":387}," { Pool } ",[234,403,391],{"class":383},[234,405,406],{"class":255}," \"pg\"\n",[234,408,409],{"class":236,"line":271},[234,410,412],{"emptyLinePlaceholder":411},true,"\n",[234,414,416,419,422,425,429],{"class":236,"line":415},4,[234,417,418],{"class":383},"const",[234,420,421],{"class":251}," app",[234,423,424],{"class":383}," =",[234,426,428],{"class":427},"sc3cj"," express",[234,430,431],{"class":387},"()\n",[234,433,435,437,440,442,445,448,451,454],{"class":236,"line":434},5,[234,436,418],{"class":383},[234,438,439],{"class":251}," pool",[234,441,424],{"class":383},[234,443,444],{"class":383}," new",[234,446,447],{"class":427}," Pool",[234,449,450],{"class":387},"({ connectionString: process.env.",[234,452,453],{"class":251},"DATABASE_URL",[234,455,456],{"class":387}," })\n",[234,458,460],{"class":236,"line":459},6,[234,461,412],{"emptyLinePlaceholder":411},[234,463,465,468,471,474],{"class":236,"line":464},7,[234,466,467],{"class":383},"let",[234,469,470],{"class":387}," ready ",[234,472,473],{"class":383},"=",[234,475,476],{"class":251}," false\n",[234,478,480],{"class":236,"line":479},8,[234,481,412],{"emptyLinePlaceholder":411},[234,483,485],{"class":236,"line":484},9,[234,486,487],{"class":240},"\u002F\u002F warm-up assíncrono — só fica ready quando dependencies validam\n",[234,489,491,494,497,500,503],{"class":236,"line":490},10,[234,492,493],{"class":387},";(",[234,495,496],{"class":383},"async",[234,498,499],{"class":387}," () ",[234,501,502],{"class":383},"=>",[234,504,505],{"class":387}," {\n",[234,507,509,512,515,518,521,524],{"class":236,"line":508},11,[234,510,511],{"class":383},"  await",[234,513,514],{"class":387}," pool.",[234,516,517],{"class":427},"query",[234,519,520],{"class":387},"(",[234,522,523],{"class":255},"\"SELECT 1\"",[234,525,526],{"class":387},")\n",[234,528,530],{"class":236,"line":529},12,[234,531,532],{"class":240},"  \u002F\u002F outras inicializações: cache prime, etc.\n",[234,534,536,539,541],{"class":236,"line":535},13,[234,537,538],{"class":387},"  ready ",[234,540,473],{"class":383},[234,542,543],{"class":251}," true\n",[234,545,547],{"class":236,"line":546},14,[234,548,549],{"class":387},"})()\n",[234,551,553],{"class":236,"line":552},15,[234,554,412],{"emptyLinePlaceholder":411},[234,556,558,561,564,566,569,572,574,577,580,582,585,588,590],{"class":236,"line":557},16,[234,559,560],{"class":387},"app.",[234,562,563],{"class":427},"get",[234,565,520],{"class":387},[234,567,568],{"class":255},"\"\u002Fhealthz\"",[234,570,571],{"class":387},", ",[234,573,496],{"class":383},[234,575,576],{"class":387}," (",[234,578,579],{"class":247},"_req",[234,581,571],{"class":387},[234,583,584],{"class":247},"res",[234,586,587],{"class":387},") ",[234,589,502],{"class":383},[234,591,505],{"class":387},[234,593,595,598,600,603,606,609,612,615,617,620,623,626,628,631],{"class":236,"line":594},17,[234,596,597],{"class":383},"  if",[234,599,576],{"class":387},[234,601,602],{"class":383},"!",[234,604,605],{"class":387},"ready) ",[234,607,608],{"class":383},"return",[234,610,611],{"class":387}," res.",[234,613,614],{"class":427},"status",[234,616,520],{"class":387},[234,618,619],{"class":251},"503",[234,621,622],{"class":387},").",[234,624,625],{"class":427},"send",[234,627,520],{"class":387},[234,629,630],{"class":255},"\"warming up\"",[234,632,526],{"class":387},[234,634,636,639],{"class":236,"line":635},18,[234,637,638],{"class":383},"  try",[234,640,505],{"class":387},[234,642,644,647,649,651,653,655],{"class":236,"line":643},19,[234,645,646],{"class":383},"    await",[234,648,514],{"class":387},[234,650,517],{"class":427},[234,652,520],{"class":387},[234,654,523],{"class":255},[234,656,526],{"class":387},[234,658,660,663,665,667,670,672,674,676,679],{"class":236,"line":659},20,[234,661,662],{"class":387},"    res.",[234,664,614],{"class":427},[234,666,520],{"class":387},[234,668,669],{"class":251},"200",[234,671,622],{"class":387},[234,673,625],{"class":427},[234,675,520],{"class":387},[234,677,678],{"class":255},"\"ok\"",[234,680,526],{"class":387},[234,682,684,687,690],{"class":236,"line":683},21,[234,685,686],{"class":387},"  } ",[234,688,689],{"class":383},"catch",[234,691,692],{"class":387}," (e) {\n",[234,694,696,698,700,702,704,706,708,710,713],{"class":236,"line":695},22,[234,697,662],{"class":387},[234,699,614],{"class":427},[234,701,520],{"class":387},[234,703,619],{"class":251},[234,705,622],{"class":387},[234,707,625],{"class":427},[234,709,520],{"class":387},[234,711,712],{"class":255},"\"db down\"",[234,714,526],{"class":387},[234,716,718],{"class":236,"line":717},23,[234,719,720],{"class":387},"  }\n",[234,722,724],{"class":236,"line":723},24,[234,725,726],{"class":387},"})\n",[234,728,730],{"class":236,"line":729},25,[234,731,412],{"emptyLinePlaceholder":411},[234,733,735,737,739,741,744,747,749,751,753,755,757,759,761,763,766],{"class":236,"line":734},26,[234,736,560],{"class":387},[234,738,563],{"class":427},[234,740,520],{"class":387},[234,742,743],{"class":255},"\"\u002F\"",[234,745,746],{"class":387},", (",[234,748,579],{"class":247},[234,750,571],{"class":387},[234,752,584],{"class":247},[234,754,587],{"class":387},[234,756,502],{"class":383},[234,758,611],{"class":387},[234,760,625],{"class":427},[234,762,520],{"class":387},[234,764,765],{"class":255},"\"Hello v1\"",[234,767,768],{"class":387},"))\n",[234,770,772],{"class":236,"line":771},27,[234,773,412],{"emptyLinePlaceholder":411},[234,775,777,779,782,784,787,790,792,795,798,800,803,806,808,811],{"class":236,"line":776},28,[234,778,418],{"class":383},[234,780,781],{"class":251}," server",[234,783,424],{"class":383},[234,785,786],{"class":387}," app.",[234,788,789],{"class":427},"listen",[234,791,520],{"class":387},[234,793,794],{"class":251},"8080",[234,796,797],{"class":387},", () ",[234,799,502],{"class":383},[234,801,802],{"class":387}," console.",[234,804,805],{"class":427},"log",[234,807,520],{"class":387},[234,809,810],{"class":255},"\"listening 8080\"",[234,812,768],{"class":387},[234,814,816],{"class":236,"line":815},29,[234,817,412],{"emptyLinePlaceholder":411},[234,819,821],{"class":236,"line":820},30,[234,822,823],{"class":240},"\u002F\u002F graceful shutdown — drena conexões antes de morrer\n",[234,825,827,830,833,835,838,840,842],{"class":236,"line":826},31,[234,828,829],{"class":387},"process.",[234,831,832],{"class":427},"on",[234,834,520],{"class":387},[234,836,837],{"class":255},"\"SIGTERM\"",[234,839,797],{"class":387},[234,841,502],{"class":383},[234,843,505],{"class":387},[234,845,847,849,851,854],{"class":236,"line":846},32,[234,848,538],{"class":387},[234,850,473],{"class":383},[234,852,853],{"class":251}," false",[234,855,856],{"class":240},"  \u002F\u002F health check passa a falhar imediatamente\n",[234,858,860,863,866,868],{"class":236,"line":859},33,[234,861,862],{"class":427},"  setTimeout",[234,864,865],{"class":387},"(() ",[234,867,502],{"class":383},[234,869,505],{"class":387},[234,871,873,876,879,881,883,886,889,891,894],{"class":236,"line":872},34,[234,874,875],{"class":387},"    server.",[234,877,878],{"class":427},"close",[234,880,865],{"class":387},[234,882,502],{"class":383},[234,884,885],{"class":387}," process.",[234,887,888],{"class":427},"exit",[234,890,520],{"class":387},[234,892,893],{"class":251},"0",[234,895,768],{"class":387},[234,897,899,902,905,908],{"class":236,"line":898},35,[234,900,901],{"class":387},"  }, ",[234,903,904],{"class":251},"5000",[234,906,907],{"class":387},")  ",[234,909,910],{"class":240},"\u002F\u002F 5s pro proxy notar e parar de mandar tráfego novo\n",[234,912,914],{"class":236,"line":913},36,[234,915,726],{"class":387},[368,917,919],{"id":918},"python-django-gunicorn","Python (Django + gunicorn)",[224,921,925],{"className":922,"code":923,"language":924,"meta":229,"style":229},"language-python shiki shiki-themes github-dark-default","# health\u002Fviews.py\nfrom django.db import connection\nfrom django.http import JsonResponse, HttpResponse\nimport redis, os\n\n_r = redis.from_url(os.environ[\"REDIS_URL\"])\n\ndef healthz(request):\n    try:\n        with connection.cursor() as c:\n            c.execute(\"SELECT 1\")\n        _r.ping()\n        return HttpResponse(\"ok\", status=200)\n    except Exception as e:\n        return HttpResponse(f\"unhealthy: {e}\", status=503)\n","python",[231,926,927,932,937,942,947,951,956,960,965,970,975,980,985,990,995],{"__ignoreMap":229},[234,928,929],{"class":236,"line":237},[234,930,931],{},"# health\u002Fviews.py\n",[234,933,934],{"class":236,"line":244},[234,935,936],{},"from django.db import connection\n",[234,938,939],{"class":236,"line":271},[234,940,941],{},"from django.http import JsonResponse, HttpResponse\n",[234,943,944],{"class":236,"line":415},[234,945,946],{},"import redis, os\n",[234,948,949],{"class":236,"line":434},[234,950,412],{"emptyLinePlaceholder":411},[234,952,953],{"class":236,"line":459},[234,954,955],{},"_r = redis.from_url(os.environ[\"REDIS_URL\"])\n",[234,957,958],{"class":236,"line":464},[234,959,412],{"emptyLinePlaceholder":411},[234,961,962],{"class":236,"line":479},[234,963,964],{},"def healthz(request):\n",[234,966,967],{"class":236,"line":484},[234,968,969],{},"    try:\n",[234,971,972],{"class":236,"line":490},[234,973,974],{},"        with connection.cursor() as c:\n",[234,976,977],{"class":236,"line":508},[234,978,979],{},"            c.execute(\"SELECT 1\")\n",[234,981,982],{"class":236,"line":529},[234,983,984],{},"        _r.ping()\n",[234,986,987],{"class":236,"line":535},[234,988,989],{},"        return HttpResponse(\"ok\", status=200)\n",[234,991,992],{"class":236,"line":546},[234,993,994],{},"    except Exception as e:\n",[234,996,997],{"class":236,"line":552},[234,998,999],{},"        return HttpResponse(f\"unhealthy: {e}\", status=503)\n",[368,1001,1003],{"id":1002},"ruby-rails","Ruby (Rails)",[224,1005,1009],{"className":1006,"code":1007,"language":1008,"meta":229,"style":229},"language-ruby shiki shiki-themes github-dark-default","# config\u002Froutes.rb\nget \"\u002Fhealthz\", to: \"health#show\"\n\n# app\u002Fcontrollers\u002Fhealth_controller.rb\nclass HealthController \u003C ApplicationController\n  def show\n    ActiveRecord::Base.connection.execute(\"SELECT 1\")\n    Rails.cache.read(\"__healthcheck__\")\n    head :ok\n  rescue => e\n    Rails.logger.warn(\"healthcheck failed: #{e.message}\")\n    head :service_unavailable\n  end\nend\n","ruby",[231,1010,1011,1016,1021,1025,1030,1035,1040,1045,1050,1055,1060,1065,1070,1075],{"__ignoreMap":229},[234,1012,1013],{"class":236,"line":237},[234,1014,1015],{},"# config\u002Froutes.rb\n",[234,1017,1018],{"class":236,"line":244},[234,1019,1020],{},"get \"\u002Fhealthz\", to: \"health#show\"\n",[234,1022,1023],{"class":236,"line":271},[234,1024,412],{"emptyLinePlaceholder":411},[234,1026,1027],{"class":236,"line":415},[234,1028,1029],{},"# app\u002Fcontrollers\u002Fhealth_controller.rb\n",[234,1031,1032],{"class":236,"line":434},[234,1033,1034],{},"class HealthController \u003C ApplicationController\n",[234,1036,1037],{"class":236,"line":459},[234,1038,1039],{},"  def show\n",[234,1041,1042],{"class":236,"line":464},[234,1043,1044],{},"    ActiveRecord::Base.connection.execute(\"SELECT 1\")\n",[234,1046,1047],{"class":236,"line":479},[234,1048,1049],{},"    Rails.cache.read(\"__healthcheck__\")\n",[234,1051,1052],{"class":236,"line":484},[234,1053,1054],{},"    head :ok\n",[234,1056,1057],{"class":236,"line":490},[234,1058,1059],{},"  rescue => e\n",[234,1061,1062],{"class":236,"line":508},[234,1063,1064],{},"    Rails.logger.warn(\"healthcheck failed: #{e.message}\")\n",[234,1066,1067],{"class":236,"line":529},[234,1068,1069],{},"    head :service_unavailable\n",[234,1071,1072],{"class":236,"line":535},[234,1073,1074],{},"  end\n",[234,1076,1077],{"class":236,"line":546},[234,1078,1079],{},"end\n",[12,1081,1082,1083,1086,1087,1090,1091,1093],{},"The detail that separates an amateur from a professional health check is ",[27,1084,1085],{},"graceful shutdown",": on receiving ",[231,1088,1089],{},"SIGTERM",", the app starts returning 503 on ",[231,1092,355],{}," immediately, but keeps accepting in-flight connections for a few more seconds. The proxy notices the 503, stops sending new traffic, and when the app finally closes there's nobody waiting for a response.",[12,1095,1096],{},"Without this, the cutover always leaks some errors even with everything else right.",[19,1098,1100],{"id":1099},"step-3-bring-up-two-docker-instances-15-min","Step 3 — Bring up two Docker instances (15 min)",[12,1102,1103],{},"Build your app into a Docker image. For the tutorial I'll use a generic image you replace:",[224,1105,1107],{"className":226,"code":1106,"language":228,"meta":229,"style":229},"# no seu laptop, push pra registry (Docker Hub, ECR, GHCR)\ndocker build -t meuusuario\u002Fmyapp:v1 .\ndocker push meuusuario\u002Fmyapp:v1\n",[231,1108,1109,1114,1130],{"__ignoreMap":229},[234,1110,1111],{"class":236,"line":237},[234,1112,1113],{"class":240},"# no seu laptop, push pra registry (Docker Hub, ECR, GHCR)\n",[234,1115,1116,1119,1122,1124,1127],{"class":236,"line":244},[234,1117,1118],{"class":247},"docker",[234,1120,1121],{"class":255}," build",[234,1123,252],{"class":251},[234,1125,1126],{"class":255}," meuusuario\u002Fmyapp:v1",[234,1128,1129],{"class":255}," .\n",[234,1131,1132,1134,1137],{"class":236,"line":271},[234,1133,1118],{"class":247},[234,1135,1136],{"class":255}," push",[234,1138,1139],{"class":255}," meuusuario\u002Fmyapp:v1\n",[12,1141,1142],{},"Bring up the instance on VPS A:",[224,1144,1146],{"className":226,"code":1145,"language":228,"meta":229,"style":229},"ssh root@203.0.113.10 \"\n  docker pull meuusuario\u002Fmyapp:v1 &&\n  docker run -d --name app --restart=unless-stopped \\\n    -p 8080:8080 \\\n    -e DATABASE_URL='postgres:\u002F\u002Fuser:pass@db.example.com:5432\u002Fapp' \\\n    --health-cmd='curl -f http:\u002F\u002Flocalhost:8080\u002Fhealthz || exit 1' \\\n    --health-interval=5s --health-timeout=2s --health-retries=3 \\\n    meuusuario\u002Fmyapp:v1\n\"\n",[231,1147,1148,1157,1162,1170,1177,1184,1191,1198,1203],{"__ignoreMap":229},[234,1149,1150,1152,1154],{"class":236,"line":237},[234,1151,298],{"class":247},[234,1153,301],{"class":255},[234,1155,1156],{"class":255}," \"\n",[234,1158,1159],{"class":236,"line":244},[234,1160,1161],{"class":255},"  docker pull meuusuario\u002Fmyapp:v1 &&\n",[234,1163,1164,1167],{"class":236,"line":271},[234,1165,1166],{"class":255},"  docker run -d --name app --restart=unless-stopped ",[234,1168,1169],{"class":383},"\\\n",[234,1171,1172,1175],{"class":236,"line":415},[234,1173,1174],{"class":255},"    -p 8080:8080 ",[234,1176,1169],{"class":383},[234,1178,1179,1182],{"class":236,"line":434},[234,1180,1181],{"class":255},"    -e DATABASE_URL='postgres:\u002F\u002Fuser:pass@db.example.com:5432\u002Fapp' ",[234,1183,1169],{"class":383},[234,1185,1186,1189],{"class":236,"line":459},[234,1187,1188],{"class":255},"    --health-cmd='curl -f http:\u002F\u002Flocalhost:8080\u002Fhealthz || exit 1' ",[234,1190,1169],{"class":383},[234,1192,1193,1196],{"class":236,"line":464},[234,1194,1195],{"class":255},"    --health-interval=5s --health-timeout=2s --health-retries=3 ",[234,1197,1169],{"class":383},[234,1199,1200],{"class":236,"line":479},[234,1201,1202],{"class":255},"    meuusuario\u002Fmyapp:v1\n",[234,1204,1205],{"class":236,"line":484},[234,1206,1207],{"class":255},"\"\n",[12,1209,1210],{},"Repeat for VPS B swapping the IP. Validate:",[224,1212,1214],{"className":226,"code":1213,"language":228,"meta":229,"style":229},"curl http:\u002F\u002F203.0.113.10:8080\u002Fhealthz   # deve retornar \"ok\"\ncurl http:\u002F\u002F203.0.113.20:8080\u002Fhealthz   # deve retornar \"ok\"\n",[231,1215,1216,1227],{"__ignoreMap":229},[234,1217,1218,1221,1224],{"class":236,"line":237},[234,1219,1220],{"class":247},"curl",[234,1222,1223],{"class":255}," http:\u002F\u002F203.0.113.10:8080\u002Fhealthz",[234,1225,1226],{"class":240},"   # deve retornar \"ok\"\n",[234,1228,1229,1231,1234],{"class":236,"line":244},[234,1230,1220],{"class":247},[234,1232,1233],{"class":255}," http:\u002F\u002F203.0.113.20:8080\u002Fhealthz",[234,1235,1226],{"class":240},[12,1237,1238],{},"If both return 200, the base is ready.",[19,1240,1242],{"id":1241},"step-4-caddy-as-reverse-proxy-load-balancer-30-min","Step 4 — Caddy as reverse proxy + load balancer (30 min)",[12,1244,1245],{},"Caddy is easier to start with than nginx because of built-in automatic TLS — Let's Encrypt works out of the box, no external bot to configure. nginx is more flexible and has a larger ecosystem; Caddy is simpler for this case. For the tutorial I'll use Caddy.",[12,1247,1248],{},"I'll run Caddy on VPS A, sharing the machine with one of the app instances. If you prefer a dedicated third VPS, swap the IP where relevant.",[12,1250,1251],{},"First, open ports 80 and 443 on VPS A:",[224,1253,1255],{"className":226,"code":1254,"language":228,"meta":229,"style":229},"ssh root@203.0.113.10 \"ufw allow 80 && ufw allow 443\"\n",[231,1256,1257],{"__ignoreMap":229},[234,1258,1259,1261,1263],{"class":236,"line":237},[234,1260,298],{"class":247},[234,1262,301],{"class":255},[234,1264,1265],{"class":255}," \"ufw allow 80 && ufw allow 443\"\n",[12,1267,1268,1269,1272],{},"Create the ",[231,1270,1271],{},"Caddyfile",":",[224,1274,1278],{"className":1275,"code":1276,"language":1277,"meta":229,"style":229},"language-caddyfile shiki shiki-themes github-dark-default","meudominio.com {\n    reverse_proxy 203.0.113.10:8080 203.0.113.20:8080 {\n        lb_policy round_robin\n        health_uri \u002Fhealthz\n        health_interval 5s\n        health_timeout 2s\n        health_status 200\n\n        fail_duration 30s\n        max_fails 2\n        unhealthy_status 5xx\n\n        transport http {\n            dial_timeout 2s\n        }\n    }\n}\n","caddyfile",[231,1279,1280,1285,1290,1295,1300,1305,1310,1315,1319,1324,1329,1334,1338,1343,1348,1353,1358],{"__ignoreMap":229},[234,1281,1282],{"class":236,"line":237},[234,1283,1284],{},"meudominio.com {\n",[234,1286,1287],{"class":236,"line":244},[234,1288,1289],{},"    reverse_proxy 203.0.113.10:8080 203.0.113.20:8080 {\n",[234,1291,1292],{"class":236,"line":271},[234,1293,1294],{},"        lb_policy round_robin\n",[234,1296,1297],{"class":236,"line":415},[234,1298,1299],{},"        health_uri \u002Fhealthz\n",[234,1301,1302],{"class":236,"line":434},[234,1303,1304],{},"        health_interval 5s\n",[234,1306,1307],{"class":236,"line":459},[234,1308,1309],{},"        health_timeout 2s\n",[234,1311,1312],{"class":236,"line":464},[234,1313,1314],{},"        health_status 200\n",[234,1316,1317],{"class":236,"line":479},[234,1318,412],{"emptyLinePlaceholder":411},[234,1320,1321],{"class":236,"line":484},[234,1322,1323],{},"        fail_duration 30s\n",[234,1325,1326],{"class":236,"line":490},[234,1327,1328],{},"        max_fails 2\n",[234,1330,1331],{"class":236,"line":508},[234,1332,1333],{},"        unhealthy_status 5xx\n",[234,1335,1336],{"class":236,"line":529},[234,1337,412],{"emptyLinePlaceholder":411},[234,1339,1340],{"class":236,"line":535},[234,1341,1342],{},"        transport http {\n",[234,1344,1345],{"class":236,"line":546},[234,1346,1347],{},"            dial_timeout 2s\n",[234,1349,1350],{"class":236,"line":552},[234,1351,1352],{},"        }\n",[234,1354,1355],{"class":236,"line":557},[234,1356,1357],{},"    }\n",[234,1359,1360],{"class":236,"line":594},[234,1361,1362],{},"}\n",[12,1364,1365,1366,1368],{},"Fifteen lines. Everything that matters is there: round-robin between the two IPs, active health check every five seconds on ",[231,1367,355],{},", marks as unhealthy after two consecutive failures in 30s, two-second timeout to open a connection.",[12,1370,1371],{},"Bring up Caddy:",[224,1373,1375],{"className":226,"code":1374,"language":228,"meta":229,"style":229},"ssh root@203.0.113.10 \"\n  mkdir -p \u002Fetc\u002Fcaddy &&\n  docker run -d --name caddy --restart=unless-stopped \\\n    --network host \\\n    -v \u002Fetc\u002Fcaddy\u002FCaddyfile:\u002Fetc\u002Fcaddy\u002FCaddyfile \\\n    -v caddy_data:\u002Fdata \\\n    -v caddy_config:\u002Fconfig \\\n    caddy:2-alpine\n\"\n",[231,1376,1377,1385,1390,1397,1404,1411,1418,1425,1430],{"__ignoreMap":229},[234,1378,1379,1381,1383],{"class":236,"line":237},[234,1380,298],{"class":247},[234,1382,301],{"class":255},[234,1384,1156],{"class":255},[234,1386,1387],{"class":236,"line":244},[234,1388,1389],{"class":255},"  mkdir -p \u002Fetc\u002Fcaddy &&\n",[234,1391,1392,1395],{"class":236,"line":271},[234,1393,1394],{"class":255},"  docker run -d --name caddy --restart=unless-stopped ",[234,1396,1169],{"class":383},[234,1398,1399,1402],{"class":236,"line":415},[234,1400,1401],{"class":255},"    --network host ",[234,1403,1169],{"class":383},[234,1405,1406,1409],{"class":236,"line":434},[234,1407,1408],{"class":255},"    -v \u002Fetc\u002Fcaddy\u002FCaddyfile:\u002Fetc\u002Fcaddy\u002FCaddyfile ",[234,1410,1169],{"class":383},[234,1412,1413,1416],{"class":236,"line":459},[234,1414,1415],{"class":255},"    -v caddy_data:\u002Fdata ",[234,1417,1169],{"class":383},[234,1419,1420,1423],{"class":236,"line":464},[234,1421,1422],{"class":255},"    -v caddy_config:\u002Fconfig ",[234,1424,1169],{"class":383},[234,1426,1427],{"class":236,"line":479},[234,1428,1429],{"class":255},"    caddy:2-alpine\n",[234,1431,1432],{"class":236,"line":484},[234,1433,1207],{"class":255},[12,1435,1436,1437,1439],{},"Point your domain's DNS A to ",[231,1438,280],{},". In a few minutes:",[224,1441,1443],{"className":226,"code":1442,"language":228,"meta":229,"style":229},"curl https:\u002F\u002Fmeudominio.com\u002F\n# deve retornar \"Hello v1\" (alternando entre as duas instâncias)\n",[231,1444,1445,1452],{"__ignoreMap":229},[234,1446,1447,1449],{"class":236,"line":237},[234,1448,1220],{"class":247},[234,1450,1451],{"class":255}," https:\u002F\u002Fmeudominio.com\u002F\n",[234,1453,1454],{"class":236,"line":244},[234,1455,1456],{"class":240},"# deve retornar \"Hello v1\" (alternando entre as duas instâncias)\n",[12,1458,1459],{},"Caddy issued a Let's Encrypt certificate automatically. This works because the domain resolves to the IP where Caddy is listening on port 80 (HTTP-01 challenge).",[19,1461,1463],{"id":1462},"step-5-bash-deploy-script-60-min","Step 5 — Bash deploy script (60 min)",[12,1465,1466],{},"This is the heart of the tutorial. A script that orchestrates rolling update between the two VPS:",[224,1468,1470],{"className":226,"code":1469,"language":228,"meta":229,"style":229},"#!\u002Fusr\u002Fbin\u002Fenv bash\n# deploy.sh — rolling deploy zero-downtime entre duas VPS\nset -euo pipefail\n\nIMAGE=\"${1:?Uso: .\u002Fdeploy.sh meuusuario\u002Fmyapp:v2}\"\nHOSTS=(\"203.0.113.10\" \"203.0.113.20\")\nHEALTH_DEADLINE=300   # max segundos esperando health check\nMIN_HEALTHY_TIME=10   # segundos saudável sustentado antes de prosseguir\nSSH_OPTS=\"-o StrictHostKeyChecking=no -o ConnectTimeout=5\"\n\ndeploy_host() {\n  local host=$1\n  local image=$2\n  echo \"==> [${host}] pulling ${image}\"\n  ssh ${SSH_OPTS} \"root@${host}\" \"docker pull ${image}\"\n\n  # guarda imagem antiga pro caso de rollback\n  local old_image\n  old_image=$(ssh ${SSH_OPTS} \"root@${host}\" \"docker inspect app --format '{{.Config.Image}}' 2>\u002Fdev\u002Fnull || echo none\")\n  echo \"==> [${host}] versão atual: ${old_image}\"\n\n  echo \"==> [${host}] substituindo contêiner\"\n  ssh ${SSH_OPTS} \"root@${host}\" \"\n    docker stop app 2>\u002Fdev\u002Fnull || true\n    docker rm app 2>\u002Fdev\u002Fnull || true\n    docker run -d --name app --restart=unless-stopped \\\n      -p 8080:8080 \\\n      -e DATABASE_URL='${DATABASE_URL}' \\\n      --health-cmd='curl -f http:\u002F\u002Flocalhost:8080\u002Fhealthz || exit 1' \\\n      --health-interval=5s --health-timeout=2s --health-retries=3 \\\n      ${image}\n  \"\n\n  echo \"==> [${host}] esperando health check (max ${HEALTH_DEADLINE}s)\"\n  local start=$(date +%s)\n  local healthy_since=0\n  while true; do\n    local now=$(date +%s)\n    if (( now - start > HEALTH_DEADLINE )); then\n      echo \"!!  [${host}] healthy_deadline excedido — fazendo rollback pra ${old_image}\"\n      ssh ${SSH_OPTS} \"root@${host}\" \"\n        docker stop app && docker rm app &&\n        docker run -d --name app --restart=unless-stopped \\\n          -p 8080:8080 -e DATABASE_URL='${DATABASE_URL}' \\\n          ${old_image}\n      \"\n      return 1\n    fi\n\n    if curl -sf --max-time 2 \"http:\u002F\u002F${host}:8080\u002Fhealthz\" > \u002Fdev\u002Fnull; then\n      if (( healthy_since == 0 )); then\n        healthy_since=${now}\n        echo \"    [${host}] saudável — confirmando por ${MIN_HEALTHY_TIME}s\"\n      elif (( now - healthy_since >= MIN_HEALTHY_TIME )); then\n        echo \"==> [${host}] saudável sustentado — promovendo\"\n        return 0\n      fi\n    else\n      healthy_since=0\n    fi\n    sleep 2\n  done\n}\n\necho \"### Deploy ${IMAGE} em ${#HOSTS[@]} hosts (rolling, max_parallel=1)\"\nfor host in \"${HOSTS[@]}\"; do\n  if ! deploy_host \"${host}\" \"${IMAGE}\"; then\n    echo \"### Deploy abortado em ${host}. Hosts anteriores mantidos como estavam.\"\n    exit 1\n  fi\ndone\necho \"### Deploy completo: todos os hosts em ${IMAGE}\"\n",[231,1471,1472,1477,1482,1493,1497,1550,1567,1580,1593,1603,1607,1615,1628,1640,1660,1683,1687,1692,1699,1724,1740,1744,1755,1769,1774,1779,1786,1793,1805,1812,1819,1828,1833,1837,1853,1872,1884,1899,1918,1942,1960,1976,1982,1990,2002,2012,2018,2027,2033,2038,2073,2093,2104,2123,2144,2156,2165,2171,2177,2187,2192,2201,2207,2212,2217,2245,2273,2300,2314,2322,2328,2334],{"__ignoreMap":229},[234,1473,1474],{"class":236,"line":237},[234,1475,1476],{"class":240},"#!\u002Fusr\u002Fbin\u002Fenv bash\n",[234,1478,1479],{"class":236,"line":244},[234,1480,1481],{"class":240},"# deploy.sh — rolling deploy zero-downtime entre duas VPS\n",[234,1483,1484,1487,1490],{"class":236,"line":271},[234,1485,1486],{"class":251},"set",[234,1488,1489],{"class":251}," -euo",[234,1491,1492],{"class":255}," pipefail\n",[234,1494,1495],{"class":236,"line":415},[234,1496,412],{"emptyLinePlaceholder":411},[234,1498,1499,1502,1504,1507,1510,1513,1516,1518,1521,1524,1527,1529,1532,1535,1537,1540,1542,1545,1548],{"class":236,"line":434},[234,1500,1501],{"class":387},"IMAGE",[234,1503,473],{"class":383},[234,1505,1506],{"class":255},"\"",[234,1508,1509],{"class":251},"${1",[234,1511,1512],{"class":383},":?",[234,1514,1515],{"class":387},"Uso",[234,1517,1272],{"class":383},[234,1519,1520],{"class":255}," .",[234,1522,1523],{"class":383},"\u002F",[234,1525,1526],{"class":387},"deploy",[234,1528,101],{"class":255},[234,1530,1531],{"class":387},"sh",[234,1533,1534],{"class":387}," meuusuario",[234,1536,1523],{"class":383},[234,1538,1539],{"class":387},"myapp",[234,1541,1272],{"class":383},[234,1543,1544],{"class":387},"v2",[234,1546,1547],{"class":251},"}",[234,1549,1207],{"class":255},[234,1551,1552,1555,1557,1559,1562,1565],{"class":236,"line":459},[234,1553,1554],{"class":387},"HOSTS",[234,1556,473],{"class":383},[234,1558,520],{"class":387},[234,1560,1561],{"class":255},"\"203.0.113.10\"",[234,1563,1564],{"class":255}," \"203.0.113.20\"",[234,1566,526],{"class":387},[234,1568,1569,1572,1574,1577],{"class":236,"line":464},[234,1570,1571],{"class":387},"HEALTH_DEADLINE",[234,1573,473],{"class":383},[234,1575,1576],{"class":255},"300",[234,1578,1579],{"class":240},"   # max segundos esperando health check\n",[234,1581,1582,1585,1587,1590],{"class":236,"line":479},[234,1583,1584],{"class":387},"MIN_HEALTHY_TIME",[234,1586,473],{"class":383},[234,1588,1589],{"class":255},"10",[234,1591,1592],{"class":240},"   # segundos saudável sustentado antes de prosseguir\n",[234,1594,1595,1598,1600],{"class":236,"line":484},[234,1596,1597],{"class":387},"SSH_OPTS",[234,1599,473],{"class":383},[234,1601,1602],{"class":255},"\"-o StrictHostKeyChecking=no -o ConnectTimeout=5\"\n",[234,1604,1605],{"class":236,"line":490},[234,1606,412],{"emptyLinePlaceholder":411},[234,1608,1609,1612],{"class":236,"line":508},[234,1610,1611],{"class":427},"deploy_host",[234,1613,1614],{"class":387},"() {\n",[234,1616,1617,1620,1623,1625],{"class":236,"line":529},[234,1618,1619],{"class":383},"  local",[234,1621,1622],{"class":387}," host",[234,1624,473],{"class":383},[234,1626,1627],{"class":247},"$1\n",[234,1629,1630,1632,1635,1637],{"class":236,"line":535},[234,1631,1619],{"class":383},[234,1633,1634],{"class":387}," image",[234,1636,473],{"class":383},[234,1638,1639],{"class":247},"$2\n",[234,1641,1642,1645,1648,1651,1654,1657],{"class":236,"line":546},[234,1643,1644],{"class":251},"  echo",[234,1646,1647],{"class":255}," \"==> [${",[234,1649,1650],{"class":387},"host",[234,1652,1653],{"class":255},"}] pulling ${",[234,1655,1656],{"class":387},"image",[234,1658,1659],{"class":255},"}\"\n",[234,1661,1662,1665,1668,1671,1673,1676,1679,1681],{"class":236,"line":552},[234,1663,1664],{"class":247},"  ssh",[234,1666,1667],{"class":387}," ${SSH_OPTS} ",[234,1669,1670],{"class":255},"\"root@${",[234,1672,1650],{"class":387},[234,1674,1675],{"class":255},"}\"",[234,1677,1678],{"class":255}," \"docker pull ${",[234,1680,1656],{"class":387},[234,1682,1659],{"class":255},[234,1684,1685],{"class":236,"line":557},[234,1686,412],{"emptyLinePlaceholder":411},[234,1688,1689],{"class":236,"line":594},[234,1690,1691],{"class":240},"  # guarda imagem antiga pro caso de rollback\n",[234,1693,1694,1696],{"class":236,"line":635},[234,1695,1619],{"class":383},[234,1697,1698],{"class":387}," old_image\n",[234,1700,1701,1704,1706,1709,1711,1713,1715,1717,1719,1722],{"class":236,"line":643},[234,1702,1703],{"class":387},"  old_image",[234,1705,473],{"class":383},[234,1707,1708],{"class":387},"$(",[234,1710,298],{"class":247},[234,1712,1667],{"class":387},[234,1714,1670],{"class":255},[234,1716,1650],{"class":387},[234,1718,1675],{"class":255},[234,1720,1721],{"class":255}," \"docker inspect app --format '{{.Config.Image}}' 2>\u002Fdev\u002Fnull || echo none\"",[234,1723,526],{"class":387},[234,1725,1726,1728,1730,1732,1735,1738],{"class":236,"line":659},[234,1727,1644],{"class":251},[234,1729,1647],{"class":255},[234,1731,1650],{"class":387},[234,1733,1734],{"class":255},"}] versão atual: ${",[234,1736,1737],{"class":387},"old_image",[234,1739,1659],{"class":255},[234,1741,1742],{"class":236,"line":683},[234,1743,412],{"emptyLinePlaceholder":411},[234,1745,1746,1748,1750,1752],{"class":236,"line":695},[234,1747,1644],{"class":251},[234,1749,1647],{"class":255},[234,1751,1650],{"class":387},[234,1753,1754],{"class":255},"}] substituindo contêiner\"\n",[234,1756,1757,1759,1761,1763,1765,1767],{"class":236,"line":717},[234,1758,1664],{"class":247},[234,1760,1667],{"class":387},[234,1762,1670],{"class":255},[234,1764,1650],{"class":387},[234,1766,1675],{"class":255},[234,1768,1156],{"class":255},[234,1770,1771],{"class":236,"line":723},[234,1772,1773],{"class":255},"    docker stop app 2>\u002Fdev\u002Fnull || true\n",[234,1775,1776],{"class":236,"line":729},[234,1777,1778],{"class":255},"    docker rm app 2>\u002Fdev\u002Fnull || true\n",[234,1780,1781,1784],{"class":236,"line":734},[234,1782,1783],{"class":255},"    docker run -d --name app --restart=unless-stopped ",[234,1785,1169],{"class":383},[234,1787,1788,1791],{"class":236,"line":771},[234,1789,1790],{"class":255},"      -p 8080:8080 ",[234,1792,1169],{"class":383},[234,1794,1795,1798,1800,1803],{"class":236,"line":776},[234,1796,1797],{"class":255},"      -e DATABASE_URL='${",[234,1799,453],{"class":387},[234,1801,1802],{"class":255},"}' ",[234,1804,1169],{"class":383},[234,1806,1807,1810],{"class":236,"line":815},[234,1808,1809],{"class":255},"      --health-cmd='curl -f http:\u002F\u002Flocalhost:8080\u002Fhealthz || exit 1' ",[234,1811,1169],{"class":383},[234,1813,1814,1817],{"class":236,"line":820},[234,1815,1816],{"class":255},"      --health-interval=5s --health-timeout=2s --health-retries=3 ",[234,1818,1169],{"class":383},[234,1820,1821,1824,1826],{"class":236,"line":826},[234,1822,1823],{"class":255},"      ${",[234,1825,1656],{"class":387},[234,1827,1362],{"class":255},[234,1829,1830],{"class":236,"line":846},[234,1831,1832],{"class":255},"  \"\n",[234,1834,1835],{"class":236,"line":859},[234,1836,412],{"emptyLinePlaceholder":411},[234,1838,1839,1841,1843,1845,1848,1850],{"class":236,"line":872},[234,1840,1644],{"class":251},[234,1842,1647],{"class":255},[234,1844,1650],{"class":387},[234,1846,1847],{"class":255},"}] esperando health check (max ${",[234,1849,1571],{"class":387},[234,1851,1852],{"class":255},"}s)\"\n",[234,1854,1855,1857,1860,1862,1864,1867,1870],{"class":236,"line":898},[234,1856,1619],{"class":383},[234,1858,1859],{"class":387}," start",[234,1861,473],{"class":383},[234,1863,1708],{"class":387},[234,1865,1866],{"class":247},"date",[234,1868,1869],{"class":255}," +%s",[234,1871,526],{"class":387},[234,1873,1874,1876,1879,1881],{"class":236,"line":913},[234,1875,1619],{"class":383},[234,1877,1878],{"class":387}," healthy_since",[234,1880,473],{"class":383},[234,1882,1883],{"class":251},"0\n",[234,1885,1887,1890,1893,1896],{"class":236,"line":1886},37,[234,1888,1889],{"class":383},"  while",[234,1891,1892],{"class":251}," true",[234,1894,1895],{"class":387},"; ",[234,1897,1898],{"class":383},"do\n",[234,1900,1902,1905,1908,1910,1912,1914,1916],{"class":236,"line":1901},38,[234,1903,1904],{"class":383},"    local",[234,1906,1907],{"class":387}," now",[234,1909,473],{"class":383},[234,1911,1708],{"class":387},[234,1913,1866],{"class":247},[234,1915,1869],{"class":255},[234,1917,526],{"class":387},[234,1919,1921,1924,1927,1930,1933,1936,1939],{"class":236,"line":1920},39,[234,1922,1923],{"class":383},"    if",[234,1925,1926],{"class":387}," (( now ",[234,1928,1929],{"class":383},"-",[234,1931,1932],{"class":387}," start ",[234,1934,1935],{"class":383},">",[234,1937,1938],{"class":387}," HEALTH_DEADLINE )); ",[234,1940,1941],{"class":383},"then\n",[234,1943,1945,1948,1951,1953,1956,1958],{"class":236,"line":1944},40,[234,1946,1947],{"class":251},"      echo",[234,1949,1950],{"class":255}," \"!!  [${",[234,1952,1650],{"class":387},[234,1954,1955],{"class":255},"}] healthy_deadline excedido — fazendo rollback pra ${",[234,1957,1737],{"class":387},[234,1959,1659],{"class":255},[234,1961,1963,1966,1968,1970,1972,1974],{"class":236,"line":1962},41,[234,1964,1965],{"class":247},"      ssh",[234,1967,1667],{"class":387},[234,1969,1670],{"class":255},[234,1971,1650],{"class":387},[234,1973,1675],{"class":255},[234,1975,1156],{"class":255},[234,1977,1979],{"class":236,"line":1978},42,[234,1980,1981],{"class":255},"        docker stop app && docker rm app &&\n",[234,1983,1985,1988],{"class":236,"line":1984},43,[234,1986,1987],{"class":255},"        docker run -d --name app --restart=unless-stopped ",[234,1989,1169],{"class":383},[234,1991,1993,1996,1998,2000],{"class":236,"line":1992},44,[234,1994,1995],{"class":255},"          -p 8080:8080 -e DATABASE_URL='${",[234,1997,453],{"class":387},[234,1999,1802],{"class":255},[234,2001,1169],{"class":383},[234,2003,2005,2008,2010],{"class":236,"line":2004},45,[234,2006,2007],{"class":255},"          ${",[234,2009,1737],{"class":387},[234,2011,1362],{"class":255},[234,2013,2015],{"class":236,"line":2014},46,[234,2016,2017],{"class":255},"      \"\n",[234,2019,2021,2024],{"class":236,"line":2020},47,[234,2022,2023],{"class":383},"      return",[234,2025,2026],{"class":251}," 1\n",[234,2028,2030],{"class":236,"line":2029},48,[234,2031,2032],{"class":383},"    fi\n",[234,2034,2036],{"class":236,"line":2035},49,[234,2037,412],{"emptyLinePlaceholder":411},[234,2039,2041,2043,2046,2049,2052,2055,2058,2060,2063,2066,2069,2071],{"class":236,"line":2040},50,[234,2042,1923],{"class":383},[234,2044,2045],{"class":247}," curl",[234,2047,2048],{"class":251}," -sf",[234,2050,2051],{"class":251}," --max-time",[234,2053,2054],{"class":251}," 2",[234,2056,2057],{"class":255}," \"http:\u002F\u002F${",[234,2059,1650],{"class":387},[234,2061,2062],{"class":255},"}:8080\u002Fhealthz\"",[234,2064,2065],{"class":383}," >",[234,2067,2068],{"class":255}," \u002Fdev\u002Fnull",[234,2070,1895],{"class":387},[234,2072,1941],{"class":383},[234,2074,2076,2079,2082,2085,2088,2091],{"class":236,"line":2075},51,[234,2077,2078],{"class":383},"      if",[234,2080,2081],{"class":387}," (( healthy_since ",[234,2083,2084],{"class":383},"==",[234,2086,2087],{"class":251}," 0",[234,2089,2090],{"class":387}," )); ",[234,2092,1941],{"class":383},[234,2094,2096,2099,2101],{"class":236,"line":2095},52,[234,2097,2098],{"class":387},"        healthy_since",[234,2100,473],{"class":383},[234,2102,2103],{"class":387},"${now}\n",[234,2105,2107,2110,2113,2115,2118,2120],{"class":236,"line":2106},53,[234,2108,2109],{"class":251},"        echo",[234,2111,2112],{"class":255}," \"    [${",[234,2114,1650],{"class":387},[234,2116,2117],{"class":255},"}] saudável — confirmando por ${",[234,2119,1584],{"class":387},[234,2121,2122],{"class":255},"}s\"\n",[234,2124,2126,2129,2131,2133,2136,2139,2142],{"class":236,"line":2125},54,[234,2127,2128],{"class":383},"      elif",[234,2130,1926],{"class":387},[234,2132,1929],{"class":383},[234,2134,2135],{"class":387}," healthy_since ",[234,2137,2138],{"class":383},">=",[234,2140,2141],{"class":387}," MIN_HEALTHY_TIME )); ",[234,2143,1941],{"class":383},[234,2145,2147,2149,2151,2153],{"class":236,"line":2146},55,[234,2148,2109],{"class":251},[234,2150,1647],{"class":255},[234,2152,1650],{"class":387},[234,2154,2155],{"class":255},"}] saudável sustentado — promovendo\"\n",[234,2157,2159,2162],{"class":236,"line":2158},56,[234,2160,2161],{"class":383},"        return",[234,2163,2164],{"class":251}," 0\n",[234,2166,2168],{"class":236,"line":2167},57,[234,2169,2170],{"class":383},"      fi\n",[234,2172,2174],{"class":236,"line":2173},58,[234,2175,2176],{"class":383},"    else\n",[234,2178,2180,2183,2185],{"class":236,"line":2179},59,[234,2181,2182],{"class":387},"      healthy_since",[234,2184,473],{"class":383},[234,2186,1883],{"class":255},[234,2188,2190],{"class":236,"line":2189},60,[234,2191,2032],{"class":383},[234,2193,2195,2198],{"class":236,"line":2194},61,[234,2196,2197],{"class":247},"    sleep",[234,2199,2200],{"class":251}," 2\n",[234,2202,2204],{"class":236,"line":2203},62,[234,2205,2206],{"class":383},"  done\n",[234,2208,2210],{"class":236,"line":2209},63,[234,2211,1362],{"class":387},[234,2213,2215],{"class":236,"line":2214},64,[234,2216,412],{"emptyLinePlaceholder":411},[234,2218,2220,2223,2226,2228,2231,2234,2236,2239,2242],{"class":236,"line":2219},65,[234,2221,2222],{"class":251},"echo",[234,2224,2225],{"class":255}," \"### Deploy ${",[234,2227,1501],{"class":387},[234,2229,2230],{"class":255},"} em ${",[234,2232,2233],{"class":383},"#",[234,2235,1554],{"class":387},[234,2237,2238],{"class":255},"[",[234,2240,2241],{"class":383},"@",[234,2243,2244],{"class":255},"]} hosts (rolling, max_parallel=1)\"\n",[234,2246,2248,2251,2254,2257,2260,2262,2264,2266,2269,2271],{"class":236,"line":2247},66,[234,2249,2250],{"class":383},"for",[234,2252,2253],{"class":387}," host ",[234,2255,2256],{"class":383},"in",[234,2258,2259],{"class":255}," \"${",[234,2261,1554],{"class":387},[234,2263,2238],{"class":255},[234,2265,2241],{"class":383},[234,2267,2268],{"class":255},"]}\"",[234,2270,1895],{"class":387},[234,2272,1898],{"class":383},[234,2274,2276,2278,2281,2284,2286,2288,2290,2292,2294,2296,2298],{"class":236,"line":2275},67,[234,2277,597],{"class":383},[234,2279,2280],{"class":383}," !",[234,2282,2283],{"class":247}," deploy_host",[234,2285,2259],{"class":255},[234,2287,1650],{"class":387},[234,2289,1675],{"class":255},[234,2291,2259],{"class":255},[234,2293,1501],{"class":387},[234,2295,1675],{"class":255},[234,2297,1895],{"class":387},[234,2299,1941],{"class":383},[234,2301,2303,2306,2309,2311],{"class":236,"line":2302},68,[234,2304,2305],{"class":251},"    echo",[234,2307,2308],{"class":255}," \"### Deploy abortado em ${",[234,2310,1650],{"class":387},[234,2312,2313],{"class":255},"}. Hosts anteriores mantidos como estavam.\"\n",[234,2315,2317,2320],{"class":236,"line":2316},69,[234,2318,2319],{"class":251},"    exit",[234,2321,2026],{"class":251},[234,2323,2325],{"class":236,"line":2324},70,[234,2326,2327],{"class":383},"  fi\n",[234,2329,2331],{"class":236,"line":2330},71,[234,2332,2333],{"class":383},"done\n",[234,2335,2337,2339,2342,2344],{"class":236,"line":2336},72,[234,2338,2222],{"class":251},[234,2340,2341],{"class":255}," \"### Deploy completo: todos os hosts em ${",[234,2343,1501],{"class":387},[234,2345,1659],{"class":255},[12,2347,2348,2349,571,2352,2355],{},"Save as ",[231,2350,2351],{},"deploy.sh",[231,2353,2354],{},"chmod +x",", and:",[224,2357,2359],{"className":226,"code":2358,"language":228,"meta":229,"style":229},"export DATABASE_URL='postgres:\u002F\u002Fuser:pass@db.example.com:5432\u002Fapp'\n.\u002Fdeploy.sh meuusuario\u002Fmyapp:v2\n",[231,2360,2361,2374],{"__ignoreMap":229},[234,2362,2363,2366,2369,2371],{"class":236,"line":237},[234,2364,2365],{"class":383},"export",[234,2367,2368],{"class":387}," DATABASE_URL",[234,2370,473],{"class":383},[234,2372,2373],{"class":255},"'postgres:\u002F\u002Fuser:pass@db.example.com:5432\u002Fapp'\n",[234,2375,2376,2379],{"class":236,"line":244},[234,2377,2378],{"class":247},".\u002Fdeploy.sh",[234,2380,2381],{"class":255}," meuusuario\u002Fmyapp:v2\n",[12,2383,2384],{},"The algorithm is literally what large orchestrators do internally:",[67,2386,2387,2393,2407,2413,2418,2424,2437],{},[70,2388,2389,2392],{},[27,2390,2391],{},"For each host, sequentially"," (max_parallel = 1)",[70,2394,2395,2398,2399,2402,2403,2406],{},[27,2396,2397],{},"Pull the new image"," before touching the container — that way the downtime between ",[231,2400,2401],{},"docker stop"," and ",[231,2404,2405],{},"docker run"," is minimal",[70,2408,2409,2412],{},[27,2410,2411],{},"Save reference to the old image"," for rollback if something goes wrong",[70,2414,2415],{},[27,2416,2417],{},"Replace the container",[70,2419,2420,2423],{},[27,2421,2422],{},"Loop waiting for health check"," with a five-minute deadline",[70,2425,2426,2429,2430,2432,2433,2436],{},[27,2427,2428],{},"Min healthy time of ten seconds",": only advances when ",[231,2431,355],{}," returned 200 ",[179,2434,2435],{},"sustainedly"," for ten seconds (if it falls in the middle, restart the count)",[70,2438,2439,2442],{},[27,2440,2441],{},"Automatic rollback"," if the deadline is exceeded",[12,2444,2445],{},"The numbers (max_parallel: 1, min_healthy_time: 10s, healthy_deadline: 300s) are exactly the defaults we use in HeroCtl. It's no coincidence — these are the values that survived years of trial and error. Min healthy time too short detects transient symptoms as \"healthy\" and breaks; too long makes the deploy slow with no gain. Ten seconds is the point where noise disappears and the deploy still finishes quickly.",[19,2447,2449],{"id":2448},"step-6-validate-with-a-load-test-during-deploy-15-min","Step 6 — Validate with a load test during deploy (15 min)",[12,2451,2452],{},"This is the fire test: run sustained load and deploy at the same time. If any 5xx appears, some part of the scheme is broken.",[12,2454,2455],{},"On an external machine (your laptop or another VPS):",[224,2457,2459],{"className":226,"code":2458,"language":228,"meta":229,"style":229},"# instale hey\ngo install github.com\u002Frakyll\u002Fhey@latest\n\n# carga sustentada de 60s, 5 conexões concorrentes\nhey -z 60s -c 5 https:\u002F\u002Fmeudominio.com\u002F\n",[231,2460,2461,2466,2477,2481,2486],{"__ignoreMap":229},[234,2462,2463],{"class":236,"line":237},[234,2464,2465],{"class":240},"# instale hey\n",[234,2467,2468,2471,2474],{"class":236,"line":244},[234,2469,2470],{"class":247},"go",[234,2472,2473],{"class":255}," install",[234,2475,2476],{"class":255}," github.com\u002Frakyll\u002Fhey@latest\n",[234,2478,2479],{"class":236,"line":271},[234,2480,412],{"emptyLinePlaceholder":411},[234,2482,2483],{"class":236,"line":415},[234,2484,2485],{"class":240},"# carga sustentada de 60s, 5 conexões concorrentes\n",[234,2487,2488,2491,2494,2497,2500,2503],{"class":236,"line":434},[234,2489,2490],{"class":247},"hey",[234,2492,2493],{"class":251}," -z",[234,2495,2496],{"class":255}," 60s",[234,2498,2499],{"class":251}," -c",[234,2501,2502],{"class":251}," 5",[234,2504,1451],{"class":255},[12,2506,2507],{},"In another window, simultaneously:",[224,2509,2511],{"className":226,"code":2510,"language":228,"meta":229,"style":229},".\u002Fdeploy.sh meuusuario\u002Fmyapp:v2\n",[231,2512,2513],{"__ignoreMap":229},[234,2514,2515,2517],{"class":236,"line":237},[234,2516,2378],{"class":247},[234,2518,2381],{"class":255},[12,2520,2521,2522,1272],{},"At the end of ",[231,2523,2490],{},[224,2525,2530],{"className":2526,"code":2528,"language":2529},[2527],"language-text","Status code distribution:\n  [200] 1847 responses\n","text",[231,2531,2528],{"__ignoreMap":229},[12,2533,2534],{},"Only 200. If a 502 or 503 shows up, one of the three pieces is weak: health check returning 200 too early, missing graceful shutdown, or short min healthy time. Investigate and fix.",[57,2536],{},[19,2538,2540],{"id":2539},"the-six-details-that-separate-real-zero-downtime-from-approximation","The six details that separate real zero-downtime from approximation",[12,2542,2543],{},"We covered most of these throughout the tutorial, but worth consolidating — because a single one missing turns the whole scheme into \"mostly zero-downtime\", which is different.",[67,2545,2546,2555,2572,2588,2594,2600],{},[70,2547,2548,2551,2552,2554],{},[27,2549,2550],{},"Connection draining on SIGTERM."," When the container receives the stop signal, the app marks ",[231,2553,355],{}," as failing immediately, but keeps accepting in-flight connections for a few seconds. Without it, connections open at the moment of stop are cut.",[70,2556,2557,2560,2561,2564,2565,2568,2569,101],{},[27,2558,2559],{},"Pre-stop hook if you have an async worker."," Queues that process background jobs need an explicit pause before killing the process, or the running job is orphaned. In Sidekiq, it's ",[231,2562,2563],{},":quiet"," before ",[231,2566,2567],{},":term",". In Celery, it's ",[231,2570,2571],{},"--soft-time-limit",[70,2573,2574,2577,2578,2581,2582,2584,2585,2587],{},[27,2575,2576],{},"Health check BEFORE promoting, not \"container running\"."," ",[231,2579,2580],{},"docker ps"," shows \"running\" milliseconds after ",[231,2583,2405],{},". It means nothing. Promote only after ",[231,2586,355],{}," returns 200 sustainedly.",[70,2589,2590,2593],{},[27,2591,2592],{},"Min healthy time of ten sustained seconds."," Don't trust seeing a single 200 and moving on — apps with irregular warm-up pass for a moment and fail again.",[70,2595,2596,2599],{},[27,2597,2598],{},"Previous version pre-pulled for fast rollback."," If you trusted \"keep the old image in Docker's cache\", at some point it's cleared by garbage collection and rollback gets slow. Keep the last three images explicitly.",[70,2601,2602,2605],{},[27,2603,2604],{},"Auto-revert when the healthy deadline is exceeded."," Without it, the deploy gets stuck in a partial state — half the hosts on v2, half on v1, with nobody to decide what to do.",[19,2607,2609],{"id":2608},"database-migrations-zero-downtime-the-part-that-breaks-experienced-peoples-deploys","Database migrations + zero-downtime (the part that breaks experienced people's deploys)",[12,2611,2612,2613,2616],{},"This is the topic I see senior developers get wrong most often. Rolling deploy assumes that ",[27,2614,2615],{},"both versions of the app run simultaneously in production for some period",". If v2 expects a schema incompatible with what v1 understands, one of the two breaks during the transition window.",[12,2618,2619,2620,101],{},"Non-negotiable golden rule: ",[27,2621,2622],{},"migrations are always backward-compatible",[12,2624,2625,2626,2629,2630,2633,2634,2636],{},"Classic case: you want to rename column ",[231,2627,2628],{},"email"," to ",[231,2631,2632],{},"email_address",". Wrong solution: do the migration that renames directly before the deploy. Result: during the rolling, v1 instances still write to ",[231,2635,2628],{}," (which no longer exists) and break. Right solution, in three deploys:",[119,2638,2639,2652],{},[122,2640,2641],{},[125,2642,2643,2646,2649],{},[128,2644,2645],{},"Deploy",[128,2647,2648],{},"Migration",[128,2650,2651],{},"Code v*",[141,2653,2654,2676,2694],{},[125,2655,2656,2659,2665],{},[146,2657,2658],{},"1",[146,2660,2661,2662,2664],{},"Add ",[231,2663,2632],{}," (nullable). No removal.",[146,2666,2667,2668,2670,2671,2673,2674,101],{},"App writes to ",[231,2669,2628],{}," AND to ",[231,2672,2632],{},"; reads from ",[231,2675,2628],{},[125,2677,2678,2681,2688],{},[146,2679,2680],{},"2",[146,2682,2683,2684,2687],{},"Backfill: ",[231,2685,2686],{},"UPDATE users SET email_address = email WHERE email_address IS NULL",". NOT NULL constraint.",[146,2689,2690,2691,2693],{},"App reads from ",[231,2692,2632],{},"; still writes to both.",[125,2695,2696,2699,2704],{},[146,2697,2698],{},"3",[146,2700,2701,2702,101],{},"Drop ",[231,2703,2628],{},[146,2705,2706,2707,101],{},"App only uses ",[231,2708,2632],{},[12,2710,2711],{},"Three deploys, weeks apart. It's tedious, it's the way. Direct column drop always breaks. Direct type change always breaks. Adding NOT NULL without a default directly always breaks.",[12,2713,2714,2715,2402,2718,2721,2722,2725],{},"Tools that help: ",[231,2716,2717],{},"pg-osc",[231,2719,2720],{},"pgroll"," (Postgres), ",[231,2723,2724],{},"gh-ost"," (MySQL) — do online schema change without a long lock. For light migrations, the manual three-step way solves it.",[19,2727,2729],{"id":2728},"patterns-beyond-rolling","Patterns beyond rolling",[12,2731,2732],{},"Rolling is the default and most economical pattern. Others worth knowing:",[2734,2735,2736,2742,2748],"ul",{},[70,2737,2738,2741],{},[27,2739,2740],{},"Blue-green."," Two complete parallel environments — \"blue\" running v1, \"green\" provisioned with v2 empty. You bring up v2 entirely on green, validate, switch DNS (or load balancer cutover). Advantage: instant rollback (return DNS to blue). Disadvantage: costs double the resources during the deploy window.",[70,2743,2744,2747],{},[27,2745,2746],{},"Canary."," Send 5% of traffic to v2, observe metrics (errors, latency, conversion rate), decide whether to promote to 100% or abort. Detects subtle bugs that health check doesn't catch — like regression in checkout conversion. Requires a proxy with weighted routing and decent observability.",[70,2749,2750,2753],{},[27,2751,2752],{},"Rainbow \u002F N+1."," Generalization of blue-green with N coexisting versions. Useful when you want long-running A\u002FB tests between entire versions.",[12,2755,2756],{},"For the tutorial, rolling is what makes sense. The others are worth it when the traffic size justifies the extra investment.",[19,2758,2760],{"id":2759},"easy-version-coolify-or-dokploy","\"Easy\" version — Coolify or Dokploy",[12,2762,2763],{},"If you don't want to script, two modern panels do rolling deploy automatically:",[2734,2765,2766,2772],{},[70,2767,2768,2771],{},[27,2769,2770],{},"Coolify"," in multi-server mode does rolling with configurable health check. Multi-server was added in more recent versions — before it was single-server only. Worth checking the version.",[70,2773,2774,2777,2778,2781],{},[27,2775,2776],{},"Dokploy"," on top of Docker Swarm does rolling with ",[231,2779,2780],{},"--update-parallelism 1 --update-delay",". Leverages what Swarm already offers.",[12,2783,2784],{},"Trade-off: you swap the fifty-line script (where you understand everything happening) for a panel (which is faster to set up, but becomes a black box when something goes wrong). For a small team where one person handles operations partially, the panel wins. For a team where you need to understand exactly what happened at 3 a.m., the script wins.",[19,2786,2788],{"id":2787},"robust-version-heroctl","\"Robust\" version — HeroCtl",[12,2790,2791],{},"For those who want to stop scripting but don't want a black box, HeroCtl combines automatic rolling deploy with a replicated control plane. You describe the service in a configuration file and the orchestrator does the rest:",[224,2793,2797],{"className":2794,"code":2795,"language":2796,"meta":229,"style":229},"language-hcl shiki shiki-themes github-dark-default","job \"minhaapp\" {\n  group \"web\" {\n    count = 2\n\n    task \"app\" {\n      driver = \"docker\"\n      config {\n        image = \"meuusuario\u002Fmyapp:v2\"\n        ports = [\"http\"]\n      }\n\n      service {\n        port = \"http\"\n        check {\n          type     = \"http\"\n          path     = \"\u002Fhealthz\"\n          interval = \"5s\"\n          timeout  = \"2s\"\n        }\n      }\n    }\n\n    update {\n      max_parallel      = 1\n      min_healthy_time  = \"10s\"\n      healthy_deadline  = \"5m\"\n      auto_revert       = true\n    }\n  }\n}\n","hcl",[231,2798,2799,2804,2809,2814,2818,2823,2828,2833,2838,2843,2848,2852,2857,2862,2867,2872,2877,2882,2887,2891,2895,2899,2903,2908,2913,2918,2923,2928,2932,2936],{"__ignoreMap":229},[234,2800,2801],{"class":236,"line":237},[234,2802,2803],{},"job \"minhaapp\" {\n",[234,2805,2806],{"class":236,"line":244},[234,2807,2808],{},"  group \"web\" {\n",[234,2810,2811],{"class":236,"line":271},[234,2812,2813],{},"    count = 2\n",[234,2815,2816],{"class":236,"line":415},[234,2817,412],{"emptyLinePlaceholder":411},[234,2819,2820],{"class":236,"line":434},[234,2821,2822],{},"    task \"app\" {\n",[234,2824,2825],{"class":236,"line":459},[234,2826,2827],{},"      driver = \"docker\"\n",[234,2829,2830],{"class":236,"line":464},[234,2831,2832],{},"      config {\n",[234,2834,2835],{"class":236,"line":479},[234,2836,2837],{},"        image = \"meuusuario\u002Fmyapp:v2\"\n",[234,2839,2840],{"class":236,"line":484},[234,2841,2842],{},"        ports = [\"http\"]\n",[234,2844,2845],{"class":236,"line":490},[234,2846,2847],{},"      }\n",[234,2849,2850],{"class":236,"line":508},[234,2851,412],{"emptyLinePlaceholder":411},[234,2853,2854],{"class":236,"line":529},[234,2855,2856],{},"      service {\n",[234,2858,2859],{"class":236,"line":535},[234,2860,2861],{},"        port = \"http\"\n",[234,2863,2864],{"class":236,"line":546},[234,2865,2866],{},"        check {\n",[234,2868,2869],{"class":236,"line":552},[234,2870,2871],{},"          type     = \"http\"\n",[234,2873,2874],{"class":236,"line":557},[234,2875,2876],{},"          path     = \"\u002Fhealthz\"\n",[234,2878,2879],{"class":236,"line":594},[234,2880,2881],{},"          interval = \"5s\"\n",[234,2883,2884],{"class":236,"line":635},[234,2885,2886],{},"          timeout  = \"2s\"\n",[234,2888,2889],{"class":236,"line":643},[234,2890,1352],{},[234,2892,2893],{"class":236,"line":659},[234,2894,2847],{},[234,2896,2897],{"class":236,"line":683},[234,2898,1357],{},[234,2900,2901],{"class":236,"line":695},[234,2902,412],{"emptyLinePlaceholder":411},[234,2904,2905],{"class":236,"line":717},[234,2906,2907],{},"    update {\n",[234,2909,2910],{"class":236,"line":723},[234,2911,2912],{},"      max_parallel      = 1\n",[234,2914,2915],{"class":236,"line":729},[234,2916,2917],{},"      min_healthy_time  = \"10s\"\n",[234,2919,2920],{"class":236,"line":734},[234,2921,2922],{},"      healthy_deadline  = \"5m\"\n",[234,2924,2925],{"class":236,"line":771},[234,2926,2927],{},"      auto_revert       = true\n",[234,2929,2930],{"class":236,"line":776},[234,2931,1357],{},[234,2933,2934],{"class":236,"line":815},[234,2935,720],{},[234,2937,2938],{"class":236,"line":820},[234,2939,1362],{},[12,2941,2942],{},"The same parameters as the bash script, declarative. The difference is that the orchestrator coordinates rolling across N servers (not just two), does automatic leader election in around seven seconds if the current node falls, and keeps the control plane distributed across the first three servers. Cluster survives the loss of any single server without human intervention.",[12,2944,2945],{},"Installation:",[224,2947,2949],{"className":226,"code":2948,"language":228,"meta":229,"style":229},"curl -sSL https:\u002F\u002Fget.heroctl.com\u002Finstall.sh | sh\n",[231,2950,2951],{"__ignoreMap":229},[234,2952,2953,2955,2958,2961,2964],{"class":236,"line":237},[234,2954,1220],{"class":247},[234,2956,2957],{"class":251}," -sSL",[234,2959,2960],{"class":255}," https:\u002F\u002Fget.heroctl.com\u002Finstall.sh",[234,2962,2963],{"class":383}," |",[234,2965,2966],{"class":247}," sh\n",[12,2968,2969],{},"Community plan is permanently free — no server or job limit, with all the orchestration features described in the tutorial. Business plan adds SSO\u002FSAML, granular RBAC, detailed audit, and SLA-backed support, for teams that have formal platform requirements. Enterprise plan adds source-code escrow, continuity contract, and 24×7 support. Business and Enterprise prices are published on the plans page — no mandatory \"talk to sales\".",[19,2971,2973],{"id":2972},"comparison-five-paths-side-by-side","Comparison: five paths side by side",[119,2975,2976,3001],{},[122,2977,2978],{},[125,2979,2980,2983,2986,2989,2992,2995,2998],{},[128,2981,2982],{},"Criterion",[128,2984,2985],{},"Bash script (2 servers)",[128,2987,2988],{},"Coolify multi-server",[128,2990,2991],{},"Dokploy + Swarm",[128,2993,2994],{},"HeroCtl",[128,2996,2997],{},"Kamal",[128,2999,3000],{},"Kubernetes",[141,3002,3003,3025,3048,3070,3088,3105,3127,3147],{},[125,3004,3005,3008,3011,3014,3017,3020,3022],{},[146,3006,3007],{},"Setup time",[146,3009,3010],{},"2-3h",[146,3012,3013],{},"30 min",[146,3015,3016],{},"1h",[146,3018,3019],{},"5 min",[146,3021,3016],{},[146,3023,3024],{},"4h-4 days",[125,3026,3027,3030,3033,3036,3039,3042,3045],{},[146,3028,3029],{},"Lines of config",[146,3031,3032],{},"~50 (script)",[146,3034,3035],{},"UI",[146,3037,3038],{},"~20",[146,3040,3041],{},"~50",[146,3043,3044],{},"~40",[146,3046,3047],{},"300+",[125,3049,3050,3053,3056,3059,3062,3065,3067],{},[146,3051,3052],{},"HA of the control plane",[146,3054,3055],{},"N\u002FA",[146,3057,3058],{},"No",[146,3060,3061],{},"Limited",[146,3063,3064],{},"Yes",[146,3066,3055],{},[146,3068,3069],{},"Yes (5+ components)",[125,3071,3072,3075,3078,3080,3082,3084,3086],{},[146,3073,3074],{},"Declarative health check",[146,3076,3077],{},"Manual",[146,3079,3064],{},[146,3081,3064],{},[146,3083,3064],{},[146,3085,3064],{},[146,3087,3064],{},[125,3089,3090,3092,3095,3097,3099,3101,3103],{},[146,3091,2441],{},[146,3093,3094],{},"Manual in script",[146,3096,3064],{},[146,3098,3064],{},[146,3100,3064],{},[146,3102,3064],{},[146,3104,3064],{},[125,3106,3107,3110,3113,3116,3119,3122,3124],{},[146,3108,3109],{},"Target scale",[146,3111,3112],{},"1-3 servers",[146,3114,3115],{},"1-10 servers",[146,3117,3118],{},"1-20 servers",[146,3120,3121],{},"1-500 servers",[146,3123,3115],{},[146,3125,3126],{},"50+ servers",[125,3128,3129,3132,3135,3137,3140,3143,3145],{},[146,3130,3131],{},"Black box?",[146,3133,3134],{},"No (you wrote it)",[146,3136,3064],{},[146,3138,3139],{},"Partial",[146,3141,3142],{},"No (short declarative)",[146,3144,3058],{},[146,3146,3064],{},[125,3148,3149,3152,3155,3157,3160,3162,3164],{},[146,3150,3151],{},"Learning curve",[146,3153,3154],{},"Low",[146,3156,3154],{},[146,3158,3159],{},"Medium",[146,3161,3154],{},[146,3163,3154],{},[146,3165,3166],{},"High",[12,3168,3169],{},"Each column has its niche. Bash script is unbeatable when you want to understand each line. Coolify wins when you just want a panel. HeroCtl wins when you need real HA without setting up an external control plane. Kubernetes wins at planetary scale — where the complexity pays off.",[19,3171,3173],{"id":3172},"the-five-most-common-errors","The five most common errors",[67,3175,3176,3188,3194,3204,3216],{},[70,3177,3178,3184,3185,3187],{},[27,3179,3180,3181,3183],{},"Health check on ",[231,3182,1523],{}," returning 200 without validating dependencies."," The app returns 200 before connecting to the database, the proxy promotes, and the user sees a 500 error on the first requests. Solution: ",[231,3186,355],{}," validates database, cache, queue — anything the app needs to actually respond.",[70,3189,3190,3193],{},[27,3191,3192],{},"Min healthy time of 1 second."," Apps with irregular warm-up may return 200 at one moment and 503 right after (cache populating, class being lazy-loaded). The orchestrator promotes on the first good window, and the next request hits a bad state. Ten sustained seconds eliminate ninety percent of these cases.",[70,3195,3196,3199,3200,3203],{},[27,3197,3198],{},"No max_parallel (or max_parallel = N)."," If you swap all instances together, during the cutover window there's nobody healthy serving. It's single-server downtime in disguise. Always ",[231,3201,3202],{},"max_parallel = 1"," to start.",[70,3205,3206,3209,3210,3212,3213,3215],{},[27,3207,3208],{},"Mix of versions in production without schema compat."," v1 writes to ",[231,3211,2628],{},", v2 reads from ",[231,3214,2632],{},", and during the five-minute rolling the two coexist — users hitting v2 don't see data v1 just wrote. Backward-compatible migration in three steps solves it.",[70,3217,3218,3221],{},[27,3219,3220],{},"Stale cache on the client (CDN, browser, service worker)."," Backend is already v2 but the user has the v1 JS in cache, and the old JS calls an API that no longer exists. Solution: keep old endpoints for a window; API versioning; strong cache-busting on critical assets.",[19,3223,3225],{"id":3224},"faq","FAQ",[12,3227,3228],{},[27,3229,3230],{},"Can I do zero-downtime with a single server?",[12,3232,3233,3234,3237],{},"No. Every variation that promises this has a measurable error window when you measure with ",[231,3235,3236],{},"hey -c 20",". The only way to have real zero-downtime is to keep at least one instance always healthy throughout the deploy — which requires two machines minimum.",[12,3239,3240],{},[27,3241,3242],{},"Does DNS round-robin work as a load balancer?",[12,3244,3245],{},"It works as a basic load balancer, but not as a health check mechanism. DNS doesn't quickly remove a dead IP from rotation — TTLs caching at ISPs and clients keep the wrong IP in use for minutes or hours. For zero-downtime you need a real proxy (Caddy, nginx, HAProxy) that takes unhealthy instances out of balancing in seconds.",[12,3247,3248],{},[27,3249,3250],{},"Caddy or Traefik — which is better for this setup?",[12,3252,3253],{},"For two servers and a static setup, Caddy is simpler — fifteen-line Caddyfile solves it. Traefik shines when you have dynamic service discovery (like Docker labels or Consul) and many backends changing all the time. nginx sits in the middle: more flexible, no built-in automatic TLS (needs external certbot). For this tutorial, Caddy.",[12,3255,3256],{},[27,3257,3258],{},"Do WebSocket connections survive during rolling?",[12,3260,3261,3262,3264],{},"Connections open on an instance that's being torn down are cut. The client has to reconnect. A good WebSocket library (Socket.IO, Phoenix Channels) reconnects automatically — the user sees a half-second blink in state. Connection draining helps: the instance marks ",[231,3263,355],{}," failing, the proxy stops sending new connections, but existing ones continue until the pre-stop timer. Thirty seconds of drain are usually enough for long-running connections to drain naturally.",[12,3266,3267],{},[27,3268,3269],{},"Database migrations — what's the golden rule?",[12,3271,3272],{},"Every migration must be backward-compatible. Drop a column never directly. Rename never directly. Type change never directly. Instead, three deploys: add new structure, backfill, remove the old. Slow, yes. But rolling deploy depends on this not to break.",[12,3274,3275],{},[27,3276,3277],{},"Automatic rollback — how to implement?",[12,3279,3280,3281,101],{},"Two pieces: deadline (max time waiting for health check) and reference to the previous image pre-pulled. If the deadline passes without becoming healthy, the script reinstalls the previous version. The example in Step 5 does exactly that. In declarative orchestrators, it becomes ",[231,3282,3283],{},"auto_revert = true",[12,3285,3286],{},[27,3287,3288],{},"Do sticky sessions complicate zero-downtime?",[12,3290,3291],{},"Yes. If the app stores session state in process memory, taking down the instance takes down the sessions of users connected to it. Solution: take session out of memory — Redis, Postgres, or signed JWT. Then any instance serves any user, and rolling cuts no session.",[12,3293,3294],{},[27,3295,3296],{},"How long does a complete deploy take?",[12,3298,3299,3300,3303],{},"Two servers, app that comes up in fifteen seconds: about a minute. Breakdown: image pull (5-15s, depends on network and size), container replacement (1s), warm-up + health check (10-30s), 10s min healthy time, total around 30-50s per host, multiplied by two hosts in sequence = 1-2 min. Four servers around 2-4 min. With fifty servers, deploy starts taking ten or fifteen minutes — time to raise ",[231,3301,3302],{},"max_parallel"," to two or three (keeping rigorous health check).",[57,3305],{},[19,3307,3309],{"id":3308},"closing","Closing",[12,3311,3312],{},"Zero-downtime deploy is architecture, not tool. The three ingredients — multiple instances, proxy with health check, controlled rolling — work with bash and Caddy as well as with a large orchestrator. The difference is in how much of the operation you want to write by hand and how much to delegate.",[12,3314,3315],{},"For a small SaaS, three VPS and a fifty-line script solve it indefinitely. When the cluster grows to dozens of servers or the team needs real HA on the control plane, it's worth stepping up to the declarative orchestrator:",[224,3317,3318],{"className":226,"code":2948,"language":228,"meta":229,"style":229},[231,3319,3320],{"__ignoreMap":229},[234,3321,3322,3324,3326,3328,3330],{"class":236,"line":237},[234,3323,1220],{"class":247},[234,3325,2957],{"class":251},[234,3327,2960],{"class":255},[234,3329,2963],{"class":383},[234,3331,2966],{"class":247},[12,3333,3334,3335,3340,3341,3345],{},"More on the rolling algorithm in ",[3336,3337,3339],"a",{"href":3338},"\u002Fen\u002Fblog\u002Fsafe-rolling-deploys-why-yours-might-not-be","Safe rolling deploy: why yours might not be",". For those leaving Compose for a multi-server setup, ",[3336,3342,3344],{"href":3343},"\u002Fen\u002Fblog\u002Fdocker-deploy-production-compose-to-cluster","Docker deploy in production: from compose to a cluster"," covers the intermediate path.",[12,3347,3348],{},"Container orchestration, without ceremony.",[3350,3351,3352],"style",{},"html pre.shiki code .sH3jZ, html code.shiki .sH3jZ{--shiki-default:#8B949E}html pre.shiki code .sQhOw, html code.shiki .sQhOw{--shiki-default:#FFA657}html pre.shiki code .sFSAA, html code.shiki .sFSAA{--shiki-default:#79C0FF}html pre.shiki code .s9uIt, html code.shiki .s9uIt{--shiki-default:#A5D6FF}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html pre.shiki code .suJrU, html code.shiki .suJrU{--shiki-default:#FF7B72}html pre.shiki code .sZEs4, html code.shiki .sZEs4{--shiki-default:#E6EDF3}html pre.shiki code .sc3cj, html code.shiki .sc3cj{--shiki-default:#D2A8FF}",{"title":229,"searchDepth":244,"depth":244,"links":3354},[3355,3356,3357,3358,3359,3360,3365,3366,3367,3368,3369,3370,3371,3372,3373,3374,3375,3376,3377],{"id":21,"depth":244,"text":22},{"id":61,"depth":244,"text":62},{"id":93,"depth":244,"text":94},{"id":113,"depth":244,"text":114},{"id":215,"depth":244,"text":216},{"id":348,"depth":244,"text":349,"children":3361},[3362,3363,3364],{"id":370,"depth":271,"text":371},{"id":918,"depth":271,"text":919},{"id":1002,"depth":271,"text":1003},{"id":1099,"depth":244,"text":1100},{"id":1241,"depth":244,"text":1242},{"id":1462,"depth":244,"text":1463},{"id":2448,"depth":244,"text":2449},{"id":2539,"depth":244,"text":2540},{"id":2608,"depth":244,"text":2609},{"id":2728,"depth":244,"text":2729},{"id":2759,"depth":244,"text":2760},{"id":2787,"depth":244,"text":2788},{"id":2972,"depth":244,"text":2973},{"id":3172,"depth":244,"text":3173},{"id":3224,"depth":244,"text":3225},{"id":3308,"depth":244,"text":3309},"engineering",null,"2026-06-09","You don't need Kubernetes for zero-downtime deploys. Full tutorial with 2 servers, Caddy\u002FTraefik in front, and rolling update via script or lightweight orchestrator.",false,"md",{},"\u002Fen\u002Fblog\u002Fzero-downtime-deploy-without-kubernetes","15 min",{"title":6,"description":3381},{"loc":3385},"en\u002Fblog\u002Fzero-downtime-deploy-without-kubernetes",[1526,3391,3392,3378],"zero-downtime","tutorial","lwgFsUuWTJnDZ04WNV5qWhFTbnfIZk7sCm3BW82wMaY",{"id":3395,"title":3396,"author":7,"body":3397,"category":3378,"cover":3379,"date":4397,"description":4398,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":4399,"navigation":411,"path":4400,"readingTime":4401,"seo":4402,"sitemap":4403,"stem":4404,"tags":4405,"__hash__":4410},"blog_en\u002Fen\u002Fblog\u002Fself-hosted-api-gateway-when-to-install.md","Self-hosted API gateway: when it's worth installing Kong, Traefik or similar",{"type":9,"value":3398,"toc":4377},[3399,3402,3405,3409,3416,3419,3426,3430,3433,3515,3522,3582,3585,3589,3592,3596,3599,3605,3611,3617,3621,3624,3629,3634,3639,3643,3646,3651,3656,3661,3665,3668,3673,3678,3683,3687,3690,3695,3700,3705,3709,3712,3750,3753,3757,3760,3812,3815,3819,3822,3828,3831,3834,3838,4124,4131,4135,4138,4144,4150,4156,4162,4168,4171,4175,4178,4184,4190,4196,4202,4206,4209,4241,4245,4255,4261,4267,4277,4283,4289,4295,4309,4315,4319,4322,4325,4328,4331,4347,4361,4371,4374],[12,3400,3401],{},"\"API gateway\" is one of the most overloaded jargon categories in back-end architecture. The term became an umbrella for things a simple reverse proxy has done for twenty years — routing, terminating TLS, balancing between instances — mixed with things that genuinely require a dedicated component: per-client API key validation, per-user request limiting, request body transformation, aggregating multiple back-ends into a single response. The confusion sells a lot of product. And makes a lot of startups install a critical component they didn't need — paying later in latency, RAM, operational complexity and failure surface.",[12,3403,3404],{},"This post separates what each thing covers, lists the five main players with honest resource consumption numbers, and draws a practical ruler: when the reverse proxy embedded in the orchestrator is enough, when it's worth bringing up a standalone Traefik, and when you actually need a Kong with authentication plug-ins. The audience is the tech lead looking at the current stack and trying to decide whether the next pain deserves another component on the critical path — or whether the pain is fake.",[19,3406,3408],{"id":3407},"tldr-installing-a-dedicated-gateway-is-an-expensive-decision-keep-the-ruler-short","TL;DR — installing a dedicated gateway is an expensive decision, keep the ruler short",[12,3410,3411,3412,3415],{},"A simple reverse proxy covers the trunk of the problem: HTTPS terminated, automatic Let's Encrypt certificates, routing by host and path, balancing between back-ends, health check, compression. For a typical B2B SaaS with web app + a few HTTP microservices, ",[27,3413,3414],{},"that's enough",". No need to install Kong, no need for Tyk, no need for KrakenD.",[12,3417,3418],{},"A dedicated API gateway becomes a defensible investment when three signs appear simultaneously: you publish an API for third parties to consume (not just your own web\u002Fmobile), you need request limits per client key (not per IP), and you want interactive documentation with try-it-here for consumers. In that scenario, Kong, Tyk or Traefik with rich middlewares pay their own cost. Outside that scenario, you are adding 100–300 MB of RAM on the critical path, 1 to 3 milliseconds of latency per request, and one more component that can fall in production — in exchange for features nobody will use.",[12,3420,3421,3422,3425],{},"The simplest ruler we know: ",[27,3423,3424],{},"if your end client is a person opening a browser, a reverse proxy is enough. If your end client is a developer with an API key, consider the gateway."," Everything else is variation on top of those two lines.",[19,3427,3429],{"id":3428},"what-a-simple-reverse-proxy-already-covers-and-whats-still-missing","What a simple reverse proxy already covers, and what's still missing",[12,3431,3432],{},"Before comparing gateways, it's worth fixing the floor. A decent reverse proxy — Caddy, nginx, or the integrated router of a modern orchestrator — delivers a lot for free. This list represents the state of the art in 2026, not the historical minimum:",[2734,3434,3435,3441,3447,3464,3470,3476,3482,3488,3494,3500],{},[70,3436,3437,3440],{},[27,3438,3439],{},"HTTP\u002FHTTPS termination with HTTP\u002F2 and HTTP\u002F3."," The proxy speaks any modern protocol with the client and speaks clean HTTP\u002F1.1 to the back-end if needed.",[70,3442,3443,3446],{},[27,3444,3445],{},"Automatic Let's Encrypt certificates."," Issuance, renewal at 60 days, error recovery. Today this is commodity — any serious router does it.",[70,3448,3449,2577,3452,3455,3456,3459,3460,3463],{},[27,3450,3451],{},"Routing by host and path.",[231,3453,3454],{},"api.example.com"," goes to one back-end, ",[231,3457,3458],{},"app.example.com"," goes to another, ",[231,3461,3462],{},"\u002Fv1\u002Fusers"," goes to a third. Rules with prefix, regex and priority.",[70,3465,3466,3469],{},[27,3467,3468],{},"Balancing between instances."," Round-robin, least connections, IP hash. Enough to distribute load between replicas of the same service.",[70,3471,3472,3475],{},[27,3473,3474],{},"Active and passive health check."," Removes a sick instance from the pool. Re-includes it when it comes back.",[70,3477,3478,3481],{},[27,3479,3480],{},"gzip and brotli compression."," Negotiates with the client, compresses what's worth compressing.",[70,3483,3484,3487],{},[27,3485,3486],{},"Static content cache."," For immutable files, avoids hitting the back-end.",[70,3489,3490,3493],{},[27,3491,3492],{},"Basic per-IP limit."," Thirty requests per second per address, for example. Covers most silly abuse.",[70,3495,3496,3499],{},[27,3497,3498],{},"Timeouts and retries."," Fail fast, retry on an alternative back-end if applicable.",[70,3501,3502,2577,3505,571,3508,571,3511,3514],{},[27,3503,3504],{},"Proxy headers.",[231,3506,3507],{},"X-Forwarded-For",[231,3509,3510],{},"X-Real-IP",[231,3512,3513],{},"X-Forwarded-Proto",". The back-end sees the real client.",[12,3516,3517,3518,3521],{},"That's a lot. For 80% of B2B SaaS web applications, it's all you need on the entry path. What a simple reverse proxy ",[27,3519,3520],{},"does not"," cover is what differentiates a gateway:",[2734,3523,3524,3530,3540,3546,3552,3558,3564,3570,3576],{},[70,3525,3526,3529],{},[27,3527,3528],{},"Per-client API key validation."," Each consumer gets a key, the gateway validates, identifies the client, and uses that identity for limits and auditing.",[70,3531,3532,3535,3536,3539],{},[27,3533,3534],{},"JWT token validation with rotatable keys."," The gateway downloads public keys from the issuer, validates signature and time, exposes ",[231,3537,3538],{},"claims"," to the back-end.",[70,3541,3542,3545],{},[27,3543,3544],{},"Request limits per key\u002Fuser\u002Froute."," Client A can make 1,000 calls\u002Fhour; client B, 100. Per route, per day, with sliding window. Hard to do in a simple proxy.",[70,3547,3548,3551],{},[27,3549,3550],{},"Request and response transformation."," Add\u002Fremove headers, rewrite JSON body, translate between API versions.",[70,3553,3554,3557],{},[27,3555,3556],{},"Versioning by header or path."," Coexist with v1 clients while v2 gains traction. Deprecation policy.",[70,3559,3560,3563],{},[27,3561,3562],{},"Back-end aggregation."," Composite endpoint that calls three microservices and returns a unified response (back-end-for-frontend pattern).",[70,3565,3566,3569],{},[27,3567,3568],{},"Request schema validation."," Reject at the gateway what doesn't match the OpenAPI contract before touching the back-end.",[70,3571,3572,3575],{},[27,3573,3574],{},"Documentation portal with try-it-here."," Interactive page for developers to explore the API.",[70,3577,3578,3581],{},[27,3579,3580],{},"Granular metrics per API key."," Who called, how much, when, with what latency. Vital if the API is the product.",[12,3583,3584],{},"Each item on this second list is a feature that costs a lot to do in application code spread out. If you need most of it, a gateway pays. If you need almost nothing — which is the common case in product SaaS — the gateway is dead weight.",[19,3586,3588],{"id":3587},"the-five-players-that-matter-in-2026","The five players that matter in 2026",[12,3590,3591],{},"The market has settled. There are five defensible choices for a self-hosted gateway, with reasonably distinct profiles. The RAM and latency numbers below are measured with default configuration and modest workload (a few dozen calls per second); heavy plug-ins or high volume change everything.",[368,3593,3595],{"id":3594},"kong-lua-based-on-top-of-openresty","Kong (Lua-based, on top of OpenResty)",[12,3597,3598],{},"The best-known name in the category. Kong started in 2015 and has the largest plug-in catalog in the space — OAuth authentication, JWT validation, transformation, log to Elasticsearch, integration with external vaults, all pre-built. The open source version covers most cases; the paid one adds a more polished developer portal, fine-grained RBAC, and SLA support.",[12,3600,3601,3604],{},[27,3602,3603],{},"Resources:"," realistic minimum of 200 MB of RAM per instance, plus the database if you don't use db-less mode. Added latency of 1 to 2 milliseconds per request on a simple call. Heavy plug-ins (schema validation with large OpenAPI, complex JSON transformation) can double that.",[12,3606,3607,3610],{},[27,3608,3609],{},"When it makes sense:"," serious public API with multiple external consumers, need for catalog plug-ins, team willing to learn Lua if customization is needed. Payments company, communication platform, any business where the API is the product sold.",[12,3612,3613,3616],{},[27,3614,3615],{},"Gotcha:"," the mode with PostgreSQL puts the database on the critical path. Database down, gateway can't update configuration. Use db-less mode (declarative configuration via file) whenever possible — eliminates that dependency.",[368,3618,3620],{"id":3619},"traefik-written-in-go-speaking-various-orchestrator-proxies","Traefik (written in Go, speaking various orchestrator proxies)",[12,3622,3623],{},"Known as a Kubernetes ingress controller, but has rich enough middlewares to cover many gateway cases. Per-client request limiting, basic JWT validation, header transformation, complex redirects, forward auth (delegating to an external service). The paid version adds commercial plug-ins and a more robust dashboard.",[12,3625,3626,3628],{},[27,3627,3603],{}," 50 to 100 MB of RAM, added latency of 0.5 to 1 millisecond. Automatic back-end discovery via container labels is the strong point — you don't write route configuration, it appears when the service comes up.",[12,3630,3631,3633],{},[27,3632,3609],{}," already using Traefik as the entry router and want to avoid adding one more component; need reasonable middlewares but not Kong's giant catalog; value declarative configuration by label rather than database.",[12,3635,3636,3638],{},[27,3637,3615],{}," some advanced patterns (call aggregation, full OpenAPI validation, interactive documentation portal) don't fit in Traefik. If you need that, the temptation to \"stretch\" Traefik via custom plug-ins leads to complexity that Kong would solve more cleanly.",[368,3640,3642],{"id":3641},"tyk-written-in-go-focus-on-developer-portal","Tyk (written in Go, focus on developer portal)",[12,3644,3645],{},"The open source version delivers far more than most — request limiting per key, key management, developer portal, all in the free plan. The paid version adds multi-tenant dashboard, multi-region replication, and support.",[12,3647,3648,3650],{},[27,3649,3603],{}," 100 MB of RAM, added latency of 1 to 2 milliseconds. Database (Redis) is central to the architecture — request limits and counters live there.",[12,3652,3653,3655],{},[27,3654,3609],{}," API with many external consumers, developer portal is part of the product, you want to pay less than what Kong charges for the equivalent in resources. Small teams publishing API for partners have found a good fit here.",[12,3657,3658,3660],{},[27,3659,3615],{}," fewer ready-made plug-ins than Kong. If your expected integration exists in Kong's list but not in Tyk's, the trade-off changes.",[368,3662,3664],{"id":3663},"krakend-written-in-go-no-database-focus-on-aggregation","KrakenD (written in Go, no database, focus on aggregation)",[12,3666,3667],{},"KrakenD is the small gateway that specializes in aggregation. 100% file configuration, no external state, designed to compose endpoints — the client makes one call, KrakenD calls three back-ends in parallel and returns a combined response. Great for the back-end-for-frontend pattern.",[12,3669,3670,3672],{},[27,3671,3603],{}," 50 MB of RAM, added latency of 0.5 milliseconds. The lightest of the category. No database, no panel — everything is static configuration file.",[12,3674,3675,3677],{},[27,3676,3609],{}," you have multiple microservices and want to expose a cleaner API to the mobile\u002Fweb front-end. You don't need dynamic key management or developer portal. You like immutable configuration: change file, deploy, done.",[12,3679,3680,3682],{},[27,3681,3615],{}," everything is static. Adding a new key is a deploy. For a small team that's simplification; for an API platform with third parties self-registering, it becomes a bottleneck.",[368,3684,3686],{"id":3685},"envoy-gateway-cncf-on-top-of-envoy-proxy","Envoy Gateway (CNCF, on top of Envoy proxy)",[12,3688,3689],{},"The serious newcomer of the list. Envoy is the very-high-performance proxy used in large service meshes. Envoy Gateway is the project that packages Envoy as an API gateway with declarative configuration. Focus on Kubernetes, high throughput, mesh integration.",[12,3691,3692,3694],{},[27,3693,3603],{}," raw Envoy consumes 50 to 100 MB on the data proxy; the control plane weighs more. Low added latency (\u003C 1 millisecond) on a simple call. But operational complexity is the highest on the list.",[12,3696,3697,3699],{},[27,3698,3609],{}," you already run a service mesh with Envoy (Istio, Consul, Linkerd with compatible proxy) and want configuration consistency between mesh and gateway. You operate at high enough scale that Envoy throughput matters (tens of thousands of requests per second).",[12,3701,3702,3704],{},[27,3703,3615],{}," for a startup with 4 servers and a few dozen requests per second, Envoy Gateway is overkill by two or three sizes. The configuration complexity doesn't pay off.",[19,3706,3708],{"id":3707},"when-is-a-simple-reverse-proxy-enough","When is a simple reverse proxy enough?",[12,3710,3711],{},"This is the question that saves money. The honest answer is: in the vast majority of Brazilian B2B SaaS we see running, it is enough. The criteria for \"enough\":",[2734,3713,3714,3720,3726,3732,3738,3744],{},[70,3715,3716,3719],{},[27,3717,3718],{},"Audience for your API is your own application."," Web, mobile, internal integrations. There are no unknown third parties calling endpoints with keys you issued.",[70,3721,3722,3725],{},[27,3723,3724],{},"Authentication happens in the application, not on the path."," Cookie session, JWT token issued by the back-end itself and validated by application middleware, OAuth via library inside the code. The proxy doesn't need to see the user.",[70,3727,3728,3731],{},[27,3729,3730],{},"Request limit is \"avoid silly abuse\"."," Thirty per second per IP, perhaps. There is no commercial plan that limits Client A to 1,000 calls\u002Fday and Client B to 10,000.",[70,3733,3734,3737],{},[27,3735,3736],{},"You don't need to combine back-ends."," Each front-end call goes to one endpoint, that endpoint calls what it needs internally. No path-level aggregation.",[70,3739,3740,3743],{},[27,3741,3742],{},"API documentation is internal or non-existent."," No developer portal with try-it-here for third parties.",[70,3745,3746,3749],{},[27,3747,3748],{},"Versioning, if it exists, is managed in code."," The back-end routes internally between v1 and v2 when needed. No formal policy at the gateway.",[12,3751,3752],{},"If five of the six items above are true, installing a dedicated gateway is expensive for the real benefit. A reverse proxy embedded in the orchestrator, or standalone Caddy\u002Fnginx, covers everything.",[19,3754,3756],{"id":3755},"when-is-a-dedicated-gateway-worth-it","When is a dedicated gateway worth it?",[12,3758,3759],{},"The inversion of the previous list. A gateway pays off when some of these appear:",[2734,3761,3762,3768,3774,3780,3794,3800,3806],{},[70,3763,3764,3767],{},[27,3765,3766],{},"Public API is part of the product."," You charge (or plan to charge) per API usage. Third parties register, get a key, consume.",[70,3769,3770,3773],{},[27,3771,3772],{},"Limit per key\u002Fuser\u002Froute is a business rule."," Free plan has a ceiling, paid plan has a higher ceiling, enterprise plan is negotiated. That limit needs to live somewhere — gateway is the right place.",[70,3775,3776,3779],{},[27,3777,3778],{},"Multiple back-ends need to be combined into one response."," Back-end-for-frontend pattern, microservice aggregation, fan-out and fan-in. High costs in the application, modest costs in the gateway.",[70,3781,3782,3785,3786,3789,3790,3793],{},[27,3783,3784],{},"Formal API versioning."," You support v1 and v2 simultaneously, with announced deprecation date. ",[231,3787,3788],{},"Accept-Version"," header or ",[231,3791,3792],{},"\u002Fv2\u002F"," path. Legacy client can't break.",[70,3795,3796,3799],{},[27,3797,3798],{},"Complex authentication."," Validation of JWT issued by a third party, with public keys downloaded and cached, with automatic rotation. OAuth with multiple providers. Authentication by client certificate (mutual TLS) for inter-company integrations.",[70,3801,3802,3805],{},[27,3803,3804],{},"Developer portal with try-it-here."," Interactive documentation, self-service key management, usage panel for consumers.",[70,3807,3808,3811],{},[27,3809,3810],{},"Per-API-key metrics."," Who calls what, when, latency per consumer. Commercial dashboards, usage reports, per-client SLA.",[12,3813,3814],{},"Three or more of these criteria true, gateway is defensible. One or two, you can still solve it in other ways (authentication in the application, limit per app, structured metrics in the log).",[19,3816,3818],{"id":3817},"heroctl-integrated-router-where-it-sits-on-this-ruler","HeroCtl integrated router — where it sits on this ruler",[12,3820,3821],{},"The router embedded in HeroCtl doesn't try to be a gateway. It covers the well-done reverse proxy side: HTTPS terminated, automatic Let's Encrypt with renewal, routing by host and path, balancing between the replicas the orchestrator brought up, health check coordinated with the agent on each node, compression, proxy headers, basic per-IP limit, retry policy on back-end failure.",[12,3823,3824,3825,3827],{},"What the integrated router ",[27,3826,3520],{}," do: per-consumer API key validation, per-key\u002Fuser limit, body transformation, back-end aggregation, OpenAPI schema validation, developer portal. For 80% of cases where the end client is the company's own browser or mobile app, the embedded router covers the entire entry path — you don't install anything else in front.",[12,3829,3830],{},"For the 20% who need a dedicated gateway, the path is direct: install Kong, standalone Traefik, Tyk or KrakenD as another job in the cluster, behind the embedded router. The router terminates TLS at the edge, the gateway does the gateway work, the back-ends sit behind. Without ceremony, without circular dependency.",[12,3832,3833],{},"The HeroCtl control plane occupies between 200 and 400 MB per server — meaning an installed Kong adds practically the same weight as the entire control plane. Worth remembering the order of magnitude before \"just install\".",[19,3835,3837],{"id":3836},"comparison-table-12-criteria","Comparison table — 12 criteria",[119,3839,3840,3867],{},[122,3841,3842],{},[125,3843,3844,3846,3849,3852,3855,3858,3861,3864],{},[128,3845,2982],{},[128,3847,3848],{},"Simple reverse proxy (Caddy\u002Fnginx)",[128,3850,3851],{},"HeroCtl router",[128,3853,3854],{},"Standalone Traefik",[128,3856,3857],{},"KrakenD",[128,3859,3860],{},"Tyk OSS",[128,3862,3863],{},"Kong OSS",[128,3865,3866],{},"Envoy Gateway",[141,3868,3869,3895,3919,3940,3959,3978,3997,4017,4036,4057,4078,4099],{},[125,3870,3871,3874,3877,3880,3883,3886,3889,3892],{},[146,3872,3873],{},"Minimum RAM",[146,3875,3876],{},"20–50 MB",[146,3878,3879],{},"embedded in control plane",[146,3881,3882],{},"50–100 MB",[146,3884,3885],{},"~50 MB",[146,3887,3888],{},"~100 MB",[146,3890,3891],{},"~200 MB",[146,3893,3894],{},"~100 MB + control plane",[125,3896,3897,3900,3903,3905,3908,3911,3914,3916],{},[146,3898,3899],{},"Added latency",[146,3901,3902],{},"\u003C 0.5 ms",[146,3904,3902],{},[146,3906,3907],{},"0.5–1 ms",[146,3909,3910],{},"~0.5 ms",[146,3912,3913],{},"1–2 ms",[146,3915,3913],{},[146,3917,3918],{},"\u003C 1 ms (can grow)",[125,3920,3921,3924,3927,3929,3931,3934,3936,3938],{},[146,3922,3923],{},"Automatic certificates",[146,3925,3926],{},"Yes (native Caddy)",[146,3928,3064],{},[146,3930,3064],{},[146,3932,3933],{},"Not direct",[146,3935,3064],{},[146,3937,3064],{},[146,3939,3064],{},[125,3941,3942,3945,3947,3949,3951,3953,3955,3957],{},[146,3943,3944],{},"Host\u002Fpath routing",[146,3946,3064],{},[146,3948,3064],{},[146,3950,3064],{},[146,3952,3064],{},[146,3954,3064],{},[146,3956,3064],{},[146,3958,3064],{},[125,3960,3961,3964,3966,3968,3970,3972,3974,3976],{},[146,3962,3963],{},"Balancing + health",[146,3965,3064],{},[146,3967,3064],{},[146,3969,3064],{},[146,3971,3064],{},[146,3973,3064],{},[146,3975,3064],{},[146,3977,3064],{},[125,3979,3980,3983,3985,3987,3989,3991,3993,3995],{},[146,3981,3982],{},"Per-IP limit",[146,3984,3064],{},[146,3986,3064],{},[146,3988,3064],{},[146,3990,3064],{},[146,3992,3064],{},[146,3994,3064],{},[146,3996,3064],{},[125,3998,3999,4002,4004,4006,4009,4011,4013,4015],{},[146,4000,4001],{},"Per key\u002Fuser limit",[146,4003,3058],{},[146,4005,3058],{},[146,4007,4008],{},"Yes (with middleware)",[146,4010,3064],{},[146,4012,3064],{},[146,4014,3064],{},[146,4016,3064],{},[125,4018,4019,4022,4024,4026,4028,4030,4032,4034],{},[146,4020,4021],{},"JWT validation",[146,4023,3058],{},[146,4025,3058],{},[146,4027,3139],{},[146,4029,3064],{},[146,4031,3064],{},[146,4033,3064],{},[146,4035,3064],{},[125,4037,4038,4041,4043,4045,4047,4050,4052,4055],{},[146,4039,4040],{},"Back-end aggregation",[146,4042,3058],{},[146,4044,3058],{},[146,4046,3058],{},[146,4048,4049],{},"Yes (focus)",[146,4051,3139],{},[146,4053,4054],{},"Yes (with plug-in)",[146,4056,3064],{},[125,4058,4059,4062,4064,4066,4068,4071,4073,4076],{},[146,4060,4061],{},"OpenAPI validation",[146,4063,3058],{},[146,4065,3058],{},[146,4067,3058],{},[146,4069,4070],{},"Yes (subscriber)",[146,4072,3064],{},[146,4074,4075],{},"Yes (plug-in)",[146,4077,3064],{},[125,4079,4080,4083,4085,4087,4089,4091,4094,4097],{},[146,4081,4082],{},"Developer portal",[146,4084,3058],{},[146,4086,3058],{},[146,4088,3058],{},[146,4090,3058],{},[146,4092,4093],{},"Yes (included)",[146,4095,4096],{},"Yes (paid in robust OSS)",[146,4098,3058],{},[125,4100,4101,4104,4107,4110,4113,4115,4118,4121],{},[146,4102,4103],{},"Configuration",[146,4105,4106],{},"File",[146,4108,4109],{},"Panel + API",[146,4111,4112],{},"Labels\u002Ffile",[146,4114,4106],{},[146,4116,4117],{},"File + panel",[146,4119,4120],{},"File + panel + database",[146,4122,4123],{},"Custom Resources",[12,4125,4126,4127,4130],{},"The table has clear zones. The first three columns solve the entry path with low weight. The last four solve the entry path ",[27,4128,4129],{},"plus"," gateway work, with growing weight and complexity.",[19,4132,4134],{"id":4133},"typical-stack-by-company-stage","Typical stack by company stage",[12,4136,4137],{},"This is the ruler we recommend. Not a strict prescription — it is what we see working in Brazilian SaaS teams.",[12,4139,4140,4143],{},[27,4141,4142],{},"MVP (1 back-end, 1 developer)."," Standalone Caddy on a server, or embedded router if you're already in an orchestrator. Don't install anything. Don't think about gateway. Focus on product.",[12,4145,4146,4149],{},[27,4147,4148],{},"Indie hacker (3 to 5 back-ends, team of 1 to 3)."," Embedded router in the orchestrator, period. The entry path already covers what matters. Authentication in the application, basic per-IP limit on the proxy. Time spent with gateway is time not spent on product features.",[12,4151,4152,4155],{},[27,4153,4154],{},"Early startup (10 to 20 back-ends, first external API consumers)."," Time to evaluate. If the external API is an experiment that may still die, leave authentication in the application and limit by key in a shared library. If the API is part of the product's core promise, install standalone Traefik with authentication and limit middlewares, or Tyk OSS for the included portal. Kong at this stage is usually too heavy.",[12,4157,4158,4161],{},[27,4159,4160],{},"Mid startup (50+ back-ends, public API platform becomes product)."," Kong OSS or paid Tyk. You need plug-ins, robust portal, self-service key management, commercial metrics. Kong's weight now justifies — you're charging for API usage and the gateway is revenue, not cost.",[12,4163,4164,4167],{},[27,4165,4166],{},"Large company (hundreds of services, integrations with serious partners)."," Kong Enterprise or Envoy Gateway, depending on context. Dedicated team looking after the gateway. Formal versioning, deprecation, per-client SLA policy.",[12,4169,4170],{},"The natural migration — reverse proxy → Traefik\u002FTyk → Kong — works because each step solves real pain from the previous step. Skipping steps is expensive: installing Kong at the MVP phase is bringing a truck to deliver a pizza.",[19,4172,4174],{"id":4173},"the-4-most-common-expensive-gateway-mistakes","The 4 most common expensive gateway mistakes",[12,4176,4177],{},"The stumbles we see in production:",[12,4179,4180,4183],{},[27,4181,4182],{},"Installing Kong with PostgreSQL on the critical path without needing to."," db-less mode exists and is perfect for most cases. Declarative configuration via file, no external dependency, no extra failure point. Many teams fall into the default configuration with database and only discover it when the database becomes unavailable and the gateway can no longer propagate changes. If you need dynamic key management (consumers self-registering), database pays off. Otherwise, db-less mode.",[12,4185,4186,4189],{},[27,4187,4188],{},"Not monitoring the gateway with the same severity as the back-ends."," A gateway in front becomes an easily-forgotten black box. Latency grows 5 ms, error rate goes from 0.01% to 0.5%, nobody notices until the client complains. Gateway metrics (per-route latency, 4xx\u002F5xx error rate, memory usage, configuration propagation lag) deserve their own dashboard and alerts, not just a shy inclusion in the general panel.",[12,4191,4192,4195],{},[27,4193,4194],{},"Custom plug-in in Lua\u002FJS running in production."," Kong allows custom plug-in in Lua. Tyk in JavaScript. Huge temptation to solve \"just this transformation\" at the gateway. A bug in that plug-in takes the entire gateway down — a bug you created, without load testing, on the critical path of everything. If you need custom transformation, do it in a microservice behind the gateway. Custom plug-in only with serious review, load testing, and automatic rollback plan.",[12,4197,4198,4201],{},[27,4199,4200],{},"Outdated gateway version."," Kong, Envoy and Tyk receive CVEs (security vulnerabilities) regularly. The gateway is exposed to the internet — a relevant attack surface. An 18-month-old version is known vulnerability accumulating. Make it part of the maintenance cycle: updating the gateway is as important as updating the operating system.",[19,4203,4205],{"id":4204},"real-scenarios-where-you-should-not-install-a-gateway","Real scenarios where you should NOT install a gateway",[12,4207,4208],{},"Strong list. If you are in any of these, avoid the dedicated gateway even if the topic comes up in council:",[2734,4210,4211,4217,4223,4229,4235],{},[70,4212,4213,4216],{},[27,4214,4215],{},"Web SaaS with 5 endpoints, no public external API."," End client is the browser. Session authentication. Reverse proxy solves the entire entry path. Adding a gateway here is architectural vanity.",[70,4218,4219,4222],{},[27,4220,4221],{},"Small team (1 to 3 developers)."," Kong's learning curve costs two to four weeks of total team productivity. On a 3-person team, that's a quarter of features stalled. Unless the gateway solves concrete pain today, postpone.",[70,4224,4225,4228],{},[27,4226,4227],{},"Workload where sub-10ms latency is a hard requirement."," Low-latency trading, real-time auction, multiplayer game. Every millisecond counts. Gateway adds 1 to 3 ms — in sensitive workload, expensive. Put intelligence in the application.",[70,4230,4231,4234],{},[27,4232,4233],{},"Monolithic application without aggregation."," The monolith serves the front-end directly, without composition between services. There's nothing to aggregate. Gateway is a solution looking for a problem.",[70,4236,4237,4240],{},[27,4238,4239],{},"Compliance that prefers minimal attack surface."," Each component exposed to the internet is one more item for audit, one more patch to apply, one more log to keep. If audit values minimalism, justify each component — gateway not covering concrete pain is a minus.",[19,4242,4244],{"id":4243},"frequent-questions","Frequent questions",[12,4246,4247,4250,4251,4254],{},[27,4248,4249],{},"Is Kong db-less stable in 2026?","\nYes. Declarative mode (",[231,4252,4253],{},"db_less = on",") is mature, recommended by Kong itself for a large part of the cases, and eliminates PostgreSQL as a dependency. You lose dynamic key management via admin API (need to deploy new configuration), but gain enormous operational simplicity. For a small team, the trade is almost always worth it.",[12,4256,4257,4260],{},[27,4258,4259],{},"Does Traefik do everything Kong does?","\nNo. Traefik with middlewares covers most common cases — basic authentication, simple per-key limit, header transformation, forward auth. It doesn't cover Kong's plug-in catalog, native OpenAPI validation, robust developer portal, ready-made commercial integrations. If your pain fits in Traefik middleware, stay in Traefik (lighter, simpler). If you need something from Kong's catalog, Kong.",[12,4262,4263,4266],{},[27,4264,4265],{},"Can I have two gateways in series?","\nTechnically yes, in practice it almost always is a symptom of confused organization. Two gateways = two configurations to maintain, two latencies summed, two failure points. The defensible case is: edge router doing TLS and basic routing, dedicated gateway behind doing specific work (key validation, aggregation). That's different from \"two complete gateways in series\" — it's responsibility split.",[12,4268,4269,4272,4273,101],{},[27,4270,4271],{},"Does an API gateway replace a service mesh?","\nNo. Gateway handles north-south traffic (external client → your system). Mesh handles east-west traffic (internal service → internal service). Similar functions (authentication, limits, observability) but different scope. For a medium startup, the gateway solves the part that matters; a complete service mesh only becomes a defensible investment at larger scale. We address that boundary in ",[3336,4274,4276],{"href":4275},"\u002Fen\u002Fblog\u002Fservice-mesh-when-its-worth-for-small-saas","service mesh: when it's worth it for small and medium SaaS",[12,4278,4279,4282],{},[27,4280,4281],{},"How much latency does Kong add on a typical call?","\nOn modern hardware, with default configuration and light plug-ins (key validation + per-key limit): 1 to 2 milliseconds per request. Heavy plug-ins (full OpenAPI validation on large payload, complex JSON transformation, synchronous log to external service) can add 3 to 10 ms. Measure before and after — don't trust generic blog post numbers.",[12,4284,4285,4288],{},[27,4286,4287],{},"Self-hosted OAuth provider — Keycloak or Hydra?","\nKeycloak is the standard for those wanting a robust admin panel, federation with LDAP\u002FSAML, complete user management. Heavier (1 GB of RAM minimum, JVM). Hydra is minimalist, focuses only on OAuth\u002FOIDC, no user management (you integrate with your existing user system). For a small team that already has its own user system, Hydra is more appropriate. For a company that wants a single place for identity, Keycloak. Both speak standard protocols, so the gateway doesn't differentiate between them.",[12,4290,4291,4294],{},[27,4292,4293],{},"Schema validation — OpenAPI or JSON Schema?","\nOpenAPI (formerly Swagger) is the standard for describing HTTP API — covers paths, methods, request and response. Includes JSON Schema for describing payloads. Kong, Tyk and standalone validators speak OpenAPI directly. Pure JSON Schema is more portable (not tied to HTTP) but requires more glue. Use OpenAPI when the gateway supports it; worth keeping the contract schema alive, not outdated.",[12,4296,4297,4300,4301,4304,4305,4308],{},[27,4298,4299],{},"Can I do limiting in the application instead of the gateway?","\nYou can. Libraries like ",[231,4302,4303],{},"golang.org\u002Fx\u002Ftime\u002Frate"," or Redis with ",[231,4306,4307],{},"INCR"," solve per-user limit at the application level. The question is where the limit is cheaper: at the gateway, before the back-end is touched (saves back-end resources, applies before work begins) or in the application, with business rules closer to the code (easier to reason about, easier to test). For simple limits, the application is enough. For limits per commercial plan with multiple tiers and auditing, the gateway is the right place.",[12,4310,4311,4314],{},[27,4312,4313],{},"Can I use two different gateways on distinct routes?","\nYou can. Some companies run Kong for \"product\" routes (sold public API) and Traefik for \"internal\" routes (admin, ops, cron). The justification is that each gateway solves a different pain, and having only one would force compromise. Worth it when usage profiles actually diverge. Not worth it just for the pleasure of variety — two pieces to maintain.",[19,4316,4318],{"id":4317},"closing-start-with-the-minimum-level-up-when-the-pain-is-real","Closing — start with the minimum, level up when the pain is real",[12,4320,4321],{},"The trap of the \"API gateway\" category is treating the decision as binary — install or not — when it is gradual. A well-done reverse proxy covers 80% of applications. The integrated router in the orchestrator covers the same 80% without a separate component. A dedicated gateway is a defensible investment when three or four signs appear simultaneously: public external API, per-key limit, aggregation, developer portal.",[12,4323,4324],{},"The honest ruler: install the minimum until concrete pain forces the next step. Skipping steps costs dearly in latency, RAM, complexity, failure surface. A small team that installs Kong \"because they will need it\" spends three weeks configuring something they don't use, and still has one more component to monitor.",[12,4326,4327],{},"HeroCtl delivers the lowest step embedded — integrated router with automatic TLS, balancing, health check, per-IP limit. When gateway pain appears for real, you bring up Kong, standalone Traefik, Tyk or KrakenD as another job in the cluster. Without painful migration, without ceremony.",[12,4329,4330],{},"To bring up a cluster and test:",[224,4332,4333],{"className":226,"code":2948,"language":228,"meta":229,"style":229},[231,4334,4335],{"__ignoreMap":229},[234,4336,4337,4339,4341,4343,4345],{"class":236,"line":237},[234,4338,1220],{"class":247},[234,4340,2957],{"class":251},[234,4342,2960],{"class":255},[234,4344,2963],{"class":383},[234,4346,2966],{"class":247},[12,4348,4349,4352,4353,4356,4357,4360],{},[27,4350,4351],{},"Community"," is free forever, no server ceiling, no job ceiling, no feature gate. ",[27,4354,4355],{},"Business"," adds SSO\u002FSAML, granular RBAC, detailed auditing and SLA support. ",[27,4358,4359],{},"Enterprise"," adds source code escrow, continuity contract and 24×7 support.",[12,4362,4363,4364,2402,4366,4370],{},"Upcoming posts: ",[3336,4365,4276],{"href":4275},[3336,4367,4369],{"href":4368},"\u002Fen\u002Fblog\u002Fmulti-tenant-saas-real-isolation","multi-tenant SaaS: real isolation between clients",". The three topics together cover most of the platform decisions for a Brazilian startup in the 1 to 500 server range.",[12,4372,4373],{},"Container orchestration, without ceremony. Gateway only when the pain asks.",[3350,4375,4376],{},"html pre.shiki code .sQhOw, html code.shiki .sQhOw{--shiki-default:#FFA657}html pre.shiki code .sFSAA, html code.shiki .sFSAA{--shiki-default:#79C0FF}html pre.shiki code .s9uIt, html code.shiki .s9uIt{--shiki-default:#A5D6FF}html pre.shiki code .suJrU, html code.shiki .suJrU{--shiki-default:#FF7B72}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}",{"title":229,"searchDepth":244,"depth":244,"links":4378},[4379,4380,4381,4388,4389,4390,4391,4392,4393,4394,4395,4396],{"id":3407,"depth":244,"text":3408},{"id":3428,"depth":244,"text":3429},{"id":3587,"depth":244,"text":3588,"children":4382},[4383,4384,4385,4386,4387],{"id":3594,"depth":271,"text":3595},{"id":3619,"depth":271,"text":3620},{"id":3641,"depth":271,"text":3642},{"id":3663,"depth":271,"text":3664},{"id":3685,"depth":271,"text":3686},{"id":3707,"depth":244,"text":3708},{"id":3755,"depth":244,"text":3756},{"id":3817,"depth":244,"text":3818},{"id":3836,"depth":244,"text":3837},{"id":4133,"depth":244,"text":4134},{"id":4173,"depth":244,"text":4174},{"id":4204,"depth":244,"text":4205},{"id":4243,"depth":244,"text":4244},{"id":4317,"depth":244,"text":4318},"2026-06-03","An API gateway solves auth, rate limiting, transformations and observability — in exchange for one more critical component. When a simple reverse proxy is enough vs. when a dedicated gateway is worth it.",{},"\u002Fen\u002Fblog\u002Fself-hosted-api-gateway-when-to-install","13 min",{"title":3396,"description":4398},{"loc":4400},"en\u002Fblog\u002Fself-hosted-api-gateway-when-to-install",[4406,4407,4408,3378,4409],"api-gateway","kong","traefik","architecture","CjqfmOvCmwhSmEqb8jVK10KBwZYkFYPuH8o9g6JGEs0",{"id":4412,"title":4413,"author":7,"body":4414,"category":3378,"cover":3379,"date":5369,"description":5370,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":5371,"navigation":411,"path":4275,"readingTime":4401,"seo":5372,"sitemap":5373,"stem":5374,"tags":5375,"__hash__":5379},"blog_en\u002Fen\u002Fblog\u002Fservice-mesh-when-its-worth-for-small-saas.md","Is service mesh overkill for a Brazilian startup? When Istio\u002FLinkerd is worth installing",{"type":9,"value":4415,"toc":5351},[4416,4419,4423,4426,4429,4433,4436,4526,4529,4533,4536,4575,4578,4582,4585,4619,4622,4626,4629,4705,4708,4734,4737,4741,4744,4761,4764,4771,4775,4778,4781,4801,4804,4818,4821,4825,4828,5026,5033,5037,5040,5066,5069,5073,5076,5108,5112,5115,5141,5144,5148,5151,5171,5174,5178,5181,5201,5204,5208,5211,5247,5251,5257,5266,5272,5278,5284,5290,5296,5302,5308,5310,5313,5316,5334,5346,5349],[12,4417,4418],{},"The question always arrives in the same format. A tech lead from a Brazilian SaaS with six or eight services running reads three English posts on service mesh, sees the entire American industry using Istio, and opens the terminal to install — along with the doubt: \"isn't this too much for the size of my company?\". It probably is. But the honest answer requires separating four problems that service mesh solves, showing the cost in RAM and CPU per server, and describing the exact point where the benefit starts surpassing the overhead.",[19,4420,4422],{"id":4421},"tldr-is-service-mesh-worth-it-for-smallmedium-startup","TL;DR — Is service mesh worth it for small\u002Fmedium startup?",[12,4424,4425],{},"Service mesh (Istio, Linkerd, Cilium Service Mesh, Consul Connect) solves four real problems between services of a microservices architecture: automatic encryption (call between pods without TLS by default leaks plaintext traffic), retries and circuit breakers (configurable resilience), granular observability (which service calls which, with what latency), and traffic shaping for canary releases. In exchange, it adds a parallel proxy on each pod (usually Envoy) that consumes between 50 and 100 MB of RAM and adds 5 to 10 ms of latency per internal call.",[12,4427,4428],{},"For startup with fewer than ten active services and fewer than fifty pods, service mesh is overkill — operational overhead exceeds benefit, and the team spends weeks studying a layer that solves a problem it doesn't yet have. For company with fifty or more microservices where diagnosing \"which service is delaying the call?\" takes hours, mesh pays in productivity. The middle ground are clusters with inter-service encryption built into the control plane itself — they cover about 60% of what mesh offers without the parallel sidecar, and serve most Brazilian cases up to the thirty-services range.",[19,4430,4432],{"id":4431},"what-service-mesh-solves-in-one-sentence","What service mesh solves, in one sentence",[12,4434,4435],{},"Before discussing cost, it's necessary to be clear on what's being bought. Service mesh is a network layer that intrudes on each call between services and adds six behaviors:",[2734,4437,4438,4451,4463,4475,4502,4511],{},[70,4439,4440,4443,4444,2629,4447,4450],{},[27,4441,4442],{},"Automatic encryption between pods."," Without mesh, a call from ",[231,4445,4446],{},"orders",[231,4448,4449],{},"users"," inside the cluster travels in plain HTTP. Any agent with access to the node's network sees the content. With mesh, each call is encrypted with automatically issued certificates, no change to application code.",[70,4452,4453,4456,4457,4459,4460,4462],{},[27,4454,4455],{},"Automatic retries on internal calls."," When ",[231,4458,4446],{}," calls ",[231,4461,4449],{}," and the first attempt fails due to a 200 ms network flap, the mesh resends. Without mesh, the application needs to implement that logic on each HTTP client it creates.",[70,4464,4465,4468,4469,4471,4472,4474],{},[27,4466,4467],{},"Configurable circuit breakers."," If ",[231,4470,4449],{}," starts responding with five-second latency, the mesh opens the circuit and makes ",[231,4473,4446],{}," fail fast instead of stacking connections. Without mesh, the team needs to add a library to each service.",[70,4476,4477,4480,4481,571,4484,4487,4488,4490,4491,4493,4494,2402,4496,4499,4500,101],{},[27,4478,4479],{},"Automatic distributed tracing."," The mesh propagates correlation headers (",[231,4482,4483],{},"x-request-id",[231,4485,4486],{},"traceparent",") through the entire call chain. The team can see, on a panel, that a request entered the ",[231,4489,4406],{},", passed through ",[231,4492,4446],{},", called ",[231,4495,4449],{},[231,4497,4498],{},"inventory",", and spent most of the time in ",[231,4501,4498],{},[70,4503,4504,4507,4508,4510],{},[27,4505,4506],{},"Fine traffic shaping."," Routing 5% of ",[231,4509,4446],{}," traffic to a new version (canary), mirroring 100% to a test version without affecting the customer (mirror), or alternating between two complete versions (blue-green) — all configured declaratively, no code.",[70,4512,4513,4516,4517,2402,4519,4522,4523,4525],{},[27,4514,4515],{},"Authorization policies between services."," Declaring that only ",[231,4518,4446],{},[231,4520,4521],{},"reports"," can call ",[231,4524,4449],{},", and any other service receives 403. It's the basis of so-called \"zero-trust network\" between pods.",[12,4527,4528],{},"Those six behaviors are real and the value is measurable. The question is whether your cluster today has enough volume and complexity to justify paying for them.",[19,4530,4532],{"id":4531},"whats-not-a-service-mesh-problem","What's NOT a service mesh problem",[12,4534,4535],{},"Before advancing, it's worth eliminating four problems many teams confuse with reason to install mesh — and that modern orchestrator already solves alone:",[2734,4537,4538,4550,4556,4565],{},[70,4539,4540,4543,4544,4546,4547,4549],{},[27,4541,4542],{},"Ingress routing (HTTP ingress)."," Receive external traffic, terminate TLS, route ",[231,4545,3454],{}," to a service and ",[231,4548,3458],{}," to another. That's work for the integrated router of the orchestrator, not for mesh.",[70,4551,4552,4555],{},[27,4553,4554],{},"Simple load balancing."," Distribute requests among three replicas of the same service with round-robin. Orchestrator does this with internal DNS and health checks. Mesh only adds when load balancing policy needs to be sophisticated (region weight, complex sticky sessions).",[70,4557,4558,4561,4562,4564],{},[27,4559,4560],{},"Service discovery."," Find where ",[231,4563,4449],{}," is running. Internal cluster DNS solves it. Mesh brings nothing new here.",[70,4566,4567,4570,4571,4574],{},[27,4568,4569],{},"HTTP\u002FHTTPS termination at the edge."," Ingress controller solves it. Mesh handles traffic ",[179,4572,4573],{},"between"," services, not the entry.",[12,4576,4577],{},"Whoever installs mesh expecting it to solve those four is paying twice for the same work.",[19,4579,4581],{"id":4580},"the-four-main-players","The four main players",[12,4583,4584],{},"Four products dominate this category in 2026. The differences matter when the tradeoff is overhead vs features.",[2734,4586,4587,4597,4607,4613],{},[70,4588,4589,4592,4593,4596],{},[27,4590,4591],{},"Istio."," The oldest, most complete, most documented — and heaviest. Uses Envoy as sidecar on each pod. De facto standard at large companies that adopted mesh between 2019 and 2022. The Ambient Mode version (no sidecar, with ",[231,4594,4595],{},"ztunnel"," per node) reduces overhead, but is still stabilizing in production.",[70,4598,4599,4602,4603,4606],{},[27,4600,4601],{},"Linkerd."," Focus on simplicity. Own proxy written in Rust (",[231,4604,4605],{},"linkerd2-proxy","), much lighter than Envoy. Short learning curve — installation fits in a couple of commands. CNCF graduated, but with smaller community than Istio.",[70,4608,4609,4612],{},[27,4610,4611],{},"Cilium Service Mesh."," Takes advantage of eBPF in the kernel to implement much of the mesh without sidecar. Per-pod overhead borders zero. In exchange, cluster setup needs recent kernel and compatible CNI, and some advanced features (like sophisticated L7 authorization) still depend on auxiliary proxy.",[70,4614,4615,4618],{},[27,4616,4617],{},"Consul Connect."," From Hashicorp. Integrates with the company's own secrets vault, and works well in mixed environments (VMs + containers). Brazilian community smaller than Istio\u002FLinkerd.",[12,4620,4621],{},"There are others (Kuma, Open Service Mesh, AWS App Mesh), but concentrating on the quartet above covers 95% of real decisions a Brazilian tech lead will face.",[19,4623,4625],{"id":4624},"how-much-does-it-cost-in-ram-and-cpu","How much does it cost in RAM and CPU?",[12,4627,4628],{},"The question that decides the discussion.",[119,4630,4631,4647],{},[122,4632,4633],{},[125,4634,4635,4638,4641,4644],{},[128,4636,4637],{},"Mesh",[128,4639,4640],{},"RAM per pod",[128,4642,4643],{},"CPU per pod",[128,4645,4646],{},"Additional latency",[141,4648,4649,4663,4677,4691],{},[125,4650,4651,4654,4657,4660],{},[146,4652,4653],{},"Istio (Envoy sidecar)",[146,4655,4656],{},"+80–120 MB",[146,4658,4659],{},"+10–15%",[146,4661,4662],{},"5–10 ms",[125,4664,4665,4668,4671,4674],{},[146,4666,4667],{},"Linkerd (linkerd2-proxy Rust)",[146,4669,4670],{},"+20–40 MB",[146,4672,4673],{},"+3–6%",[146,4675,4676],{},"1–3 ms",[125,4678,4679,4682,4685,4688],{},[146,4680,4681],{},"Cilium Service Mesh (eBPF)",[146,4683,4684],{},"~0 MB per pod",[146,4686,4687],{},"~2% on the node",[146,4689,4690],{},"\u003C1 ms",[125,4692,4693,4696,4699,4702],{},[146,4694,4695],{},"Consul Connect (Envoy sidecar)",[146,4697,4698],{},"+70–110 MB",[146,4700,4701],{},"+8–12%",[146,4703,4704],{},"4–8 ms",[12,4706,4707],{},"In cluster with one hundred active pods:",[2734,4709,4710,4716,4722,4728],{},[70,4711,4712,4715],{},[27,4713,4714],{},"Istio"," consumes about 10 GB of RAM in parallel proxies alone, before any application.",[70,4717,4718,4721],{},[27,4719,4720],{},"Linkerd"," consumes about 3 GB.",[70,4723,4724,4727],{},[27,4725,4726],{},"Cilium"," consumes almost nothing per pod, but requires an agent per node (about 200–400 MB each).",[70,4729,4730,4733],{},[27,4731,4732],{},"Consul Connect"," stays close to Istio.",[12,4735,4736],{},"For typical Brazilian startup cluster — four servers with 4 GB of RAM each, totaling 16 GB — Istio alone occupies a third of cluster memory before any line of code runs. Linkerd occupies a fifth. Cilium occupies almost nothing per pod, but requires CNI planning.",[19,4738,4740],{"id":4739},"does-my-startup-need-this","Does my startup need this?",[12,4742,4743],{},"Direct answer: probably not. The honest criteria for \"needs\":",[2734,4745,4746,4749,4752,4755,4758],{},[70,4747,4748],{},"Thirty or more active microservices in production.",[70,4750,4751],{},"Inter-service traffic is more than 50% of the cluster's total HTTP volume.",[70,4753,4754],{},"More than one incident per month related to \"which service fell, delayed, or is busting timeout\".",[70,4756,4757],{},"Formal compliance demands zero-trust network between pods (PCI-DSS level 1, certain contracts with Banco Central, health frameworks).",[70,4759,4760],{},"Team has at least one person dedicated to platform, with time to study and operate the mesh.",[12,4762,4763],{},"If you don't hit at least three of those five criteria, mesh is overkill. The added complexity doesn't return in value — it returns in on-call calls trying to understand why the sidecar is recycling.",[12,4765,4766,4767,4770],{},"Most important and least discussed criterion: ",[27,4768,4769],{},"how much of the traffic is internal?",". Application that receives request at the edge, makes a single database query and responds, spends 95% of the time between external client and database — not between services. Application that receives request at the edge, calls ten internal services to assemble the response, spends most of the time on internal traffic. For the first, mesh adds nothing perceptible. For the second, mesh can cut hours of debugging per month.",[19,4772,4774],{"id":4773},"the-cluster-native-substitute","The cluster-native substitute",[12,4776,4777],{},"Here lives the part the American discourse underestimates. In 2026, several modern orchestrators — including HeroCtl and some distributions of the orthodox colossus — come with inter-service encryption built into the control plane. No sidecar, no parallel proxy, no installing additional product.",[12,4779,4780],{},"What this covers:",[2734,4782,4783,4789,4795],{},[70,4784,4785,4788],{},[27,4786,4787],{},"Encryption between services."," Each service receives certificate automatically issued by the cluster. Internal call is encrypted by default.",[70,4790,4791,4794],{},[27,4792,4793],{},"Service identity."," Each service authenticates by certificate, not by IP or DNS.",[70,4796,4797,4800],{},[27,4798,4799],{},"Basic authorization."," Lists of who can call whom, declarative in the service config file.",[12,4802,4803],{},"What this does NOT cover:",[2734,4805,4806,4809,4812,4815],{},[70,4807,4808],{},"Fine traffic shaping (canary with 5% of traffic, mirror).",[70,4810,4811],{},"Completely automatic distributed tracing.",[70,4813,4814],{},"Configurable circuit breakers per call.",[70,4816,4817],{},"Sophisticated retry policies.",[12,4819,4820],{},"For medium startup that was thinking of installing mesh just to have \"encryption between services\", cluster-native is enough. Covers the most common audit topic without costing 10 GB of RAM.",[19,4822,4824],{"id":4823},"side-by-side-no-frills","Side by side, no frills",[12,4826,4827],{},"The table compares Istio, Linkerd, Cilium, and the option of not installing mesh (with cluster-native encryption active) on twelve criteria. There's no column without caveat.",[119,4829,4830,4846],{},[122,4831,4832],{},[125,4833,4834,4836,4838,4840,4843],{},[128,4835,2982],{},[128,4837,4714],{},[128,4839,4720],{},[128,4841,4842],{},"Cilium SM",[128,4844,4845],{},"No mesh + cluster-native",[141,4847,4848,4862,4875,4890,4907,4923,4939,4952,4966,4980,4993,5009],{},[125,4849,4850,4853,4855,4857,4860],{},[146,4851,4852],{},"RAM overhead per pod",[146,4854,4656],{},[146,4856,4670],{},[146,4858,4859],{},"~0",[146,4861,4859],{},[125,4863,4864,4867,4869,4871,4873],{},[146,4865,4866],{},"CPU overhead per pod",[146,4868,4659],{},[146,4870,4673],{},[146,4872,4687],{},[146,4874,4859],{},[125,4876,4877,4880,4882,4884,4887],{},[146,4878,4879],{},"Setup complexity",[146,4881,3166],{},[146,4883,3154],{},[146,4885,4886],{},"Medium (kernel)",[146,4888,4889],{},"Minimal",[125,4891,4892,4895,4898,4901,4904],{},[146,4893,4894],{},"Documentation in PT-BR",[146,4896,4897],{},"Good",[146,4899,4900],{},"Reasonable",[146,4902,4903],{},"Little",[146,4905,4906],{},"Embedded in orchestrator",[125,4908,4909,4912,4915,4917,4920],{},[146,4910,4911],{},"Brazilian community",[146,4913,4914],{},"Large",[146,4916,3159],{},[146,4918,4919],{},"Small",[146,4921,4922],{},"Grows with the orchestrator",[125,4924,4925,4928,4931,4934,4937],{},[146,4926,4927],{},"Parallel sidecar",[146,4929,4930],{},"Yes (Envoy)",[146,4932,4933],{},"Yes (Rust)",[146,4935,4936],{},"No (eBPF)",[146,4938,3058],{},[125,4940,4941,4944,4946,4948,4950],{},[146,4942,4943],{},"Automatic encryption between services",[146,4945,3064],{},[146,4947,3064],{},[146,4949,3064],{},[146,4951,3064],{},[125,4953,4954,4957,4959,4961,4963],{},[146,4955,4956],{},"Automatic distributed tracing",[146,4958,3064],{},[146,4960,3064],{},[146,4962,3139],{},[146,4964,4965],{},"No (needs OpenTelemetry)",[125,4967,4968,4971,4973,4975,4977],{},[146,4969,4970],{},"Fine traffic shaping (canary 5%)",[146,4972,3064],{},[146,4974,3064],{},[146,4976,3139],{},[146,4978,4979],{},"Basic (rolling, blue-green)",[125,4981,4982,4985,4987,4989,4991],{},[146,4983,4984],{},"Configurable circuit breakers",[146,4986,3064],{},[146,4988,3064],{},[146,4990,3061],{},[146,4992,3058],{},[125,4994,4995,4997,5000,5003,5006],{},[146,4996,3151],{},[146,4998,4999],{},"6–10 weeks",[146,5001,5002],{},"2–4 weeks",[146,5004,5005],{},"4–6 weeks",[146,5007,5008],{},"Days",[125,5010,5011,5014,5017,5020,5023],{},[146,5012,5013],{},"Ideal application range",[146,5015,5016],{},"50+ services",[146,5018,5019],{},"10–50 services",[146,5021,5022],{},"30+ services with new kernel",[146,5024,5025],{},"1–30 services",[12,5027,5028,5029,5032],{},"The column that matters is the last line — ",[27,5030,5031],{},"ideal application range",". Whoever is below the band, pays overhead without return. Whoever is above, feels lacking feature.",[19,5034,5036],{"id":5035},"when-service-mesh-pays-the-price","When service mesh pays the price",[12,5038,5039],{},"Four scenarios where the investment is justified:",[2734,5041,5042,5048,5054,5060],{},[70,5043,5044,5047],{},[27,5045,5046],{},"Thirty or more active microservices."," Operational complexity without mesh becomes worse than with mesh — diagnosing a chain of six internal calls across three different teams is expensive without automatic tracing.",[70,5049,5050,5053],{},[27,5051,5052],{},"Enterprise compliance with zero-trust requirements."," Some audit frameworks ask the stack to have \"zero-trust network nominally\". Mesh formally solves the checkbox.",[70,5055,5056,5059],{},[27,5057,5058],{},"Multi-cluster federation."," Service routing between two or three clusters in different regions, with automatic failover. Mesh facilitates this scenario; cluster-native solves it poorly.",[70,5061,5062,5065],{},[27,5063,5064],{},"Platform team of five or more dedicated people."," You have capacity to extract value from the mesh — operate, evolve, scale its control plane. Without that team, mesh becomes liability.",[12,5067,5068],{},"If you hit two or more of those, start evaluating. Start with Linkerd — it's what gives less pain for less relative return lost.",[19,5070,5072],{"id":5071},"when-not-to-install-most-cases","When NOT to install (most cases)",[12,5074,5075],{},"Five scenarios where installing mesh today costs more than it returns:",[2734,5077,5078,5084,5090,5096,5102],{},[70,5079,5080,5083],{},[27,5081,5082],{},"Monolith with five to ten auxiliary microservices."," Zero gain, large cost. The RAM overhead falls directly on the server bill.",[70,5085,5086,5089],{},[27,5087,5088],{},"Small team, fewer than three people on platform."," Operating mesh requires dedicated on-call for it. Small team absorbs that cost at the expense of product feature.",[70,5091,5092,5095],{},[27,5093,5094],{},"Cluster with fewer than thirty total pods."," Managing thirty pods is human work, doesn't require automatic tracing. The cost of learning mesh doesn't return.",[70,5097,5098,5101],{},[27,5099,5100],{},"Simple HTTP workload without canary requirements."," If you never needed to release 5% of traffic to a new version because rolling update always served, mesh is solution for problem that doesn't exist.",[70,5103,5104,5107],{},[27,5105,5106],{},"Cluster cost under pressure."," If every gigabyte of RAM is being counted, spending 10 GB on sidecars is decision hard to defend to investor.",[19,5109,5111],{"id":5110},"evolutionary-decision-by-stage","Evolutionary decision, by stage",[12,5113,5114],{},"The right decision changes with the size of the system. Four stages:",[2734,5116,5117,5123,5129,5135],{},[70,5118,5119,5122],{},[27,5120,5121],{},"Stage 1 — 1 to 10 services."," No mesh. If you need encryption between services, do TLS in the code (most languages have ready HTTPS client). Not worth the learning. Focus on delivering product.",[70,5124,5125,5128],{},[27,5126,5127],{},"Stage 2 — 10 to 30 services."," Cluster with encryption built into the control plane (HeroCtl, some colossus presets). Solves encryption + identity + service discovery without sidecar. Covers most of what mesh offers, without the cost.",[70,5130,5131,5134],{},[27,5132,5133],{},"Stage 3 — 30 to 50 services with platform team."," Evaluate Linkerd first. Short curve, low overhead, solves tracing and circuit breakers. Istio only if advanced features (sophisticated L7 authorization, real multi-cluster federation) are immediate requirement.",[70,5136,5137,5140],{},[27,5138,5139],{},"Stage 4 — 50+ services, enterprise compliance."," Istio or Cilium Service Mesh. Compliance will ask for one of the two; the rest are details.",[12,5142,5143],{},"Going from one stage to the next is a deliberate decision, not gradual. Add the component when the team takes on the learning and the cluster takes on the overhead. Not before.",[19,5145,5147],{"id":5146},"the-lets-install-now-to-be-prepared-trap","The \"let's install now to be prepared\" trap",[12,5149,5150],{},"Argument that appears in every discussion: \"if I'm going to grow to fifty services next year, better install now and learn\". The trap has three faces:",[2734,5152,5153,5159,5165],{},[70,5154,5155,5158],{},[27,5156,5157],{},"Learning mesh costs four to eight weeks per person on the team."," On team of five, that's twenty to forty person-weeks. Multiplied by R$200\u002Fhour, it's between R$160k and R$320k just in learning. That money buys feature or buys runway period.",[70,5160,5161,5164],{},[27,5162,5163],{},"Each new component is one more critical failure point."," Mesh control plane (Istio Pilot, Linkerd controller, Cilium operator) can fail and take internal connectivity with it. More components in quorum, more incident surface. Add only when the gain compensates that risk.",[70,5166,5167,5170],{},[27,5168,5169],{},"When you need it, installing takes a week, not a month."," Linkerd in particular is installable in a couple of commands. Cilium in a few hours if the cluster takes recent kernel. Postponing the decision isn't technical debt — it's debt postponed at lower interest.",[12,5172,5173],{},"\"Anticipate to be prepared\" doesn't work. What works is monitoring the objective criteria of the previous section and installing when two or more become reality.",[19,5175,5177],{"id":5176},"how-heroctl-approaches-the-problem","How HeroCtl approaches the problem",[12,5179,5180],{},"Our position is deliberate: service mesh, in most Brazilian cases, is decision for stage three or four. To cover stages one and two, HeroCtl brings built into the control plane:",[2734,5182,5183,5189,5195],{},[70,5184,5185,5188],{},[27,5186,5187],{},"Automatic encryption between services."," Each submitted service receives its own identity. Internal call between two services is encrypted by default, with no change in application code and no parallel sidecar.",[70,5190,5191,5194],{},[27,5192,5193],{},"Distributed tracing via integrated OpenTelemetry exporter."," The cluster propagates correlation headers and exports to any collector that understands OTLP. Not as rich as full mesh (which automatically injects tracing into the sidecars), but covers 80% of real use.",[70,5196,5197,5200],{},[27,5198,5199],{},"Basic embedded traffic shaping."," Rolling update, canary with fixed percentage of traffic, blue-green. Sufficient for startup that does ten deploys a day. Doesn't cover mirror or canary with weight per header — for that, need to install mesh.",[12,5202,5203],{},"For Brazilian startup up to the thirty-services range, this covers about 80% of what a complete mesh delivers — without the sidecar, without the four weeks of learning, without the 10 GB of RAM. When the system grows beyond that, installing Linkerd on top of HeroCtl is documented path.",[19,5205,5207],{"id":5206},"the-four-most-expensive-mistakes-installing-service-mesh","The four most expensive mistakes installing service mesh",[12,5209,5210],{},"For team that decided on the step, four traps that cost from two weeks to three months of rework:",[2734,5212,5213,5219,5235,5241],{},[70,5214,5215,5218],{},[27,5216,5217],{},"Installing before needing."," Unnecessary coverage becomes liability. New component in quorum, RAM cost, learning time — without equivalent return.",[70,5220,5221,2577,5224,5227,5228,5231,5232,5234],{},[27,5222,5223],{},"Configuring strict encryption on day one without thinking about legacy.",[231,5225,5226],{},"STRICT"," mode breaks any service that hasn't yet been migrated. The correct migration is gradual: ",[231,5229,5230],{},"PERMISSIVE"," mode at the start (accepts encrypted and non-encrypted traffic), only becomes ",[231,5233,5226],{}," when all services are inside the mesh.",[70,5236,5237,5240],{},[27,5238,5239],{},"Not sizing the control plane."," Istio Pilot and similar need enough RAM and CPU to distribute configuration to all sidecars. In growing cluster, becoming control plane bottleneck is classic incident for those who didn't plan.",[70,5242,5243,5246],{},[27,5244,5245],{},"Skipping Linkerd to Istio \"because it's more popular\"."," Linkerd solves 80% of cases with 30% of the overhead. Choosing Istio is only justified when a specific feature (sophisticated L7 authorization, integration with external identity service, multi-cluster federation) is a real requirement, not résumé preference.",[19,5248,5250],{"id":5249},"frequently-asked-questions","Frequently asked questions",[12,5252,5253,5256],{},[27,5254,5255],{},"Is Linkerd light enough for small cluster?","\nLighter than Istio by an order of magnitude, but still parallel sidecar on each pod. For cluster with twenty pods and four 4 GB nodes, Linkerd eats about 600 MB of total RAM — significant but tolerable. For cluster with ten pods, it's still excessive. Linkerd enters the scene at stage three (10–50 services), not before.",[12,5258,5259,5262,5263,5265],{},[27,5260,5261],{},"Does Istio Ambient Mode (no sidecar) change this decision?","\nReduces per-pod overhead (goes to one agent per node, ",[231,5264,4595],{},"), but still requires operating the entire Istio control plane. In stable production since 2024, but the Brazilian community is still small — waiting a few more quarters for adoption in critical project is prudent.",[12,5267,5268,5271],{},[27,5269,5270],{},"Does Cilium eBPF really have zero overhead?","\nPer pod, yes — has no parallel sidecar. But the Cilium agent on each node consumes from 200 to 400 MB and adds load on the kernel. For cluster with modern Linux kernel and compatible CNI, it's the most efficient option. For cluster still running old kernel or using specific CNI, the setup becomes a project.",[12,5273,5274,5277],{},[27,5275,5276],{},"How do I do encryption between services without service mesh?","\nThree paths. First, TLS in application code — each service exposes HTTPS, each client trusts internal CA. Works, but requires distributing certificates manually (or via secrets vault). Second, orchestrator control plane issuing certificates automatically — HeroCtl and some colossus distributions do this, it's the cleanest path. Third, VPN or encrypted overlay network (WireGuard) between nodes — protects traffic inside the cluster, but not service-to-service identity.",[12,5279,5280,5283],{},[27,5281,5282],{},"Does distributed tracing need mesh?","\nNo. OpenTelemetry SDK in each service, exporting to a central collector (Tempo, Jaeger, or managed service), covers 90% of use. Mesh automates injection without changing code, which is comfortable — but it's not a requirement. For startup, starting with OpenTelemetry in code is cheaper.",[12,5285,5286,5289],{},[27,5287,5288],{},"Service mesh in managed cluster is easier?","\nEasier to install, yes — most providers offer Istio or Linkerd add-on with one click. Easier to operate, no — you still need to understand the control plane, size, debug when a sidecar recycles. Don't gain install time at the expense of operational unpreparedness.",[12,5291,5292,5295],{},[27,5293,5294],{},"Which mesh is most used in Brazilian startup?","\nBy community experience, Istio dominates in companies that adopted between 2020 and 2022 (CNCF fashion effect). Linkerd grows since 2024 among those who migrated or started new, especially mid-size fintechs. Cilium appears in specific cases (very large clusters, cost optimization). Consul Connect very rare in Brazil.",[12,5297,5298,5301],{},[27,5299,5300],{},"Worth it for monolith + 3 microservices?","\nNo. Monolith + three microservices doesn't have internal complexity that mesh helps tame. TLS in code solves encryption. Centralized logs solve visibility. Orchestrator's rolling update solves safe deploy. Installing mesh in that scenario is bringing a problem to solve another problem that doesn't exist.",[12,5303,5304,5307],{},[27,5305,5306],{},"Does HeroCtl completely replace a service mesh?","\nFor stages one and two (up to thirty services), it replaces in about 80% of real use. For stages three and four (above thirty services, or specific compliance), HeroCtl coexists with Linkerd or Istio running as jobs on top. HeroCtl's control plane inter-service encryption coexists with the mesh — the mesh takes care of traffic between your pods, HeroCtl takes care of service identity and communication with the control plane.",[19,5309,3309],{"id":3308},[12,5311,5312],{},"The practical rule we recommend for Brazilian tech lead: install mesh when two or more of the objective criteria become reality — thirty active services, more than one incident per month related to internal calls, formal compliance asking for zero-trust, platform team of five people, real multi-cluster federation. Before that, cluster with encryption built into the control plane solves most of what you'd buy with mesh, without the 10 GB of RAM and without the eight weeks of learning.",[12,5314,5315],{},"To start exploring this path — orchestrator with inter-service encryption already included, no parallel sidecar, control plane occupying between 200 and 400 MB per server and coordinator election in about seven seconds when something falls — install on any Linux server and open the panel:",[224,5317,5319],{"className":226,"code":5318,"language":228,"meta":229,"style":229},"curl -sSL get.heroctl.com\u002Finstall.sh | sh\n",[231,5320,5321],{"__ignoreMap":229},[234,5322,5323,5325,5327,5330,5332],{"class":236,"line":237},[234,5324,1220],{"class":247},[234,5326,2957],{"class":251},[234,5328,5329],{"class":255}," get.heroctl.com\u002Finstall.sh",[234,5331,2963],{"class":383},[234,5333,2966],{"class":247},[12,5335,5336,5337,5340,5341,5345],{},"To continue on this line, two direct posts. In ",[3336,5338,5339],{"href":4368},"Multi-tenant SaaS — real isolation or just namespace?"," we deal with the neighbor problem — separating customers within the same cluster without breaking the budget. In ",[3336,5342,5344],{"href":5343},"\u002Fen\u002Fblog\u002Fk3s-vs-heroctl-when-each-fits","K3s vs HeroCtl — when each makes sense"," we compare the most common alternative when the team has already decided that the orthodox colossus is excessive.",[12,5347,5348],{},"The choice for service mesh is, deep down, a choice of when to absorb complexity. The right question isn't \"do I need Istio?\" — it's \"what's the smallest system that still solves my current problem?\". For a large part of Brazilian startups, the answer is simpler than the American industry suggests.",[3350,5350,4376],{},{"title":229,"searchDepth":244,"depth":244,"links":5352},[5353,5354,5355,5356,5357,5358,5359,5360,5361,5362,5363,5364,5365,5366,5367,5368],{"id":4421,"depth":244,"text":4422},{"id":4431,"depth":244,"text":4432},{"id":4531,"depth":244,"text":4532},{"id":4580,"depth":244,"text":4581},{"id":4624,"depth":244,"text":4625},{"id":4739,"depth":244,"text":4740},{"id":4773,"depth":244,"text":4774},{"id":4823,"depth":244,"text":4824},{"id":5035,"depth":244,"text":5036},{"id":5071,"depth":244,"text":5072},{"id":5110,"depth":244,"text":5111},{"id":5146,"depth":244,"text":5147},{"id":5176,"depth":244,"text":5177},{"id":5206,"depth":244,"text":5207},{"id":5249,"depth":244,"text":5250},{"id":3308,"depth":244,"text":3309},"2026-05-29","Service mesh solves real problems (mTLS, inter-service observability, traffic shaping). But adds 30-50% RAM\u002FCPU overhead and complexity. When it's worth it and when it's overkill.",{},{"title":4413,"description":5370},{"loc":4275},"en\u002Fblog\u002Fservice-mesh-when-its-worth-for-small-saas",[5376,5377,5378,3378,4409],"service-mesh","istio","linkerd","VX4xpWtHCom09sHEcs0-6Nv7h5dXcx4BDqp11jXSWGQ",{"id":5381,"title":5382,"author":7,"body":5383,"category":6382,"cover":3379,"date":6383,"description":6384,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":6385,"navigation":411,"path":6386,"readingTime":6387,"seo":6388,"sitemap":6389,"stem":6390,"tags":6391,"__hash__":6396},"blog_en\u002Fen\u002Fblog\u002Fleaving-aws-without-rewriting-the-stack.md","How to leave AWS without rewriting the whole stack: practical 2026 guide",{"type":9,"value":5384,"toc":6344},[5385,5388,5392,5395,5398,5404,5407,5411,5414,5417,5420,5424,5427,5501,5504,5508,5511,5714,5717,5721,5724,5727,5731,5738,5741,5744,5750,5754,5757,5760,5764,5767,5774,5777,5781,5784,5787,5791,5794,5797,5801,5804,5808,5811,5814,5818,5821,5824,5827,5831,5834,5844,5847,5851,5854,5857,5861,5864,5870,5876,5879,5883,5886,5892,5898,5908,5914,5920,5926,5932,5938,5942,5945,5977,5981,5984,5989,6026,6031,6062,6065,6068,6072,6075,6078,6095,6098,6101,6105,6108,6114,6120,6126,6132,6136,6139,6153,6156,6160,6163,6166,6191,6194,6206,6209,6211,6215,6219,6222,6226,6229,6233,6236,6240,6246,6250,6269,6273,6276,6280,6283,6287,6290,6294,6297,6299,6303,6306,6309,6325,6328,6339,6342],[12,5386,5387],{},"Most Brazilian teams thinking about leaving AWS postpone indefinitely because they believe they are facing a project of \"rewriting the entire stack\". They aren't. It is a mapping project, not a rewrite. And the mapping fits in a twelve-row spreadsheet.",[19,5389,5391],{"id":5390},"tldr-what-youll-read-in-three-minutes","TL;DR — what you'll read in three minutes",[12,5393,5394],{},"A typical Brazilian SaaS stack uses about twelve AWS services, and each of them has a portable alternative that costs between three and seven times less. EC2 becomes VPS at any provider (Hetzner, DigitalOcean, Magalu Cloud). RDS becomes Postgres on a dedicated VPS, Neon or Supabase. ElastiCache becomes self-hosted Valkey. S3 becomes Cloudflare R2 or Backblaze B2 — both with S3-compatible API, so the code doesn't even change. SQS becomes a Redis-based queue or RabbitMQ. Lambda becomes an endpoint on the traditional app server or Cloudflare Workers. ALB becomes the orchestrator's integrated router. CloudFront becomes free Cloudflare. IAM becomes secret injection in the orchestrator.",[12,5396,5397],{},"Realistic schedule for a startup with five to ten applications: six to eight weeks, eighty to one hundred sixty hours of development. Typical savings: three to seven times on the infra bill, with payback in less than a month of senior salary.",[12,5399,5400,5403],{},[27,5401,5402],{},"Don't migrate if"," your compliance requires AWS by name, if the team is single and focused on product, or if the stack uses deep lock-in (DynamoDB with specific features, Aurora Serverless v2, complex cross-account IAM).",[12,5405,5406],{},"━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━",[19,5408,5410],{"id":5409},"why-do-so-many-brazilian-teams-postpone-leaving-aws","Why do so many Brazilian teams postpone leaving AWS?",[12,5412,5413],{},"The honest answer is confusion between two different projects. \"Leaving AWS\" became a mental synonym for \"rewriting the application\". It isn't the same thing.",[12,5415,5416],{},"Rewriting the application is changing core technology — relational database for NoSQL, synchronous framework for reactive, monolith for microservices. That does take quarters. Leaving AWS is changing the infra that sustains the application you already have. The domain code stays identical. What changes are database endpoints, credentials, some SDKs and the way to declare deploys.",[12,5418,5419],{},"The confusion lasts because the team looks at the AWS console and sees two hundred services. Nobody uses two hundred. The vast majority uses twelve. Map those twelve, find an alternative for each one, and what's left is execution work — not research.",[19,5421,5423],{"id":5422},"the-twelve-aws-services-your-stack-probably-uses","The twelve AWS services your stack probably uses",[12,5425,5426],{},"The starting spreadsheet is this. Anything outside it in your account is probably satellite — a CloudWatch alarm nobody looks at, a forgotten S3 bucket, a dead Lambda function. Focus on the twelve:",[67,5428,5429,5435,5441,5447,5453,5459,5465,5471,5477,5483,5489,5495],{},[70,5430,5431,5434],{},[27,5432,5433],{},"EC2"," — virtual machines running app server and workers",[70,5436,5437,5440],{},[27,5438,5439],{},"RDS"," — managed relational database (Postgres or MySQL)",[70,5442,5443,5446],{},[27,5444,5445],{},"ElastiCache"," — Redis for cache and session",[70,5448,5449,5452],{},[27,5450,5451],{},"S3"," — object storage (uploads, backups, assets)",[70,5454,5455,5458],{},[27,5456,5457],{},"ALB \u002F NLB"," — load balancer in front of the EC2s",[70,5460,5461,5464],{},[27,5462,5463],{},"CloudFront"," — CDN for static assets",[70,5466,5467,5470],{},[27,5468,5469],{},"Route 53"," — authoritative DNS",[70,5472,5473,5476],{},[27,5474,5475],{},"SES"," — transactional email",[70,5478,5479,5482],{},[27,5480,5481],{},"SQS \u002F SNS"," — queue and pub-sub",[70,5484,5485,5488],{},[27,5486,5487],{},"IAM"," — credentials and roles for services to talk to each other",[70,5490,5491,5494],{},[27,5492,5493],{},"CloudWatch"," — metrics and logs",[70,5496,5497,5500],{},[27,5498,5499],{},"Lambda"," — serverless functions",[12,5502,5503],{},"If your account has all twelve, congratulations: you are the median stack. If you have eight or nine, better — less to migrate. If you have five very specific services (Aurora Global, DynamoDB with Streams, complex EventBridge), you are on a different path — read the lock-ins section before continuing.",[19,5505,5507],{"id":5506},"service-by-service-mapping-alternative-cost-and-complexity","Service-by-service mapping — alternative, cost and complexity",[12,5509,5510],{},"The table below is the shortcut. Each row has expanded detail afterwards.",[119,5512,5513,5532],{},[122,5514,5515],{},[125,5516,5517,5520,5523,5526,5529],{},[128,5518,5519],{},"AWS service",[128,5521,5522],{},"Portable alternative",[128,5524,5525],{},"Cost before (R$\u002Fmonth)",[128,5527,5528],{},"Cost after (R$\u002Fmonth)",[128,5530,5531],{},"Migration complexity",[141,5533,5534,5550,5566,5582,5597,5612,5626,5641,5656,5672,5685,5699],{},[125,5535,5536,5539,5542,5545,5548],{},[146,5537,5538],{},"EC2 t3.medium",[146,5540,5541],{},"Hetzner CPX21 VPS",[146,5543,5544],{},"150",[146,5546,5547],{},"44",[146,5549,3154],{},[125,5551,5552,5555,5558,5561,5564],{},[146,5553,5554],{},"RDS db.t4g.large",[146,5556,5557],{},"Self-hosted Postgres or Neon",[146,5559,5560],{},"700",[146,5562,5563],{},"50–250",[146,5565,3159],{},[125,5567,5568,5571,5574,5577,5580],{},[146,5569,5570],{},"ElastiCache cache.t4g.micro",[146,5572,5573],{},"Self-hosted Valkey",[146,5575,5576],{},"75",[146,5578,5579],{},"30",[146,5581,3159],{},[125,5583,5584,5587,5590,5593,5595],{},[146,5585,5586],{},"S3 (1TB + egress)",[146,5588,5589],{},"Cloudflare R2",[146,5591,5592],{},"600",[146,5594,5576],{},[146,5596,3154],{},[125,5598,5599,5602,5605,5608,5610],{},[146,5600,5601],{},"ALB",[146,5603,5604],{},"Orchestrator integrated router",[146,5606,5607],{},"110",[146,5609,893],{},[146,5611,3159],{},[125,5613,5614,5616,5619,5622,5624],{},[146,5615,5463],{},[146,5617,5618],{},"Free Cloudflare",[146,5620,5621],{},"400",[146,5623,893],{},[146,5625,3154],{},[125,5627,5628,5630,5633,5636,5638],{},[146,5629,5469],{},[146,5631,5632],{},"Cloudflare DNS",[146,5634,5635],{},"25",[146,5637,893],{},[146,5639,5640],{},"Trivial",[125,5642,5643,5645,5648,5651,5654],{},[146,5644,5475],{},[146,5646,5647],{},"Resend or Postmark",[146,5649,5650],{},"50",[146,5652,5653],{},"75–100",[146,5655,5640],{},[125,5657,5658,5660,5663,5666,5669],{},[146,5659,5481],{},[146,5661,5662],{},"Redis Streams or RabbitMQ",[146,5664,5665],{},"80",[146,5667,5668],{},"0 (same VPS)",[146,5670,5671],{},"Medium–high",[125,5673,5674,5676,5679,5681,5683],{},[146,5675,5487],{},[146,5677,5678],{},"Orchestrator secrets",[146,5680,893],{},[146,5682,893],{},[146,5684,3159],{},[125,5686,5687,5689,5692,5695,5697],{},[146,5688,5493],{},[146,5690,5691],{},"Prometheus + Loki",[146,5693,5694],{},"250",[146,5696,5668],{},[146,5698,3159],{},[125,5700,5701,5703,5706,5708,5711],{},[146,5702,5499],{},[146,5704,5705],{},"App server or Cloudflare Workers",[146,5707,669],{},[146,5709,5710],{},"0–60",[146,5712,5713],{},"Variable",[12,5715,5716],{},"FX considered: five reais per dollar. Before-costs assume small-medium SaaS stack with five to ten active applications.",[368,5718,5720],{"id":5719},"ec2-becomes-vps-at-any-provider","EC2 becomes VPS at any provider",[12,5722,5723],{},"The most obvious migration. EC2 t3.medium costs about thirty dollars monthly — one hundred fifty reais. Hetzner CPX21 with the same CPU class and more RAM costs seven euros and ninety-nine — forty-four reais. DigitalOcean sits in the middle. Magalu Cloud is competitive for those prioritizing invoice in real and data on national soil.",[12,5725,5726],{},"The technical path is provisioning the VPS, running your existing Ansible (or a simple bootstrap script), copying the EC2 snapshot or bringing up the image from scratch. For each server, count two to four hours. It is not the time-consuming part of the migration.",[368,5728,5730],{"id":5729},"rds-becomes-self-hosted-postgres-or-neonsupabase","RDS becomes self-hosted Postgres or Neon\u002FSupabase",[12,5732,5733,5734,5737],{},"There are three honest paths here. The first is Postgres running on a dedicated VPS, with automated backup via ",[231,5735,5736],{},"pg_dump"," in cron and physical replication to a secondary in another region. Costs the price of the VPS — fifty to one hundred reais monthly — to replace an RDS of seven hundred.",[12,5739,5740],{},"The second is Neon. Serverless Postgres with branching, automatic ramp-up, generous free plan, paid plans starting at five dollars. Useful for those wanting to abandon AWS without taking on direct database operation.",[12,5742,5743],{},"The third is Supabase, which delivers Postgres with additional APIs (auth, realtime, storage) and a permanent free tier. Makes sense for startups that tolerate Supabase coupling in exchange for simplicity.",[12,5745,5746,5747,5749],{},"The migration itself is ",[231,5748,5736],{}," followed by restore at destination, with a short maintenance window — usually minutes, not hours, with logical replication working — or logical replication with cutover almost without downtime if your Postgres is version 13 or higher. Four to eight hours depending on base size.",[368,5751,5753],{"id":5752},"elasticache-becomes-self-hosted-valkey","ElastiCache becomes self-hosted Valkey",[12,5755,5756],{},"Redis became Valkey after the license change in 2024 — fork maintained by the Linux Foundation. Runs on any VPS in two clicks. Thirty reais monthly replace ElastiCache of seventy-five.",[12,5758,5759],{},"The migration has two stages. First, bring up the Valkey cluster with Sentinel for automatic failover. Second, populate the cache — script that reads from AWS and writes at the destination, or simply let the application populate organically after cutover (cache cold start of a few minutes). Three to six hours of work.",[368,5761,5763],{"id":5762},"s3-becomes-cloudflare-r2-or-backblaze-b2","S3 becomes Cloudflare R2 (or Backblaze B2)",[12,5765,5766],{},"This is the most immediate gain. Cloudflare R2 charges zero for egress — the most expensive slice of S3 when you serve assets to users. Fifteen cents of dollar per GB stored, against twenty-three cents of standard S3. Backblaze B2 is an almost identical alternative, with even cheaper integration for heavy backup workloads.",[12,5768,5769,5770,5773],{},"Technical migration is trivial: ",[231,5771,5772],{},"rclone copy s3:my-bucket r2:my-bucket"," in parallel. One terabyte transfers in around twelve hours depending on bandwidth. The application code changes exactly one line — the S3 client endpoint. Every AWS SDK library accepts custom endpoint configuration; R2 and B2 implement the identical S3 protocol.",[12,5775,5776],{},"Typical volume of medium SaaS (fifty GB of user uploads): R$75 monthly on R2 against R$600 on S3 with active egress. The savings pay a week of migration work in the first month.",[368,5778,5780],{"id":5779},"alb-becomes-orchestrator-integrated-router","ALB becomes orchestrator integrated router",[12,5782,5783],{},"If you are using ALB, it is because you have multiple EC2s behind it. The alternative is the router embedded in the chosen orchestrator — HeroCtl, Caddy, or the router embedded in other self-hosted stacks. The orchestrator discovers running containers, opens ports, terminates TLS via automatic Let's Encrypt, distributes traffic.",[12,5785,5786],{},"The migration swaps the AWS target group definition for an ingress definition in the orchestrator manifest. Four to eight hours to understand the right rules. One hundred ten reais saved monthly per balancer, and the orchestrator accepts however many hosts you want without additional charge.",[368,5788,5790],{"id":5789},"cloudfront-becomes-free-cloudflare","CloudFront becomes free Cloudflare",[12,5792,5793],{},"This deserves a highlight mention. CloudFront charges per GB transferred — those who serve video or heavy downloads bleed. Cloudflare offers free global CDN on the free plan, with configurable cache, basic DDoS mitigation and rudimentary WAF. For most SaaS cases, it is more than enough.",[12,5795,5796],{},"The migration is changing the domain's nameservers to Cloudflare and configuring cache rules. Two to four hours. The savings can be massive — four hundred reais monthly for those with average traffic volume, thousands for those with high volume.",[368,5798,5800],{"id":5799},"route-53-becomes-cloudflare-dns","Route 53 becomes Cloudflare DNS",[12,5802,5803],{},"DNS at Cloudflare is free and faster than Route 53 in most public measurements. Migration is exporting the zone file, importing in Cloudflare, validating records, changing nameservers at the registrar. Thirty minutes. Twenty-five reais monthly that come back to the cash flow.",[368,5805,5807],{"id":5806},"ses-becomes-resend-postmark-or-mailgun","SES becomes Resend, Postmark or Mailgun",[12,5809,5810],{},"AWS is cheap for volume sending, but SES deliverability requires IP warming and reputation configuration that takes time. Resend charges twenty dollars for fifty thousand monthly emails and has superior deliverability out of the box. Postmark charges fifteen for ten thousand. Mailgun covers the case of those sending lots of non-transactional volume.",[12,5812,5813],{},"The migration is changing SMTP credentials in the app — one hour of work.",[368,5815,5817],{"id":5816},"sqs-and-sns-become-redis-streams-or-rabbitmq","SQS and SNS become Redis Streams or RabbitMQ",[12,5819,5820],{},"The most delicate migration. SQS is a service that does one thing and does it well; replacing it requires choosing queue technology and refactoring producer and consumer.",[12,5822,5823],{},"The shortest path is Redis Streams, especially if you are already running Valkey for cache. Libraries like Sidekiq (Ruby), BullMQ (Node), RQ (Python) and Asynq (Go) consume Redis natively. RabbitMQ is more robust for complex routing scenarios. NATS is a modern alternative for pub-sub.",[12,5825,5826],{},"For each queue, count one to three days depending on complexity. Simple background job queues are trivial. Queues with fan-out, dead letter queues and custom visibility timeout require more care. Eighty reais monthly saved, and the queue runs on the same VPS as the cache — zero additional in infra.",[368,5828,5830],{"id":5829},"iam-becomes-orchestrator-secrets","IAM becomes orchestrator secrets",[12,5832,5833],{},"Here is the non-obvious migration that catches many AWS-experienced teams off guard. On AWS, the application accesses S3 and RDS without explicit credentials in the code — the EC2 inherits an IAM role and the SDK fetches tokens automatically. Outside AWS, that disappears.",[12,5835,5836,5837,5839,5840,5843],{},"The solution is secret injection by the orchestrator. HeroCtl, k3s and similar accept secrets as first-class resources — you declare ",[231,5838,453],{}," or ",[231,5841,5842],{},"S3_ACCESS_KEY"," in the job manifest and the orchestrator injects as environment variable in the container. For more sophisticated scenarios, self-hosted HashiCorp Vault does automatic rotation.",[12,5845,5846],{},"The migration is refactoring each IAM role into a set of explicit credentials, created at the destination provider (Cloudflare API token, specific Postgres user, etc.), and declared as secrets. Four to eight hours for a medium stack.",[368,5848,5850],{"id":5849},"cloudwatch-becomes-prometheus-loki","CloudWatch becomes Prometheus + Loki",[12,5852,5853],{},"Metrics become Prometheus + Grafana. Logs become Loki + Grafana. Everything runs in containers in the same cluster. Two hundred fifty reais monthly of CloudWatch become zero additional.",[12,5855,5856],{},"Initial configuration takes about four hours to be productive: Prometheus with service discovery pointing to the orchestrator agents, Loki receiving via Promtail or directly from the container runtime, Grafana with basic dashboards. There are dedicated posts about this migration on the blog.",[368,5858,5860],{"id":5859},"lambda-the-hardest-part","Lambda — the hardest part",[12,5862,5863],{},"Lambda is the service with the largest variance of complexity in migration. Depends entirely on how you are using it.",[12,5865,5866,5869],{},[27,5867,5868],{},"Simple HTTP Lambda"," (API Gateway → Lambda → response) is trivial. Becomes an endpoint on your app server. The function code changes little — framework handler in place of Lambda handler. One to two hours per function.",[12,5871,5872,5875],{},[27,5873,5874],{},"Event-driven Lambda"," (S3 triggers Lambda, SQS triggers Lambda, EventBridge schedules Lambda) is the expensive part. For S3 events, R2 offers events via Cloudflare Workers — you rewrite the Lambda as a Worker and keep the pattern. For SQS, becomes a consumer on the app server. For scheduled EventBridge, becomes a cron in the orchestrator.",[12,5877,5878],{},"Worst scenario: complex Lambda with chained EventBridge, Step Functions and dead letter queues. Here it is redesign. Reserve a week or two and design a simpler event model — usually the system gets better.",[19,5880,5882],{"id":5881},"realistic-six-to-eight-week-schedule","Realistic six-to-eight-week schedule",[12,5884,5885],{},"Order matters. Starting with the database is temptation and trap — database is last to migrate, not first.",[12,5887,5888,5891],{},[27,5889,5890],{},"Week 1 — Inventory and decision."," List the twelve services, note current cost, identify integrations between them. Choose alternative for each. One-page document with the mapping table. No code yet.",[12,5893,5894,5897],{},[27,5895,5896],{},"Week 2 — Provisioning destination in parallel."," Bring up the VPS, install the orchestrator (HeroCtl or similar), configure test DNS pointing to a subdomain. Bring up Postgres, Valkey, Cloudflare R2. Everything empty. Smoke test: a \"hello world\" running.",[12,5899,5900,5903,5904,5907],{},[27,5901,5902],{},"Week 3 — Storage migration."," S3 to R2 with ",[231,5905,5906],{},"rclone",". Usually slow (volume) but very low risk. Application still reads from S3, but you validate that R2 is synchronized. By end of week, dual-write — application writes to both.",[12,5909,5910,5913],{},[27,5911,5912],{},"Week 4 — Database migration."," Logical Postgres replica from RDS to destination. Cutover in a short maintenance window — usually minutes, not hours, with logical replication working. Application points to new database. RDS stays as hot standby for a week.",[12,5915,5916,5919],{},[27,5917,5918],{},"Week 5 — Web application migration."," Apps running on EC2 become jobs in the orchestrator. Integrated router plays the ALB role. DNS points to the orchestrator (or to Cloudflare in front of it). Gradual cutover using weighted DNS.",[12,5921,5922,5925],{},[27,5923,5924],{},"Week 6 — Queues and async jobs."," SQS leaves, Redis Streams or RabbitMQ enters. Workers run in the orchestrator. Period of dual-consume to ensure no message is dropped.",[12,5927,5928,5931],{},[27,5929,5930],{},"Week 7 — Lambdas and event-driven workloads."," The most variable week. HTTP Lambdas migrate quickly. Event-driven Lambdas require the redesign discussed above. If you have more than ten complex Lambdas, consider extending to two weeks.",[12,5933,5934,5937],{},[27,5935,5936],{},"Week 8 — Final cutover, intensive monitoring, decommission."," Cloudflare in front replaces CloudFront. Route 53 becomes Cloudflare DNS. CloudWatch goes to Prometheus + Loki. Last thing: turn off the old EC2s and close the AWS account — or leave a minimum balance if you still keep some residual service.",[19,5939,5941],{"id":5940},"the-five-lock-ins-that-hurt-most-in-the-migration","The five lock-ins that hurt most in the migration",[12,5943,5944],{},"Honesty matters: not everything migrates easily. Five things require extra work and sometimes change project viability:",[67,5946,5947,5953,5959,5965,5971],{},[70,5948,5949,5952],{},[27,5950,5951],{},"DynamoDB with specific features."," GSI, Streams, scan limits, TTL. There is no direct equivalent. The realistic path is redesign to Postgres with JSONB, or to a self-hosted NoSQL (FoundationDB, ScyllaDB) — re-architecture, not migration.",[70,5954,5955,5958],{},[27,5956,5957],{},"Aurora-only features."," Aurora Serverless v2 with auto-scaling of connections, Aurora Global Database, Aurora I\u002FO optimized. Self-hosted Postgres does almost everything, but doesn't have the instant auto-scaling. For spiky workloads, consider Neon (which offers a similar pattern).",[70,5960,5961,5964],{},[27,5962,5963],{},"Complex cross-service IAM."," Teams using cross-account IAM roles, Service Control Policies and hierarchical account organization have access control embedded in the architecture. Migrating requires reimplementing the hierarchy elsewhere — Vault, Cloudflare Access, or orchestrator secret injection. Count days, not hours.",[70,5966,5967,5970],{},[27,5968,5969],{},"Lambda + complex EventBridge."," Event pipelines with multiple hops, retries, dead letter queues. Doesn't migrate as is. Redesign around queues (RabbitMQ, NATS) and persistent workers. Usually the system gets simpler — but takes time.",[70,5972,5973,5976],{},[27,5974,5975],{},"S3 events triggering Lambda."," Very common pattern, and R2 with Cloudflare Workers covers most cases. For workloads that need exactly-once guarantee or strong ordering, switch to a queue pattern — producer writes event to queue when file is confirmed, worker consumes.",[19,5978,5980],{"id":5979},"the-savings-calculation-without-optimism","The savings calculation, without optimism",[12,5982,5983],{},"Typical Brazilian SaaS scenario with five applications:",[12,5985,5986],{},[27,5987,5988],{},"Before on AWS:",[2734,5990,5991,5994,5997,6000,6003,6006,6009,6012,6015,6018,6021],{},[70,5992,5993],{},"Five EC2 t3.medium: R$750",[70,5995,5996],{},"RDS db.t4g.large Multi-AZ: R$1,400",[70,5998,5999],{},"ElastiCache cache.t4g.micro: R$75",[70,6001,6002],{},"S3 with 100GB and average egress: R$300",[70,6004,6005],{},"ALB: R$110",[70,6007,6008],{},"CloudFront with average volume: R$400",[70,6010,6011],{},"Route 53 + SES: R$75",[70,6013,6014],{},"CloudWatch logs\u002Fmetrics: R$250",[70,6016,6017],{},"Lambda with average volume: R$200",[70,6019,6020],{},"NAT Gateway: R$200",[70,6022,6023],{},[27,6024,6025],{},"Total: R$3,760\u002Fmonth = R$45,120\u002Fyear",[12,6027,6028],{},[27,6029,6030],{},"After self-hosted:",[2734,6032,6033,6036,6039,6042,6045,6048,6051,6054,6057],{},[70,6034,6035],{},"Four Hetzner CPX21 VPS with orchestrator: R$176",[70,6037,6038],{},"Self-hosted Postgres (included on the VPS): R$0",[70,6040,6041],{},"Valkey (included): R$0",[70,6043,6044],{},"Cloudflare R2 50GB with unlimited egress: R$75",[70,6046,6047],{},"Cloudflare CDN + DNS: R$0",[70,6049,6050],{},"Resend for email: R$100",[70,6052,6053],{},"Prometheus + Loki (included): R$0",[70,6055,6056],{},"Queue workers (included): R$0",[70,6058,6059],{},[27,6060,6061],{},"Total: R$351\u002Fmonth = R$4,212\u002Fyear",[12,6063,6064],{},"Savings: R$3,409\u002Fmonth, R$40,908\u002Fyear. Roughly one month of senior engineer salary.",[12,6066,6067],{},"The migration consumes eighty to one hundred sixty hours. In senior internal dev hours, between sixteen and thirty-two thousand reais. Payback in five to ten months, with perpetual savings afterwards.",[19,6069,6071],{"id":6070},"the-most-non-obvious-migration-secrets-and-credentials","The most non-obvious migration: secrets and credentials",[12,6073,6074],{},"Worth repeating, because it is what most surprises an AWS-experienced team. On AWS you access S3 without credentials in code — the EC2's IAM role resolves it. Access RDS via IAM authentication. Access parameter store via IAM. The team loses awareness that this \"magic\" exists.",[12,6076,6077],{},"Outside AWS, every credential is explicit. The application needs:",[2734,6079,6080,6083,6086,6089,6092],{},[70,6081,6082],{},"Access key and secret for R2 (created in Cloudflare panel)",[70,6084,6085],{},"Connection string with user and password for Postgres",[70,6087,6088],{},"Valkey URL with password",[70,6090,6091],{},"API key for Resend",[70,6093,6094],{},"Token for Cloudflare API if you automate DNS",[12,6096,6097],{},"The orchestrator solution is to declare all of that as secrets injected into the container as environment variables. The secret is encrypted at rest in the orchestrator and never appears in logs. For automatic rotation and sophisticated audit, self-hosted Vault enters the game — but most teams don't need it.",[12,6099,6100],{},"Plan: make a spreadsheet with all the credentials each app needs, create each at the destination provider, declare as secret in the orchestrator, inject into the container. Four to eight hours for a medium stack.",[19,6102,6104],{"id":6103},"when-not-to-migrate-honest-profiles","When NOT to migrate (honest profiles)",[12,6106,6107],{},"Four situations where leaving AWS is the wrong decision:",[12,6109,6110,6113],{},[27,6111,6112],{},"Compliance that lists AWS by name."," FedRAMP, ITAR, certain American government contracts and some financial certifications require infra to run on pre-approved components — and most lists include AWS, GCP, Azure, and few additional providers. If your client is an American federal agency, AWS resolves a slice of compliance that would cost months to replicate elsewhere.",[12,6115,6116,6119],{},[27,6117,6118],{},"Single team focused on product."," If you are the only dev and are building the product, eight weeks redirected to migration kill roadmap. Do it when you have the second dev, or when AWS costs come to represent a significant slice of MRR. Before that, AWS is expensive but buyable.",[12,6121,6122,6125],{},[27,6123,6124],{},"AWS costs below 2% of MRR."," Bill of one thousand reais monthly for a startup billing one hundred thousand. The savings are real but the effort isn't worth the focus. Migrate when the bill exceeds five to ten percent of MRR — there the gain covers the lost opportunity.",[12,6127,6128,6131],{},[27,6129,6130],{},"Deep lock-in in DynamoDB or Aurora Serverless v2."," Already addressed above. If half your architecture is DynamoDB with Streams, you don't migrate — you re-architect. That's a different project, with different scope, different decision.",[19,6133,6135],{"id":6134},"hybrid-strategy-alternative-for-those-not-wanting-to-migrate-everything","Hybrid strategy — alternative for those not wanting to migrate everything",[12,6137,6138],{},"Teams with fifty or more applications on AWS rarely migrate in block. Hybrid strategy works better:",[2734,6140,6141,6144,6147,6150],{},[70,6142,6143],{},"Keep on AWS what is expensive to move (Aurora with specific features, critical Lambda, DynamoDB)",[70,6145,6146],{},"Move what is cheap to move and expensive to maintain (S3 → R2, CloudFront → Cloudflare, non-critical EC2 → VPS)",[70,6148,6149],{},"Establish VPN or private connection between the two endpoints",[70,6151,6152],{},"Partial savings but zero risk of radical migration",[12,6154,6155],{},"Typical result: cut of forty to sixty percent of the AWS bill, without touching critical pieces. For a company paying fifty thousand monthly, that is twenty to thirty thousand back — and the rest migrates organically over the following twelve months, as teams rewrite components for other reasons.",[19,6157,6159],{"id":6158},"heroctl-as-destination-what-changes-in-practice","HeroCtl as destination — what changes in practice",[12,6161,6162],{},"HeroCtl is a container orchestrator that runs on any Linux server with Docker. Four VPS running HeroCtl deliver an operational experience close to what you would have with managed ECS — without managed billing, without lock-in.",[12,6164,6165],{},"What it replaces:",[2734,6167,6168,6173,6179,6185],{},[70,6169,6170,6172],{},[27,6171,5601],{}," becomes the HeroCtl integrated router, with automatic Let's Encrypt TLS",[70,6174,6175,6178],{},[27,6176,6177],{},"Partial CloudWatch"," becomes embedded metrics and native centralized logs",[70,6180,6181,6184],{},[27,6182,6183],{},"RDS automated backups"," becomes managed backup on Business Edition",[70,6186,6187,6190],{},[27,6188,6189],{},"IAM roles in apps"," becomes secret injection in the job manifest",[12,6192,6193],{},"What stays the same: Docker running your app exactly as it runs on ECS. Environment variables, healthchecks, rolling deploys, multi-replicas. The application doesn't notice the difference.",[12,6195,6196,6197,6199,6200,6202,6203,6205],{},"There are three plans. ",[27,6198,4351],{}," is permanent free, no server or job limit — runs the entire stack described above including real high availability, router, certificates, metrics and logs. ",[27,6201,4355],{}," adds SSO, granular RBAC, detailed auditing, managed backup and SLA support — useful for those who already have formal platform requirements. ",[27,6204,4359],{}," adds source code escrow, 24×7 support and dedicated development. Business and Enterprise pricing is published on the plans page, without mandatory \"talk to sales\".",[12,6207,6208],{},"The public demo cluster runs on four servers and coordinator election happens in around seven seconds when the current node falls — measured number, not estimated.",[12,6210,5406],{},[19,6212,6214],{"id":6213},"questions-we-get-about-leaving-aws","Questions we get about leaving AWS",[368,6216,6218],{"id":6217},"how-long-does-it-really-take-to-migrate-a-medium-stack","How long does it really take to migrate a medium stack?",[12,6220,6221],{},"For a startup with five to ten applications, without deep lock-ins: six to eight weeks with a senior dev devoting half time, or three to four weeks with full dedication. Larger stacks or with complex event-driven Lambdas: three to four months. Stacks with critical DynamoDB or Aurora Serverless v2: turn it into a re-architecture project, six-month timeline or more.",[368,6223,6225],{"id":6224},"does-dynamodb-have-a-good-alternative","Does DynamoDB have a good alternative?",[12,6227,6228],{},"There is no identical substitute. The honest options are: Postgres with JSONB for most cases (resolves eighty percent of DynamoDB uses with excellent performance), self-hosted ScyllaDB or Cassandra for workloads that really need distributed NoSQL, FoundationDB for those needing distributed transactions. None of these is \"change the connection string and done\" — they require changes in the data model.",[368,6230,6232],{"id":6231},"can-i-keep-aws-for-the-database-and-move-compute","Can I keep AWS for the database and move compute?",[12,6234,6235],{},"Yes, and it is the most common hybrid strategy. Aurora or RDS stays on AWS, EC2s become Hetzner or DigitalOcean VPS, S3 becomes R2. You open VPN between the two endpoints and the app continues accessing RDS via private endpoint. Savings typically of fifty to seventy percent of the AWS bill.",[368,6237,6239],{"id":6238},"s3-r2-how-much-does-it-cost-to-transfer-1tb","S3 → R2: how much does it cost to transfer 1TB?",[12,6241,6242,6243,6245],{},"R2 charges zero for ingress. AWS charges for S3 egress — approximately nine cents of dollar per GB on the first 10 TB. One terabyte costs about ninety dollars to leave AWS, R$450. Transfer time: twelve to twenty-four hours with parallelized ",[231,6244,5906],{},", depending on bandwidth. After migration, R$75 monthly storing 50GB with unlimited egress, against R$600 for the same on S3 with active traffic.",[368,6247,6249],{"id":6248},"lambda-how-to-migrate-event-driven","Lambda — how to migrate event-driven?",[12,6251,6252,6253,6256,6257,6260,6261,6264,6265,6268],{},"Depends on the trigger. ",[27,6254,6255],{},"S3 triggering Lambda"," becomes R2 with Cloudflare Workers (same pattern, no radical change). ",[27,6258,6259],{},"SQS triggering Lambda"," becomes a persistent worker on the app server, consuming from the queue — usually simpler code than the original Lambda. ",[27,6262,6263],{},"Scheduled EventBridge"," becomes cron in the orchestrator. ",[27,6266,6267],{},"EventBridge with complex rules and chained Step Functions"," requires redesign — design the flow around a central queue with consumer workers, becomes more auditable.",[368,6270,6272],{"id":6271},"rds-multi-az-self-hosted-postgres-is-reliable","RDS Multi-AZ → self-hosted Postgres is reliable?",[12,6274,6275],{},"Postgres with physical streaming replication and failover via Patroni reaches reliability close to RDS Multi-AZ — provided the team knows how to operate. If nobody on the team masters Postgres in production, the safest path is Neon or Supabase, which deliver managed Postgres with free tier. For teams with SRE or DBA, self-hosted is viable and saves substantially. For teams without that competence, the savings don't compensate the risk — pay for managed.",[368,6277,6279],{"id":6278},"email-ses-who-is-cheaper","Email SES → who is cheaper?",[12,6281,6282],{},"Depends on volume. Up to 10k monthly emails, Postmark at US$15 delivers much more (superior deliverability, better dashboard, responsive support). Between 50k and 100k monthly, Resend at US$20 is the best cost-benefit. Above 500k monthly, Mailgun or Amazon SES compete on price — and SES might make sense to keep even after migrating the rest. Email is one of the few AWS services that can be rational to keep.",[368,6284,6286],{"id":6285},"dns-all-cloudflare-or-mix","DNS — all Cloudflare or mix?",[12,6288,6289],{},"Cloudflare resolves DNS, CDN, DDoS, WAF and workers on the free plan. For most stacks, concentrating everything there simplifies operation and cuts cost. The exception is compliance that requires geographic provider separation — some governance frameworks ask for DNS and CDN to be from distinct providers. In that case, Cloudflare DNS + Bunny CDN (or Fastly) fulfills the separation.",[368,6291,6293],{"id":6292},"does-lgpd-compliance-change-anything","Does LGPD compliance change anything?",[12,6295,6296],{},"LGPD doesn't require hosting on Brazilian soil. It requires that you know where the data is and that you have an adequate contract with the operator. Hetzner (Germany), DigitalOcean (multiple regions), Cloudflare R2 (multi-region) and Magalu Cloud (Brazil) are all LGPD-compatible provided the contract is in order. For those preferring data on national soil due to client preference, Magalu Cloud is the direct alternative.",[12,6298,5406],{},[19,6300,6302],{"id":6301},"concrete-next-step","Concrete next step",[12,6304,6305],{},"If you got this far, the next step is the spreadsheet. List the twelve services, mark which your stack uses, note current cost of each, choose alternative. In an afternoon you know if migration is worth the effort.",[12,6307,6308],{},"When you are ready to provision the destination:",[224,6310,6311],{"className":226,"code":5318,"language":228,"meta":229,"style":229},[231,6312,6313],{"__ignoreMap":229},[234,6314,6315,6317,6319,6321,6323],{"class":236,"line":237},[234,6316,1220],{"class":247},[234,6318,2957],{"class":251},[234,6320,5329],{"class":255},[234,6322,2963],{"class":383},[234,6324,2966],{"class":247},[12,6326,6327],{},"Runs on any Linux server with Docker. The first three become quorum for the replicated control plane. You submit jobs via CLI, API or embedded web panel. The cluster decides where to run, does health check, manages rolling deploys, issues Let's Encrypt certificates automatically.",[12,6329,6330,6331,2402,6335,101],{},"For additional context on cost and architecture, also read ",[3336,6332,6334],{"href":6333},"\u002Fen\u002Fblog\u002Faws-ecs-vs-kubernetes-vs-self-hosted","AWS ECS vs Kubernetes vs self-hosted",[3336,6336,6338],{"href":6337},"\u002Fen\u002Fblog\u002Fhow-much-to-host-a-brazilian-saas-2026","How much does it cost to host a Brazilian SaaS in 2026",[12,6340,6341],{},"The migration is more annoying than difficult. The hard part is deciding to start.",[3350,6343,4376],{},{"title":229,"searchDepth":244,"depth":244,"links":6345},[6346,6347,6348,6349,6363,6364,6365,6366,6367,6368,6369,6370,6381],{"id":5390,"depth":244,"text":5391},{"id":5409,"depth":244,"text":5410},{"id":5422,"depth":244,"text":5423},{"id":5506,"depth":244,"text":5507,"children":6350},[6351,6352,6353,6354,6355,6356,6357,6358,6359,6360,6361,6362],{"id":5719,"depth":271,"text":5720},{"id":5729,"depth":271,"text":5730},{"id":5752,"depth":271,"text":5753},{"id":5762,"depth":271,"text":5763},{"id":5779,"depth":271,"text":5780},{"id":5789,"depth":271,"text":5790},{"id":5799,"depth":271,"text":5800},{"id":5806,"depth":271,"text":5807},{"id":5816,"depth":271,"text":5817},{"id":5829,"depth":271,"text":5830},{"id":5849,"depth":271,"text":5850},{"id":5859,"depth":271,"text":5860},{"id":5881,"depth":244,"text":5882},{"id":5940,"depth":244,"text":5941},{"id":5979,"depth":244,"text":5980},{"id":6070,"depth":244,"text":6071},{"id":6103,"depth":244,"text":6104},{"id":6134,"depth":244,"text":6135},{"id":6158,"depth":244,"text":6159},{"id":6213,"depth":244,"text":6214,"children":6371},[6372,6373,6374,6375,6376,6377,6378,6379,6380],{"id":6217,"depth":271,"text":6218},{"id":6224,"depth":271,"text":6225},{"id":6231,"depth":271,"text":6232},{"id":6238,"depth":271,"text":6239},{"id":6248,"depth":271,"text":6249},{"id":6271,"depth":271,"text":6272},{"id":6278,"depth":271,"text":6279},{"id":6285,"depth":271,"text":6286},{"id":6292,"depth":271,"text":6293},{"id":6301,"depth":244,"text":6302},"case-study","2026-05-26","Migrating from AWS to a cheaper cloud (Hetzner\u002FDO) or self-hosted seems like a 1-year project. In practice, you can do it in 6-8 weeks if you map the 12 AWS-only services your stack actually uses.",{},"\u002Fen\u002Fblog\u002Fleaving-aws-without-rewriting-the-stack","16 min",{"title":5382,"description":6384},{"loc":6386},"en\u002Fblog\u002Fleaving-aws-without-rewriting-the-stack",[6392,6393,6394,888,6395],"aws","migration","cost","guide","sR1IiiNXy6Y_l6sjdp00CGJCMeQQZCQ1jpzt6L_dOxw",{"id":6398,"title":6399,"author":7,"body":6400,"category":3378,"cover":3379,"date":7497,"description":7498,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":7499,"navigation":411,"path":7500,"readingTime":4401,"seo":7501,"sitemap":7502,"stem":7503,"tags":7504,"__hash__":7508},"blog_en\u002Fen\u002Fblog\u002Fredis-in-production-managed-vs-self-hosted.md","Redis (and Valkey) in production: managed vs self-hosted in 2026",{"type":9,"value":6401,"toc":7469},[6402,6416,6418,6438,6442,6445,6448,6454,6461,6465,6468,6474,6480,6486,6489,6495,6500,6505,6508,6514,6519,6524,6527,6533,6538,6548,6552,6555,6599,6603,6606,6610,6613,6645,6648,6652,6655,6669,6676,6680,6694,6697,6701,6734,6737,6741,6744,6758,6765,6769,6772,6853,6856,6860,6863,6869,6875,6882,6885,6889,6892,6924,6927,6931,6938,6941,6944,7235,7239,7242,7268,7272,7275,7301,7305,7308,7329,7338,7341,7344,7348,7373,7387,7393,7402,7408,7414,7420,7426,7428,7431,7437,7440,7456,7467],[12,6403,6404,6405,6408,6409,2402,6412,6415],{},"The question \"managed or self-hosted Redis?\" became another question at the end of March 2024. That's when the company behind Redis switched the license from Apache 2.0 \u002F BSD to a combination of RSAL with SSPL — a pair of \"source available\" licenses designed to prevent cloud providers from offering Redis as a service without commercial licensing. The reaction was quick: the Linux Foundation launched ",[27,6406,6407],{},"Valkey"," as a direct fork from the last BSD version, with AWS, Google, and Oracle backing the development. In parallel, projects that already existed — ",[27,6410,6411],{},"KeyDB",[27,6413,6414],{},"Dragonfly"," — started appearing more frequently in benchmarks of companies reassessing their stack.",[19,6417,22],{"id":21},[12,6419,6420,6421,6424,6425,6427,6428,6430,6431,6433,6434,6437],{},"In 2026, \"Redis in production\" became a category with four implementations disputing the same protocol: ",[27,6422,6423],{},"Redis OSS"," (BSD pre-2024 or RSAL post), ",[27,6426,6407],{}," (BSD, drop-in via fork), ",[27,6429,6411],{}," (multi-thread, old fork), and ",[27,6432,6414],{}," (BSL, rewritten from scratch in C++). Self-hosting any of the four costs between R$30 and R$130 per month on Hetzner VPS. The managed path costs from R$75 (ElastiCache micro) to R$1,000\u002Fmonth (13 GB instance), plus Upstash with serverless billing varying US$0–100\u002Fmonth. For Brazilian startup with MRR below R$200k, ",[27,6435,6436],{},"self-hosted Valkey"," on its own cluster saves between R$300 and R$1,500 per month compared to managed, eliminates RSAL license exposure, and maintains full compatibility with Redis clients. Switching the stack after adopting the commercial version is real pain — starting with the OSS-friendly version is the bet with the lowest exit cost. This post compares the four products, the three managed paths (ElastiCache, Upstash, Redis Cloud), and the minimum configuration to run Valkey in production without losing sleep.",[19,6439,6441],{"id":6440},"the-short-story-of-the-license-change","The short story of the license change",[12,6443,6444],{},"Before March 2024, \"Redis\" was the dominant OSS cache: BSD, gigantic ecosystem, present in any stack that had ever fit the word \"Rails\" or \"Node\" on a résumé. The commercial vendor — Redis Inc, formerly Redis Labs — lived well off the managed product and the paid modules (Search, JSON, TimeSeries).",[12,6446,6447],{},"Then came the announcement: version 7.4 onward would ship under RSAL + SSPL, no longer BSD. In practical terms, the change directly targeted AWS, Google, and Azure. The internal reading from those who produce open source software was different: \"if it happened to Redis, it can happen to any VC-funded project\". It was the third recent case — after Elastic in 2021 and MongoDB in 2018 — where a project that seemed consolidated changed the rules.",[12,6449,6450,6451,6453],{},"The Linux Foundation was quick. Five days after the announcement, ",[27,6452,6407],{}," was formed as a fork of the last BSD version (7.2.4), with independent governance and weighty backers: AWS, Google Cloud, Oracle, Ericsson, Snap. In just over a year, AWS had already migrated ElastiCache's default engine to Valkey. Google Memorystore followed. In 2026, Valkey stopped being \"experimental fork\" to become a growing reference — with 7.x and 8.x versions already incorporating its own optimizations that weren't even offered to Redis OSS.",[12,6455,6456,6457,6460],{},"The operational lesson for those choosing cache today: ",[27,6458,6459],{},"the mainstream moved",". There's no longer the inertia of \"no one was fired for choosing Redis\" — the question in the architecture interview became \"why Redis and not Valkey?\". And the honest answer, in most cases, is \"habit\".",[19,6462,6464],{"id":6463},"what-are-the-four-products-disputing-this-market","What are the four products disputing this market?",[368,6466,6423],{"id":6467},"redis-oss",[12,6469,6470,6473],{},[27,6471,6472],{},"The original."," Versions before 7.4 are still under BSD and remain usable indefinitely — no one revokes a license retroactively. Versions 7.4 onward ship under RSAL\u002FSSPL.",[12,6475,6476,6479],{},[27,6477,6478],{},"Pros",": still huge community, battle-tested in production for over a decade, richer ecosystem (modules, integrations, books, talks). Almost every client library tested against Redis OSS first.",[12,6481,6482,6485],{},[27,6483,6484],{},"Cons",": the RSAL prevents offering-as-a-service without commercial licensing. For those operating Redis for internal use, that's irrelevant — the restriction is about resale. The real risk is strategic: if the vendor changed the license once, they can change it again. Adopting Redis OSS in 2026 means betting that the next critical feature will descend to the open branch, and not stay in the commercial product.",[368,6487,6407],{"id":6488},"valkey",[12,6490,6491,6494],{},[27,6492,6493],{},"The Linux Foundation fork."," Took the code from 7.2.4 BSD and kept developing. Drop-in replacement at the protocol level: no client needs to change a line of code to swap Redis for Valkey.",[12,6496,6497,6499],{},[27,6498,6478],{},": permanent BSD guaranteed by neutral governance (it's not a company, it's a foundation). Big backers align incentives to keep the project healthy. Technical parity with Redis 7.x and growing development speed.",[12,6501,6502,6504],{},[27,6503,6484],{},": the brand is still being built — some third-party plugins and very specific SDKs still only list \"Redis\" in the README. In 2026 that's increasingly cosmetic, but it can show up in old integrations needing minor adaptation.",[368,6506,6411],{"id":6507},"keydb",[12,6509,6510,6513],{},[27,6511,6512],{},"The multi-thread fork."," Has existed since 2019, was acquired by Snap in 2022, lives today as Snap-Telemetry project. The architectural difference is fundamental: Redis OSS and Valkey are single-thread by design (one main thread processes all commands). KeyDB runs multi-thread by default.",[12,6515,6516,6518],{},[27,6517,6478],{},": on CPUs with 4+ cores, KeyDB delivers 2 to 3 times more throughput than single-thread Redis on the same hardware. API is compatible, so the client doesn't change. For CPU-bound workloads with high volume, it's the obvious choice.",[12,6520,6521,6523],{},[27,6522,6484],{},": smaller community, pace of adopting new Redis features usually lags quarters behind. Some new Redis features (Functions, certain extensions) take time to appear in KeyDB.",[368,6525,6414],{"id":6526},"dragonfly",[12,6528,6529,6532],{},[27,6530,6531],{},"The rewrite."," Not a fork — it's a new implementation in modern C++, with hash table designed for cache (not Redis's generic structure), using io_uring on Linux for asynchronous I\u002FO. Compatibility at the protocol level, not at the code level.",[12,6534,6535,6537],{},[27,6536,6478],{},": claims of 25× throughput in specific benchmarks (heavy pipelines on modern hardware). Real memory efficiency — 2 to 3 times more data in the same RAM as Redis. No implicit GIL of single-thread; scales vertically on a machine with 32+ cores.",[12,6539,6540,6542,6543,6547],{},[27,6541,6484],{},": BSL (Business Source License) license — stays closed for 4 years before becoming Apache 2.0. It's exactly the same license pattern that caught other projects in the orchestration industry by surprise, which we covered in our post on ",[3336,6544,6546],{"href":6545},"\u002Fen\u002Fblog\u002Fwhy-we-built-heroctl","why we built HeroCtl",". Some commands still incompatible with Redis in edge cases (complex Lua scripts, certain cluster operations).",[19,6549,6551],{"id":6550},"which-to-choose-for-a-new-project-in-2026","Which to choose for a new project in 2026?",[12,6553,6554],{},"The short decision tree:",[2734,6556,6557,6566,6574,6582,6591],{},[70,6558,6559,6562,6563,6565],{},[27,6560,6561],{},"Sensible default",": ",[27,6564,6407],{},". Permanent BSD, Redis parity, client doesn't need to change, future guaranteed by big backers. There's no technical reason to prefer Redis OSS for a new project in 2026.",[70,6567,6568,6562,6571,6573],{},[27,6569,6570],{},"Critical performance",[27,6572,6414],{},", if the application sustains above 100k operations per second and the team accepts the BSL license risk.",[70,6575,6576,6562,6579,6581],{},[27,6577,6578],{},"Multi-thread without rewrite",[27,6580,6411],{},", if the bottleneck is CPU on big hardware and the team prefers not to migrate to Dragonfly.",[70,6583,6584,6562,6587,6590],{},[27,6585,6586],{},"Extreme simplicity (1 VPS, low volume)",[27,6588,6589],{},"Redis OSS 7.2.4 BSD"," still works perfectly. Crystallized as a stable version; will run on any Debian\u002FAlpine for the next five years without complaining.",[70,6592,6593,6562,6596,6598],{},[27,6594,6595],{},"Migrating from Redis Labs managed",[27,6597,6407],{}," is drop-in. Zero code changing. Migration is only operational — replication, DNS swap, rollback if necessary.",[19,6600,6602],{"id":6601},"managed-vs-self-hosted-the-math-without-frills","Managed vs self-hosted: the math without frills",[12,6604,6605],{},"The numbers below are list price in May 2026, R$5\u002FUSD exchange rate.",[368,6607,6609],{"id":6608},"aws-elasticache","AWS ElastiCache",[12,6611,6612],{},"Grows in steps per instance:",[2734,6614,6615,6624,6630,6639],{},[70,6616,6617,6620,6621],{},[231,6618,6619],{},"cache.t4g.micro"," (1 GB): about US$15\u002Fmonth = ",[27,6622,6623],{},"R$75\u002Fmonth",[70,6625,6626,6629],{},[231,6627,6628],{},"cache.t4g.small"," (2 GB): US$30\u002Fmonth = R$150\u002Fmonth",[70,6631,6632,6635,6636],{},[231,6633,6634],{},"cache.r6g.large"," (13 GB): about US$200\u002Fmonth = ",[27,6637,6638],{},"R$1,000\u002Fmonth",[70,6640,6641,6644],{},[231,6642,6643],{},"cache.r6g.xlarge"," (26 GB): about US$400\u002Fmonth = R$2,000\u002Fmonth",[12,6646,6647],{},"Multi-AZ doubles the price (replica in another zone). Automatic backup is included. Real Multi-AZ failover is the main argument — you pay not to have to think about it.",[368,6649,6651],{"id":6650},"upstash","Upstash",[12,6653,6654],{},"Serverless billing per command:",[2734,6656,6657,6660,6663,6666],{},[70,6658,6659],{},"Free tier: 256 MB, 500k commands\u002Fday",[70,6661,6662],{},"Pay-as-you-go: US$0.2 per 100k commands",[70,6664,6665],{},"For startup with medium volume (10M commands\u002Fday): about US$60\u002Fmonth = R$300\u002Fmonth",[70,6667,6668],{},"For app with low peak: can stay between US$0 and US$10\u002Fmonth",[12,6670,6671,6672,6675],{},"The unique operational advantage: ",[27,6673,6674],{},"zero pre-allocated capacity",". If the app sleeps, the bill sleeps. For Vercel\u002FCloudflare Workers, it's the natural complement. For sustained and predictable load, it ends up more expensive than ElastiCache.",[368,6677,6679],{"id":6678},"redis-cloud-direct-offer-from-redis-inc","Redis Cloud (direct offer from Redis Inc)",[2734,6681,6682,6685,6691],{},[70,6683,6684],{},"Essentials Plan 30MB: free",[70,6686,6687,6688],{},"Pro Plan 5GB single-region: about US$50\u002Fmonth = ",[27,6689,6690],{},"R$250\u002Fmonth",[70,6692,6693],{},"Pro Plan 10GB multi-AZ: about US$120\u002Fmonth = R$600\u002Fmonth",[12,6695,6696],{},"Includes commercial modules (Search, JSON, TimeSeries) that don't exist in Valkey or Redis OSS. If you use those modules, there's no direct alternative — it's Redis Cloud or buy commercial license and self-host.",[368,6698,6700],{"id":6699},"self-hosted-on-hetzner","Self-hosted on Hetzner",[2734,6702,6703,6713,6719,6728],{},[70,6704,6705,6708,6709,6712],{},[27,6706,6707],{},"CPX21"," (3 vCPU, 4 GB RAM, 80 GB SSD): €7.99 = ",[27,6710,6711],{},"R$44\u002Fmonth",". Fits 2 GB Valkey with room to spare.",[70,6714,6715,6718],{},[27,6716,6717],{},"CPX31"," (4 vCPU, 8 GB RAM, 160 GB SSD): €13.99 = R$78\u002Fmonth.",[70,6720,6721,6724,6725,101],{},[27,6722,6723],{},"Cluster of 3 CPX21 for Valkey + Sentinel HA",": 3 × €7.99 = €24\u002Fmonth = ",[27,6726,6727],{},"R$130\u002Fmonth",[70,6729,6730,6733],{},[27,6731,6732],{},"Cluster of 3 CPX31 for serious data",": €42\u002Fmonth = R$230\u002Fmonth.",[12,6735,6736],{},"For DigitalOcean, Linode, Vultr, multiply by approximately 1.5×. For AWS EC2, multiply by 2×. But in any case it stays cheaper than the equivalent managed.",[368,6738,6740],{"id":6739},"practical-difference","Practical difference",[12,6742,6743],{},"For 8 GB cache workload with replication:",[2734,6745,6746,6749,6752,6755],{},[70,6747,6748],{},"ElastiCache Multi-AZ: ~R$1,000\u002Fmonth",[70,6750,6751],{},"Redis Cloud Pro Multi-AZ: ~R$600\u002Fmonth",[70,6753,6754],{},"Self-hosted Valkey on 3× Hetzner CPX31: R$230\u002Fmonth",[70,6756,6757],{},"Single-node Valkey on 1× Hetzner CPX31 + S3 backup: R$80\u002Fmonth",[12,6759,6760,6761,6764],{},"Whoever chooses the managed path pays ",[27,6762,6763],{},"3 to 10 times more"," for the same throughput. The difference is what you buy with that: contractual SLA, automatic multi-AZ failover, absence of 3 a.m. pager. For a small team, that may be worth the price. For a team that already operates Linux servers in production, it usually isn't.",[19,6766,6768],{"id":6767},"minimum-production-grade-valkey-stack","Minimum production-grade Valkey stack",[12,6770,6771],{},"Configuration that withstands real production without theater:",[2734,6773,6774,6780,6789,6804,6810,6816,6822,6828],{},[70,6775,6776,6779],{},[27,6777,6778],{},"Container or systemd service on dedicated VPS."," Don't share the machine with the application — cache and app compete for RAM, and when it goes wrong it goes wrong for both at the same time.",[70,6781,6782,6788],{},[27,6783,6784,6787],{},[231,6785,6786],{},"maxmemory"," configured"," between 50 and 70% of available RAM. Leaving memory for the system and network buffers is more important than having the last megabytes for cache.",[70,6790,6791,6562,6796,6799,6800,6803],{},[27,6792,6793],{},[231,6794,6795],{},"maxmemory-policy",[231,6797,6798],{},"allkeys-lru"," if pure cache mode (throw out old keys when full). ",[231,6801,6802],{},"noeviction"," if storage mode (queue, sessions) — there prefer write error to silently losing data.",[70,6805,6806,6809],{},[27,6807,6808],{},"AOF persistence"," if the load is job queue (Sidekiq, BullMQ, Resque). Without AOF, a restart loses any job that was queued but unprocessed. RDB is insufficient in that scenario because snapshot is periodic.",[70,6811,6812,6815],{},[27,6813,6814],{},"Sufficient RDB"," if the load is pure cache (Rails cache, Django cache). If restarting losing cache only means \"slow request for a few seconds while it warms up\", AOF is unnecessary overhead.",[70,6817,6818,6821],{},[27,6819,6820],{},"Async replication to standby"," on a second node. Manual failover with internal DNS swap is acceptable for many cases. Automatic failover costs Sentinel or Cluster.",[70,6823,6824,6827],{},[27,6825,6826],{},"AOF + RDB backup to S3"," or compatible, daily. Restic or rclone solve well.",[70,6829,6830,6833,6834,6837,6838,571,6841,571,6844,571,6847,571,6850,101],{},[27,6831,6832],{},"Monitoring"," with ",[231,6835,6836],{},"redis_exporter"," exporting to Prometheus + alerts on Grafana or similar. Critical metrics: ",[231,6839,6840],{},"connected_clients",[231,6842,6843],{},"used_memory",[231,6845,6846],{},"evicted_keys",[231,6848,6849],{},"keyspace_hits\u002Fmisses",[231,6851,6852],{},"latency_percentiles",[12,6854,6855],{},"This setup runs comfortably on CPX21 (R$44\u002Fmonth) serving 50k+ ops\u002Fs sustained for average Brazilian app.",[19,6857,6859],{"id":6858},"sentinel-or-cluster","Sentinel or Cluster?",[12,6861,6862],{},"Question that confuses many teams coming to Redis for the first time.",[12,6864,6865,6868],{},[27,6866,6867],{},"Sentinel",": 1 master + N replicas + 3+ sentinel processes monitoring. Automatic failover when master falls — a sentinel detects, the sentinels vote, a replica becomes master, clients receive new endpoint via discovery. All on a single shard — the entire dataset fits on one node.",[12,6870,6871,6874],{},[27,6872,6873],{},"Cluster",": dataset partitioned into 16384 slots distributed across 3+ masters. Each master has its own replicas. Multi-shard, horizontal capacity scaling — you can have 100 GB total with no individual node holding more than 20 GB.",[12,6876,6877,6878,6881],{},"The practical rule: ",[27,6879,6880],{},"Sentinel is enough up to ~100 GB dataset",". Above that, Cluster is necessary. For most Brazilian startups, Sentinel is the right choice for simplicity — Cluster adds real complexity (key needs hashtag for multi-key operations, Lua scripts get restricted to a slot, some clients have bugs in cluster mode).",[12,6883,6884],{},"Don't use Cluster for status. Use Sentinel until the metric forces.",[19,6886,6888],{"id":6887},"sidekiq-bullmq-and-friends-patterns","Sidekiq, BullMQ and friends patterns",[12,6890,6891],{},"Real use, not marketing diagram:",[2734,6893,6894,6900,6906,6912,6918],{},[70,6895,6896,6899],{},[27,6897,6898],{},"Sidekiq Ruby",": Redis needs AOF. Without AOF, any crash loses queued jobs that haven't yet been picked up. Sidekiq Pro adds \"reliable fetch\" that improves — but the backstop is still AOF.",[70,6901,6902,6905],{},[27,6903,6904],{},"BullMQ Node",": similar. AOF essential for durability. BullMQ uses data structures that depend on Redis transactional atomicity — restart without AOF can leave queue in inconsistent state.",[70,6907,6908,6911],{},[27,6909,6910],{},"Resque Ruby",": the father of all. AOF necessary for the same reasons.",[70,6913,6914,6917],{},[27,6915,6916],{},"Pure cache (Rails.cache, Django cache, Laravel cache)",": can run without AOF, RDB sufficient. Losing cache on restart is acceptable.",[70,6919,6920,6923],{},[27,6921,6922],{},"Pure pub\u002Fsub",": doesn't even need persistence. Pub\u002Fsub is fire-and-forget by design.",[12,6925,6926],{},"Mixing cache and queue use on the same Redis works — just configure AOF (the \"worst case\" load determines). But for serious workload, separating into two instances (one for cache without AOF, another for queue with AOF) is cleaner. Operationally cheap if there's already an orchestrator running.",[19,6928,6930],{"id":6929},"is-elasticache-sao-paulo-reliable","Is ElastiCache São Paulo reliable?",[12,6932,6933,6934,6937],{},"Yes — 99.99% contractual uptime SLA, multi-AZ in São Paulo region (",[231,6935,6936],{},"sa-east-1","), automatic backup, tested failover. Latency from Brazilian app to ElastiCache São Paulo stays at 1-3ms, indistinguishable from local Redis for most workloads.",[12,6939,6940],{},"The weak point isn't technical reliability, it's cost and lock-in. AWS Brazil charges about 30% more than North American regions for the same resource. And migrating from ElastiCache to another provider later involves dump\u002Frestore + coordinated cutover — not apocalypse, but it's weekend work.",[19,6942,6943],{"id":3836},"Comparison table: 12 criteria",[119,6945,6946,6966],{},[122,6947,6948],{},[125,6949,6950,6952,6954,6956,6958,6960,6962,6964],{},[128,6951,2982],{},[128,6953,6423],{},[128,6955,6407],{},[128,6957,6411],{},[128,6959,6414],{},[128,6961,5445],{},[128,6963,6651],{},[128,6965,5573],{},[141,6967,6968,6993,7017,7039,7064,7084,7106,7127,7150,7173,7194,7215],{},[125,6969,6970,6973,6976,6979,6981,6984,6987,6990],{},[146,6971,6972],{},"License",[146,6974,6975],{},"RSAL\u002FSSPL (7.4+)",[146,6977,6978],{},"BSD",[146,6980,6978],{},[146,6982,6983],{},"BSL → Apache 4 years",[146,6985,6986],{},"Commercial AWS",[146,6988,6989],{},"Commercial Upstash",[146,6991,6992],{},"Permanent BSD",[125,6994,6995,6998,7001,7003,7006,7008,7011,7014],{},[146,6996,6997],{},"Threading",[146,6999,7000],{},"Single",[146,7002,7000],{},[146,7004,7005],{},"Multi",[146,7007,7005],{},[146,7009,7010],{},"Single (engine 7)",[146,7012,7013],{},"Serverless",[146,7015,7016],{},"Configurable",[125,7018,7019,7022,7025,7027,7029,7032,7034,7037],{},[146,7020,7021],{},"Redis client compat.",[146,7023,7024],{},"100%",[146,7026,7024],{},[146,7028,7024],{},[146,7030,7031],{},"95%+",[146,7033,7024],{},[146,7035,7036],{},"100% (subset of commands)",[146,7038,7024],{},[125,7040,7041,7044,7047,7049,7052,7055,7058,7061],{},[146,7042,7043],{},"Baseline throughput",[146,7045,7046],{},"100k ops\u002Fs",[146,7048,7046],{},[146,7050,7051],{},"250k ops\u002Fs",[146,7053,7054],{},"1M+ ops\u002Fs",[146,7056,7057],{},"depends on inst.",[146,7059,7060],{},"depends on plan",[146,7062,7063],{},"100-250k ops\u002Fs",[125,7065,7066,7068,7070,7072,7074,7076,7079,7082],{},[146,7067,6808],{},[146,7069,3064],{},[146,7071,3064],{},[146,7073,3064],{},[146,7075,3064],{},[146,7077,7078],{},"Yes (snapshot)",[146,7080,7081],{},"Managed",[146,7083,3064],{},[125,7085,7086,7089,7091,7093,7095,7097,7100,7103],{},[146,7087,7088],{},"Replication",[146,7090,3064],{},[146,7092,3064],{},[146,7094,3064],{},[146,7096,3064],{},[146,7098,7099],{},"Multi-AZ",[146,7101,7102],{},"Multi-region",[146,7104,7105],{},"Yes (manual config)",[125,7107,7108,7111,7114,7116,7118,7120,7123,7125],{},[146,7109,7110],{},"Automatic failover",[146,7112,7113],{},"Sentinel\u002FCluster",[146,7115,7113],{},[146,7117,7113],{},[146,7119,6873],{},[146,7121,7122],{},"Built-in",[146,7124,7122],{},[146,7126,7113],{},[125,7128,7129,7132,7135,7137,7139,7141,7144,7147],{},[146,7130,7131],{},"Cost 8GB\u002Fmonth (R$)",[146,7133,7134],{},"80 (VPS)",[146,7136,7134],{},[146,7138,7134],{},[146,7140,7134],{},[146,7142,7143],{},"1000 (Multi-AZ)",[146,7145,7146],{},"300-500",[146,7148,7149],{},"80-230",[125,7151,7152,7155,7158,7160,7162,7165,7168,7171],{},[146,7153,7154],{},"Lock-in",[146,7156,7157],{},"Medium (license)",[146,7159,3154],{},[146,7161,3154],{},[146,7163,7164],{},"Medium (BSL)",[146,7166,7167],{},"High (AWS)",[146,7169,7170],{},"High (Upstash API)",[146,7172,3154],{},[125,7174,7175,7178,7181,7183,7185,7187,7190,7192],{},[146,7176,7177],{},"Premium modules",[146,7179,7180],{},"Paid",[146,7182,3055],{},[146,7184,3055],{},[146,7186,3055],{},[146,7188,7189],{},"Add-on $$",[146,7191,3061],{},[146,7193,3055],{},[125,7195,7196,7199,7202,7204,7206,7208,7211,7213],{},[146,7197,7198],{},"Operational",[146,7200,7201],{},"You",[146,7203,7201],{},[146,7205,7201],{},[146,7207,7201],{},[146,7209,7210],{},"AWS",[146,7212,6651],{},[146,7214,7201],{},[125,7216,7217,7220,7222,7224,7226,7228,7231,7233],{},[146,7218,7219],{},"SLA support",[146,7221,7180],{},[146,7223,4351],{},[146,7225,4351],{},[146,7227,7180],{},[146,7229,7230],{},"Included",[146,7232,7230],{},[146,7234,7201],{},[19,7236,7238],{"id":7237},"when-managed-still-makes-sense","When managed still makes sense",[12,7240,7241],{},"Honesty is the defense mechanism of any technical recommendation. There are four profiles where paying for managed is the right choice:",[2734,7243,7244,7250,7256,7262],{},[70,7245,7246,7249],{},[27,7247,7248],{},"Team without operational capacity for Redis cluster."," If no one in the company knows how to debug a master that no longer responds, or interpret RDB fork latency, or take care of AOF backup — paying AWS to do that is rational. It's not an excuse, it's division of labor.",[70,7251,7252,7255],{},[27,7253,7254],{},"Compliance requiring SOC2\u002FISO certified vendor."," Audit asking for \"certified vendor X\" doesn't accept \"we run Valkey on a Hetzner VPS\". The path is ElastiCache, Redis Cloud, or similar with certifications in the contract.",[70,7257,7258,7261],{},[27,7259,7260],{},"Volume needing instant scale."," Application going from 100 req\u002Fs to 100k req\u002Fs in 5 minutes due to viral campaign — Upstash's serverless path is where it shines. Self-hosted needs reserved capacity beforehand; serverless grows on the fly.",[70,7263,7264,7267],{},[27,7265,7266],{},"Fully serverless application."," If the app runs on Vercel or Cloudflare Workers and Redis also needs to be serverless by billing model, Upstash is practically the only sane option. Connecting edge functions to a Redis on VPS implies bad cold start.",[19,7269,7271],{"id":7270},"when-self-hosting-is-obvious","When self-hosting is obvious",[12,7273,7274],{},"And four profiles where paying managed is waste:",[2734,7276,7277,7283,7289,7295],{},[70,7278,7279,7282],{},[27,7280,7281],{},"Startup with R$10k–R$200k MRR optimizing cost."," The difference between R$80\u002Fmonth and R$1,000\u002Fmonth of cache is 1% of total cost of a small SaaS; it's also 11 hours of dev person-hour salary. Worth doing the math.",[70,7284,7285,7288],{},[27,7286,7287],{},"Predictable workload."," If cache volume grows 10% per month, there's no advantage in serverless scaling. Reserved capacity on VPS is cheaper and more predictable.",[70,7290,7291,7294],{},[27,7292,7293],{},"Team has 1+ person comfortable with Linux\u002FDocker."," If there's already someone who operates Postgres, nginx, Docker — Redis\u002FValkey is easier than any of them. Learning curve is days, not weeks.",[70,7296,7297,7300],{},[27,7298,7299],{},"Already have own cluster."," If the company runs an orchestrator (HeroCtl, Coolify, similar platform) with spare nodes, Valkey becomes just another job. Marginal cost close to zero — you already pay for the nodes.",[19,7302,7304],{"id":7303},"heroctl-as-infrastructure-for-valkey","HeroCtl as infrastructure for Valkey",[12,7306,7307],{},"For those operating HeroCtl, running Valkey in production is a short configuration exercise. A ~30-line file describes a job with:",[2734,7309,7310,7313,7316,7319,7322],{},[70,7311,7312],{},"Official Valkey 8.x container",[70,7314,7315],{},"Replicated named volume between nodes (data survives kill -9 of server)",[70,7317,7318],{},"Reserved resources (RAM and CPU) with hard limits",[70,7320,7321],{},"Health check on Valkey ping",[70,7323,7324,7325,7328],{},"Internal routing between services (the app talks to ",[231,7326,7327],{},"valkey.servico.local"," without exposing port to the internet)",[12,7330,7331,7332,7334,7335,7337],{},"Automated AOF + RDB backup to S3-compatible is available in the ",[27,7333,4355],{}," plan — without setting up external restic, without manual cron on the host. Valkey metrics come out via ",[231,7336,6836],{}," running as sidecar and appear in the internal Prometheus (already included as a job of the cluster itself, no external stack).",[12,7339,7340],{},"Sentinel failover is integrated with the orchestrator's control plane: if the Valkey master node falls, the cluster detects in around 7 seconds and the replica is promoted. The app's configuration is updated via service discovery — no manual redeploy.",[12,7342,7343],{},"For a startup with 4 servers running the orchestrator, this setup replaces entire ElastiCache Multi-AZ at zero marginal cost (the servers are already there). The real monthly difference is the salary-equivalent of one person, depending on the size of the operation.",[19,7345,7347],{"id":7346},"questions-we-get","Questions we get",[12,7349,7350,7353,7354,571,7357,571,7360,571,7363,571,7366,571,7369,7372],{},[27,7351,7352],{},"Is Valkey compatible with Redis client libraries?","\nYes, in 100% of practical cases. The protocol is identical — ",[231,7355,7356],{},"redis-cli",[231,7358,7359],{},"node-redis",[231,7361,7362],{},"ioredis",[231,7364,7365],{},"redis-rb",[231,7367,7368],{},"redis-py",[231,7370,7371],{},"go-redis",", all work without changing a line. What changes is just the endpoint. In 2026, several libraries already announce explicit support for Valkey in the README, but that's cosmetic — the protocol is the same.",[12,7374,7375,7378,7379,7382,7383,7386],{},[27,7376,7377],{},"Can I migrate from managed Redis Labs to self-hosted Valkey without downtime?","\nYes, with replication. Configure Valkey as Redis Labs replica (",[231,7380,7381],{},"REPLICAOF host port","), wait for sync (a few minutes to hours depending on dataset), promote Valkey to master (",[231,7384,7385],{},"REPLICAOF NO ONE","), do internal DNS cutover, decommission Redis Labs after observation period. Real error window is seconds during the swap.",[12,7388,7389,7392],{},[27,7390,7391],{},"Is Dragonfly worth the BSL risk?","\nDepends on the company's horizon. BSL converts to Apache 2.0 after 4 years by the standard model — so today's code will be open by 2030. The risk is that the company behind it (DragonflyDB Inc) follows the path of Redis Inc and makes the conversion less friendly. For workloads that demand performance Valkey doesn't deliver (above 500k sustained ops\u002Fs), Dragonfly may be the right choice despite the risk. For the rest, Valkey is more conservative.",[12,7394,7395,7398,7399,7401],{},[27,7396,7397],{},"How much RAM does a Redis with 1 GB of useful data consume?","\nPractical math: 1 GB dataset occupies between 1.3 and 2 GB of real RAM (structure overhead, fragmentation, client buffers, replication backlog). Configuring ",[231,7400,6786],{}," at 60% of available RAM is a safe rule — 4 GB instance fits ~2.5 GB of useful data with room to spare.",[12,7403,7404,7407],{},[27,7405,7406],{},"Does Sidekiq really need AOF? Sidekiq docs say it can run without.","\nThe docs say it technically runs. In production, without AOF, any unexpected restart loses queued jobs that were in the buffer. For \"welcome email\" queue, you discover when customer complains. For \"recurring billing\" queue, you discover when the accountant complains. AOF is cheap (5-10% I\u002FO increment), the cost of not having it is large.",[12,7409,7410,7413],{},[27,7411,7412],{},"Cluster vs Sentinel for app processing 50k jobs\u002Fday?","\nSentinel. 50k jobs\u002Fday is 0.6 ops\u002Fs average — fits in 100 MB of Redis RAM. Cluster is overkill by an order of magnitude. Sentinel solves automatic failover with 1 master + 1 replica + 3 sentinels (3 sentinel processes on separate VPSes, can coexist with other things).",[12,7415,7416,7419],{},[27,7417,7418],{},"Does ElastiCache São Paulo have good latency for app running in São Paulo?","\nYes, 1-3ms p99 within the same AZ. The problem isn't latency — it's cost and lock-in. Latency only becomes a topic if the app is on another provider (Hetzner FSN, DigitalOcean NYC) trying to talk to ElastiCache São Paulo — there it rises to 130-200ms and the argument disappears.",[12,7421,7422,7425],{},[27,7423,7424],{},"How to back up self-hosted Valkey to survive disaster?","\nThree layers. First: persistent AOF on local disk (survives restart). Second: daily RDB snapshot copied to S3-compatible storage (Wasabi, Backblaze B2, Cloudflare R2 — all cheaper than AWS S3 for this case). Third: weekly snapshot copied to another storage provider (second region, second vendor). Restic or rclone do the work. Total storage cost for 4 GB Valkey backup: about US$1\u002Fmonth.",[19,7427,3309],{"id":3308},[12,7429,7430],{},"In 2026, \"Redis in production\" became a question with more nuance than it had in 2023. The original product's license changed, the Linux Foundation fork matured, multi-thread alternatives are standing, the serverless offering has a real use case. Choosing among the four implementations and the three managed paths is honest exercise — there's no single answer.",[12,7432,7433,7434,7436],{},"Our default recommendation for Brazilian startup in 2026: ",[27,7435,6436],{}," on its own cluster, Sentinel mode, AOF on if there's queue, monitoring with Prometheus. Cost in the R$80–R$230\u002Fmonth range, against R$600–R$2,000\u002Fmonth for equivalent managed alternatives. Full compatibility with any Redis library. No exposure to RSAL license. Reversible migration if it becomes a problem.",[12,7438,7439],{},"To stand up this stack:",[224,7441,7442],{"className":226,"code":2948,"language":228,"meta":229,"style":229},[231,7443,7444],{"__ignoreMap":229},[234,7445,7446,7448,7450,7452,7454],{"class":236,"line":237},[234,7447,1220],{"class":247},[234,7449,2957],{"class":251},[234,7451,2960],{"class":255},[234,7453,2963],{"class":383},[234,7455,2966],{"class":247},[12,7457,7458,7459,7463,7464,7466],{},"And to read in parallel: ",[3336,7460,7462],{"href":7461},"\u002Fen\u002Fblog\u002Fpostgres-in-production-managed-vs-self-hosted","Postgres in production: managed vs self-hosted"," (same analysis for the database) and ",[3336,7465,6338],{"href":6337}," (the consolidated math of the whole stack).",[3350,7468,4376],{},{"title":229,"searchDepth":244,"depth":244,"links":7470},[7471,7472,7473,7479,7480,7487,7488,7489,7490,7491,7492,7493,7494,7495,7496],{"id":21,"depth":244,"text":22},{"id":6440,"depth":244,"text":6441},{"id":6463,"depth":244,"text":6464,"children":7474},[7475,7476,7477,7478],{"id":6467,"depth":271,"text":6423},{"id":6488,"depth":271,"text":6407},{"id":6507,"depth":271,"text":6411},{"id":6526,"depth":271,"text":6414},{"id":6550,"depth":244,"text":6551},{"id":6601,"depth":244,"text":6602,"children":7481},[7482,7483,7484,7485,7486],{"id":6608,"depth":271,"text":6609},{"id":6650,"depth":271,"text":6651},{"id":6678,"depth":271,"text":6679},{"id":6699,"depth":271,"text":6700},{"id":6739,"depth":271,"text":6740},{"id":6767,"depth":244,"text":6768},{"id":6858,"depth":244,"text":6859},{"id":6887,"depth":244,"text":6888},{"id":6929,"depth":244,"text":6930},{"id":3836,"depth":244,"text":6943},{"id":7237,"depth":244,"text":7238},{"id":7270,"depth":244,"text":7271},{"id":7303,"depth":244,"text":7304},{"id":7346,"depth":244,"text":7347},{"id":3308,"depth":244,"text":3309},"2026-05-20","Redis changed its license in 2024, Valkey was born as an OSS fork, Dragonfly hits benchmarks. In 2026, choosing cache is no longer choosing Redis — it's choosing between 4 products. Honest analysis with costs.",{},"\u002Fen\u002Fblog\u002Fredis-in-production-managed-vs-self-hosted",{"title":6399,"description":7498},{"loc":7500},"en\u002Fblog\u002Fredis-in-production-managed-vs-self-hosted",[7505,6488,7506,7507,3378],"redis","cache","self-hosted","yY3ROe_Afo2prSU4Eu_g3AmHDl997tSuL6uAW4GbUDo",{"id":7510,"title":7511,"author":7,"body":7512,"category":8756,"cover":3379,"date":8757,"description":8758,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":8759,"navigation":411,"path":8760,"readingTime":8761,"seo":8762,"sitemap":8763,"stem":8764,"tags":8765,"__hash__":8770},"blog_en\u002Fen\u002Fblog\u002Fgithub-actions-vs-gitlab-ci-vs-drone.md","GitHub Actions vs GitLab CI vs Drone: which CI\u002FCD to pick for a Brazilian startup",{"type":9,"value":7513,"toc":8719},[7514,7521,7524,7528,7544,7550,7556,7562,7565,7568,7572,7575,7578,7602,7605,7609,7616,7620,7652,7656,7659,7704,7707,7710,7713,7717,7720,7723,7743,7746,7750,7753,7757,7793,7797,7800,7805,7854,7857,7863,7877,7880,7884,7887,7891,7933,7937,7963,7967,7970,7977,7981,7984,8104,8111,8114,8118,8121,8182,8185,8187,8191,8370,8373,8377,8380,8384,8390,8394,8400,8403,8407,8413,8416,8420,8426,8430,8436,8440,8443,8446,8472,8482,8485,8489,8493,8496,8512,8516,8519,8530,8533,8537,8540,8543,8547,8550,8568,8571,8573,8578,8581,8586,8589,8594,8597,8602,8605,8610,8613,8618,8621,8626,8633,8638,8641,8646,8649,8651,8655,8658,8690,8693,8698,8701,8712],[12,7515,7516,7517,7520],{},"The CI\u002FCD choice in 2026 is no longer about \"which tool has more features\". All three serious ones — GitHub Actions, GitLab CI, Drone (and its fork Woodpecker) — do the basics well. The real choice is about ",[27,7518,7519],{},"where your pain is going to show up first",": on the bill at the end of the month, on workflow complexity when the monorepo grows, or when you have to bring up a runner nobody understands when the senior dev goes on vacation.",[12,7522,7523],{},"This post is an honest comparison for Brazilian tech leads deciding CI\u002FCD in 2026. No artificial ranking, no column where one tool is \"champion\" at everything. Explicit tradeoffs, numbers in reais, and a recommendation per profile at the end.",[19,7525,7527],{"id":7526},"tldr-200-words","TL;DR (200 words)",[12,7529,7530,7531,571,7534,571,7537,7540,7541,101],{},"The CI\u002FCD decision in 2026 follows four forces: ",[27,7532,7533],{},"where the code is hosted",[27,7535,7536],{},"minute cost",[27,7538,7539],{},"workflow complexity",", and ",[27,7542,7543],{},"willingness to operate self-hosted",[12,7545,7546,7549],{},[27,7547,7548],{},"GitHub Actions"," won absolute mindshare for projects on GitHub. It's free up to 2000 minutes\u002Fmonth on public repos; after that costs US$0.008\u002Fmin on a Linux runner — between US$5 and US$30\u002Fmonth for a typical startup (R$25 to R$150). Marketplace has 10 thousand ready actions. The Achilles heel is the minute pricing when volume grows.",[12,7551,7552,7555],{},[27,7553,7554],{},"GitLab CI"," is more complete: native job dependency graph, parent-child pipelines, better monorepo handling, included image registry, embedded security scanning. Self-hosted (Community Edition) is free but requires 4 to 8 GB of RAM and active operation. SaaS Premium is US$29\u002Fuser\u002Fmonth — expensive for a large team.",[12,7557,7558,7561],{},[27,7559,7560],{},"Drone\u002FWoodpecker"," self-hosted is the option for cutting variable cost to zero. A R$30 to R$80\u002Fmonth server runs CI for five to ten projects. Costs in ops: you operate the runners.",[12,7563,7564],{},"For a small BR startup on GitHub, start on the Actions free plan. When it passes US$30\u002Fmonth, consider Woodpecker self-hosted. For a company that values CI + issue tracker + registry in a single product, GitLab self-hosted.",[12,7566,7567],{},"━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━",[19,7569,7571],{"id":7570},"why-does-this-decision-matter-more-than-it-seems","Why does this decision matter more than it seems?",[12,7573,7574],{},"CI\u002FCD is the most-used infrastructure of any product team: every commit touches the system, every PR depends on it, every deploy goes through it. A wrong choice doesn't break you in the first month — it breaks you in the third year, when migrating costs four weeks of two people and you pay that while the roadmap stalls.",[12,7576,7577],{},"The three symptoms that indicate the choice was wrong:",[67,7579,7580,7586,7596],{},[70,7581,7582,7585],{},[27,7583,7584],{},"The bill grows faster than the team."," If CI cost doubles every six months without deploy volume justifying it, the pricing model isn't yours.",[70,7587,7588,7591,7592,7595],{},[27,7589,7590],{},"Workflows become copy-paste."," If every new project starts with ",[231,7593,7594],{},"cp -r .github\u002Fworkflows\u002F",", the tool has no decent composition.",[70,7597,7598,7601],{},[27,7599,7600],{},"CI failures take more than an hour to debug."," If reproducing the error locally requires running a Docker image nobody knows how to set up, the build isn't portable.",[12,7603,7604],{},"The three main competitors solve these symptoms in different ways. Let's go piece by piece.",[19,7606,7608],{"id":7607},"github-actions-the-de-facto-standard-is-it-worth-the-price","GitHub Actions: the de facto standard — is it worth the price?",[12,7610,7611,7612,7615],{},"If your code is on GitHub, Actions has a structural advantage that can't be ignored: zero integration friction. You create ",[231,7613,7614],{},".github\u002Fworkflows\u002Fci.yml",", push, and you're running. No separate signup, no cross token, no webhook to configure.",[368,7617,7619],{"id":7618},"what-actions-does-well","What Actions does well",[2734,7621,7622,7628,7634,7640,7646],{},[70,7623,7624,7627],{},[27,7625,7626],{},"Huge marketplace."," More than ten thousand ready actions for common tasks: setup of Node, Python, Go; deploy to AWS, GCP, Azure; image signing; security scanning. Most are maintained by the technology vendor itself (HashiCorp publishes its own, AWS publishes its own, etc).",[70,7629,7630,7633],{},[27,7631,7632],{},"Matrix builds."," Running the same suite against five Node versions or three operating systems is a key in three lines.",[70,7635,7636,7639],{},[27,7637,7638],{},"Reusable workflows."," Since 2021 you can extract workflows shared across repos in the same organization — solves the \"copy-paste between projects\" problem for medium teams.",[70,7641,7642,7645],{},[27,7643,7644],{},"Deployment protection rules."," Manual approvals, time windows, branch restriction — all configurable without plugin.",[70,7647,7648,7651],{},[27,7649,7650],{},"Self-hosted runners."," You can run the agent on your own infra and use the Actions UI just as orchestrator. Solves the minute problem for high-volume teams.",[368,7653,7655],{"id":7654},"what-actions-charges-dearly-for","What Actions charges dearly for",[12,7657,7658],{},"The billing model is per minute, and the numbers matter:",[119,7660,7661,7670],{},[122,7662,7663],{},[125,7664,7665,7668],{},[128,7666,7667],{},"Runner type",[128,7669,136],{},[141,7671,7672,7680,7688,7696],{},[125,7673,7674,7677],{},[146,7675,7676],{},"Linux 2 vCPU (default)",[146,7678,7679],{},"US$0.008\u002Fmin (R$0.04)",[125,7681,7682,7685],{},[146,7683,7684],{},"Windows 2 vCPU",[146,7686,7687],{},"US$0.016\u002Fmin (R$0.08)",[125,7689,7690,7693],{},[146,7691,7692],{},"macOS (required for iOS builds)",[146,7694,7695],{},"US$0.08\u002Fmin (R$0.40)",[125,7697,7698,7701],{},[146,7699,7700],{},"Larger Linux (4 vCPU+)",[146,7702,7703],{},"US$0.016\u002Fmin and above",[12,7705,7706],{},"For a startup on GitHub with five devs and a reasonable workflow (build + test + lint on each PR), typical consumption is 800 to 2500 minutes\u002Fmonth on Linux. That's US$6 to US$20\u002Fmonth — that is, between R$30 and R$100. Fits in the \"dev tools\" line without pain.",[12,7708,7709],{},"When it hurts: heavy workflows (E2E with Playwright, Rust builds, tests that bring up Postgres + Redis on each job) easily pass 10 thousand minutes\u002Fmonth. At US$0.008\u002Fmin that becomes US$80\u002Fmonth — R$400. Multiply by 12 and you're paying R$5 thousand\u002Fyear on CI.",[12,7711,7712],{},"macOS builds are the worst case: US$0.40\u002Fmin is ten times more than Linux. Teams maintaining iOS apps spend three to four times more on CI than on production infra.",[368,7714,7716],{"id":7715},"do-actions-self-hosted-runners-solve-it","Do Actions self-hosted runners solve it?",[12,7718,7719],{},"Partially. You run the runner binary on a machine of yours, register on the repo or organization, and jobs go there instead of the managed pool. Minute cost goes to zero — you only pay for the machine.",[12,7721,7722],{},"But three catches:",[67,7724,7725,7731,7737],{},[70,7726,7727,7730],{},[27,7728,7729],{},"Runner maintenance."," The version updates frequently; outdated runners start failing silently. Without automation, it becomes manual operation work.",[70,7732,7733,7736],{},[27,7734,7735],{},"Manual scaling."," If the team has five devs opening 20 simultaneous PRs, one runner serializes everything. You need N runners — and provisioning\u002Fdeprovisioning on demand requires additional tooling.",[70,7738,7739,7742],{},[27,7740,7741],{},"Security on public repos."," Self-hosted runners on a public repo are an open door for any malicious fork to run arbitrary code on your machine. Always restricted to private repos or trusted organizations.",[12,7744,7745],{},"The mature solution is Actions Runner Controller (ARC): an operator that brings up runners on-demand on a Kubernetes cluster or similar. Solves scaling, but adds an entire infrastructure layer — not trivial.",[19,7747,7749],{"id":7748},"gitlab-ci-does-the-heavyweight-competitor-still-make-sense","GitLab CI: does the \"heavyweight\" competitor still make sense?",[12,7751,7752],{},"GitLab CI is older than Actions, more complete in features, and less popular outside teams already on the GitLab platform. The right question isn't \"is GitLab CI better than Actions?\", it's \"is it worth migrating to GitLab to use GitLab CI?\"",[368,7754,7756],{"id":7755},"what-gitlab-ci-does-better","What GitLab CI does better",[2734,7758,7759,7769,7775,7781,7787],{},[70,7760,7761,7764,7765,7768],{},[27,7762,7763],{},"Dependency graph (DAG)."," Native, without external tooling. You declare ",[231,7766,7767],{},"needs: [job_a, job_b]"," and jobs run in parallel respecting dependencies. For workflows with 30+ jobs (large monorepo, multiple languages, multi-environment deploy), this is the difference between 8 minutes and 25 minutes per pipeline.",[70,7770,7771,7774],{},[27,7772,7773],{},"Parent-child pipelines."," A large pipeline can trigger child pipelines with conditional logic — useful for monorepos where only changed services need to build.",[70,7776,7777,7780],{},[27,7778,7779],{},"Included image registry."," Each project comes with a native private container registry. No configuring secrets for Amazon ECR, Docker Hub, or similar.",[70,7782,7783,7786],{},[27,7784,7785],{},"Pages, security scanning, code quality, dependency scanning"," — all embedded in the platform. In Actions each is a separate marketplace action.",[70,7788,7789,7792],{},[27,7790,7791],{},"Deep merge request integration."," Pipelines appear inside the MR with coverage diff, bundle size comparison, build time comparison. In Actions checks appear as links — in GitLab they're structured data.",[368,7794,7796],{"id":7795},"where-gitlab-ci-charges-dearly","Where GitLab CI charges dearly",[12,7798,7799],{},"Two dimensions.",[12,7801,7802],{},[27,7803,7804],{},"SaaS pricing:",[119,7806,7807,7819],{},[122,7808,7809],{},[125,7810,7811,7814,7816],{},[128,7812,7813],{},"Plan",[128,7815,136],{},[128,7817,7818],{},"Monthly minute limit",[141,7820,7821,7832,7843],{},[125,7822,7823,7826,7829],{},[146,7824,7825],{},"Free",[146,7827,7828],{},"US$0\u002Fuser",[146,7830,7831],{},"400 minutes",[125,7833,7834,7837,7840],{},[146,7835,7836],{},"Premium",[146,7838,7839],{},"US$29\u002Fuser\u002Fmonth (R$145)",[146,7841,7842],{},"10,000 minutes",[125,7844,7845,7848,7851],{},[146,7846,7847],{},"Ultimate",[146,7849,7850],{},"US$99\u002Fuser\u002Fmonth (R$495)",[146,7852,7853],{},"50,000 minutes",[12,7855,7856],{},"For a five-dev team on Premium, that's US$145\u002Fmonth — R$725. That's just the entry ticket; extra minutes cost separately. For a team of 20, US$580\u002Fmonth = R$2,900 just on subscription.",[12,7858,7859,7862],{},[27,7860,7861],{},"Self-hosted Community Edition"," is free and removes that license cost — but:",[2734,7864,7865,7868,7871,7874],{},[70,7866,7867],{},"Realistic minimum: 4 vCPU, 8 GB RAM (16 GB if you'll use registry + pages + scanning).",[70,7869,7870],{},"Adequate VPS in Brazil: R$120 to R$250\u002Fmonth.",[70,7872,7873],{},"Ops: 2 to 4 hours\u002Fmonth on updates, backup, monitoring.",[70,7875,7876],{},"Monthly updates. GitLab has a rigid cadence; staying three versions behind opens documented security holes.",[12,7878,7879],{},"In real production, self-hosted GitLab is less work than Kubernetes but more than Actions SaaS. It's a real server that you operate.",[19,7881,7883],{"id":7882},"drone-ci-and-woodpecker-the-minimalist-alternative","Drone CI and Woodpecker: the minimalist alternative",[12,7885,7886],{},"Drone CI was born in 2014 as the \"container-native CI\": each pipeline step is a container, no magic. In 2020 the company behind it (Drone Inc.) was acquired by Harness, and the product gained a commercial Cloud version. The community fork Woodpecker remains 100% open-source, with API compatible with Drone.",[368,7888,7890],{"id":7889},"what-dronewoodpecker-does-well","What Drone\u002FWoodpecker does well",[2734,7892,7893,7902,7915,7921,7927],{},[70,7894,7895,7898,7899,7901],{},[27,7896,7897],{},"Simple YAML."," Each step declares an image and a command. No DSL, no reusable actions with their own semantics. What you run locally with ",[231,7900,2405],{}," is what runs on CI.",[70,7903,7904,7907,7908,2402,7911,7914],{},[27,7905,7906],{},"Container-native."," There's no Java \"executor\", no Python agent running steps. Each step is an isolated container. Reproducing the error locally is literal: copy the ",[231,7909,7910],{},"image:",[231,7912,7913],{},"commands:"," from YAML and run in the terminal.",[70,7916,7917,7920],{},[27,7918,7919],{},"Self-hosted from day one."," There's no \"free Drone Cloud\" pulling features into the paid version. The server + runners are the whole product.",[70,7922,7923,7926],{},[27,7924,7925],{},"Plugins via container."," Each plugin (SSH deploy, Slack, Docker push, AWS) is a published image. Versioned like any other dependency.",[70,7928,7929,7932],{},[27,7930,7931],{},"Supports multiple code hosts."," GitHub, GitLab, Bitbucket, Gitea, Forgejo — all on the same Drone server.",[368,7934,7936],{"id":7935},"where-dronewoodpecker-charges","Where Drone\u002FWoodpecker charges",[2734,7938,7939,7945,7951,7957],{},[70,7940,7941,7944],{},[27,7942,7943],{},"Smaller community."," When you hit an obscure bug, Stack Overflow has five answers, not fifty. Github issues are your main source.",[70,7946,7947,7950],{},[27,7948,7949],{},"Non-trivial operation at scale."," One server + one runner is easy. Five autoscaling runners behind a queue is tooling you assemble — auto-scaling isn't built-in.",[70,7952,7953,7956],{},[27,7954,7955],{},"Drone Cloud is paid."," If you want SaaS, you go to Harness; the free tier is limited. That's why the recommendation is always self-hosted.",[70,7958,7959,7962],{},[27,7960,7961],{},"Modest documentation."," Covers the happy path; edge cases you discover by reading code.",[368,7964,7966],{"id":7965},"why-woodpecker-instead-of-drone-in-2026","Why Woodpecker instead of Drone in 2026",[12,7968,7969],{},"Vanilla Drone still works, but Harness has prioritized the commercial cloud version. Woodpecker is the community fork of original Drone — 100% open-source, no paid version pulling features, monthly active releases, engaged community. API and YAML compatible with Drone, so migration is trivial: swap the server URL.",[12,7971,7972,7973,7976],{},"For any small team self-hosting in 2026, ",[27,7974,7975],{},"Woodpecker is the better choice than vanilla Drone",". Same architecture, without the overhead of a company controlling the roadmap.",[19,7978,7980],{"id":7979},"which-is-cheaper-in-2026","Which is cheaper in 2026?",[12,7982,7983],{},"Real total monthly cost, considering a five-dev team with medium volume (300 builds\u002Fmonth, average 8-minute Linux builds):",[119,7985,7986,8002],{},[122,7987,7988],{},[125,7989,7990,7993,7996,7999],{},[128,7991,7992],{},"Option",[128,7994,7995],{},"Fixed cost",[128,7997,7998],{},"Variable cost",[128,8000,8001],{},"Estimated total\u002Fmonth",[141,8003,8004,8019,8032,8046,8060,8074,8088],{},[125,8005,8006,8009,8012,8015],{},[146,8007,8008],{},"Woodpecker self-hosted (VPS R$80)",[146,8010,8011],{},"R$80",[146,8013,8014],{},"R$0",[146,8016,8017],{},[27,8018,8011],{},[125,8020,8021,8024,8026,8028],{},[146,8022,8023],{},"Actions public repos (open-source)",[146,8025,8014],{},[146,8027,8014],{},[146,8029,8030],{},[27,8031,8014],{},[125,8033,8034,8037,8039,8042],{},[146,8035,8036],{},"Actions private repos (free tier 2000 min)",[146,8038,8014],{},[146,8040,8041],{},"R$0 to R$50",[146,8043,8044],{},[27,8045,8041],{},[125,8047,8048,8051,8053,8056],{},[146,8049,8050],{},"Actions Linux paid (medium volume)",[146,8052,8014],{},[146,8054,8055],{},"R$50 to R$150",[146,8057,8058],{},[27,8059,8055],{},[125,8061,8062,8065,8068,8070],{},[146,8063,8064],{},"GitLab CI self-hosted (VPS R$200)",[146,8066,8067],{},"R$200",[146,8069,8014],{},[146,8071,8072],{},[27,8073,8067],{},[125,8075,8076,8079,8081,8084],{},[146,8077,8078],{},"Actions with heavy macOS builds",[146,8080,8014],{},[146,8082,8083],{},"R$300 to R$1,500",[146,8085,8086],{},[27,8087,8083],{},[125,8089,8090,8093,8096,8099],{},[146,8091,8092],{},"GitLab CI SaaS Premium (5 devs)",[146,8094,8095],{},"R$725",[146,8097,8098],{},"R$0 to R$200",[146,8100,8101],{},[27,8102,8103],{},"R$725 to R$925",[12,8105,8106,8107,8110],{},"Absolute cost winner: ",[27,8108,8109],{},"Woodpecker self-hosted"," for a team willing to operate a VPS. Costs the same as a lunch per month and runs CI for ten projects without breaking a sweat.",[12,8112,8113],{},"If ops isn't available, the Actions free plan is the next option. It fits a small team with light workflows; when it goes past US$30\u002Fmonth variable, it's worth at least evaluating self-hosted runners.",[19,8115,8117],{"id":8116},"which-has-the-best-developer-experience","Which has the best developer experience?",[12,8119,8120],{},"DX in CI\u002FCD is measured in three dimensions: time from \"blank yml\" to \"first passing build\", debug capability when it goes wrong, and ability to evolve the workflow when it grows.",[119,8122,8123,8133],{},[122,8124,8125],{},[125,8126,8127,8130],{},[128,8128,8129],{},"Dimension",[128,8131,8132],{},"Winner",[141,8134,8135,8143,8151,8159,8167,8175],{},[125,8136,8137,8140],{},[146,8138,8139],{},"Ready templates \u002F accessibility",[146,8141,8142],{},"GitHub Actions (marketplace + onboarding)",[125,8144,8145,8148],{},[146,8146,8147],{},"Complex workflows \u002F DAG \u002F monorepo",[146,8149,8150],{},"GitLab CI (parent-child + native needs)",[125,8152,8153,8156],{},[146,8154,8155],{},"Local reproduction \u002F conceptual simplicity",[146,8157,8158],{},"Drone\u002FWoodpecker (each step = container)",[125,8160,8161,8164],{},[146,8162,8163],{},"Intermittent failure debug",[146,8165,8166],{},"Drone\u002FWoodpecker (re-running an isolated step is trivial)",[125,8168,8169,8172],{},[146,8170,8171],{},"Cross-project composition",[146,8173,8174],{},"GitHub Actions (reusable workflows + composite actions)",[125,8176,8177,8180],{},[146,8178,8179],{},"Time-to-first-pipeline (zero to hello world)",[146,8181,7548],{},[12,8183,8184],{},"There's no absolute winner. For a team that values starting fast, Actions. For a team with complex workflow from day one (monorepo, multiple languages), GitLab CI. For a team that wants to understand exactly what's happening, Drone\u002FWoodpecker.",[12,8186,7567],{},[19,8188,8190],{"id":8189},"comparative-table-12-honest-criteria","Comparative table: 12 honest criteria",[119,8192,8193,8205],{},[122,8194,8195],{},[125,8196,8197,8199,8201,8203],{},[128,8198,2982],{},[128,8200,7548],{},[128,8202,7554],{},[128,8204,7560],{},[141,8206,8207,8220,8234,8248,8262,8276,8290,8304,8318,8329,8342,8356],{},[125,8208,8209,8212,8215,8218],{},[146,8210,8211],{},"BR startup monthly cost (5 devs, medium volume)",[146,8213,8214],{},"R$0 to R$150",[146,8216,8217],{},"R$80 to R$925",[146,8219,8011],{},[125,8221,8222,8225,8228,8231],{},[146,8223,8224],{},"Real free tier (2026)",[146,8226,8227],{},"2000 min\u002Fmonth private, unlimited public",[146,8229,8230],{},"400 min\u002Fmonth SaaS",[146,8232,8233],{},"Unlimited self-hosted",[125,8235,8236,8239,8242,8245],{},[146,8237,8238],{},"Self-hosted available",[146,8240,8241],{},"Yes (runners), SaaS UI",[146,8243,8244],{},"Yes (full CE)",[146,8246,8247],{},"Yes (the only sensible way)",[125,8249,8250,8253,8256,8259],{},[146,8251,8252],{},"Large workflow complexity",[146,8254,8255],{},"Good (reusable workflows)",[146,8257,8258],{},"Excellent (DAG + parent-child)",[146,8260,8261],{},"Modest (linear + matrix)",[125,8263,8264,8267,8270,8273],{},[146,8265,8266],{},"Monorepo support",[146,8268,8269],{},"Medium (paths filter)",[146,8271,8272],{},"Excellent (rules + parent-child)",[146,8274,8275],{},"Medium (when filter)",[125,8277,8278,8281,8284,8287],{},[146,8279,8280],{},"Integrated container registry",[146,8282,8283],{},"No (needs separate GHCR)",[146,8285,8286],{},"Yes, native",[146,8288,8289],{},"No (use external registry)",[125,8291,8292,8295,8298,8301],{},[146,8293,8294],{},"Secret management",[146,8296,8297],{},"Repo + org + environment",[146,8299,8300],{},"Project + group + instance",[146,8302,8303],{},"Server + repo",[125,8305,8306,8309,8312,8315],{},[146,8307,8308],{},"Out-of-the-box parallel jobs",[146,8310,8311],{},"Yes (matrix)",[146,8313,8314],{},"Yes (parallel + DAG)",[146,8316,8317],{},"Yes (depends_on)",[125,8319,8320,8323,8325,8327],{},[146,8321,8322],{},"BR community \u002F Portuguese material",[146,8324,4914],{},[146,8326,3159],{},[146,8328,4919],{},[125,8330,8331,8334,8337,8339],{},[146,8332,8333],{},"PT-BR documentation",[146,8335,8336],{},"Partial (official in English)",[146,8338,3139],{},[146,8340,8341],{},"Practically zero",[125,8343,8344,8347,8350,8353],{},[146,8345,8346],{},"GitHub\u002FGitLab\u002FGitea integration",[146,8348,8349],{},"GitHub only",[146,8351,8352],{},"GitLab only (external mirror is workaround)",[146,8354,8355],{},"All three + Bitbucket",[125,8357,8358,8361,8364,8367],{},[146,8359,8360],{},"Ideal usage range",[146,8362,8363],{},"1 to 50 devs on GitHub",[146,8365,8366],{},"5 to 500 devs on a single platform",[146,8368,8369],{},"1 to 30 devs with ops available",[12,8371,8372],{},"No competitor has a column without caveats. The right tool depends on the team profile.",[19,8374,8376],{"id":8375},"decision-by-team-profile","Decision by team profile",[12,8378,8379],{},"Four concrete recommendations, no \"depends\".",[368,8381,8383],{"id":8382},"indie-hacker-or-public-repo-on-github","Indie hacker or public repo on GitHub",[12,8385,8386,8389],{},[27,8387,8388],{},"Use GitHub Actions free plan."," Public repos have unlimited minutes. You have no reason to look for an alternative. If a year from now the project grows, you reassess.",[368,8391,8393],{"id":8392},"early-startup-on-github-private-repos-r10k-to-r50k-mrr","Early startup on GitHub, private repos, R$10k to R$50k MRR",[12,8395,8396,8399],{},[27,8397,8398],{},"Stay on the Actions free plan."," The 2000-minute free tier fits a two- to three-dev team with reasonable workflows. When it starts going over, first reduce waste (paths filter to not run everything on every PR, decent dependency cache) before migrating.",[12,8401,8402],{},"If you consistently go over US$30\u002Fmonth variable, consider migrating to self-hosted runners or Woodpecker in parallel.",[368,8404,8406],{"id":8405},"startup-with-r50k-to-r200k-mrr-on-github-high-ci-volume","Startup with R$50k to R$200k MRR on GitHub, high CI volume",[12,8408,8409,8412],{},[27,8410,8411],{},"Hybrid."," Use Actions for light workflows (lint, unit tests) and self-hosted runners (via ARC) or Woodpecker for heavy workflows (E2E, long builds, deploys). You pay per minute where it pays off and zero where it hurts.",[12,8414,8415],{},"For teams with regular macOS builds, consider a dedicated Mac mini as a self-hosted runner. R$10 thousand investment pays off in three months if you spend US$200\u002Fmonth on macOS Actions today.",[368,8417,8419],{"id":8418},"br-company-on-self-hosted-gitlab","BR company on self-hosted GitLab",[12,8421,8422,8425],{},[27,8423,8424],{},"Use native GitLab CI."," You're already paying the cost of operating GitLab; CI comes along at no additional cost. Migrating to another tool would mean operating two systems in parallel — not worth it.",[368,8427,8429],{"id":8428},"small-team-aggressively-controlling-cost","Small team aggressively controlling cost",[12,8431,8432,8435],{},[27,8433,8434],{},"Woodpecker self-hosted on R$80 VPS."," Runs CI for ten projects without sweating. Costs in ops 1 to 2 hours\u002Fmonth. If the team has someone with affinity for Unix tools, it's the most economical and most predictable option in the bill — you know exactly the cost every month.",[19,8437,8439],{"id":8438},"where-heroctl-comes-in-as-runner-infrastructure","Where HeroCtl comes in as runner infrastructure",[12,8441,8442],{},"Self-hosted CI\u002FCD is exactly the type of workload that HeroCtl orchestrates well: long services (CI server, database that maintains build history), services that scale horizontally (runners that go up and down with the queue), services with persistence needs (artifact cache).",[12,8444,8445],{},"Instead of operating Docker Compose on a single server — single point of failure — you describe the setup as a job configuration:",[2734,8447,8448,8454,8460,8466],{},[70,8449,8450,8453],{},[27,8451,8452],{},"Drone\u002FWoodpecker server as a long job",", with a single replica and persistent volume for the history database.",[70,8455,8456,8459],{},[27,8457,8458],{},"N runners as a replicable job",", scaling horizontally. The orchestrator distributes the runners across nodes; if a server dies, the runners migrate to the others.",[70,8461,8462,8465],{},[27,8463,8464],{},"Integrated backup"," for CI state (server database + artifact cache), without setting up external tooling.",[70,8467,8468,8471],{},[27,8469,8470],{},"Integrated metrics and logs"," — you see CPU, memory, build time usage without bringing up a separate observability stack.",[12,8473,8474,8475,2629,8478,8481],{},"Practical difference: instead of operating a CI stack in parallel to your production cluster, it becomes part of the same cluster, with the same high-availability guarantees. If a server falls, the runners migrate. If you want to double capacity for a heavy sprint, change ",[231,8476,8477],{},"replicas: 4",[231,8479,8480],{},"replicas: 8"," in the configuration file.",[12,8483,8484],{},"For those on the \"starting simple but going to grow\" frontier, this solves the transition without needing to swap tools mid-path.",[19,8486,8488],{"id":8487},"the-4-expensive-errors-in-self-hosted-cicd-and-how-to-avoid-them","The 4 expensive errors in self-hosted CI\u002FCD (and how to avoid them)",[368,8490,8492],{"id":8491},"error-1-silent-stale-cache","Error 1: silent stale cache",[12,8494,8495],{},"The symptom: build passes locally, fails on CI because of a dependency that exists on the dev machine but not on the fresh image. Worst case: passes on CI too because previous cache contains the dependency, but fails in production when the image is built without cache.",[12,8497,8498,8499,571,8502,571,8505,571,8508,8511],{},"The fix: a decent cache assumes it can be invalidated at any moment. Whenever you change dependency manifest files (",[231,8500,8501],{},"package.json",[231,8503,8504],{},"go.mod",[231,8506,8507],{},"requirements.txt",[231,8509,8510],{},"Cargo.toml","), include them in the cache key. Periodically (weekly), force build without cache to detect drift.",[368,8513,8515],{"id":8514},"error-2-secret-committed-by-accident","Error 2: secret committed by accident",[12,8517,8518],{},"The symptom: someone pasted a token in the CI config \"just to test\", committed, forgot. The repo is public; in 12 hours the token is in use by someone who shouldn't.",[12,8520,8521,8522,8525,8526,8529],{},"The fix: two layered mechanisms. ",[27,8523,8524],{},"Pre-commit hook"," that scans for common key patterns (AWS, Stripe, GitHub PAT). ",[27,8527,8528],{},"Automatic rotation"," of critical tokens (90 days max). If a token leaks, the exposure window is finite.",[12,8531,8532],{},"In GitLab CI, use variables with \"masked\" and \"protected\" flags. In Actions, use environment-scoped secrets with approval rules. In Drone\u002FWoodpecker, secrets are scoped per repo and never appear in logs by default.",[368,8534,8536],{"id":8535},"error-3-runner-running-on-the-same-production-server","Error 3: runner running on the same production server",[12,8538,8539],{},"The symptom: heavy build consumes CPU\u002FRAM, production gets slow, latency rises, alarm goes off, on-call wakes up. Common real case in small teams trying to save a machine.",[12,8541,8542],{},"The fix: runners on a separate server from production, always. If the budget is tight, a runner on a R$30\u002Fmonth VPS is still cheaper than a production incident during business hours.",[368,8544,8546],{"id":8545},"error-4-workflow-that-doesnt-run-outside-ci","Error 4: workflow that doesn't run outside CI",[12,8548,8549],{},"The symptom: the CI build is a 200-line script inline in the YAML, with 15 environment variables that the system injects. When something goes wrong, nobody can reproduce locally without reverse-engineering the YAML.",[12,8551,8552,8553,571,8556,8559,8560,8563,8564,8567],{},"The fix: CI should call commands that exist as ",[231,8554,8555],{},"Makefile",[231,8557,8558],{},"script\u002Fbuild",", or ",[231,8561,8562],{},"package.json scripts",". The CI YAML orchestrates; the logic lives in versioned scripts that run in any terminal. If you can't run ",[231,8565,8566],{},"make ci"," locally and see the same result, your CI isn't portable.",[12,8569,8570],{},"Drone\u002FWoodpecker forces this discipline by design (each step is a container). Actions and GitLab CI allow the anti-pattern; it's up to the team to avoid it.",[19,8572,5250],{"id":5249},[12,8574,8575],{},[27,8576,8577],{},"Is GitHub Actions faster than Drone?",[12,8579,8580],{},"In raw build, depends on the runner: the Actions managed pool uses 2-vCPU machines; a self-hosted runner on a 4-vCPU machine is faster. In total pipeline time (including queue), Actions wins when there's volume — they have huge idle capacity. Self-hosted (any tool) has queue proportional to the number of runners you provision.",[12,8582,8583],{},[27,8584,8585],{},"Can I use GitLab CI with a repo on GitHub?",[12,8587,8588],{},"Technically yes, via \"pull mirror\" (GitLab mirrors GitHub and runs CI on it). In practice it's fragile: webhooks lag, status checks don't return to GitHub the way the team expects, MRs get confusing. Not worth it. If you're on GitHub, use Actions or Drone\u002FWoodpecker (which accept GitHub as a native source).",[12,8590,8591],{},[27,8592,8593],{},"Are GitHub Actions self-hosted runners worth it?",[12,8595,8596],{},"For private repos with high volume (more than 5000 minutes\u002Fmonth), yes. You save paid minutes in exchange for operating machines. For public repos, no — security risk (malicious forks running code on your machine) outweighs benefit. ARC (Actions Runner Controller) helps at scale, but adds a Kubernetes layer; only makes sense for teams already operating K8s.",[12,8598,8599],{},[27,8600,8601],{},"Is Woodpecker stable enough in 2026?",[12,8603,8604],{},"Yes. Monthly releases, solid codebase (forked from Drone, which had five years of production), active community. In production at hundreds of small and medium-sized companies. It's not the safe bet \"nobody is fired for choosing it\" — that's Actions or GitLab — but in three years of fork there hasn't been a serious community incident. For a small self-hosted team, it's the sensible choice.",[12,8606,8607],{},[27,8608,8609],{},"Do ArgoCD and FluxCD enter this decision?",[12,8611,8612],{},"Not directly. ArgoCD\u002FFluxCD are GitOps tools for Kubernetes, not CI. They watch a Git repo and apply changes to the cluster. CI continues to be Actions\u002FGitLab\u002FDrone generating images; ArgoCD\u002FFlux apply the deploy. If you're not on Kubernetes, ArgoCD\u002FFlux aren't for you. Teams on other orchestrators deploy directly from CI or via the orchestrator's APIs.",[12,8614,8615],{},[27,8616,8617],{},"How many simultaneous runners for a team of 5 devs?",[12,8619,8620],{},"Practical rule: one runner per two active developers, plus one extra runner so long builds don't block fast PRs. Five-dev team: three runners is comfortable. At peak times (release day), bring it up to five temporarily. Each runner consumes 1 to 2 GB of RAM in typical workload; an 8-GB server runs four runners without pain.",[12,8622,8623],{},[27,8624,8625],{},"Dependency cache — which tool handles it best?",[12,8627,8628,8629,8632],{},"GitLab CI has native cache by key\u002Fpath, integrated with the own registry. GitHub Actions has ",[231,8630,8631],{},"actions\u002Fcache"," (free, 10 GB per repo). Drone\u002FWoodpecker depend on external cache plugin (S3, local MinIO) — more setup but more flexible. At moderate volume, all solve it; at high volume (large monorepo), GitLab has an advantage from registry integration.",[12,8634,8635],{},[27,8636,8637],{},"Migrating from GitHub Actions to Drone — how much work?",[12,8639,8640],{},"For simple workflows (build + test + push), 1 to 2 days. For workflows that depend on many marketplace actions, 1 to 2 weeks (need to rewrite each action as container). The biggest pain is secrets and environments — export and reimport carefully. Recommendation: migrate project by project, not all at once.",[12,8642,8643],{},[27,8644,8645],{},"Can I run Actions and Drone\u002FWoodpecker runners on the same server?",[12,8647,8648],{},"Technically yes, both are containers. In practice, isolation improves: runners on separate servers avoid one heavy build affecting the other. If the budget is tight, two R$40\u002Fmonth servers are better than one R$80\u002Fmonth server with everything together.",[12,8650,7567],{},[19,8652,8654],{"id":8653},"in-summary","In summary",[12,8656,8657],{},"CI\u002FCD in 2026 has no winning tool. It has usage profiles and honest tradeoffs:",[2734,8659,8660,8666,8672,8678,8684],{},[70,8661,8662,8665],{},[27,8663,8664],{},"You're on GitHub and the volume is light to medium?"," Actions, free plan. Don't look for problem where there isn't one.",[70,8667,8668,8671],{},[27,8669,8670],{},"You're on self-hosted GitLab?"," Native GitLab CI. Already paid.",[70,8673,8674,8677],{},[27,8675,8676],{},"You want predictable cost and have 1-2h\u002Fmonth of ops available?"," Woodpecker self-hosted on a R$80 VPS. The most economical choice.",[70,8679,8680,8683],{},[27,8681,8682],{},"You have a large monorepo with complex workflow?"," GitLab CI (native DAG) or Actions with reusable workflows.",[70,8685,8686,8689],{},[27,8687,8688],{},"You have high volume and minute pricing pain?"," Hybrid: Actions for light workflows, self-hosted runners for heavy ones.",[12,8691,8692],{},"If you're thinking of running the CI tool as part of the same cluster that serves production — with real high availability, integrated metrics, and backup without setting up a separate stack — install HeroCtl on a server:",[224,8694,8696],{"className":8695,"code":2948,"language":2529},[2527],[231,8697,2948],{"__ignoreMap":229},[12,8699,8700],{},"From there, describing a Woodpecker server with three auto-scalable runners is a fifty-line configuration file. The cluster takes care of the rest: distributes the runners across the nodes, keeps the server available even with machine loss, backs up state, exposes metrics in the embedded panel.",[12,8702,8703,8704,8706,8707,8711],{},"For more context, also worth reading ",[3336,8705,3344],{"href":3343}," — discusses when it makes sense to leave docker-compose for a replicated control plane, with the same honest criteria as this post. And for teams thinking of simplifying the entire orchestration stack, ",[3336,8708,8710],{"href":8709},"\u002Fen\u002Fblog\u002Fmigrating-from-kubernetes-to-simpler-stack","Migrating from Kubernetes to a simpler stack — real case"," has numbers from a real migration, with gains and pains.",[12,8713,8714,8715,8718],{},"The CI\u002FCD choice is one of the most enduring decisions of the team. Worth a few days of honest comparison before copying the ",[231,8716,8717],{},".github\u002Fworkflows\u002F"," from the previous project — because three years later, migration costs dearly.",{"title":229,"searchDepth":244,"depth":244,"links":8720},[8721,8722,8723,8728,8732,8737,8738,8739,8740,8747,8748,8754,8755],{"id":7526,"depth":244,"text":7527},{"id":7570,"depth":244,"text":7571},{"id":7607,"depth":244,"text":7608,"children":8724},[8725,8726,8727],{"id":7618,"depth":271,"text":7619},{"id":7654,"depth":271,"text":7655},{"id":7715,"depth":271,"text":7716},{"id":7748,"depth":244,"text":7749,"children":8729},[8730,8731],{"id":7755,"depth":271,"text":7756},{"id":7795,"depth":271,"text":7796},{"id":7882,"depth":244,"text":7883,"children":8733},[8734,8735,8736],{"id":7889,"depth":271,"text":7890},{"id":7935,"depth":271,"text":7936},{"id":7965,"depth":271,"text":7966},{"id":7979,"depth":244,"text":7980},{"id":8116,"depth":244,"text":8117},{"id":8189,"depth":244,"text":8190},{"id":8375,"depth":244,"text":8376,"children":8741},[8742,8743,8744,8745,8746],{"id":8382,"depth":271,"text":8383},{"id":8392,"depth":271,"text":8393},{"id":8405,"depth":271,"text":8406},{"id":8418,"depth":271,"text":8419},{"id":8428,"depth":271,"text":8429},{"id":8438,"depth":244,"text":8439},{"id":8487,"depth":244,"text":8488,"children":8749},[8750,8751,8752,8753],{"id":8491,"depth":271,"text":8492},{"id":8514,"depth":271,"text":8515},{"id":8535,"depth":271,"text":8536},{"id":8545,"depth":271,"text":8546},{"id":5249,"depth":244,"text":5250},{"id":8653,"depth":244,"text":8654},"comparison","2026-05-15","GitHub Actions won mindshare but has minute costs. GitLab CI is more complete but heavier. Drone (and Woodpecker) self-hosted runs on a small VPS. Practical comparison.",{},"\u002Fen\u002Fblog\u002Fgithub-actions-vs-gitlab-ci-vs-drone","14 min",{"title":7511,"description":8758},{"loc":8760},"en\u002Fblog\u002Fgithub-actions-vs-gitlab-ci-vs-drone",[8766,8767,8768,8769,8756],"github-actions","gitlab-ci","drone","ci-cd","BvIS4ezr8i7bx--48vwBoRynlj2ckXAEl-aYeNn6diE",{"id":8772,"title":8773,"author":7,"body":8774,"category":3378,"cover":3379,"date":11771,"description":11772,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":11773,"navigation":411,"path":11774,"readingTime":6387,"seo":11775,"sitemap":11776,"stem":11777,"tags":11778,"__hash__":11783},"blog_en\u002Fen\u002Fblog\u002Fmonitoring-stack-prometheus-grafana-loki.md","Complete monitoring stack in 2026: Prometheus + Grafana + Loki step by step",{"type":9,"value":8775,"toc":11751},[8776,8789,8792,8794,8801,8812,8815,8818,8822,8825,8863,8878,8882,8885,8918,8921,8925,8931,8934,8937,8964,8978,8982,8987,8994,9000,9006,9354,9373,9376,9414,9437,9441,9446,9452,9732,9742,9761,9765,9771,9774,9853,9873,9879,9917,9927,9931,9935,9941,10195,10198,10201,10388,10398,10405,10409,10414,10423,10429,10538,10545,10552,10572,10579,10583,10587,10590,10600,10938,10945,11163,11174,11188,11192,11196,11202,11226,11233,11237,11242,11245,11248,11281,11284,11316,11325,11336,11340,11343,11346,11372,11375,11379,11429,11432,11435,11439,11570,11573,11577,11580,11590,11600,11612,11618,11620,11626,11632,11638,11644,11653,11667,11673,11679,11683,11686,11714,11726,11729,11745,11748],[12,8777,8778,8779,571,8782,571,8785,8788],{},"The first time your site crashes at three in the morning, you'll discover something uncomfortable: there's no way to know what happened. There's no CPU graph, there's no log of the container that died, there's no alert that warned beforehand. You'll open a terminal, connect to the servers one by one, run ",[231,8780,8781],{},"top",[231,8783,8784],{},"df",[231,8786,8787],{},"journalctl",", and try to reconstitute a crime scene that has already cooled down.",[12,8790,8791],{},"This post is the shortcut so you don't go through that. In four hours, with R$80 to R$120 per month of hardware, you can assemble the open-source observability stack that replaces Datadog, New Relic and CloudWatch in 95% of cases for a startup. The tools are the same that run inside companies with tens of thousands of servers — and they fit comfortably on a small VPS for a team starting out.",[19,8793,22],{"id":21},[12,8795,8796,8797,8800],{},"The standard open-source monitoring stack in 2026 — ",[27,8798,8799],{},"Prometheus + Grafana + Loki + Alertmanager"," — fits on a single 4 GB RAM VPS and covers metrics, centralized logs, dashboards and alerts. This tutorial shows step-by-step setup for a 4-to-5-server cluster in approximately four hours, using docker-compose or orchestrator job specs.",[12,8802,8803,8804,8807,8808,8811],{},"For a Brazilian startup, that means ",[27,8805,8806],{},"R$80 to R$120 per month of hardware"," vs ",[27,8809,8810],{},"R$1,000 to R$2,000 per month"," of equivalent observability SaaS. The time cost is honest: four hours of initial setup plus two to four hours per month of ongoing maintenance.",[12,8813,8814],{},"Deliverable result at the end of the tutorial: dashboards for CPU, RAM, disk, network and HTTP metrics; searchable logs with 30-day retention; alerts routed to Slack, Discord or email. Prerequisites: 1 Linux VPS with 4 GB of RAM and 50 GB SSD, Docker installed, and a domain with DNS controlled by you.",[12,8816,8817],{},"The choice between running this stack on a dedicated VPS outside the production cluster or as a job inside the orchestrator itself is an architectural decision — we cover both options in step 8 and in \"How to run this inside HeroCtl\".",[19,8819,8821],{"id":8820},"what-each-component-does-in-one-sentence","What each component does, in one sentence",[12,8823,8824],{},"Before installing anything, it's worth understanding the role of each piece. The stack has six components; the confusion usually comes from thinking some of them is \"the monitoring system\". It's not. Each one does one thing.",[2734,8826,8827,8833,8839,8845,8851,8857],{},[70,8828,8829,8832],{},[27,8830,8831],{},"Prometheus"," is a time-series database (TSDB) that collects metrics via HTTP scrape — it pulls the numbers, nobody pushes them. Retains 15 days by default.",[70,8834,8835,8838],{},[27,8836,8837],{},"Grafana"," is the visualization layer. Connects to Prometheus, to Loki, to Postgres, to almost any structured source, and draws graphs.",[70,8840,8841,8844],{},[27,8842,8843],{},"Loki"," is the log piece. Syntax similar to Prometheus, indexes only labels (not log content), and because of that gets about ten times cheaper than ELK to run.",[70,8846,8847,8850],{},[27,8848,8849],{},"Promtail"," (or Grafana Agent, which is replacing Promtail in 2026) is the collector that reads log files from each server and sends to Loki.",[70,8852,8853,8856],{},[27,8854,8855],{},"node_exporter"," runs on each monitored node and exposes an HTTP endpoint with CPU, RAM, disk and network in Prometheus format.",[70,8858,8859,8862],{},[27,8860,8861],{},"Alertmanager"," receives alert rules from Prometheus and handles routing — Slack, email, PagerDuty, arbitrary webhook.",[12,8864,8865,8866,571,8869,571,8872,571,8875,101],{},"Whoever designs the first stack usually confuses Prometheus with \"monitoring\" and Grafana with \"pretty dashboards\". The real separation is: ",[27,8867,8868],{},"Prometheus stores numbers",[27,8870,8871],{},"Loki stores text",[27,8873,8874],{},"Grafana shows both",[27,8876,8877],{},"Alertmanager screams when some number is wrong",[19,8879,8881],{"id":8880},"whats-the-recommended-architecture","What's the recommended architecture?",[12,8883,8884],{},"For a cluster of 3 to 5 servers running production applications, the topology that has worked in practice is to separate the observability server from the rest. A dedicated node, outside the cluster it monitors, with two objectives: not dying together when the cluster dies, and not competing for CPU\u002FRAM with the real application.",[2734,8886,8887,8893,8899,8909],{},[70,8888,8889,8892],{},[27,8890,8891],{},"1 dedicated \"observability\" server",", 4 GB of RAM, 50 GB SSD. Runs Prometheus, Grafana, Loki, Alertmanager.",[70,8894,8895,8898],{},[27,8896,8897],{},"Each monitored server"," runs only two lightweight processes: node_exporter (system metrics) and Promtail (log shipping).",[70,8900,8901,8904,8905,8908],{},[27,8902,8903],{},"Your applications"," expose a ",[231,8906,8907],{},"\u002Fmetrics"," endpoint in Prometheus format. If you use a popular framework, there's a ready client. If not, it's a library of a few dozen lines.",[70,8910,8911,8913,8914,8917],{},[27,8912,8837],{}," is accessible via subdomain (",[231,8915,8916],{},"monitor.yourdomain.com",") with automatic TLS and basic authentication in front.",[12,8919,8920],{},"This separation has a cost: you pay for one more VPS. In exchange, when the main cluster falls, you can still look at the graphs to understand what happened. For a startup, this trade-off pays off almost always — the worst monitoring scenario is discovering that the only thing that stopped along with the site was the system that would warn you that the site stopped.",[19,8922,8924],{"id":8923},"step-1-how-to-provision-the-observability-vps","Step 1 — How to provision the observability VPS?",[12,8926,8927,8928,101],{},"Estimated time: ",[27,8929,8930],{},"10 minutes",[12,8932,8933],{},"Any cheap provider works. The two with best cost-benefit for the Brazilian case today are Hetzner (CPX21 at 7.99 EUR per month with 3 vCPUs and 4 GB of RAM, datacenter in Germany) and DigitalOcean (Basic Droplet at US$24 per month with the same configuration, datacenters closer to Brazil). For monitoring workload, scrape latency in European datacenter doesn't cause a problem — Prometheus pulls every 15 seconds by default, so 200ms RTT between Hetzner and your servers doesn't disrupt.",[12,8935,8936],{},"Provisioning:",[67,8938,8939,8942,8945,8951,8958],{},[70,8940,8941],{},"Create the VPS with Ubuntu 24.04 LTS or Debian 12.",[70,8943,8944],{},"Add your public SSH key on creation. Disable password login.",[70,8946,8947,8948,101],{},"Install Docker and the compose plugin: ",[231,8949,8950],{},"curl -fsSL https:\u002F\u002Fget.docker.com | sh && apt install docker-compose-plugin",[70,8952,8953,8954,8957],{},"Configure firewall: port 22 (SSH) open, port 443 (HTTPS) open, all others closed. Internal ports (3000, 9090, 3100, 9093) only stay accessible via ",[231,8955,8956],{},"localhost"," of the VPS itself — the reverse proxy exposes Grafana via 443.",[70,8959,8960,8961,8963],{},"Point DNS: create an A record ",[231,8962,8916],{}," to the VPS IP.",[12,8965,341,8966,8969,8970,8973,8974,8977],{},[231,8967,8968],{},"docker --version"," returns 26.x or higher; ",[231,8971,8972],{},"dig monitor.yourdomain.com"," returns the correct IP; ",[231,8975,8976],{},"ssh root@monitor.yourdomain.com"," connects without asking for password.",[19,8979,8981],{"id":8980},"step-2-how-to-bring-up-the-stack-via-docker-compose","Step 2 — How to bring up the stack via docker-compose?",[12,8983,8927,8984,101],{},[27,8985,8986],{},"45 minutes",[12,8988,8989,8990,8993],{},"Create the working directory at ",[231,8991,8992],{},"\u002Fopt\u002Fobservability\u002F"," with the following structure:",[224,8995,8998],{"className":8996,"code":8997,"language":2529},[2527],"\u002Fopt\u002Fobservability\u002F\n├── docker-compose.yml\n├── prometheus\u002F\n│   ├── prometheus.yml\n│   └── alerts.yml\n├── alertmanager\u002F\n│   └── alertmanager.yml\n├── loki\u002F\n│   └── loki-config.yml\n└── grafana\u002F\n    └── provisioning\u002F\n        └── datasources\u002F\n            └── datasources.yml\n",[231,8999,8997],{"__ignoreMap":229},[12,9001,9002,9003,1272],{},"The abbreviated but functional ",[231,9004,9005],{},"docker-compose.yml",[224,9007,9011],{"className":9008,"code":9009,"language":9010,"meta":229,"style":229},"language-yaml shiki shiki-themes github-dark-default","services:\n  prometheus:\n    image: prom\u002Fprometheus:v2.55.0\n    volumes:\n      - .\u002Fprometheus:\u002Fetc\u002Fprometheus\n      - prometheus-data:\u002Fprometheus\n    command:\n      - '--config.file=\u002Fetc\u002Fprometheus\u002Fprometheus.yml'\n      - '--storage.tsdb.retention.time=30d'\n      - '--web.enable-lifecycle'  # permite reload via HTTP POST\n    ports:\n      - '127.0.0.1:9090:9090'\n    restart: unless-stopped\n\n  grafana:\n    image: grafana\u002Fgrafana:11.3.0\n    volumes:\n      - grafana-data:\u002Fvar\u002Flib\u002Fgrafana\n      - .\u002Fgrafana\u002Fprovisioning:\u002Fetc\u002Fgrafana\u002Fprovisioning\n    environment:\n      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}\n      - GF_USERS_ALLOW_SIGN_UP=false\n    ports:\n      - '127.0.0.1:3000:3000'\n    restart: unless-stopped\n\n  loki:\n    image: grafana\u002Floki:3.2.0\n    volumes:\n      - .\u002Floki\u002Floki-config.yml:\u002Fetc\u002Floki\u002Fconfig.yml\n      - loki-data:\u002Floki\n    command: -config.file=\u002Fetc\u002Floki\u002Fconfig.yml\n    ports:\n      - '127.0.0.1:3100:3100'\n    restart: unless-stopped\n\n  alertmanager:\n    image: prom\u002Falertmanager:v0.27.0\n    volumes:\n      - .\u002Falertmanager:\u002Fetc\u002Falertmanager\n    ports:\n      - '127.0.0.1:9093:9093'\n    restart: unless-stopped\n\nvolumes:\n  prometheus-data:\n  grafana-data:\n  loki-data:\n","yaml",[231,9012,9013,9022,9029,9039,9046,9054,9061,9068,9075,9082,9092,9099,9106,9116,9120,9127,9136,9142,9149,9156,9163,9170,9177,9183,9190,9198,9202,9209,9218,9224,9231,9238,9247,9253,9260,9268,9272,9279,9288,9294,9301,9307,9314,9322,9326,9333,9340,9347],{"__ignoreMap":229},[234,9014,9015,9019],{"class":236,"line":237},[234,9016,9018],{"class":9017},"sPWt5","services",[234,9020,9021],{"class":387},":\n",[234,9023,9024,9027],{"class":236,"line":244},[234,9025,9026],{"class":9017},"  prometheus",[234,9028,9021],{"class":387},[234,9030,9031,9034,9036],{"class":236,"line":271},[234,9032,9033],{"class":9017},"    image",[234,9035,6562],{"class":387},[234,9037,9038],{"class":255},"prom\u002Fprometheus:v2.55.0\n",[234,9040,9041,9044],{"class":236,"line":415},[234,9042,9043],{"class":9017},"    volumes",[234,9045,9021],{"class":387},[234,9047,9048,9051],{"class":236,"line":434},[234,9049,9050],{"class":387},"      - ",[234,9052,9053],{"class":255},".\u002Fprometheus:\u002Fetc\u002Fprometheus\n",[234,9055,9056,9058],{"class":236,"line":459},[234,9057,9050],{"class":387},[234,9059,9060],{"class":255},"prometheus-data:\u002Fprometheus\n",[234,9062,9063,9066],{"class":236,"line":464},[234,9064,9065],{"class":9017},"    command",[234,9067,9021],{"class":387},[234,9069,9070,9072],{"class":236,"line":479},[234,9071,9050],{"class":387},[234,9073,9074],{"class":255},"'--config.file=\u002Fetc\u002Fprometheus\u002Fprometheus.yml'\n",[234,9076,9077,9079],{"class":236,"line":484},[234,9078,9050],{"class":387},[234,9080,9081],{"class":255},"'--storage.tsdb.retention.time=30d'\n",[234,9083,9084,9086,9089],{"class":236,"line":490},[234,9085,9050],{"class":387},[234,9087,9088],{"class":255},"'--web.enable-lifecycle'",[234,9090,9091],{"class":240},"  # permite reload via HTTP POST\n",[234,9093,9094,9097],{"class":236,"line":508},[234,9095,9096],{"class":9017},"    ports",[234,9098,9021],{"class":387},[234,9100,9101,9103],{"class":236,"line":529},[234,9102,9050],{"class":387},[234,9104,9105],{"class":255},"'127.0.0.1:9090:9090'\n",[234,9107,9108,9111,9113],{"class":236,"line":535},[234,9109,9110],{"class":9017},"    restart",[234,9112,6562],{"class":387},[234,9114,9115],{"class":255},"unless-stopped\n",[234,9117,9118],{"class":236,"line":546},[234,9119,412],{"emptyLinePlaceholder":411},[234,9121,9122,9125],{"class":236,"line":552},[234,9123,9124],{"class":9017},"  grafana",[234,9126,9021],{"class":387},[234,9128,9129,9131,9133],{"class":236,"line":557},[234,9130,9033],{"class":9017},[234,9132,6562],{"class":387},[234,9134,9135],{"class":255},"grafana\u002Fgrafana:11.3.0\n",[234,9137,9138,9140],{"class":236,"line":594},[234,9139,9043],{"class":9017},[234,9141,9021],{"class":387},[234,9143,9144,9146],{"class":236,"line":635},[234,9145,9050],{"class":387},[234,9147,9148],{"class":255},"grafana-data:\u002Fvar\u002Flib\u002Fgrafana\n",[234,9150,9151,9153],{"class":236,"line":643},[234,9152,9050],{"class":387},[234,9154,9155],{"class":255},".\u002Fgrafana\u002Fprovisioning:\u002Fetc\u002Fgrafana\u002Fprovisioning\n",[234,9157,9158,9161],{"class":236,"line":659},[234,9159,9160],{"class":9017},"    environment",[234,9162,9021],{"class":387},[234,9164,9165,9167],{"class":236,"line":683},[234,9166,9050],{"class":387},[234,9168,9169],{"class":255},"GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}\n",[234,9171,9172,9174],{"class":236,"line":695},[234,9173,9050],{"class":387},[234,9175,9176],{"class":255},"GF_USERS_ALLOW_SIGN_UP=false\n",[234,9178,9179,9181],{"class":236,"line":717},[234,9180,9096],{"class":9017},[234,9182,9021],{"class":387},[234,9184,9185,9187],{"class":236,"line":723},[234,9186,9050],{"class":387},[234,9188,9189],{"class":255},"'127.0.0.1:3000:3000'\n",[234,9191,9192,9194,9196],{"class":236,"line":729},[234,9193,9110],{"class":9017},[234,9195,6562],{"class":387},[234,9197,9115],{"class":255},[234,9199,9200],{"class":236,"line":734},[234,9201,412],{"emptyLinePlaceholder":411},[234,9203,9204,9207],{"class":236,"line":771},[234,9205,9206],{"class":9017},"  loki",[234,9208,9021],{"class":387},[234,9210,9211,9213,9215],{"class":236,"line":776},[234,9212,9033],{"class":9017},[234,9214,6562],{"class":387},[234,9216,9217],{"class":255},"grafana\u002Floki:3.2.0\n",[234,9219,9220,9222],{"class":236,"line":815},[234,9221,9043],{"class":9017},[234,9223,9021],{"class":387},[234,9225,9226,9228],{"class":236,"line":820},[234,9227,9050],{"class":387},[234,9229,9230],{"class":255},".\u002Floki\u002Floki-config.yml:\u002Fetc\u002Floki\u002Fconfig.yml\n",[234,9232,9233,9235],{"class":236,"line":826},[234,9234,9050],{"class":387},[234,9236,9237],{"class":255},"loki-data:\u002Floki\n",[234,9239,9240,9242,9244],{"class":236,"line":846},[234,9241,9065],{"class":9017},[234,9243,6562],{"class":387},[234,9245,9246],{"class":255},"-config.file=\u002Fetc\u002Floki\u002Fconfig.yml\n",[234,9248,9249,9251],{"class":236,"line":859},[234,9250,9096],{"class":9017},[234,9252,9021],{"class":387},[234,9254,9255,9257],{"class":236,"line":872},[234,9256,9050],{"class":387},[234,9258,9259],{"class":255},"'127.0.0.1:3100:3100'\n",[234,9261,9262,9264,9266],{"class":236,"line":898},[234,9263,9110],{"class":9017},[234,9265,6562],{"class":387},[234,9267,9115],{"class":255},[234,9269,9270],{"class":236,"line":913},[234,9271,412],{"emptyLinePlaceholder":411},[234,9273,9274,9277],{"class":236,"line":1886},[234,9275,9276],{"class":9017},"  alertmanager",[234,9278,9021],{"class":387},[234,9280,9281,9283,9285],{"class":236,"line":1901},[234,9282,9033],{"class":9017},[234,9284,6562],{"class":387},[234,9286,9287],{"class":255},"prom\u002Falertmanager:v0.27.0\n",[234,9289,9290,9292],{"class":236,"line":1920},[234,9291,9043],{"class":9017},[234,9293,9021],{"class":387},[234,9295,9296,9298],{"class":236,"line":1944},[234,9297,9050],{"class":387},[234,9299,9300],{"class":255},".\u002Falertmanager:\u002Fetc\u002Falertmanager\n",[234,9302,9303,9305],{"class":236,"line":1962},[234,9304,9096],{"class":9017},[234,9306,9021],{"class":387},[234,9308,9309,9311],{"class":236,"line":1978},[234,9310,9050],{"class":387},[234,9312,9313],{"class":255},"'127.0.0.1:9093:9093'\n",[234,9315,9316,9318,9320],{"class":236,"line":1984},[234,9317,9110],{"class":9017},[234,9319,6562],{"class":387},[234,9321,9115],{"class":255},[234,9323,9324],{"class":236,"line":1992},[234,9325,412],{"emptyLinePlaceholder":411},[234,9327,9328,9331],{"class":236,"line":2004},[234,9329,9330],{"class":9017},"volumes",[234,9332,9021],{"class":387},[234,9334,9335,9338],{"class":236,"line":2014},[234,9336,9337],{"class":9017},"  prometheus-data",[234,9339,9021],{"class":387},[234,9341,9342,9345],{"class":236,"line":2020},[234,9343,9344],{"class":9017},"  grafana-data",[234,9346,9021],{"class":387},[234,9348,9349,9352],{"class":236,"line":2029},[234,9350,9351],{"class":9017},"  loki-data",[234,9353,9021],{"class":387},[12,9355,9356,9357,9360,9361,9364,9365,9368,9369,9372],{},"Three important points in this file. First, all ports are bound to ",[231,9358,9359],{},"127.0.0.1"," — none of the services is directly accessible from the internet. Second, volumes are named (not bind mounts), so they survive ",[231,9362,9363],{},"docker-compose down",". Third, the Grafana password comes from environment variable: create a ",[231,9366,9367],{},".env"," next to the compose with ",[231,9370,9371],{},"GRAFANA_PASSWORD=something_long_random"," and never commit that.",[12,9374,9375],{},"Bring up the stack:",[224,9377,9379],{"className":226,"code":9378,"language":228,"meta":229,"style":229},"cd \u002Fopt\u002Fobservability\ndocker compose up -d\ndocker compose ps  # all should be \"Up\" \u002F healthy\n",[231,9380,9381,9389,9402],{"__ignoreMap":229},[234,9382,9383,9386],{"class":236,"line":237},[234,9384,9385],{"class":251},"cd",[234,9387,9388],{"class":255}," \u002Fopt\u002Fobservability\n",[234,9390,9391,9393,9396,9399],{"class":236,"line":244},[234,9392,1118],{"class":247},[234,9394,9395],{"class":255}," compose",[234,9397,9398],{"class":255}," up",[234,9400,9401],{"class":251}," -d\n",[234,9403,9404,9406,9408,9411],{"class":236,"line":271},[234,9405,1118],{"class":247},[234,9407,9395],{"class":255},[234,9409,9410],{"class":255}," ps",[234,9412,9413],{"class":240},"  # all should be \"Up\" \u002F healthy\n",[12,9415,9416,9417,9420,9421,1895,9424,9420,9427,1895,9430,9433,9434,101],{},"Quick validation: ",[231,9418,9419],{},"curl localhost:9090\u002F-\u002Fready"," returns ",[231,9422,9423],{},"Prometheus Server is Ready",[231,9425,9426],{},"curl localhost:3100\u002Fready",[231,9428,9429],{},"ready",[231,9431,9432],{},"curl localhost:3000\u002Fapi\u002Fhealth"," returns JSON with ",[231,9435,9436],{},"\"database\": \"ok\"",[19,9438,9440],{"id":9439},"step-3-how-to-configure-prometheus-scrapes","Step 3 — How to configure Prometheus scrapes?",[12,9442,8927,9443,101],{},[27,9444,9445],{},"30 minutes",[12,9447,352,9448,9451],{},[231,9449,9450],{},"prometheus\u002Fprometheus.yml"," is where you tell Prometheus which endpoints to scrape. For a 4-server cluster, it looks like this:",[224,9453,9455],{"className":9008,"code":9454,"language":9010,"meta":229,"style":229},"global:\n  scrape_interval: 15s\n  evaluation_interval: 15s\n\nalerting:\n  alertmanagers:\n    - static_configs:\n        - targets: ['alertmanager:9093']\n\nrule_files:\n  - 'alerts.yml'\n\nscrape_configs:\n  - job_name: 'prometheus'\n    static_configs:\n      - targets: ['localhost:9090']\n\n  - job_name: 'node'\n    static_configs:\n      - targets:\n          - 'server-1.yourdomain.internal:9100'\n          - 'server-2.yourdomain.internal:9100'\n          - 'server-3.yourdomain.internal:9100'\n          - 'worker-1.yourdomain.internal:9100'\n        labels:\n          environment: 'production'\n\n  - job_name: 'apps'\n    static_configs:\n      - targets:\n          - 'api.yourdomain.internal:8080'\n          - 'worker.yourdomain.internal:8080'\n        labels:\n          environment: 'production'\n    metrics_path: '\u002Fmetrics'\n",[231,9456,9457,9464,9474,9483,9487,9494,9501,9511,9528,9532,9539,9547,9551,9558,9570,9577,9590,9594,9605,9611,9619,9627,9634,9641,9648,9655,9665,9669,9680,9686,9694,9701,9708,9714,9722],{"__ignoreMap":229},[234,9458,9459,9462],{"class":236,"line":237},[234,9460,9461],{"class":9017},"global",[234,9463,9021],{"class":387},[234,9465,9466,9469,9471],{"class":236,"line":244},[234,9467,9468],{"class":9017},"  scrape_interval",[234,9470,6562],{"class":387},[234,9472,9473],{"class":255},"15s\n",[234,9475,9476,9479,9481],{"class":236,"line":271},[234,9477,9478],{"class":9017},"  evaluation_interval",[234,9480,6562],{"class":387},[234,9482,9473],{"class":255},[234,9484,9485],{"class":236,"line":415},[234,9486,412],{"emptyLinePlaceholder":411},[234,9488,9489,9492],{"class":236,"line":434},[234,9490,9491],{"class":9017},"alerting",[234,9493,9021],{"class":387},[234,9495,9496,9499],{"class":236,"line":459},[234,9497,9498],{"class":9017},"  alertmanagers",[234,9500,9021],{"class":387},[234,9502,9503,9506,9509],{"class":236,"line":464},[234,9504,9505],{"class":387},"    - ",[234,9507,9508],{"class":9017},"static_configs",[234,9510,9021],{"class":387},[234,9512,9513,9516,9519,9522,9525],{"class":236,"line":479},[234,9514,9515],{"class":387},"        - ",[234,9517,9518],{"class":9017},"targets",[234,9520,9521],{"class":387},": [",[234,9523,9524],{"class":255},"'alertmanager:9093'",[234,9526,9527],{"class":387},"]\n",[234,9529,9530],{"class":236,"line":484},[234,9531,412],{"emptyLinePlaceholder":411},[234,9533,9534,9537],{"class":236,"line":490},[234,9535,9536],{"class":9017},"rule_files",[234,9538,9021],{"class":387},[234,9540,9541,9544],{"class":236,"line":508},[234,9542,9543],{"class":387},"  - ",[234,9545,9546],{"class":255},"'alerts.yml'\n",[234,9548,9549],{"class":236,"line":529},[234,9550,412],{"emptyLinePlaceholder":411},[234,9552,9553,9556],{"class":236,"line":535},[234,9554,9555],{"class":9017},"scrape_configs",[234,9557,9021],{"class":387},[234,9559,9560,9562,9565,9567],{"class":236,"line":546},[234,9561,9543],{"class":387},[234,9563,9564],{"class":9017},"job_name",[234,9566,6562],{"class":387},[234,9568,9569],{"class":255},"'prometheus'\n",[234,9571,9572,9575],{"class":236,"line":552},[234,9573,9574],{"class":9017},"    static_configs",[234,9576,9021],{"class":387},[234,9578,9579,9581,9583,9585,9588],{"class":236,"line":557},[234,9580,9050],{"class":387},[234,9582,9518],{"class":9017},[234,9584,9521],{"class":387},[234,9586,9587],{"class":255},"'localhost:9090'",[234,9589,9527],{"class":387},[234,9591,9592],{"class":236,"line":594},[234,9593,412],{"emptyLinePlaceholder":411},[234,9595,9596,9598,9600,9602],{"class":236,"line":635},[234,9597,9543],{"class":387},[234,9599,9564],{"class":9017},[234,9601,6562],{"class":387},[234,9603,9604],{"class":255},"'node'\n",[234,9606,9607,9609],{"class":236,"line":643},[234,9608,9574],{"class":9017},[234,9610,9021],{"class":387},[234,9612,9613,9615,9617],{"class":236,"line":659},[234,9614,9050],{"class":387},[234,9616,9518],{"class":9017},[234,9618,9021],{"class":387},[234,9620,9621,9624],{"class":236,"line":683},[234,9622,9623],{"class":387},"          - ",[234,9625,9626],{"class":255},"'server-1.yourdomain.internal:9100'\n",[234,9628,9629,9631],{"class":236,"line":695},[234,9630,9623],{"class":387},[234,9632,9633],{"class":255},"'server-2.yourdomain.internal:9100'\n",[234,9635,9636,9638],{"class":236,"line":717},[234,9637,9623],{"class":387},[234,9639,9640],{"class":255},"'server-3.yourdomain.internal:9100'\n",[234,9642,9643,9645],{"class":236,"line":723},[234,9644,9623],{"class":387},[234,9646,9647],{"class":255},"'worker-1.yourdomain.internal:9100'\n",[234,9649,9650,9653],{"class":236,"line":729},[234,9651,9652],{"class":9017},"        labels",[234,9654,9021],{"class":387},[234,9656,9657,9660,9662],{"class":236,"line":734},[234,9658,9659],{"class":9017},"          environment",[234,9661,6562],{"class":387},[234,9663,9664],{"class":255},"'production'\n",[234,9666,9667],{"class":236,"line":771},[234,9668,412],{"emptyLinePlaceholder":411},[234,9670,9671,9673,9675,9677],{"class":236,"line":776},[234,9672,9543],{"class":387},[234,9674,9564],{"class":9017},[234,9676,6562],{"class":387},[234,9678,9679],{"class":255},"'apps'\n",[234,9681,9682,9684],{"class":236,"line":815},[234,9683,9574],{"class":9017},[234,9685,9021],{"class":387},[234,9687,9688,9690,9692],{"class":236,"line":820},[234,9689,9050],{"class":387},[234,9691,9518],{"class":9017},[234,9693,9021],{"class":387},[234,9695,9696,9698],{"class":236,"line":826},[234,9697,9623],{"class":387},[234,9699,9700],{"class":255},"'api.yourdomain.internal:8080'\n",[234,9702,9703,9705],{"class":236,"line":846},[234,9704,9623],{"class":387},[234,9706,9707],{"class":255},"'worker.yourdomain.internal:8080'\n",[234,9709,9710,9712],{"class":236,"line":859},[234,9711,9652],{"class":9017},[234,9713,9021],{"class":387},[234,9715,9716,9718,9720],{"class":236,"line":872},[234,9717,9659],{"class":9017},[234,9719,6562],{"class":387},[234,9721,9664],{"class":255},[234,9723,9724,9727,9729],{"class":236,"line":898},[234,9725,9726],{"class":9017},"    metrics_path",[234,9728,6562],{"class":387},[234,9730,9731],{"class":255},"'\u002Fmetrics'\n",[12,9733,9734,9735,9737,9738,9741],{},"For larger clusters or those that change composition frequently, swap ",[231,9736,9508],{}," for ",[231,9739,9740],{},"file_sd_configs"," pointing to a JSON you generate automatically. For 4 static servers, the file above resolves it.",[12,9743,9744,9745,9748,9749,9752,9753,9756,9757,9760],{},"Reload: ",[231,9746,9747],{},"curl -X POST localhost:9090\u002F-\u002Freload",". Check at ",[231,9750,9751],{},"localhost:9090\u002Ftargets"," if all jobs are ",[231,9754,9755],{},"UP",". The ones that are ",[231,9758,9759],{},"DOWN"," haven't been instrumented yet — that's step 4.",[19,9762,9764],{"id":9763},"step-4-how-to-install-node_exporter-on-each-server","Step 4 — How to install node_exporter on each server?",[12,9766,8927,9767,9770],{},[27,9768,9769],{},"15 minutes"," for 4 servers.",[12,9772,9773],{},"On each monitored server, run node_exporter. There are two ways: direct binary via systemd, or Docker container. In 2026 the consensus is container — easier to update and isolate. On each node:",[224,9775,9777],{"className":226,"code":9776,"language":228,"meta":229,"style":229},"docker run -d \\\n  --name node-exporter \\\n  --restart unless-stopped \\\n  --net=\"host\" \\\n  --pid=\"host\" \\\n  -v \"\u002F:\u002Fhost:ro,rslave\" \\\n  prom\u002Fnode-exporter:v1.8.2 \\\n  --path.rootfs=\u002Fhost\n",[231,9778,9779,9792,9802,9812,9822,9831,9841,9848],{"__ignoreMap":229},[234,9780,9781,9783,9786,9789],{"class":236,"line":237},[234,9782,1118],{"class":247},[234,9784,9785],{"class":255}," run",[234,9787,9788],{"class":251}," -d",[234,9790,9791],{"class":383}," \\\n",[234,9793,9794,9797,9800],{"class":236,"line":244},[234,9795,9796],{"class":251},"  --name",[234,9798,9799],{"class":255}," node-exporter",[234,9801,9791],{"class":383},[234,9803,9804,9807,9810],{"class":236,"line":271},[234,9805,9806],{"class":251},"  --restart",[234,9808,9809],{"class":255}," unless-stopped",[234,9811,9791],{"class":383},[234,9813,9814,9817,9820],{"class":236,"line":415},[234,9815,9816],{"class":251},"  --net=",[234,9818,9819],{"class":255},"\"host\"",[234,9821,9791],{"class":383},[234,9823,9824,9827,9829],{"class":236,"line":434},[234,9825,9826],{"class":251},"  --pid=",[234,9828,9819],{"class":255},[234,9830,9791],{"class":383},[234,9832,9833,9836,9839],{"class":236,"line":459},[234,9834,9835],{"class":251},"  -v",[234,9837,9838],{"class":255}," \"\u002F:\u002Fhost:ro,rslave\"",[234,9840,9791],{"class":383},[234,9842,9843,9846],{"class":236,"line":464},[234,9844,9845],{"class":255},"  prom\u002Fnode-exporter:v1.8.2",[234,9847,9791],{"class":383},[234,9849,9850],{"class":236,"line":479},[234,9851,9852],{"class":251},"  --path.rootfs=\u002Fhost\n",[12,9854,352,9855,9858,9859,9862,9863,571,9866,2402,9869,9872],{},[231,9856,9857],{},"--net=host"," is necessary for it to see real network interfaces. The bind mount on ",[231,9860,9861],{},"\u002Fhost"," allows reading ",[231,9864,9865],{},"\u002Fproc",[231,9867,9868],{},"\u002Fsys",[231,9870,9871],{},"\u002Fetc\u002Fpasswd"," from the host (read-only) without running the container with root privileges.",[12,9874,9875,9876,1272],{},"Firewall: open port 9100 only to the observability server IP. On Ubuntu with ",[231,9877,9878],{},"ufw",[224,9880,9882],{"className":226,"code":9881,"language":228,"meta":229,"style":229},"ufw allow from \u003COBSERVABILITY_IP> to any port 9100\n",[231,9883,9884],{"__ignoreMap":229},[234,9885,9886,9888,9891,9894,9897,9900,9903,9905,9908,9911,9914],{"class":236,"line":237},[234,9887,9878],{"class":247},[234,9889,9890],{"class":255}," allow",[234,9892,9893],{"class":255}," from",[234,9895,9896],{"class":383}," \u003C",[234,9898,9899],{"class":255},"OBSERVABILITY_I",[234,9901,9902],{"class":387},"P",[234,9904,1935],{"class":383},[234,9906,9907],{"class":255}," to",[234,9909,9910],{"class":255}," any",[234,9912,9913],{"class":255}," port",[234,9915,9916],{"class":251}," 9100\n",[12,9918,9919,9920,9923,9924,101],{},"Validation: from the observability server, ",[231,9921,9922],{},"curl http:\u002F\u002Fserver-1.yourdomain.internal:9100\u002Fmetrics"," should return hundreds of lines starting with ",[231,9925,9926],{},"# HELP node_cpu_seconds_total...",[19,9928,9930],{"id":9929},"step-5-how-to-configure-loki-promtail","Step 5 — How to configure Loki + Promtail?",[12,9932,8927,9933,101],{},[27,9934,9445],{},[12,9936,9937,9938,1272],{},"Loki is already running in the compose from step 2. Missing the ",[231,9939,9940],{},"loki-config.yml",[224,9942,9944],{"className":9008,"code":9943,"language":9010,"meta":229,"style":229},"auth_enabled: false\n\nserver:\n  http_listen_port: 3100\n\ncommon:\n  path_prefix: \u002Floki\n  storage:\n    filesystem:\n      chunks_directory: \u002Floki\u002Fchunks\n      rules_directory: \u002Floki\u002Frules\n  replication_factor: 1\n  ring:\n    kvstore:\n      store: inmemory\n\nschema_config:\n  configs:\n    - from: 2024-01-01\n      store: tsdb\n      object_store: filesystem\n      schema: v13\n      index:\n        prefix: index_\n        period: 24h\n\nlimits_config:\n  retention_period: 720h  # 30 dias\n  reject_old_samples: true\n  reject_old_samples_max_age: 168h\n",[231,9945,9946,9956,9960,9967,9977,9981,9988,9998,10005,10012,10022,10032,10042,10049,10056,10066,10070,10077,10084,10095,10104,10114,10124,10131,10141,10151,10155,10162,10175,10185],{"__ignoreMap":229},[234,9947,9948,9951,9953],{"class":236,"line":237},[234,9949,9950],{"class":9017},"auth_enabled",[234,9952,6562],{"class":387},[234,9954,9955],{"class":251},"false\n",[234,9957,9958],{"class":236,"line":244},[234,9959,412],{"emptyLinePlaceholder":411},[234,9961,9962,9965],{"class":236,"line":271},[234,9963,9964],{"class":9017},"server",[234,9966,9021],{"class":387},[234,9968,9969,9972,9974],{"class":236,"line":415},[234,9970,9971],{"class":9017},"  http_listen_port",[234,9973,6562],{"class":387},[234,9975,9976],{"class":251},"3100\n",[234,9978,9979],{"class":236,"line":434},[234,9980,412],{"emptyLinePlaceholder":411},[234,9982,9983,9986],{"class":236,"line":459},[234,9984,9985],{"class":9017},"common",[234,9987,9021],{"class":387},[234,9989,9990,9993,9995],{"class":236,"line":464},[234,9991,9992],{"class":9017},"  path_prefix",[234,9994,6562],{"class":387},[234,9996,9997],{"class":255},"\u002Floki\n",[234,9999,10000,10003],{"class":236,"line":479},[234,10001,10002],{"class":9017},"  storage",[234,10004,9021],{"class":387},[234,10006,10007,10010],{"class":236,"line":484},[234,10008,10009],{"class":9017},"    filesystem",[234,10011,9021],{"class":387},[234,10013,10014,10017,10019],{"class":236,"line":490},[234,10015,10016],{"class":9017},"      chunks_directory",[234,10018,6562],{"class":387},[234,10020,10021],{"class":255},"\u002Floki\u002Fchunks\n",[234,10023,10024,10027,10029],{"class":236,"line":508},[234,10025,10026],{"class":9017},"      rules_directory",[234,10028,6562],{"class":387},[234,10030,10031],{"class":255},"\u002Floki\u002Frules\n",[234,10033,10034,10037,10039],{"class":236,"line":529},[234,10035,10036],{"class":9017},"  replication_factor",[234,10038,6562],{"class":387},[234,10040,10041],{"class":251},"1\n",[234,10043,10044,10047],{"class":236,"line":535},[234,10045,10046],{"class":9017},"  ring",[234,10048,9021],{"class":387},[234,10050,10051,10054],{"class":236,"line":546},[234,10052,10053],{"class":9017},"    kvstore",[234,10055,9021],{"class":387},[234,10057,10058,10061,10063],{"class":236,"line":552},[234,10059,10060],{"class":9017},"      store",[234,10062,6562],{"class":387},[234,10064,10065],{"class":255},"inmemory\n",[234,10067,10068],{"class":236,"line":557},[234,10069,412],{"emptyLinePlaceholder":411},[234,10071,10072,10075],{"class":236,"line":594},[234,10073,10074],{"class":9017},"schema_config",[234,10076,9021],{"class":387},[234,10078,10079,10082],{"class":236,"line":635},[234,10080,10081],{"class":9017},"  configs",[234,10083,9021],{"class":387},[234,10085,10086,10088,10090,10092],{"class":236,"line":643},[234,10087,9505],{"class":387},[234,10089,391],{"class":9017},[234,10091,6562],{"class":387},[234,10093,10094],{"class":251},"2024-01-01\n",[234,10096,10097,10099,10101],{"class":236,"line":659},[234,10098,10060],{"class":9017},[234,10100,6562],{"class":387},[234,10102,10103],{"class":255},"tsdb\n",[234,10105,10106,10109,10111],{"class":236,"line":683},[234,10107,10108],{"class":9017},"      object_store",[234,10110,6562],{"class":387},[234,10112,10113],{"class":255},"filesystem\n",[234,10115,10116,10119,10121],{"class":236,"line":695},[234,10117,10118],{"class":9017},"      schema",[234,10120,6562],{"class":387},[234,10122,10123],{"class":255},"v13\n",[234,10125,10126,10129],{"class":236,"line":717},[234,10127,10128],{"class":9017},"      index",[234,10130,9021],{"class":387},[234,10132,10133,10136,10138],{"class":236,"line":723},[234,10134,10135],{"class":9017},"        prefix",[234,10137,6562],{"class":387},[234,10139,10140],{"class":255},"index_\n",[234,10142,10143,10146,10148],{"class":236,"line":729},[234,10144,10145],{"class":9017},"        period",[234,10147,6562],{"class":387},[234,10149,10150],{"class":255},"24h\n",[234,10152,10153],{"class":236,"line":734},[234,10154,412],{"emptyLinePlaceholder":411},[234,10156,10157,10160],{"class":236,"line":771},[234,10158,10159],{"class":9017},"limits_config",[234,10161,9021],{"class":387},[234,10163,10164,10167,10169,10172],{"class":236,"line":776},[234,10165,10166],{"class":9017},"  retention_period",[234,10168,6562],{"class":387},[234,10170,10171],{"class":255},"720h",[234,10173,10174],{"class":240},"  # 30 dias\n",[234,10176,10177,10180,10182],{"class":236,"line":815},[234,10178,10179],{"class":9017},"  reject_old_samples",[234,10181,6562],{"class":387},[234,10183,10184],{"class":251},"true\n",[234,10186,10187,10190,10192],{"class":236,"line":820},[234,10188,10189],{"class":9017},"  reject_old_samples_max_age",[234,10191,6562],{"class":387},[234,10193,10194],{"class":255},"168h\n",[12,10196,10197],{},"Filesystem storage is enough to start. When you exceed 50 GB of logs per day or want 90+ days retention, migrate to S3 (or compatible). Don't migrate before — complicates operation without real gain.",[12,10199,10200],{},"On each monitored server, install Promtail (or Grafana Agent) also via container:",[224,10202,10204],{"className":9008,"code":10203,"language":9010,"meta":229,"style":229},"# \u002Fopt\u002Fpromtail\u002Fpromtail-config.yml em cada servidor\nserver:\n  http_listen_port: 9080\n\nclients:\n  - url: http:\u002F\u002Fmonitor.yourdomain.com:3100\u002Floki\u002Fapi\u002Fv1\u002Fpush\n\nscrape_configs:\n  - job_name: system\n    static_configs:\n      - targets: [localhost]\n        labels:\n          job: varlogs\n          host: ${HOSTNAME}\n          __path__: \u002Fvar\u002Flog\u002F*.log\n\n  - job_name: docker\n    docker_sd_configs:\n      - host: unix:\u002F\u002F\u002Fvar\u002Frun\u002Fdocker.sock\n    relabel_configs:\n      - source_labels: ['__meta_docker_container_name']\n        target_label: 'container'\n",[231,10205,10206,10211,10217,10226,10230,10237,10249,10253,10259,10270,10276,10288,10294,10304,10314,10324,10328,10339,10346,10357,10364,10378],{"__ignoreMap":229},[234,10207,10208],{"class":236,"line":237},[234,10209,10210],{"class":240},"# \u002Fopt\u002Fpromtail\u002Fpromtail-config.yml em cada servidor\n",[234,10212,10213,10215],{"class":236,"line":244},[234,10214,9964],{"class":9017},[234,10216,9021],{"class":387},[234,10218,10219,10221,10223],{"class":236,"line":271},[234,10220,9971],{"class":9017},[234,10222,6562],{"class":387},[234,10224,10225],{"class":251},"9080\n",[234,10227,10228],{"class":236,"line":415},[234,10229,412],{"emptyLinePlaceholder":411},[234,10231,10232,10235],{"class":236,"line":434},[234,10233,10234],{"class":9017},"clients",[234,10236,9021],{"class":387},[234,10238,10239,10241,10244,10246],{"class":236,"line":459},[234,10240,9543],{"class":387},[234,10242,10243],{"class":9017},"url",[234,10245,6562],{"class":387},[234,10247,10248],{"class":255},"http:\u002F\u002Fmonitor.yourdomain.com:3100\u002Floki\u002Fapi\u002Fv1\u002Fpush\n",[234,10250,10251],{"class":236,"line":464},[234,10252,412],{"emptyLinePlaceholder":411},[234,10254,10255,10257],{"class":236,"line":479},[234,10256,9555],{"class":9017},[234,10258,9021],{"class":387},[234,10260,10261,10263,10265,10267],{"class":236,"line":484},[234,10262,9543],{"class":387},[234,10264,9564],{"class":9017},[234,10266,6562],{"class":387},[234,10268,10269],{"class":255},"system\n",[234,10271,10272,10274],{"class":236,"line":490},[234,10273,9574],{"class":9017},[234,10275,9021],{"class":387},[234,10277,10278,10280,10282,10284,10286],{"class":236,"line":508},[234,10279,9050],{"class":387},[234,10281,9518],{"class":9017},[234,10283,9521],{"class":387},[234,10285,8956],{"class":255},[234,10287,9527],{"class":387},[234,10289,10290,10292],{"class":236,"line":529},[234,10291,9652],{"class":9017},[234,10293,9021],{"class":387},[234,10295,10296,10299,10301],{"class":236,"line":535},[234,10297,10298],{"class":9017},"          job",[234,10300,6562],{"class":387},[234,10302,10303],{"class":255},"varlogs\n",[234,10305,10306,10309,10311],{"class":236,"line":546},[234,10307,10308],{"class":9017},"          host",[234,10310,6562],{"class":387},[234,10312,10313],{"class":255},"${HOSTNAME}\n",[234,10315,10316,10319,10321],{"class":236,"line":552},[234,10317,10318],{"class":9017},"          __path__",[234,10320,6562],{"class":387},[234,10322,10323],{"class":255},"\u002Fvar\u002Flog\u002F*.log\n",[234,10325,10326],{"class":236,"line":557},[234,10327,412],{"emptyLinePlaceholder":411},[234,10329,10330,10332,10334,10336],{"class":236,"line":594},[234,10331,9543],{"class":387},[234,10333,9564],{"class":9017},[234,10335,6562],{"class":387},[234,10337,10338],{"class":255},"docker\n",[234,10340,10341,10344],{"class":236,"line":635},[234,10342,10343],{"class":9017},"    docker_sd_configs",[234,10345,9021],{"class":387},[234,10347,10348,10350,10352,10354],{"class":236,"line":643},[234,10349,9050],{"class":387},[234,10351,1650],{"class":9017},[234,10353,6562],{"class":387},[234,10355,10356],{"class":255},"unix:\u002F\u002F\u002Fvar\u002Frun\u002Fdocker.sock\n",[234,10358,10359,10362],{"class":236,"line":659},[234,10360,10361],{"class":9017},"    relabel_configs",[234,10363,9021],{"class":387},[234,10365,10366,10368,10371,10373,10376],{"class":236,"line":683},[234,10367,9050],{"class":387},[234,10369,10370],{"class":9017},"source_labels",[234,10372,9521],{"class":387},[234,10374,10375],{"class":255},"'__meta_docker_container_name'",[234,10377,9527],{"class":387},[234,10379,10380,10383,10385],{"class":236,"line":695},[234,10381,10382],{"class":9017},"        target_label",[234,10384,6562],{"class":387},[234,10386,10387],{"class":255},"'container'\n",[12,10389,10390,10391,10394,10395,10397],{},"Important: the endpoint ",[231,10392,10393],{},"http:\u002F\u002Fmonitor.yourdomain.com:3100\u002Floki\u002Fapi\u002Fv1\u002Fpush"," needs to be accessible from the servers. If you followed step 2 and bound Loki to ",[231,10396,9359],{},", you have two options: expose 3100 via reverse proxy with basic authentication, or open an SSH\u002FWireGuard tunnel between servers. The second option is more secure and what we recommend.",[12,10399,10400,10401,10404],{},"Validation: in Grafana, go to Explore, select the Loki data source, run ",[231,10402,10403],{},"{job=\"varlogs\"}"," and see logs appearing in real time.",[19,10406,10408],{"id":10407},"step-6-how-to-import-grafana-dashboards","Step 6 — How to import Grafana dashboards?",[12,10410,8927,10411,101],{},[27,10412,10413],{},"20 minutes",[12,10415,10416,10417,10420,10421,101],{},"Access ",[231,10418,10419],{},"https:\u002F\u002Fmonitor.yourdomain.com"," (after configuring the reverse proxy from step 8 — you can skip ahead now if you want). Login admin with the password from ",[231,10422,9367],{},[12,10424,10425,10426,1272],{},"Add the two data sources via automatic provisioning. In ",[231,10427,10428],{},"grafana\u002Fprovisioning\u002Fdatasources\u002Fdatasources.yml",[224,10430,10432],{"className":9008,"code":10431,"language":9010,"meta":229,"style":229},"apiVersion: 1\ndatasources:\n  - name: Prometheus\n    type: prometheus\n    access: proxy\n    url: http:\u002F\u002Fprometheus:9090\n    isDefault: true\n  - name: Loki\n    type: loki\n    access: proxy\n    url: http:\u002F\u002Floki:3100\n",[231,10433,10434,10443,10450,10462,10472,10482,10492,10501,10512,10521,10529],{"__ignoreMap":229},[234,10435,10436,10439,10441],{"class":236,"line":237},[234,10437,10438],{"class":9017},"apiVersion",[234,10440,6562],{"class":387},[234,10442,10041],{"class":251},[234,10444,10445,10448],{"class":236,"line":244},[234,10446,10447],{"class":9017},"datasources",[234,10449,9021],{"class":387},[234,10451,10452,10454,10457,10459],{"class":236,"line":271},[234,10453,9543],{"class":387},[234,10455,10456],{"class":9017},"name",[234,10458,6562],{"class":387},[234,10460,10461],{"class":255},"Prometheus\n",[234,10463,10464,10467,10469],{"class":236,"line":415},[234,10465,10466],{"class":9017},"    type",[234,10468,6562],{"class":387},[234,10470,10471],{"class":255},"prometheus\n",[234,10473,10474,10477,10479],{"class":236,"line":434},[234,10475,10476],{"class":9017},"    access",[234,10478,6562],{"class":387},[234,10480,10481],{"class":255},"proxy\n",[234,10483,10484,10487,10489],{"class":236,"line":459},[234,10485,10486],{"class":9017},"    url",[234,10488,6562],{"class":387},[234,10490,10491],{"class":255},"http:\u002F\u002Fprometheus:9090\n",[234,10493,10494,10497,10499],{"class":236,"line":464},[234,10495,10496],{"class":9017},"    isDefault",[234,10498,6562],{"class":387},[234,10500,10184],{"class":251},[234,10502,10503,10505,10507,10509],{"class":236,"line":479},[234,10504,9543],{"class":387},[234,10506,10456],{"class":9017},[234,10508,6562],{"class":387},[234,10510,10511],{"class":255},"Loki\n",[234,10513,10514,10516,10518],{"class":236,"line":484},[234,10515,10466],{"class":9017},[234,10517,6562],{"class":387},[234,10519,10520],{"class":255},"loki\n",[234,10522,10523,10525,10527],{"class":236,"line":490},[234,10524,10476],{"class":9017},[234,10526,6562],{"class":387},[234,10528,10481],{"class":255},[234,10530,10531,10533,10535],{"class":236,"line":508},[234,10532,10486],{"class":9017},[234,10534,6562],{"class":387},[234,10536,10537],{"class":255},"http:\u002F\u002Floki:3100\n",[12,10539,10540,10541,10544],{},"Restart Grafana with ",[231,10542,10543],{},"docker compose restart grafana"," and the sources appear automatically.",[12,10546,10547,10548,10551],{},"Import ready dashboards. In ",[27,10549,10550],{},"Dashboards → New → Import",", paste the dashboard ID:",[2734,10553,10554,10560,10566],{},[70,10555,10556,10559],{},[27,10557,10558],{},"1860"," — Node Exporter Full. CPU, RAM, disk, network, filesystem. It's the most used dashboard in the Prometheus community, with reason.",[70,10561,10562,10565],{},[27,10563,10564],{},"13639"," — Logs \u002F App. Basic visualization of Loki logs with filters by job, container, host.",[70,10567,10568,10571],{},[27,10569,10570],{},"15172"," — Cluster overview. Consolidated view per server, useful for small cluster.",[12,10573,10574,10575,10578],{},"Customize each one to use ",[231,10576,10577],{},"environment=\"production\""," in the default filter. After two weeks using, you'll want to create your own dashboards for specific workloads — there's no shortcut there, it's chair time.",[19,10580,10582],{"id":10581},"step-7-how-to-configure-basic-alerts","Step 7 — How to configure basic alerts?",[12,10584,8927,10585,101],{},[27,10586,8986],{},[12,10588,10589],{},"Alerts are where 80% of teams stumble: either they put very few and discover incidents through customers, or they put dozens and desensitize the team.",[12,10591,10592,10593,10596,10597,1272],{},"Start with ",[27,10594,10595],{},"six essential alerts",". In ",[231,10598,10599],{},"prometheus\u002Falerts.yml",[224,10601,10603],{"className":9008,"code":10602,"language":9010,"meta":229,"style":229},"groups:\n  - name: essentials\n    interval: 30s\n    rules:\n      - alert: ServerDown\n        expr: up{job=\"node\"} == 0\n        for: 2m\n        labels:\n          severity: critical\n        annotations:\n          summary: \"Servidor {{ $labels.instance }} está fora do ar\"\n\n      - alert: HighCPU\n        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100) > 80\n        for: 10m\n        labels:\n          severity: warning\n\n      - alert: DiskAlmostFull\n        expr: (node_filesystem_avail_bytes{mountpoint=\"\u002F\"} \u002F node_filesystem_size_bytes{mountpoint=\"\u002F\"}) * 100 \u003C 15\n        for: 5m\n        labels:\n          severity: critical\n\n      - alert: HighMemory\n        expr: (1 - (node_memory_MemAvailable_bytes \u002F node_memory_MemTotal_bytes)) * 100 > 90\n        for: 10m\n        labels:\n          severity: warning\n\n      - alert: HighHTTPErrorRate\n        expr: sum(rate(http_requests_total{status=~\"5..\"}[5m])) \u002F sum(rate(http_requests_total[5m])) > 0.05\n        for: 5m\n        labels:\n          severity: critical\n\n      - alert: HighLatency\n        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 2\n        for: 10m\n        labels:\n          severity: warning\n",[231,10604,10605,10612,10623,10633,10640,10652,10662,10672,10678,10688,10695,10705,10709,10720,10729,10738,10744,10753,10757,10768,10777,10786,10792,10800,10804,10815,10824,10832,10838,10846,10850,10861,10870,10878,10884,10892,10896,10907,10916,10924,10930],{"__ignoreMap":229},[234,10606,10607,10610],{"class":236,"line":237},[234,10608,10609],{"class":9017},"groups",[234,10611,9021],{"class":387},[234,10613,10614,10616,10618,10620],{"class":236,"line":244},[234,10615,9543],{"class":387},[234,10617,10456],{"class":9017},[234,10619,6562],{"class":387},[234,10621,10622],{"class":255},"essentials\n",[234,10624,10625,10628,10630],{"class":236,"line":271},[234,10626,10627],{"class":9017},"    interval",[234,10629,6562],{"class":387},[234,10631,10632],{"class":255},"30s\n",[234,10634,10635,10638],{"class":236,"line":415},[234,10636,10637],{"class":9017},"    rules",[234,10639,9021],{"class":387},[234,10641,10642,10644,10647,10649],{"class":236,"line":434},[234,10643,9050],{"class":387},[234,10645,10646],{"class":9017},"alert",[234,10648,6562],{"class":387},[234,10650,10651],{"class":255},"ServerDown\n",[234,10653,10654,10657,10659],{"class":236,"line":459},[234,10655,10656],{"class":9017},"        expr",[234,10658,6562],{"class":387},[234,10660,10661],{"class":255},"up{job=\"node\"} == 0\n",[234,10663,10664,10667,10669],{"class":236,"line":464},[234,10665,10666],{"class":9017},"        for",[234,10668,6562],{"class":387},[234,10670,10671],{"class":255},"2m\n",[234,10673,10674,10676],{"class":236,"line":479},[234,10675,9652],{"class":9017},[234,10677,9021],{"class":387},[234,10679,10680,10683,10685],{"class":236,"line":484},[234,10681,10682],{"class":9017},"          severity",[234,10684,6562],{"class":387},[234,10686,10687],{"class":255},"critical\n",[234,10689,10690,10693],{"class":236,"line":490},[234,10691,10692],{"class":9017},"        annotations",[234,10694,9021],{"class":387},[234,10696,10697,10700,10702],{"class":236,"line":508},[234,10698,10699],{"class":9017},"          summary",[234,10701,6562],{"class":387},[234,10703,10704],{"class":255},"\"Servidor {{ $labels.instance }} está fora do ar\"\n",[234,10706,10707],{"class":236,"line":529},[234,10708,412],{"emptyLinePlaceholder":411},[234,10710,10711,10713,10715,10717],{"class":236,"line":535},[234,10712,9050],{"class":387},[234,10714,10646],{"class":9017},[234,10716,6562],{"class":387},[234,10718,10719],{"class":255},"HighCPU\n",[234,10721,10722,10724,10726],{"class":236,"line":546},[234,10723,10656],{"class":9017},[234,10725,6562],{"class":387},[234,10727,10728],{"class":255},"100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100) > 80\n",[234,10730,10731,10733,10735],{"class":236,"line":552},[234,10732,10666],{"class":9017},[234,10734,6562],{"class":387},[234,10736,10737],{"class":255},"10m\n",[234,10739,10740,10742],{"class":236,"line":557},[234,10741,9652],{"class":9017},[234,10743,9021],{"class":387},[234,10745,10746,10748,10750],{"class":236,"line":594},[234,10747,10682],{"class":9017},[234,10749,6562],{"class":387},[234,10751,10752],{"class":255},"warning\n",[234,10754,10755],{"class":236,"line":635},[234,10756,412],{"emptyLinePlaceholder":411},[234,10758,10759,10761,10763,10765],{"class":236,"line":643},[234,10760,9050],{"class":387},[234,10762,10646],{"class":9017},[234,10764,6562],{"class":387},[234,10766,10767],{"class":255},"DiskAlmostFull\n",[234,10769,10770,10772,10774],{"class":236,"line":659},[234,10771,10656],{"class":9017},[234,10773,6562],{"class":387},[234,10775,10776],{"class":255},"(node_filesystem_avail_bytes{mountpoint=\"\u002F\"} \u002F node_filesystem_size_bytes{mountpoint=\"\u002F\"}) * 100 \u003C 15\n",[234,10778,10779,10781,10783],{"class":236,"line":683},[234,10780,10666],{"class":9017},[234,10782,6562],{"class":387},[234,10784,10785],{"class":255},"5m\n",[234,10787,10788,10790],{"class":236,"line":695},[234,10789,9652],{"class":9017},[234,10791,9021],{"class":387},[234,10793,10794,10796,10798],{"class":236,"line":717},[234,10795,10682],{"class":9017},[234,10797,6562],{"class":387},[234,10799,10687],{"class":255},[234,10801,10802],{"class":236,"line":723},[234,10803,412],{"emptyLinePlaceholder":411},[234,10805,10806,10808,10810,10812],{"class":236,"line":729},[234,10807,9050],{"class":387},[234,10809,10646],{"class":9017},[234,10811,6562],{"class":387},[234,10813,10814],{"class":255},"HighMemory\n",[234,10816,10817,10819,10821],{"class":236,"line":734},[234,10818,10656],{"class":9017},[234,10820,6562],{"class":387},[234,10822,10823],{"class":255},"(1 - (node_memory_MemAvailable_bytes \u002F node_memory_MemTotal_bytes)) * 100 > 90\n",[234,10825,10826,10828,10830],{"class":236,"line":771},[234,10827,10666],{"class":9017},[234,10829,6562],{"class":387},[234,10831,10737],{"class":255},[234,10833,10834,10836],{"class":236,"line":776},[234,10835,9652],{"class":9017},[234,10837,9021],{"class":387},[234,10839,10840,10842,10844],{"class":236,"line":815},[234,10841,10682],{"class":9017},[234,10843,6562],{"class":387},[234,10845,10752],{"class":255},[234,10847,10848],{"class":236,"line":820},[234,10849,412],{"emptyLinePlaceholder":411},[234,10851,10852,10854,10856,10858],{"class":236,"line":826},[234,10853,9050],{"class":387},[234,10855,10646],{"class":9017},[234,10857,6562],{"class":387},[234,10859,10860],{"class":255},"HighHTTPErrorRate\n",[234,10862,10863,10865,10867],{"class":236,"line":846},[234,10864,10656],{"class":9017},[234,10866,6562],{"class":387},[234,10868,10869],{"class":255},"sum(rate(http_requests_total{status=~\"5..\"}[5m])) \u002F sum(rate(http_requests_total[5m])) > 0.05\n",[234,10871,10872,10874,10876],{"class":236,"line":859},[234,10873,10666],{"class":9017},[234,10875,6562],{"class":387},[234,10877,10785],{"class":255},[234,10879,10880,10882],{"class":236,"line":872},[234,10881,9652],{"class":9017},[234,10883,9021],{"class":387},[234,10885,10886,10888,10890],{"class":236,"line":898},[234,10887,10682],{"class":9017},[234,10889,6562],{"class":387},[234,10891,10687],{"class":255},[234,10893,10894],{"class":236,"line":913},[234,10895,412],{"emptyLinePlaceholder":411},[234,10897,10898,10900,10902,10904],{"class":236,"line":1886},[234,10899,9050],{"class":387},[234,10901,10646],{"class":9017},[234,10903,6562],{"class":387},[234,10905,10906],{"class":255},"HighLatency\n",[234,10908,10909,10911,10913],{"class":236,"line":1901},[234,10910,10656],{"class":9017},[234,10912,6562],{"class":387},[234,10914,10915],{"class":255},"histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 2\n",[234,10917,10918,10920,10922],{"class":236,"line":1920},[234,10919,10666],{"class":9017},[234,10921,6562],{"class":387},[234,10923,10737],{"class":255},[234,10925,10926,10928],{"class":236,"line":1944},[234,10927,9652],{"class":9017},[234,10929,9021],{"class":387},[234,10931,10932,10934,10936],{"class":236,"line":1962},[234,10933,10682],{"class":9017},[234,10935,6562],{"class":387},[234,10937,10752],{"class":255},[12,10939,10940,10941,10944],{},"And the ",[231,10942,10943],{},"alertmanager\u002Falertmanager.yml"," pointing to a Slack or Discord webhook:",[224,10946,10948],{"className":9008,"code":10947,"language":9010,"meta":229,"style":229},"route:\n  group_by: ['alertname', 'severity']\n  group_wait: 30s\n  group_interval: 5m\n  repeat_interval: 4h\n  receiver: 'slack-default'\n  routes:\n    - match:\n        severity: critical\n      receiver: 'slack-critical'\n      repeat_interval: 1h\n\nreceivers:\n  - name: 'slack-default'\n    slack_configs:\n      - api_url: 'https:\u002F\u002Fhooks.slack.com\u002Fservices\u002FYOUR\u002FWEBHOOK\u002FHERE'\n        channel: '#alerts'\n        send_resolved: true\n\n  - name: 'slack-critical'\n    slack_configs:\n      - api_url: 'https:\u002F\u002Fhooks.slack.com\u002Fservices\u002FYOUR\u002FWEBHOOK\u002FHERE'\n        channel: '#alerts-critical'\n        send_resolved: true\n",[231,10949,10950,10957,10974,10983,10992,11002,11012,11019,11028,11037,11047,11057,11061,11068,11078,11085,11097,11107,11116,11120,11130,11136,11146,11155],{"__ignoreMap":229},[234,10951,10952,10955],{"class":236,"line":237},[234,10953,10954],{"class":9017},"route",[234,10956,9021],{"class":387},[234,10958,10959,10962,10964,10967,10969,10972],{"class":236,"line":244},[234,10960,10961],{"class":9017},"  group_by",[234,10963,9521],{"class":387},[234,10965,10966],{"class":255},"'alertname'",[234,10968,571],{"class":387},[234,10970,10971],{"class":255},"'severity'",[234,10973,9527],{"class":387},[234,10975,10976,10979,10981],{"class":236,"line":271},[234,10977,10978],{"class":9017},"  group_wait",[234,10980,6562],{"class":387},[234,10982,10632],{"class":255},[234,10984,10985,10988,10990],{"class":236,"line":415},[234,10986,10987],{"class":9017},"  group_interval",[234,10989,6562],{"class":387},[234,10991,10785],{"class":255},[234,10993,10994,10997,10999],{"class":236,"line":434},[234,10995,10996],{"class":9017},"  repeat_interval",[234,10998,6562],{"class":387},[234,11000,11001],{"class":255},"4h\n",[234,11003,11004,11007,11009],{"class":236,"line":459},[234,11005,11006],{"class":9017},"  receiver",[234,11008,6562],{"class":387},[234,11010,11011],{"class":255},"'slack-default'\n",[234,11013,11014,11017],{"class":236,"line":464},[234,11015,11016],{"class":9017},"  routes",[234,11018,9021],{"class":387},[234,11020,11021,11023,11026],{"class":236,"line":479},[234,11022,9505],{"class":387},[234,11024,11025],{"class":9017},"match",[234,11027,9021],{"class":387},[234,11029,11030,11033,11035],{"class":236,"line":484},[234,11031,11032],{"class":9017},"        severity",[234,11034,6562],{"class":387},[234,11036,10687],{"class":255},[234,11038,11039,11042,11044],{"class":236,"line":490},[234,11040,11041],{"class":9017},"      receiver",[234,11043,6562],{"class":387},[234,11045,11046],{"class":255},"'slack-critical'\n",[234,11048,11049,11052,11054],{"class":236,"line":508},[234,11050,11051],{"class":9017},"      repeat_interval",[234,11053,6562],{"class":387},[234,11055,11056],{"class":255},"1h\n",[234,11058,11059],{"class":236,"line":529},[234,11060,412],{"emptyLinePlaceholder":411},[234,11062,11063,11066],{"class":236,"line":535},[234,11064,11065],{"class":9017},"receivers",[234,11067,9021],{"class":387},[234,11069,11070,11072,11074,11076],{"class":236,"line":546},[234,11071,9543],{"class":387},[234,11073,10456],{"class":9017},[234,11075,6562],{"class":387},[234,11077,11011],{"class":255},[234,11079,11080,11083],{"class":236,"line":552},[234,11081,11082],{"class":9017},"    slack_configs",[234,11084,9021],{"class":387},[234,11086,11087,11089,11092,11094],{"class":236,"line":557},[234,11088,9050],{"class":387},[234,11090,11091],{"class":9017},"api_url",[234,11093,6562],{"class":387},[234,11095,11096],{"class":255},"'https:\u002F\u002Fhooks.slack.com\u002Fservices\u002FYOUR\u002FWEBHOOK\u002FHERE'\n",[234,11098,11099,11102,11104],{"class":236,"line":594},[234,11100,11101],{"class":9017},"        channel",[234,11103,6562],{"class":387},[234,11105,11106],{"class":255},"'#alerts'\n",[234,11108,11109,11112,11114],{"class":236,"line":635},[234,11110,11111],{"class":9017},"        send_resolved",[234,11113,6562],{"class":387},[234,11115,10184],{"class":251},[234,11117,11118],{"class":236,"line":643},[234,11119,412],{"emptyLinePlaceholder":411},[234,11121,11122,11124,11126,11128],{"class":236,"line":659},[234,11123,9543],{"class":387},[234,11125,10456],{"class":9017},[234,11127,6562],{"class":387},[234,11129,11046],{"class":255},[234,11131,11132,11134],{"class":236,"line":683},[234,11133,11082],{"class":9017},[234,11135,9021],{"class":387},[234,11137,11138,11140,11142,11144],{"class":236,"line":695},[234,11139,9050],{"class":387},[234,11141,11091],{"class":9017},[234,11143,6562],{"class":387},[234,11145,11096],{"class":255},[234,11147,11148,11150,11152],{"class":236,"line":717},[234,11149,11101],{"class":9017},[234,11151,6562],{"class":387},[234,11153,11154],{"class":255},"'#alerts-critical'\n",[234,11156,11157,11159,11161],{"class":236,"line":723},[234,11158,11111],{"class":9017},[234,11160,6562],{"class":387},[234,11162,10184],{"class":251},[12,11164,11165,11166,11169,11170,11173],{},"Two details that save sleep. The ",[231,11167,11168],{},"for: 10m"," on CPU prevents short spikes from becoming alerts — the server can hit 95% for 30 seconds and that be normal. The ",[231,11171,11172],{},"repeat_interval: 4h"," for warnings ensures that a warning resolved in one hour doesn't become 60 messages — Alertmanager groups.",[12,11175,11176,11177,11179,11180,11183,11184,11187],{},"Reload Prometheus (",[231,11178,9747],{},") and test by forcing an alert: ",[231,11181,11182],{},"stress --cpu 4 --timeout 700s"," on some server should trigger ",[231,11185,11186],{},"HighCPU"," in 10 minutes.",[19,11189,11191],{"id":11190},"step-8-how-to-put-reverse-proxy-and-tls-in-front","Step 8 — How to put reverse proxy and TLS in front?",[12,11193,8927,11194,101],{},[27,11195,10413],{},[12,11197,11198,11199,11201],{},"To access Grafana via ",[231,11200,10419],{}," with valid certificate, you need something in front of port 3000. Two options:",[67,11203,11204,11214],{},[70,11205,11206,11209,11210,11213],{},[27,11207,11208],{},"Orchestrator's integrated router"," — if you already have the HeroCtl cluster running, just declare Grafana as a job with ",[231,11211,11212],{},"ingress: { host: monitor.yourdomain.com, tls: true }",". Automatic Let's Encrypt certificate, without additional tool.",[70,11215,11216,11219,11220],{},[27,11217,11218],{},"Caddy standalone"," on the observability VPS itself — also issues Let's Encrypt automatically. Minimum Caddyfile:",[224,11221,11224],{"className":11222,"code":11223,"language":2529},[2527],"monitor.yourdomain.com {\n  reverse_proxy localhost:3000\n  basicauth \u002Flogin {\n    admin \u003Cbcrypt_hash>\n  }\n}\n",[231,11225,11223],{"__ignoreMap":229},[12,11227,11228,11229,11232],{},"For defense in depth, keep Caddy\u002Frouter basic authentication in front of Grafana login — two barriers, not one. The second is especially important because the default Grafana login is ",[231,11230,11231],{},"admin\u002Fadmin"," and the first thing bots do on an exposed Grafana is try that combination.",[19,11234,11236],{"id":11235},"step-9-how-to-instrument-application-metrics","Step 9 — How to instrument application metrics?",[12,11238,8927,11239,101],{},[27,11240,11241],{},"varies according to number of applications",[12,11243,11244],{},"System metrics are half the story. The other half is what your application is doing — how many requests per second, what the p99 latency is, how many errors, what the background job queue size is.",[12,11246,11247],{},"Each popular language has an official Prometheus client:",[2734,11249,11250,11258,11266,11273],{},[70,11251,11252,6562,11255],{},[27,11253,11254],{},"Node.js",[231,11256,11257],{},"prom-client",[70,11259,11260,6562,11263],{},[27,11261,11262],{},"Python",[231,11264,11265],{},"prometheus-client",[70,11267,11268,6562,11271],{},[27,11269,11270],{},"Ruby",[231,11272,11265],{},[70,11274,11275,6562,11278],{},[27,11276,11277],{},"Go",[231,11279,11280],{},"github.com\u002Fprometheus\u002Fclient_golang",[12,11282,11283],{},"The minimum standard is three metrics per HTTP endpoint:",[2734,11285,11286,11300,11306],{},[70,11287,11288,11291,11292,571,11295,571,11298,101],{},[231,11289,11290],{},"http_requests_total"," — counter, with labels ",[231,11293,11294],{},"method",[231,11296,11297],{},"path",[231,11299,614],{},[70,11301,11302,11305],{},[231,11303,11304],{},"http_request_duration_seconds"," — histogram, same label set.",[70,11307,11308,11311,11312,11315],{},[231,11309,11310],{},"app_errors_total"," — counter, with label ",[231,11313,11314],{},"kind"," (\"validation\", \"db\", \"external_api\", etc).",[12,11317,11318,11319,11321,11322,11324],{},"Expose all of that in ",[231,11320,8907],{},". Add the endpoint in Prometheus's ",[231,11323,9555],{},". In hours you have dashboards per endpoint, alerts per error rate, and the ability to answer \"what was happening at 3:14 yesterday\" with a graph instead of a guess.",[12,11326,11327,11328,11331,11332,11335],{},"Watch for ",[27,11329,11330],{},"cardinality",". Each unique combination of labels becomes a separate time series. If you put ",[231,11333,11334],{},"user_id"," as label, with 100k users you create 100k series — and Prometheus will consume 8+ GB of RAM just to index that. Practical rule: labels have values in small sets (status code: 5 values; method: 5 values; path: dozens). Unique identifiers go in logs, not in metrics.",[19,11337,11339],{"id":11338},"how-to-run-this-inside-heroctl-instead-of-dedicated-vps","How to run this inside HeroCtl instead of dedicated VPS?",[12,11341,11342],{},"For clusters already running the orchestrator, it makes sense to consider the stack as one more job. Trade-off: you save a VPS, but lose isolation (if the cluster dies, monitoring dies along).",[12,11344,11345],{},"The topology looks like this:",[2734,11347,11348,11354,11360,11366],{},[70,11349,11350,11353],{},[27,11351,11352],{},"1 single job spec"," with 4 tasks: prometheus, grafana, loki, alertmanager.",[70,11355,11356,11359],{},[27,11357,11358],{},"Replicated volumes"," in the cluster — data survives node failure.",[70,11361,11362,11365],{},[27,11363,11364],{},"Integrated router"," does automatic TLS via subdomain. No need for additional Caddy.",[70,11367,11368,11371],{},[27,11369,11370],{},"Cluster's own metrics"," are already exposed in Prometheus format on the administrative API, so the scrape is direct.",[12,11373,11374],{},"For critical production, we recommend physical separation (dedicated VPS outside the cluster). For personal project, MVP, or small team where \"everything falls together\" is acceptable, running inside is cheaper and operationally simpler. The entire job spec sits around 80 lines of manifest.",[19,11376,11378],{"id":11377},"how-much-does-this-stack-cost-per-month-in-brazil","How much does this stack cost per month in Brazil?",[119,11380,11381,11391],{},[122,11382,11383],{},[125,11384,11385,11388],{},[128,11386,11387],{},"Item",[128,11389,11390],{},"Monthly cost (BRL)",[141,11392,11393,11401,11409,11417],{},[125,11394,11395,11398],{},[146,11396,11397],{},"Dedicated observability VPS (4 GB RAM)",[146,11399,11400],{},"R$40 to R$80",[125,11402,11403,11406],{},[146,11404,11405],{},"Object storage for long log retention (optional)",[146,11407,11408],{},"R$30",[125,11410,11411,11414],{},[146,11412,11413],{},"Maintenance time (2 to 4h × hour value)",[146,11415,11416],{},"R$200 to R$400",[125,11418,11419,11424],{},[146,11420,11421],{},[27,11422,11423],{},"Total operational",[146,11425,11426],{},[27,11427,11428],{},"R$300 to R$500",[12,11430,11431],{},"For comparison, a Datadog or New Relic subscription with equivalent coverage (5 hosts, 30-day log retention, alerts, dashboards) goes for around R$1,500 to R$2,000 per month — without counting the automatic overage that appears at month-end when someone forgets a verbose log on.",[12,11433,11434],{},"The difference isn't small: in a year, the open-source self-hosted stack saves between R$12,000 and R$18,000. For an early-stage startup, that's half a junior engineer.",[19,11436,11438],{"id":11437},"table-of-ports-resources-and-characteristics-per-component","Table of ports, resources and characteristics per component",[119,11440,11441,11461],{},[122,11442,11443],{},[125,11444,11445,11447,11450,11452,11455,11458],{},[128,11446,130],{},[128,11448,11449],{},"Port",[128,11451,3873],{},[128,11453,11454],{},"Disk",[128,11456,11457],{},"Default retention",[128,11459,11460],{},"Data format",[141,11462,11463,11482,11500,11518,11537,11553],{},[125,11464,11465,11467,11470,11473,11476,11479],{},[146,11466,8831],{},[146,11468,11469],{},"9090",[146,11471,11472],{},"512 MB",[146,11474,11475],{},"10 GB",[146,11477,11478],{},"15 days",[146,11480,11481],{},"binary TSDB",[125,11483,11484,11486,11489,11492,11495,11497],{},[146,11485,8837],{},[146,11487,11488],{},"3000",[146,11490,11491],{},"256 MB",[146,11493,11494],{},"1 GB",[146,11496,3055],{},[146,11498,11499],{},"SQLite or Postgres",[125,11501,11502,11504,11507,11509,11512,11515],{},[146,11503,8843],{},[146,11505,11506],{},"3100",[146,11508,11472],{},[146,11510,11511],{},"30 GB",[146,11513,11514],{},"30 days (configurable)",[146,11516,11517],{},"compressed chunks",[125,11519,11520,11523,11526,11529,11532,11534],{},[146,11521,11522],{},"Promtail \u002F Agent",[146,11524,11525],{},"9080",[146,11527,11528],{},"128 MB",[146,11530,11531],{},"minimum",[146,11533,3055],{},[146,11535,11536],{},"passes by value",[125,11538,11539,11541,11544,11546,11548,11550],{},[146,11540,8861],{},[146,11542,11543],{},"9093",[146,11545,11528],{},[146,11547,11494],{},[146,11549,3055],{},[146,11551,11552],{},"notification log",[125,11554,11555,11557,11560,11563,11565,11567],{},[146,11556,8855],{},[146,11558,11559],{},"9100",[146,11561,11562],{},"64 MB",[146,11564,11531],{},[146,11566,3055],{},[146,11568,11569],{},"scrape endpoint",[12,11571,11572],{},"These are the viable minimums for small cluster. In production with 30 servers and real traffic, multiply RAM by 3 and disk by 5.",[19,11574,11576],{"id":11575},"the-four-errors-that-kill-a-new-monitoring-stack","The four errors that kill a new monitoring stack",[12,11578,11579],{},"Teams setting up observability for the first time stumble almost always on the same four errors. Knowing about them beforehand saves months.",[12,11581,11582,11585,11586,11589],{},[27,11583,11584],{},"Not monitoring monitoring."," Prometheus stopped scraping on Thursday; nobody saw it. On Wednesday of the following week a server actually went down and they discovered there was no alert because Prometheus was dead for 6 days. Solution: configure a simple external cron (even free Pingdom serves) that hits ",[231,11587,11588],{},"https:\u002F\u002Fmonitor.yourdomain.com\u002Fapi\u002Fhealth"," every 5 minutes and warns you when Grafana itself falls.",[12,11591,11592,11595,11596,11599],{},[27,11593,11594],{},"No retention strategy."," Disk fills up in three months, Prometheus stops recording, someone deletes everything in despair, loses 90 days of history. Configure ",[231,11597,11598],{},"--storage.tsdb.retention.time=30d"," from day one and establish a housekeeping job.",[12,11601,11602,11605,11606,571,11608,11611],{},[27,11603,11604],{},"High cardinality in labels."," We already covered in step 9, but worth repeating: each ",[231,11607,11334],{},[231,11609,11610],{},"request_id"," or UUID that becomes a label is a number that explosively multiplies Prometheus RAM consumption. Unique identifiers go to Loki, not to Prometheus.",[12,11613,11614,11617],{},[27,11615,11616],{},"Noisy alerts."," The team receives 200 alerts per day. In two weeks, nobody looks anymore. When the site actually crashes, the alert will be in the middle of 199 others. Solution: start with six alerts (those from step 7), audit every two weeks, and exclude everything that fired but didn't require human action. Alert without action is noise.",[19,11619,3225],{"id":3224},[12,11621,11622,11625],{},[27,11623,11624],{},"Can I run everything on a 2 GB VPS?","\nTechnically yes, for a cluster of up to 3 servers and few applications. In practice you'll hit the RAM ceiling in 2 to 3 months, especially if you import dense Grafana dashboards. Pay 50 reais more and go straight to 4 GB VPS — the time you save not fighting OOM kills pays for itself.",[12,11627,11628,11631],{},[27,11629,11630],{},"How much disk for 30 days of logs?","\nDepends entirely on your application's log volume. Rough rule for small startup: cluster of 4 servers with normal web applications generates 1 to 5 GB of logs per day after Loki compression. Thirty days gives between 30 and 150 GB. Start with 50 GB SSD, monitor growth for two weeks, expand if necessary. If you go much beyond that, it's time to go to object storage.",[12,11633,11634,11637],{},[27,11635,11636],{},"Grafana Cloud vs self-hosted, which to choose?","\nGrafana Cloud free tier is generous (10k series, 50 GB of logs, 14-day retention) and eliminates the work of maintaining the server. For solo project or very small team, makes sense. From the moment you exceed the free tier, prices scale fast — from US$50\u002Fmonth — and you lose control over the data. Self-hosted costs hardware + time, Cloud costs money + lock-in. For a company that intends to grow and has a DevOps dev on the team, self-hosted wins.",[12,11639,11640,11643],{},[27,11641,11642],{},"Promtail or Grafana Agent?","\nIn 2026, Grafana Agent (renamed to Grafana Alloy) is officially replacing Promtail. For new setup, go straight to Alloy. For setup that has been running Promtail for a long time, no urgency to migrate — Promtail will continue working for years.",[12,11645,11646,11649,11650,11652],{},[27,11647,11648],{},"Where does OpenTelemetry fit in this stack?","\nOTel is the application instrumentation standard that's consolidating. Instead of using ",[231,11651,11257],{}," directly, you use OTel's SDK and it exports to Prometheus, Loki and Tempo simultaneously. The big advantage is portability — if you want to swap Prometheus for something else 3 years from now, your application doesn't change a line. For a startup starting today, we recommend OTel from day one.",[12,11654,11655,11658,11659,11662,11663,11666],{},[27,11656,11657],{},"How do I backup Prometheus?","\nPrometheus has snapshot via API: ",[231,11660,11661],{},"curl -X POST localhost:9090\u002Fapi\u002Fv1\u002Fadmin\u002Ftsdb\u002Fsnapshot"," creates a snapshot in the data directory. Do that once a day via cron, do ",[231,11664,11665],{},"tar.gz"," and send to object storage. In case of disaster, what you lose is metrics — and metrics, unlike logs, are typically recoverable in hours (start collecting again and dashboards return). Lost logs are lost forever, so invest more in Loki backup.",[12,11668,11669,11672],{},[27,11670,11671],{},"Tempo (distributed traces) worth installing now?","\nNo. Traces become useful from the moment you have 5+ services talking to each other and debugging latency involves following a request through several hops. For monolithic architecture or few services, traces give disproportionate work to the value. Add when complexity calls for it.",[12,11674,11675,11678],{},[27,11676,11677],{},"Does Loki index full-text like ELK?","\nNo, and that's the feature, not bug. Loki indexes only labels (job, host, container, severity) and log content stays compressed without index. To search text, you filter by labels first and then grep on the resulting chunks. That's what makes Loki ten times cheaper than ELK in RAM and CPU. In exchange, free-text queries across all history are slower. For 90% of debugging cases, filtering by job + host + time window already reduces to dozens of MB where grep flies.",[19,11680,11682],{"id":11681},"next-steps","Next steps",[12,11684,11685],{},"Brought up the stack, have dashboard, have alert, have searchable log? Good. The next three things worth investing in are, in order:",[67,11687,11688,11694,11708],{},[70,11689,11690,11693],{},[27,11691,11692],{},"Custom dashboards per application"," — business metrics (subscriptions created\u002Fhour, jobs processed, email queue) instead of just infrastructure.",[70,11695,11696,11699,11700,11703,11704,11707],{},[27,11697,11698],{},"Runbooks linked in alerts"," — every rule in ",[231,11701,11702],{},"alerts.yml"," should have ",[231,11705,11706],{},"annotations.runbook_url"," pointing to a page explaining what to do. When the alert fires at 3 AM, sleep doesn't think.",[70,11709,11710,11713],{},[27,11711,11712],{},"Monthly alert review"," — 30 minutes once a month auditing what fired in the previous month, deleting what became noise, adjusting thresholds.",[12,11715,11716,11717,11721,11722,101],{},"For those who want to go further and understand why we chose this stack instead of managed SaaS, read ",[3336,11718,11720],{"href":11719},"\u002Fen\u002Fblog\u002Fobservability-without-datadog-startup-stack","Observability without Datadog: the Brazilian startup stack",". And to close the operations cycle — because there's no point knowing the database fell if you can't restore — it's worth reading ",[3336,11723,11725],{"href":11724},"\u002Fen\u002Fblog\u002Fdatabase-backup-strategies-cluster","Database backup in cluster: strategies for 3 AM",[12,11727,11728],{},"If you want to skip this entire setup and run the stack as a job inside an orchestrator that already takes care of TLS, rolling update deploy and volume replication:",[224,11730,11731],{"className":226,"code":5318,"language":228,"meta":229,"style":229},[231,11732,11733],{"__ignoreMap":229},[234,11734,11735,11737,11739,11741,11743],{"class":236,"line":237},[234,11736,1220],{"class":247},[234,11738,2957],{"class":251},[234,11740,5329],{"class":255},[234,11742,2963],{"class":383},[234,11744,2966],{"class":247},[12,11746,11747],{},"Four hours become forty minutes. The rest is the same work of thinking about which alerts matter — and on that part, no one frees you.",[3350,11749,11750],{},"html pre.shiki code .sPWt5, html code.shiki .sPWt5{--shiki-default:#7EE787}html pre.shiki code .sZEs4, html code.shiki .sZEs4{--shiki-default:#E6EDF3}html pre.shiki code .s9uIt, html code.shiki .s9uIt{--shiki-default:#A5D6FF}html pre.shiki code .sH3jZ, html code.shiki .sH3jZ{--shiki-default:#8B949E}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html pre.shiki code .sFSAA, html code.shiki .sFSAA{--shiki-default:#79C0FF}html pre.shiki code .sQhOw, html code.shiki .sQhOw{--shiki-default:#FFA657}html pre.shiki code .suJrU, html code.shiki .suJrU{--shiki-default:#FF7B72}",{"title":229,"searchDepth":244,"depth":244,"links":11752},[11753,11754,11755,11756,11757,11758,11759,11760,11761,11762,11763,11764,11765,11766,11767,11768,11769,11770],{"id":21,"depth":244,"text":22},{"id":8820,"depth":244,"text":8821},{"id":8880,"depth":244,"text":8881},{"id":8923,"depth":244,"text":8924},{"id":8980,"depth":244,"text":8981},{"id":9439,"depth":244,"text":9440},{"id":9763,"depth":244,"text":9764},{"id":9929,"depth":244,"text":9930},{"id":10407,"depth":244,"text":10408},{"id":10581,"depth":244,"text":10582},{"id":11190,"depth":244,"text":11191},{"id":11235,"depth":244,"text":11236},{"id":11338,"depth":244,"text":11339},{"id":11377,"depth":244,"text":11378},{"id":11437,"depth":244,"text":11438},{"id":11575,"depth":244,"text":11576},{"id":3224,"depth":244,"text":3225},{"id":11681,"depth":244,"text":11682},"2026-05-12","Honest tutorial to spin up metrics, logs and dashboards for your cluster — in 4 hours, without Datadog. Open-source stack that fits in 1 VPS at R$80\u002Fmonth.",{},"\u002Fen\u002Fblog\u002Fmonitoring-stack-prometheus-grafana-loki",{"title":8773,"description":11772},{"loc":11774},"en\u002Fblog\u002Fmonitoring-stack-prometheus-grafana-loki",[11779,11780,11781,11782,3392,3378],"prometheus","grafana","loki","monitoring","qXuCsrBWk65Tau6l18D0_EwAL61sTr4A97-gZfDIzKs",{"id":11785,"title":11786,"author":7,"body":11787,"category":3378,"cover":3379,"date":12737,"description":12738,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":12739,"navigation":411,"path":12740,"readingTime":4401,"seo":12741,"sitemap":12742,"stem":12743,"tags":12744,"__hash__":12749},"blog_en\u002Fen\u002Fblog\u002Fcloudflare-in-front-of-self-hosted-cluster.md","Cloudflare in front of a self-hosted cluster: is it worth it in 2026?",{"type":9,"value":11788,"toc":12715},[11789,11792,11796,11799,11802,11805,11809,11812,11868,11871,11875,11878,12018,12021,12025,12028,12034,12040,12047,12050,12054,12057,12083,12090,12093,12113,12117,12120,12158,12161,12165,12168,12241,12260,12264,12267,12273,12279,12285,12291,12295,12298,12318,12321,12325,12528,12531,12535,12538,12558,12562,12565,12569,12575,12579,12582,12586,12589,12625,12628,12632,12635,12639,12642,12646,12653,12657,12660,12664,12667,12670,12686,12689,12692,12710,12713],[12,11790,11791],{},"The question comes back every week in a Brazilian DevOps group: \"I brought up my cluster with three servers on DigitalOcean, is it worth putting Cloudflare in front?\". The short answer is \"almost always yes\" — but the \"almost\" carries trade-offs that nobody mentions until the first time something breaks in production and you spend two hours debugging a cache rule that masked a 500 from the app. This post is the long version, with measurable criteria, of the decision you need to make before moving the nameserver.",[19,11793,11795],{"id":11794},"tldr-the-200-word-summary","TL;DR — the 200-word summary",[12,11797,11798],{},"Free Cloudflare became de facto standard for any Brazilian site with traffic: protects against DDoS without contractual limit, issues automatic SSL certificate, caches assets in over 300 cities, hides the server's origin IP and still delivers DNS with sub-10ms. For a self-hosted cluster — be it HeroCtl, Coolify, k3s or Docker Swarm — putting Cloudflare in front is an easy decision in around 90% of cases.",[12,11800,11801],{},"The remaining 10% have concrete trade-offs: additional latency of 10 to 30ms on dynamic routes, TLS terminates at Cloudflare by default (no longer end-to-end to your server), cache rules can mask subtle bugs in the app, and lock-in grows as you adopt Workers, R2 and Pages.",[12,11803,11804],{},"Worth it when: you want DDoS protection without paying; global cache to reduce bandwidth cost; hide the server IP from scanners. Not worth it when: financial\u002Fhealth compliance requires truly end-to-end TLS; you need p99 below 50ms on dynamic routes; the cluster already has internal CDN edge in multiple data centers. A cluster with integrated router already covers around 60% of what Cloudflare offers — combining the two is the most common path.",[19,11806,11808],{"id":11807},"what-does-cloudflare-offer-for-free-in-2026","What does Cloudflare offer for free in 2026?",[12,11810,11811],{},"The free offering grew year over year. Today, the free plan covers what was paid plan in 2019:",[2734,11813,11814,11820,11826,11832,11838,11844,11850,11856,11862],{},[70,11815,11816,11819],{},[27,11817,11818],{},"DDoS protection without contractual limit"," — Layer 3, 4 and 7. Cloudflare absorbs attacks of hundreds of Gbps without charging excess.",[70,11821,11822,11825],{},[27,11823,11824],{},"Automatic SSL\u002FTLS certificate"," — issued in minutes by Cloudflare itself, automatically renewed. Wildcard requires the Advanced Certificate Manager plan (US$ 10\u002Fmonth).",[70,11827,11828,11831],{},[27,11829,11830],{},"Global CDN"," — over 300 cities in over 120 countries. Includes São Paulo, Rio, Fortaleza, Curitiba and Porto Alegre.",[70,11833,11834,11837],{},[27,11835,11836],{},"Authoritative DNS"," — sub-10ms global average, anycast, with APIs for automation.",[70,11839,11840,11843],{},[27,11841,11842],{},"Basic bot protection"," — blocking known bots and JavaScript challenges on suspicious traffic.",[70,11845,11846,11849],{},[27,11847,11848],{},"Static asset cache"," — recognized extensions (CSS, JS, images, fonts) cached by default.",[70,11851,11852,11855],{},[27,11853,11854],{},"Page Rules"," — three free rules to force HTTPS, extra cache, redirects.",[70,11857,11858,11861],{},[27,11859,11860],{},"Always Online"," — when origin falls, Cloudflare serves the last cached version.",[70,11863,11864,11867],{},[27,11865,11866],{},"Web Analytics"," — RUM metrics (visits, countries, browsers), no cookies.",[12,11869,11870],{},"The cutoff line is generous enough that a 10k visitors\u002Fday site runs 100% on free without any operational problem.",[19,11872,11874],{"id":11873},"and-what-does-cloudflare-charge-extra-for","And what does Cloudflare charge extra for?",[12,11876,11877],{},"Four plans: Free, Pro (US$ 25\u002Fmonth per domain), Business (US$ 250\u002Fmonth per domain) and Enterprise (on consultation, generally above US$ 5k\u002Fmonth).",[119,11879,11880,11897],{},[122,11881,11882],{},[125,11883,11884,11887,11889,11892,11895],{},[128,11885,11886],{},"Resource",[128,11888,7825],{},[128,11890,11891],{},"Pro US$ 25",[128,11893,11894],{},"Business US$ 250",[128,11896,4359],{},[141,11898,11899,11915,11929,11942,11956,11971,11984,12001],{},[125,11900,11901,11904,11906,11909,11912],{},[146,11902,11903],{},"WAF managed rulesets",[146,11905,3058],{},[146,11907,11908],{},"Yes (basic OWASP)",[146,11910,11911],{},"Yes (advanced)",[146,11913,11914],{},"Custom",[125,11916,11917,11920,11922,11925,11927],{},[146,11918,11919],{},"Image Resizing",[146,11921,3058],{},[146,11923,11924],{},"Yes (US$ 5\u002FM)",[146,11926,3064],{},[146,11928,3064],{},[125,11930,11931,11934,11936,11938,11940],{},[146,11932,11933],{},"Polish (image optimization)",[146,11935,3058],{},[146,11937,3064],{},[146,11939,3064],{},[146,11941,3064],{},[125,11943,11944,11947,11949,11952,11954],{},[146,11945,11946],{},"Argo Smart Routing",[146,11948,3058],{},[146,11950,11951],{},"US$ 5\u002Fmonth add-on",[146,11953,3064],{},[146,11955,3064],{},[125,11957,11958,11961,11963,11966,11968],{},[146,11959,11960],{},"Page Rules included",[146,11962,2698],{},[146,11964,11965],{},"20",[146,11967,5650],{},[146,11969,11970],{},"125+",[125,11972,11973,11976,11978,11980,11982],{},[146,11974,11975],{},"Cache Reserve",[146,11977,3058],{},[146,11979,3058],{},[146,11981,3064],{},[146,11983,3064],{},[125,11985,11986,11989,11992,11995,11998],{},[146,11987,11988],{},"Customer Support SLA",[146,11990,11991],{},"Best-effort",[146,11993,11994],{},"24h",[146,11996,11997],{},"Chat 24\u002F7",[146,11999,12000],{},"Dedicated engineer",[125,12002,12003,12006,12009,12012,12015],{},[146,12004,12005],{},"Log analysis",[146,12007,12008],{},"Last hour",[146,12010,12011],{},"Last 24h",[146,12013,12014],{},"Last 7 days",[146,12016,12017],{},"30 days",[12,12019,12020],{},"Workers and R2 have free tier independent of plan: 100k requests\u002Fday for Workers, 10 GB of storage and 1 million Class A operations\u002Fmonth for R2. For a modest marketing site, you can run image storage on R2 without ever reaching the bill.",[19,12022,12024],{"id":12023},"does-cloudflare-add-latency","Does Cloudflare add latency?",[12,12026,12027],{},"The honest question. Honest answer too: depends on the route.",[12,12029,12030,12033],{},[27,12031,12032],{},"For cached routes"," (static HTML, assets, optimized images), Cloudflare reduces latency. The user in Recife gets the content from the Fortaleza or São Paulo POP in 15 to 40ms, instead of doing round-trip to your server in New Jersey or Frankfurt. Typical savings: 150 to 250ms per request.",[12,12035,12036,12039],{},[27,12037,12038],{},"For dynamic routes"," (API, logged dashboard, checkout), traffic passes through the Cloudflare proxy before reaching your server. That adds between 10 and 30ms in normal conditions. The exact number depends on which POP the user is connected to and where the origin server is.",[12,12041,12042,12043,12046],{},"We measured on the public production cluster: the average response time of ",[231,12044,12045],{},"manage.heroctl.com\u002Fv1\u002Fnodes"," is 38ms without Cloudflare proxy and 51ms with proxy enabled, requesting from the same notebook in São Paulo. A delta of 13ms — perceptible in benchmark, invisible to a human.",[12,12048,12049],{},"Latency is only a dealbreaker in three real scenarios: online gaming, high-frequency financial auction, and low-latency WebSocket loads (trading, live collaboration). For the rest, the 13ms disappear in the browser render time.",[19,12051,12053],{"id":12052},"does-cloudflare-break-end-to-end-tls","Does Cloudflare break end-to-end TLS?",[12,12055,12056],{},"By default, yes. See the modes:",[2734,12058,12059,12065,12071,12077],{},[70,12060,12061,12064],{},[27,12062,12063],{},"Flexible"," (NEVER use this) — TLS only between client and Cloudflare. Cloudflare → server connection is plain HTTP. Vulnerable to interception on the inner leg.",[70,12066,12067,12070],{},[27,12068,12069],{},"Full"," — TLS between client and Cloudflare, and separately between Cloudflare and server. But Cloudflare accepts invalid\u002Fself-signed certificate at the server. Risk of man-in-the-middle between Cloudflare and origin.",[70,12072,12073,12076],{},[27,12074,12075],{},"Full (strict)"," — TLS on both legs, and Cloudflare requires valid certificate at origin. This is the minimum reasonable configuration.",[70,12078,12079,12082],{},[27,12080,12081],{},"Strict (SSL-Only Origin Pull)"," — Cloudflare verifies that the origin's certificate was issued by a public valid CA for the hostname. More secure than Full strict.",[12,12084,12085,12086,12089],{},"In all these modes, ",[27,12087,12088],{},"Cloudflare decrypts traffic in the middle of the path",". They see request body, headers, cookies — everything. For most cases that is acceptable (the contract with Cloudflare is clear), but in strict compliance (health, financial, government) it can break audit requirements.",[12,12091,12092],{},"The real exit for end-to-end:",[2734,12094,12095,12101,12107],{},[70,12096,12097,12100],{},[27,12098,12099],{},"Authenticated Origin Pulls"," — Cloudflare presents a client certificate when connecting to your origin; the server only accepts connections from that chain. Still decrypts in the middle, but at least only Cloudflare can reach your origin.",[70,12102,12103,12106],{},[27,12104,12105],{},"Cloudflare Tunnel + mTLS client at the endpoint"," — for internal apps, Tunnel replaces public IP and requires client certificate.",[70,12108,12109,12112],{},[27,12110,12111],{},"Gray cloud (DNS only)"," — disables proxy. You lose DDoS protection, cache, WAF — but get direct client-server connection with truly end-to-end TLS. It is a valid option when compliance commands.",[19,12114,12116],{"id":12115},"will-i-get-locked-in-to-cloudflare","Will I get locked-in to Cloudflare?",[12,12118,12119],{},"Depends exclusively on which features you adopt. Let's go layer by layer:",[2734,12121,12122,12128,12134,12140,12146,12152],{},[70,12123,12124,12127],{},[27,12125,12126],{},"DNS"," — trivially reversible. Moving nameserver takes 24 to 48h of propagation and nothing breaks. Zero lock-in.",[70,12129,12130,12133],{},[27,12131,12132],{},"Proxy + cache + WAF"," — reversible in hours. You disable the orange cloud, adjust DNS to point directly to the server, reconfigure WAF on your origin (if any). Low lock-in.",[70,12135,12136,12139],{},[27,12137,12138],{},"Workers"," — real lock-in. The Workers API is proprietary; rewriting to Lambda@Edge or Fastly Compute@Edge costs days to weeks depending on the code. It is not the worst case, but count on rework.",[70,12141,12142,12145],{},[27,12143,12144],{},"R2 Object Storage"," — API S3-compatible, so code keeps working. But R2 doesn't charge egress (S3 charges US$ 0.09\u002FGB), so moving to another provider makes the bill more expensive. Economic lock-in, not technical.",[70,12147,12148,12151],{},[27,12149,12150],{},"Pages"," — moderate lock-in. Build process is custom; rewrite to Vercel\u002FNetlify\u002Fgeneric static host takes an afternoon, but requires.",[70,12153,12154,12157],{},[27,12155,12156],{},"Zero Trust"," — high lock-in. Policies, identity, tunnels: complete rewrite to Tailscale\u002FTwingate\u002Fequivalent.",[12,12159,12160],{},"The operational recommendation is: use the Cloudflare core (DNS + proxy + WAF + Page Rules) without hesitation — you can revert in a day. Adopt Workers\u002FR2\u002FPages only with clear awareness that you are accepting lock-in proportional to the value that feature delivers.",[19,12162,12164],{"id":12163},"minimum-recommended-configuration-for-self-hosted-cluster","Minimum recommended configuration for self-hosted cluster",[12,12166,12167],{},"Practical sequence, no secret:",[67,12169,12170,12176,12185,12195,12201,12207,12213,12223,12229,12235],{},[70,12171,12172,12175],{},[27,12173,12174],{},"Create a Cloudflare account"," and add the domain. The site will scan your current DNS records and copy them to the new zone.",[70,12177,12178,12181,12182,101],{},[27,12179,12180],{},"Change the nameservers"," at the registrar (Hostinger, Registro.br, GoDaddy, wherever you are). Wait 4 to 48 hours for propagation. Verify with ",[231,12183,12184],{},"dig NS heroctl.com +short",[70,12186,12187,12190,12191,12194],{},[27,12188,12189],{},"DNS records of the cluster",": create an A record for the root domain pointing to the IP of the server receiving traffic, and a wildcard A record ",[231,12192,12193],{},"*"," pointing to the same IP. Mark both with proxy enabled (orange cloud).",[70,12196,12197,12200],{},[27,12198,12199],{},"SSL\u002FTLS mode",": configure Full (strict). That requires the cluster to have a valid certificate. The HeroCtl integrated router issues Let's Encrypt automatically — works out of the box.",[70,12202,12203,12206],{},[27,12204,12205],{},"Always Use HTTPS",": ON. Redirects any HTTP to HTTPS at the edge.",[70,12208,12209,12212],{},[27,12210,12211],{},"HSTS",": 6 months, include subdomains, no preload for now. Preload is a definitive decision — you can't undo it quickly if something breaks.",[70,12214,12215,12218,12219,12222],{},[27,12216,12217],{},"Page Rule for cache"," of static assets: ",[231,12220,12221],{},"*heroctl.com\u002Fstatic\u002F*"," → Cache Level: Cache Everything, Edge Cache TTL: 1 month.",[70,12224,12225,12228],{},[27,12226,12227],{},"WAF managed ruleset"," (Pro+): enable the Cloudflare Managed Ruleset and OWASP Core Rule Set in Block mode for high-score rules.",[70,12230,12231,12234],{},[27,12232,12233],{},"Security Level",": Medium. Low lets too many bots through; High challenges legitimate people.",[70,12236,12237,12240],{},[27,12238,12239],{},"Bot Fight Mode",": ON on the free plan. Controls basic scrapers without asking the human for CAPTCHA.",[12,12242,12243,12244,12247,12248,12251,12252,12255,12256,12259],{},"After applying all of that, run ",[231,12245,12246],{},"curl -I https:\u002F\u002Fyourdomain.com"," and confirm: header ",[231,12249,12250],{},"cf-ray"," present, header ",[231,12253,12254],{},"server: cloudflare",", header ",[231,12257,12258],{},"strict-transport-security"," with long max-age.",[19,12261,12263],{"id":12262},"when-is-cloudflare-not-worth-it","When is Cloudflare NOT worth it?",[12,12265,12266],{},"Four scenarios where the recommendation changes. They matter more than they seem.",[12,12268,12269,12272],{},[27,12270,12271],{},"Cluster with robust internal CDN\u002Fedge."," If you already run in four or five geographically spread regions, with proximity-based DNS balancing and local cache in each region, Cloudflare's CDN adds latency without gain. Worth running gray cloud (DNS only) and keeping the rest direct.",[12,12274,12275,12278],{},[27,12276,12277],{},"Financial or health compliance with mandatory end-to-end mTLS."," LGPD by itself doesn't require this; but specific audits (PCI-DSS Level 1 with custom requirements, strict HIPAA certifications, banking frameworks) may require encrypted traffic to never be decrypted at a third party. Since Cloudflare decrypts in the middle of the path, doesn't pass.",[12,12280,12281,12284],{},[27,12282,12283],{},"Purely internal apps (intranet\u002Fclosed B2B SaaS)."," Free Cloudflare doesn't cover advanced Zero Trust. For an app that exclusively serves employees, Tailscale or native WireGuard deliver more with less.",[12,12286,12287,12290],{},[27,12288,12289],{},"Small sites without traffic and without public enemy."," Personal blog of 200 visits\u002Fmonth, without payment form, without sensitive data. Direct DNS at Hostinger\u002FRegistro.br + Let's Encrypt from the integrated router serves perfectly. Adding Cloudflare is unnecessary ceremony.",[19,12292,12294],{"id":12293},"how-does-cloudflare-interact-with-a-high-availability-cluster","How does Cloudflare interact with a high availability cluster?",[12,12296,12297],{},"Here the design matters. A cluster with three or more nodes serves traffic on all of them — there is no single \"main\" node. The pragmatic configuration is:",[2734,12299,12300,12306,12312],{},[70,12301,12302,12305],{},[27,12303,12304],{},"DNS round-robin with health",": register A records for the IP of all nodes that run the router. Cloudflare does health check (Pro+) and removes a broken node from rotation automatically.",[70,12307,12308,12311],{},[27,12309,12310],{},"Cloudflare failover",": ~30 seconds to detect a dead node and remove from rotation (configurable to 5 seconds on Enterprise).",[70,12313,12314,12317],{},[27,12315,12316],{},"Internal cluster failover",": the HeroCtl integrated router reroutes traffic between healthy nodes in around 5 seconds. New coordinator election happens in ~7 seconds when the leader node falls.",[12,12319,12320],{},"Combined, real downtime perceived by the user stays below 40 seconds in the worst case (Cloudflare detects + cluster reacts). Without Cloudflare, stays at ~7 seconds (cluster alone). With Cloudflare and aggressive monitoring configuration (Pro+), back to ~10 seconds. The choice is clear: if you don't need DDoS protection, the cluster alone is already faster. If you need it, Cloudflare adds 30s of detection in exchange for protection against attacks.",[19,12322,12324],{"id":12323},"comparison-table-12-decision-criteria","Comparison table: 12 decision criteria",[119,12326,12327,12345],{},[122,12328,12329],{},[125,12330,12331,12333,12336,12339,12342],{},[128,12332,2982],{},[128,12334,12335],{},"Without Cloudflare",[128,12337,12338],{},"CF Free",[128,12340,12341],{},"CF Pro US$ 25",[128,12343,12344],{},"CF Business US$ 250",[141,12346,12347,12363,12380,12397,12413,12426,12441,12456,12471,12483,12495,12511],{},[125,12348,12349,12352,12355,12358,12360],{},[146,12350,12351],{},"DDoS Layer 3\u002F4",[146,12353,12354],{},"You handle it",[146,12356,12357],{},"Unlimited",[146,12359,12357],{},[146,12361,12362],{},"Unlimited + SLA",[125,12364,12365,12368,12371,12374,12377],{},[146,12366,12367],{},"DDoS Layer 7",[146,12369,12370],{},"Not available",[146,12372,12373],{},"Basic",[146,12375,12376],{},"Advanced",[146,12378,12379],{},"Advanced + Custom Rules",[125,12381,12382,12385,12388,12391,12394],{},[146,12383,12384],{},"Added latency on dynamic routes",[146,12386,12387],{},"0ms",[146,12389,12390],{},"+13 to 30ms",[146,12392,12393],{},"+10 to 25ms (Argo optional)",[146,12395,12396],{},"+5 to 15ms (Argo included)",[125,12398,12399,12402,12405,12408,12410],{},[146,12400,12401],{},"Global static cache",[146,12403,12404],{},"You build",[146,12406,12407],{},"300+ cities",[146,12409,12407],{},[146,12411,12412],{},"300+ cities + Reserve",[125,12414,12415,12418,12420,12422,12424],{},[146,12416,12417],{},"Hides server IP",[146,12419,3058],{},[146,12421,3064],{},[146,12423,3064],{},[146,12425,3064],{},[125,12427,12428,12431,12433,12436,12438],{},[146,12429,12430],{},"Truly end-to-end TLS",[146,12432,3064],{},[146,12434,12435],{},"No (decrypts)",[146,12437,3058],{},[146,12439,12440],{},"No (but Origin Pulls)",[125,12442,12443,12446,12448,12450,12453],{},[146,12444,12445],{},"Managed WAF",[146,12447,12370],{},[146,12449,3058],{},[146,12451,12452],{},"Basic OWASP",[146,12454,12455],{},"Advanced OWASP",[125,12457,12458,12461,12463,12465,12468],{},[146,12459,12460],{},"Bot protection",[146,12462,12404],{},[146,12464,12239],{},[146,12466,12467],{},"Super Bot Fight",[146,12469,12470],{},"Bot Management ML",[125,12472,12473,12475,12477,12479,12481],{},[146,12474,11854],{},[146,12476,3055],{},[146,12478,2698],{},[146,12480,11965],{},[146,12482,5650],{},[125,12484,12485,12487,12489,12491,12493],{},[146,12486,11860],{},[146,12488,3058],{},[146,12490,3064],{},[146,12492,3064],{},[146,12494,3064],{},[125,12496,12497,12500,12503,12505,12508],{},[146,12498,12499],{},"Monthly cost per domain",[146,12501,12502],{},"US$ 0",[146,12504,12502],{},[146,12506,12507],{},"US$ 25",[146,12509,12510],{},"US$ 250",[125,12512,12513,12516,12519,12522,12525],{},[146,12514,12515],{},"Proportional lock-in",[146,12517,12518],{},"Zero",[146,12520,12521],{},"Low (DNS+proxy)",[146,12523,12524],{},"Low to medium",[146,12526,12527],{},"Medium (Workers\u002FR2 begin to enter)",[12,12529,12530],{},"The line that decides for most is \"DDoS Layer 7 + hides IP\". These two alone justify the free plan. Paid lines only make sense with high-volume traffic or formal WAF requirement.",[19,12532,12534],{"id":12533},"does-free-cloudflare-have-a-traffic-limit","Does free Cloudflare have a traffic limit?",[12,12536,12537],{},"There is no contractual bandwidth limit on the free plan for normal web traffic through the proxy. But there are three practical limits worth mentioning:",[2734,12539,12540,12546,12552],{},[70,12541,12542,12545],{},[27,12543,12544],{},"Section 2.8 of the Terms of Service",": the free plan is for sites whose main content is HTML, and Cloudflare reserves the right to ask for upgrade if you use the service primarily to serve video or large files. In practice, they rarely act on this — but if you become a host for 50TB\u002Fmonth of pirated videos, expect to receive an email.",[70,12547,12548,12551],{},[27,12549,12550],{},"Workers free",": 100k requests\u002Fday. Above that, Workers Paid (US$ 5\u002Fmonth) with 10M requests included.",[70,12553,12554,12557],{},[27,12555,12556],{},"R2 free",": 10GB of storage, 1M Class A operations\u002Fmonth, 10M Class B operations\u002Fmonth. Above, US$ 0.015\u002FGB-month.",[19,12559,12561],{"id":12560},"can-i-use-cloudflare-dns-without-the-proxy","Can I use Cloudflare DNS without the proxy?",[12,12563,12564],{},"Yes — \"DNS only\" mode (gray cloud). You use Cloudflare DNS (fast, free, anycast global) but traffic goes directly to your server without passing through the proxy. You lose DDoS, cache, WAF, IP hiding — keep only the DNS infrastructure. Useful when: compliance prohibits decryption at third parties; you only want fast DNS without touching the traffic path; you are testing before activating the proxy.",[19,12566,12568],{"id":12567},"does-free-waf-block-sql-injection","Does free WAF block SQL injection?",[12,12570,12571,12572,12574],{},"Cloudflare Free has ",[27,12573,12239],{}," and automatic mitigation rules for obvious patterns, but doesn't have the complete OWASP Managed Ruleset. For reliable blocking of SQL injection, XSS, known RCE patterns, you need the Pro plan or higher. Alternative: run ModSecurity or your own WAF at your origin — works, but adds CPU and configuration.",[19,12576,12578],{"id":12577},"does-cloudflare-have-a-datacenter-in-brazil","Does Cloudflare have a datacenter in Brazil?",[12,12580,12581],{},"Yes. In 2026 there are five Brazilian POPs: São Paulo (two POPs), Rio de Janeiro, Fortaleza, Curitiba and Porto Alegre. Typical latency from any city in the Southeast to a POP stays below 20ms. The Fortaleza POP serves the Northeast very well because of the submarine cables that land there (EllaLink, Monet, GlobeNet). For the North, it is still a longer path — Manaus reaches Fortaleza in 80 to 120ms.",[19,12583,12585],{"id":12584},"how-do-i-migrate-nameservers-from-hostinger-to-cloudflare","How do I migrate nameservers from Hostinger to Cloudflare?",[12,12587,12588],{},"Four steps. Takes less than an hour active, plus up to 48h of propagation:",[67,12590,12591,12597,12609,12615],{},[70,12592,12593,12596],{},[27,12594,12595],{},"Cloudflare",": add the domain. The wizard scans your current DNS and creates the corresponding records in the new zone. Check that everything was copied — MX, TXT (SPF\u002FDKIM\u002FDMARC), CNAME, A. Copy errors here cause email taken down for a week.",[70,12598,12599,12601,12602,2402,12605,12608],{},[27,12600,12595],{},": it gives you two nameservers (something like ",[231,12603,12604],{},"kim.ns.cloudflare.com",[231,12606,12607],{},"walt.ns.cloudflare.com","). Note them.",[70,12610,12611,12614],{},[27,12612,12613],{},"Hostinger",": panel → Domains → your domain → Nameservers → \"Use custom nameservers\" → paste the two from Cloudflare. Save.",[70,12616,12617,12620,12621,12624],{},[27,12618,12619],{},"Wait for propagation",". Verify with ",[231,12622,12623],{},"dig NS yourdomain.com +short",". When the Cloudflare nameservers appear, the domain is under their management. DNS records continue to be edited on the Cloudflare panel from here on.",[12,12626,12627],{},"Important: while propagation happens, part of the users still resolves via Hostinger. Don't turn off the old zone until you confirm 100% of resolvers have already switched (24 to 48 hours is safe).",[19,12629,12631],{"id":12630},"where-does-tls-terminate-does-e2e-break","Where does TLS terminate? Does E2E break?",[12,12633,12634],{},"In proxy mode (orange cloud), TLS terminates at Cloudflare. They re-establish another TLS connection to your server (in Full strict mode). Technically: decrypts, processes, re-encrypts. For truly end-to-end: gray cloud (DNS only) or Cloudflare Tunnel with custom configuration. For most applications, \"truly end-to-end TLS\" is less important than it seems — the attack this protects against (interception in the middle of the network) requires an attacker already inside the Cloudflare network, an unrealistic scenario.",[19,12636,12638],{"id":12637},"cloudflare-workers-vs-serverless-from-my-cloud-when-is-it-worth-it","Cloudflare Workers vs serverless from my cloud — when is it worth it?",[12,12640,12641],{},"Workers are good for: edge computing where latency \u003C50ms matters (geo-routing, A\u002FB testing, header rewrite); lightweight request\u002Fresponse transformation; auth at the edge (validating JWT before reaching origin). They are not good for: workloads with more than 30 seconds of runtime; heavy integration with relational databases (cold start latency of DB driver kills); code that needs libraries that depend on filesystem or process. AWS Lambda remains better for long-runtime workload; Workers win at the edge. Use both, don't replace one with the other.",[19,12643,12645],{"id":12644},"can-i-use-cloudflare-r2-with-a-self-hosted-cluster","Can I use Cloudflare R2 with a self-hosted cluster?",[12,12647,12648,12649,12652],{},"Yes — R2 is S3-compatible at the API level. Your app uses ",[231,12650,12651],{},"aws-sdk"," configured with R2 endpoint and R2 credentials; code keeps the same. Economic advantage: zero egress fee. You can serve heavy downloads (installers, product videos, backups) directly from R2 without paying for outgoing bandwidth. Disadvantage: documented durability is 99.999999999% (11 nines), same as S3, but R2's operational history is shorter. For critical hot path, some teams prefer to keep S3 and use R2 only for cold storage and static delivery.",[19,12654,12656],{"id":12655},"origin-fell-does-always-online-solve-it","Origin fell — does Always Online solve it?",[12,12658,12659],{},"Partially. Always Online serves the last cached version of HTML pages when the server is offline. But: only works for routes that were being cached; only serves the static version (without updated dynamic data); only lasts while Cloudflare keeps the snapshot (usually a few days). It is a good safety net for static blog and marketing. Doesn't replace real cluster high availability — for a dynamic app, what solves it is the cluster having three nodes and automatic election when one falls.",[19,12661,12663],{"id":12662},"closing-combining-cloudflare-with-self-hosted-cluster","Closing — combining Cloudflare with self-hosted cluster",[12,12665,12666],{},"The combination we recommend for 90% of cases is: self-hosted cluster with three or more nodes (real high availability) + Cloudflare Free at the edge (DDoS, cache, IP hiding). The cluster takes care of internal routing, automatic certificates, failover between nodes in seconds. Cloudflare takes care of public protection, global cache and IP obfuscation. The two layers complement each other — they don't compete.",[12,12668,12669],{},"To start from scratch with this combination:",[224,12671,12672],{"className":226,"code":5318,"language":228,"meta":229,"style":229},[231,12673,12674],{"__ignoreMap":229},[234,12675,12676,12678,12680,12682,12684],{"class":236,"line":237},[234,12677,1220],{"class":247},[234,12679,2957],{"class":251},[234,12681,5329],{"class":255},[234,12683,2963],{"class":383},[234,12685,2966],{"class":247},[12,12687,12688],{},"You end up with a functional cluster on three nodes, automatic Let's Encrypt certificate on the domain you choose, web panel to submit jobs, real high availability. Then add Cloudflare Free in front of the domain and configure as per the \"Minimum configuration\" section of this post. Total time: an afternoon.",[12,12690,12691],{},"More reading along this line:",[2734,12693,12694,12704],{},[70,12695,12696,12699,12700,12703],{},[3336,12697,12698],{"href":3343},"Docker deploy in production: from compose to cluster"," — how to leave ",[231,12701,12702],{},"docker compose up"," and reach real high availability, with the intermediate steps.",[70,12705,12706,12709],{},[3336,12707,12708],{"href":11719},"Observability without Datadog: stack for a startup"," — metrics, logs and tracing without paying US$ 2,000\u002Fmonth for an observability SaaS.",[12,12711,12712],{},"Cloudflare is one of the few tools where the free tier is so good that refusing is stubbornness. But like any infra choice, the hard part is understanding exactly where the boundary lies — and, primarily, where it passes through your application's encrypted traffic.",[3350,12714,4376],{},{"title":229,"searchDepth":244,"depth":244,"links":12716},[12717,12718,12719,12720,12721,12722,12723,12724,12725,12726,12727,12728,12729,12730,12731,12732,12733,12734,12735,12736],{"id":11794,"depth":244,"text":11795},{"id":11807,"depth":244,"text":11808},{"id":11873,"depth":244,"text":11874},{"id":12023,"depth":244,"text":12024},{"id":12052,"depth":244,"text":12053},{"id":12115,"depth":244,"text":12116},{"id":12163,"depth":244,"text":12164},{"id":12262,"depth":244,"text":12263},{"id":12293,"depth":244,"text":12294},{"id":12323,"depth":244,"text":12324},{"id":12533,"depth":244,"text":12534},{"id":12560,"depth":244,"text":12561},{"id":12567,"depth":244,"text":12568},{"id":12577,"depth":244,"text":12578},{"id":12584,"depth":244,"text":12585},{"id":12630,"depth":244,"text":12631},{"id":12637,"depth":244,"text":12638},{"id":12644,"depth":244,"text":12645},{"id":12655,"depth":244,"text":12656},{"id":12662,"depth":244,"text":12663},"2026-05-08","Free Cloudflare blocks DDoS, caches static assets and hides the server IP. But adds latency, lock-in and features you may not use. When it's worth it and when it's overkill.",{},"\u002Fen\u002Fblog\u002Fcloudflare-in-front-of-self-hosted-cluster",{"title":11786,"description":12738},{"loc":12740},"en\u002Fblog\u002Fcloudflare-in-front-of-self-hosted-cluster",[12745,12746,12747,12748,3378],"cloudflare","cdn","ddos","performance","sh_ab86c1jnH6sTCINkR_53BAVKSC-NxuaXZKkb26gc",{"id":12751,"title":12752,"author":7,"body":12753,"category":3378,"cover":3379,"date":13831,"description":13832,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":13833,"navigation":411,"path":13834,"readingTime":4401,"seo":13835,"sitemap":13836,"stem":13837,"tags":13838,"__hash__":13841},"blog_en\u002Fen\u002Fblog\u002Fsentry-self-hosted-vs-saas-cost-comparison.md","Sentry self-hosted vs SaaS: how much you really save for a Brazilian startup",{"type":9,"value":12754,"toc":13815},[12755,12758,12761,12765,12780,12791,12802,12808,12812,12815,12859,12862,12866,12869,12874,12884,12887,12892,12911,12914,12919,12938,12941,12944,12948,12955,13082,13089,13100,13104,13115,13163,13169,13173,13176,13186,13191,13205,13212,13217,13237,13242,13262,13265,13269,13272,13327,13330,13360,13367,13369,13543,13547,13550,13556,13562,13568,13574,13578,13581,13587,13593,13599,13602,13606,13609,13629,13632,13636,13639,13674,13677,13679,13689,13699,13705,13711,13720,13741,13747,13757,13759,13762,13765,13776,13779,13782,13798,13801,13813],[12,12756,12757],{},"The question reaches the inbox almost every week of those who follow the blog: is it worth dropping Sentry SaaS and self-hosting? The honest answer starts with a \"depends\" — and most content circulating on the subject treats the \"depends\" as if it were evasive. It isn't. There are three hard variables that decide the math — event volume, team operational capacity, and compliance requirement — and each has a number you can measure before making the decision.",[12,12759,12760],{},"This post is the long version of the answer. If you're CTO of a Brazilian startup at the stage between five and fifty engineers, reading the Sentry bill rise month by month, what comes below serves as an explained calculator — not as a sermon for or against self-hosted. At the end there's a table with twelve criteria, an FAQ section for questions that didn't fit in the body, and three well-defined profiles of when each path makes sense.",[19,12762,12764],{"id":12763},"tldr-30-second-version","TL;DR — 30-second version",[12,12766,12767,12768,12771,12772,12775,12776,12779],{},"Sentry SaaS starts at ",[27,12769,12770],{},"US$26\u002Fmonth"," on the Team plan — covers 50 thousand errors\u002Fmonth and five users. For Brazilian startup with serious traffic, the bill rises fast to the ",[27,12773,12774],{},"US$80–200\u002Fmonth"," band (R$400–1,000\u002Fmonth at R$5\u002FUSD exchange rate), and at scale-up easily reaches ",[27,12777,12778],{},"US$300–500\u002Fmonth"," (R$1.5k–2.5k\u002Fmonth). It's predictable and asks nothing of the team beyond the credit card.",[12,12781,12782,12783,12786,12787,12790],{},"Self-hosted is \"free open-source\" in the marketing, but the software runs ",[27,12784,12785],{},"ten to twelve containers"," — PostgreSQL, Redis, Kafka, ZooKeeper, ClickHouse, Symbolicator, and four distinct Sentry processes. Asks between ",[27,12788,12789],{},"8 GB of RAM and 4 vCPUs"," on a dedicated server, plus storage, backup, quarterly updates, and the time of the dev who takes care of all that.",[12,12792,12793,12794,12797,12798,12801],{},"It's worth self-hosting when: (a) volume passed ",[27,12795,12796],{},"1 million errors\u002Fmonth"," and the SaaS bill started to hurt, (b) the team has operational capacity to take care of complex stack without becoming a bottleneck, or (c) compliance demands data on own server. ",[27,12799,12800],{},"It's not worth it"," for team of one to three people focused on product, or low volume where the SaaS is just budget noise.",[12,12803,12804,12807],{},[27,12805,12806],{},"Interesting middle ground:"," GlitchTip — open-source MIT, compatible with the existing Sentry SDK, runs on Postgres + Redis (no Kafka, no ClickHouse, no ZooKeeper). Covers 80% of Sentry's value with 20% of the operational overhead.",[19,12809,12811],{"id":12810},"what-sentry-saas-does-well","What Sentry SaaS does well",[12,12813,12814],{},"Before discussing cost, it's worth recognizing what you're paying for. Hosted Sentry delivers a combination that's hard to reproduce at home without investing several team quarters:",[2734,12816,12817,12823,12829,12835,12841,12847,12853],{},[70,12818,12819,12822],{},[27,12820,12821],{},"Pre-configured stack."," You create an account, install the SDK in the application, and in fifteen minutes you have errors arriving. Zero server, zero config file, zero backup to schedule.",[70,12824,12825,12828],{},[27,12826,12827],{},"Integrated Performance Monitoring."," Distributed tracing, n+1 queries detection, slow database queries — all on the same dashboard where you're already looking at errors.",[70,12830,12831,12834],{},[27,12832,12833],{},"Session Replay."," Anonymized recording of the user session up to the error moment. Worth gold for debugging scenarios that don't reproduce in dev.",[70,12836,12837,12840],{},[27,12838,12839],{},"Mature alerting."," Integration with Slack, PagerDuty, Microsoft Teams, email, and webhooks. Fine rules — alert only when error rate doubles in 5 minutes, only in production, only for authenticated users.",[70,12842,12843,12846],{},[27,12844,12845],{},"Issue tracking + sync."," Links issue to Linear\u002FJira\u002FGitHub Issues automatically. Resolve on the tracker side, close on the Sentry side.",[70,12848,12849,12852],{},[27,12850,12851],{},"Continuous Profiling."," Profile in production without perceptible overhead, discovering CPU bottlenecks without needing to reproduce locally. Available only in the SaaS version — self-hosted doesn't yet support.",[70,12854,12855,12858],{},[27,12856,12857],{},"Technical support."," Depending on the plan, human team responding within four business hours.",[12,12860,12861],{},"That whole package costs US$26\u002Fmonth on the entry plan. For a team of three people starting a SaaS, it's literally the best use of R$130\u002Fmonth that exists on the shelf.",[19,12863,12865],{"id":12864},"the-saas-math-line-by-line","The SaaS math, line by line",[12,12867,12868],{},"The Sentry SaaS catch isn't the entry price — it's the curve. Let's detail for Brazilian startup at three different stages:",[12,12870,12871],{},[27,12872,12873],{},"Stage 1 — first product, R$10–30k MRR.",[2734,12875,12876,12879],{},[70,12877,12878],{},"Team plan: US$26\u002Fmonth (50 thousand errors\u002Fmonth, 5 users)",[70,12880,12881,12882],{},"Total: ",[27,12883,6727],{},[12,12885,12886],{},"That's the band where no one should be thinking about self-hosted. R$130\u002Fmonth is almost budget noise — spending four CTO hours installing monitoring infrastructure costs more than two years of subscription.",[12,12888,12889],{},[27,12890,12891],{},"Stage 2 — product growing, R$50–100k MRR.",[2734,12893,12894,12897,12900,12903,12906],{},[70,12895,12896],{},"Business plan: US$80\u002Fmonth (300 thousand errors\u002Fmonth included)",[70,12898,12899],{},"Extra performance events: US$15–30\u002Fmonth",[70,12901,12902],{},"Session Replay: US$25–50\u002Fmonth",[70,12904,12905],{},"Additional users (10 devs): US$50\u002Fmonth",[70,12907,12881,12908],{},[27,12909,12910],{},"US$170–210\u002Fmonth = R$850–1,050\u002Fmonth",[12,12912,12913],{},"Here the bill starts appearing in the monthly financial report. It doesn't hurt yet, but you notice.",[12,12915,12916],{},[27,12917,12918],{},"Stage 3 — scale-up, R$200k+ MRR, 20+ engineers.",[2734,12920,12921,12924,12927,12930,12933],{},[70,12922,12923],{},"Business plan with adjustments: US$200–300\u002Fmonth base",[70,12925,12926],{},"Performance events: US$50–100\u002Fmonth",[70,12928,12929],{},"Session Replay: US$100\u002Fmonth",[70,12931,12932],{},"Users: US$100\u002Fmonth",[70,12934,12881,12935],{},[27,12936,12937],{},"US$450–600\u002Fmonth = R$2,250–3,000\u002Fmonth",[12,12939,12940],{},"At that stage, R$30k\u002Fyear on error tracking starts competing with partial engineer salary. It's the point where the conversation \"is self-hosting worth it?\" stops being theoretical and becomes scheduled meeting.",[12,12942,12943],{},"The real pain of the Business plan isn't the base price — it's the multiplication by add-ons. Performance, Replay, Profiling, Cron Monitoring, Code Coverage — each comes with its own event count and its own bill. It's predictable, but it's not cheap.",[19,12945,12947],{"id":12946},"sentry-self-hosted-what-it-is-exactly","Sentry self-hosted — what it is exactly",[12,12949,12950,12951,12954],{},"The official ",[231,12952,12953],{},"getsentry\u002Fself-hosted"," folder installs a stack that has the following form, in production:",[119,12956,12957,12969],{},[122,12958,12959],{},[125,12960,12961,12964,12966],{},[128,12962,12963],{},"Service",[128,12965,139],{},[128,12967,12968],{},"Typical RAM",[141,12970,12971,12981,12992,13002,13012,13022,13031,13041,13051,13062,13072],{},[125,12972,12973,12976,12979],{},[146,12974,12975],{},"sentry-web",[146,12977,12978],{},"Django frontend + API",[146,12980,11472],{},[125,12982,12983,12986,12989],{},[146,12984,12985],{},"sentry-worker",[146,12987,12988],{},"Async processing (Celery)",[146,12990,12991],{},"768 MB",[125,12993,12994,12997,13000],{},[146,12995,12996],{},"sentry-cron",[146,12998,12999],{},"Scheduled tasks",[146,13001,11491],{},[125,13003,13004,13007,13010],{},[146,13005,13006],{},"relay",[146,13008,13009],{},"Event ingestion",[146,13011,11472],{},[125,13013,13014,13017,13020],{},[146,13015,13016],{},"postgres",[146,13018,13019],{},"Metadata, projects, users",[146,13021,11494],{},[125,13023,13024,13026,13029],{},[146,13025,7505],{},[146,13027,13028],{},"Cache + Celery queues",[146,13030,11472],{},[125,13032,13033,13036,13039],{},[146,13034,13035],{},"kafka",[146,13037,13038],{},"Raw event stream",[146,13040,11494],{},[125,13042,13043,13046,13049],{},[146,13044,13045],{},"zookeeper",[146,13047,13048],{},"Kafka coordination",[146,13050,11491],{},[125,13052,13053,13056,13059],{},[146,13054,13055],{},"clickhouse",[146,13057,13058],{},"Analytic event storage",[146,13060,13061],{},"1.5 GB",[125,13063,13064,13067,13070],{},[146,13065,13066],{},"symbolicator",[146,13068,13069],{},"Native stack trace resolution",[146,13071,11472],{},[125,13073,13074,13077,13080],{},[146,13075,13076],{},"snuba-api \u002F snuba-consumer \u002F snuba-replacer",[146,13078,13079],{},"Layer between Sentry and ClickHouse",[146,13081,11494],{},[12,13083,13084,13085,13088],{},"Sum: about ",[27,13086,13087],{},"8 GB of RAM in firm use"," in a small cluster, 12 containers, and that's the floor. At higher volume, Kafka and ClickHouse get hungry pretty quickly.",[12,13090,13091,13092,13095,13096,13099],{},"The less-discussed catch is the license: since 2019, Sentry licenses the self-hosted product under ",[27,13093,13094],{},"BSL 1.1"," (Business Source License). It's open-source in form — you read the code, modify, contribute — but it has a clause that ",[27,13097,13098],{},"prohibits offering Sentry as commercial service to third parties",". For a company using internally, it's irrelevant. For an agency that thought of including error tracking in the package sold to the customer, it's prohibitive.",[19,13101,13103],{"id":13102},"cluster-setup-high-level-steps","Cluster setup — high-level steps",[12,13105,13106,13107,13110,13111,13114],{},"The official documentation assumes you'll run ",[231,13108,13109],{},".\u002Finstall.sh"," on an Ubuntu server, and then ",[231,13112,13113],{},"docker-compose up -d"," administers the stack. For those operating modern cluster, the path is slightly different:",[67,13116,13117,13127,13137,13147,13157],{},[70,13118,13119,13122,13123,13126],{},[27,13120,13121],{},"Define job spec with 12 containers."," HeroCtl accepts config file of up to a few thousand lines, and the translation from official compose to job spec is mechanical. Reserve a work shift to do this right — including ",[231,13124,13125],{},"depends_on",", health checks, and boot orders (ZooKeeper before Kafka, Postgres before sentry-web, and so on).",[70,13128,13129,13132,13133,13136],{},[27,13130,13131],{},"Persistent volumes for Postgres, ClickHouse, and Kafka."," All three have data you can't lose. ClickHouse is what grows the most — raw events become analytic rows and the disk fills. Reserve ",[27,13134,13135],{},"50 GB initial SSD",", adjust to 200 GB after six months if volume justifies.",[70,13138,13139,13142,13143,13146],{},[27,13140,13141],{},"Backup, and backup of each one separately."," The most common error in self-hosted is doing backup only of Postgres, forgetting that ",[27,13144,13145],{},"events live in ClickHouse",". Postgres has metadata (project, user, alert configuration); ClickHouse has the error history. Backup of just Postgres recovers the empty interface.",[70,13148,13149,13152,13153,13156],{},[27,13150,13151],{},"Automatic TLS on internal domain."," The panel needs to be accessible by devs with valid certificate — no ",[231,13154,13155],{},"-k"," on curl, no yellow warning in Chrome. HeroCtl cluster solves this automatically with Let's Encrypt; on other stacks you add an operator or configure manually.",[70,13158,13159,13162],{},[27,13160,13161],{},"Quarterly updates."," Sentry releases major version every three months, and the self-hosted requires Postgres schema migration + partial ClickHouse reindex. Reserve a maintenance window each release — generally between 15 minutes and two hours, depending on accumulated volume.",[12,13164,13165,13168],{},[27,13166,13167],{},"Total install time:"," 4 to 8 hours for competent dev who knows the product, 2 to 3 days for someone learning from scratch. That's the clean setup. Add 50% for debugging the real case (wrong internal DNS, unmounted volume, Kafka stuck waiting for ZooKeeper to come up).",[19,13170,13172],{"id":13171},"glitchtip-the-sentry-lite-many-forget","GlitchTip — the \"Sentry-lite\" many forget",[12,13174,13175],{},"GlitchTip is the alternative that appears little in comparisons and deserves highlight. It's open-source MIT (not BSL) — you use it for any purpose, including commercial, no clause. It was specifically designed to cover Sentry's 80\u002F20 case.",[12,13177,13178,13181,13182,13185],{},[27,13179,13180],{},"How \"Sentry SDK compatible\" works:"," the Sentry SDK sends events to a standard HTTP endpoint. GlitchTip implements that same endpoint. You change the URL on ",[231,13183,13184],{},"Sentry.init({ dsn: ... })"," of your application to point to GlitchTip and nothing else changes — not code, not dependency, not build. Reverse migration is also direct.",[12,13187,13188],{},[27,13189,13190],{},"GlitchTip stack:",[2734,13192,13193,13196,13199,13202],{},[70,13194,13195],{},"PostgreSQL",[70,13197,13198],{},"Redis (queues)",[70,13200,13201],{},"Django web",[70,13203,13204],{},"Celery worker",[12,13206,13207,13208,13211],{},"It's 4 containers vs 12 of Sentry. Resources: ",[27,13209,13210],{},"2 GB of RAM, 1 vCPU",". Runs comfortably on a US$12\u002Fmonth droplet. Accepts the same SDKs (JavaScript, Python, Go, Ruby, PHP, Java, .NET, mobile) without change.",[12,13213,13214],{},[27,13215,13216],{},"What GlitchTip doesn't have:",[2734,13218,13219,13222,13225,13228,13231,13234],{},[70,13220,13221],{},"Session Replay",[70,13223,13224],{},"Continuous Profiling",[70,13226,13227],{},"Performance Monitoring at the level of Sentry detail (has simplified version)",[70,13229,13230],{},"Distributed tracing with full waterfall",[70,13232,13233],{},"Code Coverage",[70,13235,13236],{},"Sophisticated cron monitoring",[12,13238,13239],{},[27,13240,13241],{},"What GlitchTip has:",[2734,13243,13244,13247,13250,13253,13256,13259],{},[70,13245,13246],{},"Complete error tracking, with grouping, frequency, first\u002Flast seen",[70,13248,13249],{},"Stack traces of any language supported by Sentry SDKs",[70,13251,13252],{},"Basic uptime monitoring",[70,13254,13255],{},"Alerting via webhook, email, and popular integrations",[70,13257,13258],{},"Minimal issue tracking — assign, resolve, ignore",[70,13260,13261],{},"Multi-project, multi-team, simple RBAC",[12,13263,13264],{},"For small or medium startup that doesn't use Replay or advanced Profiling, GlitchTip covers what matters with a fraction of the operation. It's the most underestimated case on the shelf.",[19,13266,13268],{"id":13267},"the-self-hosted-math-being-honest","The self-hosted math, being honest",[12,13270,13271],{},"If your premise for self-hosting is \"it'll be free\", you'll be disappointed. Real costs of self-hosting Sentry on cheap provider:",[119,13273,13274,13282],{},[122,13275,13276],{},[125,13277,13278,13280],{},[128,13279,11387],{},[128,13281,11390],{},[141,13283,13284,13292,13299,13307,13315],{},[125,13285,13286,13289],{},[146,13287,13288],{},"Dedicated server 8 GB \u002F 4 vCPU (Hetzner CPX31, €13.49)",[146,13290,13291],{},"R$74",[125,13293,13294,13297],{},[146,13295,13296],{},"S3-compatible backup storage (50 GB)",[146,13298,11408],{},[125,13300,13301,13304],{},[146,13302,13303],{},"Maintenance time (2–4h\u002Fmonth × R$100\u002Fh hour value)",[146,13305,13306],{},"R$200–400",[125,13308,13309,13312],{},[146,13310,13311],{},"Amortized quarterly updates (4h × R$100\u002Fh ÷ 3 months)",[146,13313,13314],{},"R$130",[125,13316,13317,13322],{},[146,13318,13319],{},[27,13320,13321],{},"Honest total",[146,13323,13324],{},[27,13325,13326],{},"R$430–630\u002Fmonth",[12,13328,13329],{},"Compare:",[2734,13331,13332,13341,13351],{},[70,13333,13334,2577,13337,13340],{},[27,13335,13336],{},"Versus SaaS Team (R$130\u002Fmonth):",[27,13338,13339],{},"loss of R$300–500\u002Fmonth."," Self-hosted at this stage is hobby, not savings.",[70,13342,13343,13346,13347,13350],{},[27,13344,13345],{},"Versus SaaS Business in medium use (R$850\u002Fmonth):"," savings of ",[27,13348,13349],{},"R$220–420\u002Fmonth."," Starts to make sense.",[70,13352,13353,13346,13356,13359],{},[27,13354,13355],{},"Versus SaaS scale-up (R$2,500\u002Fmonth):",[27,13357,13358],{},"R$1,870–2,070\u002Fmonth."," There it's worth the effort.",[12,13361,13362,13363,13366],{},"For GlitchTip, the math is different — the server can be half the size (2 GB \u002F 1 vCPU, R$30\u002Fmonth) and maintenance drops to about 1h\u002Fmonth. Honest total: ",[27,13364,13365],{},"R$150–200\u002Fmonth",". There the breakeven point with SaaS Team arrives.",[19,13368,3837],{"id":3836},[119,13370,13371,13386],{},[122,13372,13373],{},[125,13374,13375,13377,13380,13383],{},[128,13376,2982],{},[128,13378,13379],{},"Sentry SaaS",[128,13381,13382],{},"Sentry self-hosted",[128,13384,13385],{},"GlitchTip self-hosted",[141,13387,13388,13402,13416,13430,13442,13455,13467,13478,13488,13502,13515,13529],{},[125,13389,13390,13393,13396,13399],{},[146,13391,13392],{},"Minimum monthly cost (BRL)",[146,13394,13395],{},"R$130 (Team)",[146,13397,13398],{},"R$430 honest",[146,13400,13401],{},"R$150 honest",[125,13403,13404,13407,13410,13413],{},[146,13405,13406],{},"Cost at 100k errors\u002Fmonth",[146,13408,13409],{},"R$130–250",[146,13411,13412],{},"R$430–500",[146,13414,13415],{},"R$150–200",[125,13417,13418,13421,13424,13427],{},[146,13419,13420],{},"Cost at 1M errors\u002Fmonth",[146,13422,13423],{},"R$1,500–2,500",[146,13425,13426],{},"R$500–700",[146,13428,13429],{},"R$250–350",[125,13431,13432,13434,13436,13439],{},[146,13433,3007],{},[146,13435,9769],{},[146,13437,13438],{},"4–8 hours (1st time 2–3 days)",[146,13440,13441],{},"1–2 hours",[125,13443,13444,13447,13449,13452],{},[146,13445,13446],{},"Minimum cluster resources",[146,13448,12518],{},[146,13450,13451],{},"8 GB RAM \u002F 4 vCPU \u002F 50 GB SSD",[146,13453,13454],{},"2 GB RAM \u002F 1 vCPU \u002F 20 GB SSD",[125,13456,13457,13460,13463,13465],{},[146,13458,13459],{},"Performance Monitoring",[146,13461,13462],{},"Complete",[146,13464,13462],{},[146,13466,12373],{},[125,13468,13469,13471,13473,13476],{},[146,13470,13221],{},[146,13472,3064],{},[146,13474,13475],{},"No (SaaS only)",[146,13477,3058],{},[125,13479,13480,13482,13484,13486],{},[146,13481,13224],{},[146,13483,3064],{},[146,13485,3058],{},[146,13487,3058],{},[125,13489,13490,13493,13496,13499],{},[146,13491,13492],{},"Alerting integrations",[146,13494,13495],{},"Slack, PagerDuty, Teams, Linear, Jira, email, webhook",[146,13497,13498],{},"Same set",[146,13500,13501],{},"Slack, email, webhook",[125,13503,13504,13507,13510,13513],{},[146,13505,13506],{},"Compliance \u002F data residency",[146,13508,13509],{},"US\u002FEU datacenters (international transfer)",[146,13511,13512],{},"Your server",[146,13514,13512],{},[125,13516,13517,13520,13523,13526],{},[146,13518,13519],{},"Community \u002F SDKs",[146,13521,13522],{},"The whole industry",[146,13524,13525],{},"Same Sentry SDKs",[146,13527,13528],{},"Compatible Sentry SDKs",[125,13530,13531,13534,13537,13540],{},[146,13532,13533],{},"Ideal range",[146,13535,13536],{},"\u003C500k events\u002Fmonth or small team",[146,13538,13539],{},">1M events\u002Fmonth with 1 dev to take care",[146,13541,13542],{},"Small\u002Fmedium startup without Replay\u002FProfiling",[19,13544,13546],{"id":13545},"when-to-stay-on-saas","When to stay on SaaS",[12,13548,13549],{},"Four profiles where paying Sentry every month is the right decision:",[12,13551,13552,13555],{},[27,13553,13554],{},"Volume below 100 thousand errors\u002Fmonth."," The Team plan at US$26\u002Fmonth covers, no add-ons. Self-hosted that size is a hobby project — you spend more time configuring than you save.",[12,13557,13558,13561],{},[27,13559,13560],{},"Team of 1 to 3 devs without operational capacity."," Each hour spent operating Sentry is hour not spent on product. If you haven't yet hired the first platform engineer, pay the SaaS and move on. The line \"first hire SRE to run Sentry\" isn't strategy, it's distraction.",[12,13563,13564,13567],{},[27,13565,13566],{},"You use Session Replay and Profiling."," They're SaaS-only features — self-hosting still isn't an option. If your debug workflow depends on those two, the discussion ends.",[12,13569,13570,13573],{},[27,13571,13572],{},"Compliance requires only LGPD, no local data residency."," Sentry has European Union datacenter option, in compliance with GDPR and by extension LGPD. If your legal accepts international transfer of anonymized data, the conformity doesn't force self-hosting.",[19,13575,13577],{"id":13576},"when-self-hosting-sentry-makes-sense","When self-hosting Sentry makes sense",[12,13579,13580],{},"Three conditions — you need at least two to justify the effort:",[12,13582,13583,13586],{},[27,13584,13585],{},"Volume passed 1 million errors\u002Fmonth."," There the SaaS bill starts competing with salary, and the savings pay for the dev's time taking care of the infrastructure.",[12,13588,13589,13592],{},[27,13590,13591],{},"Compliance demands data on controlled server."," Regulated sectors (health, financial, government, child education) frequently have data residency clause that makes SaaS unviable. Self-hosted is the path.",[12,13594,13595,13598],{},[27,13596,13597],{},"Team has 1+ dev with recurring time to take care."," \"Recurring time\" = at least 4 hours per month explicitly allocated, with named person and backup. If it's \"ah, anyone takes care\", in three months no one takes care and the system becomes blind spot.",[12,13600,13601],{},"Bonus: you don't need Session Replay or Profiling. Those two stay on SaaS, so self-hosting means giving them up. For many B2B teams with server-side applications, that giving up is trivial. For B2C teams with complex mobile\u002FSPA app, it can be deal-breaker.",[19,13603,13605],{"id":13604},"when-glitchtip-is-the-best-choice-of-the-three","When GlitchTip is the best choice of the three",[12,13607,13608],{},"GlitchTip's ideal profile is specific:",[2734,13610,13611,13614,13617,13620,13623,13626],{},[70,13612,13613],{},"Startup with R$10–50k MRR, team of 2 to 5 engineers.",[70,13615,13616],{},"Relatively simple applications — B2B SaaS, web app, mobile backend, API.",[70,13618,13619],{},"Doesn't use Session Replay (or never used, and doesn't miss it).",[70,13621,13622],{},"Doesn't use Continuous Profiling.",[70,13624,13625],{},"Wants self-hosted for control and MIT license, but doesn't want to operate 12 containers.",[70,13627,13628],{},"Already uses the Sentry SDK and doesn't want to rewrite instrumentation.",[12,13630,13631],{},"If three of those four points match your team, GlitchTip probably saves R$500\u002Fmonth versus SaaS Business without costing the operational overhead of Sentry self-hosted. It's the least-discussed and most frequently useful case.",[19,13633,13635],{"id":13634},"heroctl-as-operational-layer","HeroCtl as operational layer",[12,13637,13638],{},"If you decided to self-host (Sentry or GlitchTip), the cluster where it runs matters almost as much as the product choice. Some observations on running error tracking on HeroCtl:",[2734,13640,13641,13647,13653,13659,13668],{},[70,13642,13643,13646],{},[27,13644,13645],{},"Job spec with persistent volumes"," is native to the product — Postgres, ClickHouse, and Kafka have somewhere to write data without losing anything in a rolling deploy.",[70,13648,13649,13652],{},[27,13650,13651],{},"Managed backup"," is available on the Business plan, covering databases running as jobs in the cluster. Postgres + ClickHouse enter together in the same schedule.",[70,13654,13655,13658],{},[27,13656,13657],{},"Integrated metrics"," of HeroCtl itself show CPU\u002FRAM\u002FIO of each Sentry service — you don't need to set up Prometheus externally just to know if ClickHouse is healthy.",[70,13660,13661,13663,13664,13667],{},[27,13662,11364],{}," with automatic TLS handles ",[231,13665,13666],{},"sentry.yourcompany.com.br"," without any additional certificate operator.",[70,13669,13670,13673],{},[27,13671,13672],{},"Compact control plane"," — 200 to 400 MB per server, leaves plenty of resource for real workload (Sentry, in this case).",[12,13675,13676],{},"For 4-cloud-server cluster, this means: 8 GB of RAM available on the server hosting Sentry, metrics and backup coming free from the orchestrator, and zero external operator to configure. The ROI changes — because part of the \"honest\" cost of self-hosted is exactly the infra the cluster already offers.",[19,13678,3225],{"id":3224},[12,13680,13681,13684,13685,13688],{},[27,13682,13683],{},"How much RAM does Sentry self-hosted consume?","\nIn small production, ",[27,13686,13687],{},"8 GB of RAM is the firm floor"," — below that, Kafka starts OOM-killing processes under load. Recommended: 12 GB to have slack. ClickHouse and Kafka are the two largest consumers; together they sum half the total memory.",[12,13690,13691,13694,13695,13698],{},[27,13692,13693],{},"Is GlitchTip compatible with the existing Sentry SDK?","\nYes. GlitchTip implements the same HTTP endpoint that Sentry receives the events. Changing the URL on ",[231,13696,13697],{},"Sentry.init({ dsn: 'https:\u002F\u002F...' })"," pointing to your GlitchTip is enough — the SDK doesn't notice the difference. Reverse migration is also trivial. The SDKs covered include JavaScript, Python, Go, Ruby, PHP, Java, .NET, iOS, Android, and React Native.",[12,13700,13701,13704],{},[27,13702,13703],{},"Can I migrate from SaaS to self-hosted without losing error history?","\nTechnically yes, but with caveats. Sentry SaaS offers event export via API, and self-hosted accepts ingestion. In practice, most teams simply don't migrate history — they start from scratch on the new, and keep SaaS read-only for about 90 days to consult old incidents when necessary. Old error history usually has decreasing value; what matters is what happened in the last 4 weeks.",[12,13706,13707,13710],{},[27,13708,13709],{},"Does Sentry self-hosted have official support?","\nNo, at no level. The self-hosted version is \"best effort\" from the Sentry company — they publish the release, document, but technical support only appears on the SaaS paid plans. The community on GitHub and in the official forum is active, and most common problems have already been resolved there. For exotic problems, you're alone — or hire specialized consulting.",[12,13712,13713,13716,13719],{},[27,13714,13715],{},"The BSL license — can I use for commercial SaaS?",[27,13717,13718],{},"No."," Sentry's BSL 1.1 explicitly prohibits offering Sentry as commercial service to third parties. You can use internally without limit, in any company of any size. But if your idea was to include \"dedicated error tracking\" in your customer's package charging for it, the license blocks. For that case, GlitchTip (MIT) or other open-source MIT alternatives are the path.",[12,13721,13722,13725,13726,13729,13730,13733,13734,13729,13737,13740],{},[27,13723,13724],{},"How long does setup from scratch take?","\nSentry self-hosted: ",[27,13727,13728],{},"4 to 8 hours"," for experienced dev, ",[27,13731,13732],{},"2 to 3 days"," for someone learning. GlitchTip: ",[27,13735,13736],{},"1 to 2 hours",[27,13738,13739],{},"half a day"," for beginner. Those numbers cover clean install, working ingestion, basic alert configured, ready TLS. Doesn't include SDK migration (which is trivial) or fine-tuning of alert rules (which takes weeks in any product).",[12,13742,13743,13746],{},[27,13744,13745],{},"Is ClickHouse really necessary, or does SQLite serve?","\nFor Sentry self-hosted: ClickHouse is mandatory. The product uses the analytic database for aggregation queries that Postgres can't handle under real volume. For GlitchTip: the product uses only Postgres, and that's part of why it's dramatically lighter. SQLite doesn't serve for either of the two in production.",[12,13748,13749,13752,13753,13756],{},[27,13750,13751],{},"Performance impact on the client app?","\nThe Sentry SDK (and GlitchTip's, which uses the same SDK) has overhead ",[27,13754,13755],{},"below 1ms in most cases"," — captures error locally, sends in background without blocking. In SPAs, the bundle size adds 30–60 KB gzipped depending on integrations. For server-side apps, the overhead is negligible. Performance Monitoring has slightly higher cost (10% sampling is usually the default), but under reasonable configuration stays below 5ms p99.",[19,13758,3309],{"id":3308},[12,13760,13761],{},"The question \"Sentry SaaS vs self-hosted\" has a different answer at each company stage. For small startup, SaaS always. For scale-up with high volume and competent team, self-hosted saves real money. For middle of the road, GlitchTip is usually the most lucid choice — significant savings, manageable operational complexity, permissive license.",[12,13763,13764],{},"The math is measurable, not philosophical. Before deciding, check three numbers:",[67,13766,13767,13770,13773],{},[70,13768,13769],{},"How many errors\u002Fmonth your application generates today (any error tracking dashboard shows).",[70,13771,13772],{},"What's the projected monthly Sentry bill for the next 12 months (multiply predicted traffic by the corresponding plan).",[70,13774,13775],{},"How many hours\u002Fmonth your team can allocate to operating infrastructure without product detriment.",[12,13777,13778],{},"If number 1 is above 1 million, number 2 is above R$1,500, and number 3 is above 4 hours — self-hosting is sane financial decision. Otherwise, stay on SaaS and use the saved time to deliver feature.",[12,13780,13781],{},"To run self-hosted (Sentry or GlitchTip) on a cluster that takes care of TLS, backup, and metrics without external operator:",[224,13783,13784],{"className":226,"code":5318,"language":228,"meta":229,"style":229},[231,13785,13786],{"__ignoreMap":229},[234,13787,13788,13790,13792,13794,13796],{"class":236,"line":237},[234,13789,1220],{"class":247},[234,13791,2957],{"class":251},[234,13793,5329],{"class":255},[234,13795,2963],{"class":383},[234,13797,2966],{"class":247},[12,13799,13800],{},"Related posts:",[2734,13802,13803,13808],{},[70,13804,13805],{},[3336,13806,13807],{"href":11719},"Observability without Datadog: honest stack for Brazilian startup",[70,13809,13810],{},[3336,13811,13812],{"href":7461},"Postgres in production: managed vs self-hosted, the real math",[3350,13814,4376],{},{"title":229,"searchDepth":244,"depth":244,"links":13816},[13817,13818,13819,13820,13821,13822,13823,13824,13825,13826,13827,13828,13829,13830],{"id":12763,"depth":244,"text":12764},{"id":12810,"depth":244,"text":12811},{"id":12864,"depth":244,"text":12865},{"id":12946,"depth":244,"text":12947},{"id":13102,"depth":244,"text":13103},{"id":13171,"depth":244,"text":13172},{"id":13267,"depth":244,"text":13268},{"id":3836,"depth":244,"text":3837},{"id":13545,"depth":244,"text":13546},{"id":13576,"depth":244,"text":13577},{"id":13604,"depth":244,"text":13605},{"id":13634,"depth":244,"text":13635},{"id":3224,"depth":244,"text":3225},{"id":3308,"depth":244,"text":3309},"2026-05-05","Sentry SaaS starts at US$26\u002Fmonth, scaling fast with volume. Self-hosted is 'free' — but runs Postgres + Redis + Kafka + ClickHouse. Honest analysis of when self-hosting is worth it.",{},"\u002Fen\u002Fblog\u002Fsentry-self-hosted-vs-saas-cost-comparison",{"title":12752,"description":13832},{"loc":13834},"en\u002Fblog\u002Fsentry-self-hosted-vs-saas-cost-comparison",[13839,13840,7507,6394,3378],"sentry","error-tracking","ZWgabXiuZqvhbDqFhZo9ZuVp8oSUa1falcpMt-hN92I",{"id":13843,"title":13844,"author":7,"body":13845,"category":8756,"cover":3379,"date":14927,"description":14928,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":14929,"navigation":411,"path":14930,"readingTime":8761,"seo":14931,"sitemap":14932,"stem":14933,"tags":14934,"__hash__":14940},"blog_en\u002Fen\u002Fblog\u002Fhetzner-vs-digitalocean-vs-magalu-cloud.md","Hetzner vs DigitalOcean vs Magalu Cloud: which VPS to pick for a Brazilian startup in 2026",{"type":9,"value":13846,"toc":14882},[13847,13850,13853,13857,13860,13864,13867,13871,13874,13920,13927,13931,13934,13937,13969,13972,13976,13979,13999,14003,14035,14039,14042,14046,14049,14096,14103,14107,14110,14136,14139,14143,14181,14185,14210,14214,14217,14221,14224,14250,14253,14257,14271,14275,14301,14305,14337,14341,14506,14510,14513,14516,14519,14523,14526,14564,14567,14571,14574,14627,14630,14633,14637,14640,14643,14663,14666,14670,14674,14695,14699,14716,14720,14737,14741,14744,14747,14750,14753,14757,14760,14763,14766,14768,14771,14774,14778,14781,14785,14788,14792,14795,14799,14802,14806,14809,14813,14816,14820,14823,14826,14829,14831,14834,14837,14840,14856,14859,14862,14877,14880],[12,13848,13849],{},"Choosing a VPS for a Brazilian startup in 2026 isn't a single question — it's four questions that intersect. How much it costs in dollars or euros against revenue that comes in reais. How much latency your product can tolerate without offending users. How much managed service you want to pay for to outsource. And how much maturity the provider has to back you up when something breaks at 3 AM.",[12,13851,13852],{},"The three names that show up in nearly every conversation today are Hetzner, DigitalOcean, and Magalu Cloud. Each one solves one side of the equation very well and loses on the other two. There's no universal winner. There's a winner per usage profile. This post opens up the math, with numbers, and closes with an honest recommendation per scenario.",[19,13854,13856],{"id":13855},"tldr-which-vps-to-pick-for-a-brazilian-startup-in-2026","TL;DR — which VPS to pick for a Brazilian startup in 2026",[12,13858,13859],{},"Hetzner is the right choice when absolute cost matters most and the audience tolerates 200ms of latency — hobby projects, MVPs, internal tooling, async integrations, CI runners, batch jobs, and self-hosted clusters serving an audience outside Brazil. The CX11 goes for €4.09 per month, around R$22, and includes 20 TB of outbound traffic. DigitalOcean is the sensible pick for Brazilian indie hackers with a B2C audience that needs sub-100ms response — datacenters in New York and Toronto deliver 120-140ms to São Paulo, the interface is the best in the segment, and the Portuguese-speaking community is extensive, but the price is roughly 2-3× Hetzner's. Magalu Cloud is the right answer when data residency in Brazil is a regulatory requirement — health, financial, government, sectoral LGPD contracts — with 5-15ms latency to São Paulo, billing in reais, and Portuguese support, at the cost of an ecosystem that's still maturing. For a self-hosted cluster of 3-4 nodes running dozens of applications, any of the three works well.",[19,13861,13863],{"id":13862},"hetzner-the-attack-on-your-wallet","Hetzner — the attack on your wallet",[12,13865,13866],{},"Hetzner is a German company with over 25 years of operation. Its model is simple: no-frills VPS, dedicated servers via auction, decent European network, and pricing that beats any competitor on absolute value. In 2026 the table is still the same as always — it goes up little over time, unlike nearly every American cloud.",[368,13868,13870],{"id":13869},"how-much-does-a-hetzner-vps-cost-in-2026","How much does a Hetzner VPS cost in 2026?",[12,13872,13873],{},"The Cloud Servers line has two flavors, x86 and ARM, and the most relevant plans are:",[2734,13875,13876,13886,13895,13903,13911],{},[70,13877,13878,13881,13882,13885],{},[27,13879,13880],{},"CX11"," (1 vCPU x86, 2 GB RAM, 20 GB SSD): €4.09\u002Fmonth — about ",[27,13883,13884],{},"R$22"," at R$5.5\u002Feuro.",[70,13887,13888,13891,13892,101],{},[27,13889,13890],{},"CPX11"," (2 vCPU AMD, 2 GB RAM, 40 GB SSD): €4.75\u002Fmonth — ",[27,13893,13894],{},"R$26",[70,13896,13897,13899,13900,101],{},[27,13898,6707],{}," (3 vCPU AMD, 4 GB RAM, 80 GB SSD): €7.99\u002Fmonth — ",[27,13901,13902],{},"R$44",[70,13904,13905,13907,13908,101],{},[27,13906,6717],{}," (4 vCPU AMD, 8 GB RAM, 160 GB SSD): €14.86\u002Fmonth — ",[27,13909,13910],{},"R$82",[70,13912,13913,13916,13917,101],{},[27,13914,13915],{},"CAX11"," (2 vCPU ARM, 4 GB RAM, 40 GB SSD): €3.79\u002Fmonth — ",[27,13918,13919],{},"R$21",[12,13921,13922,13923,13926],{},"Each VPS includes ",[27,13924,13925],{},"20 TB of outbound traffic per month",". That's an amount that borders on absurd compared to American clouds — the same traffic on AWS São Paulo costs $0.09 per gigabyte after the first 100 GB, meaning 20 TB runs to about $1,800 per month in bandwidth alone.",[368,13928,13930],{"id":13929},"does-hetzner-have-a-datacenter-in-brazil","Does Hetzner have a datacenter in Brazil?",[12,13932,13933],{},"No. Hetzner operates datacenters in Falkenstein, Nuremberg and Helsinki in Europe, Hillsboro (Oregon) and Ashburn (Virginia) in the United States, and Singapore in Asia. There is no South American presence.",[12,13935,13936],{},"Typical latency from a server to a user in São Paulo:",[2734,13938,13939,13945,13951,13957,13963],{},[70,13940,13941,13942],{},"Falkenstein, Germany: ",[27,13943,13944],{},"200-220ms",[70,13946,13947,13948],{},"Helsinki, Finland: ",[27,13949,13950],{},"210-240ms",[70,13952,13953,13954],{},"Ashburn, Virginia: ",[27,13955,13956],{},"140-160ms",[70,13958,13959,13960],{},"Hillsboro, Oregon: ",[27,13961,13962],{},"170-190ms",[70,13964,13965,13966],{},"Singapore: ",[27,13967,13968],{},"300-340ms",[12,13970,13971],{},"For context, 200ms is the point where the user starts noticing slowness on a button click. For a heavy SPA already loading browser assets and making three parallel calls, 200ms cumulative on each round-trip turns into five to eight perceptible seconds. A B2C app with a Brazilian audience hosted in Falkenstein is a product that feels \"slow,\" even when the server responds in 5ms.",[368,13973,13975],{"id":13974},"where-hetzner-shines","Where Hetzner shines",[12,13977,13978],{},"The combination of pricing + included bandwidth makes Hetzner the obvious choice for three types of workload:",[67,13980,13981,13987,13993],{},[70,13982,13983,13986],{},[27,13984,13985],{},"Non-latency-critical workloads",": CI runners, batch jobs, async workers, nightly ETL, cron tasks. It doesn't matter if they respond 200ms slower.",[70,13988,13989,13992],{},[27,13990,13991],{},"Applications with audiences outside Brazil",": B2B SaaS with clients in Europe or the US. Falkenstein to Berlin is 15ms, Ashburn to New York is 5ms.",[70,13994,13995,13998],{},[27,13996,13997],{},"Self-hosted orchestrator clusters",": running HeroCtl, Coolify or similar on 3-4 Hetzner nodes to host dozens of internal apps costs what a single equivalent droplet would on DigitalOcean.",[368,14000,14002],{"id":14001},"where-hetzner-loses","Where Hetzner loses",[2734,14004,14005,14011,14017,14023,14029],{},[70,14006,14007,14010],{},[27,14008,14009],{},"No Brazilian datacenter"," — end of story for any compliance that requires data residency.",[70,14012,14013,14016],{},[27,14014,14015],{},"Support only in English and German"," — chat and tickets respond fast, but on European business hours.",[70,14018,14019,14022],{},[27,14020,14021],{},"Billing in euros"," — you pay with an international credit card, with 4.38% IOF and bank exchange spread.",[70,14024,14025,14028],{},[27,14026,14027],{},"Limited service marketplace"," — there's no Managed Postgres at DigitalOcean's level nor App Platform serverless. What's there is S3-compatible Object Storage, Load Balancers and Volumes. The rest, you assemble.",[70,14030,14031,14034],{},[27,14032,14033],{},"Account verification can take time"," — first-time users report up to 48 hours to release a new account. In some cases Hetzner asks for a selfie with ID.",[19,14036,14038],{"id":14037},"digitalocean-the-all-around","DigitalOcean — the all-around",[12,14040,14041],{},"DigitalOcean is the option that shows up in front of nearly every Brazilian indie hacker in 2026, and for good reason. The product is 14 years old, the interface is the best in the segment, the documentation is a reference, the Brazilian community is huge. The price is higher than Hetzner in absolute value, but it comes wrapped in UX and managed services that save hours of work.",[368,14043,14045],{"id":14044},"how-much-does-a-digitalocean-droplet-cost-in-2026","How much does a DigitalOcean Droplet cost in 2026?",[12,14047,14048],{},"The Basic line has three flavors: Regular Intel, Premium Intel and Premium AMD. At R$5\u002Fdollar:",[2734,14050,14051,14060,14069,14078,14087],{},[70,14052,14053,14056,14057,101],{},[27,14054,14055],{},"Basic 1 GB \u002F 1 vCPU \u002F 25 GB SSD",": $4-6\u002Fmonth — ",[27,14058,14059],{},"R$20-30",[70,14061,14062,14065,14066,101],{},[27,14063,14064],{},"Basic 2 GB \u002F 1 vCPU \u002F 50 GB SSD",": $12\u002Fmonth — ",[27,14067,14068],{},"R$60",[70,14070,14071,14074,14075,101],{},[27,14072,14073],{},"Basic 4 GB \u002F 2 vCPU \u002F 80 GB SSD",": $24\u002Fmonth — ",[27,14076,14077],{},"R$120",[70,14079,14080,14083,14084,101],{},[27,14081,14082],{},"Basic 8 GB \u002F 4 vCPU \u002F 160 GB SSD",": $48\u002Fmonth — ",[27,14085,14086],{},"R$240",[70,14088,14089,14092,14093,101],{},[27,14090,14091],{},"Basic 16 GB \u002F 8 vCPU \u002F 320 GB SSD",": $96\u002Fmonth — ",[27,14094,14095],{},"R$480",[12,14097,14098,14099,14102],{},"All include ",[27,14100,14101],{},"1 TB of outbound traffic per month",", with overage at $0.01\u002FGB — much cheaper than AWS, but far from Hetzner's 20 TB.",[368,14104,14106],{"id":14105},"whats-the-digitalocean-latency-to-sao-paulo","What's the DigitalOcean latency to São Paulo?",[12,14108,14109],{},"DigitalOcean operates 14 regions. There's no direct South American presence, but the closest regions have acceptable latency:",[2734,14111,14112,14118,14124,14130],{},[70,14113,14114,14117],{},[27,14115,14116],{},"NYC (New York)",": ~120ms",[70,14119,14120,14123],{},[27,14121,14122],{},"TOR (Toronto)",": ~140ms",[70,14125,14126,14129],{},[27,14127,14128],{},"SFO (San Francisco)",": ~180ms",[70,14131,14132,14135],{},[27,14133,14134],{},"AMS (Amsterdam)",": ~210ms",[12,14137,14138],{},"NYC is the obvious choice for any audience with Brazilians as the main user base. 120ms is the range where a button responding \"right after the click\" still feels immediate to untrained eyes.",[368,14140,14142],{"id":14141},"where-digitalocean-shines","Where DigitalOcean shines",[2734,14144,14145,14151,14157,14163,14169,14175],{},[70,14146,14147,14150],{},[27,14148,14149],{},"Web interface",": the best in the segment. Spins up a Droplet in 60 seconds, sets up a firewall in three clicks, registers a domain, configures DNS, installs a managed database.",[70,14152,14153,14156],{},[27,14154,14155],{},"One-click marketplace",": WordPress, Ghost, GitLab, Mattermost, MongoDB, dozens of ready stacks.",[70,14158,14159,14162],{},[27,14160,14161],{},"Managed Postgres \u002F MySQL \u002F Redis",": managed databases with automatic backup, failover, read replicas. Starting at $15\u002Fmonth — pricey, but it saves an SRE.",[70,14164,14165,14168],{},[27,14166,14167],{},"App Platform",": serverless with Git deploys, autoscaling, automatic Let's Encrypt certificates. Starting at $5\u002Fmonth.",[70,14170,14171,14174],{},[27,14172,14173],{},"Active Brazilian community",": Portuguese tutorials, forums, unofficial Discord, ex-employees speaking at conferences.",[70,14176,14177,14180],{},[27,14178,14179],{},"Billing in dollars with international card",", but with checkout that accepts BR cards without friction (already established).",[368,14182,14184],{"id":14183},"where-digitalocean-loses","Where DigitalOcean loses",[2734,14186,14187,14193,14198,14204],{},[70,14188,14189,14192],{},[27,14190,14191],{},"Pricing",": 2-3× more expensive than Hetzner per equivalent vCPU.",[70,14194,14195,14197],{},[27,14196,14009],{},": the closest is NYC.",[70,14199,14200,14203],{},[27,14201,14202],{},"Some products pricier than AWS",": Spaces (S3-compatible Object Storage) costs $5\u002Fmonth with 250 GB; the equivalent on S3 costs $1.50\u002Fmonth at the same durability class.",[70,14205,14206,14209],{},[27,14207,14208],{},"Simple networking",": cross-region VPC exists but isn't as polished as on AWS.",[19,14211,14213],{"id":14212},"magalu-cloud-the-national-one","Magalu Cloud — the national one",[12,14215,14216],{},"Magalu Cloud is the infrastructure arm of the Magazine Luiza group. Launched in 2023, it's still maturing, but in 2026 it already has a consistent offering of VPS, Object Storage, managed Kubernetes and Portuguese support. It's the viable Brazilian bet for those who need data residency.",[368,14218,14220],{"id":14219},"how-much-does-magalu-cloud-cost-in-2026","How much does Magalu Cloud cost in 2026?",[12,14222,14223],{},"The plans that matter for a startup, with 2026 estimates:",[2734,14225,14226,14232,14238,14244],{},[70,14227,14228,14231],{},[27,14229,14230],{},"vCPU 1 \u002F 1 GB RAM \u002F 25 GB SSD",": ~R$30\u002Fmonth.",[70,14233,14234,14237],{},[27,14235,14236],{},"vCPU 2 \u002F 4 GB RAM \u002F 80 GB SSD",": ~R$80\u002Fmonth.",[70,14239,14240,14243],{},[27,14241,14242],{},"vCPU 4 \u002F 8 GB RAM \u002F 160 GB SSD",": ~R$160\u002Fmonth.",[70,14245,14246,14249],{},[27,14247,14248],{},"vCPU 8 \u002F 16 GB RAM \u002F 320 GB SSD",": ~R$320\u002Fmonth.",[12,14251,14252],{},"Pricing in reais, no currency conversion, no IOF, with NF-e issued.",[368,14254,14256],{"id":14255},"does-magalu-cloud-have-a-datacenter-in-brazil","Does Magalu Cloud have a datacenter in Brazil?",[12,14258,14259,14260,2402,14263,14266,14267,14270],{},"Yes — that's the whole point of the product. Operations in ",[27,14261,14262],{},"Tamboré, São Paulo",[27,14264,14265],{},"Curitiba",", Paraná. Latency from São Paulo capital to Tamboré: ",[27,14268,14269],{},"5-15ms",". Practically indistinguishable from loopback.",[368,14272,14274],{"id":14273},"where-magalu-cloud-shines","Where Magalu Cloud shines",[2734,14276,14277,14283,14289,14295],{},[70,14278,14279,14282],{},[27,14280,14281],{},"Data residency",": data stays in Brazil. For regulated sectors (health under sectoral LGPD, financial under the Central Bank, government) this resolves the legal argument all at once.",[70,14284,14285,14288],{},[27,14286,14287],{},"Billing in reais",": invoices to a Brazilian company entity, NF-e issued, simple accounting reconciliation.",[70,14290,14291,14294],{},[27,14292,14293],{},"Portuguese support",": chat and ticket respond on Brazilian business hours, with people who understand the national market vocabulary.",[70,14296,14297,14300],{},[27,14298,14299],{},"Unbeatable latency for BR users",": 5-15ms to anywhere in the southeast.",[368,14302,14304],{"id":14303},"where-magalu-cloud-loses","Where Magalu Cloud loses",[2734,14306,14307,14313,14319,14325,14331],{},[70,14308,14309,14312],{},[27,14310,14311],{},"Maturing ecosystem",": Managed Postgres exists, but with fewer configuration options than DigitalOcean. S3-compatible Object Storage works, but third-party integrations aren't always ready.",[70,14314,14315,14318],{},[27,14316,14317],{},"Small community",": Portuguese tutorials starting to appear in 2025-2026, but far from DigitalOcean's base.",[70,14320,14321,14324],{},[27,14322,14323],{},"Smaller instance range",": specialized machine types (GPU, memory-optimized, compute-intensive) still arrive in waves.",[70,14326,14327,14330],{},[27,14328,14329],{},"SLA maturity",": the first public incidents were well communicated, but the track record is still short.",[70,14332,14333,14336],{},[27,14334,14335],{},"More expensive than Hetzner",": the 4 GB RAM VPS costs R$80 on Magalu vs R$44 on Hetzner — almost double.",[19,14338,14340],{"id":14339},"side-by-side-12-criteria-that-matter","Side by side: 12 criteria that matter",[119,14342,14343,14358],{},[122,14344,14345],{},[125,14346,14347,14349,14352,14355],{},[128,14348,2982],{},[128,14350,14351],{},"Hetzner",[128,14353,14354],{},"DigitalOcean",[128,14356,14357],{},"Magalu Cloud",[141,14359,14360,14371,14385,14399,14413,14424,14436,14447,14460,14472,14482,14493],{},[125,14361,14362,14365,14367,14369],{},[146,14363,14364],{},"VPS 4 GB RAM, 2 vCPU (BRL\u002Fmonth)",[146,14366,13902],{},[146,14368,14077],{},[146,14370,8011],{},[125,14372,14373,14376,14379,14382],{},[146,14374,14375],{},"1 TB outbound traffic included",[146,14377,14378],{},"Yes (20 TB)",[146,14380,14381],{},"Yes (1 TB)",[146,14383,14384],{},"Yes (variable)",[125,14386,14387,14390,14393,14396],{},[146,14388,14389],{},"Closest datacenter to SP",[146,14391,14392],{},"Ashburn\u002FVA",[146,14394,14395],{},"NYC\u002FTOR",[146,14397,14398],{},"Tamboré\u002FSP",[125,14400,14401,14404,14407,14410],{},[146,14402,14403],{},"Average SP latency (ms)",[146,14405,14406],{},"150-220",[146,14408,14409],{},"120-140",[146,14411,14412],{},"5-15",[125,14414,14415,14418,14420,14422],{},[146,14416,14417],{},"Datacenter in Brazil",[146,14419,3058],{},[146,14421,3058],{},[146,14423,3064],{},[125,14425,14426,14429,14431,14434],{},[146,14427,14428],{},"Managed Postgres",[146,14430,3058],{},[146,14432,14433],{},"Yes ($15+)",[146,14435,3064],{},[125,14437,14438,14441,14443,14445],{},[146,14439,14440],{},"S3-compatible Object Storage",[146,14442,3064],{},[146,14444,3064],{},[146,14446,3064],{},[125,14448,14449,14452,14455,14458],{},[146,14450,14451],{},"Managed Load Balancer",[146,14453,14454],{},"Yes (€4.90)",[146,14456,14457],{},"Yes ($12)",[146,14459,3064],{},[125,14461,14462,14465,14467,14469],{},[146,14463,14464],{},"BR\u002FPT-BR community",[146,14466,4919],{},[146,14468,4914],{},[146,14470,14471],{},"Growing",[125,14473,14474,14476,14478,14480],{},[146,14475,14293],{},[146,14477,3058],{},[146,14479,3061],{},[146,14481,3064],{},[125,14483,14484,14487,14489,14491],{},[146,14485,14486],{},"Billing in reais (NF-e)",[146,14488,3058],{},[146,14490,3058],{},[146,14492,3064],{},[125,14494,14495,14497,14500,14503],{},[146,14496,13533],{},[146,14498,14499],{},"Hobby, MVP, global B2B",[146,14501,14502],{},"Indie hacker, BR B2C",[146,14504,14505],{},"Regulated B2B, sensitive data",[19,14507,14509],{"id":14508},"which-vps-is-cheapest-for-a-brazilian-startup-in-2026","Which VPS is cheapest for a Brazilian startup in 2026?",[12,14511,14512],{},"In absolute pricing per vCPU and gigabyte of RAM, Hetzner wins by a wide margin. A 4 GB RAM server costs R$44 per month on Hetzner vs R$120 on DigitalOcean and R$80 on Magalu Cloud. The difference is 2-3× in DigitalOcean's case and almost 2× for Magalu.",[12,14514,14515],{},"But total math changes when you include bandwidth. DigitalOcean gives 1 TB outbound included per droplet, Hetzner gives 20 TB. For an application serving heavy assets (video, images, downloads) that exceeds the bandwidth cap, the math gets closer between the two. For a traditional web app without heavy media delivery, Hetzner remains cheaper.",[12,14517,14518],{},"Magalu Cloud will never be the cheapest in direct comparison. But when you compare against \"DigitalOcean + 4.38% IOF + 2-3% exchange spread + 10-15% currency variation risk over the year\", the financial argument gets close. If your revenue is also in reais, Magalu takes you out of the risk of becoming a company that pays R$120 today and R$160 next quarter because the real depreciated.",[19,14520,14522],{"id":14521},"which-has-better-latency-for-brazilian-users","Which has better latency for Brazilian users?",[12,14524,14525],{},"The order is direct:",[67,14527,14528,14534,14540,14546,14552,14558],{},[70,14529,14530,14533],{},[27,14531,14532],{},"Magalu Cloud (Tamboré)",": 5-15ms — its own category.",[70,14535,14536,14539],{},[27,14537,14538],{},"DigitalOcean (NYC)",": 120ms — acceptable for B2C.",[70,14541,14542,14545],{},[27,14543,14544],{},"DigitalOcean (Toronto)",": 140ms — a good NYC alternative.",[70,14547,14548,14551],{},[27,14549,14550],{},"Hetzner (Ashburn, US East)",": 150ms — borderline for B2C.",[70,14553,14554,14557],{},[27,14555,14556],{},"Hetzner (Falkenstein, Germany)",": 200-220ms — only for async B2B or hobby.",[70,14559,14560,14563],{},[27,14561,14562],{},"Hetzner (Singapore)",": 300ms+ — not an option for Brazil.",[12,14565,14566],{},"The practical rule the team uses internally: if the product has a button the user clicks expecting immediate response, stay on Magalu Cloud Tamboré or DigitalOcean NYC. If the product is async — webhook, background processing, dashboard that reloads every 30 seconds — Hetzner Ashburn or Falkenstein handle it comfortably.",[19,14568,14570],{"id":14569},"which-to-pick-for-running-heroctl-coolify-or-dokploy-on-a-4-vps-cluster","Which to pick for running HeroCtl, Coolify or Dokploy on a 4-VPS cluster?",[12,14572,14573],{},"A self-hosted orchestrator cluster is the case where all three providers shine, because most traffic is internal east-west (between nodes) and client latency only matters for north-south traffic (going to the ingress Traefik or Caddy). Absolute monthly cost for three typical configurations:",[119,14575,14576,14588],{},[122,14577,14578],{},[125,14579,14580,14583,14585],{},[128,14581,14582],{},"Provider",[128,14584,4103],{},[128,14586,14587],{},"Monthly cost",[141,14589,14590,14603,14615],{},[125,14591,14592,14594,14597],{},[146,14593,14351],{},[146,14595,14596],{},"4× CPX21 (3 vCPU, 4 GB)",[146,14598,14599,14600],{},"€27.96 = ",[27,14601,14602],{},"R$155",[125,14604,14605,14607,14610],{},[146,14606,14354],{},[146,14608,14609],{},"4× Basic 4 GB (2 vCPU, 4 GB)",[146,14611,14612,14613],{},"$96 = ",[27,14614,14095],{},[125,14616,14617,14619,14622],{},[146,14618,14357],{},[146,14620,14621],{},"4× vCPU 2 \u002F 4 GB",[146,14623,14624],{},[27,14625,14626],{},"R$320",[12,14628,14629],{},"The difference between Hetzner and DigitalOcean is R$325\u002Fmonth — almost R$4,000\u002Fyear. For an indie hacker in pre-MRR or low-MRR phase, that's the difference between a SaaS that pays its own bills and a SaaS that needs funding.",[12,14631,14632],{},"To run HeroCtl specifically: the control plane occupies 200-400 MB of RAM per server, so any of the three providers has plenty of headroom for real workload. Three of the four nodes run the replicated control plane (server); the fourth runs only the agent. The public demo cluster uses this exact topology on four cloud servers, totaling 5 vCPU and 10 GB of RAM, serving five sites with automatic TLS and sixteen active containers.",[19,14634,14636],{"id":14635},"when-does-it-make-sense-to-stay-with-a-traditional-brazilian-provider-locaweb-kinghost-uol-host","When does it make sense to stay with a traditional Brazilian provider (Locaweb, KingHost, UOL Host)?",[12,14638,14639],{},"Almost never, in practice. Traditional Brazilian providers — Locaweb, KingHost, UOL Host, HostGator BR — operate on a model inherited from shared hosting reseller, with VPS offered as a secondary product. Pricing is higher than Magalu Cloud, the infra is less modern and the ecosystem is smaller.",[12,14641,14642],{},"The three scenarios where it makes sense:",[67,14644,14645,14651,14657],{},[70,14646,14647,14650],{},[27,14648,14649],{},"Compliance that nominally lists a vendor",": some government or large company contracts require a specific vendor registered in SICAF or internal lists. If your customer requires Locaweb, it's Locaweb.",[70,14652,14653,14656],{},[27,14654,14655],{},"NF-e as absolute priority",": all three (Hetzner no, DigitalOcean no, Magalu yes) issue NF-e, but some B2B contracts require a publicly recognized Brazilian vendor.",[70,14658,14659,14662],{},[27,14660,14661],{},"Customer requires 24\u002F7 support in Portuguese via phone",": few cloud providers offer that. Locaweb does.",[12,14664,14665],{},"For any case outside these three, the Hetzner \u002F DigitalOcean \u002F Magalu Cloud combination covers better with more modern infra and more predictable pricing.",[19,14667,14669],{"id":14668},"practical-scenarios-three-profiles-with-recommendations","Practical scenarios: three profiles with recommendations",[368,14671,14673],{"id":14672},"profile-1-hobby-project-1-vps-r0-revenue","Profile 1: hobby project \u002F 1 VPS \u002F R$0 revenue",[2734,14675,14676,14682,14689],{},[70,14677,14678,14681],{},[27,14679,14680],{},"Recommendation",": Hetzner CX11 or CAX11.",[70,14683,14684,14686,14687,101],{},[27,14685,136],{},": €4.09\u002Fmonth = ",[27,14688,13884],{},[70,14690,14691,14694],{},[27,14692,14693],{},"Why",": a hobby project audience generally isn't the end customer demanding SLA. 200ms latency is tolerable. R$22\u002Fmonth is the limit a project without revenue can sustain without becoming a burden. You install a lightweight orchestrator on top — HeroCtl Community Edition or Coolify — and run dozens of apps on the same server.",[368,14696,14698],{"id":14697},"profile-2-indie-hacker-4-vps-cluster-r10k-50k-mrr","Profile 2: indie hacker \u002F 4-VPS cluster \u002F R$10k-50k MRR",[2734,14700,14701,14706,14711],{},[70,14702,14703,14705],{},[27,14704,14680],{},": Hetzner CPX21 (4×) if audience is global or B2B; DigitalOcean Basic 4 GB (4×) if audience is Brazilian B2C.",[70,14707,14708,14710],{},[27,14709,136],{},": R$155\u002Fmonth (Hetzner) or R$480\u002Fmonth (DigitalOcean).",[70,14712,14713,14715],{},[27,14714,14693],{},": at this MRR range, a R$325 difference still matters, but perceived user latency starts impacting conversion. For B2C with BR audience, the rule is DigitalOcean NYC. For B2B where the user is a dev clicking a dashboard that reloads every minute, Hetzner Falkenstein is more than acceptable.",[368,14717,14719],{"id":14718},"profile-3-regulated-b2b-data-residency-4-8-vps-r200k-mrr","Profile 3: regulated B2B \u002F data residency \u002F 4-8 VPS \u002F R$200k+ MRR",[2734,14721,14722,14727,14732],{},[70,14723,14724,14726],{},[27,14725,14680],{},": Magalu Cloud as primary, DigitalOcean NYC as secondary for DR, or AWS São Paulo if compliance requires a vendor with specific certifications.",[70,14728,14729,14731],{},[27,14730,136],{},": R$320-640\u002Fmonth on Magalu, with extras for managed database and Object Storage.",[70,14733,14734,14736],{},[27,14735,14693],{},": at this range, infra cost is lower than the cost of any failed audit. Data residency is worth the premium. Magalu Cloud delivers that with excellent latency, simple billing and Portuguese support. DigitalOcean NYC stays as secondary for regional failover.",[19,14738,14740],{"id":14739},"can-i-use-two-providers-at-the-same-time","Can I use two providers at the same time?",[12,14742,14743],{},"Yes. The important question is whether the complexity is worth it.",[12,14745,14746],{},"The configuration that shows up most in practice is multi-provider in layers: production at one provider, staging at another, cheaper one. For example, production on Magalu Cloud Tamboré (BR latency), staging on Hetzner Falkenstein (minimum cost). The dev team accesses staging via VPN to test features; customers never see staging.",[12,14748,14749],{},"Another configuration: distribute a self-hosted cluster across two providers for resilience. Three nodes on Hetzner Ashburn, one node on Hetzner Falkenstein — works because intra-cluster latency tolerates the cross-region (40ms between Hetzner datacenters). Mixing Hetzner with DigitalOcean or Magalu Cloud in the same orchestrator cluster is technically possible but bad in practice: 150-200ms latency between nodes stalls distributed consensus.",[12,14751,14752],{},"Practical rule: if you don't have a concrete reason for multiple providers, stay with one. Operational complexity doubles with each added provider. DNS, billing, IAM, SSH keys, monitoring, backup — all doubled.",[19,14754,14756],{"id":14755},"is-it-worth-running-a-self-hosted-cluster-on-cheap-vps","Is it worth running a self-hosted cluster on cheap VPS?",[12,14758,14759],{},"It is, with caveats. The premise of the modern orchestrator is exactly that: you take 3-4 commodity VPS and the software does the work of replicated control plane, routing, automatic certificates, rolling update deploy. Monthly cost is a fraction of any equivalent managed platform-as-a-service.",[12,14761,14762],{},"The caveat is that cheap VPS comes with little CPU and little RAM. Hetzner CX11 has 1 vCPU and 2 GB of RAM — the control plane occupies 200-400 MB, leaving little for real workload. If your application is Node or Go, it fits. If it's Java with a greedy JVM, it doesn't. To run 3-4 cluster nodes, prefer CPX21 or CPX31 — 3-4 vCPU and 4-8 GB of RAM per node give room for a dozen containers per node.",[12,14764,14765],{},"The other caveat is management. Cheap VPS doesn't come with Managed Postgres at DigitalOcean's level. If you need a managed database, either you pay DigitalOcean for them to run Postgres ($15\u002Fmonth minimum) or you run your own Postgres as a container in the cluster — with backup, replication and upgrade responsibility on your hands.",[19,14767,3225],{"id":3224},[368,14769,13930],{"id":14770},"does-hetzner-have-a-datacenter-in-brazil-1",[12,14772,14773],{},"No. Hetzner operates datacenters in Germany (Falkenstein, Nuremberg), Finland (Helsinki), United States (Ashburn in Virginia, Hillsboro in Oregon) and Singapore. There is no South American presence in 2026, and the company has not announced public plans to expand to the region.",[368,14775,14777],{"id":14776},"how-much-does-it-cost-to-run-4-vps-on-each-of-the-three-providers-in-2026","How much does it cost to run 4 VPS on each of the three providers in 2026?",[12,14779,14780],{},"In equivalent configuration of 4 GB RAM and 2-3 vCPU per node: Hetzner runs about R$155\u002Fmonth total (4× CPX21 at €7.99). DigitalOcean runs R$480\u002Fmonth (4× Basic 4 GB at $24). Magalu Cloud runs R$320\u002Fmonth (4× vCPU 2 \u002F 4 GB at R$80). The difference between cheapest and most expensive is 3× in absolute value.",[368,14782,14784],{"id":14783},"can-i-use-two-vps-providers-at-the-same-time","Can I use two VPS providers at the same time?",[12,14786,14787],{},"Yes, and in some scenarios it makes sense. Common configurations: production at one provider + staging at another, cheaper one; cluster distributed across regions of the same provider for resilience; primary at one provider + DR at another for regional failover. Mixing two providers in the same orchestrator cluster is technically possible but cross-provider latency (typically 100-200ms) stalls distributed consensus. The practical rule is to stay with one provider per workload unless there's a concrete reason.",[368,14789,14791],{"id":14790},"is-magalu-cloud-reliable-in-2026","Is Magalu Cloud reliable in 2026?",[12,14793,14794],{},"Yes, with the caveat that the track record is still short. Operations started in 2023, in 2026 it has three years of public production. The incidents that happened in this period were communicated publicly and resolved within typical SLA. The managed services ecosystem is smaller than DigitalOcean's or AWS's, but the basic offering (VPS, Object Storage, Load Balancer, managed Kubernetes) is stable. For workload that requires Brazilian data residency, it's the viable option in 2026.",[368,14796,14798],{"id":14797},"does-digitalocean-have-portuguese-support","Does DigitalOcean have Portuguese support?",[12,14800,14801],{},"Limited. Official documentation is in English with some translated sections. Ticket support accepts Portuguese, but the first response usually comes in English. The Brazilian community (forums, Discord, tutorials) is large and covers common questions in Portuguese well. For enterprise support, there are paid contracts that include service in Portuguese, but they only make sense from monthly spend above US$1,000.",[368,14803,14805],{"id":14804},"whats-the-approximate-latency-to-sao-paulo-on-each-provider","What's the approximate latency to São Paulo on each provider?",[12,14807,14808],{},"Magalu Cloud (Tamboré and Curitiba): 5-15ms. DigitalOcean (NYC): 120ms. DigitalOcean (Toronto): 140ms. Hetzner (Ashburn, US East): 140-160ms. Hetzner (Hillsboro, US West): 170-190ms. Hetzner (Falkenstein, Germany): 200-220ms. Hetzner (Singapore): 300ms+. For B2C with Brazilian audience requiring immediate response, Magalu Cloud or DigitalOcean NYC. For async B2B or hobby, any one works.",[368,14810,14812],{"id":14811},"which-provider-accepts-payment-in-reais-with-nf-e","Which provider accepts payment in reais with NF-e?",[12,14814,14815],{},"Only Magalu Cloud among the three. Hetzner bills in euros with international credit card (plus 4.38% IOF and exchange spread). DigitalOcean bills in dollars, also with international card, with checkout that accepts BR card without friction but no NF-e issuance. For those who need NF-e for tax or accounting requirement, Magalu Cloud is the only option of the three. Traditional Brazilian providers (Locaweb, KingHost) also issue NF-e but have less competitive pricing.",[368,14817,14819],{"id":14818},"does-hetzner-accept-brazilian-cards","Does Hetzner accept Brazilian cards?",[12,14821,14822],{},"Yes, but with friction. Brazilian international credit cards (Visa, Mastercard) are accepted. The first charge usually goes through additional verification — Hetzner may ask for proof of identity, and in some cases a selfie with ID. The signup process can take 24-48 hours until first server release. After that, monthly charges go through normally. Prepaid cards and virtual cards are generally not accepted.",[368,14824,14756],{"id":14825},"is-it-worth-running-a-self-hosted-cluster-on-cheap-vps-1",[12,14827,14828],{},"It is, and it's the central use case for lightweight orchestrators like HeroCtl. Four Hetzner CPX21 VPS total 12 vCPU and 16 GB of RAM for R$155\u002Fmonth — capacity equivalent to a single mid-range server on expensive providers, at the cost of a lunch. The control plane occupies 200-400 MB of RAM per server, leaving plenty of room to host dozens of containers. The caveats are: prefer instances with at least 3 vCPU and 4 GB of RAM (CPX21+ or equivalent); managed database outside the cluster or database as container with manual backup; and have your own monitoring plan — cheap VPS doesn't come with observability ready.",[19,14830,3309],{"id":3308},[12,14832,14833],{},"There's no universal winner among Hetzner, DigitalOcean and Magalu Cloud. There's a winner per usage profile. Hetzner for absolute cost. DigitalOcean for UX and reasonable latency for Brazilian audience. Magalu Cloud for data residency and billing in reais.",[12,14835,14836],{},"Most indie teams start on Hetzner, migrate part of the load to DigitalOcean when the product requires lower latency for B2C, and consider Magalu Cloud when the first contract with a data residency clause appears. The three coexist in the portfolio of many mature startups — each one fulfilling a role.",[12,14838,14839],{},"For any of the three, a self-hosted cluster of 3-4 nodes with a lightweight orchestrator gives you the foundation to run dozens of applications without paying platform-as-a-service per app. HeroCtl Community Edition is free and installs in one command:",[224,14841,14842],{"className":226,"code":2948,"language":228,"meta":229,"style":229},[231,14843,14844],{"__ignoreMap":229},[234,14845,14846,14848,14850,14852,14854],{"class":236,"line":237},[234,14847,1220],{"class":247},[234,14849,2957],{"class":251},[234,14851,2960],{"class":255},[234,14853,2963],{"class":383},[234,14855,2966],{"class":247},[12,14857,14858],{},"The first three servers form a replicated control plane, automatic Let's Encrypt certificates, rolling update deploy without maintenance window, embedded web panel. The public demo cluster runs on four servers totaling 5 vCPU and 10 GB of RAM, hosting five sites in production.",[12,14860,14861],{},"For next steps, two related posts that go deeper on adjacent topics:",[2734,14863,14864,14870],{},[70,14865,14866,14869],{},[3336,14867,14868],{"href":6337},"How much it costs to host a Brazilian SaaS in 2026"," — detailed TCO analysis including bandwidth, database, monitoring.",[70,14871,14872,14876],{},[3336,14873,14875],{"href":14874},"\u002Fen\u002Fblog\u002Fkubernetes-alternative-self-hosted-paas","Kubernetes \u002F PaaS alternative for Brazil"," — comparison between lightweight orchestrators for small teams.",[12,14878,14879],{},"Provider choice is important, but it's a reversible decision. Orchestrator choice is harder to reverse — start with the software that will hold your stack for the next five years, then choose where it will run.",[3350,14881,4376],{},{"title":229,"searchDepth":244,"depth":244,"links":14883},[14884,14885,14891,14897,14903,14904,14905,14906,14907,14908,14913,14914,14915,14926],{"id":13855,"depth":244,"text":13856},{"id":13862,"depth":244,"text":13863,"children":14886},[14887,14888,14889,14890],{"id":13869,"depth":271,"text":13870},{"id":13929,"depth":271,"text":13930},{"id":13974,"depth":271,"text":13975},{"id":14001,"depth":271,"text":14002},{"id":14037,"depth":244,"text":14038,"children":14892},[14893,14894,14895,14896],{"id":14044,"depth":271,"text":14045},{"id":14105,"depth":271,"text":14106},{"id":14141,"depth":271,"text":14142},{"id":14183,"depth":271,"text":14184},{"id":14212,"depth":244,"text":14213,"children":14898},[14899,14900,14901,14902],{"id":14219,"depth":271,"text":14220},{"id":14255,"depth":271,"text":14256},{"id":14273,"depth":271,"text":14274},{"id":14303,"depth":271,"text":14304},{"id":14339,"depth":244,"text":14340},{"id":14508,"depth":244,"text":14509},{"id":14521,"depth":244,"text":14522},{"id":14569,"depth":244,"text":14570},{"id":14635,"depth":244,"text":14636},{"id":14668,"depth":244,"text":14669,"children":14909},[14910,14911,14912],{"id":14672,"depth":271,"text":14673},{"id":14697,"depth":271,"text":14698},{"id":14718,"depth":271,"text":14719},{"id":14739,"depth":244,"text":14740},{"id":14755,"depth":244,"text":14756},{"id":3224,"depth":244,"text":3225,"children":14916},[14917,14918,14919,14920,14921,14922,14923,14924,14925],{"id":14770,"depth":271,"text":13930},{"id":14776,"depth":271,"text":14777},{"id":14783,"depth":271,"text":14784},{"id":14790,"depth":271,"text":14791},{"id":14797,"depth":271,"text":14798},{"id":14804,"depth":271,"text":14805},{"id":14811,"depth":271,"text":14812},{"id":14818,"depth":271,"text":14819},{"id":14825,"depth":271,"text":14756},{"id":3308,"depth":244,"text":3309},"2026-04-29","Hetzner is 3-5× cheaper but has no datacenter in Brazil. DigitalOcean has more regions but costs more. Magalu Cloud is national but still maturing. Honest analysis of latency, cost, and when each one makes sense.",{},"\u002Fen\u002Fblog\u002Fhetzner-vs-digitalocean-vs-magalu-cloud",{"title":13844,"description":14928},{"loc":14930},"en\u002Fblog\u002Fhetzner-vs-digitalocean-vs-magalu-cloud",[14935,14936,14937,14938,8756,14939],"hetzner","digitalocean","magalu-cloud","vps","brazil","ooSG7yQS6DIYmBXl2wesTcHX1qINp4yyBCdCVtS6t6c",{"id":14942,"title":14943,"author":7,"body":14944,"category":6382,"cover":3379,"date":15798,"description":15799,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":15800,"navigation":411,"path":6337,"readingTime":6387,"seo":15801,"sitemap":15802,"stem":15803,"tags":15804,"__hash__":15808},"blog_en\u002Fen\u002Fblog\u002Fhow-much-to-host-a-brazilian-saas-2026.md","How much does it cost to host a Brazilian SaaS in 2026: the open spreadsheet",{"type":9,"value":14945,"toc":15782},[14946,14949,14952,14956,14959,14962,14969,14976,14979,14983,14986,14990,14993,14999,15004,15126,15129,15135,15138,15142,15145,15150,15154,15239,15246,15251,15255,15258,15263,15332,15343,15348,15351,15355,15358,15362,15409,15412,15417,15424,15427,15431,15434,15440,15446,15455,15461,15467,15473,15477,15483,15605,15608,15611,15615,15618,15624,15630,15636,15642,15646,15649,15655,15661,15667,15673,15677,15680,15687,15694,15701,15708,15715,15717,15723,15729,15735,15741,15747,15753,15759,15761,15764,15767,15772],[12,14947,14948],{},"The first expense that kills a Brazilian SaaS's margin isn't payroll. It isn't taxes. It isn't customer acquisition. It's infrastructure paid in dollars while the customer pays in reais. That mismatch is silent in year one, uncomfortable in year two, and compromises the entire thesis of the business in year three — when the team discovers that each new account brought with it a disproportionate slice of cloud provider.",[12,14950,14951],{},"This post is the open spreadsheet. No buzzwords, no \"depends on the case\", no hypothetical savings. Four scenarios of Brazilian SaaS divided by revenue stage, with side-by-side cost tables, honest decision in each, and the total math at the end. The numbers were measured on real providers in April 2026, with a reference exchange rate of R$5 per dollar — a number that may oscillate, but serves to calibrate the order of magnitude.",[19,14953,14955],{"id":14954},"the-asymmetry-no-one-explains-in-pitch-decks","The asymmetry no one explains in pitch decks",[12,14957,14958],{},"Imagine a Brazilian SaaS that just hit US$10k MRR. At the current exchange rate, that's about R$50k per month. It seems like a healthy number — pays salaries, pays taxes, leaves capital. The founder looks at the balance and breathes a sigh of relief.",[12,14960,14961],{},"Now add a typical modern SaaS stack: Vercel for front, managed database on a cloud provider, Datadog for observability, Sentry for errors, premium Redis for queue and cache. Round number, US$1,500 per month. Fifteen percent of MRR. Sounds like a lot? Compared with the American competitor, it's the same proportion: they also have US$10k MRR and spend US$1,500. Technical tie.",[12,14963,14964,14965,14968],{},"Except it's not a tie. Look at payroll: a mid-level Brazilian dev costs around R$15k. An equivalent American costs US$10k, or R$50k. The Brazilian pays, ",[27,14966,14967],{},"proportionally, three times more for hosting relative to their own labor cost",". The American's spreadsheet closes spending 15% on infra and 50% on salaries. The Brazilian's closes spending 15% on infra and 30% on salaries — leaving, after taxes, a narrow fraction to grow.",[12,14970,14971,14972,14975],{},"The conclusion is uncomfortable and most pitch decks avoid it: the infrastructure strategy that works for a Silicon Valley startup ",[27,14973,14974],{},"doesn't work"," for a Brazilian startup. The math is different from day one. Charging in reais, paying in dollars, and still wanting to copy the Sequoia-portfolio stack is an equation that only closes while the investor is subsidizing.",[12,14977,14978],{},"The good news: infrastructure cost is the expense with the most reduction leverage in the entire P&L of a small or medium SaaS. More even than payroll — because you can't fire a dev and keep delivering the roadmap, but you can swap Render for VPS and keep exactly the same product. What's missing is seeing the open spreadsheet.",[19,14980,14982],{"id":14981},"the-four-scenarios-by-revenue-stage","The four scenarios by revenue stage",[12,14984,14985],{},"The MRR division is intentional. Each band has different operational needs, different uptime demands, and — crucially — a different opportunity cost of the team's time. Treating all Brazilian SaaSes as if they were the same is the root of most wrong decisions.",[368,14987,14989],{"id":14988},"scenario-a-pre-revenue-mvp-r0-to-r5k-mrr","Scenario A — Pre-revenue \u002F MVP (R$0 to R$5k MRR)",[12,14991,14992],{},"This is the stage where every hundred reais saved is worth a thousand. There's no customer demanding SLA, no audit, no large team to coordinate. The goal is to stay up, validate the hypothesis, and make the first real come in.",[12,14994,14995,14998],{},[27,14996,14997],{},"Typical stack:"," Render free tier, Railway hobby, Vercel pro plan, or a single VPS on a cheap provider with Coolify.",[12,15000,15001],{},[27,15002,15003],{},"Detailed table (monthly price):",[119,15005,15006,15024],{},[122,15007,15008],{},[125,15009,15010,15012,15015,15018,15021],{},[128,15011,11387],{},[128,15013,15014],{},"Render",[128,15016,15017],{},"Railway",[128,15019,15020],{},"Vercel",[128,15022,15023],{},"VPS + Coolify",[141,15025,15026,15043,15059,15073,15099],{},[125,15027,15028,15031,15034,15037,15040],{},[146,15029,15030],{},"Web application",[146,15032,15033],{},"free (with limit)",[146,15035,15036],{},"US$5",[146,15038,15039],{},"US$20",[146,15041,15042],{},"included in VPS",[125,15044,15045,15048,15051,15053,15056],{},[146,15046,15047],{},"Postgres database",[146,15049,15050],{},"US$7",[146,15052,15036],{},[146,15054,15055],{},"KV (US$0.50\u002FM ops)",[146,15057,15058],{},"included",[125,15060,15061,15064,15066,15068,15071],{},[146,15062,15063],{},"Redis \u002F cache",[146,15065,15050],{},[146,15067,15036],{},[146,15069,15070],{},"KV US$5",[146,15072,15058],{},[125,15074,15075,15080,15085,15090,15095],{},[146,15076,15077],{},[27,15078,15079],{},"Total monthly USD",[146,15081,15082],{},[27,15083,15084],{},"US$14",[146,15086,15087],{},[27,15088,15089],{},"US$15",[146,15091,15092],{},[27,15093,15094],{},"US$25",[146,15096,15097],{},[27,15098,15036],{},[125,15100,15101,15106,15111,15116,15121],{},[146,15102,15103],{},[27,15104,15105],{},"Total monthly BRL",[146,15107,15108],{},[27,15109,15110],{},"R$70",[146,15112,15113],{},[27,15114,15115],{},"R$75",[146,15117,15118],{},[27,15119,15120],{},"R$125",[146,15122,15123],{},[27,15124,15125],{},"R$25–R$30",[12,15127,15128],{},"The raw difference — VPS running Coolify costing one-fifth of the cheapest managed option — is the direct effect of cutting out the middleman. You take on the work of installing Postgres in a container, configuring backup, opening port on the firewall. In exchange, you pay R$25 instead of R$125.",[12,15130,15131,15134],{},[27,15132,15133],{},"Honest decision:"," the VPS with Coolify is four to five times cheaper. But it costs between two and four hours per month of maintenance (package updates, backup checks, occasional reboot). For a Brazilian MVP with R$0 of real revenue and two founders who still have a daytime CLT job, the time-versus-money equation tilts toward money: saving R$100 per month in the first twelve months is R$1,200 that pays for trademark registration, three months of domain, or the first R$ of Google Ads when there's something to invest.",[12,15136,15137],{},"Counterintuitive shortcut: if your only pain is \"I don't want to learn Linux\", hire managed VPS from Locaweb or KingHost. It's more expensive than Hetzner, cheaper than Vercel, and support speaks Portuguese.",[368,15139,15141],{"id":15140},"scenario-b-indie-hacker-micro-saas-r5k-to-r30k-mrr","Scenario B — Indie hacker \u002F micro-SaaS (R$5k to R$30k MRR)",[12,15143,15144],{},"Here there are customers. There's a bit of unpredictability — peak in business hours, drop at night, seasonal surge at month-end when the bill goes out. Single-server starts to hurt because one outage takes down all customers at the same time.",[12,15146,15147,15149],{},[27,15148,14997],{}," Render Pro, Railway scaled, managed Postgres somewhere, basic monitoring.",[12,15151,15152],{},[27,15153,15003],{},[119,15155,15156,15170],{},[122,15157,15158],{},[125,15159,15160,15162,15164,15167],{},[128,15161,14582],{},[128,15163,4103],{},[128,15165,15166],{},"USD",[128,15168,15169],{},"BRL",[141,15171,15172,15185,15198,15212,15225],{},[125,15173,15174,15176,15179,15182],{},[146,15175,15014],{},[146,15177,15178],{},"Pro instance US$25 + Postgres US$25 + Redis US$7",[146,15180,15181],{},"US$57",[146,15183,15184],{},"R$285",[125,15186,15187,15189,15192,15195],{},[146,15188,15017],{},[146,15190,15191],{},"variable usage tier, app + db + Redis",[146,15193,15194],{},"US$30–US$80",[146,15196,15197],{},"R$150–R$400",[125,15199,15200,15203,15206,15209],{},[146,15201,15202],{},"Vercel Team",[146,15204,15205],{},"2 seats + bandwidth + functions",[146,15207,15208],{},"US$80–US$150",[146,15210,15211],{},"R$400–R$750",[125,15213,15214,15216,15219,15222],{},[146,15215,15023],{},[146,15217,15218],{},"1 server 4 vCPU 8 GB",[146,15220,15221],{},"US$10",[146,15223,15224],{},"R$50",[125,15226,15227,15230,15233,15236],{},[146,15228,15229],{},"HeroCtl Community + 3 Hetzner VPS",[146,15231,15232],{},"3 nodes with real high availability",[146,15234,15235],{},"€15",[146,15237,15238],{},"R$90",[12,15240,15241,15242,15245],{},"The last line is where the argument changes nature. For R$90 per month — less than Render's cheapest tier — you have a cluster with ",[27,15243,15244],{},"three real servers",", with replicated control plane, automatic coordinator election in around seven seconds when a node falls, automatic HTTPS certificate, and integrated router. The total stack cost is less than the team's dinner on a Friday. And the real uptime, once configured, is better than that of managed single-server, because one node failing leaves the other two serving traffic.",[12,15247,15248,15250],{},[27,15249,15133],{}," simple self-hosted on a single server is three to five times cheaper than hosted, but trades availability for savings. Real high-availability cluster (HeroCtl Community on three VPSes) still costs half of Render Pro single-server, and gives operational guarantee that Render single-server doesn't. The monthly difference of R$200 to R$500 is, over the year, equivalent to a daily lunch for the entire team. For an indie hacker, that's the difference between buying a new course, going to an event, or simply breathing more deeply in the cash flow.",[368,15252,15254],{"id":15253},"scenario-c-early-stage-startup-r30k-to-r200k-mrr","Scenario C — Early stage startup (R$30k to R$200k MRR)",[12,15256,15257],{},"Real SLA requirements appear here. B2B customers start asking about availability, backup, log retention. Will need more serious monitoring, perhaps audit, certainly managed backup and recovery processes.",[12,15259,15260],{},[27,15261,15262],{},"Typical managed stack:",[119,15264,15265,15277],{},[122,15266,15267],{},[125,15268,15269,15271,15274],{},[128,15270,4103],{},[128,15272,15273],{},"USD\u002Fmonth",[128,15275,15276],{},"BRL\u002Fmonth",[141,15278,15279,15290,15299,15310,15321],{},[125,15280,15281,15284,15287],{},[146,15282,15283],{},"AWS managed (small cluster + RDS Postgres + ElastiCache + load balancer + NAT + CloudWatch + S3)",[146,15285,15286],{},"US$1,500–US$3,000",[146,15288,15289],{},"R$7,500–R$15,000",[125,15291,15292,15295,15297],{},[146,15293,15294],{},"Equivalent GCP managed (cluster + CloudSQL + Memorystore)",[146,15296,15286],{},[146,15298,15289],{},[125,15300,15301,15304,15307],{},[146,15302,15303],{},"Render Team plan + scaled",[146,15305,15306],{},"US$300–US$600",[146,15308,15309],{},"R$1,500–R$3,000",[125,15311,15312,15315,15318],{},[146,15313,15314],{},"Self-hosted cluster (HeroCtl on 4 Hetzner or DigitalOcean VPS + S3-compatible storage)",[146,15316,15317],{},"US$60–US$120",[146,15319,15320],{},"R$300–R$600",[125,15322,15323,15326,15329],{},[146,15324,15325],{},"Hybrid (self-hosted + critical managed services like transactional database)",[146,15327,15328],{},"US$200–US$400",[146,15330,15331],{},"R$1,000–R$2,000",[12,15333,15334,15335,15338,15339,15342],{},"The difference between AWS managed and self-hosted at this stage is the most significant in the entire spreadsheet. We're talking about ",[27,15336,15337],{},"R$5,000 to R$12,000 per month of recurring savings",". That delta pays, over twelve months, ",[27,15340,15341],{},"a mid-level developer for an entire year",", or two interns, or — for the startup still seeking breakeven — six additional months of capital runway.",[12,15344,15345,15347],{},[27,15346,15133],{}," self-hosting at this stage requires the team to have someone with operational competence. Doesn't have to be a full-time large-scale specialist, but has to be someone who knows how to read logs, restore backup, diagnose latency. Usually it's the CTO or the first senior dev. The embedded cost there — four to eight hours per month of that professional — is small compared to the savings. But it exists, and ignoring it is dishonest.",[12,15349,15350],{},"Hybrid is usually the right answer at this stage: application runs on self-hosted cluster (because it's easy), transactional database stays managed (because restoring Postgres with synchronous replication at three a.m. isn't founder weekend work). The hybrid bill comes out around R$1,500 per month — still four times cheaper than full managed AWS.",[368,15352,15354],{"id":15353},"scenario-d-scale-up-r200k-mrr-established-platform-team","Scenario D — Scale-up (R$200k+ MRR, established platform team)",[12,15356,15357],{},"Here the equation inverts. Team has two or three engineers dedicated to infrastructure. Compliance may be on the map. Customer demands contractual SLA with financial penalty. Multi-tenant with serious isolation is prerequisite, not differentiator.",[12,15359,15360],{},[27,15361,14997],{},[119,15363,15364,15374],{},[122,15365,15366],{},[125,15367,15368,15370,15372],{},[128,15369,4103],{},[128,15371,15273],{},[128,15373,15276],{},[141,15375,15376,15387,15398],{},[125,15377,15378,15381,15384],{},[146,15379,15380],{},"Full AWS managed (multi-AZ, multi-region, premium observability, business support)",[146,15382,15383],{},"US$5,000–US$15,000",[146,15385,15386],{},"R$25,000–R$75,000",[125,15388,15389,15392,15395],{},[146,15390,15391],{},"HeroCtl Enterprise + 8 to 12 servers (with Enterprise license + 24×7 support)",[146,15393,15394],{},"servers R$2k–R$5k + license",[146,15396,15397],{},"R$2,000–R$5,000 + license",[125,15399,15400,15403,15406],{},[146,15401,15402],{},"Self-managed Kubernetes on cloud provider (servers + 2 senior engineers on payroll)",[146,15404,15405],{},"US$3,000–US$10,000 servers + R$60,000 payroll",[146,15407,15408],{},"R$75,000–R$110,000 total",[12,15410,15411],{},"Note that the comparison changes. It's no longer \"pure infra\": it's \"infra + team cost to operate\". Self-managed Kubernetes is cheaper on servers, but charges two senior salaries on payroll — it's like buying a cheap car and hiring a full-time driver.",[12,15413,15414,15416],{},[27,15415,15133],{}," at this stage, infrastructure cost becomes negligible compared to team cost. A platform team of three people costs R$50k to R$80k per month in payroll. R$20k per month of more or less server is statistical noise.",[12,15418,15419,15420,15423],{},"Optimization at this stage is no longer about USD per month, it's about ",[27,15421,15422],{},"time saved by the team",". If your platform team spends two days a month solving a specialized Postgres operator problem, and the managed alternative costs R$5k more — it's worth paying. If they spend half an hour a month because the stack is simple, the expensive alternative buys nothing but luxury.",[12,15425,15426],{},"Managed AWS makes sense at this phase when compliance explicitly asks, when serious B2B customers list cloud provider as contract prerequisite, or when a specific certification requires a pre-approved stack. Self-hosted makes sense when the team can extract value from customization — fine telemetry, routing control, more aggressive isolation than managed offers.",[19,15428,15430],{"id":15429},"the-invisible-costs-no-one-calculates","The invisible costs no one calculates",[12,15432,15433],{},"Every cloud provider spreadsheet has the same pattern: the \"shelf price\" is just the beginning. The costs that appear on the bill in the third month — not the first — are what separate those who did the math right from those who'll only discover the problem when the investor asks for the statement.",[12,15435,15436,15439],{},[27,15437,15438],{},"Egress bandwidth."," Traditional American cloud provider charges around US$0.09 per gigabyte of egress. In reais, that's R$0.45 per gigabyte. A modest SaaS with 100 GB of egress traffic per month pays R$45 just for data leaving the data center — and that's the most frequently forgotten category in budget projections. Hetzner includes 20 TB of free bandwidth per server per month. At scale, the difference easily becomes thousands of reais per month.",[12,15441,15442,15445],{},[27,15443,15444],{},"Logs."," AWS's managed log service charges US$0.50 per gigabyte ingested and US$0.03 per gigabyte stored. Typical application generates between 1 GB and 5 GB of log per month per instance. On five instances with six-month retention, the bill rises to R$50 to R$200 per month — invisible in the initial proposal.",[12,15447,15448,15451,15452,101],{},[27,15449,15450],{},"Monitoring as a service."," Datadog charges US$15 per host per month on the standard configuration. New Relic charges similar. Five hosts cost R$375 to R$500 per month. For an early-stage Brazilian startup, that's half an intern's salary ",[27,15453,15454],{},"just for monitoring",[12,15456,15457,15460],{},[27,15458,15459],{},"DNS."," Cloud provider managed DNS service charges US$0.50 per zone and US$0.40 per million queries. It's cheap in absolute value, but it's the category that usually falls outside the budget because it seems negligible — until five products in the company each have three zones and you discover US$30 per month leaving secretly.",[12,15462,15463,15466],{},[27,15464,15465],{},"Backup retention."," Daily snapshot with seven days of retention. Weekly snapshot with four weeks. Monthly snapshot with twelve months. The policy multiplies volume by six or seven. Wrong lifecycle management can double storage cost from one hour to the next.",[12,15468,15469,15472],{},[27,15470,15471],{},"Shrinking free tier."," Traditional American cloud provider includes 100 GB of free egress per month on the entire account. Hetzner includes 20 TB per server. The difference at scale is dramatic — and it's one of the reasons hosting in Germany on a European provider usually costs 30 to 50 percent less than hosting in São Paulo on the traditional cloud provider, even counting additional latency.",[19,15474,15476],{"id":15475},"final-aggregated-table-total-year-1-cost-by-scenario","Final aggregated table — total year-1 cost by scenario",[12,15478,15479,15480,101],{},"The table below ties everything together, expressing infra cost as a percentage of MRR. It's the metric that really matters to the CFO: ",[27,15481,15482],{},"how much of each real of revenue is going away to pay for servers",[119,15484,15485,15504],{},[122,15486,15487],{},[125,15488,15489,15492,15495,15498,15501],{},[128,15490,15491],{},"Scenario",[128,15493,15494],{},"Stack",[128,15496,15497],{},"Year-1 cost",[128,15499,15500],{},"Year-1 MRR",[128,15502,15503],{},"% of revenue",[141,15505,15506,15523,15539,15556,15572,15589],{},[125,15507,15508,15511,15514,15517,15520],{},[146,15509,15510],{},"MVP on VPS + Coolify",[146,15512,15513],{},"1 cheap server",[146,15515,15516],{},"R$360",[146,15518,15519],{},"R$60,000",[146,15521,15522],{},"0.6% — healthy",[125,15524,15525,15528,15531,15534,15536],{},[146,15526,15527],{},"MVP on expensive hosted",[146,15529,15530],{},"Vercel pro",[146,15532,15533],{},"R$1,500",[146,15535,15519],{},[146,15537,15538],{},"2.5% — still OK",[125,15540,15541,15544,15547,15550,15553],{},[146,15542,15543],{},"Indie hacker on hosted",[146,15545,15546],{},"Render Pro",[146,15548,15549],{},"R$3,500",[146,15551,15552],{},"R$240,000",[146,15554,15555],{},"1.5% — fine",[125,15557,15558,15561,15564,15567,15569],{},[146,15559,15560],{},"Indie hacker on self-hosted HA",[146,15562,15563],{},"HeroCtl Community + 3 VPS",[146,15565,15566],{},"R$1,000",[146,15568,15552],{},[146,15570,15571],{},"0.4% — excellent",[125,15573,15574,15577,15580,15583,15586],{},[146,15575,15576],{},"Startup on managed AWS",[146,15578,15579],{},"EKS + RDS + observability",[146,15581,15582],{},"R$120,000 + 2 SREs (R$720k payroll)",[146,15584,15585],{},"R$2,400,000",[146,15587,15588],{},"35% — hurting",[125,15590,15591,15594,15597,15600,15602],{},[146,15592,15593],{},"Startup on self-hosted",[146,15595,15596],{},"HeroCtl + 4 VPS + 1 part-time dev",[146,15598,15599],{},"R$10,000 + half a dev R$70k",[146,15601,15585],{},[146,15603,15604],{},"3% — healthy",[12,15606,15607],{},"The line that jumps out the most is the fifth. Thirty-five percent of R$2.4 million MRR in infra plus dedicated operations team. For a Brazilian startup at that stage, it's literally the point that defines whether the year closes profitable or in the red.",[12,15609,15610],{},"The bottom line, with the same revenue, closes at three percent of spending. Thirty-two additional percentage points of operational margin. That's not optimization: it's an architecture decision that changes the business category the company is in.",[19,15612,15614],{"id":15613},"the-false-shortcuts","The false shortcuts",[12,15616,15617],{},"In conversations with Brazilian founders, four phrases come up frequently. Each sounds reasonable and each is a specific trap.",[12,15619,15620,15623],{},[27,15621,15622],{},"\"I'll start with everything managed and migrate later.\""," The intention is good: don't distract from the product now, optimize later. But migration from a full managed stack to self-hosted typically takes four to six months of concentrated work by at least one senior — because you have to rewrite automation, redo deploy pipeline, validate backup, train the rest of the team. During those six months, spending keeps rising. Typically, by the time of migration, the company has already wasted between R$50,000 and R$100,000 more than it needed to.",[12,15625,15626,15629],{},[27,15627,15628],{},"\"My team's opportunity cost is greater than infra cost.\""," True for a large team with five senior engineers dedicated to product feature. False for a team of two or three people where the real operational is \"one dev spends four hours a month on server\". In that scenario, opportunity cost is fictitious — because the time spent on server is time that would otherwise be spent in a meeting, or on code review, or on a product task that isn't a real priority.",[12,15631,15632,15635],{},[27,15633,15634],{},"\"Free tier is enough until we validate.\""," It was true in 2018. In 2026, all providers reduced free tier year after year — some silently, others with formal announcement. Render's free tier hibernating at fifteen minutes brought down production for people who discovered it the worst way. Vercel free for personal project becomes bandwidth and function execution limit surprisingly early. The spreadsheet has to be made assuming paid tier from day one — if there's leftover, great, it's cash.",[12,15637,15638,15641],{},[27,15639,15640],{},"\"Brazilian cloud is more expensive anyway.\""," It was true in 2020. In 2026, Hetzner Germany comes out 30 to 50 percent cheaper than traditional cloud provider in São Paulo, and the additional 100 to 150 millisecond latency is negligible for the overwhelming majority of SaaS workloads. Magalu Cloud already competes on price for small and medium loads. Locaweb and KingHost, although no longer an option for scale, still have competitive entry tier. The premise \"Brazilian cloud is expensive\" became folklore — worth checking the current price before assuming.",[19,15643,15645],{"id":15644},"when-it-makes-sense-to-spend-more-on-infra","When it makes sense to SPEND more on infra",[12,15647,15648],{},"Reverse honesty: there are situations where paying more is the correct decision, and saying otherwise would be selling ideology instead of solution. Four cases where the premium is worth it.",[12,15650,15651,15654],{},[27,15652,15653],{},"Team of one or two people without any time to take care of server."," If you're a solo founder and your day is sales + product + support, any hour spent on server is an hour not spent on activities that generate revenue. Worth paying R$2,000 to R$5,000 per month more for a fully managed stack. You're buying focus, not infrastructure.",[12,15656,15657,15660],{},[27,15658,15659],{},"Serious B2B customer demands listed cloud provider."," Some large companies (especially banks, insurers, government) have a contractual clause requiring vendors to host on a specific cloud provider. It's not negotiable; it's a procurement prerequisite. There's no choice — pay the premium and move on.",[12,15662,15663,15666],{},[27,15664,15665],{},"Compliance or audit requiring pre-approved stack."," Specific frameworks (some related to health, payments, or government contracts) list nominally approved tools. If the auditor needs to point to an existing certificate, and your self-hosted product isn't on that list, the right answer is traditional managed. Arguing with an auditor is wasted work.",[12,15668,15669,15672],{},[27,15670,15671],{},"Critical latency and edge network is a real differentiator."," If your product is a game, real-time auction, or trading, and every millisecond counts, the edge infrastructure of the traditional American provider or CloudFlare is genuinely different. Worth the premium. But note: 99 percent of Brazilian B2B SaaSes aren't in that category, and saying they are is usually post-fact justification for a choice that was made by habit.",[19,15674,15676],{"id":15675},"heroctl-in-the-brazilian-budget-specifically","HeroCtl in the Brazilian budget specifically",[12,15678,15679],{},"Five facts relevant to the Brazilian context:",[12,15681,15682,15683,15686],{},"First, the ",[27,15684,15685],{},"Community plan is free permanent",", no artificial feature gate. There's no asterisked limited version — it's the entire product, real high availability, router, automatic certificate, metrics, centralized log. Indie hacker and early-stage startup can run everything here forever.",[12,15688,15689,15690,15693],{},"Second, ",[27,15691,15692],{},"runs on any cloud",": Hetzner Germany, DigitalOcean, traditional cloud provider, Magalu Cloud, KingHost, small Brazilian VPS provider. The cluster is the same binary on any Linux with Docker. There's no dependency on a specific provider's managed service — you switch providers without redoing anything.",[12,15695,15696,15697,15700],{},"Third, ",[27,15698,15699],{},"the Business plan is charged in reais",", no exchange rate volatility passed on to the customer. Dollar variation is our problem, not yours. Brazilian company paying Brazilian company in reais, against a contract in reais.",[12,15702,15703,15704,15707],{},"Fourth, ",[27,15705,15706],{},"the price is frozen for existing contracts",". What you sign today continues to apply on the contract anniversary. Rate change only applies to new contract. There's no clause allowing retroactive readjustment.",[12,15709,15710,15711,15714],{},"Fifth, ",[27,15712,15713],{},"support in Portuguese"," on Business and Enterprise plans. Team speaking your language, in your timezone, with Brazilian market context.",[19,15716,5250],{"id":5249},[12,15718,15719,15722],{},[27,15720,15721],{},"Is it cheaper to host outside Brazil?"," In 2026, yes — Hetzner Germany costs half of the traditional provider in São Paulo for equivalent loads. The additional one hundred to one hundred fifty milliseconds of latency is imperceptible for web app, REST API, dashboard, internal tool. It's perceptible for streaming, gaming, real-time voice. For the overwhelming majority of Brazilian B2B SaaS, hosting outside is a pure financial decision.",[12,15724,15725,15728],{},[27,15726,15727],{},"Is Magalu Cloud worth it in 2026?"," For small and medium loads, yes. The price is competitive, stability is acceptable, and Portuguese support helps. For loads requiring deep ecosystem (five orchestrated managed services), there are still gaps. Worth it as primary provider for Brazilian startup that values local vendor; less worth it for company needing complete service catalog.",[12,15730,15731,15734],{},[27,15732,15733],{},"When does it make sense to pay traditional cloud provider managed even being expensive?"," When compliance lists specific names. When B2B customer requires it in contract. When the team has a large-scale specialist on payroll for other reasons. When edge latency is a real product differentiator. Outside these cases, it's habit more than decision.",[12,15736,15737,15740],{},[27,15738,15739],{},"How long does migration cost compensate?"," Typical migration from expensive managed provider to self-hosted costs between R$30,000 and R$80,000 in engineering time (one or two people for two to four months). The post-migration monthly savings pay back this investment in three to eight months in the Scenario C range. On a twelve-month horizon, the migration pays itself and still generates positive cash.",[12,15742,15743,15746],{},[27,15744,15745],{},"Does free tier still exist in 2026?"," It exists, but more restricted than in 2020. Render keeps hibernating tier (doesn't serve production). Railway has initial credits. Vercel has hobby plan with limit. Hetzner doesn't have free tier but has servers starting at €4. The updated rule of thumb: always plan for paid tier — if free tier is left over, it's bonus.",[12,15748,15749,15752],{},[27,15750,15751],{},"What if I sell to a customer that requires a specific cloud provider in the architecture?"," You have two options. First: host the main product where it's best for cost, and keep a separate deploy on the required provider to serve that specific customer. Second: HeroCtl runs inside traditional cloud provider too — you get the provider's name on the contract, but without paying the managed cluster service premium. It's a middle ground that serves audit without destroying margin.",[12,15754,15755,15758],{},[27,15756,15757],{},"Does HeroCtl work on small Brazilian provider?"," Yes. The requirement is Linux with Docker. Works on Locaweb, KingHost, Hostinger, Magalu Cloud, any reasonable VPS. The demo clusters run on four servers totaling five vCPUs and ten gigabytes of RAM — any Brazilian provider delivers equivalent configuration for competitive value.",[19,15760,3309],{"id":3308},[12,15762,15763],{},"The open spreadsheet says one thing: the infra strategy that makes sense for a Brazilian startup is different from the one that makes sense for an American startup, and that difference becomes a margin point that separates those who reach the next round from those who burn cash against vendor.",[12,15765,15766],{},"To get started now — three cheap servers, real high availability, automatic certificate, web panel, no recurring license cost:",[224,15768,15770],{"className":15769,"code":5318,"language":2529},[2527],[231,15771,5318],{"__ignoreMap":229},[12,15773,15774,15775,2402,15778,101],{},"Related reading: ",[3336,15776,15777],{"href":14874},"Alternatives to Kubernetes and PaaS in Brazil",[3336,15779,15781],{"href":15780},"\u002Fen\u002Fblog\u002Fkubernetes-overkill-when-you-dont-need-it","Kubernetes is overkill: when you don't need it",{"title":229,"searchDepth":244,"depth":244,"links":15783},[15784,15785,15791,15792,15793,15794,15795,15796,15797],{"id":14954,"depth":244,"text":14955},{"id":14981,"depth":244,"text":14982,"children":15786},[15787,15788,15789,15790],{"id":14988,"depth":271,"text":14989},{"id":15140,"depth":271,"text":15141},{"id":15253,"depth":271,"text":15254},{"id":15353,"depth":271,"text":15354},{"id":15429,"depth":244,"text":15430},{"id":15475,"depth":244,"text":15476},{"id":15613,"depth":244,"text":15614},{"id":15644,"depth":244,"text":15645},{"id":15675,"depth":244,"text":15676},{"id":5249,"depth":244,"text":5250},{"id":3308,"depth":244,"text":3309},"2026-04-26","Revenue in reais, cost in dollars. For a Brazilian startup, infra is the first expense that kills margin. Detailed comparison of hosting scenarios with measured numbers.",{},{"title":14943,"description":15799},{"loc":6337},"en\u002Fblog\u002Fhow-much-to-host-a-brazilian-saas-2026",[6394,15805,14939,15806,15807],"saas","infrastructure","budget","FRAgcqpKrZuSeUDceh5zLeH3fjIQXEJAZSB3mcBzO8I",{"id":15810,"title":15811,"author":7,"body":15812,"category":3378,"cover":3379,"date":16717,"description":16718,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":16719,"navigation":411,"path":3343,"readingTime":3386,"seo":16720,"sitemap":16721,"stem":16722,"tags":16723,"__hash__":16727},"blog_en\u002Fen\u002Fblog\u002Fdocker-deploy-production-compose-to-cluster.md","Docker deploy in production: from compose to a high-availability cluster",{"type":9,"value":15813,"toc":16703},[15814,15823,15826,15829,15833,15836,15839,15856,15859,15876,15879,15885,15889,15899,15904,15931,15936,15939,15944,15964,15969,15972,15977,16051,16054,16058,16061,16066,16083,16087,16090,16095,16102,16107,16110,16114,16117,16121,16124,16128,16159,16163,16166,16171,16174,16177,16181,16184,16189,16192,16196,16199,16203,16220,16224,16227,16232,16235,16240,16243,16247,16250,16254,16426,16429,16433,16436,16442,16448,16454,16460,16464,16467,16476,16482,16488,16492,16495,16512,16521,16535,16540,16549,16553,16556,16582,16589,16591,16603,16609,16615,16621,16641,16647,16657,16659,16662,16665,16681,16684,16697,16700],[12,15815,15816,15817,15819,15820,15822],{},"The flow is familiar to any dev who came up in the last five years. You write a ",[231,15818,9005],{}," with three services, run ",[231,15821,12702],{}," on your machine, it works. You bring it up on a staging VPS, it works. You move it to production by pointing DNS at it, and it works — until the first Friday at five p.m. when it stops working.",[12,15824,15825],{},"The exact point at which \"works in production\" stops being true depends much less on which tool you picked, and much more on which maturity stage your product reached. This post maps the four stages almost every team passes through, shows the practical signs that you need to step up, and makes it explicit when you don't yet.",[12,15827,15828],{},"It isn't an anti-compose post, nor a pro-cluster post. It's a post about when each thing fits.",[19,15830,15832],{"id":15831},"why-docker-compose-did-so-well-in-development","Why Docker Compose did so well in development",[12,15834,15835],{},"Before talking about the stages, it's worth understanding what Compose was designed to be. It solves a very specific problem and solves it well: orchestrating multiple containers on the same machine, declaring dependencies, networks, volumes, and environment variables in a single readable file.",[12,15837,15838],{},"The premises baked into that design are the premises of someone developing:",[2734,15840,15841,15844,15847,15850,15853],{},[70,15842,15843],{},"A single machine. Yours.",[70,15845,15846],{},"A single user. You.",[70,15848,15849],{},"Manual restart. When something breaks, you open the terminal and type.",[70,15851,15852],{},"Ephemeral data. If the database resets, you re-run the seed.",[70,15854,15855],{},"Nobody depends on it staying up. If it falls at three a.m., the world goes on.",[12,15857,15858],{},"In production, all five premises invert:",[2734,15860,15861,15864,15867,15870,15873],{},[70,15862,15863],{},"N machines. You're not alone anymore.",[70,15865,15866],{},"N users. They don't know you.",[70,15868,15869],{},"Automatic restart is the bare minimum. Nobody is going to wake you up at four a.m. Ideally, nobody wakes anyone.",[70,15871,15872],{},"Data matters. Losing the database becomes a reportable incident.",[70,15874,15875],{},"Someone sleeps with this active. Possibly a customer, possibly a contract with penalties.",[12,15877,15878],{},"Docker Compose still works outside its original premises. It makes things work — it just makes them work badly. The shortcuts that look innocent in development (shared network between services, volume mounted directly on the host, log going to the terminal) become traps when the environment changes from \"one machine where I know everything that's running\" to \"three machines where something needs to be running twenty-four hours a day\".",[12,15880,15881,15882,15884],{},"The four stages below show the natural curve of someone taking the same ",[231,15883,9005],{}," from a hobby through to a SaaS with a contractual SLA.",[19,15886,15888],{"id":15887},"stage-1-compose-on-a-single-vps","Stage 1: Compose on a single VPS",[12,15890,15891,15892,15894,15895,15898],{},"The most honest entry point for Docker in production. A cheap VPS, a ",[231,15893,9005],{}," file, and ",[231,15896,15897],{},"docker compose up -d"," solving life.",[12,15900,15901],{},[27,15902,15903],{},"Minimum viable setup:",[2734,15905,15906,15909,15912,15919,15925],{},[70,15907,15908],{},"1 VPS with 1–2 vCPUs and 2 GB of RAM (about R$30 per month at a decent Brazilian provider).",[70,15910,15911],{},"Docker and Docker Compose installed via official script.",[70,15913,15914,15915,15918],{},"All services with ",[231,15916,15917],{},"restart: always"," in compose.",[70,15920,15921,15922,622],{},"Named volumes for data (not bind mounts pointing to ",[231,15923,15924],{},"\u002Fopt\u002Fapp\u002Fdata",[70,15926,15927,15928,15930],{},"A daily cron running ",[231,15929,5736],{}," and shipping to S3 or to a Backblaze B2.",[12,15932,15933],{},[27,15934,15935],{},"Who this works for:",[12,15937,15938],{},"Hobby projects, MVPs validating product-market fit, internal tools for the team, private admin dashboards, personal blogs. Any application where the phrase \"if it's down for five minutes, nobody dies\" is literally true.",[12,15940,15941],{},[27,15942,15943],{},"Risks you're explicitly accepting:",[2734,15945,15946,15949,15952,15961],{},[70,15947,15948],{},"The VPS goes down (provider maintenance, noisy-neighbor spike, hardware) and your service goes with it. There's no fail-over.",[70,15950,15951],{},"The disk dies, and if you weren't backing up, you lost the data. Cloud-provider SSDs fail less than the old datacenter ones, but they fail. It happens.",[70,15953,15954,15955,2402,15958,15960],{},"Each deploy has a window of about 30 seconds between ",[231,15956,15957],{},"docker compose down",[231,15959,12702],{}," during which the service is down.",[70,15962,15963],{},"You are the sysadmin. Kernel patch, Docker update, log rotation, disk monitoring — all on you.",[12,15965,15966],{},[27,15967,15968],{},"Practical limits:",[12,15970,15971],{},"Comfortably handles 1 to 3 small applications, traffic on the order of 100 requests per second, and tolerance of 5 to 30 minutes of downtime per month. If any of those numbers gets pushed up, you're abusing the stage.",[12,15973,15974],{},[27,15975,15976],{},"The backup nobody does and should:",[224,15978,15980],{"className":226,"code":15979,"language":228,"meta":229,"style":229},"# \u002Fetc\u002Fcron.daily\u002Fdb-backup\ndocker exec postgres pg_dump -U app app \\\n  | gzip \\\n  | aws s3 cp - s3:\u002F\u002Fmeu-bucket\u002Fbackups\u002F$(date +%F).sql.gz\n",[231,15981,15982,15987,16009,16019],{"__ignoreMap":229},[234,15983,15984],{"class":236,"line":237},[234,15985,15986],{"class":240},"# \u002Fetc\u002Fcron.daily\u002Fdb-backup\n",[234,15988,15989,15991,15994,15997,16000,16003,16005,16007],{"class":236,"line":244},[234,15990,1118],{"class":247},[234,15992,15993],{"class":255}," exec",[234,15995,15996],{"class":255}," postgres",[234,15998,15999],{"class":255}," pg_dump",[234,16001,16002],{"class":251}," -U",[234,16004,421],{"class":255},[234,16006,421],{"class":255},[234,16008,9791],{"class":383},[234,16010,16011,16014,16017],{"class":236,"line":271},[234,16012,16013],{"class":383},"  |",[234,16015,16016],{"class":247}," gzip",[234,16018,9791],{"class":383},[234,16020,16021,16023,16026,16029,16032,16035,16038,16040,16042,16045,16048],{"class":236,"line":415},[234,16022,16013],{"class":383},[234,16024,16025],{"class":247}," aws",[234,16027,16028],{"class":255}," s3",[234,16030,16031],{"class":255}," cp",[234,16033,16034],{"class":255}," -",[234,16036,16037],{"class":255}," s3:\u002F\u002Fmeu-bucket\u002Fbackups\u002F",[234,16039,1708],{"class":387},[234,16041,1866],{"class":247},[234,16043,16044],{"class":255}," +%F",[234,16046,16047],{"class":387},")",[234,16049,16050],{"class":255},".sql.gz\n",[12,16052,16053],{},"Without this, you're not in production. You're in \"development exposed on the internet\". The difference between one and the other is exactly this cron.",[19,16055,16057],{"id":16056},"stage-2-compose-with-auto-update-and-a-router-in-front","Stage 2: Compose with auto-update and a router in front",[12,16059,16060],{},"The first natural evolution. You still have a VPS, but now it has two floors: a router that terminates TLS and distributes requests, and the application's containers behind it.",[12,16062,16063],{},[27,16064,16065],{},"Setup:",[2734,16067,16068,16071,16074,16077,16080],{},[70,16069,16070],{},"1 slightly beefier VPS (2–4 vCPUs, 4–8 GB), running around R$50 to R$80 per month.",[70,16072,16073],{},"Same stack as the previous stage, plus a reverse proxy (Caddy or a standalone Traefik) terminating TLS automatically via Let's Encrypt.",[70,16075,16076],{},"Watchtower (or equivalent) pulling new images from the registry periodically.",[70,16078,16079],{},"Simple pipeline on GitHub Actions or GitLab CI that builds the image, ships it to the registry, and lets Watchtower discover it.",[70,16081,16082],{},"Automated backup like stage 1, now with longer retention.",[12,16084,16085],{},[27,16086,15935],{},[12,16088,16089],{},"Indie hackers with 2 to 5 small apps, first paying customer on a side SaaS, an agency hosting sites for clients with no contractual SLA, internal tools that grew past the \"three people use it\" phase.",[12,16091,16092],{},[27,16093,16094],{},"What improved over stage 1:",[12,16096,16097,16098,16101],{},"Deploy became seamless from the developer's point of view. You ",[231,16099,16100],{},"git push",", the CI builds and publishes, and two minutes later the new version is live without you having SSHed into any server. Automatic TLS solves a pain that used to consume an afternoon per quarter. Multiple apps share the same wildcard certificate via the router.",[12,16103,16104],{},[27,16105,16106],{},"What still hurts:",[12,16108,16109],{},"Watchtower pulls any new image without a second thought. There's no rolling deploy — during the swap, the application is unavailable for somewhere between 10 and 30 seconds. There's no real health check before promoting the new version; if you published a broken image, the service is down until you notice and revert manually. And the single point of failure remains: the VPS that goes down takes everything with it.",[12,16111,16112],{},[27,16113,15968],{},[12,16115,16116],{},"5 to 10 apps on the same server, traffic up to 500 requests per second (very dependent on the load shape), tolerance of 5 to 15 minutes of downtime per month. If you started losing sleep because Watchtower updated something at three a.m. and broke it, you've moved past the stage.",[19,16118,16120],{"id":16119},"stage-3-multi-server-with-docker-swarm","Stage 3: Multi-server with Docker Swarm",[12,16122,16123],{},"Here the conversation changes. You go from \"one beefy machine\" to \"three machines getting along together\". Docker Swarm is the natural step for someone already comfortable with Compose: the file is practically the same, the vocabulary is the same, and the conceptual jump is smaller than going straight to Kubernetes.",[12,16125,16126],{},[27,16127,16065],{},[2734,16129,16130,16133,16143,16153,16156],{},[70,16131,16132],{},"3 or more medium-sized VPS. Three is the practical minimum so the cluster survives losing a machine.",[70,16134,16135,16138,16139,16142],{},[231,16136,16137],{},"docker swarm init"," on the first node, ",[231,16140,16141],{},"docker swarm join"," on the other two.",[70,16144,16145,16146,16149,16150,101],{},"Stack file (",[231,16147,16148],{},"docker stack deploy -c stack.yml meuapp",") instead of ",[231,16151,16152],{},"docker-compose up",[70,16154,16155],{},"Router integrated with the cluster (Traefik has a native Swarm mode) listening to daemon events and rebalancing automatically.",[70,16157,16158],{},"Centralized logs and metrics? You bolt them on. Not in the box.",[12,16160,16161],{},[27,16162,15935],{},[12,16164,16165],{},"B2B SaaS with a first contract requiring \"best-effort 99%\", team has a dev comfortable with the terminal and willing to learn, application grew past what fits on a single VPS without pain.",[12,16167,16168],{},[27,16169,16170],{},"The elephant in the room:",[12,16172,16173],{},"Docker Swarm has been in maintenance mode since 2019. Docker Inc. doesn't actively invest in new features. Critical bugs are still fixed, but the plugin ecosystem stagnated, and scheduler evolution practically stopped. It works — thousands of companies run Swarm in production without problems today. But you're betting on a technology whose investment trajectory was cut more than five years ago.",[12,16175,16176],{},"The honest version: if you adopt Swarm in 2026, you're adopting it expecting to eventually migrate to something else. It's not an immediate problem — it's a problem that shows up when you need something Swarm will never gain.",[12,16178,16179],{},[27,16180,15968],{},[12,16182,16183],{},"3 to 30 servers, traffic on the order of 5 thousand aggregated requests per second, tolerance of 5 to 30 seconds of failover when a machine drops. Above that, either you complement with external pieces (observability stack, manual autoscaler, GSLB DNS), or you step up to the next stage.",[12,16185,16186],{},[27,16187,16188],{},"Where else it hurts day-to-day:",[12,16190,16191],{},"The overlay network under high load has known edge cases, mainly on cloud-provider networks with non-standard MTU. Recovery after split-brain in some scenarios needs manual intervention — the cluster doesn't recompose itself in 100% of cases. And anything involving detailed observability (persisted metrics, structured logs, distributed tracing) you assemble separately, maintaining two or three more products.",[19,16193,16195],{"id":16194},"stage-4-cluster-with-replicated-control-plane","Stage 4: Cluster with replicated control plane",[12,16197,16198],{},"The step where \"production\" starts to mean the same thing it means at mature platform companies. You're no longer running a legacy orchestrator in maintenance, nor depending on a single server for continuity.",[12,16200,16201],{},[27,16202,16065],{},[2734,16204,16205,16208,16211,16214,16217],{},[70,16206,16207],{},"3 to 5 servers in the minimum configuration, with the control plane replicated across the first three. The rest join as agents.",[70,16209,16210],{},"Automatic leader election. If the current one falls, in a few seconds another takes over and the cluster keeps accepting deploys and serving traffic.",[70,16212,16213],{},"Integrated router, automatic certificates, health check before promoting a new version, rolling deploy with configurable windows.",[70,16215,16216],{},"Metrics and logs as internal services of the cluster itself — you don't bolt on five products to get basic observability.",[70,16218,16219],{},"Job submission via CLI, API, or web panel. The cluster decides which server each replica runs on.",[12,16221,16222],{},[27,16223,15935],{},[12,16225,16226],{},"SaaS with a 99.9% contractual SLA, multi-tenant with formal isolation requirements, platform team of 1 to 3 people, B2B contracts where uptime is part of the SOW, any company where \"being out twenty minutes\" generates a refund.",[12,16228,16229],{},[27,16230,16231],{},"Risks that change in nature:",[12,16233,16234],{},"Complexity doesn't go away — it changes shape. Instead of you operating three VPS by hand, you operate a cluster that solves most problems on its own but has more pieces. The learning curve for someone who has never operated a cluster before exists and is real. And it is genuinely overkill if you'll never go beyond stage 2: three servers running a hobby app is waste.",[12,16236,16237],{},[27,16238,16239],{},"Concrete numbers to calibrate:",[12,16241,16242],{},"A small, well-configured cluster runs comfortably on 4 servers totaling 5 vCPUs and 10 GB of RAM, with the control plane occupying between 200 and 400 MB per server. Leader election, when the current one falls, takes about 7 seconds until the cluster is back to accepting deploys. By comparison, the equivalent configuration on Kubernetes starts at hundreds of lines of manifest for a \"hello world\" app — and HeroCtl solves the same thing in about 50.",[12,16244,16245],{},[27,16246,15968],{},[12,16248,16249],{},"3 to 500 servers. Above that, the managed Kubernetes ecosystem has tools that a small cluster doesn't need: multi-region federation, advanced scheduler for heterogeneous workloads, deep library of specialized operators for stateful databases. It's not that small clusters don't scale — it's that above 500 nodes you're in a market where other tools have a five-year head start.",[19,16251,16253],{"id":16252},"the-four-stages-side-by-side","The four stages side by side",[119,16255,16256,16274],{},[122,16257,16258],{},[125,16259,16260,16262,16265,16268,16271],{},[128,16261,2982],{},[128,16263,16264],{},"Stage 1 (compose 1 VPS)",[128,16266,16267],{},"Stage 2 (compose + auto-update)",[128,16269,16270],{},"Stage 3 (Docker Swarm)",[128,16272,16273],{},"Stage 4 (replicated cluster)",[141,16275,16276,16291,16305,16320,16334,16351,16365,16381,16394,16410],{},[125,16277,16278,16281,16283,16285,16288],{},[146,16279,16280],{},"Minimum monthly cost (BR, 2026)",[146,16282,11408],{},[146,16284,15224],{},[146,16286,16287],{},"R$150 (3 VPS)",[146,16289,16290],{},"R$200 (4 VPS)",[125,16292,16293,16296,16298,16300,16302],{},[146,16294,16295],{},"Operational complexity",[146,16297,4889],{},[146,16299,3154],{},[146,16301,3159],{},[146,16303,16304],{},"Medium-high",[125,16306,16307,16310,16312,16315,16318],{},[146,16308,16309],{},"Time to first deploy",[146,16311,9769],{},[146,16313,16314],{},"1 hour",[146,16316,16317],{},"1 day",[146,16319,16317],{},[125,16321,16322,16325,16327,16329,16332],{},[146,16323,16324],{},"Real high availability",[146,16326,3058],{},[146,16328,3058],{},[146,16330,16331],{},"Yes, with caveats",[146,16333,3064],{},[125,16335,16336,16339,16342,16345,16348],{},[146,16337,16338],{},"Realistic max scale",[146,16340,16341],{},"1–3 apps",[146,16343,16344],{},"5–10 apps",[146,16346,16347],{},"30 servers",[146,16349,16350],{},"500 servers",[125,16352,16353,16356,16358,16361,16363],{},[146,16354,16355],{},"Deploys without downtime",[146,16357,3058],{},[146,16359,16360],{},"Almost",[146,16362,3064],{},[146,16364,3064],{},[125,16366,16367,16370,16373,16376,16379],{},[146,16368,16369],{},"Automatic TLS",[146,16371,16372],{},"Manual or plugin",[146,16374,16375],{},"Yes, built-in",[146,16377,16378],{},"Yes, via router",[146,16380,16375],{},[125,16382,16383,16386,16388,16390,16392],{},[146,16384,16385],{},"Observability in the box",[146,16387,3058],{},[146,16389,3058],{},[146,16391,3058],{},[146,16393,3064],{},[125,16395,16396,16399,16402,16404,16407],{},[146,16397,16398],{},"Minimum team to operate",[146,16400,16401],{},"1 dev (partial)",[146,16403,16401],{},[146,16405,16406],{},"1 dev (dedicated)",[146,16408,16409],{},"1 dev (partial) or 2 (partial)",[125,16411,16412,16414,16417,16420,16423],{},[146,16413,5013],{},[146,16415,16416],{},"Hobby, MVP",[146,16418,16419],{},"Indie hacker, first customer",[146,16421,16422],{},"Early-stage B2B SaaS",[146,16424,16425],{},"SaaS with SLA, multi-tenant",[12,16427,16428],{},"The column that usually surprises is \"minimum team to operate\". Stage 4 with the right tool doesn't require more people than stage 3 — it requires people who think differently. The cognitive jump is bigger than the operational jump.",[19,16430,16432],{"id":16431},"the-signs-its-time-to-step-up","The signs it's time to step up",[12,16434,16435],{},"Stepping up before you need to is waste; staying below what's needed is pain. The practical signs of each transition:",[12,16437,16438,16441],{},[27,16439,16440],{},"Stage 1 → Stage 2."," You discovered you need to run more than one application on the same VPS, manual deploys started getting tense (fear of taking down production at nine p.m. on a Friday), the first paying customer showed up and they have expectations — even if not written down — that you won't disappear for thirty minutes mid-business-day.",[12,16443,16444,16447],{},[27,16445,16446],{},"Stage 2 → Stage 3."," A customer asked for an SLA for the first time, even informally (\"how long max can this be down?\"). Or the single VPS went down once and you learned the hard way you needed redundancy. Or the team grew to three or more people and you don't want to be the only one who knows how to deploy. Or you're paying R$300 per month on a giant VPS when three medium VPS would solve the same with fail-over.",[12,16449,16450,16453],{},[27,16451,16452],{},"Stage 3 → Stage 4."," B2B contract requires measurable, auditable uptime (words like \"99.9%\" and \"maintenance window\" started showing up in commercial proposals). Compliance asked for detailed audit and you need to show a trail of who did what. Or — the most common signal today — you're tired of Swarm patches and want a tool with a clear roadmap for the next five years.",[12,16455,16456,16459],{},[27,16457,16458],{},"The universal \"stepped up too early\" signal."," You're spending more time configuring infrastructure than writing product features. Step back one. Seriously. Infra exists to support the product, not the other way around, and most startups that die early die because they built platform without a customer instead of customer without a platform.",[19,16461,16463],{"id":16462},"the-trajectory-that-doesnt-work","The trajectory that doesn't work",[12,16465,16466],{},"Three common traps teams fall into trying to accelerate the jump:",[12,16468,16469,16472,16473,16475],{},[27,16470,16471],{},"Jumping from compose straight to Kubernetes."," The temptation is genuine: \"if I'm going to migrate once, better migrate to the market-leading tool and never again\". Reality is harsher. Six months in you're still fighting 300-line manifests, specialized operators, operators of operators, and spending half your engineering time on problems that didn't even exist when you ran ",[231,16474,12702],{},". Meanwhile, the simpler competitor shipped twelve features. K8s is worth it at a very specific moment — when you already know you're going to scale to 50+ servers, you have a team to operate it, and the problems it solves are problems you actually have. Before that, it's burned capital.",[12,16477,16478,16481],{},[27,16479,16480],{},"Staying on compose out of pride."," The other extreme. \"Complicated DevOps is overkill, I don't need it, I've always run everything on a VPS and never had a problem\". Reality arrives the first Friday at five p.m. when the VPS disk dies, last month's backup is three weeks old, and you discover simultaneously that (a) you needed HA and (b) you needed a tested recovery procedure. Both lessons in a single weekend are expensive.",[12,16483,16484,16487],{},[27,16485,16486],{},"Buying the stack because it's hype."," Service mesh, complete observability stack with five products, GitOps with two repositories and three pipelines, autoscaler with sophisticated policies — for a three-container app serving 200 active users. You're building platform without users for the platform. The same energy invested in product features would have generated ten times the return. If you're at an earlier stage, it doesn't matter how pretty the next stage's tool is.",[19,16489,16491],{"id":16490},"technical-details-that-hold-at-any-stage","Technical details that hold at any stage",[12,16493,16494],{},"Some decisions hold from stage 1 and keep holding at stage 4. Worth spending three paragraphs on them because each has already caused production pain for a lot of people.",[12,16496,16497,16500,16501,16503,16504,16507,16508,16511],{},[27,16498,16499],{},"Restart policies."," In Compose, ",[231,16502,15917],{}," is the right path for someone who wants the container to come back on its own after any failure. ",[231,16505,16506],{},"on-failure"," is more economical but will bite you when the process exits 0 by mistake. In Swarm, ",[231,16509,16510],{},"restart_policy.condition: any"," plays a similar role. At any stage, thinking about restart policy is part of thinking about application design — it's not a detail.",[12,16513,16514,16517,16518,16520],{},[27,16515,16516],{},"Health checks."," Every application that accepts HTTP needs to expose a ",[231,16519,355],{}," endpoint returning 200 when healthy. Without it, no orchestrator above stage 1 can distinguish \"container started\" from \"container started and is actually serving traffic\". Reasonable timeout: 5 seconds. Reasonable retry: 3 times before marking unhealthy. Without it, you're going to enter restart loops and take hours to understand what's happening.",[12,16522,16523,16526,16527,16530,16531,16534],{},[27,16524,16525],{},"Named volumes versus bind mounts."," Named volumes (",[231,16528,16529],{},"volumes: [meudata:\u002Fvar\u002Flib\u002Fpostgresql\u002Fdata]",") survive container recreation, are managed by Docker, and work consistently across stages. Bind mounts (",[231,16532,16533],{},".\u002Fdata:\u002Fvar\u002Flib\u002Fpostgresql\u002Fdata",") depend on the host filesystem, behave strangely with SELinux and AppArmor, and break when the container changes machines (stage 3 and 4). Use bind mount only for development and for read-only configuration files.",[12,16536,16537,16539],{},[27,16538,15444],{}," Stdout and stderr is the right path, always. An application that writes log to a file inside the container is an application that will give you a headache. The orchestrator captures stdout, routes it where it needs to go (syslog driver, external aggregator, internal service), and you never need to exec inside the container to see what happened.",[12,16541,16542,16545,16546,16548],{},[27,16543,16544],{},"Secrets."," Environment variables in a ",[231,16547,9367],{}," file are comfortable and dangerous — they leak in logs, in backups, in snapshots. For stage 1 and 2, you can live with them if you're careful. For stage 3 and beyond, use the orchestrator's native secrets mechanism. In newer tools (HeroCtl included), the vault is part of the cluster — you don't bolt on a separate product just to store a password.",[19,16550,16552],{"id":16551},"concrete-cost-per-stage-brazil-2026","Concrete cost per stage (Brazil, 2026)",[12,16554,16555],{},"The raw math, no flourishes:",[2734,16557,16558,16564,16570,16576],{},[70,16559,16560,16563],{},[27,16561,16562],{},"Stage 1."," 1 VPS at R$30 per month = R$360 per year. Initial setup time: an afternoon. Continuous operation time: about 1 hour per month.",[70,16565,16566,16569],{},[27,16567,16568],{},"Stage 2."," 1 VPS at R$50 per month = R$600 per year. Setup time: a day. Operation time: about 2 hours per month, mostly dealing with Watchtower updating something it shouldn't have.",[70,16571,16572,16575],{},[27,16573,16574],{},"Stage 3."," 3 VPS at R$50 = R$150 per month = R$1,800 per year. Setup time: about 3 days until comfortable. Operation time: 4 to 8 hours per month, depending on how many jobs run.",[70,16577,16578,16581],{},[27,16579,16580],{},"Stage 4 (HeroCtl Community)."," 4 VPS at R$50 = R$200 per month = R$2,400 per year. Setup time: 1 to 2 days until comfortable. Operation time: comparable to stage 3, but without the manual patches and with observability in the box.",[12,16583,16584,16585,16588],{},"And to calibrate the comparison many people make too early: ",[27,16586,16587],{},"managed Kubernetes for the same scale"," costs between R$700 and R$1,500 per month for control plane and load balancers, so R$8.4k to R$18k per year just on infrastructure — not counting the 1 to 2 SREs (R$25k to R$35k per month each) that this stage starts to require. The difference between stage 4 and managed Kubernetes in total cost is usually a full order of magnitude.",[19,16590,7347],{"id":7346},[12,16592,16593,16599,16600,16602],{},[27,16594,16595,16596,16598],{},"Isn't compose with ",[231,16597,15917],{}," enough?","\nIt's enough until the first thing ",[231,16601,15917],{}," doesn't cover: the entire VPS unavailable, the disk corrupted, the provider's network failure, or a deploy that ships a broken image and enters a loop with nobody to notice. For a hobby project, enough; for a paying customer, it's the starting point, not the finish line.",[12,16604,16605,16608],{},[27,16606,16607],{},"Is Docker Swarm really deprecated?","\nNot in the official sense — Docker Inc. hasn't announced discontinuation. But it's in maintenance, with no relevant new features since 2019, and the plugin ecosystem stopped growing. It works in production today. A defensible choice for someone already on it. A questionable choice for someone adopting now in 2026.",[12,16610,16611,16614],{},[27,16612,16613],{},"When is it worth stepping up to managed Kubernetes?","\nWhen you have more than 50 servers, a platform team with 3+ dedicated people, and specific problems the K8s ecosystem solves better (multi-region federation, sophisticated autoscaling, deep library of stateful operators). Before that, you're paying the cost without using the benefit.",[12,16616,16617,16620],{},[27,16618,16619],{},"Is Watchtower safe?","\nReasonably, with caveats. It pulls any new image published on the tag you're pointing at, without distinguishing between \"an update you published\" and \"a compromised image someone pushed via supply chain\". For stage 2, the trade-off is worth it: the operational gain outweighs the risk. For larger stages, prefer mechanisms that validate the image before promoting.",[12,16622,16623,16626,16627,16630,16631,571,16633,16636,16637,16640],{},[27,16624,16625],{},"How do I back up a Docker volume at stage 1?","\nA daily cron running ",[231,16628,16629],{},"docker exec"," on the database container with the native dump utility (",[231,16632,5736],{},[231,16634,16635],{},"mysqldump",", etc.), pipe to ",[231,16638,16639],{},"gzip",", and upload to object storage outside the same provider. The golden rule: the backup must live far from the primary data. If the datacenter goes down, the backup needs to be in another datacenter.",[12,16642,16643,16646],{},[27,16644,16645],{},"Can I jump from stage 1 straight to 4?","\nTechnically yes, especially with tools that make the jump smooth (HeroCtl is one of them, installs in minutes and runs comfortably on 3 servers). Recommended only if you already know you'll need stage 4 within the next six months. Otherwise, stage 2 teaches you things (automated deploy, TLS, image registries) you'll use anyway later.",[12,16648,16649,16652,16653,16656],{},[27,16650,16651],{},"What if I don't know Linux deeply?","\nStages 1 and 2 are very accessible with basic knowledge. Stage 3 starts to require network understanding (overlay networks, MTU, occasional iptables). Stage 4, with the right tool, abstracts most of the complexity — but incident debug still requires reading systemd logs, understanding what ",[231,16654,16655],{},"dmesg"," is saying, and diagnosing a full disk. There's no magic that replaces fundamentals when something goes wrong at three a.m.",[19,16658,3309],{"id":3308},[12,16660,16661],{},"Maturity isn't a moral virtue. It isn't \"better\" to be at stage 4 than at stage 1 — it's better to be at the stage that matches the size of the problem you're solving. A hobby project at stage 4 is wasted capital and attention; a SaaS with fifty paying customers at stage 1 is operational negligence.",[12,16663,16664],{},"HeroCtl exists to make stage 4 accessible to people who used to have to choose between the discomfort of Swarm and the cost of Kubernetes. If you feel you've moved past stage 2 and are weighing options:",[224,16666,16667],{"className":226,"code":2948,"language":228,"meta":229,"style":229},[231,16668,16669],{"__ignoreMap":229},[234,16670,16671,16673,16675,16677,16679],{"class":236,"line":237},[234,16672,1220],{"class":247},[234,16674,2957],{"class":251},[234,16676,2960],{"class":255},[234,16678,2963],{"class":383},[234,16680,2966],{"class":247},[12,16682,16683],{},"Installs in minutes, runs comfortably on 3 to 5 servers, and has a permanently free Community plan — no artificial feature gate, no server limit, no contract lock-in. Business and Enterprise plans exist for companies with formal SSO, detailed audit, and SLA-backed support requirements, and prices are published without a mandatory \"talk to sales\".",[12,16685,16686,16687,16691,16692,16696],{},"For people comparing tools in the same niche, two complementary posts: ",[3336,16688,16690],{"href":16689},"\u002Fen\u002Fblog\u002Fheroctl-vs-coolify","HeroCtl vs Coolify"," covers the trade-off of adopting a tool with real HA versus an elegant single-server panel; ",[3336,16693,16695],{"href":16694},"\u002Fen\u002Fblog\u002Fheroctl-vs-dokploy","HeroCtl vs Dokploy"," covers the difference between adopting a cluster with a replicated control plane versus a panel that internally runs Swarm.",[12,16698,16699],{},"And if the question is \"which stage matches me right now?\", the honest answer almost always is: the one before the one you think you need. Step back one, stay until it hurts, step up when it really hurts.",[3350,16701,16702],{},"html pre.shiki code .sH3jZ, html code.shiki .sH3jZ{--shiki-default:#8B949E}html pre.shiki code .sQhOw, html code.shiki .sQhOw{--shiki-default:#FFA657}html pre.shiki code .s9uIt, html code.shiki .s9uIt{--shiki-default:#A5D6FF}html pre.shiki code .sFSAA, html code.shiki .sFSAA{--shiki-default:#79C0FF}html pre.shiki code .suJrU, html code.shiki .suJrU{--shiki-default:#FF7B72}html pre.shiki code .sZEs4, html code.shiki .sZEs4{--shiki-default:#E6EDF3}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}",{"title":229,"searchDepth":244,"depth":244,"links":16704},[16705,16706,16707,16708,16709,16710,16711,16712,16713,16714,16715,16716],{"id":15831,"depth":244,"text":15832},{"id":15887,"depth":244,"text":15888},{"id":16056,"depth":244,"text":16057},{"id":16119,"depth":244,"text":16120},{"id":16194,"depth":244,"text":16195},{"id":16252,"depth":244,"text":16253},{"id":16431,"depth":244,"text":16432},{"id":16462,"depth":244,"text":16463},{"id":16490,"depth":244,"text":16491},{"id":16551,"depth":244,"text":16552},{"id":7346,"depth":244,"text":7347},{"id":3308,"depth":244,"text":3309},"2026-04-21","Docker Compose solves dev. In production, even a single server with no SLA can do. Beyond that, you need a real cluster. An honest trajectory through the four maturity stages.",{},{"title":15811,"description":16718},{"loc":3343},"en\u002Fblog\u002Fdocker-deploy-production-compose-to-cluster",[1118,1526,16724,16725,16726,3378],"production","cluster","ha","mfqfPpSR1uQ_80pgl39j1VgOwGutdVoxL1ymY7vjuyw",{"id":16729,"title":16730,"author":7,"body":16731,"category":3378,"cover":3379,"date":17792,"description":17793,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":17794,"navigation":411,"path":7461,"readingTime":8761,"seo":17795,"sitemap":17796,"stem":17797,"tags":17798,"__hash__":17801},"blog_en\u002Fen\u002Fblog\u002Fpostgres-in-production-managed-vs-self-hosted.md","Postgres in production: managed vs self-hosted, the honest math",{"type":9,"value":16732,"toc":17775},[16733,16736,16739,16743,16746,16752,16758,16764,16770,16776,16782,16788,16791,16795,16798,16804,16810,16816,16822,16828,16834,16838,16844,16848,16924,16927,16931,17023,17030,17034,17148,17155,17159,17162,17175,17181,17187,17190,17194,17197,17203,17213,17229,17235,17245,17251,17255,17258,17264,17285,17295,17305,17311,17315,17318,17332,17338,17344,17350,17359,17364,17367,17371,17634,17637,17641,17647,17653,17659,17665,17669,17678,17687,17702,17711,17717,17727,17733,17735,17738,17741,17744,17760,17770,17773],[12,16734,16735],{},"The \"RDS or run Postgres on my cluster\" decision is the one Brazilian SaaS most postpones. It shows up in a month-one architecture document, becomes a TODO in month three, becomes an internal fight in month six when the AWS bill comes in high three digits. And meanwhile, no one wants to choose — because every blog post on the subject is written by someone with bias. Those who work for a managed vendor say self-hosted will break you. Those who've maintained Postgres for fifteen years say RDS is a rip-off. Both sides leave things out.",[12,16737,16738],{},"This post opens the real spreadsheet. No frills, no taking sides. When RDS makes sense, when it doesn't, how much each scenario costs in reais, how much it costs in engineer-hours, and what the five mistakes are that turn self-hosted into an accident.",[19,16740,16742],{"id":16741},"what-managed-services-do-for-you-no-irony","What managed services do for you (no irony)",[12,16744,16745],{},"Before any comparison, it's honest to recognize what RDS, Cloud SQL, Aurora, and modern Postgres-as-a-Service (Supabase, Neon, Crunchy) actually deliver. The marketing inflated it — but the product is real.",[12,16747,16748,16751],{},[27,16749,16750],{},"Automatic backup with configurable retention."," You say \"keep seven days,\" and it's done. Incremental snapshot, no visible maintenance window, no cron, no babysitter. For many teams, this item alone justifies the check.",[12,16753,16754,16757],{},[27,16755,16756],{},"Point-in-time recovery (PITR)."," You discover at eleven a.m. that a deploy at nine deleted an important field. In RDS, you restore to 08:55. No reading the WAL archiving manual, no praying for a transaction log to be intact in a bucket. Just console and button.",[12,16759,16760,16763],{},[27,16761,16762],{},"Automatic security patches."," Postgres minor releases come out every three months, and each has a reasonable CVE. In managed, that applies in a window you define. In self-hosted, you discover you're behind when a compliance check hits.",[12,16765,16766,16769],{},[27,16767,16768],{},"One-click read replica."," Want to scale reads? Turn on the replica, wait for replication, point your application. In self-hosted, you configure streaming replication manually, manage replication slot, monitor lag, define what happens if the connection drops.",[12,16771,16772,16775],{},[27,16773,16774],{},"Automatic Multi-AZ failover."," In RDS Multi-AZ, the secondary instance takes over in 60–120 seconds when the primary dies, and the DNS endpoint routes itself. It's the most expensive and most useful feature of the product.",[12,16777,16778,16781],{},[27,16779,16780],{},"Integrated metrics, centralized logs."," CloudWatch already has everything there. Slow queries, cache hit ratio, active connections, IO. You open the console and see.",[12,16783,16784,16787],{},[27,16785,16786],{},"Hours of operation you don't spend."," This is the invisible item. Each of the features above is an afternoon you didn't spend. Twenty afternoons over the year add up to a whole engineer of part-time dedication.",[12,16789,16790],{},"Recognizing this is the honest starting point. RDS is a serious product. It's not air.",[19,16792,16794],{"id":16793},"what-managed-services-do-not-do-and-no-one-talks-about","What managed services do NOT do (and no one talks about)",[12,16796,16797],{},"Here lives the asterisk. The limitations below aren't on page one of the documentation.",[12,16799,16800,16803],{},[27,16801,16802],{},"Migrating to another platform becomes a project."," When you're in Aurora, leaving Aurora is a two-to-twelve-week project depending on the size. The dialect isn't pure Postgres — Aurora has its own extensions and behaviors. Leaving Cloud SQL for another cloud requires dump-restore, planned downtime, script rewrite, IAM tuning, redoing monitoring. The exit cost is what funds the entry discount.",[12,16805,16806,16809],{},[27,16807,16808],{},"Some popular extensions simply don't exist."," TimescaleDB doesn't run on RDS (AWS offers its own equivalent that isn't compatible). pg_partman has an old version. pgvector arrived late. If your architecture depends on a specific extension, you may discover three months later that it isn't available in your region, in your version, or at all.",[12,16811,16812,16815],{},[27,16813,16814],{},"Cross-region egress traffic costs."," You decide to put a replica in another region for disaster recovery. Each gigabyte leaving the main region for the secondary pays toll. In small workloads it's negligible. In workloads with 200 GB of write per day, it becomes a parallel bill.",[12,16817,16818,16821],{},[27,16819,16820],{},"Latency between app and database if they're in different VPCs."," This is the silent error. You bring up the app on one network and the database on another, with peering. Minimum latency goes from 0.3 ms (same network) to 2–4 ms (peering). Doesn't seem like much until your application makes one hundred and twenty queries per request — then it becomes 350 ms of phantom latency.",[12,16823,16824,16827],{},[27,16825,16826],{},"Detailed auditing costs extra."," Who ran DROP TABLE? In RDS that asks for Performance Insights at the advanced tier (US$7 per vCPU per month) plus a logging plugin. It doesn't come on.",[12,16829,16830,16833],{},[27,16831,16832],{},"You don't really control the maintenance window."," You \"configure\" a window, but in serious incidents AWS applies patches outside it. It's happened, it'll happen.",[19,16835,16837],{"id":16836},"the-honest-financial-math","The honest financial math",[12,16839,16840,16841,16843],{},"Reference exchange rate: R$5 per dollar. RDS prices in São Paulo region (",[231,16842,6936],{},"), April 2026, on-demand. Self-hosted assumes DigitalOcean \u002F Vultr \u002F Hetzner VPS in São Paulo or Miami.",[368,16845,16847],{"id":16846},"small-scenario-database-under-10-gb-up-to-100-connectionssec","Small scenario: database under 10 GB, up to 100 connections\u002Fsec",[119,16849,16850,16861],{},[122,16851,16852],{},[125,16853,16854,16856,16858],{},[128,16855,11387],{},[128,16857,5439],{},[128,16859,16860],{},"Self-hosted",[141,16862,16863,16874,16888,16898,16909],{},[125,16864,16865,16868,16871],{},[146,16866,16867],{},"Instance",[146,16869,16870],{},"db.t4g.micro (2 vCPU burst, 1 GB RAM)",[146,16872,16873],{},"2 vCPU 4 GB VPS already used by the app",[125,16875,16876,16878,16883],{},[146,16877,14587],{},[146,16879,16880,16881],{},"US$15 = ",[27,16882,15115],{},[146,16884,16885,16887],{},[27,16886,8014],{}," (fits alongside the app)",[125,16889,16890,16893,16896],{},[146,16891,16892],{},"10 GB gp3 storage",[146,16894,16895],{},"US$1.15",[146,16897,15058],{},[125,16899,16900,16903,16906],{},[146,16901,16902],{},"10 GB backup",[146,16904,16905],{},"US$0.95",[146,16907,16908],{},"R$0.50 (S3-compatible)",[125,16910,16911,16914,16919],{},[146,16912,16913],{},"Total",[146,16915,16916],{},[27,16917,16918],{},"R$85\u002Fmonth",[146,16920,16921],{},[27,16922,16923],{},"R$0.50\u002Fmonth",[12,16925,16926],{},"Difference: R$84\u002Fmonth. In a year, R$1k. Doesn't change anyone's life. For an MVP, RDS is defensible just for the automatic backup.",[368,16928,16930],{"id":16929},"medium-scenario-50-gb-1k-connectionssec-1-read-replica","Medium scenario: 50 GB, 1k connections\u002Fsec, 1 read replica",[119,16932,16933,16943],{},[122,16934,16935],{},[125,16936,16937,16939,16941],{},[128,16938,11387],{},[128,16940,5439],{},[128,16942,16860],{},[141,16944,16945,16956,16967,16977,16987,16998,17007],{},[125,16946,16947,16950,16953],{},[146,16948,16949],{},"Primary",[146,16951,16952],{},"db.r6g.large (2 vCPU, 16 GB)",[146,16954,16955],{},"Dedicated 4 vCPU, 8 GB VPS — R$120",[125,16957,16958,16961,16964],{},[146,16959,16960],{},"Read replica",[146,16962,16963],{},"db.r6g.large",[146,16965,16966],{},"4 vCPU, 8 GB VPS — R$120",[125,16968,16969,16972,16975],{},[146,16970,16971],{},"50 GB gp3 storage",[146,16973,16974],{},"US$5.75",[146,16976,15058],{},[125,16978,16979,16982,16985],{},[146,16980,16981],{},"3000 provisioned IOPS",[146,16983,16984],{},"US$60",[146,16986,15058],{},[125,16988,16989,16992,16995],{},[146,16990,16991],{},"50 GB backup",[146,16993,16994],{},"US$4.75",[146,16996,16997],{},"R$5 (S3-compatible)",[125,16999,17000,17003,17005],{},[146,17001,17002],{},"Bandwidth",[146,17004,15221],{},[146,17006,15058],{},[125,17008,17009,17013,17018],{},[146,17010,17011],{},[27,17012,16913],{},[146,17014,17015],{},[27,17016,17017],{},"US$280 = R$1,400\u002Fmonth",[146,17019,17020],{},[27,17021,17022],{},"R$245\u002Fmonth",[12,17024,17025,17026,17029],{},"Difference: R$1,155\u002Fmonth = ",[27,17027,17028],{},"R$13.8k\u002Fyear",". Here the conversation begins. Is it worth R$14k not to think about backup? For a team of two engineers, that's a month of one of their work. For a team of eight, it's negligible.",[368,17031,17033],{"id":17032},"large-scenario-500-gb-10k-connectionssec-real-high-availability","Large scenario: 500 GB, 10k connections\u002Fsec, real high availability",[119,17035,17036,17048],{},[122,17037,17038],{},[125,17039,17040,17042,17045],{},[128,17041,11387],{},[128,17043,17044],{},"RDS Multi-AZ",[128,17046,17047],{},"Self-hosted cluster",[141,17049,17050,17060,17069,17079,17088,17099,17110,17121,17132],{},[125,17051,17052,17054,17057],{},[146,17053,16949],{},[146,17055,17056],{},"db.r6g.4xlarge (16 vCPU, 128 GB) Multi-AZ",[146,17058,17059],{},"Dedicated 16 vCPU, 64 GB VPS — R$650",[125,17061,17062,17065,17067],{},[146,17063,17064],{},"Multi-AZ sync replica",[146,17066,15058],{},[146,17068,17059],{},[125,17070,17071,17073,17076],{},[146,17072,16960],{},[146,17074,17075],{},"db.r6g.2xlarge",[146,17077,17078],{},"8 vCPU, 32 GB VPS — R$320",[125,17080,17081,17084,17086],{},[146,17082,17083],{},"500 GB io1 storage",[146,17085,16984],{},[146,17087,15058],{},[125,17089,17090,17093,17096],{},[146,17091,17092],{},"10k provisioned IOPS",[146,17094,17095],{},"US$650",[146,17097,17098],{},"local NVMe included",[125,17100,17101,17104,17107],{},[146,17102,17103],{},"500 GB automatic backup",[146,17105,17106],{},"US$48",[146,17108,17109],{},"R$80 (WAL archiving)",[125,17111,17112,17115,17118],{},[146,17113,17114],{},"Performance Insights advanced",[146,17116,17117],{},"US$112",[146,17119,17120],{},"free (Prometheus)",[125,17122,17123,17126,17129],{},[146,17124,17125],{},"Egress bandwidth",[146,17127,17128],{},"US$100",[146,17130,17131],{},"included up to 20 TB",[125,17133,17134,17138,17143],{},[146,17135,17136],{},[27,17137,16913],{},[146,17139,17140],{},[27,17141,17142],{},"US$2,100 = R$10.5k\u002Fmonth",[146,17144,17145],{},[27,17146,17147],{},"R$1,700\u002Fmonth",[12,17149,17150,17151,17154],{},"Difference: R$8.8k\u002Fmonth = ",[27,17152,17153],{},"R$105k\u002Fyear",". This is where managed becomes hard to defend financially. But the financial math is only half. The other half is time.",[19,17156,17158],{"id":17157},"the-time-math-more-important-than-the-financial","The time math (more important than the financial)",[12,17160,17161],{},"Engineer time in São Paulo costs between R$80 and R$250 per useful hour depending on level. Consider R$150\u002Fhour as a weighted average. That's the multiplier you need to cross with each item below.",[12,17163,17164,17167,17168,1523,17171,17174],{},[27,17165,17166],{},"Initial setup."," RDS via console: thirty minutes. You define instance, storage, security group, parameter group, and it's running. Self-hosted done right: four to eight hours. Postgres + PgBouncer + pgBackRest to S3 + monitoring + tuning of ",[231,17169,17170],{},"shared_buffers",[231,17172,17173],{},"work_mem"," + restore script + restore test. Doing this in half a day requires prior experience. Without experience, it becomes a whole sprint.",[12,17176,17177,17180],{},[27,17178,17179],{},"Ongoing monthly operation."," RDS: zero. You open the console when something screams. Self-hosted done right: two to four hours. Review slow queries, adjust a parameter that got tight, verify backup ran, monthly restore test, update minor version. That's the cruise regime. If you're spending more than that, problems are happening.",[12,17182,17183,17186],{},[27,17184,17185],{},"When it breaks at three a.m."," In RDS, you open a ticket. AWS Business plan responds in four hours for high severity, one hour for critical. You go to bed and wake up with a workaround. In self-hosted, you are the support. If your monitoring system didn't wake you up, the customer did. If your DR plan is in a document no one has read in six months, you're improvising.",[12,17188,17189],{},"The clear rule: having monitoring, written disaster recovery plan, and monthly restore test — is not optional in self-hosted. It's what separates \"professional self-hosted\" from \"accident waiting to happen\".",[19,17191,17193],{"id":17192},"minimum-stack-for-production-grade-self-hosted-postgres","Minimum stack for production-grade self-hosted Postgres",[12,17195,17196],{},"You can't run Postgres in production without this base. Each component below solves a known failure mode.",[12,17198,17199,17202],{},[27,17200,17201],{},"Main Postgres on dedicated server."," Don't share disk with the application. The engine depends on predictable IOPS, and an uncontrolled growing app log can fill the volume and stop the database. Allocate a VPS just for the database, or a separate volume if it's the same VPS at first.",[12,17204,17205,17208,17209,17212],{},[27,17206,17207],{},"Connection pool with PgBouncer or Pgpool."," Postgres allocates one process per connection. At two hundred direct connections, it consumes more memory than your application. PgBouncer in ",[231,17210,17211],{},"transaction"," mode solves it: dozens of real connections to the database serving thousands of application connections. Without it, you die in the first peak hour.",[12,17214,17215,17218,17219,17221,17222,17224,17225,17228],{},[27,17216,17217],{},"Backup with pgBackRest or WAL-E."," Don't use ",[231,17220,5736],{}," in cron as a sole strategy. ",[231,17223,5736],{}," is a logical dump — good for migrating versions, bad for recovering a large database at a precise moment. You want weekly ",[231,17226,17227],{},"pg_basebackup"," plus continuous WAL archiving to an S3-compatible bucket (Cloudflare R2, Backblaze B2, Wasabi, or S3 itself). pgBackRest does this and validates integrity.",[12,17230,17231,17234],{},[27,17232,17233],{},"Hot standby replica via streaming replication."," A second server receiving the WAL in real time, ready to be promoted if the primary falls. Bonus: you use that same server for heavy read queries, offloading the primary.",[12,17236,17237,17244],{},[27,17238,17239,17240,17243],{},"Monitoring with ",[231,17241,17242],{},"postgres_exporter"," + Prometheus + Grafana",", or an equivalent plugin from the orchestrator you use. You want to see: active connections, cache ratio, transaction rate, replication lag, disk space, slow queries. Without this, you're driving with your eyes closed.",[12,17246,17247,17250],{},[27,17248,17249],{},"Automated monthly restore test."," Cron that picks the most recent backup, restores it on a temporary server, validates that some tables have rows. If that fails, alert the team. Backup that's never been restored is placebo. We've seen teams lose a whole week of data because the \"backup\" had been corrupted for three months and no one tested.",[19,17252,17254],{"id":17253},"the-five-mistakes-that-break-self-hosted-postgres","The five mistakes that break self-hosted Postgres",[12,17256,17257],{},"They've been the same five for fifteen years. They don't innovate.",[12,17259,17260,17263],{},[27,17261,17262],{},"Not testing restore."," We repeat because it's the most common item. Backup that's never been restored isn't backup, it's a file. Automated monthly restore is the civilized minimum.",[12,17265,17266,17274,17275,17277,17278,17281,17282,17284],{},[27,17267,17268,17269,2402,17271,17273],{},"Keeping ",[231,17270,17170],{},[231,17272,17173],{}," at default."," Postgres's default is designed to run on a small server without assuming anything. In production, ",[231,17276,17170],{}," should be 25% of RAM, ",[231,17279,17280],{},"effective_cache_size"," 50–75%, ",[231,17283,17173],{}," calculated per simultaneous connection. Without this, you have 64 MB of cache on a server with 16 GB of RAM and performance is left on the table.",[12,17286,17287,17290,17291,17294],{},[27,17288,17289],{},"Not monitoring slow queries."," A poorly written query by a distracted developer can lock the entire database. ",[231,17292,17293],{},"pg_stat_statements"," enabled, alert for any query going over 500 ms in production. Without this, you discover the problem when the customer opens a ticket.",[12,17296,17297,17300,17301,17304],{},[27,17298,17299],{},"Disk shared with the operating system."," System log fills, the database's ",[231,17302,17303],{},"\u002Fvar"," shares the same volume, and the database stops accepting writes. Postgres has to be on a dedicated volume. NVMe if possible.",[12,17306,17307,17310],{},[27,17308,17309],{},"A single server without replica."," Server falls — and it falls, sooner or later, hardware fails — and you're with one to three hours of downtime restoring from backup. Synchronous replica on another server reduces that to seconds.",[19,17312,17314],{"id":17313},"postgres-on-an-orchestrator-like-heroctl","Postgres on an orchestrator like HeroCtl",[12,17316,17317],{},"This is where operational complexity drops. Not because Postgres got simpler — it's still complex — but because the orchestrator absorbs the plumbing part you'd normally write by hand.",[12,17319,17320,17323,17324,17327,17328,17331],{},[27,17321,17322],{},"Postgres as a cluster task."," The service description is a configuration file of about thirty lines: official Postgres image, named volume for data, environment variables for credentials, reserved CPU and memory, restart policy. No ",[231,17325,17326],{},"systemd unit",", no ",[231,17329,17330],{},"apt install",", no manual firewall.",[12,17333,17334,17337],{},[27,17335,17336],{},"Persistence via replicated named volume."," You say \"this volume is replicated between two servers\", and the orchestrator ensures the data exists on both. If the server running Postgres falls, the cluster reschedules on the second server with data already present. Recovery time in seconds, not hours.",[12,17339,17340,17343],{},[27,17341,17342],{},"Integrated automatic backup"," in the Business plan: continuous WAL archiving to S3-compatible object storage, weekly snapshot, configurable retention. The same RDS feature, no check to AWS.",[12,17345,17346,17349],{},[27,17347,17348],{},"Read replica as additional task."," You describe a second service pointing to the first as upstream replication. Five extra lines in the manifest. No console, no clicks, no manual step.",[12,17351,17352,17355,17356,17358],{},[27,17353,17354],{},"Built-in metrics."," The orchestrator is already collecting CPU, memory, IO from each container. Adding ",[231,17357,17242],{}," is one more fifteen-line task. No assembling separate Prometheus, no provisioning Grafana, no popping another server.",[12,17360,17361,17363],{},[27,17362,7110],{}," if the coordinating server falls: the cluster elects another coordinator in around seven seconds and continues scheduling. Postgres itself comes back on the remaining servers right after.",[12,17365,17366],{},"The full description of a Postgres with replica + backup + metrics on HeroCtl is around one hundred lines. In Kubernetes, the equivalent is an external operator (CloudNativePG or Zalando) plus 300 lines of manifest, plus a separate monitoring stack, plus cert-manager for internal TLS between nodes. For the team of five, the difference is between an afternoon and a sprint.",[19,17368,17370],{"id":17369},"comparison-table","Comparison table",[119,17372,17373,17397],{},[122,17374,17375],{},[125,17376,17377,17379,17382,17385,17388,17391,17394],{},[128,17378,2982],{},[128,17380,17381],{},"RDS São Paulo",[128,17383,17384],{},"Cloud SQL",[128,17386,17387],{},"Supabase",[128,17389,17390],{},"Neon",[128,17392,17393],{},"Simple Postgres VPS",[128,17395,17396],{},"Postgres on HeroCtl",[141,17398,17399,17422,17442,17460,17478,17497,17516,17536,17555,17574,17595,17616],{},[125,17400,17401,17404,17407,17410,17413,17416,17419],{},[146,17402,17403],{},"Minimum cost (50 GB medium)",[146,17405,17406],{},"R$1,400\u002Fmo",[146,17408,17409],{},"R$1,300\u002Fmo",[146,17411,17412],{},"R$125\u002Fmo (Pro)",[146,17414,17415],{},"R$95\u002Fmo (Launch)",[146,17417,17418],{},"R$240\u002Fmo",[146,17420,17421],{},"R$245\u002Fmo",[125,17423,17424,17427,17430,17432,17434,17436,17439],{},[146,17425,17426],{},"Automatic backup",[146,17428,17429],{},"yes",[146,17431,17429],{},[146,17433,17429],{},[146,17435,17429],{},[146,17437,17438],{},"you configure",[146,17440,17441],{},"yes (Business)",[125,17443,17444,17447,17449,17451,17454,17456,17458],{},[146,17445,17446],{},"Point-in-time recovery",[146,17448,17429],{},[146,17450,17429],{},[146,17452,17453],{},"yes (Pro)",[146,17455,17429],{},[146,17457,17438],{},[146,17459,17441],{},[125,17461,17462,17464,17467,17469,17472,17474,17476],{},[146,17463,16324],{},[146,17465,17466],{},"yes (Multi-AZ paid)",[146,17468,17429],{},[146,17470,17471],{},"partial",[146,17473,17429],{},[146,17475,17438],{},[146,17477,17429],{},[125,17479,17480,17483,17486,17488,17490,17492,17495],{},[146,17481,17482],{},"Custom extensions",[146,17484,17485],{},"restricted",[146,17487,17485],{},[146,17489,17485],{},[146,17491,17485],{},[146,17493,17494],{},"total",[146,17496,17494],{},[125,17498,17499,17501,17504,17506,17509,17511,17514],{},[146,17500,7154],{},[146,17502,17503],{},"high",[146,17505,17503],{},[146,17507,17508],{},"medium",[146,17510,17508],{},[146,17512,17513],{},"none",[146,17515,17513],{},[125,17517,17518,17521,17524,17526,17529,17531,17534],{},[146,17519,17520],{},"Exit migration",[146,17522,17523],{},"weeks",[146,17525,17523],{},[146,17527,17528],{},"days",[146,17530,17528],{},[146,17532,17533],{},"hours",[146,17535,17533],{},[125,17537,17538,17541,17543,17545,17547,17549,17552],{},[146,17539,17540],{},"Included monitoring",[146,17542,17471],{},[146,17544,17429],{},[146,17546,17429],{},[146,17548,17429],{},[146,17550,17551],{},"you assemble",[146,17553,17554],{},"built-in",[125,17556,17557,17560,17562,17564,17566,17568,17571],{},[146,17558,17559],{},"Minimum expertise",[146,17561,17513],{},[146,17563,17513],{},[146,17565,17513],{},[146,17567,17513],{},[146,17569,17570],{},"senior",[146,17572,17573],{},"mid-level",[125,17575,17576,17579,17582,17584,17587,17590,17593],{},[146,17577,17578],{},"App↔db latency",[146,17580,17581],{},"1–4 ms",[146,17583,17581],{},[146,17585,17586],{},"5–30 ms",[146,17588,17589],{},"10–50 ms",[146,17591,17592],{},"0.3 ms",[146,17594,17592],{},[125,17596,17597,17599,17602,17604,17607,17610,17613],{},[146,17598,13533],{},[146,17600,17601],{},"any",[146,17603,17601],{},[146,17605,17606],{},"up to 50 GB",[146,17608,17609],{},"up to 100 GB",[146,17611,17612],{},"indie",[146,17614,17615],{},"startup to mid-size",[125,17617,17618,17621,17623,17625,17627,17629,17632],{},[146,17619,17620],{},"LGPD compliance via vendor",[146,17622,17429],{},[146,17624,17429],{},[146,17626,17471],{},[146,17628,17471],{},[146,17630,17631],{},"you document",[146,17633,17631],{},[12,17635,17636],{},"No column wins at everything. Each is a coherent set of tradeoffs. Anyone trying to sell a column as \"the best\" is selling.",[19,17638,17640],{"id":17639},"honest-decision-by-profile","Honest decision by profile",[12,17642,17643,17646],{},[27,17644,17645],{},"MVP up to 10 GB and up to a hundred connections\u002Fsec."," Postgres as a container alongside the application, on a single VPS. Daily backup to S3-compatible object storage. Total cost, database and all, in the R$10\u002Fmonth range above what you already pay for the VPS. At some point you migrate — and migrating with 10 GB is a Sunday night, not a project. Start simple.",[12,17648,17649,17652],{},[27,17650,17651],{},"Indie hacker between 10 and 100 GB."," Postgres on dedicated VPS, async replica on a second VPS, hourly backup to S3-compatible object (Cloudflare R2 or Backblaze B2 costs cents). Something between R$120 and R$200 per month total. If you have time to dedicate, this is the point where self-hosted pays off a lot.",[12,17654,17655,17658],{},[27,17656,17657],{},"Early startup between 100 and 500 GB."," This is where the decision really lies. Evaluate RDS São Paulo on the LGPD compliance argument (AWS already has the datacenter certifications) — it'll come out in the R$1.5 to R$3k per month range. Or evaluate Postgres in a cluster managed by the orchestrator, on three dedicated VPSes, in the R$400 per month range — but it requires real operational discipline. It's not \"self-hosted made easy\". It's self-hosted with the orchestrator absorbing the plumbing part.",[12,17660,17661,17664],{},[27,17662,17663],{},"Heavy compliance or Enterprise."," Managed makes sense when the audit framework asks for a vendor with specific certification. But read the contract — some RDS regions in Brazil still don't have all the certifications (HIPAA, FedRAMP, PCI level 1) that the American region has. If your auditor asks for a specific certificate, confirm the region has it before signing.",[19,17666,17668],{"id":17667},"questions-inexperienced-teams-ask","Questions inexperienced teams ask",[12,17670,17671,17674,17675,17677],{},[27,17672,17673],{},"Can I start self-hosted and migrate to RDS later?"," You can, and it's a valid strategy. Postgres is Postgres. You do ",[231,17676,5736],{}," of the base, restore to RDS, adjust the application endpoint, decommission the old server. Up to 50 GB, it's an operation of a few hours with a short window. The opposite path (leaving RDS for self-hosted) also works, but tools like AWS DMS make ingress easier than egress.",[12,17679,17680,17683,17684,17686],{},[27,17681,17682],{},"Is RDS São Paulo reliable?"," The ",[231,17685,6936],{}," region is one of the oldest AWS regions outside the United States, operating since 2011, and has three independent availability zones. In global AWS incidents, São Paulo usually gets caught up. In regional incidents, it falls alone — which has happened twice in the last five years for a few hours. Reliable enough for production, not reliable enough to skip a plan B.",[12,17688,17689,17695,17696,17698,17699,17701],{},[27,17690,17691,17692,17694],{},"Does backup with ",[231,17693,5736],{}," in cron solve it?"," It solves for MVP, doesn't solve for serious production. ",[231,17697,5736],{}," is a logical dump — doesn't preserve exact state, loses on restore time (slow for large bases), and doesn't allow recovery to minute X. The right combination is weekly physical ",[231,17700,17227],{}," plus continuous WAL archiving. Tool: pgBackRest.",[12,17703,17704,17707,17708,17710],{},[27,17705,17706],{},"When is it worth buying advanced Performance Insights?"," When you're on RDS, have more than five engineers touching the schema, and need to track \"who ran this query?\". On small teams, native ",[231,17709,17293],{}," already delivers 80% of the value — turn it on first and see if you need more.",[12,17712,17713,17716],{},[27,17714,17715],{},"And Supabase, Neon, Crunchy?"," They're different products on top of Postgres. Supabase is Postgres + auth + generated REST API + file storage — good for a project that needs all that integrated, bad for those wanting just a database. Neon separates storage and compute, sleeps when idle, great for staging environment and spiky workload. Crunchy is pure Postgres with enterprise focus and Kubernetes operator. The three have reasonable free tiers for MVP — worth testing before closing with RDS.",[12,17718,17719,17722,17723,17726],{},[27,17720,17721],{},"How do I do real HA without RDS Multi-AZ?"," Synchronous replica on a second server (",[231,17724,17725],{},"synchronous_standby_names"," configured) ensures each commit was written to both before returning OK to the application. Failover via Patroni, or via orchestrator like HeroCtl. The sensitive point is split-brain: the replica can't promote itself without external confirmation. Patroni solves it with etcd as arbiter. HeroCtl solves it with the distributed control plane itself acting as arbiter — without setting up an extra service.",[12,17728,17729,17732],{},[27,17730,17731],{},"Does HeroCtl run heavy Postgres in production for real?"," It does. The public cluster of the documentation itself serves this blog through a stack that includes a self-hosted Postgres as a cluster task, with replica and backup. For workloads above 500 GB or with IOPS requirements in the 50k range, we recommend evaluating managed — not because the orchestrator can't handle it, but because AWS's provisioned IOPS and I\u002FO control in that range start to make real operational difference.",[19,17734,3309],{"id":3308},[12,17736,17737],{},"There's no single answer to \"managed or self-hosted Postgres\". There's your spreadsheet. If you opened this post looking for confirmation, what you found was numbers — use them.",[12,17739,17740],{},"For profiles where self-hosted pays off but operation scares you, HeroCtl is the layer that reduces friction. Backup, replica, monitoring, and failover described in a hundred lines of configuration, running on your cluster, no check to vendor, no lock-in.",[12,17742,17743],{},"Install with:",[224,17745,17746],{"className":226,"code":5318,"language":228,"meta":229,"style":229},[231,17747,17748],{"__ignoreMap":229},[234,17749,17750,17752,17754,17756,17758],{"class":236,"line":237},[234,17751,1220],{"class":247},[234,17753,2957],{"class":251},[234,17755,5329],{"class":255},[234,17757,2963],{"class":383},[234,17759,2966],{"class":247},[12,17761,17762,17763,17766,17767,101],{},"For more on the total cost of hosting SaaS in Brazil in 2026, read ",[3336,17764,17765],{"href":6337},"how much it costs to host a Brazilian SaaS",". For the practical transition from Docker Compose to a cluster with real high availability, read ",[3336,17768,17769],{"href":3343},"Docker deploy in production, from Compose to cluster",[12,17771,17772],{},"The honest math is the one that fits your spreadsheet. Run the numbers before deciding.",[3350,17774,4376],{},{"title":229,"searchDepth":244,"depth":244,"links":17776},[17777,17778,17779,17784,17785,17786,17787,17788,17789,17790,17791],{"id":16741,"depth":244,"text":16742},{"id":16793,"depth":244,"text":16794},{"id":16836,"depth":244,"text":16837,"children":17780},[17781,17782,17783],{"id":16846,"depth":271,"text":16847},{"id":16929,"depth":271,"text":16930},{"id":17032,"depth":271,"text":17033},{"id":17157,"depth":244,"text":17158},{"id":17192,"depth":244,"text":17193},{"id":17253,"depth":244,"text":17254},{"id":17313,"depth":244,"text":17314},{"id":17369,"depth":244,"text":17370},{"id":17639,"depth":244,"text":17640},{"id":17667,"depth":244,"text":17668},{"id":3308,"depth":244,"text":3309},"2026-04-15","RDS starts at US$15\u002Fmonth — ends at US$500. Self-hosting starts at $0 — ends waking you up at 3 a.m. How to decide between the two without lying to yourself.",{},{"title":16730,"description":17793},{"loc":7461},"en\u002Fblog\u002Fpostgres-in-production-managed-vs-self-hosted",[13016,17799,17800,7507,3378],"database","rds","3XZJAM1qeoTtGvzK2RyJe9zIZBX90xtG6dEwdjX6w-s",{"id":17803,"title":17804,"author":7,"body":17805,"category":3378,"cover":3379,"date":18350,"description":18351,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":18352,"navigation":411,"path":11719,"readingTime":8761,"seo":18353,"sitemap":18354,"stem":18355,"tags":18356,"__hash__":18360},"blog_en\u002Fen\u002Fblog\u002Fobservability-without-datadog-startup-stack.md","Observability without Datadog: the alternative stack that fits the Brazilian budget",{"type":9,"value":17806,"toc":18335},[17807,17810,17813,17816,17820,17823,17826,17829,17832,17835,17838,17842,17845,17851,17857,17863,17869,17872,17876,17879,17885,17891,17897,17903,17909,17915,17921,17924,17928,17931,17934,17937,17940,17944,17947,17953,17959,17965,17971,17975,17978,17984,17990,17996,18002,18008,18011,18015,18018,18024,18030,18036,18042,18045,18049,18052,18055,18058,18061,18064,18068,18071,18074,18077,18080,18084,18233,18236,18240,18243,18249,18255,18261,18267,18269,18275,18281,18287,18293,18299,18305,18311,18313,18316,18319,18322,18327],[12,17808,17809],{},"The first Datadog bill that exceeds four digits in reais usually arrives at a predictable moment. The team deploys another pair of services, the agent's auto-discovery starts counting containers as billable hosts, someone enables APM on the Node backend, someone else enables RUM on the frontend to investigate page slowness, and at month-end the company card is debited almost R$2k. The founder looks at the spreadsheet, adds Vercel, adds managed database, adds S3, and discovers that infrastructure — once a discreet line in the budget — now eats the equivalent of half a senior salary per month.",[12,17811,17812],{},"The dominant reaction is to grit your teeth and pay. Datadog is, without irony, the best product on the observability market. The dashboards are pretty, the integrations work first try, the APM shows the slow query with depth few competitors reach, the alerting is flexible enough to model SLO policies without becoming an internal project. For a Series B company with revenue above US$5 million annually, the bill of US$2k or US$5k is negligible against the engineering time it buys. The choice is rational.",[12,17814,17815],{},"For a Brazilian startup of five to ten servers, billing between R$30k and R$500k per month, the same choice is financially devastating. This post maps the concrete alternative — which tools to use, which server to run on, how much RAM it consumes, how much it costs in storage — to deliver 95% of the same result for about 10% of the price.",[19,17817,17819],{"id":17818},"why-datadog-won-the-market","Why Datadog won the market",[12,17821,17822],{},"You can't honestly talk about alternatives without first explaining why the leader is where it is. Whoever has operated observability at any scale knows: the \"build it yourself\" version has always existed, and yet Datadog grew to US$2 billion in annual revenue. There are real reasons for that.",[12,17824,17825],{},"The first is that the UX set a standard. When you open a Datadog dashboard for the first time, the information hierarchy makes sense right away — host map, service map, traces, logs, all connected by links that cross contexts without friction. Whoever operated observability in the 2010s with Nagios, Cacti and Munin knows that's not free. It was expensive product engineering, over a decade.",[12,17827,17828],{},"The second is the integration library. Postgres, MySQL, Redis, Kafka, RabbitMQ, ElasticSearch, Mongo, Cassandra, dozens of cloud providers, more than six hundred ready targets. Each integration comes with reasonable default dashboard, suggested alerts, relevant metrics collected without you having to read obscure documentation.",[12,17830,17831],{},"The third is accurate APM. Official tracers for popular languages do automatic instrumentation that captures the right level of detail. The slow query in Postgres appears with execution plan. The slow p99 endpoint appears with the stack trace that caused the slowness. That level of visibility requires continuous investment in each runtime.",[12,17833,17834],{},"The fourth is flexible alerting. Simple threshold is easy in any tool. Alerting that understands seasonality, that crosses multiple series, that applies anomaly detection on sparse series — that's what Datadog does well because it invested a decade calibrating.",[12,17836,17837],{},"For a company that can pay, hiring Datadog is the right decision. For you who can't, it's worth understanding exactly where the bill explodes before looking for an alternative.",[19,17839,17841],{"id":17840},"the-four-vectors-where-the-bill-explodes","The four vectors where the bill explodes",[12,17843,17844],{},"The Datadog pricing page looks civilized when you look from afar. US$15 per host per month at entry, US$23 on the professional plan, US$31 on the enterprise plan. Five hosts times US$23 gives US$115 — cheap. The problem is that's not the bill that arrives.",[12,17846,17847,17850],{},[27,17848,17849],{},"Per-host charge with containers counting as hosts."," On some plans, each container counts as an additional host for billing purposes. The startup that runs four services in three replicas across three servers thinks it has three hosts — but the bill counts thirty-six telemetry points. The policy has changed several times in recent years and even those who read the documentation carefully miss the estimate.",[12,17852,17853,17856],{},[27,17854,17855],{},"Custom metrics charged per metric per minute."," Each custom metric above the included limit has individual cost. A well-intentioned team adds business metrics — orders per minute, average cart value, conversion per funnel — and the bill goes up US$50, US$100, US$300 depending on tag cardinality. High cardinality in custom metrics is the silent fee nobody anticipates.",[12,17858,17859,17862],{},[27,17860,17861],{},"APM Pro as upsell."," Basic APM comes included, but the features you actually want to use — continuous profiler, code-level visibility, deployment tracking, extended trace retention — are in APM Pro, with additional price per host.",[12,17864,17865,17868],{},[27,17866,17867],{},"Logs ingestion plus extra retention."," Logs price is in two dimensions: ingestion (how much comes in) and retention (how long it stays). Five servers generating 1 GB of log per day ingest 150 GB per month. Retaining 30 days is one price tier; retaining 90 days is another; retaining a year is another. And searching old logs costs per query on some plans.",[12,17870,17871],{},"Add Network Performance Monitoring, Real User Monitoring, Database Monitoring, Synthetics, CI Visibility — each one is upsell on top of upsell. The final bill of a startup with five to ten servers typically sits between US$200 and US$400 per month, that is, R$1k to R$2k at R$5 per dollar. R$24k per year is the equivalent of one month of senior person's salary.",[19,17873,17875],{"id":17874},"the-alternative-stack-with-specific-names","The alternative stack, with specific names",[12,17877,17878],{},"Good news: the open-source observability industry has been mature for a long time. It's not a 2015 scenario with Nagios and Munin. Tools today cover each vertical with real quality.",[12,17880,17881,17884],{},[27,17882,17883],{},"Metrics",": Prometheus for collection and storage, Grafana for visualization. The combination has fifteen years of production in thousands of companies, is the de facto standard, and most modern applications already expose metrics in compatible format.",[12,17886,17887,17890],{},[27,17888,17889],{},"Logs",": Loki, from the same team that maintains Grafana. Query syntax is similar to Prometheus's, which reduces cognitive load for those who already use one. Per GB stored, it's typically 90% cheaper than Datadog Logs because it indexes only labels, not the full content.",[12,17892,17893,17896],{},[27,17894,17895],{},"Traces",": Tempo (Grafana) or Jaeger or SigNoz. The three speak OpenTelemetry, so the application doesn't get coupled to the choice. Tempo integrates more cleanly with Grafana; Jaeger is the veteran with own UI; SigNoz combines traces, metrics and logs in a single product.",[12,17898,17899,17902],{},[27,17900,17901],{},"APM",": SigNoz is the most mature direct competitor today, with OpenTelemetry-native instrumentation. OpenObserve is a newer alternative with modern architecture. Pyroscope covers continuous profiling — the type of CPU and memory visibility that Datadog APM Pro sells expensively.",[12,17904,17905,17908],{},[27,17906,17907],{},"Errors and exceptions",": self-hosted Sentry is the robust option — same tool as the SaaS version, without the cost. GlitchTip is a lighter alternative, drop-in compatible with Sentry SDKs, great for small teams.",[12,17910,17911,17914],{},[27,17912,17913],{},"Uptime monitoring",": Uptime Kuma covers 95% of cases with five-minute installation. Statping is a similar alternative.",[12,17916,17917,17920],{},[27,17918,17919],{},"Synthetic checks",": Checkly has a generous free tier and covers the case \"run test in browser from various regions\" without you needing to maintain check infra. For those who prefer homegrown, Playwright scripts in GitHub Actions solve it.",[12,17922,17923],{},"The stack above covers all the verticals Datadog sells. The honest question is where each piece falls short in the comparison.",[19,17925,17927],{"id":17926},"what-each-component-does-and-what-it-doesnt","What each component does and what it doesn't",[12,17929,17930],{},"Prometheus and Grafana cover 90% of what Datadog dashboards cover. The real difference is in integrations: Datadog has one-click integration for six hundred targets, while Prometheus typically requires writing exporter or using a common exporter — postgres_exporter, redis_exporter, blackbox_exporter, node_exporter. For popular targets, these exporters exist and are well maintained. For exotic targets, you write.",[12,17932,17933],{},"Loki covers logs in 95% of web cases. The trade-off is indexing: Loki indexes only labels, not the full content. For rich log search with complex full-text terms, ELK or OpenSearch fit better. For search by service, host, log level, status code — which is what 95% of teams really do — Loki is cheaper and simpler.",[12,17935,17936],{},"SigNoz and Tempo cover APM with quality. The trade-off is polish. The slow query profile in Datadog APM has more shine — years of UX on the views that matter. SigNoz is close and improves every release; in common use cases (slow endpoint, slow query, error spike) it covers comfortably. For forensic profile investigation of a rare transaction, Datadog still wins on refinement.",[12,17938,17939],{},"Self-hosted Sentry is practically identical to Sentry SaaS — same team maintains both. You install the stack via Docker Compose, spend fifteen minutes configuring, and have error tracking in production. Costs zero in license and two to four hours per month of maintenance.",[19,17941,17943],{"id":17942},"the-concrete-architecture-in-a-small-stack","The concrete architecture in a small stack",[12,17945,17946],{},"For a startup operating five to ten servers, the architecture fits on a single dedicated observability server. Four gigabytes of RAM solves it.",[12,17948,17949,17952],{},[27,17950,17951],{},"Observability server (4 GB RAM)",": Prometheus consumes around 1.5 GB with five-to-ten-node series. Grafana sits at 200 MB. Loki at 1 GB with reasonable retention. Tempo at 500 MB. Plenty of room left for Alertmanager (50 MB) and some exporter or additional collector.",[12,17954,17955,17958],{},[27,17956,17957],{},"Storage for metrics",": five servers exposing about 100 metrics per second each, retained for 30 days, generate approximately 10 GB of time series database. Common SSD disk handles it — typically R$30 per month of additional storage on most providers.",[12,17960,17961,17964],{},[27,17962,17963],{},"Storage for logs",": five servers producing 1 GB of log per day, retained for 30 days, are 150 GB. The cheap solution is to point Loki to a S3-compatible backend — Cloudflare R2 charges US$0.015 per GB per month with no egress fee, that is, US$2.25 per month for 150 GB. Backblaze B2 is equivalent. AWS S3 works but has egress that hurts if you read a lot; for observability, R2 or B2 are the obvious choice.",[12,17966,17967,17970],{},[27,17968,17969],{},"Trace sampling",": 100% trace is usually waste. Sampling of 1% to 5% for normal traces, 100% for traces containing error, 100% for specific critical endpoints. Reduces volume by an order of magnitude without losing the signal that matters.",[19,17972,17974],{"id":17973},"honest-setup-steps-without-copy-paste","Honest setup: steps without copy-paste",[12,17976,17977],{},"The difference between blog tutorial and real operation is knowledge of where steps break. Here goes the sequence that works, with the real pitfalls.",[12,17979,17980,17983],{},[27,17981,17982],{},"Step 1: Prometheus in container."," Spin up Prometheus pointing the scrape config to nodes running node_exporter. Each node needs node_exporter running — also in container, port 9100. Initial configuration is twenty lines of YAML. Pitfall: dynamic service discovery requires integration with the truth source of hosts. For small cluster, static list resolves; for cluster that grows, integration with the orchestrator API.",[12,17985,17986,17989],{},[27,17987,17988],{},"Step 2: Grafana in container."," Add Prometheus as datasource, import three to five ready dashboards from Grafana Marketplace — node_exporter full, container metrics, blackbox uptime are good starting points. In fifteen minutes you have dashboards better than much Datadog setup I've seen in production.",[12,17991,17992,17995],{},[27,17993,17994],{},"Step 3: Loki plus Promtail (or unified Grafana Agent) on each node."," Promtail reads local logs and pushes to Loki. Minimum configuration is about thirty lines — define log paths, labels, and Loki endpoint. Pitfall: application log output in free format forces you to write parsing regex. It's worth investing an afternoon to standardize logs in structured JSON before configuring parsing.",[12,17997,17998,18001],{},[27,17999,18000],{},"Step 4: OpenTelemetry SDK in the application."," Each language has its official SDK. You initialize at application bootstrap, define the Tempo (or SigNoz collector) endpoint, and gain automatic distributed tracing for HTTP, database, cache. Adding custom spans at critical points is trivial.",[12,18003,18004,18007],{},[27,18005,18006],{},"Step 5: Alertmanager."," Receives alert rules from Prometheus and routes to Slack, email, PagerDuty or Discord webhook. Classic pitfall: the first month you'll have alert fatigue from poorly calibrated threshold. Reserve one hour per week in the first two months to refine rules.",[12,18009,18010],{},"Total time for someone without prior experience: four to eight hours to have the entire stack functional, plus two to three afternoons refining dashboards and alerts in the next two weeks. At R$200 per engineering hour, total investment is R$1.2k to R$2.5k. Replaces R$1k to R$2k per month of Datadog indefinitely. Payback in one to two months.",[19,18012,18014],{"id":18013},"where-self-hosted-falls-short","Where self-hosted falls short",[12,18016,18017],{},"Honesty here is the test of who's selling the alternative in good faith versus who's selling the simplified version of reality.",[12,18019,18020,18023],{},[27,18021,18022],{},"Deep Database Monitoring."," Datadog DBM has detailed visibility into Postgres and Redis, with execution plan per query, lock waits, slow query analysis. postgres_exporter covers basic health metrics — connections, transactions, replication, cache hit ratio. Deep slow query analysis in open source requires pgBadger or manual scraping of pg_stat_statements, with much more work than clicking \"Enable DBM\" in Datadog.",[12,18025,18026,18029],{},[27,18027,18028],{},"Real User Monitoring."," Datadog RUM measures load time perceived by real users, distributed by geography, browser, device. The combination of Sentry with Plausible covers part of the space, but with gaps. If detailed RUM is a central part of the product strategy, Datadog wins today.",[12,18031,18032,18035],{},[27,18033,18034],{},"Network Performance Monitoring."," Datadog NPM has packet visibility in complex networks, especially useful in architectures crossing multiple zones. There's no practical self-hosted equivalent for the general case.",[12,18037,18038,18041],{},[27,18039,18040],{},"Global synthetic monitoring."," Datadog runs checks from over thirty regions. Self-hosted requires you to run checks from multiple regions — viable but laborious. Checkly covers the gap with an accessible intermediate tier.",[12,18043,18044],{},"Summary: 95% of observability cases that a startup needs are covered. The 5% that's left out are enterprise features rarely used in a startup.",[19,18046,18048],{"id":18047},"concrete-cost-compared","Concrete cost compared",[12,18050,18051],{},"Worth doing the spreadsheet in reais, with numbers you can reproduce.",[12,18053,18054],{},"Datadog on five hosts, with APM Pro, 100 GB of logs per month, 30 custom metrics and active RUM: about US$400 per month, or R$2k at R$5 per dollar.",[12,18056,18057],{},"Self-hosted stack on a dedicated VPS with 4 GB of RAM (R$80 per month at most Brazilian providers), plus log storage in S3-compatible (R$30 per month for 150 GB on R2 or B2), plus estimated maintenance time value (two hours per month at R$200 per hour, R$400 per month): R$510 per month.",[12,18059,18060],{},"Monthly difference: R$1,490. Annual difference: R$17,880. In three years, R$53k — equivalent to two months salary of a senior person, or to the cost of acquiring a medium customer in B2B sales.",[12,18062,18063],{},"Important: maintenance time is a pessimistic estimate. Teams that standardize the setup typically spend less than one hour per month after the initial investment. In three years, maintenance compounds but doesn't become a continuous project.",[19,18065,18067],{"id":18066},"how-heroctl-fits","How HeroCtl fits",[12,18069,18070],{},"The orchestrator exposes cluster metrics in Prometheus format by default. There's no proprietary agent to install on each server — the cluster exposes aggregate on a single endpoint, and Prometheus scrapes directly.",[12,18072,18073],{},"Logs follow a single embedded writer architecture. Instead of each container producing log that needs to be collected by an agent on each node, the cluster centralizes capture and exposes a query interface. That reduces operational overhead — you don't assemble an agent on each server.",[12,18075,18076],{},"The OSS stack (Prometheus, Grafana, Loki, Tempo, Sentry) runs as jobs on the cluster itself. You submit the Prometheus manifest like any other service, and the orchestrator handles health check, restart, rolling update deploy and routing. Additional operational overhead: zero.",[12,18078,18079],{},"For a startup already running HeroCtl, enabling complete observability is an afternoon. The cluster already provides all the plumbing — only deciding the dashboards is left.",[19,18081,18083],{"id":18082},"comparison-datadog-vs-new-relic-vs-self-hosted-oss-stack","Comparison: Datadog vs New Relic vs self-hosted OSS Stack",[119,18085,18086,18101],{},[122,18087,18088],{},[125,18089,18090,18092,18095,18098],{},[128,18091,2982],{},[128,18093,18094],{},"Datadog",[128,18096,18097],{},"New Relic",[128,18099,18100],{},"Self-hosted OSS Stack",[141,18102,18103,18117,18130,18143,18156,18168,18182,18196,18208,18220],{},[125,18104,18105,18108,18111,18114],{},[146,18106,18107],{},"Monthly cost for 5 hosts",[146,18109,18110],{},"R$1k-2k",[146,18112,18113],{},"R$800-1.5k",[146,18115,18116],{},"R$80-510",[125,18118,18119,18121,18124,18127],{},[146,18120,17883],{},[146,18122,18123],{},"Excellent, 1-click integrations",[146,18125,18126],{},"Good, strong integrations",[146,18128,18129],{},"Prometheus + Grafana, exporters per target",[125,18131,18132,18134,18137,18140],{},[146,18133,17889],{},[146,18135,18136],{},"Excellent, rich search",[146,18138,18139],{},"Good, rich search",[146,18141,18142],{},"Loki, search by label",[125,18144,18145,18147,18150,18153],{},[146,18146,17901],{},[146,18148,18149],{},"Market-leading depth",[146,18151,18152],{},"Close to Datadog",[146,18154,18155],{},"SigNoz\u002FTempo, 80% of the shine",[125,18157,18158,18160,18163,18165],{},[146,18159,17895],{},[146,18161,18162],{},"Advanced sampling",[146,18164,18162],{},[146,18166,18167],{},"OpenTelemetry, configurable sampling",[125,18169,18170,18173,18176,18179],{},[146,18171,18172],{},"Alerting",[146,18174,18175],{},"Anomaly detection, seasonality",[146,18177,18178],{},"Anomaly detection",[146,18180,18181],{},"Threshold + Alertmanager (no AI)",[125,18183,18184,18187,18190,18193],{},[146,18185,18186],{},"Integrations",[146,18188,18189],{},"600+ ready",[146,18191,18192],{},"400+ ready",[146,18194,18195],{},"100+ community exporters",[125,18197,18198,18200,18203,18205],{},[146,18199,17559],{},[146,18201,18202],{},"Low (button on)",[146,18204,18202],{},[146,18206,18207],{},"Medium (config + maintenance)",[125,18209,18210,18212,18215,18217],{},[146,18211,7154],{},[146,18213,18214],{},"High (proprietary format)",[146,18216,18214],{},[146,18218,18219],{},"Zero (open formats)",[125,18221,18222,18224,18227,18230],{},[146,18223,13533],{},[146,18225,18226],{},"Series B+ with revenue",[146,18228,18229],{},"Series A-B with revenue",[146,18231,18232],{},"Bootstrapped, seed, Series A",[12,18234,18235],{},"The last column is what matters for a Brazilian startup. Zero lock-in means that if the OSS stack stops serving, you migrate the dashboards and rules with contained investment — open format runs anywhere.",[19,18237,18239],{"id":18238},"when-to-stay-on-datadog","When to stay on Datadog",[12,18241,18242],{},"Honesty requires pointing out when the alternative doesn't pay off.",[12,18244,18245,18248],{},[27,18246,18247],{},"Series B or larger company with revenue justifying."," Above US$5 million ARR, R$2k per month disappears in the budget. The time you save not assembling a stack is worth more than the cash.",[12,18250,18251,18254],{},[27,18252,18253],{},"Compliance that requires SOC2 or ISO certified vendor nominally."," Some frameworks list pre-approved tools. If you need the name Datadog or New Relic on an audit list, the alternative doesn't fit.",[12,18256,18257,18260],{},[27,18258,18259],{},"Team without capacity to set up stack."," If the engineering team has three people focused on product and zero on infra, setting up Prometheus plus Grafana plus Loki is a four-to-eight-hour distraction the team doesn't have. Datadog free tier or New Relic free tier solve the start.",[12,18262,18263,18266],{},[27,18264,18265],{},"Need for enterprise-grade NPM or DBM."," For the 5% of cases where Datadog has irreplaceable feature, staying with it is the correct technical decision.",[19,18268,5250],{"id":5249},[12,18270,18271,18274],{},[27,18272,18273],{},"Can I use Datadog free tier?","\nYes, and it makes sense to start. Five hosts, short retention, no APM, no advanced logs. Works for a two-person team validating an idea. Migration starts when the free tier ends and the cost estimate appears — generally between six and twelve months later.",[12,18276,18277,18280],{},[27,18278,18279],{},"Is Grafana Cloud a good intermediate alternative?","\nIt is. Grafana Cloud free tier offers 10k series, 50 GB of logs, 50 GB of traces. Paid starts at US$8 per month with reasonable volume. Covers the range between \"Datadog is too expensive\" and \"self-hosting is work\". Trade-off is moderate lock-in — formats are open, but you don't control retention and costs are on another spreadsheet.",[12,18282,18283,18286],{},[27,18284,18285],{},"How much does S3-compatible log storage cost in Brazil?","\nCloudflare R2 charges US$0.015 per GB per month, no egress fee. Backblaze B2 charges US$0.005 per GB per month with US$0.01 per GB egress. For 150 GB on R2: US$2.25 per month, or R$11. For 1 TB on B2: US$5 per month plus egress as used. In both cases, the cost is negligible.",[12,18288,18289,18292],{},[27,18290,18291],{},"OpenTelemetry vs StatsD?","\nOpenTelemetry is the current standard and covers metrics, traces and logs. StatsD was the 2010s standard for metrics, still exists, but it's narrow. If you're starting, go straight on OpenTelemetry — all modern SDKs support, all modern backends support, and the investment of learning is worth it for years.",[12,18294,18295,18298],{},[27,18296,18297],{},"Is Sentry worth self-hosting?","\nFor a small team, GlitchTip resolves with less overhead — simple installation, same API as Sentry, drop-in compatible with SDKs. For a team that needs advanced features (Performance, Profiling, Replay), self-hosted Sentry is worth the work of setting up Docker Compose. Sentry SaaS free tier is generous and covers the start.",[12,18300,18301,18304],{},[27,18302,18303],{},"How much does the OSS stack consume in RAM and CPU?","\nFor five to ten monitored nodes: Prometheus 1.5 GB of RAM, Grafana 200 MB, Loki 1 GB, Tempo 500 MB. Total around 3.5 GB. Average CPU is low — peak in scrapes of 5 to 10% of one vCPU. Fits in 4 GB VPS with room to spare.",[12,18306,18307,18310],{},[27,18308,18309],{},"Does HeroCtl have ready dashboards?","\nYes. The cluster exposes metrics in Prometheus format, and the embedded administration panel includes basic dashboards per job — CPU usage, memory, replica status, health check latency. For more elaborate dashboards, spin up Grafana as a job in the cluster itself and point to the control plane metrics endpoint.",[19,18312,3309],{"id":3308},[12,18314,18315],{},"The difference between R$2k and R$500 per month isn't detail — it's R$18k per year. For a startup in validation stage, it's what separates hiring an additional person and staying with the current team. For a startup in growth stage, it's the margin that justifies investing in product instead of vendor.",[12,18317,18318],{},"The choice isn't \"Datadog or nothing\". It's \"which tool serves the company's current phase\". In early phase, the self-hosted OSS stack wins on cost with functional parity. In late phase, Datadog wins on productivity with absorbed cost. The common error is to keep paying Datadog because it was never reassessed — annual stack audit is mature company practice, even among those who choose to keep paying.",[12,18320,18321],{},"If you run HeroCtl, the OSS stack comes up as a regular job in the cluster. Without extra agent, without infra provisioner, without third vendor. The budget that's left goes to the next hired engineer.",[224,18323,18325],{"className":18324,"code":5318,"language":2529},[2527],[231,18326,5318],{"__ignoreMap":229},[12,18328,18329,18330,2402,18333,101],{},"To continue reading: ",[3336,18331,18332],{"href":6337},"How much it costs to host SaaS in Brazil in 2026",[3336,18334,7462],{"href":7461},{"title":229,"searchDepth":244,"depth":244,"links":18336},[18337,18338,18339,18340,18341,18342,18343,18344,18345,18346,18347,18348,18349],{"id":17818,"depth":244,"text":17819},{"id":17840,"depth":244,"text":17841},{"id":17874,"depth":244,"text":17875},{"id":17926,"depth":244,"text":17927},{"id":17942,"depth":244,"text":17943},{"id":17973,"depth":244,"text":17974},{"id":18013,"depth":244,"text":18014},{"id":18047,"depth":244,"text":18048},{"id":18066,"depth":244,"text":18067},{"id":18082,"depth":244,"text":18083},{"id":18238,"depth":244,"text":18239},{"id":5249,"depth":244,"text":5250},{"id":3308,"depth":244,"text":3309},"2026-04-08","Datadog charges US$15-31\u002Fhost\u002Fmonth. For a startup with 5 servers, that's R$1k\u002Fmonth just on monitoring. The self-hosted stack reaches the same place for R$50.",{},{"title":17804,"description":18351},{"loc":11719},"en\u002Fblog\u002Fobservability-without-datadog-startup-stack",[18357,18358,11782,18359,6394],"observability","datadog","open-source","McHaAwonryTCDc0ZTgKeBncvZXaq9eF8-XLiwfzxUNk",{"id":18362,"title":18363,"author":7,"body":18364,"category":3378,"cover":3379,"date":19090,"description":19091,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":19092,"navigation":411,"path":4368,"readingTime":4401,"seo":19093,"sitemap":19094,"stem":19095,"tags":19096,"__hash__":19099},"blog_en\u002Fen\u002Fblog\u002Fmulti-tenant-saas-real-isolation.md","Multi-tenant SaaS with real isolation: 3 patterns and when each one becomes a nightmare",{"type":9,"value":18365,"toc":19077},[18366,18373,18376,18380,18383,18389,18395,18401,18407,18414,18418,18425,18436,18441,18466,18471,18500,18506,18510,18513,18531,18536,18561,18566,18589,18595,18599,18609,18616,18619,18624,18649,18654,18671,18677,18681,18841,18844,18848,18851,18857,18863,18869,18875,18878,18882,18885,18891,18903,18912,18918,18921,18925,18928,18934,18940,18946,18952,18961,18965,18968,18974,18984,18990,18993,18995,19001,19011,19017,19023,19038,19044,19050,19052,19055,19058,19061,19066,19069],[12,18367,18368,18369,18372],{},"The first question from a serious B2B buyer, after the product passes the demo and before legal enters the room, is always the same: \"is my data isolated from other customers?\". If your answer is \"ah, we filter by ",[231,18370,18371],{},"tenant_id"," in each query\", the contract just turned to smoke. The buyer isn't asking for a technical justification — they're asking for a guarantee that survives an audit, an incident, and a junior dev rotation.",[12,18374,18375],{},"There are three real multi-tenant isolation patterns in B2B SaaS. Each one has obvious benefits in the presentation and invisible costs in operation. This post maps the three, shows exactly when each one becomes a nightmare, and explains the typical journey of a Brazilian startup that starts with fifty small customers and ends serving a regulated bank.",[19,18377,18379],{"id":18378},"why-this-matters-now-not-next-year","Why this matters now — not next year",[12,18381,18382],{},"Multi-tenancy is an architectural decision that seems postponable until the exact moment when it stops being. Four forces are pushing this decision to the front of the roadmap of Brazilian B2B startups in 2026:",[12,18384,18385,18388],{},[27,18386,18387],{},"LGPD became a practical requirement, not just legal."," The law has been in effect since 2020, but corporate DPOs only started asking for operational evidence in the last two years. The question stopped being \"are you compliant?\" and became \"how do you demonstrate adequate handling of personal data?\". Demonstration requires visible separation, access log, and clear deletion process. Pure pool makes all of that more difficult — not impossible, but harder to justify to an auditor who has never seen your architecture.",[12,18390,18391,18394],{},[27,18392,18393],{},"Large B2B customer requires isolation as a contract prerequisite."," That's the point of the opening paragraph. If your pipeline has a R$50k per month proposal with a regional logistics operation, it's practically certain their information security department will send a one-hundred-eighty-question questionnaire. Almost half the questions are about isolation. Answering \"we share a database with logical filters\" delivers the deal to the competitor that answered \"dedicated schema per customer\" — even if the real difference, in terms of risk, is small.",[12,18396,18397,18400],{},[27,18398,18399],{},"Sectoral compliance may require physical isolation."," Health (LGPD + CFM), financial (Bacen, CVM), private basic education (LGPD + ECA), insurance (SUSEP). Regulated sectors occasionally list \"data segregation per customer in physical layer\" as required control. When that appears, schema-per-tenant solves with some effort; pool doesn't solve it.",[12,18402,18403,18406],{},[27,18404,18405],{},"One customer will become ten times bigger than the others."," B2B SaaS usage distribution follows power law. Your largest customer will consume more resources than the fifty smaller ones combined. In pure pool, that customer degrades the experience of everyone else — and you can't charge more for it without a billing model that prices usage, which nobody wants to build before necessary.",[12,18408,18409,18410,18413],{},"And the fifth force, perhaps the most important: ",[27,18411,18412],{},"migrating from one pattern to another after having customers in production is a project of months, not days."," You'll choose wrong on some axis — everyone does — but choosing consciously makes the difference between \"six-week refactor\" and \"six-month rewrite\".",[19,18415,18417],{"id":18416},"pattern-1-pool-shares-everything","Pattern 1 — Pool (shares everything)",[12,18419,18420,18421,18424],{},"Simplest possible setup. One database, one application instance, one infrastructure stack. All customers live in the same tables. Each application query has an additional ",[231,18422,18423],{},"WHERE tenant_id = ?"," filter injected by middleware before reaching the database. Postgres Row-Level Security (RLS) enforces this filter at the database level — a second layer that fires even if the application middleware forgets.",[12,18426,18427,18428,18431,18432,18435],{},"New customer onboarding is literally an ",[231,18429,18430],{},"INSERT"," in the ",[231,18433,18434],{},"tenants"," table. Thirty milliseconds later the customer already has a functional environment. The marginal cost of adding a tenant is practically zero — you're just using more rows in the same database, more bytes in the same cache.",[12,18437,18438],{},[27,18439,18440],{},"Pool's strong points:",[2734,18442,18443,18446,18449,18452,18459],{},[70,18444,18445],{},"Low operational cost. One database to backup, one stack to monitor, one application version running.",[70,18447,18448],{},"Instant onboarding. Credit card signup flow is trivial.",[70,18450,18451],{},"Sublinear scale. A thousand customers don't cost a thousand times more than one.",[70,18453,18454,18455,18458],{},"Cross-tenant analytics are natural. Internal \"monthly active users\" dashboard is a ",[231,18456,18457],{},"SELECT COUNT(*)"," without gymnastics.",[70,18460,18461,18462,18465],{},"Migrations are simple. One ",[231,18463,18464],{},"ALTER TABLE"," applies to all customers at once.",[12,18467,18468],{},[27,18469,18470],{},"Pool's weak points:",[2734,18472,18473,18484,18487,18490,18493],{},[70,18474,18475,18476,18479,18480,18483],{},"SQL bug is a critical incident. A forgotten ",[231,18477,18478],{},"WHERE",", a ",[231,18481,18482],{},"JOIN"," that leaks to another context, a poorly written migration — and customer A's data appears on customer B's screen. That incident has already killed companies (literally: contracts cancelled en masse, irreversible loss of trust). Postgres RLS is the safety belt that drastically reduces that risk, but requires discipline to configure well and test all roles.",[70,18485,18486],{},"Noisy neighbor. A customer that fires a heavy report at 2 PM on Tuesday consumes the connection pool and degrades latency for everyone else. You can add per-tenant query limits, but that's additional work.",[70,18488,18489],{},"Backup is all-or-nothing. Restoring a specific customer after a destructive operation requires snapshot of the entire database, restore in parallel environment, and selective export-import. Annoying operation of one to four hours.",[70,18491,18492],{},"Compliance requiring physical separation doesn't fit. If the customer asks \"where physically does my data live?\", the answer is \"in the same data file as all other customers' data\" — truth that drives away specific profiles.",[70,18494,18495,18496,18499],{},"Customization becomes nullable column. Customer Y needs an extra field. You add it. Another customer doesn't use that field. In six months no one remembers what that ",[231,18497,18498],{},"extra_data_3"," column is for. That accumulation is one of the most predictable symptoms of mature pool.",[12,18501,18502,18505],{},[27,18503,18504],{},"When pool makes sense:"," SMB B2B SaaS (small and medium customers), tenants relatively similar in usage, low regulatory risk, small engineering team (three to eight people), product still searching for product-market fit. Practically every SaaS starts here — and is right. The mistake isn't to start with pool; it's not knowing when to leave.",[19,18507,18509],{"id":18508},"pattern-2-schema-per-tenant-shared-database-separate-schemas","Pattern 2 — Schema-per-tenant (shared database, separate schemas)",[12,18511,18512],{},"Architectural middle ground. You still have one database instance — one Postgres instance running, with its parameters, connections, replication. But inside it, each customer has their own schema. Postgres calls it schema; MySQL calls it database; the concept is the same: named namespace inside the server, with own tables, own indexes, and own privileges.",[12,18514,18515,18516,18519,18520,18523,18524,18526,18527,18530],{},"The application selects the correct schema via ",[231,18517,18518],{},"SET search_path TO tenant_acme"," (or equivalent) at the beginning of each connection, based on which tenant is making the request. Tables exist with the same structure in all schemas, but are physically separated: schema ",[231,18521,18522],{},"tenant_acme"," has its own ",[231,18525,4449],{}," rows, ",[231,18528,18529],{},"tenant_xyz"," has its, and queries inside one schema don't see the other without explicit privilege.",[12,18532,18533],{},[27,18534,18535],{},"Schema-per-tenant's strong points:",[2734,18537,18538,18545,18552,18555,18558],{},[70,18539,18540,18541,18544],{},"Strong data isolation by default. No ",[231,18542,18543],{},"WHERE tenant_id"," anywhere — tables are physically separated. SQL bug stays circumscribed to current schema.",[70,18546,18547,18548,18551],{},"Per-tenant backup is practical. ",[231,18549,18550],{},"pg_dump --schema=tenant_acme"," exports only that customer. Restoring is the same: bring up in parallel environment, import the schema, and move specific data.",[70,18553,18554],{},"Resource quotas per schema. Postgres allows limiting connections per role, and roles can be tied to schemas. You can guarantee that a large tenant doesn't consume all database connections.",[70,18556,18557],{},"Clean customization. Customer Y needs an extra field? Add it in their schema only. Other customers don't even know that field exists. The base schema stays clean.",[70,18559,18560],{},"Demonstration of separation becomes obvious to audit. \"Each customer has their own database namespace with isolated privileges\" is an answer that satisfies most security questionnaires.",[12,18562,18563],{},[27,18564,18565],{},"Schema-per-tenant's weak points:",[2734,18567,18568,18571,18578,18583,18586],{},[70,18569,18570],{},"Migrations multiply by N. A ten-minute migration with a thousand schemas becomes one-hundred-sixty hours of database work if it runs serially. You need careful parallelism, migration scripts that know the schema set, and planned maintenance window — or non-blocking migration strategy that works schema-by-schema.",[70,18572,18573,18574,18577],{},"Connection pooling gets complicated. If each connection needs ",[231,18575,18576],{},"SET search_path"," per tenant, simple pgBouncer doesn't work — it reuses connections between different customers. Solutions: pool per schema (breaks cardinality), session-mode pooling (slower), or application middleware that manages the reset.",[70,18579,18580,18581,101],{},"Cross-tenant analytics get expensive. To answer \"how many active users do I have across all customers?\" you need to union a thousand tables. Real solution: daily ETL to a separate warehouse (ClickHouse, BigQuery), with denormalized ",[231,18582,18371],{},[70,18584,18585],{},"Bug in switching code is still a risk. If middleware selects the wrong schema due to session leakage bug between requests, you have the same type of leak that pool has. Less common, but possible.",[70,18587,18588],{},"Practical schema limit. Postgres handles tens of thousands of schemas, but the database catalog gets heavy at some point — slow listings, autovacuum competing. Companies running over five thousand schemas in a single instance report pain.",[12,18590,18591,18594],{},[27,18592,18593],{},"When schema-per-tenant makes sense:"," mid-market B2B SaaS, ten to a thousand customers, some high-value customers that justify customization, moderate compliance. It's the \"intermediate\" pattern in the literal sense — you trade some operational simplicity from pool for stronger isolation and customization flexibility.",[19,18596,18598],{"id":18597},"pattern-3-app-per-tenant-complete-silo","Pattern 3 — App-per-tenant (complete silo)",[12,18600,18601,18602,571,18605,18608],{},"Each customer receives a dedicated instance of everything: application, database, cache, job queue, scheduler. What they share is only the physical infrastructure — the cluster of machines where containers run. But each workload has its own database with its own data, its own URL (",[231,18603,18604],{},"acme.app.com",[231,18606,18607],{},"customer-xyz.app.com","), and potentially its own version of the application.",[12,18610,18611,18612,18615],{},"Serious implementation of this pattern requires an orchestrator. Without orchestration, provisioning a new customer means manually creating virtual machine, running database setup, deploying the application, configuring DNS, issuing TLS certificate — operation of hours nobody will tolerate repeating twenty times a month. With orchestrator, that's a parameterized job: you trigger a definition that says \"new tenant ",[231,18613,18614],{},"acme",", enterprise plan, isolated database, automatic certificate\", the cluster allocates, configures, and fires up in one to three minutes.",[12,18617,18618],{},"Kubernetes does that with namespaces and Helm. HeroCtl does it with job templates. Other orchestrators do with their own primitives. What matters is that the time to onboard a new customer in this architecture — minutes, not seconds like pool — doesn't become human pain because it's automated.",[12,18620,18621],{},[27,18622,18623],{},"App-per-tenant's strong points:",[2734,18625,18626,18629,18640,18643,18646],{},[70,18627,18628],{},"Maximum isolation. There's no shared code querying data from more than one customer — physically impossible. SQL bug affects only that instance's customer.",[70,18630,18631,18632,18635,18636,18639],{},"Total customization. Customer A can run version ",[231,18633,18634],{},"2.4"," of the application, customer B version ",[231,18637,18638],{},"2.5",". Useful for gradual release tests, or to serve customers who asked for a specific patch.",[70,18641,18642],{},"Isolated failure. If customer A's database corrupted, customer B doesn't even notice. Customer A has outage; customer B doesn't.",[70,18644,18645],{},"Heavy compliance becomes feasible. FedRAMP, HIPAA with strict multi-tenant requirements, contracts with \"dedicated infrastructure\" clause — all pass.",[70,18647,18648],{},"Regional deploys per customer. Brazilian customer with national territory data requirement? Run in São Paulo datacenter. European customer? Frankfurt. The \"run tenant where they need to be\" primitive starts to exist.",[12,18650,18651],{},[27,18652,18653],{},"App-per-tenant's weak points:",[2734,18655,18656,18659,18662,18665,18668],{},[70,18657,18658],{},"Cost scales linearly. A thousand small customers cost roughly a thousand times more than one customer. No pool gain. For low-ticket customers, the margin disappears.",[70,18660,18661],{},"Onboarding takes minutes, not seconds. Can be unacceptable for self-service models with credit card signup. Works for assisted sales models where onboarding is process, not buying flow.",[70,18663,18664],{},"Operations multiply by N. Each database needs backup, each application needs monitoring, each deploy needs validation. Without centralized orchestration tools, becomes unfeasible at two dozen customers.",[70,18666,18667],{},"Cross-tenant analytics are expensive. Worse than schema-per-tenant — you have to sync data from completely separate databases. ETL to common warehouse is even more necessary.",[70,18669,18670],{},"Minimum infrastructure cost per tenant. Each dedicated Postgres has overhead of two hundred to five hundred megabytes of RAM even idle. Each Go or Node application another hundred to two hundred megabytes. The spending floor is real.",[12,18672,18673,18676],{},[27,18674,18675],{},"When app-per-tenant makes sense:"," enterprise SaaS, high ARR per customer (R$10k\u002Fmonth per customer up is comfortable reference), demanding compliance, customer customization is competitive differential. Also works in contexts of fifty to a thousand customers where average ticket sustains the cost. Companies that sell self-service on Stripe and charge R$99\u002Fmonth per user don't fit here — the economy doesn't close.",[19,18678,18680],{"id":18679},"comparative-table","Comparative table",[119,18682,18683,18698],{},[122,18684,18685],{},[125,18686,18687,18689,18692,18695],{},[128,18688,2982],{},[128,18690,18691],{},"Pool",[128,18693,18694],{},"Schema-per-tenant",[128,18696,18697],{},"App-per-tenant",[141,18699,18700,18714,18728,18742,18755,18772,18786,18800,18814,18828],{},[125,18701,18702,18705,18708,18711],{},[146,18703,18704],{},"Cost per tenant",[146,18706,18707],{},"Sublinear (almost zero additional)",[146,18709,18710],{},"Almost linear (small overhead)",[146,18712,18713],{},"Linear (dedicated instance)",[125,18715,18716,18719,18722,18725],{},[146,18717,18718],{},"Onboarding time",[146,18720,18721],{},"Seconds (INSERT)",[146,18723,18724],{},"Seconds to minutes (CREATE SCHEMA + migrate)",[146,18726,18727],{},"Minutes (provision stack)",[125,18729,18730,18733,18736,18739],{},[146,18731,18732],{},"Performance overhead",[146,18734,18735],{},"None (shares cache, etc)",[146,18737,18738],{},"Small (more relations in catalog)",[146,18740,18741],{},"High (overhead per instance)",[125,18743,18744,18747,18750,18753],{},[146,18745,18746],{},"Risk of leak from bug",[146,18748,18749],{},"High (mitigated by RLS)",[146,18751,18752],{},"Medium (mitigated by search_path)",[146,18754,8341],{},[125,18756,18757,18760,18763,18769],{},[146,18758,18759],{},"Per-tenant backup",[146,18761,18762],{},"Hard (full snapshot)",[146,18764,18765,18766,16047],{},"Easy (",[231,18767,18768],{},"pg_dump --schema",[146,18770,18771],{},"Trivial (dedicated backup)",[125,18773,18774,18777,18780,18783],{},[146,18775,18776],{},"Customer customization",[146,18778,18779],{},"Expensive (nullable columns)",[146,18781,18782],{},"Good (extra fields in schema)",[146,18784,18785],{},"Total (own app version)",[125,18787,18788,18791,18794,18797],{},[146,18789,18790],{},"Enterprise compliance",[146,18792,18793],{},"Hard to demonstrate",[146,18795,18796],{},"Demonstrable",[146,18798,18799],{},"Strong by construction",[125,18801,18802,18805,18808,18811],{},[146,18803,18804],{},"Ideal tenant range",[146,18806,18807],{},"1 to 10k",[146,18809,18810],{},"10 to 5k",[146,18812,18813],{},"10 to 1,000",[125,18815,18816,18819,18822,18825],{},[146,18817,18818],{},"Cross-tenant analytics",[146,18820,18821],{},"Trivial (one query)",[146,18823,18824],{},"Heavy (UNION N tables or ETL)",[146,18826,18827],{},"Heavy (mandatory ETL)",[125,18829,18830,18832,18835,18838],{},[146,18831,16398],{},[146,18833,18834],{},"2 to 5 devs",[146,18836,18837],{},"4 to 10 devs with basic infra",[146,18839,18840],{},"4 to 10 devs with orchestrator",[12,18842,18843],{},"The upper limits of tenant range are approximate — companies have exceeded all of them with effort. The numbers serve as reference for when it starts to hurt.",[19,18845,18847],{"id":18846},"the-typical-brazilian-saas-journey","The typical Brazilian SaaS journey",[12,18849,18850],{},"Most Brazilian B2B SaaS follow a predictable path, and understanding the path helps choose the current stage without underprovisioning or overprovisioning.",[12,18852,18853,18856],{},[27,18854,18855],{},"Stage 1: zero to fifty customers."," Pool is the obvious choice. Small team, low cost, nobody has asked for compliance yet, all customers are similar in usage. Focus on product-market fit — any hour spent with isolation now is an hour stolen from product. Postgres RLS from day one is the minimum defense investment.",[12,18858,18859,18862],{},[27,18860,18861],{},"Stage 2: fifty to five hundred customers, first mid-market B2B customer arrives."," Here it starts to tighten. That customer with one hundred fifty users consumes six times more resources than the others. The security questionnaire arrives with the question about isolation. Evaluating schema-per-tenant becomes rational. Hybrid is also an option: pool for the small ones, dedicated schema for the bigger ones who explicitly asked. Migration at this stage is less painful because the base is still manageable.",[12,18864,18865,18868],{},[27,18866,18867],{},"Stage 3: five hundred customers or first enterprise customer."," Now the decision is structural. Schema-per-tenant for everyone? App-per-tenant for enterprise and schema for the rest? Hybrid with three layers (pool for free, schema for paid, app for enterprise)? The answer depends on the customer mix — companies with few very large customers tend toward app-per-tenant; companies with a thousand mid customers stay on schema-per-tenant.",[12,18870,18871,18874],{},[27,18872,18873],{},"Stage 4: enterprise mode."," App-per-tenant for high-value, with schema-per-tenant or pool sustaining smaller ones. That's the state of companies like Salesforce (which historically did schema-per-tenant at extreme scale), Notion (highly optimized pool), and newer enterprise tools that adopt app-per-tenant from birth.",[12,18876,18877],{},"The transition between stages is where the most expensive engineering of a SaaS career lives. Whoever has been through it knows the smell.",[19,18879,18881],{"id":18880},"how-heroctl-helps-in-stage-3-and-4","How HeroCtl helps in stage 3 and 4",[12,18883,18884],{},"The app-per-tenant model requires a competent orchestrator. It's non-negotiable: without automated provisioning, operational complexity makes the model unfeasible. Four primitives an orchestrator needs to deliver for app-per-tenant to work, and how HeroCtl resolves each:",[12,18886,18887,18890],{},[27,18888,18889],{},"Parameterized job templates."," You describe \"tenant\" once — which application runs, which database, which ingress, which environment variables, which CPU and memory quota. For each new customer, you only vary the parameters (name, subdomain, plan). In HeroCtl, that's a short job spec with variable placeholders.",[12,18892,18893,2577,18896,18899,18900,18902],{},[27,18894,18895],{},"Onboarding API.",[231,18897,18898],{},"POST \u002Fv1\u002Fjobs"," with the new customer's variables. In seconds to a few minutes, the cluster provisions containers, brings up the database, registers in the internal router, issues automatic TLS certificate for ",[231,18901,18604],{},". No manual operation.",[12,18904,18905,18908,18909,18911],{},[27,18906,18907],{},"Integrated subdomain routing."," Each tenant gets their own subdomain with automatic TLS. The orchestrator's internal router resolves ",[231,18910,18604],{}," to the right container without you configuring DNS per customer — DNS wildcard points to the cluster, and the orchestrator does the rest.",[12,18913,18914,18917],{},[27,18915,18916],{},"Per-tenant quotas and auditing."," Each job carries resource limits (CPU, RAM, disk). Customer who tries to consume more than the plan allows, saturates at their own limit and doesn't affect neighbors. On the Business plan, detailed log of who deployed which version of which tenant, when — useful for internal audit and for answering customer questionnaires.",[12,18919,18920],{},"The HeroCtl public cluster runs today on four servers totaling five vCPUs and ten gigabytes of RAM, sustaining multiple sites with automatic TLS. When a coordinator node falls, the cluster elects another coordinator in about seven seconds — a window short enough that customers don't notice, and an important operational detail for those operating app-per-tenant in production. We're not promising magic: we're describing what already runs.",[19,18922,18924],{"id":18923},"five-expensive-errors-in-multi-tenant","Five expensive errors in multi-tenant",[12,18926,18927],{},"Errors that appear with enough frequency to warrant an explicit warning.",[12,18929,18930,18933],{},[27,18931,18932],{},"Sharing schema from day one without RLS."," Pool without Row-Level Security is just one defense layer: the application middleware. One layer fails at some point. RLS is the second layer — cheap to configure, and the difference between embarrassing incident and fatal incident. Configure from the start, even if the team thinks it's overkill.",[12,18935,18936,18939],{},[27,18937,18938],{},"Migrating too late from pool to schema."," Company that grew to ten thousand customers in pool and discovers it needs to migrate to schema-per-tenant has a four-to-eight-month project ahead. Middleware rewrite, data migration in windows, validation per customer. Whoever migrated at five hundred tenants spent three weeks; whoever migrated at ten thousand spent a quarter.",[12,18941,18942,18945],{},[27,18943,18944],{},"Ad-hoc customization in pool."," Customer Y asks for an extra field. You add as nullable column. In three months other customers asked for three other columns. In six months no one understands the table anymore. What seemed like a shortcut becomes debt that pays interest every sprint. Resist that pattern; or accept that you need schema-per-tenant to serve those customizations cleanly.",[12,18947,18948,18951],{},[27,18949,18950],{},"Backup of main database only."," When you leave pool, backup needs to be rethought. Separate schema needs conscious separate backup. App-per-tenant needs per-database backup. Forgetting that and discovering in an incident is catastrophic — companies have lost data of a single customer because the global backup didn't cover per-tenant databases.",[12,18953,18954,18957,18958,18960],{},[27,18955,18956],{},"Cross-tenant analytics in schema-per-tenant via UNION."," Works in ten customers, gets heavy in a hundred, becomes unfeasible in a thousand. Build ETL to separate warehouse early — ClickHouse or BigQuery with denormalized ",[231,18959,18371],{}," is the standard solution. Trying to keep everything in transactional is a recipe for forty-minute query.",[19,18962,18964],{"id":18963},"lgpd-and-multi-tenancy","LGPD and multi-tenancy",[12,18966,18967],{},"LGPD doesn't require a specific architectural model, but requires demonstration of adequate handling. Each pattern has different implications.",[12,18969,18970,18973],{},[27,18971,18972],{},"Pool:"," you need to demonstrate robust logical separation (RLS configured, tested, audited), personal data access log (who read what, when), and deletion process that covers all relevant tables for the right to be forgotten (article 18). All viable, but with more demonstration work.",[12,18975,18976,18979,18980,18983],{},[27,18977,18978],{},"Schema-per-tenant:"," demonstration becomes simpler. \"Each customer has their isolated schema, with their own privileges, and data deletion is ",[231,18981,18982],{},"DROP SCHEMA","\" — phrase that satisfies auditor without pain. Right to be forgotten is practically trivial in this model.",[12,18985,18986,18989],{},[27,18987,18988],{},"App-per-tenant:"," physical separation is demonstrable directly. Audit becomes even simpler. Right to be forgotten is destroying the customer's database.",[12,18991,18992],{},"In all models: personal data access log (article 16, storage requirement) is the application layer's responsibility — independent of the isolation model. Build that log early.",[19,18994,3225],{"id":3224},[12,18996,18997,19000],{},[27,18998,18999],{},"Is Postgres RLS reliable in production?","\nYes, and widely used. The pitfalls are two: ensure all roles connecting to the database are non-privileged (super-user ignores RLS), and test policies with automated tests that run in CI. Whoever configures RLS once and doesn't test, discovers holes later.",[12,19002,19003,19006,19007,19010],{},[27,19004,19005],{},"How to automate migrations in schema-per-tenant?","\nCommon pattern: ",[231,19008,19009],{},"tenant_metadata"," table with list of schemas and current version of each. Migration job consults, applies in parallel (with concurrency limit so as not to saturate the database), updates version. Tools like Flyway and migrate with custom wrapper work. Reserve maintenance window for big migrations even with parallelism.",[12,19012,19013,19016],{},[27,19014,19015],{},"Doesn't app-per-tenant get too expensive to scale?","\nIt does, if average ticket is low. Practical rule: ARR of R$10k\u002Fmonth per customer comfortably sustains the cost of dedicated infra. Below that, the margin tightens. For small customers, keep pool or schema. App-per-tenant is a weapon for customers who pay for exclusivity.",[12,19018,19019,19022],{},[27,19020,19021],{},"Can I mix models (high-value app-per-tenant, rest pool)?","\nYes, and hybrid is the most common final state in mature SaaS. Operational complexity increases — you operate two architectures, not one — but the savings pay off when high-value customers justify the effort. Requires team with maturity of at least six to ten engineers.",[12,19024,19025,19030,19031,19033,19034,19037],{},[27,19026,19027,19029],{},[231,19028,18371],{}," in path or subdomain?","\nSubdomain (",[231,19032,18604],{},") is usually better for branding and isolated cookies. Path (",[231,19035,19036],{},"app.com\u002Facme",") is simpler in DNS and routing. Subdomain combines better with app-per-tenant; path combines well with pool. Choose early, because changing later breaks customer integrations.",[12,19039,19040,19043],{},[27,19041,19042],{},"Is encryption per tenant feasible?","\nIn pool, key per tenant in the application layer is the path — reasonable overhead, and you stay with keys derived from a protected master key. In schema-per-tenant, same strategy. In app-per-tenant, database encryption-at-rest already gives natural isolation. Encryption per tenant is expensive and rarely required — only go there if the customer explicitly asks in contract.",[12,19045,19046,19049],{},[27,19047,19048],{},"How long does onboarding take in each model?","\nPool: thirty to two hundred milliseconds (a database transaction). Schema-per-tenant: two to thirty seconds (CREATE SCHEMA + migrations). App-per-tenant: thirty seconds to three minutes (provision instances, bring up database, register TLS). That time enters the signup UX flow — credit card self-service models don't accommodate minute time without some form of queue or async notification.",[19,19051,3309],{"id":3308},[12,19053,19054],{},"Choosing a multi-tenant pattern isn't technically difficult — it's organizationally difficult. Difficult because it requires anticipating three to five years of product and customer growth, and almost no one anticipates well. The defense isn't choosing perfect; it's choosing consciously, with migration trail mapped to the next stage, and with instrumentation that warns when the current model is asking for retirement.",[12,19056,19057],{},"Pool is right at the start. Schema-per-tenant is right in the transition to mid-market. App-per-tenant is right when the customer pays for exclusivity or when compliance requires it. Hybrid is the common destination.",[12,19059,19060],{},"If you're building Brazilian B2B SaaS in 2026 and the product is reaching the stage where multi-tenancy matters, it's worth knowing an orchestrator that makes app-per-tenant operationally accessible to small teams:",[224,19062,19064],{"className":19063,"code":5318,"language":2529},[2527],[231,19065,5318],{"__ignoreMap":229},[12,19067,19068],{},"Four servers, five vCPUs, ten gigabytes of RAM — the public demo cluster runs on resources that fit in any regional cloud plan. Coordinator election in about seven seconds when something falls. Embedded routing and TLS. It's the foundation missing for much isolated tenant architecture in a small team.",[12,19070,19071,19072,2402,19074,101],{},"For more on Brazilian SaaS infra, see ",[3336,19073,7462],{"href":7461},[3336,19075,19076],{"href":6337},"How much it costs to host Brazilian SaaS in 2026",{"title":229,"searchDepth":244,"depth":244,"links":19078},[19079,19080,19081,19082,19083,19084,19085,19086,19087,19088,19089],{"id":18378,"depth":244,"text":18379},{"id":18416,"depth":244,"text":18417},{"id":18508,"depth":244,"text":18509},{"id":18597,"depth":244,"text":18598},{"id":18679,"depth":244,"text":18680},{"id":18846,"depth":244,"text":18847},{"id":18880,"depth":244,"text":18881},{"id":18923,"depth":244,"text":18924},{"id":18963,"depth":244,"text":18964},{"id":3224,"depth":244,"text":3225},{"id":3308,"depth":244,"text":3309},"2026-04-01","Pool, schema-per-tenant, app-per-tenant. Each pattern has obvious benefits and invisible costs. How to decide before the first serious B2B customer asks 'is my data isolated?'.",{},{"title":18363,"description":19091},{"loc":4368},"en\u002Fblog\u002Fmulti-tenant-saas-real-isolation",[19097,15805,19098,3378,4409],"multi-tenancy","isolation","v1Sv7siy3ZlWmJmgy7pUC2jpQRPh6LgQ9VqhTnqDEHw",{"id":19101,"title":19102,"author":7,"body":19103,"category":6382,"cover":3379,"date":19773,"description":19774,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":19775,"navigation":411,"path":19776,"readingTime":4401,"seo":19777,"sitemap":19778,"stem":19779,"tags":19780,"__hash__":19785},"blog_en\u002Fen\u002Fblog\u002Fstrapi-directus-ghost-self-hosted-guide.md","Self-hosted Strapi, Directus, and Ghost: honest guide for agencies and indie hackers",{"type":9,"value":19104,"toc":19758},[19105,19108,19111,19114,19117,19121,19124,19130,19136,19146,19152,19158,19162,19165,19168,19171,19174,19177,19180,19184,19187,19190,19193,19196,19199,19202,19206,19209,19212,19215,19218,19221,19224,19226,19229,19427,19430,19434,19437,19460,19489,19515,19519,19522,19528,19534,19540,19546,19550,19553,19559,19565,19571,19574,19578,19581,19587,19593,19599,19605,19611,19615,19621,19627,19633,19639,19645,19649,19652,19658,19664,19670,19675,19681,19684,19686,19692,19698,19704,19710,19716,19722,19728,19730,19733,19736,19739,19744,19755],[12,19106,19107],{},"Every Brazilian agency hosting client sites knows the dilemma. You have thirty active accounts, each with a Wordpress on Wordpress.com Business costing between US$25 and US$45 per month — when the customer doesn't demand Wordpress.com VIP, which goes to three digits. Add that, multiply by thirty, divide by the dollar of the month, and the margin disappears. The oldest alternative is renting cheap shared hosting and stacking thirty sites on a PHP server that falls together on the first Tuesday of the month — reputation burned for saving five hundred reais.",[12,19109,19110],{},"There's a middle path that became viable in the last two years: replace the PHP monolith with a modern self-hosted CMS. Strapi, Directus, and Ghost are the three that most appear in agency projects and in indie SaaS in Brazil. Each solves a different problem, each has its own trap, and each one's official documentation sells the product instead of comparing honestly. This post is the comparison that was missing.",[12,19112,19113],{},"The audience here is dual. On one side, the agency of five to twenty people that delivers site or editorial platform to client — that profile needs to decide between managed cloud and self-hosting based on cost per client, not technical hype. On the other, the solo developer or indie hacker who's choosing the stack of their own project and wants to know which CMS scales better without becoming a Saturday headache.",[12,19115,19116],{},"The numbers are the post's skeleton. Costs in dollars were converted to reais using the current band of R$5.00 to R$5.30 per dollar — where the interval matters, it's marked. RAM and CPU requirements were collected from the official documentations and validated on test VPSes running synthetic workload. If any number seems optimistic, it's because it's the floor — real production usually asks for 30 to 50 percent more.",[19,19118,19120],{"id":19119},"why-self-hosted-cms-became-viable-in-2026","Why self-hosted CMS became viable in 2026",[12,19122,19123],{},"Five combined factors unlocked the scenario. None of them is new alone; what changed is that they all matured at the same time.",[12,19125,19126,19129],{},[27,19127,19128],{},"The cost of virtual machine fell to near absurd."," Hetzner, DigitalOcean, OVH, and even Brazilian providers like Magalu Cloud and UOL Host sell 2 vCPU and 4 GB RAM VPS for less than R$60 per month. Five years ago, the same capacity cost triple. For agency that historically outsourced infra to hosting resellers, now it makes more sense to rent a dedicated machine and stack workloads there.",[12,19131,19132,19135],{},[27,19133,19134],{},"Self-hosted orchestration panels cover what was missing in ops."," Coolify, Dokploy, CapRover, and HeroCtl itself deliver what was exclusivity of expensive vendors: deploy from a config file, automatic TLS certificate, rollback from one version to another, basic metrics. The barrier to running a Strapi in production fell from \"a week of manual provisioning\" to \"five minutes after the server is up\".",[12,19137,19138,19141,19142,19145],{},[27,19139,19140],{},"Modern CMSes publish official and mature Docker images."," It's been three years since you needed to mount your own Dockerfile for Strapi in production; today the official team publishes a tested image with reference docker-compose recipe. Even for Ghost, which historically had its own packaging, the ",[231,19143,19144],{},"ghost:5-alpine"," image is the form recommended by the official team.",[12,19147,19148,19151],{},[27,19149,19150],{},"Brazilian communities stopped being invisible."," The Strapi BR Discord channel has thousands of active members, the Directus official forum responds in English but with high participation of Brazilian devs, and Ghost documentation has been translated in pieces by local contributors. It's not the WordPress community (which is gigantic and full of PT-BR tutorials), but it's enough to unlock most problems without having to decipher technical English on the fourth consecutive error.",[12,19153,19154,19157],{},[27,19155,19156],{},"Wordpress.com aggressively raised prices."," Whoever followed Heroku going paid in 2022 recognizes the pattern: free or cheap service becomes premium, old plan is discontinued, legacy account migrates or pays more. Wordpress.com did the equivalent over the last two years — the \"Personal\" tier rose, the \"Premium\" tier rose more, and features that previously came on the medium plan now require the Business tier or higher. Each increase is one more push toward self-hosting.",[19,19159,19161],{"id":19160},"strapi-the-api-first-cms","Strapi — the API-first CMS",[12,19163,19164],{},"Strapi is what most resembles \"modern Wordpress for dev\". You define the content type in the admin interface (post, author, category, product, anything), and Strapi automatically generates a REST API and a GraphQL API to read and write that content. There's no frontend in it — it's pure headless backend. The frontend is your responsibility, generally a Next.js, Nuxt, or Astro consuming the API.",[12,19166,19167],{},"The stack is Node.js on the backend, Postgres or MySQL database for persistence, and a React admin panel that comes embedded. The panel is the product's strong point: non-technical editor manages to create content without training, organize media, schedule publication, manage users. For agency, that's an easy sell — the customer enters the admin and recognizes the \"Wordpress but cleaner\" paradigm.",[12,19169,19170],{},"The realistic minimum requirement in production is 2 vCPU, 2 GB of RAM, and 10 GB of storage. The official documentation talks of 1 GB, but with any active plugin and traffic beyond local test, memory blows. On VPS of R$50 to R$80 per month you run comfortably; on VPS of R$30 (1 GB of RAM) the process dies every time a larger media upload happens.",[12,19172,19173],{},"The strong points are consistent. Rich plugin ecosystem — social authentication, internationalization, S3 integration for media, sitemap generator, all already exist. Native GraphQL without extra configuration, which fits well with modern frontend. Custom hooks (lifecycle hooks, middlewares, policies) solve business rule without needing separate microservice. The admin interface is genuinely good — compared with Drupal or Wordpress without admin plugin, it's another level.",[12,19175,19176],{},"The weak points are also consistent, and worth saying out loud. The transition between major versions usually breaks — the migration from v4 to v5 was notorious, with incompatible API changes and need to rewrite custom plugins. If you adopt Strapi for a long-term project, reserve an upgrade window every twelve or eighteen months as recurring cost, not surprise. Schema migrations also require discipline — adding field is easy, renaming or typing differently without losing data asks for migration script written by hand. And some features that appear in the marketing only run in Strapi Cloud (their paid version), like live preview between environments — self-hosting you don't have it ready.",[12,19178,19179],{},"When it makes sense to choose Strapi: SaaS that needs own blog and knowledge base on the same CMS, agency that delivers to customer used to \"Wordpress but without PHP\", headless commerce project where SKUs are modeled as content type, and any scenario where having ready GraphQL saves days of work.",[19,19181,19183],{"id":19182},"directus-the-cms-for-existing-data","Directus — the CMS for existing data",[12,19185,19186],{},"Directus is a different creature. Instead of forcing you to create content type from scratch within it, it puts an admin interface on top of any database you already have. You'd point to a legacy Postgres with twenty existing tables, and Directus shows each table as an editable collection, respecting column types, foreign keys, and even constraints. It's the tool that most resembles \"universal admin for any SQL database\".",[12,19188,19189],{},"The stack is Node.js on the backend, official support for Postgres, MySQL, MariaDB, SQLite, Oracle, and SQL Server, and an admin panel in Vue. Database support is deliberately broad — the product was designed to adapt, not to impose its own schema. You can use Directus against an empty database and let it create the tables via interface, or point to a database with ten years of history and expect everything to appear organized in admin.",[12,19191,19192],{},"The minimum requirement is lighter than Strapi. 1 vCPU, 1 GB of RAM, and 5 GB of storage run comfortably for small and medium workload. On VPS of R$30 to R$50 per month you can bring up a Directus serving dozens of collections with moderate traffic. For smaller projects, SQLite as database is sufficient — fits in a single file, simplifies backup, avoids having a separate Postgres to manage.",[12,19194,19195],{},"The strong points come from the design. The ability to adopt existing database without reformulating schema is genuinely unique — no popular CMS does this so well. Real-time updates via WebSockets come ready, which opens door to dashboards and internal tools that react to change in real time without needing an additional layer. Granular permissions per collection, per field, and even per row (based on condition) cover multi-tenancy scenarios without hack. The documentation is decent, kept active, and the team responds to questions in forum within reasonable timeframes.",[12,19197,19198],{},"The weak points: the learning curve for advanced customizations (extensions, custom hooks, dashboard panels) is steeper than Strapi. The plugin ecosystem is smaller — where Strapi has ten SEO plugins, Directus has two or three. And for non-technical editor, the interface is less friendly than Strapi's — Directus prioritizes power and flexibility, not smooth onboarding.",[12,19200,19201],{},"When it makes sense to choose Directus: agency that took on customer with ten-year-old legacy MySQL database and needs to deliver admin panel without redoing schema, internal tool where modeling is data-driven (custom CRM, inventory management, operations platform), application whose central entity is \"relational data\", not \"editorial document\". Also the obvious choice when the customer already has Postgres or MySQL running another system and wants to take advantage.",[19,19203,19205],{"id":19204},"ghost-the-publishing-cms","Ghost — the publishing CMS",[12,19207,19208],{},"Ghost is the opposite of neutrality. It doesn't pretend to be universal CMS — it's blog and newsletter platform, specialized in editorial content. Whoever tries to use Ghost for e-commerce product or SaaS app is using the wrong tool. Whoever uses for corporate blog, media site, podcast with membership, or paid newsletter, finds a polished and focused product.",[12,19210,19211],{},"The stack is Node.js on the backend, MySQL or SQLite database (Postgres isn't officially supported), and frontend in Handlebars with theme. The frontend is part of the package — Ghost serves the pages directly, with theme installed via upload. There's headless mode (you use only the Content API and assemble the frontend separately), but the common case is Ghost serving everything.",[12,19213,19214],{},"The minimum requirement is the lightest of the three. 1 vCPU, 1 GB of RAM, and 5 GB of storage run Ghost with slack for medium-traffic blog. On VPS of R$30 you can run it — with care of configuring external SMTP for newsletter (sending email from the server itself is a recipe for falling into spam).",[12,19216,19217],{},"The strong points are sharp. Out-of-the-box SEO is the best among the three — meta tags, sitemap, schema.org, AMP (when it makes sense), all configured by default. Membership and paywall system comes native: you create subscription levels, charge via Stripe, release paid content automatically. The markdown editor is genuinely good, with cards (callouts, code) that cover the common case without becoming a Word editor. The themes focus on readability and editorial typography — nothing of the generic Wordpress theme aesthetic.",[12,19219,19220],{},"The weak points come from specialization. Plugin ecosystem is closed by design — integration apps exist on Ghost.org as paid product, and installing custom app is harder than in Strapi or Directus. Non-blog is hostile territory — trying to model product, author with rich profile, complex taxonomy bumps into design decisions that prioritize the \"post + author + tag\" case. And official Postgres support doesn't exist — if you have company standard on Postgres, you'll operate parallel MySQL just for Ghost.",[12,19222,19223],{},"When it makes sense to choose Ghost: corporate blog with paywall or premium content, media or independent journalism site, podcast that wants to monetize via membership, content marketing taken seriously with editor that will use the admin every day. For anything outside that scope, it's pushing the bar.",[19,19225,17370],{"id":17369},[12,19227,19228],{},"The three modern CMSes side by side with Wordpress (the inherited reference) and Payload (recent competitor worth mentioning):",[119,19230,19231,19252],{},[122,19232,19233],{},[125,19234,19235,19237,19240,19243,19246,19249],{},[128,19236,2982],{},[128,19238,19239],{},"Strapi",[128,19241,19242],{},"Directus",[128,19244,19245],{},"Ghost",[128,19247,19248],{},"Wordpress",[128,19250,19251],{},"Payload",[141,19253,19254,19270,19287,19305,19322,19338,19356,19374,19392,19407],{},[125,19255,19256,19259,19262,19264,19266,19268],{},[146,19257,19258],{},"Realistic minimum RAM",[146,19260,19261],{},"2 GB",[146,19263,11494],{},[146,19265,11494],{},[146,19267,11494],{},[146,19269,19261],{},[125,19271,19272,19274,19277,19279,19282,19285],{},[146,19273,16309],{},[146,19275,19276],{},"30–60 min",[146,19278,19276],{},[146,19280,19281],{},"15–30 min",[146,19283,19284],{},"10–20 min",[146,19286,19276],{},[125,19288,19289,19292,19295,19297,19300,19303],{},[146,19290,19291],{},"Headless mode",[146,19293,19294],{},"Yes, default",[146,19296,19294],{},[146,19298,19299],{},"Optional",[146,19301,19302],{},"Optional (REST + GraphQL)",[146,19304,19294],{},[125,19306,19307,19310,19312,19314,19317,19320],{},[146,19308,19309],{},"Native GraphQL",[146,19311,3064],{},[146,19313,3064],{},[146,19315,19316],{},"No (REST)",[146,19318,19319],{},"External plugin",[146,19321,3064],{},[125,19323,19324,19327,19329,19331,19334,19336],{},[146,19325,19326],{},"Easy multi-tenancy",[146,19328,3159],{},[146,19330,4897],{},[146,19332,19333],{},"Hard",[146,19335,19319],{},[146,19337,4897],{},[125,19339,19340,19343,19346,19348,19351,19354],{},[146,19341,19342],{},"Membership \u002F paywall",[146,19344,19345],{},"Plugin",[146,19347,19345],{},[146,19349,19350],{},"Native",[146,19352,19353],{},"Paid plugin",[146,19355,11914],{},[125,19357,19358,19361,19364,19366,19369,19372],{},[146,19359,19360],{},"Plugin ecosystem",[146,19362,19363],{},"Rich",[146,19365,3159],{},[146,19367,19368],{},"Weak",[146,19370,19371],{},"Very rich",[146,19373,14471],{},[125,19375,19376,19379,19382,19384,19387,19389],{},[146,19377,19378],{},"Cloud cost (initial USD\u002Fmonth)",[146,19380,19381],{},"15",[146,19383,5635],{},[146,19385,19386],{},"11",[146,19388,5635],{},[146,19390,19391],{},"35",[125,19393,19394,19396,19398,19400,19402,19404],{},[146,19395,4894],{},[146,19397,3139],{},[146,19399,4889],{},[146,19401,4889],{},[146,19403,19371],{},[146,19405,19406],{},"English",[125,19408,19409,19412,19415,19418,19421,19424],{},[146,19410,19411],{},"Ideal use range",[146,19413,19414],{},"API + admin",[146,19416,19417],{},"Admin over data",[146,19419,19420],{},"Editorial content",[146,19422,19423],{},"Generic site",[146,19425,19426],{},"Custom Node.js app",[12,19428,19429],{},"The \"time to first deploy\" column assumes server already provisioned and Docker installed. The \"Cloud cost\" column is the product's entry tier — price scale rises according to traffic, member, or admin seat limits. The \"documentation in PT-BR\" column reflects what exists official plus what the Brazilian community keeps active; none of the three has complete manual translated, but Strapi has the best learning path in Portuguese.",[19,19431,19433],{"id":19432},"self-hosted-setup-at-a-high-level","Self-hosted setup at a high level",[12,19435,19436],{},"The recipe isn't copy-paste — it's the mental script of what will be needed. Specific details change per VPS and per orchestrator choice.",[12,19438,19439,19440,19442,19443,19446,19447,571,19450,571,19453,571,19456,19459],{},"For ",[27,19441,19239],{},", the minimally serious setup is docker-compose with three services: Strapi, Postgres, and Redis (Redis is optional, but speeds up the admin noticeably when there are more than five editors). Named volume for ",[231,19444,19445],{},"\u002Fsrv\u002Fstrapi\u002Fuploads"," (media) and for Postgres data. Panel comes up on port 1337 internally, exposed via subdomain with TLS by the orchestrator's router. Critical environment variables: ",[231,19448,19449],{},"APP_KEYS",[231,19451,19452],{},"JWT_SECRET",[231,19454,19455],{},"ADMIN_JWT_SECRET",[231,19457,19458],{},"DATABASE_*",". Forgetting any of those makes the admin not come up or lose session at every restart.",[12,19461,19439,19462,19464,19465,571,19468,571,19471,571,19474,571,19477,19480,19481,19484,19485,19488],{},[27,19463,19242],{},", the setup is similar but lighter. Docker-compose with Directus and database (SQLite fits in a single volume, Postgres if the expectation is multi-user with concurrent writing). No Redis necessary to start. Panel on port 8055. Critical variables: ",[231,19466,19467],{},"KEY",[231,19469,19470],{},"SECRET",[231,19472,19473],{},"ADMIN_EMAIL",[231,19475,19476],{},"ADMIN_PASSWORD",[231,19478,19479],{},"DB_*",". Attention point: if you point Directus to existing database with rich schema, open the admin calmly and configure permissions before giving access to any other user — by default the ",[231,19482,19483],{},"admin"," role sees everything and the ",[231,19486,19487],{},"public"," role sees nothing, which is reasonable; but if you create intermediate role without care, you expose entire collections without intending.",[12,19490,19439,19491,19493,19494,19497,19498,19500,19501,571,19504,19507,19508,19510,19511,19514],{},[27,19492,19245],{},", docker-compose with Ghost and MySQL. SQLite serves for development but is discouraged in production by the official team. Named volume for ",[231,19495,19496],{},"\u002Fvar\u002Flib\u002Fghost\u002Fcontent"," (themes, media, configs) and for MySQL. Configuring external SMTP is mandatory step — Mailgun, Postmark, and Resend have free or cheap tier, any of them serves. Without SMTP, password recovery doesn't work, newsletter doesn't send, member signup is broken. Critical variables: ",[231,19499,10243],{}," (public domain with https), ",[231,19502,19503],{},"database__connection__*",[231,19505,19506],{},"mail__*",". Common error: configuring ",[231,19509,10243],{}," as ",[231,19512,19513],{},"http:\u002F\u002Flocalhost"," in production and discovering only later that all email links came out broken.",[19,19516,19518],{"id":19517},"compared-costs","Compared costs",[12,19520,19521],{},"The honest spreadsheet of managed cloud against self-hosted, in current currency (R$5.00 per dollar as reference):",[12,19523,19524,19527],{},[27,19525,19526],{},"Strapi Cloud"," starts at US$15 per month on the Developer tier (R$75), rises to US$99 per month on the Pro tier (R$495) with features like separate staging and production environments, more admin seats, and support. Self-hosted on VPS of R$50 to R$80 per month runs Strapi with slack for small and medium workload. Monthly difference: from R$25 to R$445 depending on which tier you'd compare. For agency with five clients on Strapi, that translates to annual savings between R$1,500 and R$26,700.",[12,19529,19530,19533],{},[27,19531,19532],{},"Directus Cloud"," starts at US$25 per month on the Standard tier (R$125), rises to US$99 per month on the Pro tier (R$495), and has Enterprise tier with price on consultation. Self-hosted on VPS of R$50 per month covers the common case. Difference similar to Strapi's — between R$75 and R$445 per month per instance.",[12,19535,19536,19539],{},[27,19537,19538],{},"Ghost Pro"," starts at US$11 per month on the Starter tier (R$55) with up to 500 members and a single staff seat, scales to US$31 (R$155) with 1,000 members, and reaches US$249 per month (R$1,245) on the tier supporting 50,000 members. Self-hosted on VPS of R$50 to R$80 per month has no member ceiling — you can have 50,000 or 500,000 without changing the server (the only thing that changes is the volume of transactional email, which scales separately). For publication that grows in audience, the annual savings self-hosting Ghost passes R$10k quickly.",[12,19541,19542,19545],{},[27,19543,19544],{},"Wordpress.com Business"," costs US$25 per month (R$125), VIP stays in three digits. Comparing with self-hosting Wordpress on a VPS of R$50 is meh — Wordpress is heavy by nature, requires more security and backup care, and the plugin ecosystem is recurring source of production incident. For new project in 2026, it's more sensible to choose between Strapi, Directus, or Ghost than to inherit PHP's complexity.",[19,19547,19549],{"id":19548},"strategy-for-agency-hosting-thirty-clients","Strategy for agency hosting thirty clients",[12,19551,19552],{},"Three options with clear tradeoffs.",[12,19554,19555,19558],{},[27,19556,19557],{},"Option A — one VPS per client."," Total isolation: if a client takes down their server, the other twenty-nine don't feel it. Direct cost: 30 VPS × R$30 to R$50 = R$900 to R$1,500 per month just on infra. Operational cost: thirty times everything — thirty OS updates, thirty certificates to monitor, thirty backups to orchestrate. For agency with more than ten clients, the operational overhead eats the margin the option had in the first place.",[12,19560,19561,19564],{},[27,19562,19563],{},"Option B — a shared cluster running thirty CMS instances."," Four servers totaling 5 vCPU and 10 GB of RAM (the configuration we run in production here on HeroCtl) comfortably host thirty Strapi\u002FDirectus\u002FGhost instances with typical SME client traffic. Infra cost: about R$300 to R$400 per month for the entire cluster. Operational cost: a single monitoring strategy, a single backup strategy, a single place to look when something weighs the system. Agency margin increases because the point where you charge is the same and the point where you spend fell.",[12,19566,19567,19570],{},[27,19568,19569],{},"Option C — shared cluster with each client on own subdomain."," Variation of option B, but with explicit routing by subdomain (client1.youragency.com, client2.youragency.com) or own client domain (client-shop.com.br). The orchestrator's integrated router solves the part of automatic TLS and traffic direction. Multi-tenancy stays at DNS + container level, not at CMS level — each client has isolated instance of Strapi\u002FDirectus\u002FGhost with own database. For agency that sells \"exclusive site\" as differentiator, it's the way to keep the promise without multiplying VPSes.",[12,19572,19573],{},"Option B with elements of C is what makes most sense for typical agency. Shared cluster, isolated instances, subdomain or own domain per client, centralized backup.",[19,19575,19577],{"id":19576},"backup-and-migration-between-cms","Backup and migration between CMS",[12,19579,19580],{},"Migration between CMS is territory where vendors deliberately omit detail. The practical truth:",[12,19582,19583,19586],{},[27,19584,19585],{},"Strapi to Strapi"," (between versions or between instances) has export and import via official plugin, generates JSON file with schema and data. Works well for migration between staging and production; between major versions, may ask for manual adjustment in the JSON before the import.",[12,19588,19589,19592],{},[27,19590,19591],{},"Strapi to Directus"," has no ready tool. Schema is different enough to require manual mapping — Node script reading Strapi's REST API and writing on Directus's REST API, item by item. For base of one or ten thousand records, it's afternoon work; for larger base, worth parallelizing.",[12,19594,19595,19598],{},[27,19596,19597],{},"Wordpress to Strapi"," has third-party tools (wp2strapi and variants), all partial. What migrates well is post + author + category + media. What doesn't migrate well is any complex custom post type, SEO plugin with own metadata, or menu structure. Reserve one to three days per site in migration and revise media manually.",[12,19600,19601,19604],{},[27,19602,19603],{},"Ghost to Ghost"," has native export and import in admin — generates JSON with posts, authors, site configurations, members. Works clean between instances and between versions.",[12,19606,19607,19610],{},[27,19608,19609],{},"Database backup"," is the non-negotiable step. Pg_dump (Postgres) or mysqldump (MySQL) daily, copied to object storage outside the server (S3, Backblaze B2, Wasabi). Without this, any incident — corrupted disk, accidental rm, hack — becomes extinction event for the client's data. S3 cost with versioning for a small cluster stays below R$50 per month even keeping thirty days of retention.",[19,19612,19614],{"id":19613},"five-mistakes-that-kill-self-hosted-cms","Five mistakes that kill self-hosted CMS",[12,19616,19617,19620],{},[27,19618,19619],{},"Not updating."," Outdated CMS is open vulnerability. Monthly update cron is the floor — fixed calendar, maintenance window combined with client, smoke test afterward. Not doing this means that sooner or later someone opens the client's admin without credential.",[12,19622,19623,19626],{},[27,19624,19625],{},"Weak admin password."," Admin\u002Fadmin in production keeps happening in 2026. Strong password generated by password manager, two-factor authentication when the CMS supports, separate role for editor (the client doesn't get total admin password).",[12,19628,19629,19632],{},[27,19630,19631],{},"No automatic backup."," Client sees six months of content disappear and the relationship ends. Daily database backup, retained for thirty days minimum, copied to storage outside the server hosting the CMS. Test restore at least once per quarter — backup that's never been restored is theory, not backup.",[12,19634,19635,19638],{},[27,19636,19637],{},"Local media storage without CDN."," Large images on small VPS take down the server when a page goes viral. Configure object storage (S3, R2, Spaces) for media from day one, even if traffic is low at the start. Strapi and Directus have official providers for this; Ghost supports via configuration.",[12,19640,19641,19644],{},[27,19642,19643],{},"Untested transactional email."," Strapi and Directus need SMTP configured for password reset to work. Ghost depends on SMTP for the entire newsletter. Configure and test on deploy day — send test email to yourself, check inbox and spam folder, adjust SPF\u002FDKIM if it falls in spam. Without this, the client discovers that the site broke on the day they need to change their own password.",[19,19646,19648],{"id":19647},"heroctl-as-agency-infra","HeroCtl as agency infra",[12,19650,19651],{},"The last part of the guide is honest about how HeroCtl fits into this scenario. We don't pretend to be the only option — Coolify, Dokploy, and CapRover cover similar cases with different tradeoffs. What HeroCtl brings for agency hosting CMS is:",[12,19653,19654,19657],{},[27,19655,19656],{},"Job templates to bring up new CMS in seconds."," Instead of writing docker-compose from scratch for each client, you keep a fifty-line config file with Strapi + Postgres already parameterized, change the domain and database name, and submit. New client enters production in less time than it takes to make coffee.",[12,19659,19660,19663],{},[27,19661,19662],{},"Routing by subdomain with automatic TLS."," Each client on own subdomain (or own domain with DNS pointing) receives Let's Encrypt certificate without intervention. Renewal happens by itself. You don't touch web server config file — the integrated router handles it.",[12,19665,19666,19669],{},[27,19667,19668],{},"Metrics per job."," Which client is weighing the cluster becomes visible on the panel — CPU, memory, requests per second, latency. When a client passes contracted volume, you see before the cluster feels.",[12,19671,19672,19674],{},[27,19673,13651],{}," (in the Business plan) covers all clients at once. Instead of configuring thirty separate pg_dump scripts, it's a central policy with configurable retention.",[12,19676,19677,19680],{},[27,19678,19679],{},"Detailed audit"," (in the Business plan) covers LGPD requirement for client that needs to demonstrate who accessed what and when. For agency serving client in health, finance, or education, it stops being luxury.",[12,19682,19683],{},"The line between what comes in the Community plan (free, no server or job limit) and what's in Business is drawn by the type of requirement that appears when the agency grows. For five or ten clients, Community solves. For thirty clients where two of them require SSO and one requires audit report, Business pays for itself in the first month.",[19,19685,7347],{"id":7346},[12,19687,19688,19691],{},[27,19689,19690],{},"Wordpress vs these three — when does Wordpress still win?","\nWhen the client has internal team used to Wordpress, when the site depends on specific plugin that only exists in Wordpress (some hyper-localized e-commerce plugins, some LMS), and when the budget is so small that training editor on new CMS costs more than the hosting savings. For new project in 2026 without these restrictions, rarely.",[12,19693,19694,19697],{},[27,19695,19696],{},"Can I run Strapi on R$30 VPS?","\nTechnically yes, in practice it's source of incident. 1 GB of RAM is the floor and any traffic spike or larger media upload takes down the process. Bump up to R$50 to R$80 — the difference is less than a lunch, and stability becomes another thing.",[12,19699,19700,19703],{},[27,19701,19702],{},"Ghost and Strapi on the same server, ok?","\nOn small VPS (4 GB of RAM or less) it's tight and subject to contention. On 8 GB or more server with docker-compose separating resources, it works. On cluster with orchestrator, it's the common case — both run on different hosts or share with process isolation.",[12,19705,19706,19709],{},[27,19707,19708],{},"How do I migrate Strapi v4 to v5 without burning the night?","\nDocument the current schema before touching. Bring up staging environment with v5 and the same database copied. Run the official migrator and verify everything in admin. Rewrite custom plugins before promoting to production — they don't migrate automatically. Reserve two to four business days for a medium Strapi. Without staging environment, don't turn the game directly in production.",[12,19711,19712,19715],{},[27,19713,19714],{},"Transactional email for Ghost newsletter — which provider is cheapest?","\nMailgun has free tier up to five thousand emails per month, then costs by volume. Resend has free tier up to three thousand. Postmark is paid from the first email but is the most reliable in delivery. For small newsletter (up to two thousand members), free Mailgun or Resend solves. Above that, Postmark is worth the cost for the delivery rate.",[12,19717,19718,19721],{},[27,19719,19720],{},"Is there a Brazilian agency case scaling like this?","\nThere are several, but those that speak in public are minority. The typical pattern is agency with ten to thirty clients, cluster of three or four servers on cloud provider, separate instances per client, centralized backup. When the agency publishes numbers, it usually talks of fifty to seventy percent savings over the equivalent in managed hosting — which matches the arithmetic above.",[12,19723,19724,19727],{},[27,19725,19726],{},"Large image media — where to store?","\nObject storage outside the server hosting the CMS. AWS S3, Cloudflare R2, Backblaze B2, and DigitalOcean Spaces cover the case. R2 and B2 have better price than pure S3 for read-intensive workload. Configure from day one, even with low traffic — migrating media later is headache that doesn't compensate.",[19,19729,3309],{"id":3308},[12,19731,19732],{},"The three modern CMSes cover three distinct cases. Strapi for those who want polished admin with headless API and plugin for everything. Directus for those who have data and need admin over it. Ghost for those who publish editorial content and want paywall without hack.",[12,19734,19735],{},"Self-hosting became viable because machine became cheap, self-hosted orchestrator became good, and the three products matured Docker packaging. For agency with more than five clients, the savings from cloud to self-hosting on shared cluster pays for one team person in a few months.",[12,19737,19738],{},"If you want to test the path of orchestrator on own cluster:",[224,19740,19742],{"className":19741,"code":2948,"language":2529},[2527],[231,19743,2948],{"__ignoreMap":229},[12,19745,19746,19747,19751,19752,19754],{},"Related posts that deepen specific points: ",[3336,19748,19750],{"href":19749},"\u002Fen\u002Fblog\u002Fself-hosted-heroku-2026","Self-hosted Heroku in 2026"," covers the broader panorama of \"fleeing expensive PaaS\", and ",[3336,19753,6338],{"href":6337}," brings the complete infra spreadsheet for digital product starting from scratch.",[12,19756,19757],{},"Without ceremony.",{"title":229,"searchDepth":244,"depth":244,"links":19759},[19760,19761,19762,19763,19764,19765,19766,19767,19768,19769,19770,19771,19772],{"id":19119,"depth":244,"text":19120},{"id":19160,"depth":244,"text":19161},{"id":19182,"depth":244,"text":19183},{"id":19204,"depth":244,"text":19205},{"id":17369,"depth":244,"text":17370},{"id":19432,"depth":244,"text":19433},{"id":19517,"depth":244,"text":19518},{"id":19548,"depth":244,"text":19549},{"id":19576,"depth":244,"text":19577},{"id":19613,"depth":244,"text":19614},{"id":19647,"depth":244,"text":19648},{"id":7346,"depth":244,"text":7347},{"id":3308,"depth":244,"text":3309},"2026-03-25","The three modern open-source CMSes Brazilian devs most self-host. One for each case. Comparison table, real requirements, and when it's worth paying for the cloud version.",{},"\u002Fen\u002Fblog\u002Fstrapi-directus-ghost-self-hosted-guide",{"title":19102,"description":19774},{"loc":19776},"en\u002Fblog\u002Fstrapi-directus-ghost-self-hosted-guide",[19781,19782,19783,19784,7507,6395],"cms","strapi","directus","ghost","w0SGNznUB5Hl4gKpjXRhJFVwEuj1lP6cUT-GxG5eGj8",{"id":19787,"title":19788,"author":7,"body":19789,"category":6382,"cover":3379,"date":20377,"description":20378,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":20379,"navigation":411,"path":8709,"readingTime":3386,"seo":20380,"sitemap":20381,"stem":20382,"tags":20383,"__hash__":20387},"blog_en\u002Fen\u002Fblog\u002Fmigrating-from-kubernetes-to-simpler-stack.md","Migrating from Kubernetes to a simpler stack: real case of complexity reduction",{"type":9,"value":19790,"toc":20358},[19791,19798,19805,19808,19812,19815,19818,19824,19830,19836,19842,19848,19854,19857,19861,19864,19870,19876,19882,19888,19894,19897,19901,19904,19914,19923,19932,19938,19944,19950,19953,19957,19960,19966,19972,19978,19984,19987,19991,19994,19998,20001,20004,20007,20011,20014,20090,20093,20096,20100,20103,20106,20109,20113,20116,20119,20122,20125,20129,20140,20143,20147,20150,20160,20166,20172,20178,20184,20197,20201,20204,20210,20216,20222,20226,20229,20261,20264,20268,20271,20274,20277,20280,20284,20290,20296,20302,20308,20314,20320,20326,20328,20331,20334,20337,20342,20345,20355],[12,19792,19793,19794,19797],{},"The public narrative of the past six years has always been in the same direction: everyone migrates ",[27,19795,19796],{},"to"," Kubernetes. Conferences, sponsored posts, SRE jobs, vendor cases — the vector is unique. Came from a bare virtual machine and went up to K8s. Came from Heroku and went up to K8s. Came from Docker Compose and went up to K8s. The direction is one arrow only, and whoever isn't on the arrow must be doing it wrong.",[12,19799,19800,19801,19804],{},"The silent reality nobody publishes at conferences is the inverse vector: hundreds of teams migrate ",[27,19802,19803],{},"out"," of Kubernetes after discovering they paid dearly for complexity they didn't need. It's not headline, but it happens every month. Fifteen-dev company with six-node EKS cluster realizes the platform team became half the engineering budget. Startup that adopted K8s on day two discovers that three years later it still spends an entire Friday per month just updating the Helm chart version of operators. Product team that should be shipping features is debugging admission controller webhook.",[12,19806,19807],{},"This post is the playbook for this reverse migration, with the real pitfalls we've seen happen. It's not pitch — it's operational manual. If you read this and decide to stay on Kubernetes, great. The informed decision to stay is as valuable as the informed decision to leave.",[19,19809,19811],{"id":19810},"the-qualifying-question-should-you-even-consider-this","The qualifying question: should you even consider this?",[12,19813,19814],{},"Before anything, rule out the case. Reverse migration only makes sense for a specific profile — and most teams thinking about leaving K8s aren't in that profile. The others are researching alternatives when they should be hiring one more SRE or simplifying the current K8s use.",[12,19816,19817],{},"The profile where the migration makes sense has six simultaneous signals:",[12,19819,19820,19823],{},[27,19821,19822],{},"Signal 1: the company runs Kubernetes in production for one year or more."," Migration isn't experiment. If the team has been on K8s for three months and is already complaining, the problem is onboarding, not platform. Wait for the learning cycle to complete before declaring bankruptcy.",[12,19825,19826,19829],{},[27,19827,19828],{},"Signal 2: the platform team has between one and three people."," Companies with five or more engineers dedicated to platform have an operational scale that justifies K8s. Below that, the tool tends to consume the entire team in maintenance.",[12,19831,19832,19835],{},[27,19833,19834],{},"Signal 3: the cluster has fewer than fifty servers in production."," Above that number, the K8s ecosystem gives you tooling — horizontal node scaling, cross-region balancing, multi-cluster federation — that another stack doesn't give you. Below, you're paying overhead for capacity you don't use.",[12,19837,19838,19841],{},[27,19839,19840],{},"Signal 4: the apps are typical."," HTTP web, relational database, in-memory cache, async jobs, some queue. If the stack includes an exotic distributed database operator, service mesh with sophisticated L7 policies, or GPU scheduling for model training, reverse migration gets complicated.",[12,19843,19844,19847],{},[27,19845,19846],{},"Signal 5: 80% of platform time is maintenance, not new feature."," If the platform team spends most weeks updating Helm chart, debugging cluster version upgrade, or fixing a webhook that failed, it's a clear symptom. The platform became an internal product that consumes itself.",[12,19849,19850,19853],{},[27,19851,19852],{},"Signal 6: total platform salary represents more than 5% of MRR."," It's not an absolute rule, but it's a useful metric. When the team that operates infra costs more than a twentieth of monthly recurring revenue, infra is too expensive for the company's current scale.",[12,19855,19856],{},"If your company checks all six, it's worth reading the rest of the post. If it checks three or four, read but decide cautiously. If it checks one or two, the problem is probably another.",[19,19858,19860],{"id":19859},"who-definitely-should-not-migrate","Who definitely should not migrate",[12,19862,19863],{},"Same honesty in reverse. There's a profile where leaving K8s is the wrong decision, and whoever fits should close this tab.",[12,19865,19866,19869],{},[27,19867,19868],{},"Strong platform team with mature process."," If you have five platform engineers who master K8s, stable CI\u002FCD pipeline, written runbook, configured observability — leaving all that to start from zero on a simpler stack is throwing real investment away. The destination's simplicity doesn't compensate for the reset.",[12,19871,19872,19875],{},[27,19873,19874],{},"Stack that depends on critical operators."," Relational database with automatic replication managed by operator, distributed queue with balancing managed by operator, columnar database with automatic bootstrap. These operators are real value. Trading for \"human takes care of this\" is operational regression, not simplification.",[12,19877,19878,19881],{},[27,19879,19880],{},"Compliance that nominally requires Kubernetes."," Some audit frameworks — FedRAMP at certain levels, certain government contracts, some sectoral security seals — list pre-approved tools. If your compliance officer needs to point to an existing certificate, K8s is the answer. Migrating to a tool that's not on the list creates friction that costs more than savings.",[12,19883,19884,19887],{},[27,19885,19886],{},"Multi-cluster federation in production."," If you run workloads that move between clusters in different regions, with state replication coordinated by a tool like Argo or FluxCD in multi-cluster mode, the K8s ecosystem has primitives other stacks don't have. Migrating from that is a six-month project minimum.",[12,19889,19890,19893],{},[27,19891,19892],{},"ML\u002FAI workloads with complex GPU scheduling."," Distributed training, GPU partitioning, scheduling that understands specific hardware affinity. K8s has mature operators and plugins for that. A simpler stack doesn't.",[12,19895,19896],{},"If you fit any of these five, the honest recommendation is to stay where you are and optimize the current K8s use.",[19,19898,19900],{"id":19899},"pre-flight-assessment-one-to-two-days-before-committing","Pre-flight assessment: one to two days before committing",[12,19902,19903],{},"Reverse migration starts with inventory. Before scheduling a \"let's leave K8s\" meeting, the team needs to measure what they have today. Without numbers, the decision is vibes — and vibes don't survive the first unforeseen problem during cutover.",[12,19905,19906,19909,19910,19913],{},[27,19907,19908],{},"Manifest inventory."," Run ",[231,19911,19912],{},"kubectl get all -A --output yaml > all.yaml"," and count. How many files in the manifest repository? How many lines aggregated? How many namespaces? Our informal measurement on small teams: a company with ten typical apps usually has between 1,500 and 4,000 lines of YAML, spread across Deployment, Service, Ingress, ConfigMap, Secret, HorizontalPodAutoscaler, and some NetworkPolicy. Each of these lines is migration work.",[12,19915,19916,2577,19919,19922],{},[27,19917,19918],{},"Helm release inventory.",[231,19920,19921],{},"helm list -A"," shows each installed chart. Each one is a decision. Database operator chart — will it become a regular job at the destination, with manual replication? Ingress chart — will it become integrated router config? Monitoring chart — will it become external agent? The more charts, the more migration time.",[12,19924,19925,2577,19928,19931],{},[27,19926,19927],{},"Operator inventory.",[231,19929,19930],{},"kubectl get crds"," lists each Custom Resource Definition. Each CRD is a critical dependency that probably has no direct equivalent at the destination. If the output has three or four CRDs (cert-manager, ingress-nginx, prometheus-operator, sealed-secrets), it's within expected for a small team. If it has thirty CRDs, the migration isn't trivial — you built platform on top of platform.",[12,19933,19934,19937],{},[27,19935,19936],{},"RBAC and complex policies inventory."," NetworkPolicy declaring isolation between namespaces, configured PodSecurityPolicy or PodSecurityStandards, fine-grained RoleBinding. All of that needs an equivalent at the destination, and the equivalent is rarely 1:1.",[12,19939,19940,19943],{},[27,19941,19942],{},"Traffic volume."," Requests per second at peak hours, simultaneous database connections, aggregated outbound throughput. The destination needs to absorb all of that. If you've never measured, measure now — before committing to migration schedule.",[12,19945,19946,19949],{},[27,19947,19948],{},"Service to Ingress mapping."," Each exposed Service becomes an entry point at the destination. List of domains, associated certificates, configured sticky sessions, path-based routing rules. Without this list, the migration breaks exactly at cutover time.",[12,19951,19952],{},"This assessment takes one to two days for a competent dev. It's cheap. Skipping this step is the biggest source of migrations that blow through schedule.",[19,19954,19956],{"id":19955},"target-stack-decision","Target stack decision",[12,19958,19959],{},"Four main options today for a small team. Each one with explicit trade-off.",[12,19961,19962,19965],{},[27,19963,19964],{},"Option A: Docker Swarm."," Direct compatibility with Compose format, simple multi-server, low learning curve. Good for one-dev team that already knows Compose. Serious limitation: Swarm has been in maintenance mode for a long time, with no active development of new features. Runs and works, but you're betting on a tool that doesn't evolve.",[12,19967,19968,19971],{},[27,19969,19970],{},"Option B: Nomad."," Similar to K8s in declarative model, but simpler and with single binary. Good for those who like robust declarative model and want real high availability. Limitation: the license changed to a restricted model since 2023, and the company behind it was acquired in 2025. For new adoption today, it's a path with an asterisk.",[12,19973,19974,19977],{},[27,19975,19976],{},"Option C: HeroCtl."," Independent orchestrator with replicated control plane, single binary, short configuration. Good for those who want operational simplicity and real high availability from day one. Honest limitation: smaller ecosystem than K8s, without a deep library of ready operators.",[12,19979,19980,19983],{},[27,19981,19982],{},"Option D: self-hosted panel (Coolify, Dokploy, similar)."," Web panel that orchestrates Docker on a machine or small set. Good for very small team without formal HA requirements. Limitation: architecture that doesn't support real distributed consensus across multiple servers — grew, became a single point of failure.",[12,19985,19986],{},"The choice depends on the profile. One-dev team without SLA requirement = Option D. Small team with real HA requirement = Option C. Team that prefers robust declarative model and accepts restricted license = Option B. Team already invested in Compose = Option A.",[19,19988,19990],{"id":19989},"the-five-steps-of-the-migration","The five steps of the migration",[12,19992,19993],{},"From here on the playbook assumes the target stack is HeroCtl, but the skeleton applies to any destination. Adjust the conceptual mappings to Swarm\u002FNomad\u002FCoolify according to choice.",[368,19995,19997],{"id":19996},"step-1-setup-the-destination-in-parallel-one-week","Step 1 — Setup the destination in parallel (one week)",[12,19999,20000],{},"Hard rule: never migrate in-place. The current K8s cluster keeps running untouched throughout the migration. The destination is provisioned in parallel, on new servers, with a temporary domain or test subdomain.",[12,20002,20003],{},"Three to five new Linux servers. Install the target stack. Validate that the network between servers works, that storage persists after reboot, that certificates are issued automatically, that secrets can be injected into apps. Connect the destination with the same image registry the current K8s uses — that way the same image running in production goes to the destination without rebuild.",[12,20005,20006],{},"This step is deliberately light. The intent is to prove the destination works with a synthetic app before committing to migrating a real app.",[368,20008,20010],{"id":20009},"step-2-migrating-manifests-to-destination-spec-one-to-two-weeks","Step 2 — Migrating manifests to destination spec (one to two weeks)",[12,20012,20013],{},"Most of the effort is here. Each K8s workload needs to be re-expressed in the destination format. The conceptual mapping from K8s to HeroCtl serves as reference:",[2734,20015,20016,20026,20032,20038,20044,20050,20056,20062,20068,20074,20084],{},[70,20017,20018,20021,20022,20025],{},[27,20019,20020],{},"Deployment + ReplicaSet"," → job with ",[231,20023,20024],{},"replicas: N",". The concept is the same: N copies of the same workload, balanced between servers.",[70,20027,20028,20031],{},[27,20029,20030],{},"Service ClusterIP"," → internal service. In HeroCtl you don't need to create — any task has a name resolvable inside the cluster by default.",[70,20033,20034,20037],{},[27,20035,20036],{},"Service LoadBalancer or Ingress"," → integrated ingress. Without external operator, without separate cert-manager, without ingress-nginx — everything embedded in the orchestrator.",[70,20039,20040,20043],{},[27,20041,20042],{},"Pod"," → task. 1:1 concept.",[70,20045,20046,20049],{},[27,20047,20048],{},"PersistentVolume"," → named volume. May require data copy, depending on the storage backend used in K8s.",[70,20051,20052,20055],{},[27,20053,20054],{},"ConfigMap"," → env block or file in the spec. There's no separate object.",[70,20057,20058,20061],{},[27,20059,20060],{},"Secret"," → orchestrator's integrated secret. Encrypted at rest in the control plane.",[70,20063,20064,20067],{},[27,20065,20066],{},"HorizontalPodAutoscaler"," → scaling policy in the job spec. Triggered by CPU usage, RAM, or custom metric.",[70,20069,20070,20073],{},[27,20071,20072],{},"DaemonSet"," → job with placement restriction \"1 per node\".",[70,20075,20076,20079,20080,20083],{},[27,20077,20078],{},"CronJob"," → ",[231,20081,20082],{},"periodic"," type job with cron expression.",[70,20085,20086,20089],{},[27,20087,20088],{},"Helm chart"," → custom spec. Doesn't convert 1:1 — re-write by hand.",[12,20091,20092],{},"In raw lines, the reduction is dramatic. A typical web app in K8s has 30 to 50 lines of Deployment, plus 20 of Service, plus 50 to 100 of Ingress + cert-manager + annotations. Total 100 to 170 lines. The equivalent in HeroCtl sits between 30 and 50 lines, aggregating everything in a single file.",[12,20094,20095],{},"Average migration time per app: one to three days for a competent dev. Ten apps in three weeks is a realistic pace. If it goes much slower, there's a hidden operator or undetected complexity in the assessment — stop and re-measure.",[368,20097,20099],{"id":20098},"step-3-database-and-storage-migration-one-to-three-days","Step 3 — Database and storage migration (one to three days)",[12,20101,20102],{},"Two strategies. If the database is managed (RDS, Cloud SQL, equivalent), the destination just points to the new connection string and that's it — the database stays where it was, platform-agnostic. If the database is self-hosted on K8s, it's manual dump-and-restore: pg_dump on the old database, pg_restore on the new, with a short maintenance window at cutover time.",[12,20104,20105],{},"Persistent volumes from K8s become named volumes at the destination. May require data copy via rsync or snapshot — depending on the storage backend, this is an additional window.",[12,20107,20108],{},"Secrets are extracted from K8s and re-inserted at the destination. Use a secure channel (kubectl get secret -o yaml is just a means of reading; never commit an intermediate file). In HeroCtl, secrets are submitted via API with TLS and stay encrypted in the control plane.",[368,20110,20112],{"id":20111},"step-4-cutover-one-to-three-hours-usually-overnight","Step 4 — Cutover (one to three hours, usually overnight)",[12,20114,20115],{},"The critical step. Pre-checks before any DNS change: smoke test on the destination — login works, main page loads, database is connected, latency is acceptable, queue processes job, metrics arrive at monitoring. If any of the five fails, abort the cutover.",[12,20117,20118],{},"DNS prepared: TTL reduced to 60 seconds twenty-four hours before the window. Without that, propagation takes hours and rollback is painful.",[12,20120,20121],{},"Cutover proper: change the DNS record to point to the destination IPs. Monitor 5xx and latency in a five-minute window. If something breaks significantly in the first thirty minutes, switch DNS back to K8s — complete rollback in sixty seconds of additional propagation.",[12,20123,20124],{},"Keep the K8s cluster running as standby for thirty days. Don't shut down. The extra cost is justified: if some latent bug appears in week three of the destination, you still have a place to go back to.",[368,20126,20128],{"id":20127},"step-5-decommission-of-k8s-one-to-two-hours-after-thirty-days","Step 5 — Decommission of K8s (one to two hours, after thirty days)",[12,20130,20131,20132,20135,20136,20139],{},"Thirty days after cutover, without significant incident, time to shut down. ",[231,20133,20134],{},"kubectl delete cluster"," in the self-hosted case, or ",[231,20137,20138],{},"aws eks delete-cluster"," (or equivalent in other clouds) in the managed case. Cancel managed addons separately — the bill has items that don't disappear with delete-cluster alone.",[12,20141,20142],{},"Prorated refund of the current month of the managed plan, if the provider offers. Worker instance cancellation. Final backup of the cluster state before delete, in case of future audit.",[19,20144,20146],{"id":20145},"the-six-pitfalls-of-the-path","The six pitfalls of the path",[12,20148,20149],{},"Technical assessment covers what you can measure. The pitfalls below are what escape assessment and break the schedule. Each one has already caused a migration that blew through deadline at some real team.",[12,20151,20152,20155,20156,20159],{},[27,20153,20154],{},"Pitfall 1: hidden operator dependency."," You think you don't have a complex operator, but cert-manager + ingress-nginx + sealed-secrets is already a stack of three operators. And probably more — kube-state-metrics for monitoring, external-dns to update DNS automatically, reloader to restart pods when ConfigMap changes. Map ",[179,20157,20158],{},"everything",". Each operator is migration work that the superficial assessment misses.",[12,20161,20162,20165],{},[27,20163,20164],{},"Pitfall 2: assuming Helm chart is rewritable in a day."," Simple chart with five templates is a few-hours rewrite. Complex chart with thirty templates, nested values, pre-install\u002Fpost-install hooks, and subchart dependencies can take a week just to map to equivalent spec. Calibrate the estimate by the most complex chart, not by the simplest.",[12,20167,20168,20171],{},[27,20169,20170],{},"Pitfall 3: undocumented sticky sessions."," ingress-nginx in K8s supports persistent session via annotation configuration. If the app depends on that (shopping cart, admin session, persistent websocket) and nobody documented it, the migration breaks exactly at cutover when a user starts switching between two backend servers and loses session state. Audit ingress configuration upfront — don't trust just what the team remembers.",[12,20173,20174,20177],{},[27,20175,20176],{},"Pitfall 4: different resource limits."," K8s uses limit\u002Frequest with precise semantics: request is guarantee, limit is ceiling. The destination may have a different declarative model (hard limit, or aggregated quota per job, or soft-limit semantics). Tuning error here breaks autoscaling — the app stays underprovisioned in production and doesn't scale when it should, or overprovisioned and wastes capacity. Re-measure real consumption after cutover, adjust limits in the first week.",[12,20179,20180,20183],{},[27,20181,20182],{},"Pitfall 5: log format."," Some K8s ingresses emit log in JSON by default — downstream parser (Loki, Datadog, ELK) is configured for that format. Destination may emit log in plain text or different format. Downstream parsing breaks silently — alerts stop firing because the pattern doesn't match anymore. Verify destination's integrated router log format before cutover.",[12,20185,20186,20189,20190,5839,20193,20196],{},[27,20187,20188],{},"Pitfall 6: coupled CI\u002FCD pipeline."," GitOps with ArgoCD or FluxCD pointing to K8s needs to be reworked. If the pipeline applies declarative manifest with ",[231,20191,20192],{},"kubectl apply",[231,20194,20195],{},"helm upgrade",", that doesn't work at the destination. Adapter scripts at the deploy stage are necessary — receive the old manifest, translate to new spec, submit via API. Estimate one to two weeks just for the CI\u002FCD pipeline, separate from manifest migration time.",[19,20198,20200],{"id":20199},"realistic-schedule","Realistic schedule",[12,20202,20203],{},"Honest expectation calibration, in three size ranges.",[12,20205,20206,20209],{},[27,20207,20208],{},"Team of one to two devs, five to ten apps:"," four to six weeks total. Decomposition: one week of destination setup, two to three weeks of manifest migration and adjustment, one to three days of cutover, thirty days of parallel operation, one day of decommission. Note: migration work steals focus from product development during this period. Consider feature freeze window.",[12,20211,20212,20215],{},[27,20213,20214],{},"Team of three to five devs, twenty to fifty apps:"," eight to twelve weeks. Multiplication isn't linear — additional apps increase cutover test matrix. Worth dedicating one person full-time to migration and keeping the rest of the team on product.",[12,20217,20218,20221],{},[27,20219,20220],{},"Company with one hundred or more apps:"," four to six month project, with one to two dedicated people. At this size, migration becomes a phase with project manager, biweekly milestones, and status reports. It's not a sprint.",[19,20223,20225],{"id":20224},"typical-post-migration-results","Typical post-migration results",[12,20227,20228],{},"Ranges observed in teams that completed migration. They're not guarantees — they're reference points.",[2734,20230,20231,20237,20243,20249,20255],{},[70,20232,20233,20236],{},[27,20234,20235],{},"Total RAM reduction:"," 30% to 50%. Kubernetes overhead is real, and disappears when you leave. Cluster that used 32 GB of aggregated RAM becomes something between 16 and 22 GB for the same workload.",[70,20238,20239,20242],{},[27,20240,20241],{},"Cloud cost reduction:"," 40% to 70%. Comes from three fronts: no managed control plane (US$73\u002Fmonth per cluster leaves the budget), no NAT gateway per subnet (some providers charge per GB), smaller instances possible (platform overhead exits consumption).",[70,20244,20245,20248],{},[27,20246,20247],{},"Deploy time:"," similar or slightly better. Not where the gain is — K8s is reasonably fast in deploy when configured well.",[70,20250,20251,20254],{},[27,20252,20253],{},"Learning time for new dev:"," one week, against four to six in K8s. The mental model is simpler — fewer intermediate abstractions between \"I want to run this\" and \"it's running\".",[70,20256,20257,20260],{},[27,20258,20259],{},"Monthly operation time:"," one to three dev-hours of maintenance, against twenty to forty in K8s. The bigger gain. It's here that ROI materializes.",[12,20262,20263],{},"To calibrate the last metric: our public demo cluster runs on four servers totaling five vCPUs and ten gigabytes of RAM, with control plane occupying between 200 and 400 MB per server. New coordinator election, in case of current one's failure, takes about seven seconds. Typical application spec in HeroCtl has about fifty lines — compared to three hundred lines or more of YAML in Kubernetes for \"hello world\" equivalent with TLS and ingress.",[19,20265,20267],{"id":20266},"the-inevitable-question-will-it-go-back-to-kubernetes-eventually","The inevitable question: will it go back to Kubernetes eventually?",[12,20269,20270],{},"Honesty. Depends on scale.",[12,20272,20273],{},"Team that grows to thirty or more devs, with one hundred or more servers in production, multi-region, with cross-cluster federation requirement, eventually hits the ceiling of a simpler stack. At that scale, K8s becomes a rational choice — the ecosystem gives you tools other stacks don't have. The migration back is a months project, not days, but it's a viable path.",[12,20275,20276],{},"For startups that stay sub-fifty servers over five years — the absolute majority of them — it rarely makes sense to go back. The operational gain of the simpler stack holds throughout the product's useful life.",[12,20278,20279],{},"Reverse migration (HeroCtl → K8s) is also a weeks project, not days. It's not a one-way decision. If the company grows much faster than expected, the path back exists — more expensive than staying, but it exists. The decision to migrate now doesn't lock you in forever.",[19,20281,20283],{"id":20282},"questions-we-receive","Questions we receive",[12,20285,20286,20289],{},[27,20287,20288],{},"How long until ROI?","\nFor one-to-two-dev team with small cluster, migration pays in three to six months — the salary-equivalent of recovered maintenance time exceeds migration project cost. For larger teams, depends on how much the platform team consumed in maintenance; typically six to twelve months.",[12,20291,20292,20295],{},[27,20293,20294],{},"Can I keep Kubernetes for a specific workload and migrate the rest?","\nYes, and in some cases it's the correct strategy. Workload with critical operator (distributed database, queue with managed balancing) stays on K8s. The rest goes to a simpler stack. The two clusters coexist with separate domains or path-based routing on an upstream router. Costs a bit more than consolidating, but avoids re-writing what still works well.",[12,20297,20298,20301],{},[27,20299,20300],{},"Complex Helm charts: worth re-writing?","\nCase by case. Third-party operator chart with fifty files: probably not worth it, keep on K8s or change the technology. Own chart with twenty templates: worth it, it's a few-days rewrite and eliminates Helm dependency.",[12,20303,20304,20307],{},[27,20305,20306],{},"Does ArgoCD work with HeroCtl?","\nNot directly — ArgoCD was made to apply K8s manifest. But the GitOps concept works: pipeline observes the repository, translates destination spec to API payload, submits via authenticated curl. Native plugin is under consideration; for now it's a fifty-line adapter script.",[12,20309,20310,20313],{},[27,20311,20312],{},"The team that learned Kubernetes — will they be resentful?","\nLegitimate question. K8s learning curve is real investment, and nobody likes seeing investment discarded. Direct conversation: the skill doesn't disappear. K8s remains a market standard for large scale, and a dev who already mastered it remains employable and valuable. The migration is a product decision for current scale, not a verdict on individual knowledge.",[12,20315,20316,20319],{},[27,20317,20318],{},"Is cloud agnostic more or less viable afterward?","\nMore viable, in practice. Simpler stack runs on any Linux server with Docker — bare metal, VPS from any provider, instance from any cloud. Managed K8s ties you to the provider (EKS on AWS, GKE on Google, AKS on Azure) — each with its own flavor. Leaving expands options.",[12,20321,20322,20325],{},[27,20323,20324],{},"Is there a public case of a company that did this migration?","\nSeveral, but most don't publish at conferences (the narrative vector continues to be K8s for everyone). On forums and in informal conversations, it's easy to find a report. If you want to talk to someone who did the migration, write us — we make the bridge.",[19,20327,3309],{"id":3308},[12,20329,20330],{},"The decision to leave Kubernetes for a simpler stack isn't an admission of defeat — it's recognition that the right tool depends on the company's current scale, and that the company's current scale isn't from the colossus marketing book. Small team, small cluster, typical apps, platform consuming half of engineering budget: it's exactly the scenario where reverse migration pays.",[12,20332,20333],{},"It's not an afternoon decision. It's a four-to-six-week project for a small team, with inventory, mapping, overnight cutover, thirty days of parallel operation, and careful decommission. But it's a project whose ROI is measured in dev-hours recovered every month — every month, for the next years of the company.",[12,20335,20336],{},"If you want to try HeroCtl as candidate destination:",[224,20338,20340],{"className":20339,"code":5318,"language":2529},[2527],[231,20341,5318],{"__ignoreMap":229},[12,20343,20344],{},"Runs on any Linux server with Docker. Three servers form a replicated control plane with real high availability. Application spec sits between thirty and fifty lines, aggregating everything needed (replication, ingress, automatic certificate, secrets). The permanently free Community plan covers the entire stack described here — only Business and Enterprise add SSO, granular RBAC, detailed auditing, code escrow and SLA support, geared toward companies with formal platform requirements.",[12,20346,20347,20348,20351,20352,20354],{},"For additional context, ",[3336,20349,20350],{"href":5343},"k3s vs HeroCtl: when each one makes sense"," addresses the choice when the team has already decided to leave vanilla K8s but hesitates between lightweight K8s distribution and independent orchestrator. And ",[3336,20353,15781],{"href":15780}," is the underlying argument for those not yet convinced that complexity is unnecessary at current scale.",[12,20356,20357],{},"Reverse migration isn't a conference headline. But it's the right decision for more teams than the public narrative admits.",{"title":229,"searchDepth":244,"depth":244,"links":20359},[20360,20361,20362,20363,20364,20371,20372,20373,20374,20375,20376],{"id":19810,"depth":244,"text":19811},{"id":19859,"depth":244,"text":19860},{"id":19899,"depth":244,"text":19900},{"id":19955,"depth":244,"text":19956},{"id":19989,"depth":244,"text":19990,"children":20365},[20366,20367,20368,20369,20370],{"id":19996,"depth":271,"text":19997},{"id":20009,"depth":271,"text":20010},{"id":20098,"depth":271,"text":20099},{"id":20111,"depth":271,"text":20112},{"id":20127,"depth":271,"text":20128},{"id":20145,"depth":244,"text":20146},{"id":20199,"depth":244,"text":20200},{"id":20224,"depth":244,"text":20225},{"id":20266,"depth":244,"text":20267},{"id":20282,"depth":244,"text":20283},{"id":3308,"depth":244,"text":3309},"2026-03-18","When a company adopts K8s too early, everyone pays. The reverse path — leaving K8s for simpler orchestration — is viable and more common than it seems. What to validate before, during and after.",{},{"title":19788,"description":20378},{"loc":8709},"en\u002Fblog\u002Fmigrating-from-kubernetes-to-simpler-stack",[20384,6393,20385,20386,6382],"kubernetes","simplification","case","nRDl4sISs2VpNqYi48cJjczppomYKbT0Bv1OkTfd-hI",{"id":20389,"title":20390,"author":7,"body":20391,"category":6382,"cover":3379,"date":21741,"description":21742,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":21743,"navigation":411,"path":21744,"readingTime":6387,"seo":21745,"sitemap":21746,"stem":21747,"tags":21748,"__hash__":21749},"blog_en\u002Fen\u002Fblog\u002Fmigrating-from-heroku-technical-guide.md","Migrating from Heroku to your own cluster: technical guide in 5 steps",{"type":9,"value":20392,"toc":21728},[20393,20396,20399,20403,20406,20409,20463,20466,20469,20473,20476,20481,20494,20497,20500,20505,20523,20526,20531,20547,20559,20564,20580,20589,20592,20606,20611,20614,20643,20646,20651,20680,20683,20686,20689,20693,20696,20702,20705,20711,20714,20720,20723,20726,20730,20733,20876,20879,20884,20942,20947,21003,21006,21011,21014,21038,21043,21054,21057,21061,21064,21069,21078,21098,21116,21121,21129,21134,21258,21261,21266,21269,21272,21277,21280,21298,21301,21305,21308,21313,21316,21321,21327,21332,21367,21370,21375,21382,21387,21390,21398,21401,21406,21409,21413,21416,21419,21473,21476,21479,21483,21486,21500,21517,21535,21543,21559,21565,21571,21577,21581,21584,21615,21618,21621,21624,21626,21632,21638,21654,21660,21666,21672,21685,21687,21690,21693,21709,21712,21715,21725],[12,20394,20395],{},"On November 28, 2022, Salesforce shut down Heroku's free plan. Hundreds of thousands of hobby projects were wiped at once, and the news cycle lasted a couple of months — people migrating to Render, to Fly.io, to Railway, to any VPS. What nobody predicted at that moment is what happened next: four years passed, we're in 2026, and there are still thousands of Brazilian SaaS in production paying between US$25 and US$100 per month per dyno just because \"migrating\" is the thirteenth item in the backlog. There's always a more urgent feature. There's always a customer asking when module X will ship. Migrating gives zero new revenue — so it stays.",[12,20397,20398],{},"This post is the plan to fit that migration into a week of work for a part-time dev, and the rest of a month to stabilize. It's not a manifesto, it's not a vendor comparison, it's not \"come to HeroCtl\". It's a runbook. At the end there's a section on destination options, including our product, but if you finish reading and go to Render or Coolify or Fly.io, the post did its job.",[19,20400,20402],{"id":20401},"why-migrating-still-hurts-the-unspoken-truth","Why migrating still hurts (the unspoken truth)",[12,20404,20405],{},"The first thing that needs to be clear: it's not the Dockerfile. Writing a Dockerfile for a Rails or Node app is half an afternoon — there's a ready template for each framework, there are five posts on DEV explaining, there's Copilot writing it. If your resistance is \"we haven't dockerized yet\", that part is the least important.",[12,20407,20408],{},"The pain is in the ecosystem:",[2734,20410,20411,20429,20435,20441,20447,20453],{},[70,20412,20413,20416,20417,571,20419,571,20422,571,20425,20428],{},[27,20414,20415],{},"Postgres with specific extensions"," that you forgot you enabled in 2019. ",[231,20418,17293],{},[231,20420,20421],{},"pgcrypto",[231,20423,20424],{},"hstore",[231,20426,20427],{},"postgis"," — each one is a reason for the migration to break silently.",[70,20430,20431,20434],{},[27,20432,20433],{},"Redis Premium with persistence"," that you use for the Sidekiq queue AND for cache AND for rate limit. For cache it can restart from zero. For queue it can't.",[70,20436,20437,20440],{},[27,20438,20439],{},"Stateful Sidekiq workers"," with jobs scheduled months ahead. Migrating while they run is chasing a moving train.",[70,20442,20443,20446],{},[27,20444,20445],{},"Heroku Scheduler"," with that cron nobody has looked at since 2020 but that produces the CEO's monthly report.",[70,20448,20449,20452],{},[27,20450,20451],{},"Papertrail"," integrated, NewRelic instrumented, Bugsnag on every error — three extra SaaS you don't even know if they'll make sense in the new architecture.",[70,20454,20455,20458,20459,20462],{},[27,20456,20457],{},"Buildpack"," that ran for six years without anyone really knowing what it does. There's a ",[231,20460,20461],{},"bin\u002Fpost_compile"," that minifies something, there's an environment variable that defines which Ruby version — somewhere, your application depends on six buildpack behaviors that were never documented.",[12,20464,20465],{},"And there's the human part: you and your team internalized Heroku primitives over years. Procfile, slug compilation, dynos, release phase, config vars. All of that became intuition. When we go to redo outside Heroku, we redo unconsciously — and generally badly, because Heroku had defaults that hide important decisions that are now yours.",[12,20467,20468],{},"The technical migration takes a week. The mental migration takes a month. This post tries to shorten both.",[19,20470,20472],{"id":20471},"pre-flight-check-one-to-two-hours-before-any-commit","Pre-flight check — one to two hours, before any commit",[12,20474,20475],{},"Before opening the editor, you need the inventory. Most migrations that go wrong are because of a surprise that could have been discovered in the first hour.",[12,20477,20478],{},[27,20479,20480],{},"Apps inventory:",[224,20482,20484],{"className":226,"code":20483,"language":228,"meta":229,"style":229},"heroku apps\n",[231,20485,20486],{"__ignoreMap":229},[234,20487,20488,20491],{"class":236,"line":237},[234,20489,20490],{"class":247},"heroku",[234,20492,20493],{"class":255}," apps\n",[12,20495,20496],{},"How many apps exist on the account? Which ones are still really in use? Which ones can become a cron-job and die? Which ones were created for a customer that left in 2021? Mark each one in a spreadsheet with three columns: name, status (live\u002Fzombie\u002Fcron), migration priority (high\u002Fmedium\u002Flow).",[12,20498,20499],{},"Most accounts have 30% zombie apps. Migrating zombies has no ROI — destroying them does.",[12,20501,20502],{},[27,20503,20504],{},"Addons inventory per app:",[224,20506,20508],{"className":226,"code":20507,"language":228,"meta":229,"style":229},"heroku addons -a my-app\n",[231,20509,20510],{"__ignoreMap":229},[234,20511,20512,20514,20517,20520],{"class":236,"line":237},[234,20513,20490],{"class":247},[234,20515,20516],{"class":255}," addons",[234,20518,20519],{"class":251}," -a",[234,20521,20522],{"class":255}," my-app\n",[12,20524,20525],{},"Each line is a future decision. Postgres? Redis? Papertrail? Heroku Scheduler? SendGrid? Mailgun? For each one, write in the spreadsheet: will migrate to self-hosted equivalent, will become external SaaS, or will discard. If you don't know what it's for, look it up before — not at cutover time.",[12,20527,20528],{},[27,20529,20530],{},"Buildpacks inventory:",[224,20532,20534],{"className":226,"code":20533,"language":228,"meta":229,"style":229},"heroku buildpacks -a my-app\n",[231,20535,20536],{"__ignoreMap":229},[234,20537,20538,20540,20543,20545],{"class":236,"line":237},[234,20539,20490],{"class":247},[234,20541,20542],{"class":255}," buildpacks",[234,20544,20519],{"class":251},[234,20546,20522],{"class":255},[12,20548,20549,20550,571,20553,571,20556,20558],{},"Multi-buildpack? Custom buildpack? If the output has more than one line, read each one. Custom buildpacks usually have hooks (",[231,20551,20552],{},"bin\u002Frelease",[231,20554,20555],{},"bin\u002Fcompile",[231,20557,20461],{},") that execute specific things. You'll need to replicate these steps in the Dockerfile or in a release container.",[12,20560,20561],{},[27,20562,20563],{},"Env vars inventory:",[224,20565,20567],{"className":226,"code":20566,"language":228,"meta":229,"style":229},"heroku config -a my-app\n",[231,20568,20569],{"__ignoreMap":229},[234,20570,20571,20573,20576,20578],{"class":236,"line":237},[234,20572,20490],{"class":247},[234,20574,20575],{"class":255}," config",[234,20577,20519],{"class":251},[234,20579,20522],{"class":255},[12,20581,20582,20583,571,20585,20588],{},"Export everything to a secure file. DO NOT commit. DO NOT send via Slack. DO NOT paste into ChatGPT. This file has ",[231,20584,453],{},[231,20586,20587],{},"SECRET_KEY_BASE",", payment API key. Treat as password, because that's exactly what it is.",[12,20590,20591],{},"Watch for two pitfalls:",[2734,20593,20594,20600],{},[70,20595,20596,20597,20599],{},"Variables with ",[231,20598,1272],{}," character in the name (some old libs use it) escape differently in containers.",[70,20601,20602,20605],{},[231,20603,20604],{},"BUNDLE_WITHOUT=development:test"," saved in production is a time bomb after migration.",[12,20607,20608],{},[27,20609,20610],{},"Procfile inventory:",[12,20612,20613],{},"Each Procfile line is a service:",[2734,20615,20616,20622,20628,20634],{},[70,20617,20618,20621],{},[231,20619,20620],{},"web"," becomes the main container.",[70,20623,20624,20627],{},[231,20625,20626],{},"worker"," becomes a second container or separate job.",[70,20629,20630,20633],{},[231,20631,20632],{},"release"," becomes a pre-deploy step (typically migrations).",[70,20635,20636,5839,20639,20642],{},[231,20637,20638],{},"clock",[231,20640,20641],{},"scheduler"," becomes a cron job.",[12,20644,20645],{},"If your Procfile has five lines, you'll have five services at the destination. They're not details — they're the topology design.",[12,20647,20648],{},[27,20649,20650],{},"Current metrics:",[224,20652,20654],{"className":226,"code":20653,"language":228,"meta":229,"style":229},"heroku ps -a my-app\nheroku logs --tail -a my-app\n",[231,20655,20656,20666],{"__ignoreMap":229},[234,20657,20658,20660,20662,20664],{"class":236,"line":237},[234,20659,20490],{"class":247},[234,20661,9410],{"class":255},[234,20663,20519],{"class":251},[234,20665,20522],{"class":255},[234,20667,20668,20670,20673,20676,20678],{"class":236,"line":244},[234,20669,20490],{"class":247},[234,20671,20672],{"class":255}," logs",[234,20674,20675],{"class":251}," --tail",[234,20677,20519],{"class":251},[234,20679,20522],{"class":255},[12,20681,20682],{},"How many dynos running? What type (Standard-1X, Performance-M)? Log volume per minute? Average latency on NewRelic? CPU\u002Fmemory peak last month?",[12,20684,20685],{},"These numbers serve to size the destination. Migrating and discovering later that memory is half of what's needed is the fastest way to break confidence in the entire project.",[12,20687,20688],{},"At the end of pre-flight you have a spreadsheet with everything. That file is the heart of the migration. Every decision returns to it.",[19,20690,20692],{"id":20691},"step-1-target-stack-choice-architectural-decision-30-minutes","Step 1 — Target stack choice (architectural decision, 30 minutes)",[12,20694,20695],{},"Three possible paths. I'll be honest about each one.",[12,20697,20698,20701],{},[27,20699,20700],{},"Option A — Single VPS with self-hosted panel.","\nA server on DigitalOcean or Hetzner, install Coolify or Dokploy, deploy your app via the panel. Cost: R$30 to R$50 per month to start, scales well to about 10 apps on a medium server. No high availability — if the server falls, everything falls. SLA you can promise: best-effort.",[12,20703,20704],{},"Ideal for: indie hacker, personal project, MVP, SaaS without customer requiring written SLA.",[12,20706,20707,20710],{},[27,20708,20709],{},"Option B — Cluster with high availability.","\nThree or more servers, orchestrator coordinating between them, survives the crash of a server without affecting traffic. Cost: R$150 to R$300 per month for a three-modest-node cluster. Possible SLA: 99.9% without despair.",[12,20712,20713],{},"Ideal for: B2B SaaS with paying customers, any application where half an hour of downtime generates support ticket.",[12,20715,20716,20719],{},[27,20717,20718],{},"Option C — External managed platform.","\nRender, Railway, Fly.io. You pay more, but zero ops. Cost: R$200 to R$500 per month for workload comparable to 2-3 Heroku dynos, linear scaling from there.",[12,20721,20722],{},"Ideal for: team that has absolutely nobody to take care of server and prefers transferring the problem to another company.",[12,20724,20725],{},"Honest decision, in one question: do you have a customer requiring SLA? If not, option A. If yes, B. If the team has nobody willing to learn minimum ops, C. There's no universal right answer — there's a right answer for your context. Mixing the three is also valid: main app on B, internal tool on A, isolated scheduler on C.",[19,20727,20729],{"id":20728},"step-2-dockerization-half-a-day-to-two-days-per-app","Step 2 — Dockerization (half a day to two days per app)",[12,20731,20732],{},"Here the technical work begins. The general logic is the same for any stack:",[224,20734,20738],{"className":20735,"code":20736,"language":20737,"meta":229,"style":229},"language-dockerfile shiki shiki-themes github-dark-default","FROM ruby:3.3-slim AS builder\nWORKDIR \u002Fapp\nCOPY Gemfile Gemfile.lock .\u002F\nRUN bundle install --without development test\nCOPY . .\nRUN bundle exec rake assets:precompile\n\nFROM ruby:3.3-slim\nWORKDIR \u002Fapp\nRUN apt-get update && apt-get install -y --no-install-recommends \\\n    libpq5 nodejs && rm -rf \u002Fvar\u002Flib\u002Fapt\u002Flists\u002F*\nCOPY --from=builder \u002Fusr\u002Flocal\u002Fbundle \u002Fusr\u002Flocal\u002Fbundle\nCOPY --from=builder \u002Fapp \u002Fapp\nEXPOSE 3000\nCMD [\"bundle\", \"exec\", \"puma\", \"-C\", \"config\u002Fpuma.rb\"]\n","dockerfile",[231,20739,20740,20754,20762,20770,20778,20785,20792,20796,20803,20809,20816,20821,20828,20835,20843],{"__ignoreMap":229},[234,20741,20742,20745,20748,20751],{"class":236,"line":237},[234,20743,20744],{"class":383},"FROM",[234,20746,20747],{"class":387}," ruby:3.3-slim ",[234,20749,20750],{"class":383},"AS",[234,20752,20753],{"class":387}," builder\n",[234,20755,20756,20759],{"class":236,"line":244},[234,20757,20758],{"class":383},"WORKDIR",[234,20760,20761],{"class":387}," \u002Fapp\n",[234,20763,20764,20767],{"class":236,"line":271},[234,20765,20766],{"class":383},"COPY",[234,20768,20769],{"class":387}," Gemfile Gemfile.lock .\u002F\n",[234,20771,20772,20775],{"class":236,"line":415},[234,20773,20774],{"class":383},"RUN",[234,20776,20777],{"class":387}," bundle install --without development test\n",[234,20779,20780,20782],{"class":236,"line":434},[234,20781,20766],{"class":383},[234,20783,20784],{"class":387}," . .\n",[234,20786,20787,20789],{"class":236,"line":459},[234,20788,20774],{"class":383},[234,20790,20791],{"class":387}," bundle exec rake assets:precompile\n",[234,20793,20794],{"class":236,"line":464},[234,20795,412],{"emptyLinePlaceholder":411},[234,20797,20798,20800],{"class":236,"line":479},[234,20799,20744],{"class":383},[234,20801,20802],{"class":387}," ruby:3.3-slim\n",[234,20804,20805,20807],{"class":236,"line":484},[234,20806,20758],{"class":383},[234,20808,20761],{"class":387},[234,20810,20811,20813],{"class":236,"line":490},[234,20812,20774],{"class":383},[234,20814,20815],{"class":387}," apt-get update && apt-get install -y --no-install-recommends \\\n",[234,20817,20818],{"class":236,"line":508},[234,20819,20820],{"class":387},"    libpq5 nodejs && rm -rf \u002Fvar\u002Flib\u002Fapt\u002Flists\u002F*\n",[234,20822,20823,20825],{"class":236,"line":529},[234,20824,20766],{"class":383},[234,20826,20827],{"class":387}," --from=builder \u002Fusr\u002Flocal\u002Fbundle \u002Fusr\u002Flocal\u002Fbundle\n",[234,20829,20830,20832],{"class":236,"line":535},[234,20831,20766],{"class":383},[234,20833,20834],{"class":387}," --from=builder \u002Fapp \u002Fapp\n",[234,20836,20837,20840],{"class":236,"line":546},[234,20838,20839],{"class":383},"EXPOSE",[234,20841,20842],{"class":387}," 3000\n",[234,20844,20845,20848,20851,20854,20856,20859,20861,20864,20866,20869,20871,20874],{"class":236,"line":552},[234,20846,20847],{"class":383},"CMD",[234,20849,20850],{"class":387}," [",[234,20852,20853],{"class":255},"\"bundle\"",[234,20855,571],{"class":387},[234,20857,20858],{"class":255},"\"exec\"",[234,20860,571],{"class":387},[234,20862,20863],{"class":255},"\"puma\"",[234,20865,571],{"class":387},[234,20867,20868],{"class":255},"\"-C\"",[234,20870,571],{"class":387},[234,20872,20873],{"class":255},"\"config\u002Fpuma.rb\"",[234,20875,9527],{"class":387},[12,20877,20878],{},"Multi-stage. Heavy build stays in a stage that's discarded. Final image has only what's necessary to run.",[12,20880,20881],{},[27,20882,20883],{},"By language:",[2734,20885,20886,20902,20914,20929],{},[70,20887,20888,6562,20891,20894,20895,571,20898,20901],{},[27,20889,20890],{},"Ruby\u002FRails",[231,20892,20893],{},"ruby:3.x-slim"," as base, multi-stage to reduce size. Heroku's slug compilation became your own lines in the Dockerfile — ",[231,20896,20897],{},"bundle install",[231,20899,20900],{},"assets:precompile",", copy artifacts.",[70,20903,20904,6562,20907,20910,20911,101],{},[27,20905,20906],{},"Node",[231,20908,20909],{},"node:20-alpine"," solves most cases. Watch for deps with native binaries (sharp, bcrypt, sqlite3, canvas) — Alpine uses musl, and some libs require glibc. If it breaks, switch to ",[231,20912,20913],{},"node:20-slim",[70,20915,20916,6562,20919,20922,20923,5839,20925,20928],{},[27,20917,20918],{},"Python\u002FDjango",[231,20920,20921],{},"python:3.x-slim",", gunicorn or uvicorn as server. ",[231,20924,8507],{},[231,20926,20927],{},"pyproject.toml"," in the build stage.",[70,20930,20931,6562,20934,20937,20938,20941],{},[27,20932,20933],{},"Elixir\u002FPhoenix",[231,20935,20936],{},"elixir:1.x-alpine",", release as artifact (",[231,20939,20940],{},"mix release","), runtime image with only erlang.",[12,20943,20944],{},[27,20945,20946],{},"Procfile → Docker mapping:",[119,20948,20949,20959],{},[122,20950,20951],{},[125,20952,20953,20956],{},[128,20954,20955],{},"Procfile",[128,20957,20958],{},"Equivalent at destination",[141,20960,20961,20973,20983,20993],{},[125,20962,20963,20968],{},[146,20964,20965],{},[231,20966,20967],{},"web: bundle exec puma",[146,20969,20970,20972],{},[231,20971,20847],{}," of main container",[125,20974,20975,20980],{},[146,20976,20977],{},[231,20978,20979],{},"worker: bundle exec sidekiq",[146,20981,20982],{},"Separate container, same image, different command",[125,20984,20985,20990],{},[146,20986,20987],{},[231,20988,20989],{},"release: bundle exec rake db:migrate",[146,20991,20992],{},"Release job, runs before rolling update deploy",[125,20994,20995,21000],{},[146,20996,20997],{},[231,20998,20999],{},"clock: bundle exec clockwork",[146,21001,21002],{},"Cron job, or singleton container",[12,21004,21005],{},"Most modern orchestrators (HeroCtl, Render, Railway, Coolify) understand these four formats directly.",[12,21007,21008],{},[27,21009,21010],{},"Assets:",[12,21012,21013],{},"Heroku slug compilation does precompile automatically. In Docker you need to think:",[2734,21015,21016,21022,21028],{},[70,21017,21018,21019,20928],{},"Rails: ",[231,21020,21021],{},"RUN bundle exec rake assets:precompile",[70,21023,21024,21025,20928],{},"Node: ",[231,21026,21027],{},"RUN npm run build",[70,21029,21030,21031,2402,21034,21037],{},"Asset host (CDN): if you use CloudFlare or S3 to serve static, configure ",[231,21032,21033],{},"RAILS_SERVE_STATIC_FILES",[231,21035,21036],{},"ASSET_HOST"," correctly.",[12,21039,21040],{},[27,21041,21042],{},"Realistic average time:",[2734,21044,21045,21048,21051],{},[70,21046,21047],{},"Medium Rails app (CRUD with Sidekiq): 1 to 2 days.",[70,21049,21050],{},"Simple Node app (API, no heavy frontend build): 4 hours.",[70,21052,21053],{},"App with 5+ stateful workers and media processing: 3 to 5 days.",[12,21055,21056],{},"The first app takes longer. The second takes half. From the third on, it's mechanical.",[19,21058,21060],{"id":21059},"step-3-database-migration-the-riskiest-part-2-to-8-hours","Step 3 — Database migration (the riskiest part, 2 to 8 hours)",[12,21062,21063],{},"Here lives the fear. The database is the only place where \"going back\" is expensive. Everything else is redeploy.",[12,21065,21066],{},[27,21067,21068],{},"Postgres:",[12,21070,21071,21072,21074,21075,21077],{},"Heroku Postgres exposes direct access via ",[231,21073,5736],{}," if you have the credentials (they're in ",[231,21076,453],{},"). Before anything, find out your extensions:",[224,21079,21083],{"className":21080,"code":21081,"language":21082,"meta":229,"style":229},"language-sql shiki shiki-themes github-dark-default","SELECT extname, extversion FROM pg_extension;\n","sql",[231,21084,21085],{"__ignoreMap":229},[234,21086,21087,21090,21093,21095],{"class":236,"line":237},[234,21088,21089],{"class":383},"SELECT",[234,21091,21092],{"class":387}," extname, extversion ",[234,21094,20744],{"class":383},[234,21096,21097],{"class":387}," pg_extension;\n",[12,21099,21100,21101,571,21103,571,21105,571,21107,571,21109,571,21112,21115],{},"Common ones: ",[231,21102,20421],{},[231,21104,20424],{},[231,21106,20427],{},[231,21108,17293],{},[231,21110,21111],{},"uuid-ossp",[231,21113,21114],{},"unaccent",". If the destination doesn't have all, or has them in a different version, you find out before — not in the middle of restore at 3 AM.",[12,21117,21118],{},[27,21119,21120],{},"Possible destination for Postgres:",[2734,21122,21123,21126],{},[70,21124,21125],{},"Postgres running as a job in the cluster itself (smaller RPO\u002FRTO, total control, you take care of backup).",[70,21127,21128],{},"Regional managed Postgres — RDS São Paulo, Neon, Supabase, Aiven. More expensive, less ops.",[12,21130,21131],{},[27,21132,21133],{},"Migration with minimum downtime — option A (with window):",[224,21135,21137],{"className":226,"code":21136,"language":228,"meta":229,"style":229},"# Drains traffic: puts app in maintenance, waits for Sidekiq to drain\nheroku maintenance:on -a my-app\n\n# Dump\npg_dump $HEROKU_DATABASE_URL --no-owner --no-privileges --format=custom --file=dump.sql\n\n# Restore at destination\npg_restore --no-owner --no-privileges --dbname=$DEST_DATABASE_URL dump.sql\n\n# Smoke test at destination\npsql $DEST_DATABASE_URL -c 'SELECT count(*) FROM users;'\n\n# DNS cutover, app at destination points to new database\nheroku maintenance:off -a my-app  # optional, just for Heroku to keep serving \u002Fhealthz\n",[231,21138,21139,21144,21155,21159,21164,21183,21187,21192,21211,21215,21220,21234,21238,21243],{"__ignoreMap":229},[234,21140,21141],{"class":236,"line":237},[234,21142,21143],{"class":240},"# Drains traffic: puts app in maintenance, waits for Sidekiq to drain\n",[234,21145,21146,21148,21151,21153],{"class":236,"line":244},[234,21147,20490],{"class":247},[234,21149,21150],{"class":255}," maintenance:on",[234,21152,20519],{"class":251},[234,21154,20522],{"class":255},[234,21156,21157],{"class":236,"line":271},[234,21158,412],{"emptyLinePlaceholder":411},[234,21160,21161],{"class":236,"line":415},[234,21162,21163],{"class":240},"# Dump\n",[234,21165,21166,21168,21171,21174,21177,21180],{"class":236,"line":434},[234,21167,5736],{"class":247},[234,21169,21170],{"class":387}," $HEROKU_DATABASE_URL ",[234,21172,21173],{"class":251},"--no-owner",[234,21175,21176],{"class":251}," --no-privileges",[234,21178,21179],{"class":251}," --format=custom",[234,21181,21182],{"class":251}," --file=dump.sql\n",[234,21184,21185],{"class":236,"line":459},[234,21186,412],{"emptyLinePlaceholder":411},[234,21188,21189],{"class":236,"line":464},[234,21190,21191],{"class":240},"# Restore at destination\n",[234,21193,21194,21197,21200,21202,21205,21208],{"class":236,"line":479},[234,21195,21196],{"class":247},"pg_restore",[234,21198,21199],{"class":251}," --no-owner",[234,21201,21176],{"class":251},[234,21203,21204],{"class":251}," --dbname=",[234,21206,21207],{"class":387},"$DEST_DATABASE_URL",[234,21209,21210],{"class":255}," dump.sql\n",[234,21212,21213],{"class":236,"line":484},[234,21214,412],{"emptyLinePlaceholder":411},[234,21216,21217],{"class":236,"line":490},[234,21218,21219],{"class":240},"# Smoke test at destination\n",[234,21221,21222,21225,21228,21231],{"class":236,"line":508},[234,21223,21224],{"class":247},"psql",[234,21226,21227],{"class":387}," $DEST_DATABASE_URL ",[234,21229,21230],{"class":251},"-c",[234,21232,21233],{"class":255}," 'SELECT count(*) FROM users;'\n",[234,21235,21236],{"class":236,"line":529},[234,21237,412],{"emptyLinePlaceholder":411},[234,21239,21240],{"class":236,"line":535},[234,21241,21242],{"class":240},"# DNS cutover, app at destination points to new database\n",[234,21244,21245,21247,21250,21252,21255],{"class":236,"line":546},[234,21246,20490],{"class":247},[234,21248,21249],{"class":255}," maintenance:off",[234,21251,20519],{"class":251},[234,21253,21254],{"class":255}," my-app",[234,21256,21257],{"class":240},"  # optional, just for Heroku to keep serving \u002Fhealthz\n",[12,21259,21260],{},"Typical window: 30 minutes to 2 hours, depending on database size. For a base under 5GB, 30 min is comfortable.",[12,21262,21263],{},[27,21264,21265],{},"Migration with minimum downtime — option B (logical replication):",[12,21267,21268],{},"Postgres logical replication allows you to start the copy while the app continues writing to Heroku. When the replica reaches the current state, do the DNS cutover and the destination becomes the new primary.",[12,21270,21271],{},"Works if the destination can reach Heroku via network. For Heroku Postgres you need to whitelist the destination IP (Heroku has a mechanism for that on paid plans). Setup takes an afternoon, cutover lasts seconds.",[12,21273,21274],{},[27,21275,21276],{},"Redis:",[12,21278,21279],{},"Two distinct natures — treat differently:",[2734,21281,21282,21288],{},[70,21283,21284,21287],{},[27,21285,21286],{},"Redis as cache",": simply restart from zero at destination. Cache reheats by itself. Nothing to migrate.",[70,21289,21290,21293,21294,21297],{},[27,21291,21292],{},"Redis as Sidekiq\u002FResque queue with persistence",": here it hurts. Snapshot via ",[231,21295,21296],{},"BGSAVE",", transfer the RDB, restore at destination. Or: pause workers on Heroku, process the queue to completion, do cutover with empty queue.",[12,21299,21300],{},"Heroku Redis Premium has persistence enabled by default; simple Redis at destination may not — check before.",[19,21302,21304],{"id":21303},"step-4-dns-ssl-and-cutover-1-to-3-hours","Step 4 — DNS, SSL and cutover (1 to 3 hours)",[12,21306,21307],{},"Cutover is the moment of truth. Everything before was preparation.",[12,21309,21310],{},[27,21311,21312],{},"24 hours before:",[12,21314,21315],{},"Reduce DNS record TTL to 60 seconds. This ensures that when you point to the destination, propagation is fast. High TTL is what makes cutover become a 6-hour nightmare with half the customers still hitting the old server.",[12,21317,21318],{},[27,21319,21320],{},"Parallel setup:",[12,21322,21323,21324,622],{},"App running in parallel at both destinations. Heroku continues responding on the old domain. Destination responds on a temporary domain (e.g., ",[231,21325,21326],{},"app-new.heroctl.com",[12,21328,21329],{},[27,21330,21331],{},"Smoke test at destination:",[224,21333,21335],{"className":226,"code":21334,"language":228,"meta":229,"style":229},"curl https:\u002F\u002Fapp-new.heroctl.com\u002Fhealthz\ncurl https:\u002F\u002Fapp-new.heroctl.com\u002Fapi\u002Fv1\u002Fusers -H \"Authorization: Bearer $TOKEN\"\n# Hit critical endpoints manually, with human eyes\n",[231,21336,21337,21344,21362],{"__ignoreMap":229},[234,21338,21339,21341],{"class":236,"line":237},[234,21340,1220],{"class":247},[234,21342,21343],{"class":255}," https:\u002F\u002Fapp-new.heroctl.com\u002Fhealthz\n",[234,21345,21346,21348,21351,21354,21357,21360],{"class":236,"line":244},[234,21347,1220],{"class":247},[234,21349,21350],{"class":255}," https:\u002F\u002Fapp-new.heroctl.com\u002Fapi\u002Fv1\u002Fusers",[234,21352,21353],{"class":251}," -H",[234,21355,21356],{"class":255}," \"Authorization: Bearer ",[234,21358,21359],{"class":387},"$TOKEN",[234,21361,1207],{"class":255},[234,21363,21364],{"class":236,"line":271},[234,21365,21366],{"class":240},"# Hit critical endpoints manually, with human eyes\n",[12,21368,21369],{},"If something is wrong, find out now. After cutover you'll be dealing with support tickets simultaneously.",[12,21371,21372],{},[27,21373,21374],{},"Cutover:",[12,21376,21377,21378,21381],{},"Change the CNAME (or A record) of the production domain to the destination. Within 60 seconds, new requests go to the new destination. Heroku continues responding on the old domain (the ",[231,21379,21380],{},"*.herokuapp.com"," URL) for 30 days — that's an important safety belt.",[12,21383,21384],{},[27,21385,21386],{},"SSL\u002FTLS:",[12,21388,21389],{},"Heroku had embedded automatic certificate. At the destination, depending on the choice:",[2734,21391,21392,21395],{},[70,21393,21394],{},"HeroCtl, Coolify, Render, Railway, Fly.io: automatic certificate via Let's Encrypt, without you thinking.",[70,21396,21397],{},"Bare single VPS: you configure cert-manager-equivalent, or Caddy with ACME, or nginx + certbot.",[12,21399,21400],{},"Before DNS cutover, validate that the destination issued the certificate for the domain. Let's Encrypt validates via HTTP-01 or DNS-01 — the HTTP-01 challenge only works after DNS points, so there's a chicken-and-egg. Solution: issue via DNS-01 first (doesn't need DNS pointing to destination), or accept 30 seconds of TLS error at cutover moment.",[12,21402,21403],{},[27,21404,21405],{},"Sticky sessions:",[12,21407,21408],{},"If your app uses WebSocket, or has in-memory session (instead of Redis or database), you need sticky session at the load balancer. Heroku didn't do that by default, but some apps end up depending on stable routing without realizing it. At the destination, configure cookie-based session affinity if necessary.",[19,21410,21412],{"id":21411},"step-5-heroku-decommission-1-hour-30-days-later","Step 5 — Heroku decommission (1 hour, 30 days later)",[12,21414,21415],{},"Thirty days is the safety belt. Keep the app on Heroku running, without traffic (after all DNS already pointed elsewhere), just in case of emergency. Cost: what you were already paying, divided proportionally up to the cancellation date.",[12,21417,21418],{},"Thirty days later, if nothing broke:",[224,21420,21422],{"className":226,"code":21421,"language":228,"meta":229,"style":229},"heroku addons:destroy heroku-postgresql -a my-app\nheroku addons:destroy heroku-redis -a my-app\nheroku addons:destroy papertrail -a my-app\nheroku apps:destroy my-app\n",[231,21423,21424,21438,21451,21464],{"__ignoreMap":229},[234,21425,21426,21428,21431,21434,21436],{"class":236,"line":237},[234,21427,20490],{"class":247},[234,21429,21430],{"class":255}," addons:destroy",[234,21432,21433],{"class":255}," heroku-postgresql",[234,21435,20519],{"class":251},[234,21437,20522],{"class":255},[234,21439,21440,21442,21444,21447,21449],{"class":236,"line":244},[234,21441,20490],{"class":247},[234,21443,21430],{"class":255},[234,21445,21446],{"class":255}," heroku-redis",[234,21448,20519],{"class":251},[234,21450,20522],{"class":255},[234,21452,21453,21455,21457,21460,21462],{"class":236,"line":271},[234,21454,20490],{"class":247},[234,21456,21430],{"class":255},[234,21458,21459],{"class":255}," papertrail",[234,21461,20519],{"class":251},[234,21463,20522],{"class":255},[234,21465,21466,21468,21471],{"class":236,"line":415},[234,21467,20490],{"class":247},[234,21469,21470],{"class":255}," apps:destroy",[234,21472,20522],{"class":255},[12,21474,21475],{},"Each addon must be cancelled separately — some have their own billing that continues even with the app destroyed. Check next month's bill with a magnifying glass.",[12,21477,21478],{},"Heroku does pro-rata refund up to the cancellation day. Don't forget to cancel the entire account if it's the last app — otherwise you pay platform fee every month for nothing.",[19,21480,21482],{"id":21481},"common-pitfalls","Common pitfalls",[12,21484,21485],{},"Most migrations get stuck on these eight things. Read everything before starting.",[12,21487,21488,21491,21492,571,21494,571,21496,21499],{},[27,21489,21490],{},"Invisible slug compilation hooks."," Old apps have ",[231,21493,20552],{},[231,21495,20461],{},[231,21497,21498],{},"bin\u002Fpre_compile",". These scripts run inside the buildpack and do things like minifying JS, generating derived files, or running a migration nobody remembers. Before Dockerizing, open each one and replicate in a Dockerfile step or in a release container.",[12,21501,21502,21505,21506,21509,21510,21512,21513,21516],{},[27,21503,21504],{},"Config vars with broken format."," Heroku accepts ",[231,21507,21508],{},"MY:VAR"," as variable name (with ",[231,21511,1272],{},"). Containers in general also accept, but some orchestration tools escape differently. Rename to ",[231,21514,21515],{},"MY_VAR"," before migrating.",[12,21518,21519,21522,21523,21526,21527,21530,21531,21534],{},[27,21520,21521],{},"Redis URL with variant format."," Heroku uses ",[231,21524,21525],{},"redis:\u002F\u002Fh:password@host:port",". Some clients (old Ruby gems mainly) expect ",[231,21528,21529],{},"redis:\u002F\u002F:password@host:port",". If you see ",[231,21532,21533],{},"Redis::CommandError: WRONGPASS",", that's probably it.",[12,21536,21537,21542],{},[27,21538,21539,21541],{},[231,21540,20604],{}," saved in env."," When you run that same container outside Heroku, it continues without installing development gems. In production, ok. In staging where you need to run tests, breaks. Clean that variable before using the config dump in another environment.",[12,21544,21545,2577,21548,21551,21552,571,21555,21558],{},[27,21546,21547],{},"Heroku-specific gems.",[231,21549,21550],{},"rails_12factor"," (deprecated but still in 2014 apps), ",[231,21553,21554],{},"heroku_san",[231,21556,21557],{},"taps",". Removed, end. If something depends, swap for standard equivalent.",[12,21560,21561,21564],{},[27,21562,21563],{},"DNS with Heroku-DNS-Target."," Heroku recommends using ALIAS or ANAME to point to the app, instead of CNAME, for domain roots. When migrating, switch to A record direct to destination IP. ALIAS pointing to Heroku is what will screw you on apex domains.",[12,21566,21567,21570],{},[27,21568,21569],{},"Papertrail \u002F NewRelic \u002F Bugsnag turned off without substitute."," Logs and observability are easy to leave for later and break in the first hour post-migration. Before cutover, you must have: centralized logs (HeroCtl has single embedded writer; Render exposes via UI; Coolify has optional Loki), basic metrics (CPU, memory, requests), and some error tool (self-hosted Sentry or SaaS).",[12,21572,21573,21576],{},[27,21574,21575],{},"Sidekiq\u002FResque with in-flight jobs during cutover."," During the cutover moment, some jobs go to the destination queue without having been processed at origin. If your job isn't idempotent (can run twice without side effect), that's a problem. Solution: pause Heroku workers 5 minutes before cutover, wait for queue to drain, do cutover with empty queue.",[19,21578,21580],{"id":21579},"realistic-schedule-for-medium-startup-5-to-10-heroku-apps","Realistic schedule for medium startup (5 to 10 Heroku apps)",[12,21582,21583],{},"Small team, one part-time dev:",[2734,21585,21586,21592,21598,21604,21610],{},[70,21587,21588,21591],{},[27,21589,21590],{},"Week 1",": complete pre-flight + stack choice + destination setup (empty cluster running, panel accessible).",[70,21593,21594,21597],{},[27,21595,21596],{},"Week 2",": Dockerization of first low-risk app + database migration in staging environment.",[70,21599,21600,21603],{},[27,21601,21602],{},"Week 3",": cutover of first app in production + 7-day validation.",[70,21605,21606,21609],{},[27,21607,21608],{},"Weeks 4 to 6",": migration of remaining apps in parallel, pace of 1 to 2 per week.",[70,21611,21612,21614],{},[27,21613,16913],{},": 4 to 6 weeks of elapsed time, maybe 80 hours of effective work distributed.",[12,21616,21617],{},"Medium team (3 devs, 20 apps): 8 weeks, 200 hours of effective work.",[12,21619,21620],{},"Large team (cluster of 50+ apps): treat as formal project, with project manager, and calculate quarter.",[12,21622,21623],{},"Rule of thumb: never migrate more than 2 apps in parallel if it's the same dev doing it. Context-switching cost eats the parallelism gain.",[19,21625,3225],{"id":3224},[12,21627,21628,21631],{},[27,21629,21630],{},"How much does the migration cost in person-hours?","\nFor a 5-app SaaS, part-time dev: ~80 hours. At R$200\u002Fh, R$16k. Compared to R$2k\u002Fmonth of Heroku bill you save, payback in 8 months. In the next 4 years, it's just savings.",[12,21633,21634,21637],{},[27,21635,21636],{},"What if I don't have Docker setup?","\nYou don't need to pre-install anything — destination platforms build the image for you (Render, Railway, Fly.io accept Dockerfile direct from git). HeroCtl requires image in registry, so you push to ECR, GCR, Docker Hub or GHCR. For local use, install Docker Desktop and you're ready.",[12,21639,21640,21643,21644,21646,21647,21649,21650,21653],{},[27,21641,21642],{},"Does Heroku Postgres have export limit?","\nThere's IOPS limit during ",[231,21645,5736],{}," on low plans. Databases above 5GB on Hobby plan may need ",[231,21648,5736],{}," in parallel mode (",[231,21651,21652],{},"-j",") or use logical replication to avoid heavy load. For Standard or higher, no relevant problem.",[12,21655,21656,21659],{},[27,21657,21658],{},"Do Sidekiq scheduled jobs survive?","\nThey survive if you migrate Redis with snapshot (BGSAVE → restore). If you restart Redis from zero at destination, you lose scheduled jobs. Consider that at cutover: either do Redis transfer along, or accept manually rescheduling some jobs.",[12,21661,21662,21665],{},[27,21663,21664],{},"Can I test with 1 app first?","\nThat's the recommended path. Take the least critical app (internal, or very low traffic), do the entire migration on it first. Learn from the stumbles there. Then migrate production ones with confidence. The first migration teaches more than reading 10 posts like this.",[12,21667,21668,21671],{},[27,21669,21670],{},"What if the migration fails?","\nThe 30 days of Heroku running in parallel are your safety net. If the destination breaks irreversibly in the first hour, switch DNS back to Heroku, takes 60 seconds, normal life. The only case where rollback is expensive is if you did database cutover with writes at the destination — then you need to replicate back. That's why the recommendation is simultaneous DNS and database cutover, with short window.",[12,21673,21674,21677,21678,21681,21682,21684],{},[27,21675,21676],{},"Is there an assisted migration path for HeroCtl?","\nFor HeroCtl, yes — we have an experimental converter that reads ",[231,21679,21680],{},"app.json"," + ",[231,21683,20955],{}," and generates an equivalent job manifest. Works for simple apps (web + worker + release), and stumbles on exotic cases (heavy multi-buildpack, custom hooks). If you want to test, send a message.",[19,21686,3309],{"id":3308},[12,21688,21689],{},"Migrating from Heroku four years later is embarrassing — should have left in 2022. But four years becoming five is worse. The compounded cost of not migrating (R$25k to R$100k per year in accumulated Heroku bill, plus the fragility of depending on a product Salesforce already showed has no affection for small users) is greater than the cost of a focused week of work.",[12,21691,21692],{},"If you decide to test HeroCtl, install on any Linux server:",[224,21694,21695],{"className":226,"code":2948,"language":228,"meta":229,"style":229},[231,21696,21697],{"__ignoreMap":229},[234,21698,21699,21701,21703,21705,21707],{"class":236,"line":237},[234,21700,1220],{"class":247},[234,21702,2957],{"class":251},[234,21704,2960],{"class":255},[234,21706,2963],{"class":383},[234,21708,2966],{"class":247},[12,21710,21711],{},"Works on 1 server (simple mode) or on 3+ (real HA mode). The Community plan is free without server limit and without job limit — you don't need to make any commercial decision to do the entire migration.",[12,21713,21714],{},"If you decide on Render, Railway or Coolify, also great. The point of this post isn't to capture you as a customer — it's to get you off Heroku. Four years later, it's time.",[12,21716,21717,21718,21721,21722,101],{},"For additional context on self-hosting in 2026, read ",[3336,21719,21720],{"href":19749},"Self-hosted Heroku: the state of the art in 2026",". To understand why we built a new orchestrator instead of adopting an existing one, read ",[3336,21723,21724],{"href":6545},"Why we built HeroCtl",[3350,21726,21727],{},"html pre.shiki code .sQhOw, html code.shiki .sQhOw{--shiki-default:#FFA657}html pre.shiki code .s9uIt, html code.shiki .s9uIt{--shiki-default:#A5D6FF}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html pre.shiki code .sFSAA, html code.shiki .sFSAA{--shiki-default:#79C0FF}html pre.shiki code .suJrU, html code.shiki .suJrU{--shiki-default:#FF7B72}html pre.shiki code .sZEs4, html code.shiki .sZEs4{--shiki-default:#E6EDF3}html pre.shiki code .sH3jZ, html code.shiki .sH3jZ{--shiki-default:#8B949E}",{"title":229,"searchDepth":244,"depth":244,"links":21729},[21730,21731,21732,21733,21734,21735,21736,21737,21738,21739,21740],{"id":20401,"depth":244,"text":20402},{"id":20471,"depth":244,"text":20472},{"id":20691,"depth":244,"text":20692},{"id":20728,"depth":244,"text":20729},{"id":21059,"depth":244,"text":21060},{"id":21303,"depth":244,"text":21304},{"id":21411,"depth":244,"text":21412},{"id":21481,"depth":244,"text":21482},{"id":21579,"depth":244,"text":21580},{"id":3224,"depth":244,"text":3225},{"id":3308,"depth":244,"text":3309},"2026-03-11","The end of Heroku's free plan in November\u002F2022 turned migration into a priority for hundreds of Brazilian teams. Detailed plan with checklist, estimated time, and common pitfalls.",{},"\u002Fen\u002Fblog\u002Fmigrating-from-heroku-technical-guide",{"title":20390,"description":21742},{"loc":21744},"en\u002Fblog\u002Fmigrating-from-heroku-technical-guide",[20490,6393,6395,3392,888],"3KlZIN2qM2cMukpE7Xi4UEWjyIUjfaPo36rBNwc5Awk",{"id":21751,"title":21752,"author":7,"body":21753,"category":8756,"cover":3379,"date":22455,"description":22456,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":22457,"navigation":411,"path":6333,"readingTime":3386,"seo":22458,"sitemap":22459,"stem":22460,"tags":22461,"__hash__":22464},"blog_en\u002Fen\u002Fblog\u002Faws-ecs-vs-kubernetes-vs-self-hosted.md","AWS ECS vs Kubernetes vs self-hosted: three paths to run containers in 2026",{"type":9,"value":21754,"toc":22440},[21755,21758,21761,21764,21768,21771,21774,21798,21801,21804,21808,21811,21814,21820,21826,21832,21838,21844,21847,21851,21854,21868,21872,21875,21878,21891,21895,21901,21927,21931,21934,21951,21955,21958,21961,21978,21981,21985,21988,22014,22018,22021,22202,22205,22209,22212,22218,22224,22230,22236,22242,22246,22249,22318,22321,22342,22345,22348,22352,22358,22364,22370,22376,22382,22388,22394,22396,22399,22402,22405,22408,22424,22427,22435,22438],[12,21756,21757],{},"AWS today sells, at minimum, four different products to run a container in production: ECS (with EC2 or Fargate), EKS, App Runner and Lightsail Containers. It is not catalog redundancy nor internal confusion — it is a direct response to the market. Each covers a distinct slice of those arriving at AWS with the same basic question: how to bring up a container, keep it alive, expose it to the internet, update it without falling, and sleep peacefully.",[12,21759,21760],{},"ECS is AWS's bet for those who don't want Kubernetes. It is not \"simpler Kubernetes\", it is a proprietary alternative to Kubernetes, written by Amazon engineers before K8s became consensus. EKS is managed Kubernetes, the same off-the-shelf as GKE and AKS. Self-hosted is the exit from AWS entirely — you run on any Linux server, pay only the server, and take your containers with you if the provider changes mood.",[12,21762,21763],{},"The three paths solve the same problem with very different trade-offs. This post puts side by side what each charges, what each binds, and in what context each makes sense — without pretending there is a uniform winner.",[19,21765,21767],{"id":21766},"aws-ecs-what-it-is-exactly","AWS ECS: what it is exactly",[12,21769,21770],{},"ECS is Amazon's proprietary orchestrator. It is not open source, doesn't run outside AWS, has no alternative implementation. It was announced in 2014, before Kubernetes gained traction, and AWS has invested in it since then as the \"AWS-native, no K8s\" entry door to the container world.",[12,21772,21773],{},"The conceptual model is its own:",[2734,21775,21776,21782,21787,21792],{},[70,21777,21778,21781],{},[27,21779,21780],{},"Task definition"," instead of Pod. It is a JSON file describing the container, resources, ports, environment variables, IAM role.",[70,21783,21784,21786],{},[27,21785,12963],{}," instead of Deployment. Keeps N tasks running, does health check, integrates with Application Load Balancer.",[70,21788,21789,21791],{},[27,21790,6873],{}," is just a logical grouping — no paid control plane. The control plane is free (AWS manages it internally, you neither see nor maintain it).",[70,21793,21794,21797],{},[27,21795,21796],{},"Capacity provider"," defines where tasks run: EC2 (you manage instances) or Fargate (serverless per vCPU\u002FRAM\u002Fsecond).",[12,21799,21800],{},"Integration with the rest of AWS is the real strong point. Task gets IAM role directly, no auth sidecar. Logs go to CloudWatch without an agent. Images come from ECR without configuring a pull secret. ALB routes traffic to tasks with automatic service discovery. All of that with a decent graphical console, stable CLI, and SDK in every language.",[12,21802,21803],{},"Compared to K8s, ECS is deliberately simple. There are no CRDs, no operators, no Helm charts, no formalized sidecar pattern, no admission control. You have task, service, cluster — and that's it. For a team already deep in AWS, that simplicity is the argument.",[19,21805,21807],{"id":21806},"aws-ecs-where-it-hurts","AWS ECS: where it hurts",[12,21809,21810],{},"Lock-in is absolute, and worth naming first. Task definition doesn't run outside AWS. ECR is not a portable registry (you can pull, but IAM ties back). ALB is AWS-only. Service discovery via Cloud Map is AWS-only. CloudWatch is AWS-only. You are not adopting \"a way to run containers\" — you are adopting an entire stack that only exists in there. Migrating out requires rewriting each piece.",[12,21812,21813],{},"Cost appears in layers nobody adds in the first evaluation:",[12,21815,21816,21819],{},[27,21817,21818],{},"Fargate",": US$0.04 per vCPU-hour + US$0.0044 per GB-hour. A modest app with 0.5 vCPU + 1 GB, running 24×7, costs US$25\u002Fmonth — R$125 at R$5\u002FUSD. Sounds like little until you remember that each microservice is a task, and that typical production has 8 to 15 microservices + queue tasks + cron jobs. Five small applications easily become R$600 of Fargate alone.",[12,21821,21822,21825],{},[27,21823,21824],{},"CloudWatch Logs",": US$0.50 per GB ingested + US$0.03 per GB stored per month. An app logging 5 GB\u002Fmonth leaves US$2.65 — R$13. Multiplied by ten services, R$130\u002Fmonth in logs alone. And it is the \"cheap\" option — turn on Insights for serious queries, doubles.",[12,21827,21828,21831],{},[27,21829,21830],{},"Egress",": US$0.09 per GB after the first 100 GB free — R$0.45\u002FGB. An app serving 500 GB of egress per month pays R$180. Video streaming, image downloads, heavy public API: egress becomes the largest bill item, frequently exceeding compute.",[12,21833,21834,21837],{},[27,21835,21836],{},"Network",": VPC is free, but NAT Gateway costs US$0.045\u002Fhour — fixed US$32\u002Fmonth, R$160 — just to exist, plus US$0.045 per GB processed. You need NAT for any task in a private subnet that calls the internet (update package, talk to external API, send email via SES). In production with high availability, the recommendation is NAT in two zones — two NAT Gateways, R$320\u002Fmonth baseline before any traffic.",[12,21839,21840,21843],{},[27,21841,21842],{},"Application Load Balancer",": US$0.0225\u002Fhour (US$16\u002Fmonth fixed) + US$0.008 per LCU-hour. For an app with moderate traffic, US$25\u002Fmonth is realistic — R$125.",[12,21845,21846],{},"Realistic sum for a small operation with five apps in Fargate, shared ALB, NAT in one zone only, moderate logs: R$1,000 to R$1,500\u002Fmonth. Grows linear with the number of tasks. Not expensive by enterprise standards, but multiple times the equivalent cost on dedicated VPS.",[19,21848,21850],{"id":21849},"aws-ecs-who-uses-and-rightly-loves-it","AWS ECS: who uses and rightly loves it",[12,21852,21853],{},"There is a clear profile for whom ECS is the right answer, and we recommend it without reservations for these cases:",[2734,21855,21856,21859,21862,21865],{},[70,21857,21858],{},"A company that is already 100% AWS, with a team trained on the console and IAM policies. Adding ECS is incremental — doesn't require learning a new tool outside the bubble.",[70,21860,21861],{},"Burst workloads, scheduled jobs, nightly ETLs. Fargate shines when you want 50 tasks running for 12 minutes a day and zero the rest of the time. Paying per second is honest.",[70,21863,21864],{},"Compliance that requires specific AWS (FedRAMP High, American federal contracts, certain HIPAA configurations with AWS BAA). When audit asks for AWS, ECS gives you the shortest path without installing K8s on top.",[70,21866,21867],{},"A team that prioritizes zero-ops over cost and portability. If you don't have anyone to maintain an EC2 instance, Fargate is genuinely less work — you never see a machine, never patch a kernel, never show up to talk about disk saturation.",[19,21869,21871],{"id":21870},"kubernetes-what-it-is-exactly","Kubernetes: what it is exactly",[12,21873,21874],{},"The audience knows K8s, so the summary here is short. De facto standard for orchestration since around 2018, with a giant CNCF ecosystem (cert-manager, ingress controllers, operators for practically any database). Consistent API across clouds, which makes multi-cloud genuinely viable (expensive, but viable). Well-documented learning curve — 300+ manifest lines to put a hello world with TLS in the air.",[12,21876,21877],{},"Operating models:",[2734,21879,21880,21885],{},[70,21881,21882,21884],{},[27,21883,7081],{}," (EKS, GKE, AKS): provider maintains the control plane. Charges around US$73\u002Fmonth per cluster on the big three — R$365. Plus NAT, ALB, observability, registry. Typical minimum team: 2 SREs.",[70,21886,21887,21890],{},[27,21888,21889],{},"Self-managed"," with k3s, kubeadm, kops, Rancher: you install on VMs or bare metal. No control plane cost, but you become the platform team. Minimum team: 1 very good SRE or 2 average ones.",[19,21892,21894],{"id":21893},"kubernetes-where-it-hurts","Kubernetes: where it hurts",[12,21896,21897,21898,21900],{},"Already covered in depth in ",[3336,21899,15781],{"href":15780},". Direct summary:",[2734,21902,21903,21909,21915,21921],{},[70,21904,21905,21908],{},[27,21906,21907],{},"Operational cost",": 1 to 2 dedicated SREs, R$30-40k\u002Fmonth each CLT in Brazil. That's the largest item on the bill — multiply by twelve months and the cluster has passed the R$500k\u002Fyear mark in people alone.",[70,21910,21911,21914],{},[27,21912,21913],{},"Curve",": 6+ months to a team productive in fact (not \"delivers manifest\", but \"debugs a problem at three in the morning without destroying production\").",[70,21916,21917,21920],{},[27,21918,21919],{},"Surrounding stack",": cert-manager, ingress controller, metrics operator, log agent, service mesh if any — each with its own version, its own update policy, its own failure model.",[70,21922,21923,21926],{},[27,21924,21925],{},"Long manifests",": \"hello world\" with namespace + deployment + service + ingress + cert + RBAC sits at 300 lines. Helm reduces duplication but adds a conceptual layer.",[19,21928,21930],{"id":21929},"kubernetes-who-uses-and-justifies","Kubernetes: who uses and justifies",[12,21932,21933],{},"The profiles where K8s is the obvious choice, no irony:",[2734,21935,21936,21939,21942,21945,21948],{},[70,21937,21938],{},"Series B+ company with platform team of 3 or more dedicated people. The human scale sustains the complexity.",[70,21940,21941],{},"Multi-cloud or vendor neutrality as a real requirement (not as a slide aspiration). You will effectively run on two clouds, and K8s is the only mature abstraction covering both.",[70,21943,21944],{},"Workloads that depend on specific mature operators: Postgres operator with replication, Kafka operator with balancing, Cassandra operator with bootstrap. Rewriting that \"by hand\" costs more than the cluster.",[70,21946,21947],{},"Nominal compliance — some frameworks list Kubernetes by name in controls. If your auditor needs to point to a SOC2 certificate that says \"Kubernetes 1.28\", the tool has to be called Kubernetes.",[70,21949,21950],{},"Operation above 50 servers in sustained production. At that size, the CNCF ecosystem gives you tools you would have to build from scratch in smaller alternatives.",[19,21952,21954],{"id":21953},"modern-self-hosted","Modern self-hosted",[12,21956,21957],{},"The third option is what changed in the last two years. Self-hosted stopped being \"Docker Compose on a server with luck\" and became a category with serious products — HeroCtl is one of them, but Coolify, Dokploy, Caprover and others also occupy the space.",[12,21959,21960],{},"The common proposition:",[2734,21962,21963,21966,21969,21972,21975],{},[70,21964,21965],{},"A binary (or simple Docker image) installed on N Linux servers with Docker.",[70,21967,21968],{},"Replicated control plane, with automatic coordinator election. You lose one server, the cluster keeps going.",[70,21970,21971],{},"Embedded router, automatic Let's Encrypt certificates.",[70,21973,21974],{},"No cloud provider dependency — runs on any VPS, any bare metal, any mixture.",[70,21976,21977],{},"Honest commercial model: permanent free Community without artificial feature gate, paid Business published for those needing SSO\u002Faudit\u002FSLA, Enterprise for contracts with escrow and 24×7 support.",[12,21979,21980],{},"Cost reduces to two lines: the VPS and the time of the part-time dev looking after it. Three US$24\u002Fmonth droplets each on DigitalOcean — R$360\u002Fmonth — sustain an operation that on ECS would sit between R$1,500 and R$3,000.",[19,21982,21984],{"id":21983},"self-hosted-where-it-hurts","Self-hosted: where it hurts",[12,21986,21987],{},"Honesty is worth more than a brochure. Where self-hosted is not the answer:",[2734,21989,21990,21996,22002,22008],{},[70,21991,21992,21995],{},[27,21993,21994],{},"You are responsible for everything",". No AWS support to call when things go wrong. Active community helps, and paid Business support exists — but the first line of defense is you reading the log.",[70,21997,21998,22001],{},[27,21999,22000],{},"Healthy scale range: 1 to 500 servers",". Above that, specific Kubernetes tooling still wins. It is not a product defect — it is where CNCF spent ten years polishing things nobody else has.",[70,22003,22004,22007],{},[27,22005,22006],{},"Specific enterprise compliance",". If your auditor needs the orchestrator to appear on a pre-approved list of suppliers, and that list lists AWS\u002FAzure\u002FGCP\u002FK8s, new self-hosted excludes you by default.",[70,22009,22010,22013],{},[27,22011,22012],{},"Native AWS integrations",": Cognito as auth, S3 with IAM directly on the task, RDS with IAM auth — all of that can be adapted, but the adaptation is extra work. On ECS it works without thinking.",[19,22015,22017],{"id":22016},"side-by-side-twelve-criteria-without-caveat","Side by side: twelve criteria without caveat",[12,22019,22020],{},"The table below is the honest version of the decision. R$5\u002FUSD FX used in all real estimates.",[119,22022,22023,22037],{},[122,22024,22025],{},[125,22026,22027,22029,22032,22035],{},[128,22028,2982],{},[128,22030,22031],{},"AWS ECS (Fargate)",[128,22033,22034],{},"Kubernetes (EKS)",[128,22036,16860],{},[141,22038,22039,22053,22067,22081,22094,22107,22121,22134,22147,22161,22174,22188],{},[125,22040,22041,22044,22047,22050],{},[146,22042,22043],{},"Minimum BRL\u002Fmonth cost — 5 small apps",[146,22045,22046],{},"R$1,000-1,500",[146,22048,22049],{},"R$1,500-2,500 + SRE team",[146,22051,22052],{},"R$300-500 + part-time dev",[125,22054,22055,22058,22061,22064],{},[146,22056,22057],{},"Predictable cost month over month",[146,22059,22060],{},"No — egress + log vary",[146,22062,22063],{},"No — sum of many lines",[146,22065,22066],{},"Yes — VPS is fixed",[125,22068,22069,22072,22075,22078],{},[146,22070,22071],{},"Lock-in (0-10)",[146,22073,22074],{},"10 — task def is AWS-only",[146,22076,22077],{},"4 — portable manifests with caveats",[146,22079,22080],{},"1 — any Linux VPS",[125,22082,22083,22085,22088,22091],{},[146,22084,16309],{},[146,22086,22087],{},"2-4 hours (with IAM well done)",[146,22089,22090],{},"1-3 days",[146,22092,22093],{},"15-30 minutes",[125,22095,22096,22098,22101,22104],{},[146,22097,3151],{},[146,22099,22100],{},"Medium (own concepts + AWS)",[146,22102,22103],{},"High (6+ months for real productivity)",[146,22105,22106],{},"Low (Heroku model)",[125,22108,22109,22112,22115,22118],{},[146,22110,22111],{},"Operator ecosystem",[146,22113,22114],{},"Restricted to AWS catalog",[146,22116,22117],{},"Hundreds, mature",[146,22119,22120],{},"Limited, growing",[125,22122,22123,22125,22128,22131],{},[146,22124,7102],{},[146,22126,22127],{},"Native (AWS regions)",[146,22129,22130],{},"Native via federation",[146,22132,22133],{},"Manual, with care",[125,22135,22136,22139,22142,22144],{},[146,22137,22138],{},"24\u002F7 support",[146,22140,22141],{},"Paid separately (Business+)",[146,22143,22141],{},[146,22145,22146],{},"Paid Enterprise",[125,22148,22149,22152,22155,22158],{},[146,22150,22151],{},"Nominal enterprise compliance",[146,22153,22154],{},"Strong (FedRAMP, HIPAA)",[146,22156,22157],{},"Strong (listed by name)",[146,22159,22160],{},"Under construction",[125,22162,22163,22166,22169,22172],{},[146,22164,22165],{},"Ideal scale range",[146,22167,22168],{},"1-200 tasks",[146,22170,22171],{},"50-100,000 servers",[146,22173,3121],{},[125,22175,22176,22179,22182,22185],{},[146,22177,22178],{},"Minimum team",[146,22180,22181],{},"1 dev + an AWS docs reader",[146,22183,22184],{},"2 SREs",[146,22186,22187],{},"1 part-time dev",[125,22189,22190,22193,22196,22199],{},[146,22191,22192],{},"Migration pain (leaving)",[146,22194,22195],{},"High — rewrite stack",[146,22197,22198],{},"Low — manifests are portable",[146,22200,22201],{},"Minimal — Docker is Docker",[12,22203,22204],{},"The column that matters varies by context, and that is exactly the point: there is no uniform winner.",[19,22206,22208],{"id":22207},"decision-by-context","Decision by context",[12,22210,22211],{},"Practical translation of the tables into direct recommendation. If your scenario fits one of these five, the answer is the indicated one — without flourish.",[12,22213,22214,22217],{},[27,22215,22216],{},"\"We're already on AWS, small team, contracts require AWS.\"","\nECS Fargate. Lock-in is already a done deal, and Fargate eliminates the work of managing instances. You trade predictable cost for zero-ops, which is the right trade-off when the team has few hands and can't stop to patch a kernel.",[12,22219,22220,22223],{},[27,22221,22222],{},"\"Multi-cloud strategy or compliance requires neutrality between vendors.\"","\nKubernetes. If the team is strong, k3s self-managed on VMs drastically reduces control plane cost. If the team is average, managed EKS in a primary cloud and equivalent in the other. Don't pay K8s cost without the real requirement — the real requirement exists and the tool fits.",[12,22225,22226,22229],{},[27,22227,22228],{},"\"Early-stage startup, costs hurt, team is not AWS specialist.\"","\nSelf-hosted. HeroCtl, Coolify, Dokploy — the segment matured. Three droplets, a part-time dev, R$400\u002Fmonth of infra, and you have the entire operation under control. When the product gains traction and the company gets large, you can reassess — but getting there spending R$400\u002Fmonth is the difference between closing runway and not closing.",[12,22231,22232,22235],{},[27,22233,22234],{},"\"Big enterprise, compliance lists K8s by name, platform team with 5+ people.\"","\nManaged EKS. Control plane cost disappears in the budget, the team absorbs the complexity, and audit is satisfied. This is the canonical K8s case — don't try to economize here.",[12,22237,22238,22241],{},[27,22239,22240],{},"\"Solo dev experimenting, side project, MVP.\"","\nRender or Railway on hosted cloud (pay only what you use, zero ops), or Coolify on a US$5 VPS. Don't build a cluster for a project that may die in three months. When it passes US$1k MRR and it becomes clear it'll survive, migrate to HeroCtl or Dokploy on a three-VPS cluster.",[19,22243,22245],{"id":22244},"migrating-from-ecs-to-self-hosted-practical-path","Migrating from ECS to self-hosted: practical path",[12,22247,22248],{},"For those already on ECS who want to reduce the bill, the conceptual mapping is simpler than it seems. The primitives match almost one-to-one:",[119,22250,22251,22260],{},[122,22252,22253],{},[125,22254,22255,22258],{},[128,22256,22257],{},"ECS",[128,22259,2994],{},[141,22261,22262,22269,22276,22282,22289,22296,22303,22310],{},[125,22263,22264,22266],{},[146,22265,21780],{},[146,22267,22268],{},"Job spec",[125,22270,22271,22273],{},[146,22272,12963],{},[146,22274,22275],{},"Group inside the job",[125,22277,22278,22280],{},[146,22279,6873],{},[146,22281,6873],{},[125,22283,22284,22286],{},[146,22285,21818],{},[146,22287,22288],{},"Dedicated server (VPS or bare metal)",[125,22290,22291,22293],{},[146,22292,5601],{},[146,22294,22295],{},"Integrated router, with automatic TLS",[125,22297,22298,22300],{},[146,22299,21824],{},[146,22301,22302],{},"Single embedded writer",[125,22304,22305,22308],{},[146,22306,22307],{},"Task IAM role",[146,22309,5678],{},[125,22311,22312,22315],{},[146,22313,22314],{},"ECR",[146,22316,22317],{},"ECR keeps working, or internal registry",[12,22319,22320],{},"Practical path for a simple app:",[67,22322,22323,22330,22333,22336,22339],{},[70,22324,22325,22326,22329],{},"Bring up three Linux VPS with Docker. Install the orchestrator on one of the three (",[231,22327,22328],{},"curl -sSL get.heroctl.com\u002Finstall.sh | sh","). The other two join as agents — single command for each.",[70,22331,22332],{},"Take the task definition in JSON. Map: container, resources, ports, environment variables, secrets. Becomes a configuration file of ~50 lines.",[70,22334,22335],{},"Submit via CLI. The cluster decides where to run, opens the port, registers in the router, issues a Let's Encrypt certificate, starts serving.",[70,22337,22338],{},"Point the domain DNS to the IP of the coordinator server (or to a DNS-based Load Balancer if you have more than one region).",[70,22340,22341],{},"Test in staging for a week. If OK, repeat for production and decommission ECS.",[12,22343,22344],{},"Average time per simple app: 1 to 3 hours, including the test. Apps with strong dependence on IAM\u002FCognito\u002FSQS take longer — you need to adapt the AWS call to go via SDK + key (instead of implicit IAM role). Stateless HTTP apps are almost mechanical.",[12,22346,22347],{},"Typical annual savings on modest scale-up (15 microservices, 1 ALB, NAT in one zone, moderate logs): R$50,000 to R$150,000 leaving the AWS bill and turning into salary or marketing. The human component also changes — you stop needing a dedicated AWS specialist.",[19,22349,22351],{"id":22350},"questions-that-come-up","Questions that come up",[12,22353,22354,22357],{},[27,22355,22356],{},"Can I use ECS Fargate without ALB to save?","\nTechnically yes, exposing the task with public IP — but you lose automatic TLS, load balancing, layer 7 health check, and service discovery. For real production this is not savings, it is debt. Worth it only for internal workloads without ingress (queue jobs, ETLs).",[12,22359,22360,22363],{},[27,22361,22362],{},"Is EKS Anywhere viable outside AWS?","\nIt exists and works, but the license cost is high and integration with the AWS ecosystem is partial — you get the \"EKS\" name without getting native integration. To run K8s outside AWS, k3s or kubeadm have better cost-benefit in practice.",[12,22365,22366,22369],{},[27,22367,22368],{},"Migrating from ECS to self-hosted, how long realistically?","\nSimple apps: 1-3 hours each. Entire operation with 10-15 services: 2 to 4 weeks if you go carefully, with staging tests. The bottleneck is usually dependencies on AWS-specific services (SQS, SNS, Cognito), not the orchestrator itself.",[12,22371,22372,22375],{},[27,22373,22374],{},"Does keeping both (ECS + self-hosted in parallel) make sense?","\nIt does, during migration and sometimes permanently. Workloads that depend heavily on native IAM (direct S3 access without keys, for example) can stay in ECS. The rest goes to the self-hosted cluster. DNS routing solves the traffic split.",[12,22377,22378,22381],{},[27,22379,22380],{},"Compliance requires AWS, can I use HeroCtl on EC2?","\nYou can. HeroCtl runs on any Linux VPS with Docker — including EC2 instances. You lose the advantage of total portability, but keep the simple operational model and predictable cost. It is a good option for teams that need to stay inside AWS by contract but want to escape native complexity.",[12,22383,22384,22387],{},[27,22385,22386],{},"Is ECS App Runner a good alternative?","\nApp Runner is AWS's offer for \"Heroku on top of ECS\". Works for very simple apps (one image, one port, automatic build from Git). Charges more than equivalent Fargate and has less control. For weekend MVP, it is reasonable. For serious production, ECS direct with Fargate gives more flexibility for the same money.",[12,22389,22390,22393],{},[27,22391,22392],{},"GKE Autopilot vs Fargate vs HeroCtl?","\nGKE Autopilot and Fargate occupy the same conceptual niche: serverless per pod\u002Ftask, you don't see the node. GKE Autopilot is generally cheaper for stable workloads, and more expensive for burst. Both have strong lock-in. HeroCtl attacks the problem from the other side — you see the server on purpose, pay for it whole, and the orchestrator distributes the loads. For long-running stable workload, comes out cheaper. For extreme burst workload, serverless wins.",[19,22395,3309],{"id":3308},[12,22397,22398],{},"There is no path that wins on all twelve criteria of the table. ECS wins on AWS integration, loses on portability. Kubernetes wins on ecosystem, loses on simplicity. Self-hosted wins on cost and clarity, loses on specific extreme-scale tooling.",[12,22400,22401],{},"The right choice depends on your team, your compliance, your contracts, and the company stage. The wrong choice is not doing the math — adopting ECS because \"it's the AWS standard\" without summing Fargate + ALB + NAT + CloudWatch + egress; adopting Kubernetes because \"it's what's trending\" without having the 2 SREs; or staying on fragile homemade self-hosted when the operation has already passed 50 servers and the CNCF ecosystem has come to justify itself.",[12,22403,22404],{},"If you are reviewing the container orchestration stack in 2026, the practical recommendation is simple: measure the real cost of the previous twelve months, identify which of the five profiles your team fits, and decide. If the profile is \"early startup, small team, cost hurts\", the lowest-risk path is to try self-hosted in parallel for one month, before migrating.",[12,22406,22407],{},"To start:",[224,22409,22410],{"className":226,"code":5318,"language":228,"meta":229,"style":229},[231,22411,22412],{"__ignoreMap":229},[234,22413,22414,22416,22418,22420,22422],{"class":236,"line":237},[234,22415,1220],{"class":247},[234,22417,2957],{"class":251},[234,22419,5329],{"class":255},[234,22421,2963],{"class":383},[234,22423,2966],{"class":247},[12,22425,22426],{},"Three Linux VPS, ten minutes per server, and you have a cluster with replicated control plane, integrated router, automatic certificates. From there, it is a matter of moving service by service from the AWS bill to your own cluster.",[12,22428,15774,22429,22431,22432,22434],{},[3336,22430,15781],{"href":15780}," explains in more detail when the colossus doesn't fit; ",[3336,22433,20350],{"href":5343}," compares the two lightweight alternatives within the self-hosted category.",[12,22436,22437],{},"Container orchestration is a long-term decision. Make it by the right math, not by inertia.",[3350,22439,4376],{},{"title":229,"searchDepth":244,"depth":244,"links":22441},[22442,22443,22444,22445,22446,22447,22448,22449,22450,22451,22452,22453,22454],{"id":21766,"depth":244,"text":21767},{"id":21806,"depth":244,"text":21807},{"id":21849,"depth":244,"text":21850},{"id":21870,"depth":244,"text":21871},{"id":21893,"depth":244,"text":21894},{"id":21929,"depth":244,"text":21930},{"id":21953,"depth":244,"text":21954},{"id":21983,"depth":244,"text":21984},{"id":22016,"depth":244,"text":22017},{"id":22207,"depth":244,"text":22208},{"id":22244,"depth":244,"text":22245},{"id":22350,"depth":244,"text":22351},{"id":3308,"depth":244,"text":3309},"2026-03-04","ECS is AWS's offer for those escaping Kubernetes. Kubernetes is Kubernetes. Self-hosted is the path out of AWS. Each makes sense in specific contexts — no uniform trade-off.",{},{"title":21752,"description":22456},{"loc":6333},"en\u002Fblog\u002Faws-ecs-vs-kubernetes-vs-self-hosted",[6392,22462,20384,22463,8756,7507],"ecs","fargate","CmcR4T0P7aqHTkKD7laYYeTXUCOtwQwpAa8lPIlZeaM",{"id":22466,"title":22467,"author":7,"body":22468,"category":8756,"cover":3379,"date":23009,"description":23010,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":23011,"navigation":411,"path":5343,"readingTime":8761,"seo":23012,"sitemap":23013,"stem":23014,"tags":23015,"__hash__":23018},"blog_en\u002Fen\u002Fblog\u002Fk3s-vs-heroctl-when-each-fits.md","k3s vs HeroCtl: when you need lightweight Kubernetes and when you don't need Kubernetes",{"type":9,"value":22469,"toc":22995},[22470,22481,22484,22488,22495,22502,22505,22512,22516,22527,22541,22544,22548,22551,22557,22563,22569,22575,22581,22585,22592,22595,22598,22601,22605,22611,22617,22623,22629,22635,22639,22645,22651,22657,22663,22669,22673,22676,22818,22821,22825,22828,22834,22840,22846,22852,22858,22862,22865,22871,22877,22883,22886,22888,22891,22897,22903,22906,22909,22911,22917,22923,22929,22935,22941,22947,22953,22957,22963,22966,22969,22972,22977,22980,22992],[12,22471,22472,22473,22476,22477,22480],{},"The question arrives in our inbox almost weekly: \"you're like a k3s, right?\". The short answer is no. The long answer starts by realizing that k3s and HeroCtl get confused because they occupy the same mental space — \"orchestration without the complexity of full Kubernetes\" — but solve problems that only look the same from outside. k3s ",[27,22474,22475],{},"is"," Kubernetes, distilled. HeroCtl ",[27,22478,22479],{},"is not"," Kubernetes, and that difference changes everything that comes after: what you read, what you install, who you hire, what you celebrate on Saturdays.",[12,22482,22483],{},"This post is for tech leads who know K8s well enough to have scars and are considering some lighter alternative. The intent is not to convince anyone to abandon Kubernetes — Kubernetes is the right choice for many cases. The intent is to give you the map to decide between k3s and HeroCtl without mixing the two's premises.",[19,22485,22487],{"id":22486},"what-k3s-is-exactly","What k3s is, exactly",[12,22489,22490,22491,22494],{},"k3s is a Kubernetes distribution maintained by Rancher (now SUSE). It is ",[27,22492,22493],{},"full Kubernetes and CNCF-certified"," — the same API, the same controllers, the same object model. What changes is the packaging.",[12,22496,22497,22498,22501],{},"Instead of five to seven separate components running as system services, k3s ships a single binary of about 50 MB that boots API server, scheduler, controller manager, kubelet and container runtime in a single process tree. Default storage is SQLite instead of Kubernetes' traditional distributed database — you can swap for the distributed database when you want real high availability, or use an external SQL database cluster via the ",[231,22499,22500],{},"kine"," driver. Cloud provider plugins were removed from the binary — if you need them, you install separately.",[12,22503,22504],{},"The installation fits in one command, comes up in less than 30 seconds on a modest server, and the minimum RAM requirement is 512 MB. It works on Raspberry Pi. It works on a $5 VPS. It works on industrial fanless hardware running inside a steel box in a factory.",[12,22506,22507,22508,22511],{},"The most important point: ",[27,22509,22510],{},"kubectl, K8s manifests, operators, templating charts and everything else you learned about Kubernetes work identically",". A k3s cluster accepts the same YAML files an AWS managed cluster accepts. Migrating from one to the other is, in practice, copying manifests. Compatibility is the feature.",[19,22513,22515],{"id":22514},"what-k3s-does-not-do-even-being-lightweight","What k3s does NOT do, even being \"lightweight\"",[12,22517,22518,22519,22522,22523,22526],{},"The word \"lightweight\" deceives. k3s is light in ",[27,22520,22521],{},"footprint"," (RAM, disk, number of processes), not in ",[27,22524,22525],{},"mental model",". What it removes is the installation barrier and the dependency on five external services to come up. What it keeps is everything that makes Kubernetes Kubernetes:",[2734,22528,22529,22532,22535,22538],{},[70,22530,22531],{},"A \"hello world\" manifest still passes 100 lines when you add Service, Ingress and ConfigMap. Adding automatic TLS and minimum RBAC, it goes to 300+.",[70,22533,22534],{},"You still need to understand namespaces, services, ingress, persistent volumes, secrets, configmaps, RBAC, network policies, pod disruption budgets, liveness\u002Freadiness probes, init containers, sidecars, taints, tolerations, affinity rules, and so on.",[70,22536,22537],{},"Operators and templating charts continue to be the idiomatic path for anything non-trivial. Replicated Postgres? Operator. Kafka? Operator. Automatic certificates? Operator. Metrics? A three-product stack.",[70,22539,22540],{},"The learning curve is practically the same. k3s removes maybe 10% — the \"install and maintain the control plane\" piece. The remaining 90% — understanding how the system models applications, how controllers reconcile state, how to debug when a probe starts failing — remain there, intact.",[12,22542,22543],{},"If a veteran SRE looks at a k3s manifest, they feel at home. If a product developer who has never touched Kubernetes looks at that same manifest, they feel exactly as lost as if they were looking at a managed cluster manifest.",[19,22545,22547],{"id":22546},"who-should-use-k3s-real-profile","Who should use k3s (real profile)",[12,22549,22550],{},"Let's be concrete. k3s makes sense for:",[12,22552,22553,22556],{},[27,22554,22555],{},"Teams that already speak fluent Kubernetes and want to run on cheap hardware."," Edge computing, IoT, physical stores with local server, factories with industrial gateways, on-prem environments with modest hardware. The team already knows how to operate K8s — k3s only allows them to take that expertise to places where a 4 GB RAM control plane would be unfeasible.",[12,22558,22559,22562],{},[27,22560,22561],{},"Company migrating from managed Kubernetes to self-managed to reduce cost."," Managed cluster on a cloud provider charges about US$73\u002Fmonth just for the control plane, multiplied by the number of clusters. Add NAT, load balancers, observability — it gets expensive. Whoever already paid that toll and wants to stop can spin up k3s on commodity VPS and cut the bill by an order of magnitude. Operations don't get simpler; the bill gets smaller.",[12,22564,22565,22568],{},[27,22566,22567],{},"Workloads that depend on the CNCF ecosystem."," Mature operators for Postgres with automatic replication (CloudNativePG, Zalando), Kafka (Strimzi), Cassandra, Elasticsearch — these operators exist because someone invested three years polishing them. If your architecture depends on four of them in production, you want full Kubernetes, and k3s gives you full Kubernetes.",[12,22570,22571,22574],{},[27,22572,22573],{},"Those who want K8s-compatible tools working 1:1."," kubectl, templating charts, ArgoCD for GitOps, image scanning tools, policy tools like OPA Gatekeeper. If your existing CI\u002FCD pipeline uses these tools, k3s keeps all of them working without adaptation.",[12,22576,22577,22580],{},[27,22578,22579],{},"Compliance that requires a CNCF-certified distribution."," Some audit frameworks nominally ask for a certified distribution. k3s appears on that list. HeroCtl doesn't — we are too young to be on any list, and our proposal is different enough that some lists may never include us.",[19,22582,22584],{"id":22583},"what-heroctl-is-exactly","What HeroCtl is, exactly",[12,22586,22587,22588,22591],{},"HeroCtl is an independent orchestrator. It's not a distribution derived from Kubernetes; it doesn't share the API; it doesn't use the same primitives. It's a ",[27,22589,22590],{},"different"," layer that addresses a similar intent — running containers across multiple servers with real high availability — using another vocabulary and other design decisions.",[12,22593,22594],{},"Concretely: a single executable file that you install on N Linux servers. The first three become quorum for the replicated control plane. You submit jobs via CLI, API or embedded web panel. A job is a configuration file of about 50 lines that describes the entire application — including replicas, ingress, certificates, secrets. The cluster decides where to run, does health check, manages rolling update deploys, issues automatic certificates via integrated router.",[12,22596,22597],{},"There are no specialized operators to install, there are no observability stacks assembled separately, there is no service mesh configured aside. Persistent metrics run as a job from the system itself. Logs have a single embedded writer. Encryption between services and key management come ready. Ingress with automatic TLS is part of the binary.",[12,22599,22600],{},"The consequence is a short operational model. Bringing up a new application is describe, submit, wait — and the cluster handles routing, certificate, replication, metrics and health check without you installing anything extra.",[19,22602,22604],{"id":22603},"what-heroctl-does-not-do-honest-limits","What HeroCtl does NOT do (honest limits)",[12,22606,22607,22610],{},[27,22608,22609],{},"It is not compatible with the Kubernetes API."," kubectl doesn't talk to HeroCtl. Templating charts don't run. K8s manifests are not accepted. If your critical dependency is a CNCF ecosystem tool that talks to the Kubernetes API, HeroCtl doesn't replace it — it's a different tool, with its own vocabulary.",[12,22612,22613,22616],{},[27,22614,22615],{},"It has no specialized operator ecosystem."," There's no mature Postgres operator with automatic replication waiting to be installed. You run Postgres as a regular job and take care of backup and replication as a human takes care — you don't delegate to an external controller. For many teams that's relief; for others it's regression.",[12,22618,22619,22622],{},[27,22620,22621],{},"Recommended scale range goes from 1 to 500 servers."," We tested up to hundreds in lab, validated some dozens in production. Above that, Kubernetes (full or in a distribution like k3s) wins by ecosystem — multi-cluster federation tools, cross-region autoscaling, storage migration primitives between clouds exist there and don't yet exist here.",[12,22624,22625,22628],{},[27,22626,22627],{},"Multi-cluster federation is not native."," If you need multiple regions orchestrated as a single surface, with workloads moving automatically between them, tools like Rancher Fleet or Kubernetes multi-cluster features solve it today. HeroCtl doesn't.",[12,22630,22631,22634],{},[27,22632,22633],{},"Compliance that nominally lists Kubernetes."," If your certification nominally requires a CNCF-certified distribution, HeroCtl doesn't comply — we are a new product, too young to figure on established lists. k3s, OpenShift and Talos comply. That's the path.",[19,22636,22638],{"id":22637},"who-should-use-heroctl-real-profile","Who should use HeroCtl (real profile)",[12,22640,22641,22644],{},[27,22642,22643],{},"Teams that DON'T want to learn Kubernetes but need orchestration with real high availability."," Popular self-hosted panels work well on one server but don't have distributed consensus — when you want three servers tolerating loss of one without downtime, those panels don't serve. Kubernetes would serve, but it costs an SRE on the team. HeroCtl is the missing middle.",[12,22646,22647,22650],{},[27,22648,22649],{},"Indie hackers and startups up to about R$1M annual revenue."," Typical stack: web application, relational database, async queue, cache. There's no Kafka, there's no Cassandra, there are no seven database operators. For this profile, the CNCF ecosystem is expensive idle capacity — you pay in learning curve and operational complexity without using what you pay for.",[12,22652,22653,22656],{},[27,22654,22655],{},"Typical web applications without exotic dependencies."," HTTP on top of SQL database and in-memory cache covers maybe 70% of the SaaS market. For that piece, Kubernetes is overkill and HeroCtl is sized.",[12,22658,22659,22662],{},[27,22660,22661],{},"Those who want \"Coolify simplicity with real high availability\"."," Coolify, Dokploy and similar got the experience right but missed high availability. Kubernetes got high availability right but missed the experience. HeroCtl tries to get both right at the cost of not being Kubernetes.",[12,22664,22665,22668],{},[27,22666,22667],{},"LGPD-only compliance."," If your compliance is LGPD and Brazilian commercial contracts, without FedRAMP nor ITAR on the horizon, the absence of specific certifications is not a blocker.",[19,22670,22672],{"id":22671},"side-by-side-no-fluff","Side by side, no fluff",[12,22674,22675],{},"The table below covers the criteria that show up most in the decision. Each row has a caveat — read the text.",[119,22677,22678,22689],{},[122,22679,22680],{},[125,22681,22682,22684,22687],{},[128,22683,2982],{},[128,22685,22686],{},"k3s",[128,22688,2994],{},[141,22690,22691,22702,22713,22724,22734,22744,22754,22764,22775,22786,22796,22807],{},[125,22692,22693,22696,22699],{},[146,22694,22695],{},"Product type",[146,22697,22698],{},"Kubernetes distribution",[146,22700,22701],{},"Independent orchestrator, non-Kubernetes",[125,22703,22704,22707,22710],{},[146,22705,22706],{},"Lines for hello world + TLS + ingress",[146,22708,22709],{},"200–300 (manifests + TLS operator)",[146,22711,22712],{},"~50 (job spec)",[125,22714,22715,22718,22721],{},[146,22716,22717],{},"Minimum total RAM in cluster",[146,22719,22720],{},"512 MB per node (1.5 GB on 3 HA nodes)",[146,22722,22723],{},"~600 MB by control plane (200–400 MB per node × 3)",[125,22725,22726,22728,22731],{},[146,22727,3151],{},[146,22729,22730],{},"8–16 weeks (full K8s curve)",[146,22732,22733],{},"1–2 weeks",[125,22735,22736,22739,22741],{},[146,22737,22738],{},"kubectl + templating chart compatibility",[146,22740,16913],{},[146,22742,22743],{},"None — own vocabulary",[125,22745,22746,22748,22751],{},[146,22747,11364],{},[146,22749,22750],{},"No — install separately",[146,22752,22753],{},"Yes, embedded",[125,22755,22756,22759,22762],{},[146,22757,22758],{},"Embedded automatic certificates",[146,22760,22761],{},"No — external operator",[146,22763,22753],{},[125,22765,22766,22769,22772],{},[146,22767,22768],{},"Embedded metrics",[146,22770,22771],{},"No — external 3-product stack",[146,22773,22774],{},"Yes, system's own job",[125,22776,22777,22780,22783],{},[146,22778,22779],{},"Centralized logs",[146,22781,22782],{},"No — external 2-product stack",[146,22784,22785],{},"Yes, single embedded writer",[125,22787,22788,22790,22793],{},[146,22789,22111],{},[146,22791,22792],{},"Vast (hundreds)",[146,22794,22795],{},"None — workloads as regular jobs",[125,22797,22798,22801,22804],{},[146,22799,22800],{},"Recommended scale range",[146,22802,22803],{},"1 node to 10k+",[146,22805,22806],{},"1 to 500 servers",[125,22808,22809,22812,22815],{},[146,22810,22811],{},"Commercial model",[146,22813,22814],{},"Open source (Apache 2.0), supported by SUSE",[146,22816,22817],{},"Free Community + paid Business + Enterprise",[12,22819,22820],{},"The column that matters most varies by context. For a team that already has K8s expertise, \"kubectl compatibility\" weighs a lot. For a team that's starting out, \"lines for hello world\" and \"learning curve\" weigh more.",[19,22822,22824],{"id":22823},"when-both-are-in-the-conversation-practical-decision","When both are in the conversation (practical decision)",[12,22826,22827],{},"Five real scenarios that show up in conversations with readers. The answers are direct because reality is direct.",[12,22829,22830,22833],{},[27,22831,22832],{},"\"We already have managed Kubernetes and it hurts operationally.\"","\nk3s reduces cloud cost because you exit the paid control plane. The operational pain remains — long manifests, TLS operators, external observability stacks. You save on the bill but not on time. HeroCtl reduces the pain at the root, but requires learning another tool and re-writing the primitives. If the pain is financial, k3s. If the pain is engineering time, HeroCtl.",[12,22835,22836,22839],{},[27,22837,22838],{},"\"We're just starting, we want something simple.\"","\nHeroCtl. Kubernetes (full or k3s) adds months of learning curve that don't generate product value in the early phase. You spend three months learning templating charts and ingress controllers instead of shipping features. In early-stage, opportunity cost is everything.",[12,22841,22842,22845],{},[27,22843,22844],{},"\"Compliance requires a CNCF-certified distribution.\"","\nk3s or Talos. HeroCtl doesn't fulfill that list. It's not pride — it's honesty. When we're ready for those lists, we'll talk again.",[12,22847,22848,22851],{},[27,22849,22850],{},"\"Team has 1 strong SRE who loves Kubernetes.\"","\nk3s. Keeps the SRE happy, preserves all the team's existing knowledge, and still cuts the cloud bill. HeroCtl would force the SRE to re-learn and abandon tools they master — unnecessary friction when expertise is already paid for.",[12,22853,22854,22857],{},[27,22855,22856],{},"\"Team has 0 SREs and grows by product.\"","\nHeroCtl. Kubernetes without expertise is a recipe for disaster — you'll discover what a pod stuck in CrashLoopBackOff is on a Friday night, without context to debug. HeroCtl is sized for a team that doesn't have a dedicated infra on-call.",[19,22859,22861],{"id":22860},"the-improbable-migration","The improbable migration",[12,22863,22864],{},"Migrating from k3s to HeroCtl, or vice versa, is an operation that seems worse than it is.",[12,22866,22867,22870],{},[27,22868,22869],{},"Conceptually, the two are similar."," Both run containers, both have replica notion, both have ingress, both have health check, both have rolling update deploy. If you know how to do one, you know how to think about the other.",[12,22872,22873,22876],{},[27,22874,22875],{},"Syntactically, they are incompatible."," Kubernetes manifest doesn't convert 1:1 to HeroCtl job spec. Fields don't match, abstractions aren't the same, defaults are different. You re-write.",[12,22878,22879,22882],{},[27,22880,22881],{},"Re-writing isn't as expensive as it seems."," For a typical team with 20 to 40 specs in production, the migration takes an afternoon. The reason is that most K8s manifests have huge structural repetition — 80% of fields are standardized, and you discover the mapping quickly. For teams with a few dozen jobs, manual converter suffices. Above that, we're open to talking about experimental converters.",[12,22884,22885],{},"The migration in the other direction (HeroCtl → k3s) is more expensive, because you're leaving a lean model for a verbose model. You gain ecosystem; you pay in verbosity.",[19,22887,18048],{"id":18047},[12,22889,22890],{},"Scenario: 4 VPS at a low-cost European provider, each with 4 vCPU and 8 GB of RAM. Infrastructure cost: about R$100\u002Fmonth per VPS, R$400\u002Fmonth total.",[12,22892,22893,22896],{},[27,22894,22895],{},"Self-managed k3s in this scenario"," costs R$400\u002Fmonth of infra plus partial SRE salary. A strong SRE in Brazil costs a full R$15k to R$25k\u002Fmonth. Even if you allocate only 30% of their time to the cluster — which is optimistic for a small team — that's R$5k to R$7.5k of people cost. Total: R$5.4k to R$7.9k\u002Fmonth.",[12,22898,22899,22902],{},[27,22900,22901],{},"HeroCtl Community in the same scenario"," costs R$400\u002Fmonth of infra plus part-time dev allocation to the cluster. Since the operational model is shorter, 10% of a senior dev's time suffices — R$1.5k to R$2.5k\u002Fmonth. Total: R$1.9k to R$2.9k\u002Fmonth.",[12,22904,22905],{},"The difference is in people salary. k3s asks for more expertise; more expertise costs more. The infra is practically the same.",[12,22907,22908],{},"This calculation flips when the team already has an SRE paid independently of the orchestrator choice. If the SRE exists for the rest of the stack, the marginal cost of operating k3s is low, and the CNCF ecosystem becomes worth gold. It's another type of company.",[19,22910,3225],{"id":3224},[12,22912,22913,22916],{},[27,22914,22915],{},"Is k3s still full Kubernetes?","\nYes. k3s is CNCF-certified as a conformant distribution. The same manifests run, kubectl works identically, the API is the same. Removals were of dependencies and cloud plugins — not of the API nor semantics.",[12,22918,22919,22922],{},[27,22920,22921],{},"Can HeroCtl run on Raspberry Pi like k3s?","\nTechnically, yes — HeroCtl runs on any Linux server with Docker, including ARM. Practically, the \"edge on Raspberry Pi\" use case is territory where k3s has years of polish and HeroCtl hasn't been exercised enough yet. If your use is industrial edge on modest ARM hardware, k3s is the more proven choice today. HeroCtl on Pi works for hobby; for edge production, wait a few more quarters.",[12,22924,22925,22928],{},[27,22926,22927],{},"Does kubectl work on HeroCtl?","\nNo. HeroCtl has its own CLI and own API. The intent was different from the start — we don't try to be Kubernetes-compatible. Whoever wants kubectl wants Kubernetes; that's the right tool for that person.",[12,22930,22931,22934],{},[27,22932,22933],{},"How to migrate from managed Kubernetes to k3s?","\nMost manifests run directly. Exceptions usually are: cloud provider-specific annotations (load balancer, storage class), native IAM integrations and some ingress controller that assumed cloud infrastructure. You swap for CNCF ecosystem equivalents (MetalLB for LB, longhorn or local storage for volumes) and redo the annotations. For a cluster with a few dozen manifests, the migration takes a few days.",[12,22936,22937,22940],{},[27,22938,22939],{},"Does HeroCtl have multi-region like Rancher Fleet?","\nNot natively. Today the control plane quorum is sized for one cluster per region. You can operate HeroCtl across multiple regions in parallel, each with its cluster, but there's no federation layer today that presents all as a single surface. It's on the roadmap. Whoever needs that today, k3s + Rancher Fleet or full Kubernetes + Karmada are exercised paths.",[12,22942,22943,22946],{},[27,22944,22945],{},"Which scales higher?","\nKubernetes (full or k3s). Companies operate K8s clusters with tens of thousands of nodes in production. HeroCtl aims at the \"1 to 500 servers\" range and doesn't intend to compete above that. If you operate at the level of hundreds of thousands of machines, K8s is the path — not by choice, by sizing.",[12,22948,22949,22952],{},[27,22950,22951],{},"Can I run both side by side?","\nYes. Both are orchestrators that run containers on Linux servers. You can have a k3s cluster for workloads that depend on the CNCF ecosystem and a HeroCtl cluster for typical web apps — they don't conflict, they are different products. Some of our customers do exactly that: HeroCtl for the main product, k3s for a test environment that needs to mimic the end client's managed production cluster.",[19,22954,22956],{"id":22955},"honest-closing","Honest closing",[12,22958,22959,22960,22962],{},"The initial question — \"you're like a k3s, right?\" — now has a long answer. We are not. k3s is Kubernetes packaged to fit modest hardware, keeping the entire learning curve and the entire ecosystem intact. HeroCtl is a ",[27,22961,22590],{}," layer from Kubernetes, with its own vocabulary, shorter operational model, no operator ecosystem and no kubectl compatibility.",[12,22964,22965],{},"If you already speak fluent Kubernetes and want to take that expertise to cheap hardware or edge, k3s is the choice. If you never wanted to learn Kubernetes but need orchestration with real high availability, HeroCtl is the choice. If your pain is compliance that lists certified distributions, k3s or Talos. If your pain is engineering time spent on long manifests and external operators, HeroCtl.",[12,22967,22968],{},"There's no universal winner — there are tools that match different contexts. The wrong choice is neither k3s nor HeroCtl; it's adopting either one without understanding what problem you're really solving.",[12,22970,22971],{},"Whoever wants to try HeroCtl on the closest server, the path is unique:",[224,22973,22975],{"className":22974,"code":5318,"language":2529},[2527],[231,22976,5318],{"__ignoreMap":229},[12,22978,22979],{},"Five minutes later you have a single-node cluster running. Add two more servers with the same command + token, and you have real high availability — without installing an external operator, without assembling an observability stack, without learning new manifest vocabulary.",[12,22981,22982,22983,22985,22986,22988,22989,22991],{},"To continue reading, two posts speak directly with this one: ",[3336,22984,15781],{"href":15780}," addresses the general decision not to adopt K8s; ",[3336,22987,6334],{"href":6333}," compares the three families when the cloud server is on the table. And to understand why we exist as a separate product instead of being yet another K8s distribution, ",[3336,22990,6546],{"href":6545}," has the complete history.",[12,22993,22994],{},"The intent, as always, is simple: container orchestration, without ceremony — but with honesty about when ceremony is what you need.",{"title":229,"searchDepth":244,"depth":244,"links":22996},[22997,22998,22999,23000,23001,23002,23003,23004,23005,23006,23007,23008],{"id":22486,"depth":244,"text":22487},{"id":22514,"depth":244,"text":22515},{"id":22546,"depth":244,"text":22547},{"id":22583,"depth":244,"text":22584},{"id":22603,"depth":244,"text":22604},{"id":22637,"depth":244,"text":22638},{"id":22671,"depth":244,"text":22672},{"id":22823,"depth":244,"text":22824},{"id":22860,"depth":244,"text":22861},{"id":18047,"depth":244,"text":18048},{"id":3224,"depth":244,"text":3225},{"id":22955,"depth":244,"text":22956},"2026-02-26","k3s is Kubernetes packaged to fit in 512MB. For those who already speak K8s and want to take it somewhere smaller. HeroCtl is a DIFFERENT layer from Kubernetes. How to decide between the two without mixing premises.",{},{"title":22467,"description":23010},{"loc":5343},"en\u002Fblog\u002Fk3s-vs-heroctl-when-each-fits",[22686,20384,8756,23016,23017],"edge","lightweight","TI_9rDcZFKAxQFF4QGjbkVxxVlYRUs0wYOmy7yeHrrc",{"id":23020,"title":23021,"author":7,"body":23022,"category":8756,"cover":3379,"date":23600,"description":23601,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":23602,"navigation":411,"path":23603,"readingTime":8761,"seo":23604,"sitemap":23605,"stem":23606,"tags":23607,"__hash__":23612},"blog_en\u002Fen\u002Fblog\u002Fheroctl-vs-nomad.md","HeroCtl vs Nomad: the alternative for those caught by the license change",{"type":9,"value":23023,"toc":23587},[23024,23027,23030,23034,23047,23050,23053,23057,23060,23066,23069,23075,23081,23085,23088,23091,23097,23103,23109,23112,23116,23119,23122,23125,23128,23131,23134,23138,23141,23147,23153,23159,23165,23168,23172,23175,23181,23191,23197,23200,23208,23211,23215,23390,23393,23397,23400,23406,23416,23426,23432,23438,23464,23467,23471,23474,23484,23490,23496,23502,23504,23510,23516,23525,23531,23541,23547,23553,23559,23561,23564,23567,23570,23577,23584],[12,23025,23026],{},"The Nomad story is, in retrospect, a warning. For many people who chose Nomad between 2020 and 2022 — for the argument of \"simpler than Kubernetes, real open-source, open governance\" — the two announcements that came after changed the base of the contract. In August 2023 the license changed. In February 2025 the company was sold. The technical product is still good. The contract around it is no longer the same that was on the table when you signed.",[12,23028,23029],{},"This post is about that difference. It isn't a critique of Nomad's technical core — we'll get to that. It's about the structural problem of adopting a tool whose contractual base can change between one reorganization and the next, and what's left for whoever decides today, in April 2026.",[19,23031,23033],{"id":23032},"where-nomad-got-it-right-and-keeps-getting-it-right","Where Nomad got it right (and keeps getting it right)",[12,23035,23036,23037,571,23040,571,23043,23046],{},"Before any criticism, the credit owed. Nomad is technically good. Replicated control plane works in production. The command-line interface is consistent — ",[231,23038,23039],{},"nomad job run",[231,23041,23042],{},"nomad job stop",[231,23044,23045],{},"nomad alloc logs"," behave predictably across a thousand different clusters. Multi-region operation was actually thought through, with federation between datacenters and routing between them. Supports non-Docker workloads (native binaries, lightweight virtual machines, custom drivers) — a flexibility the larger competitor never prioritized.",[12,23048,23049],{},"Whoever has operated Nomad in serious production rarely complains about the core. The incidents that show up in public post-mortems are almost always at the edge — integration with the configuration manager, integration with the external router, integration with the secret vault — not in the orchestrator proper.",[12,23051,23052],{},"That technical performance is important because it sets the tone for the rest of the post: the Nomad problem isn't the code. The code is good. The problem is what happened around the code.",[19,23054,23056],{"id":23055},"the-timeline-in-absolute-dates","The timeline, in absolute dates",[12,23058,23059],{},"Worth recapping without dramatizing.",[12,23061,23062,23065],{},[27,23063,23064],{},"August 2023."," HashiCorp, then a publicly traded company, announces a license change in all main products — including Terraform, Vault, Consul, and Nomad. The license migrates from Mozilla Public License 2.0 (an open license, with copies of the OSI definition) to Business Source License 1.1 — a \"source available\" license. The practical difference is the clause that restricts use in commercial products that compete with the company's paid offering. You can still read the code. Can run internally. Can no longer embed in a commercial product of yours, and can no longer offer as a managed service to third parties, without separate licensing.",[12,23067,23068],{},"The community reaction was immediate. In a few weeks, a Terraform fork — OpenTofu — came off the page, with Linux Foundation governance. Months later, Vault gained OpenBao with similar governance. Nomad and Consul were left without an equivalent strong community fork. There were discussions, halted contributions, some maintainer departures — but nothing the size of what happened with Terraform.",[12,23070,23071,23074],{},[27,23072,23073],{},"February 2025."," IBM announces the acquisition of HashiCorp for approximately US$6.4 billion. The operation closed still in 2025. HashiCorp becomes part of IBM's portfolio — aligned with IBM's existing offerings around internal platforms, configuration management, and hybrid cloud.",[12,23076,23077,23080],{},[27,23078,23079],{},"Today (April 2026)."," The BSL license remains in force. Official support is exclusively through the IBM channel. The release cadence continues, but the roadmap responds to IBM's internal OKRs now — no longer to an independent product plan. There's no community fork strong equivalent to OpenTofu for Nomad. Whoever is in production, is in production; whoever is adopting today, is adopting a tool whose owner changed twice in 30 months.",[19,23082,23084],{"id":23083},"what-this-means-in-practice","What this means in practice",[12,23086,23087],{},"For those who already had Nomad in production before August 2023, the situation is manageable. The version immediately prior to the license change is frozen at MPL 2.0 — you can keep with it, take selected patches, wait for a fork to mature. It's not comfortable, but it's workable.",[12,23089,23090],{},"For those deciding in 2026, it's a different decision. Three asterisks matter:",[12,23092,23093,23096],{},[27,23094,23095],{},"The next critical feature might land only in the paid version."," The license doesn't prevent it, and the company no longer has an independent product plan that keeps strategic features on the open path. It's a choice now, not a principle.",[12,23098,23099,23102],{},[27,23100,23101],{},"The roadmap responds to IBM OKRs."," This may be good — IBM has engineering muscle, will invest — or bad, if IBM's priorities (compliance, governmental, integration with legacy products) are different from yours. You don't control which of the two happens.",[12,23104,23105,23108],{},[27,23106,23107],{},"Commercial prices may be revisited in a next reorganization."," Large acquisitions usually have a \"value discovery\" period where price tables are reviewed. There's no rule preventing a new round of contractual changes in 12 or 24 months. You have no protection against this.",[12,23110,23111],{},"The technical community registers these signals. The rate of external contributions to Nomad fell after August 2023. Companies that previously displayed public use cases stopped publishing. Conference talks cooled. The ecosystem still exists — but it lost the product-of-community traction it had between 2018 and 2022.",[19,23113,23115],{"id":23114},"the-right-lesson-isnt-open-source-or-nothing","The right lesson isn't \"open source or nothing\"",[12,23117,23118],{},"Here comes the part that matters for any infrastructure tool — including HeroCtl.",[12,23120,23121],{},"The Nomad problem wasn't going paid. Companies need to be sustainable; charging for software is legitimate; serious commercial software is better for everyone than open software that goes into skeletal maintenance because nobody pays.",[12,23123,23124],{},"The Nomad problem was changing the contract with whoever already bet. Whoever chose Nomad in 2021 based on the open license had a reasonable expectation — that the base would be stable. It wasn't. In three years, the base became something else, and the cost of leaving was — exactly — the amount of work that had been invested in the original choice.",[12,23126,23127],{},"This reveals an asymmetry. Open software that can become commercial at any moment is riskier than commercial software published from day one, with terms frozen for existing contracts. The first looks safer because \"it's open\". It's a wrong perception when control of the project is concentrated in a single company, subject to acquisition, IPO, CEO change, or investor pressure. The contract may change; the security perception was illusion.",[12,23129,23130],{},"Honest commercial software — published prices, frozen terms, no retroactive change, continuity mechanism in case the company shuts down — is structurally more predictable.",[12,23132,23133],{},"That's the HeroCtl pivot.",[19,23135,23137],{"id":23136},"how-heroctl-solves-this-structurally","How HeroCtl solves this structurally",[12,23139,23140],{},"HeroCtl was born commercial. There was no \"open source while we grow, commercial later\" phase. The terms have been on the table since day one, published, with explicit anti-rug-pull mechanisms.",[12,23142,23143,23146],{},[27,23144,23145],{},"Permanently free Community plan."," No server limit. No job limit. No artificial feature gate. Runs the entire stack — real high availability, integrated router, automatic certificates, encryption between services, metrics, and logs. Individuals and small teams never need to leave Community.",[12,23148,23149,23152],{},[27,23150,23151],{},"No phone-home, no kill-switch."," Once installed, the cluster works without ever talking to the company's server. There's no periodic activation that expires. There's no flag that can be revoked from outside. If office internet goes down forever, the cluster keeps running.",[12,23154,23155,23158],{},[27,23156,23157],{},"Source-code escrow on Enterprise."," Enterprise contracts include an escrow clause — the code is deposited with a third-party custodian. If the company shuts down operations, the code goes to paying customers via the custodian, with license for internal continuity. It isn't \"trust us\"; it's a legal mechanism that survives the company's disappearance.",[12,23160,23161,23164],{},[27,23162,23163],{},"Published prices."," Business and Enterprise have visible prices on the plans page — without mandatory \"talk to sales\" as a qualification tactic. No unilateral revision clause; the agreed price stays frozen for existing contracts.",[12,23166,23167],{},"The difference with what happened in August 2023 and February 2025 isn't a promise. It's a structure.",[19,23169,23171],{"id":23170},"honest-technical-comparison","Honest technical comparison",[12,23173,23174],{},"Setting license aside, what's the technical difference? At a high level, surprisingly less than you'd imagine.",[12,23176,23177,23180],{},[27,23178,23179],{},"Similar primitives."," Nomad organizes work in job → group → task. HeroCtl organizes in job → replica → task. The concepts map almost one-to-one. A job describes the service, the grouping determines replica and location, the task is what actually runs. Both support long-running jobs (services), batch jobs, and periodic jobs.",[12,23182,23183,23186,23187,23190],{},[27,23184,23185],{},"Replicated control plane in both."," Both use consensus between servers for state durability. Both tolerate the loss of a minority of servers without unavailability. Both elect a leader automatically. In HeroCtl, the public chaos test shows election in around seven seconds after a ",[231,23188,23189],{},"kill -9"," on the leader.",[12,23192,23193,23196],{},[27,23194,23195],{},"Key difference 1: batteries included vs assemble the mesh."," Here the philosophies diverge. Nomad was designed to be composed with other components from the same ecosystem — external configuration manager for service mesh, external secret vault, and some separate gateway for ingress. This gives huge flexibility in complex scenarios, but means running Nomad in honest production usually requires three products in parallel, each with its own control plane, its own update, its own learning curve.",[12,23198,23199],{},"HeroCtl has integrated router, automatic Let's Encrypt certificates, and encryption between services embedded in the single binary. You don't assemble the mesh — it comes assembled. For small teams that's two months of work saved; for large teams operating dozens of federated datacenters, it's less flexibility.",[12,23201,23202,23205,23206,101],{},[27,23203,23204],{},"Key difference 2: application range."," Nomad targets high scale. There are public clusters running tens of thousands of nodes. The ecosystem is more mature in that range — and whoever is coming down from Nomad because the complexity isn't justified is usually doing the same math we describe in ",[3336,23207,15781],{"href":15780},[12,23209,23210],{},"HeroCtl targets the \"1 to 500 server\" range — single-server for prototyping, three for real high availability, dozens to hundreds for productive scale of medium SaaS. Above that, we haven't yet tested at large scale in production; the roadmap gets there, but it isn't this quarter's priority.",[19,23212,23214],{"id":23213},"side-by-side-no-flourishes","Side by side, no flourishes",[119,23216,23217,23228],{},[122,23218,23219],{},[125,23220,23221,23223,23226],{},[128,23222,2982],{},[128,23224,23225],{},"Nomad (under BSL\u002FIBM)",[128,23227,2994],{},[141,23229,23230,23240,23249,23262,23273,23283,23293,23304,23315,23326,23337,23348,23359,23370,23381],{},[125,23231,23232,23234,23237],{},[146,23233,22811],{},[146,23235,23236],{},"Source available (BSL 1.1) + commercial via IBM",[146,23238,23239],{},"Commercial since day 1, permanently free plan",[125,23241,23242,23245,23247],{},[146,23243,23244],{},"Replicated control plane",[146,23246,3064],{},[146,23248,3064],{},[125,23250,23251,23254,23257],{},[146,23252,23253],{},"Leader election",[146,23255,23256],{},"Yes, in seconds",[146,23258,23259,23260],{},"Yes, ~7 seconds after ",[231,23261,23189],{},[125,23263,23264,23267,23270],{},[146,23265,23266],{},"HTTP\u002FTLS router",[146,23268,23269],{},"External (third-party gateway)",[146,23271,23272],{},"Embedded",[125,23274,23275,23277,23280],{},[146,23276,3923],{},[146,23278,23279],{},"External (specialized operator)",[146,23281,23282],{},"Embedded (automatic Let's Encrypt)",[125,23284,23285,23288,23291],{},[146,23286,23287],{},"Encryption between services",[146,23289,23290],{},"External (configuration manager + vault)",[146,23292,23272],{},[125,23294,23295,23298,23301],{},[146,23296,23297],{},"Secret vault",[146,23299,23300],{},"External (dedicated vault)",[146,23302,23303],{},"Embedded (the cluster is the vault)",[125,23305,23306,23309,23312],{},[146,23307,23308],{},"Metrics + logs",[146,23310,23311],{},"External stack",[146,23313,23314],{},"Internal job + embedded single writer",[125,23316,23317,23320,23323],{},[146,23318,23319],{},"Lines for app+ingress+TLS",[146,23321,23322],{},"80–120 lines across multiple files",[146,23324,23325],{},"~50 lines in one file",[125,23327,23328,23331,23334],{},[146,23329,23330],{},"Driver ecosystem",[146,23332,23333],{},"Deep (Docker, exec, java, qemu, plugins)",[146,23335,23336],{},"Docker as runtime",[125,23338,23339,23342,23345],{},[146,23340,23341],{},"Maximum tested scale",[146,23343,23344],{},"Tens of thousands of nodes",[146,23346,23347],{},"1–500 nodes (target range)",[125,23349,23350,23353,23356],{},[146,23351,23352],{},"Official support",[146,23354,23355],{},"IBM channel",[146,23357,23358],{},"Direct from manufacturer; SLA on Business",[125,23360,23361,23364,23367],{},[146,23362,23363],{},"Frozen contract for existing customers",[146,23365,23366],{},"No — changed in 2023",[146,23368,23369],{},"Yes, explicit clause",[125,23371,23372,23375,23378],{},[146,23373,23374],{},"Continuity mechanism if the company shuts down",[146,23376,23377],{},"Implicit (BSL converts to MPL after 4 years)",[146,23379,23380],{},"Active escrow on Enterprise",[125,23382,23383,23386,23388],{},[146,23384,23385],{},"Phone-home \u002F kill-switch",[146,23387,3058],{},[146,23389,3058],{},[12,23391,23392],{},"The column that gives the chills is the second-to-last. \"Frozen contract for existing customers\" is exactly what was missing in August 2023 — those who had adhered with expectation of open license were caught by the retroactive. In HeroCtl, this is an explicit clause.",[19,23394,23396],{"id":23395},"migrating-from-nomad-to-heroctl","Migrating from Nomad to HeroCtl",[12,23398,23399],{},"The good news of primitive convergence: migration isn't an architecture redesign. It is, mostly, a file translation.",[12,23401,23402,23405],{},[27,23403,23404],{},"Direct mapping."," A job in Nomad becomes a job in HeroCtl. A group becomes a replica grouping. A task remains the unit that runs on the agent. Placement constraints (node class, datacenter, custom attribute) have direct equivalents. Update strategies (rolling, canary, blue-green) idem. Health checks via command or HTTP idem.",[12,23407,23408,23411,23412,23415],{},[27,23409,23410],{},"What changes in the file."," The HeroCtl configuration file is smaller — around 50 lines for a common case of web app with ingress and secrets, compared to 80–120 lines in equivalent jobspec. The difference comes from fewer intermediate abstractions and from native integration with the router (you describe ",[231,23413,23414],{},"ingress: { host, tls: true }"," in the job itself, not in a separate document).",[12,23417,23418,23421,23422,23425],{},[27,23419,23420],{},"What needs adaptation."," Integrations with the external secret vault need to become ",[231,23423,23424],{},"secrets:"," blocks in the job itself (the cluster is the vault in HeroCtl). Service mesh via external configuration manager needs to be replaced with the embedded encryption between services — usually it's simplification, not regression. Non-Docker drivers (exec, java) today don't have a direct equivalent; applications that depend on it stay with Nomad or are packaged in a container.",[12,23427,23428,23431],{},[27,23429,23430],{},"Experimental converter."," There's a jobspec converter under development — takes a Nomad file as input, emits a HeroCtl file as output, with warnings for non-covered cases. It's under development, we don't promote it as a finished product. It covers the common cases (service job with Docker, ingress, health check, secrets); edge cases still require manual review. If it's relevant to your migration, write to us.",[12,23433,23434,23437],{},[27,23435,23436],{},"Phased plan."," The path we recommend for those who have production on Nomad:",[2734,23439,23440,23446,23452,23458],{},[70,23441,23442,23445],{},[27,23443,23444],{},"Phase 1 (weeks 1–2)."," Bring up HeroCtl cluster in parallel, with 3 servers. Migrate first a non-critical job — preferably a batch or periodic job that has no awake user depending on it. Validate logs, metrics, behavior under failure.",[70,23447,23448,23451],{},[27,23449,23450],{},"Phase 2 (weeks 3–4)."," Migrate a web application with few users. Validate certificate issuance, rolling deploy, behavior during server loss. Compare latency and error rate with the equivalent Nomad.",[70,23453,23454,23457],{},[27,23455,23456],{},"Phase 3 (weeks 5+)."," Cutover by application, not entire cluster cutover. DNS points to HeroCtl, monitor for a week, next application. Keep Nomad running what hasn't migrated yet.",[70,23459,23460,23463],{},[27,23461,23462],{},"Phase 4 (when comfortable)."," Decommission Nomad. Document what you learned in your internal wiki for the next tool — because there's always a next tool.",[12,23465,23466],{},"For teams with a few dozen jobs, the entire migration is an afternoon. For teams with hundreds of jobs, and exotic drivers, it's bespoke work — write to us so we can help plan.",[19,23468,23470],{"id":23469},"when-nomad-continues-being-the-right-choice","When Nomad continues being the right choice",[12,23472,23473],{},"Honesty first: don't migrate out of fashion.",[12,23475,23476,23479,23480,23483],{},[27,23477,23478],{},"If you're already running Nomad in production without operational problem, stay."," License change for those who don't embed nor offer as a managed service is manageable — you run internally, update carefully, wait for the ecosystem to settle. Migrating a stack that works is unnecessary work. For those leaving Nomad after something simpler and single-server, worth considering ",[3336,23481,23482],{"href":16689},"Coolify as alternative"," before assuming the path is HeroCtl.",[12,23485,23486,23489],{},[27,23487,23488],{},"If your architecture deeply depends on other components from the same ecosystem."," Secret vault integrated with configuration manager integrated with Nomad, with federated policies and ACLs — undoing that integration is expensive. The cost of leaving may be greater than the license asterisk.",[12,23491,23492,23495],{},[27,23493,23494],{},"If you operate multi-region at large scale."," Hundreds of federated datacenters, with workloads moving between them, depending on mature federation between clusters. Nomad has eight years invested in that direction. HeroCtl has a few quarters in real production. The conservative choice for that profile is Nomad — and that's fine.",[12,23497,23498,23501],{},[27,23499,23500],{},"If you have governmental contracts that list vendors by name."," Some frameworks (FedRAMP, ITAR, federal contracts) require listed vendors. IBM now meets that list. HeroCtl is too young. If your compliance officer needs to point to a company name in an audit list, today the answer is Nomad via IBM, not HeroCtl.",[19,23503,7347],{"id":7346},[12,23505,23506,23509],{},[27,23507,23508],{},"Is HeroCtl just a Nomad clone?","\nNo. Convergence of primitives (job, grouping, task, replicated control plane) reflects that these are the right blocks for container orchestration — not a copy. The important differences (batteries included instead of external mesh, frozen commercial contract, application range 1–500) are opposite structural choices, not implementation detail. If it were a clone, the configuration file would have 100+ lines and depend on three parallel products.",[12,23511,23512,23515],{},[27,23513,23514],{},"What if HeroCtl changes its license?","\nFair question — it's exactly the kind of question that matters after August 2023. Three protections: the published commercial contract has a clause of frozen terms for existing customers (the agreed price stays, doesn't change retroactively); the binary has no phone-home or kill-switch (once installed, runs independent); Enterprise contracts have active escrow (code goes to paying customers if the company shuts down). It isn't \"trust us\". It's legal and technical structure that survives management change, acquisition, or shutdown.",[12,23517,23518,23521,23522,23524],{},[27,23519,23520],{},"Does the jobspec converter convert integration with secret vault and configuration manager?","\nNot entirely. Integrations with external vault become ",[231,23523,23424],{}," blocks in the HeroCtl job — the converter does the syntactic transformation, but it's up to you to review the logic (rotation, access policy, scope). Service mesh via external manager generally becomes the embedded encryption, with simplification in most cases. Exotic cases (dynamic policies, secrets with coordinated renewal) get a \"review manually\" warning in the converter report.",[12,23526,23527,23530],{},[27,23528,23529],{},"And the Nomad driver plugins? I have a custom driver to run native binaries.","\nToday, HeroCtl runs over Docker as runtime. Exotic drivers (exec, java, qemu, custom) don't have a direct equivalent — the application needs to be packaged in a container. For a lot of things, this is simplification; for some workloads (statically compiled binaries, lightweight virtual machines), it's extra work. If you depend on a non-Docker driver in production, talk to us before migrating.",[12,23532,23533,23536,23537,23540],{},[27,23534,23535],{},"And the periodic and batch jobs?","\nSupported. HeroCtl has long-running jobs, batch jobs (executes, finishes, releases), and periodic jobs (cron-like). The syntax is direct — a ",[231,23538,23539],{},"schedule: \"*\u002F5 * * * *\""," key in the configuration file. What we don't yet have is massive batch fan-out (thousands of parallel tasks with result aggregation) — for that, tools dedicated to data flow remain better.",[12,23542,23543,23546],{},[27,23544,23545],{},"Can I operate HeroCtl behind an existing Nomad?","\nTechnically yes — HeroCtl runs on any Linux server with Docker, so you can describe a Nomad job that brings up the HeroCtl agent. But the recommendation is the opposite: run HeroCtl in parallel, migrate application by application, decommission Nomad when comfortable. Nesting two orchestrators is maintaining two control planes, two mental models, two log sets — without gain.",[12,23548,23549,23552],{},[27,23550,23551],{},"Is Business price higher or lower than Nomad Enterprise?","\nHeroCtl Business and Enterprise prices are published on the plans page — without mandatory \"talk to sales\" as a qualification tactic. The Nomad commercial license under IBM is negotiated case by case and has no public price; direct comparisons depend on quote. The point we highlight isn't \"we're cheaper than IBM\" — it's \"our price is public and frozen for existing contracts\". You know what you'll pay in five years. That predictability is the product.",[12,23554,23555,23558],{},[27,23556,23557],{},"How does support work?","\nCommunity has a public forum and community channels — no SLA. Business has direct official support from the manufacturer with hours-level SLA on first response. Enterprise adds 24×7 support and dedicated development for specific extensions. Important: official support is direct, not outsourced via channel, not dependent on extra vendor in the chain.",[19,23560,3309],{"id":3308},[12,23562,23563],{},"The Nomad story is less about Nomad and more about how to adopt infrastructure. Open software controlled by a single company is fragile on a years-scale — not in the code, but in the contract. Acquisition, IPO, management change, investor pressure: any of these can rewrite the base under your feet.",[12,23565,23566],{},"The right answer isn't \"reject everything that has a CNPJ behind it\". The right answer is to require the contract to be explicit from day one, with frozen terms for those who already signed and continuity mechanism in case the company disappears. Honest commercial software, with these three properties, is structurally more predictable than open software that may change.",[12,23568,23569],{},"That's the HeroCtl design. Commercial since day one. Permanently free Community plan, no artificial feature gate. Business and Enterprise with published prices. No phone-home, no kill-switch. Escrow on Enterprise.",[12,23571,23572,23573,23576],{},"To install: ",[231,23574,23575],{},"curl -sSL https:\u002F\u002Fget.heroctl.com\u002Finstall.sh | sh"," on a Linux server with Docker. For real high availability, the same command on three servers — they form quorum automatically.",[12,23578,23579,23580,23583],{},"To understand the motivation behind this, read ",[3336,23581,23582],{"href":6545},"why we created HeroCtl"," — explains the three paths that existed in 2026, the gap each left, and what we tried to fill.",[12,23585,23586],{},"If you're on Nomad and want to talk about migration, write. If you're on Nomad and you're well, stay on Nomad — and take advantage to read the contract more carefully next time a renewal arrives. That's the real lesson of August 2023.",{"title":229,"searchDepth":244,"depth":244,"links":23588},[23589,23590,23591,23592,23593,23594,23595,23596,23597,23598,23599],{"id":23032,"depth":244,"text":23033},{"id":23055,"depth":244,"text":23056},{"id":23083,"depth":244,"text":23084},{"id":23114,"depth":244,"text":23115},{"id":23136,"depth":244,"text":23137},{"id":23170,"depth":244,"text":23171},{"id":23213,"depth":244,"text":23214},{"id":23395,"depth":244,"text":23396},{"id":23469,"depth":244,"text":23470},{"id":7346,"depth":244,"text":7347},{"id":3308,"depth":244,"text":3309},"2026-02-19","HashiCorp changed Nomad's license in August 2023 and was acquired by IBM in February 2025. For those adopting today, it's a big asterisk.",{},"\u002Fen\u002Fblog\u002Fheroctl-vs-nomad",{"title":23021,"description":23601},{"loc":23603},"en\u002Fblog\u002Fheroctl-vs-nomad",[23608,23609,23610,23611,8756,6393],"nomad","hashicorp","ibm","bsl","dhYOXeaTNkJUjPrkRv0sh5r5THK77SsZlu6enneAw6c",{"id":23614,"title":23615,"author":7,"body":23616,"category":8756,"cover":3379,"date":24156,"description":24157,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":24158,"navigation":411,"path":24159,"readingTime":8761,"seo":24160,"sitemap":24161,"stem":24162,"tags":24163,"__hash__":24168},"blog_en\u002Fen\u002Fblog\u002Frender-vs-railway-vs-fly-io.md","Render vs Railway vs Fly.io: Brazilian comparison 2026",{"type":9,"value":23617,"toc":24144},[23618,23621,23624,23627,23631,23634,23645,23648,23651,23654,23658,23669,23675,23681,23687,23693,23699,23703,23706,23711,23716,23721,23726,23731,23735,23738,23747,23752,23763,23768,23773,23775,23965,23968,23972,23975,23981,23987,23993,23999,24002,24006,24009,24015,24021,24027,24033,24037,24040,24046,24052,24058,24064,24066,24072,24078,24087,24093,24099,24105,24115,24117,24120,24123,24126,24131,24141],[12,23619,23620],{},"In August 2022 Salesforce announced the end of Heroku's free plan. The decision was written in two lines on a corporate blog post, became a meme in three hours, and within three months was rewriting the hosted PaaS map for every indie dev in the world. Brazilians took the hit twice: lost the free tier and also saw the dollar rise from R$5.20 to R$5.90 in the same period.",[12,23622,23623],{},"Three products picked up most of the inheritance: Render, Railway, and Fly.io. Each bets on a different philosophy — predictable fixed pricing, pay-as-you-go with pretty UI, or multi-region edge with low-level primitives. For Brazilian devs, the choice mixes DX, cost in USD that becomes reais, latency to São Paulo, and the scale ceiling before the project starts to hurt the wallet.",[12,23625,23626],{},"This article is the honest map. No \"best\" ranking, because there isn't one — there's the best for each stage. And at the end of the line, when the three become expensive, there's the self-hosted path, which is where HeroCtl comes in. But let's go in parts.",[19,23628,23630],{"id":23629},"the-heroku-inheritance","The Heroku inheritance",[12,23632,23633],{},"To understand what each one copied and what each one discarded, it's worth remembering what Heroku did well in its ten glory years.",[12,23635,23636,23637,23640,23641,23644],{},"The first thing was ",[231,23638,23639],{},"git push heroku main",". That command became the mental deploy standard for an entire generation of devs. There was no pipeline to configure, no image registry to populate, no orchestration file to write. You pushed code to the remote repository and the platform took care of the rest — buildpack detected the language, installed dependencies, started the process. Render and Railway inherited this almost intact. Fly.io swapped it for ",[231,23642,23643],{},"fly deploy",", which has the same face but works with explicit container image.",[12,23646,23647],{},"The second thing was the notion of \"dyno\" as a billing and scaling unit. You didn't think in terms of server, vCPU, or RAM — you thought in terms of how many web processes and how many workers. Render kept that simplification on the Starter plan. Railway abandoned it and charges by real consumed resource. Fly.io swapped dyno for VM with declared size.",[12,23649,23650],{},"The third thing was the addons marketplace: Postgres, Redis, MongoDB, transactional email, all one click away. That's the piece where the three successors diverge most. Render has some native managed addons. Railway has a template marketplace covering the basics. Fly.io pushes you to run your own Postgres in a dedicated app, with all the responsibility that brings.",[12,23652,23653],{},"And the fourth thing, the painful one, was vendor lock-in disguised as simplicity. Whoever had fifty services on Heroku discovered in 2022 that migrating would cost months. The three successors promise to be better in this regard; we'll see.",[19,23655,23657],{"id":23656},"render-the-predictable-heroku-20","Render: the \"predictable Heroku 2.0\"",[12,23659,23660,23661,23664,23665,23668],{},"Render is the most conservative successor. The premise is simple: you describe the service in a ",[231,23662,23663],{},"render.yaml"," file or via the panel, connect the repository, and each push to ",[231,23666,23667],{},"main"," triggers a deploy. Build, TLS certificate, custom domain, all automatic.",[12,23670,23671,23674],{},[27,23672,23673],{},"Philosophy."," Fixed prices per instance. The cheapest Starter instance costs US$7\u002Fmonth — and costs US$7\u002Fmonth whether you make five requests per day or fifty thousand. There's a free plan, but with a big asterisk: the service sleeps after fifteen minutes without traffic and takes about thirty seconds to wake up on the first request. It's exactly the same tradeoff as 2015 Heroku free. Works for portfolio, doesn't work for serious product.",[12,23676,23677,23680],{},[27,23678,23679],{},"Strong points."," Budget predictability is the great appeal for small teams. You know that three Starter services will cost US$21\u002Fmonth, period. There's no end-of-month surprise because a cron ran in a loop. Managed Postgres and Redis exist as own products, with automatic backup and recent version. The panel is clean, focuses on doing few things well. Latency to São Paulo from Ohio data center stays around 120ms — not ideal, but viable for typical web app where the bottleneck is usually the database query.",[12,23682,23683,23686],{},[27,23684,23685],{},"Weak points."," The sleeping free tier is sad for anything that isn't a personal showcase. Scaling is expensive because it's linear: four Starter instances cost four times US$7. There's no native multi-region on the basic plan — you're stuck with the region you chose. CDN and edge computing are limited; whoever needs heavy geographic cache will suffer. And managed Postgres starts at US$7\u002Fmonth, so the real floor of an app with database is US$14\u002Fmonth, which for MVP at zero revenue stage already bothers.",[12,23688,23689,23692],{},[27,23690,23691],{},"Cost in reais."," Simple app without database: US$7\u002Fmonth = R$36 to R$40 depending on exchange rate. With minimum Postgres: US$14\u002Fmonth = R$72 to R$80. Production version with two web instances and Standard Postgres (US$20\u002Fmonth): US$34\u002Fmonth = R$170 to R$190. Add domain, transactional email, monitoring and you're at R$300\u002Fmonth easy for a small SaaS. It's not expensive by American standards. It's expensive by Brazilian standards of those who pull salary from their own MRR.",[12,23694,23695,23698],{},[27,23696,23697],{},"When it makes sense."," Indie hacker already on Heroku and wants to migrate with minimal mental friction. Team of one to three people that values predictable budget more than absolute minimum cost. Traditional web app without low geographic latency requirements. Project where you'd rather pay US$30\u002Fmonth and not think about it more than pay US$10\u002Fmonth with possibility of exploding to US$80 in a bad month.",[19,23700,23702],{"id":23701},"railway-the-pay-as-you-go-with-premium-ux","Railway: the \"pay-as-you-go with premium UX\"",[12,23704,23705],{},"Railway is the successor that dared to invert the billing model. Instead of fixed instance, you pay for actual consumed resource — vCPU, RAM, network egress, storage. The premise is that small apps and intermittent jobs come out cheaper on pay-as-you-go than on instance-based.",[12,23707,23708,23710],{},[27,23709,23673],{}," Granular billing by real usage. You don't choose machine size, you choose the app and Railway decides how much resource to deliver as demand goes, within a ceiling. Excellent UI — probably the prettiest in the hosted PaaS segment. One-click templates cover dozens of combinations: Postgres, Redis, MongoDB, MeiliSearch, n8n, Strapi.",[12,23712,23713,23715],{},[27,23714,23679],{}," UX wins on first contact. In fifteen minutes you have an app running with database, Redis, and real-time logs dashboard. The template marketplace is friendly for those who don't want to write Dockerfile. For low-traffic apps pay-as-you-go can come out much cheaper than Render's fixed instance — a cron that runs ten seconds a day pays only those ten seconds, not twenty-four hours.",[12,23717,23718,23720],{},[27,23719,23685],{}," Unpredictable cost is the Achilles heel. Railway has a soft spending ceiling, but not a hard one by default — a job in infinite loop can burn US$50 over a weekend before you notice. The pricing has fine print: the Hobby plan (US$5\u002Fmonth) includes US$5 of usage, so technically you start from zero every month, but the bill can pass that easily. The original free plan was removed in 2023 — another repetition of the Heroku pattern. Datacenter is US-only, latency to SP stays at the same 120ms as Render. And there was history of pricing changes that caught users by surprise, like the removal of the more generous trial in mid-2023.",[12,23722,23723,23725],{},[27,23724,23691],{}," Hobby plan: US$5\u002Fmonth fixed = R$25 to R$28. But real usage usually pulls the bill to US$10-20\u002Fmonth on an app with database and light workers = R$50 to R$110. In a bad month, with traffic spike, it's easy to see the bill near US$30-40 = R$150 to R$220. The second standard deviation is what scares.",[12,23727,23728,23730],{},[27,23729,23697],{}," Small team in experimentation phase, launching dozens of tests per week, where most never get traffic and would be wasteful to pay fixed instance. Genuinely small low-traffic apps where pay-per-use comes out cheaper in the long run. Devs who value pretty UI and frictionless deploy flow above rigid budget predictability. Personal projects that run in bursts.",[19,23732,23734],{"id":23733},"flyio-the-edge-multi-region-for-serious-workloads","Fly.io: the \"edge multi-region for serious workloads\"",[12,23736,23737],{},"Fly.io is the most technical of the three. The premise is different from the base: your app doesn't run in one region, it runs in several. When a user in Brazil accesses, the request hits the nearest region — and Fly.io has a point in São Paulo, GRU. For perceived latency by the Brazilian user, that's the most important thing in the PaaS market today.",[12,23739,23740,23742,23743,23746],{},[27,23741,23673],{}," Global application by default. Real VMs (not shared containers) on top of Firecracker. Low-level primitives — you deal with ",[231,23744,23745],{},"fly.toml",", with regional persistent volumes, with private network between apps via WireGuard, with anycast IPs. More power, more responsibility.",[12,23748,23749,23751],{},[27,23750,23679],{}," GRU datacenter is real and measurable advantage: latency to user in São Paulo drops from 120ms (Render\u002FRailway via US) to 30-60ms. For interactive app, that's the difference between \"fast\" and \"instant\". Multi-region is native — you run the same binary on three continents, and routing hits the region closest to the user. Regional persistent volumes allow patterns like Postgres with Litestream replicating to object storage. Pricing starts very cheap: US$1.94\u002Fmonth for the smallest VM (shared-cpu-1x, 256MB), plus fractions of a cent per GB transferred.",[12,23753,23754,23756,23757,571,23760,23762],{},[27,23755,23685],{}," Real learning curve. ",[231,23758,23759],{},"flyctl",[231,23761,23745],{},", Machines vs Apps concepts, WireGuard networks — all of that you have to digest. It's not \"open panel and click deploy\", it's \"read the docs for an afternoon\". Variable pricing like Railway, with the same trap: app that suddenly grows can triple the cost in a month. Community is smaller than Render\u002FRailway, especially in PT-BR — finding recent Brazilian tutorial is task. And there were publicly reported stability incidents in 2023-2024 that affected base confidence; they improved, but the memory still weighs.",[12,23764,23765,23767],{},[27,23766,23691],{}," Small app with minimum VM and modest storage: US$2-6\u002Fmonth = R$10 to R$32. Postgres run as dedicated app (you're responsible for backup): plus US$2-10\u002Fmonth = R$10 to R$50. Serious production app with two regions and real bandwidth: US$15-40\u002Fmonth = R$80 to R$220. The low band is unbeatable for personal project; the high band starts competing with self-hosted on dedicated VPS.",[12,23769,23770,23772],{},[27,23771,23697],{}," B2B SaaS with customers distributed across multiple regions — app that needs to be in US-East and in GRU simultaneously without you operating two clusters. App where latency to Brazilian user is competitive differentiator. Team comfortable with primitives like VM, mesh network, regional block storage — typically devs with DevOps background or those who migrated from bare-metal. Personal hobby project where the US$2\u002Fmonth pays for five apps at once.",[19,23774,17370],{"id":17369},[119,23776,23777,23790],{},[122,23778,23779],{},[125,23780,23781,23783,23785,23787],{},[128,23782,2982],{},[128,23784,15014],{},[128,23786,15017],{},[128,23788,23789],{},"Fly.io",[141,23791,23792,23806,23820,23834,23848,23861,23875,23886,23898,23911,23925,23939,23953],{},[125,23793,23794,23797,23800,23803],{},[146,23795,23796],{},"Minimum cost USD\u002Fmonth (real app)",[146,23798,23799],{},"US$7 (Starter)",[146,23801,23802],{},"US$5 (Hobby plan) + usage",[146,23804,23805],{},"US$2-3 (shared-cpu-1x)",[125,23807,23808,23811,23814,23817],{},[146,23809,23810],{},"Minimum cost R$\u002Fmonth (rate R$5.50)",[146,23812,23813],{},"~R$40",[146,23815,23816],{},"~R$30 + variable usage",[146,23818,23819],{},"~R$15",[125,23821,23822,23825,23828,23831],{},[146,23823,23824],{},"Billing model",[146,23826,23827],{},"Fixed instance",[146,23829,23830],{},"Pay-as-you-go",[146,23832,23833],{},"Declared VM + usage",[125,23835,23836,23839,23842,23845],{},[146,23837,23838],{},"Latency to São Paulo",[146,23840,23841],{},"~120ms (Ohio)",[146,23843,23844],{},"~120ms (US)",[146,23846,23847],{},"~30-60ms (GRU)",[125,23849,23850,23853,23856,23858],{},[146,23851,23852],{},"Native multi-region",[146,23854,23855],{},"Not on base plan",[146,23857,3058],{},[146,23859,23860],{},"Yes, central in product",[125,23862,23863,23866,23869,23872],{},[146,23864,23865],{},"Real free tier",[146,23867,23868],{},"Yes, with sleep",[146,23870,23871],{},"Removed in 2023",[146,23873,23874],{},"No, but very low floor",[125,23876,23877,23879,23881,23883],{},[146,23878,14417],{},[146,23880,3058],{},[146,23882,3058],{},[146,23884,23885],{},"Yes (GRU)",[125,23887,23888,23891,23893,23895],{},[146,23889,23890],{},"Preview deploys per PR",[146,23892,3064],{},[146,23894,3064],{},[146,23896,23897],{},"Yes, via CLI",[125,23899,23900,23902,23905,23908],{},[146,23901,14428],{},[146,23903,23904],{},"Yes, US$7\u002Fmonth",[146,23906,23907],{},"Yes, via template",[146,23909,23910],{},"Not native (you run)",[125,23912,23913,23916,23919,23922],{},[146,23914,23915],{},"Auto scaling",[146,23917,23918],{},"Manual on Starter",[146,23920,23921],{},"Yes, vertical and horizontal",[146,23923,23924],{},"Yes, via Machines API",[125,23926,23927,23930,23933,23936],{},[146,23928,23929],{},"Platform lock-in",[146,23931,23932],{},"Medium (some proprietary addons)",[146,23934,23935],{},"High (custom templates)",[146,23937,23938],{},"Low (standard Dockerfile)",[125,23940,23941,23944,23947,23950],{},[146,23942,23943],{},"Audience focus",[146,23945,23946],{},"Predictable ex-Heroku",[146,23948,23949],{},"UI-first indie hacker",[146,23951,23952],{},"Edge-first technical dev",[125,23954,23955,23957,23960,23962],{},[146,23956,14464],{},[146,23958,23959],{},"Small, growing",[146,23961,4919],{},[146,23963,23964],{},"Very small",[12,23966,23967],{},"The table hides nuance worth making explicit: none of the three is \"best\" at everything. Render wins on predictability. Railway wins on UX. Fly.io wins on performance for Brazilian user. The choice is a function of what you're optimizing.",[19,23969,23971],{"id":23970},"what-the-three-have-in-common-and-where-heroctl-comes-in","What the three have in common (and where HeroCtl comes in)",[12,23973,23974],{},"Four points the three share, and each one is a reason to eventually leave.",[12,23976,23977,23980],{},[27,23978,23979],{},"First: billing in USD."," None of the three bills in reais. Exchange rate goes up, your bill goes up with it, and you discover in February that the infra that cost R$300 in January costs R$340 with no technical change. For bootstrapped team without exchange rate protection, that's exposure no one asked to have.",[12,23982,23983,23986],{},[27,23984,23985],{},"Second: you don't control the server."," You don't choose the kernel, you don't have SSH access, you don't run background process beyond what the platform allows. For ninety percent of cases that's an advantage. For the remaining ten percent — when you need fine-tuning, custom daemon, some specific OS capability — you're limited.",[12,23988,23989,23992],{},[27,23990,23991],{},"Third: free tiers shrinking year after year."," Heroku killed free in 2022. Railway removed the generous trial in 2023. Render keeps free with sleep but has reduced limits. The pattern is clear: free tier serves to acquire dev at zero revenue stage; when the product grows, the company needs to monetize and free shrinks. It's not betrayal, it's economics. But it implies your long-term strategy can't depend on free continuing as it is.",[12,23994,23995,23998],{},[27,23996,23997],{},"Fourth and most important: when the startup grows, cost scales disproportionately."," Small US$15\u002Fmonth app is negligible. Real app with five services, two databases, Redis, and two environments (staging + prod) starts at US$80-150\u002Fmonth — between R$450 and R$850. Add workers, background jobs, monitoring, and you pass US$200\u002Fmonth easily. That's the point where self-hosted stops being DevOps hobby and becomes clear savings.",[12,24000,24001],{},"The typical band where the tradeoff turns: when your hosted PaaS passes US$50\u002Fmonth of consistent spending, three Hetzner VPSes of US$5 each (~R$80 total) running an orchestration platform start to make financial sense. You trade convenience for savings, and you gain server control as a bonus. HeroCtl is designed for that range: simple installation, real high availability between multiple servers, automatic certificates, web panel. No operational ceremony, no SRE team.",[19,24003,24005],{"id":24004},"decision-by-project-stage","Decision by project stage",[12,24007,24008],{},"The right question isn't \"what's the best PaaS\", it's \"what's the best for the stage I'm at now\".",[12,24010,24011,24014],{},[27,24012,24013],{},"Hobby stage, zero revenue."," Render free tier (with sleep), or Fly.io taking advantage of the minimum tier (US$2-3\u002Fmonth covers personal project), or a US$5\u002Fmonth VPS with Coolify running solo if you enjoy tinkering. Railway lost spot here after removing the original free tier.",[12,24016,24017,24020],{},[27,24018,24019],{},"Indie hacker stage, up to US$5k MRR."," Render if you prioritize predictable budget and the app has a \"instance running 24\u002F7 with constant traffic\" profile. Railway if you're experimenting a lot, want pretty UI, and pay-as-you-go will favor you. Cost in this band stays at US$15-50\u002Fmonth, manageable.",[12,24022,24023,24026],{},[27,24024,24025],{},"Early startup stage, US$5k to US$50k MRR."," Time to evaluate seriously. If latency to Brazilian user matters (B2C, interactive app), Fly.io becomes a strong candidate because of GRU. If the team already has a dev comfortable with basic infra, simple self-hosted (Coolify on a server, or HeroCtl on three for high availability) starts to pay. The bill in this band on hosted PaaS runs US$80-300\u002Fmonth — on self-hosted, R$150-400\u002Fmonth with more headroom.",[12,24028,24029,24032],{},[27,24030,24031],{},"Startup stage with first serious B2B customer, SLA required."," Here hosted PaaS starts breaking in other ways: you need contractual SLA, redundancy across multiple servers, predictable maintenance window, audit logs. Render and Railway don't offer strong SLA on standard plan. Fly.io offers, but multi-region implies operational complexity the team will have to learn anyway. That's the point where HeroCtl with three-server cluster comes in as alternative: real high availability, admin panel, audit, without the US$300+ monthly of hosted PaaS for the same level of robustness. The other option is the opposite path: managed AWS if some customer pulls specific compliance requirements.",[19,24034,24036],{"id":24035},"when-not-to-leave-render-railway-or-flyio","When NOT to leave Render, Railway, or Fly.io",[12,24038,24039],{},"Migration costs expensive, and most of the time it's premature optimization. Four clear situations to stay where you are.",[12,24041,24042,24045],{},[27,24043,24044],{},"Team of one or two people without any time for operational stuff."," If you're a solo founder on the product and every hour spent on infra is an hour not spent on sales, keeping the PaaS is the right decision. R$200\u002Fmonth of hosted is cheaper than seventeen hours of yours in the month tinkering with server.",[12,24047,24048,24051],{},[27,24049,24050],{},"Platform cost is negligible compared to revenue."," Simple rule: if infra is less than two percent of MRR, don't optimize. SaaS with US$50k MRR spending US$500\u002Fmonth on hosted is paying one percent on infra. It's excellent. Tinkering there is micro-optimizing to save coins.",[12,24053,24054,24057],{},[27,24055,24056],{},"You use proprietary addon without easy substitute."," If you depend on some specific Railway addon that has no obvious equivalent outside — some custom template with proprietary integration, some unique feature — migration isn't just re-deploy, it's re-architecture. Evaluate total cost before.",[12,24059,24060,24063],{},[27,24061,24062],{},"Migration would take more than two weeks and the company can't stop for it."," There are moments when the product is in hyper-growth, or in critical funding cycle, or simply in feature sprint that matters to the customer. Don't migrate infra in those windows. Note the technical debt and come back later.",[19,24065,3225],{"id":3224},[12,24067,24068,24071],{},[27,24069,24070],{},"Can I use Render for production and Railway for staging?","\nYou can, and people do. The justification is simple: production demands predictability (Render shines there), staging has bursty traffic and ideally should cost little when no one is using (Railway pay-as-you-go wins). The cost of maintaining two vendors is the mental overhead of two dashboards and two sets of credentials. Makes sense for disciplined team, hampers small team that prefers monoculture.",[12,24073,24074,24077],{},[27,24075,24076],{},"Is Fly.io GRU region reliable?","\nToday yes, with asterisk. Works well since 2023, measured latency to SP\u002FRJ is what it promises (30-60ms typically). The asterisk is that GRU is a smaller region in Fly's portfolio, so capacity can get tight on peaks and new features usually arrive in US-East first. For serious production, worth running in at least two regions (GRU + a US) with failover.",[12,24079,24080,24083,24084,24086],{},[27,24081,24082],{},"How do I migrate from Render to HeroCtl without downtime?","\nPragmatic roadmap: provision three VPS, install HeroCtl, bring up the application in parallel pointing to a temporary domain. Replicate the database with ",[231,24085,5736],{}," + initial load, then keep in sync with logical replication until switchover. When ready, change the DNS of the main domain to the new cluster with low TTL. The risk window stays around the DNS TTL — typically five to ten minutes. Small databases (up to a few GB) migrate in an afternoon; large databases ask for a coordinated window.",[12,24088,24089,24092],{},[27,24090,24091],{},"Is managed Postgres on these three good?","\nRender Postgres is decent, with automatic backup and updated version — equivalent to classic Heroku Postgres. Railway Postgres via template works, but the backup is more manual and the default configuration is conservative. Fly.io doesn't have its own managed Postgres; you run as app, which gives control and responsibility. For a team that doesn't want to take care of database, Render takes this category. For a team that prefers control, Fly.io.",[12,24094,24095,24098],{},[27,24096,24097],{},"Which of these has support in Portuguese?","\nNone officially. Documentation is all in English, support via chat\u002Femail is in English. There are unofficial PT-BR communities on Discord and Twitter for each, but most serious tickets you open in English. For a team uncomfortable with that, it's a variable that weighs.",[12,24100,24101,24104],{},[27,24102,24103],{},"And performance for Rails \u002F Django \u002F Node app?","\nFor the three frameworks the three PaaSes work well. Render has mature Rails buildpack and runs Sidekiq easily. Railway detects Django and Node automatically via templates. Fly.io has specific documentation for Rails and Phoenix, with official deploy guides. The practical difference appears in details: Rails with heavy asset pipeline gets faster on Render because of predictable build cache; Node with background workers wins on Railway via pay-as-you-go; any framework wins on Fly.io if the user is in Brazil because of GRU.",[12,24106,24107,24110,24111,24114],{},[27,24108,24109],{},"Do preview deploys per PR work on the three?","\nYes on the three, with nuances. Render has the best: each PR becomes a temporary URL automatically, no extra configuration. Railway offers via repository integration, simple configuration. Fly.io does via CLI (",[231,24112,24113],{},"fly deploy --config preview.toml",") and requires a bit more manual setup or a custom CI workflow. For team prioritizing preview deploys as part of code review, Render is the most fluid.",[19,24116,3309],{"id":3308},[12,24118,24119],{},"The choice between Render, Railway, and Fly.io has no universal answer. Render is the predictable conservative. Railway is the experimenter with pretty UI. Fly.io is the technical with real latency advantage for Brazil. The three are honest about what they are, and the three have the same structural destiny: from a certain growth point, cost in USD scales faster than your revenue in reais, and self-hosted starts to make sense.",[12,24121,24122],{},"When that point arrives, consider HeroCtl. Three Linux servers, real high availability between them, automatic certificates, built-in web panel, no operational ceremony. Community plan free forever, Business and Enterprise plans for when the team needs SSO, audit, and SLA support. No retroactive contract changes.",[12,24124,24125],{},"To get started:",[224,24127,24129],{"className":24128,"code":5318,"language":2529},[2527],[231,24130,5318],{"__ignoreMap":229},[12,24132,24133,24134,24136,24137,24140],{},"If you want to read first, two complementary posts: ",[3336,24135,19750],{"href":19749}," covers the general thesis of when to leave hosted PaaS, and ",[3336,24138,24139],{"href":14874},"Brazilian alternatives to Kubernetes"," gets into which orchestrator makes sense for small team outside the three PaaSes of this article.",[12,24142,24143],{},"The good choice is the one that fits your stage now, not the one that seems sophisticated. Start simple, migrate when cost justifies, and never before.",{"title":229,"searchDepth":244,"depth":244,"links":24145},[24146,24147,24148,24149,24150,24151,24152,24153,24154,24155],{"id":23629,"depth":244,"text":23630},{"id":23656,"depth":244,"text":23657},{"id":23701,"depth":244,"text":23702},{"id":23733,"depth":244,"text":23734},{"id":17369,"depth":244,"text":17370},{"id":23970,"depth":244,"text":23971},{"id":24004,"depth":244,"text":24005},{"id":24035,"depth":244,"text":24036},{"id":3224,"depth":244,"text":3225},{"id":3308,"depth":244,"text":3309},"2026-02-12","The three hosted PaaSes Brazilian devs most use to escape Heroku. Each has a different tradeoff. Honest analysis of when to stay on each and when to move to self-hosted.",{},"\u002Fen\u002Fblog\u002Frender-vs-railway-vs-fly-io",{"title":23615,"description":24157},{"loc":24159},"en\u002Fblog\u002Frender-vs-railway-vs-fly-io",[24164,24165,24166,24167,8756,14939],"render","railway","fly-io","paas","FDB9uihRYCQkdNYC5q0QV-zsLycvipK1M6KL7eGeBd4",{"id":24170,"title":24171,"author":7,"body":24172,"category":8756,"cover":3379,"date":25317,"description":25318,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":25319,"navigation":411,"path":25320,"readingTime":4401,"seo":25321,"sitemap":25322,"stem":25323,"tags":25324,"__hash__":25328},"blog_en\u002Fen\u002Fblog\u002Fself-hosted-vercel-alternative.md","A self-hosted alternative to Vercel: hosting Next.js without lock-in",{"type":9,"value":24173,"toc":25290},[24174,24177,24180,24184,24187,24219,24222,24225,24229,24233,24236,24239,24242,24245,24249,24252,24255,24265,24276,24293,24304,24308,24311,24314,24338,24341,24345,24348,24352,24355,24361,24367,24373,24376,24380,24383,24386,24396,24399,24402,24406,24409,24412,24418,24421,24424,24428,24431,24648,24651,24655,24658,24664,24670,24676,24682,24685,24689,24692,24696,24734,24737,24741,24755,24799,24806,24915,24918,24922,24925,24964,24967,24971,24974,25010,25014,25017,25024,25027,25031,25034,25065,25068,25072,25075,25096,25099,25103,25106,25111,25131,25136,25152,25158,25161,25163,25169,25178,25192,25202,25224,25239,25245,25247,25250,25253,25256,25272,25275,25284,25287],[12,24175,24176],{},"Vercel is the best DX product out there for Next.js. Automatic build on every push, preview deploys with a unique URL per commit, edge runtime running close to the user, ISR working with one line of config, analytics and Web Vitals integrated without installing a thing. For someone starting a Next.js application solo, it is objectively the best technical choice.",[12,24178,24179],{},"The bill starts to hurt at three predictable points. This post maps those points with real numbers, defends Vercel where it gets things right, and shows three exit routes — each with its trade-off. At the end, a step-by-step technical migration and a concrete calculation of how much a Brazilian team saves by leaving.",[19,24181,24183],{"id":24182},"where-vercel-gets-it-right","Where Vercel gets it right",[12,24185,24186],{},"Worth starting with the hard-to-admit part. Vercel solves real problems that other providers don't solve with the same elegance:",[2734,24188,24189,24195,24201,24207,24213],{},[70,24190,24191,24194],{},[27,24192,24193],{},"Zero-config for Next.js."," The Vercel team maintains the framework. Each new release has tested support on the provider before it goes GA. You don't need to configure adapter, runtime, build cache, anything.",[70,24196,24197,24200],{},[27,24198,24199],{},"Preview deploys per commit."," Each pull request opens an isolated public URL with that commit's build. Designer reviews, PM approves, QA tests — without bringing up shared staging.",[70,24202,24203,24206],{},[27,24204,24205],{},"Global edge functions."," Code runs in over 30 regions simultaneously. For a user in São Paulo, cold start latency is lower than many dedicated servers in GRU.",[70,24208,24209,24212],{},[27,24210,24211],{},"ISR and SSG out-of-the-box."," A static page with scheduled revalidation works without configuring an external CDN, without manual invalidation.",[70,24214,24215,24218],{},[27,24216,24217],{},"Native analytics and Web Vitals."," No third-party script to install, no extra weight on the bundle, real Core Web Vitals metrics in production.",[12,24220,24221],{},"For a Brazilian solo dev with a SaaS at US$5k MRR, with a small app and controlled traffic, Vercel costs US$20\u002Fmonth and frees the developer to write product. It is the right choice. There's no irony in that sentence.",[12,24223,24224],{},"The problem is what happens after the product grows.",[19,24226,24228],{"id":24227},"the-three-points-where-it-hurts","The three points where it hurts",[368,24230,24232],{"id":24231},"point-1-cost-scaled-in-usd","Point 1 — Cost scaled in USD",[12,24234,24235],{},"The Pro plan costs US$20 per developer per month as a floor. In real, with the exchange rate around R$5, that becomes roughly R$100 per dev per month. A team of five starts at R$500\u002Fmonth in licenses alone, before any traffic or compute.",[12,24237,24238],{},"From there, cost is by usage. Serverless functions charge per GB-second of execution plus invocation per request. Egress traffic charges per GB. Vercel KV, Vercel Postgres, Vercel Blob — each managed service has its pricing table, all in USD.",[12,24240,24241],{},"The operational consequence is an unpredictable bill. Seasonal traffic — Black Friday, feature launch, press mention — multiplies that month's cost tenfold. In USD, with the exchange rate varying 5% in a bad month, you find out you budgeted at R$5.00 and closed at R$5.30. The math is clear: 10x traffic and 6% additional FX becomes a bill 10.6x larger in real.",[12,24243,24244],{},"For a Brazilian startup with revenue in real and cost in dollars, the margin spread vanishes first. For an agency that bills clients in real and pays infra in dollars, margin vanishes second.",[368,24246,24248],{"id":24247},"point-2-lock-in-of-primitives","Point 2 — Lock-in of primitives",[12,24250,24251],{},"This is the point nobody sees until they need to leave.",[12,24253,24254],{},"ISR (Incremental Static Regeneration) is a Next.js feature, but Vercel's optimized implementation uses a proprietary CDN with global tag invalidation. Self-hosted, ISR works locally — each node has its own cache copy on disk. To invalidate a tag across three nodes you need explicit orchestration.",[12,24256,24257,24258,17327,24261,24264],{},"Edge runtime uses Cloudflare Workers-style primitives — global ",[231,24259,24260],{},"fetch",[231,24262,24263],{},"fs"," access, no native Node modules. Code written for the edge doesn't run directly on traditional Node without refactoring.",[12,24266,24267,24268,24271,24272,24275],{},"Image Optimization runs on Vercel's infra. You ship ",[231,24269,24270],{},"\u003CImage src=\"\u002Ffoo.jpg\" \u002F>"," and the provider delivers a resized WebP, with global cache. Self-hosted, you need to run ",[231,24273,24274],{},"sharp"," at build time, or use a dedicated image proxy, or disable the feature.",[12,24277,24278,24279,2629,24282,24284,24285,24288,24289,24292],{},"Vercel KV, Postgres and Blob are wrappers around Upstash Redis, Neon Postgres and S3-compatible storage with their own SDK. Migrating from ",[231,24280,24281],{},"@vercel\u002Fkv",[231,24283,7362],{}," directly is an afternoon of refactoring. Migrating from ",[231,24286,24287],{},"@vercel\u002Fpostgres"," to a standard client is another afternoon. Migrating from ",[231,24290,24291],{},"@vercel\u002Fblob"," to S3 means revisiting every upload in the app.",[12,24294,24295,24296,24299,24300,24303],{},"None of these barriers is insurmountable. But leaving Vercel isn't ",[231,24297,24298],{},"git remote set-url"," followed by ",[231,24301,24302],{},"vercel logout",". It's a refactoring sprint for a medium-sized app.",[368,24305,24307],{"id":24306},"point-3-bandwidth-and-functions-out-of-control","Point 3 — Bandwidth and functions out of control",[12,24309,24310],{},"Vercel has no hard cap by default on the Pro plan. You define budget alerts, but the application keeps serving until the limit is reached — and the limit is configurable upward, not downward.",[12,24312,24313],{},"The three bad scenarios are predictable:",[2734,24315,24316,24322,24328],{},[70,24317,24318,24321],{},[27,24319,24320],{},"Light DDoS."," A thousand requests per second for an hour hit a volume of bandwidth and invocation that routinely exceeds US$200 on a normal plan. Vercel has protection against massive attacks, but the threshold to trigger the defense is high.",[70,24323,24324,24327],{},[27,24325,24326],{},"Viral post."," Your page hits Hacker News or the Reddit front, and five hundred thousand people access it in a day. The cost falls on you, not on the advertiser.",[70,24329,24330,24333,24334,24337],{},[27,24331,24332],{},"Bug in a loop."," A function in production with a ",[231,24335,24336],{},"setInterval"," that forgot to clear, or a route that calls itself recursively in SSR — discovered on the bill.",[12,24339,24340],{},"You discover the damage when the credit card statement arrives. The appeals path exists, but it is case-by-case and depends on provider goodwill. It is not a contractual cap guarantee.",[19,24342,24344],{"id":24343},"the-three-exit-routes","The three exit routes",[12,24346,24347],{},"Leaving Vercel doesn't mean jumping to Kubernetes. There's a spectrum, and each step trades one thing for another.",[368,24349,24351],{"id":24350},"route-a-hosted-more-predictable","Route A — Hosted, more predictable",[12,24353,24354],{},"Render, Railway and Fly.io occupy this band. You still pay per instance in USD, still trust the provider for availability, still have a web panel and integrated CI\u002FCD. The difference is the billing model.",[12,24356,24357,24360],{},[27,24358,24359],{},"Render."," Fixed price per instance per month. Basic web service US$7\u002Fmonth, larger instance US$25-85\u002Fmonth. Decent latency to São Paulo via Render-Cloud in the eastern US region. Has a limited free tier for personal projects. Direct support for Next.js standalone, without a custom adapter.",[12,24362,24363,24366],{},[27,24364,24365],{},"Railway."," Usage-based model (CPU + memory + bandwidth) with configurable cap. Predictable pricing because you set the ceiling. Good for an MVP running cheap, scales up when needed. The console UX is excellent.",[12,24368,24369,24372],{},[27,24370,24371],{},"Fly.io."," Multi-region edge without separate per-function billing. You ship the application and it runs in N regions at the same price. For an app that needs global presence but doesn't want to pay a function table, it is the most obvious choice.",[12,24374,24375],{},"Trade-off of Route A: you still pay in USD, still depend on a single provider for availability, still have to accept their pricing table when it changes. But you left the serverless-per-invocation model and gained cost predictability. For many teams, that's enough.",[368,24377,24379],{"id":24378},"route-b-simple-self-hosted","Route B — Simple self-hosted",[12,24381,24382],{},"Modern orchestration panels became a category in the past two years. Coolify, Dokploy, Kamal — each with its philosophy, all sharing the same model: install the panel on a single server, connect a repo, deploy the application.",[12,24384,24385],{},"The numbers shift regime in this route. A cloud server on Hetzner costs around €5\u002Fmonth — close to R$30. That single server comfortably hosts five medium-sized Next.js applications, plus a small Postgres database, plus a Redis. The DX drops to \"install panel, connect repo, choose domain, deploy.\" There's no global build cache, no automatic preview deploy on every commit (depends on extra configuration), no global edge.",[12,24387,24388,24389,2629,24392,24395],{},"The technical detail that enables the route is Next.js standalone build. Adding ",[231,24390,24391],{},"output: 'standalone'",[231,24393,24394],{},"next.config.js",", the build generates a compact Node server with all required dependencies copied in. The resulting Docker image lands around 150 MB. Each instance of the application consumes roughly 100 MB of RAM at idle, scaling with traffic. Five Next.js apps on a 4 GB server have memory to spare.",[12,24397,24398],{},"Trade-off of Route B: you lose global edge. A user in Tokyo accessing your app in SP will feel latency. You lose automatic preview deploy per commit (need to configure manually, or use a panel that supports it). You gain total cost predictability: the server bill is the same every month, regardless of traffic.",[12,24400,24401],{},"For a Brazilian team with Brazilian clients, the loss of global edge is, in practice, irrelevant. Latency São Paulo → São Paulo is lower than latency São Paulo → Vercel-edge-São Paulo, in most measurements.",[368,24403,24405],{"id":24404},"route-c-self-hosted-with-high-availability","Route C — Self-hosted with high availability",[12,24407,24408],{},"This is where HeroCtl lives. The difference from Route B is the kind of guarantee you can give the client.",[12,24410,24411],{},"Simple self-hosted panels are, by construction, single-server. When that server goes down, the client goes down with it. For a personal app or MVP, that's acceptable. For a B2B contract with a written SLA, it isn't.",[12,24413,24414,24415,24417],{},"Route C removes that single point of failure by placing 3 or 4 servers in the same cluster, with the control plane replicated across them. If one server dies, the others keep serving — and the cluster automatically reschedules containers from the dead node onto other healthy nodes. New coordinator election takes about 7 seconds after a ",[231,24416,23189],{}," on the leading server.",[12,24419,24420],{},"The integrated router issues Let's Encrypt certificates automatically, performs rolling deploys without a maintenance window, and runs health checks on every container. You don't assemble five products to get ingress + TLS + metrics + logs — it all comes in the same binary.",[12,24422,24423],{},"The application range is specific: when a startup needs a written SLA from a client (usually above US$10k MRR or in a serious B2B contract), Route B starts to get risky. A single server, even a reliable one, is a hard narrative to defend when the client asks \"and what if that server falls?\". Route C solves that without becoming Kubernetes.",[19,24425,24427],{"id":24426},"side-by-side","Side by side",[12,24429,24430],{},"The table is the honest version of the decision. There's no column without caveats.",[119,24432,24433,24449],{},[122,24434,24435],{},[125,24436,24437,24439,24441,24443,24445,24447],{},[128,24438,2982],{},[128,24440,15020],{},[128,24442,15014],{},[128,24444,15017],{},[128,24446,2770],{},[128,24448,2994],{},[141,24450,24451,24471,24487,24503,24520,24537,24554,24568,24582,24601,24615,24631],{},[125,24452,24453,24456,24459,24462,24465,24468],{},[146,24454,24455],{},"Minimum BRL\u002Fmonth cost",[146,24457,24458],{},"~R$100\u002Fdev",[146,24460,24461],{},"~R$35",[146,24463,24464],{},"~R$25",[146,24466,24467],{},"~R$30 (1 VPS)",[146,24469,24470],{},"~R$120 (3-4 VPS)",[125,24472,24473,24476,24478,24480,24483,24485],{},[146,24474,24475],{},"Predictable cost",[146,24477,3058],{},[146,24479,3064],{},[146,24481,24482],{},"Yes (with cap)",[146,24484,3064],{},[146,24486,3064],{},[125,24488,24489,24492,24495,24497,24499,24501],{},[146,24490,24491],{},"Global edge",[146,24493,24494],{},"Yes (30+ regions)",[146,24496,3058],{},[146,24498,3058],{},[146,24500,3058],{},[146,24502,3058],{},[125,24504,24505,24508,24511,24514,24516,24518],{},[146,24506,24507],{},"Next.js ISR",[146,24509,24510],{},"Native, optimized",[146,24512,24513],{},"Works locally",[146,24515,24513],{},[146,24517,24513],{},[146,24519,24513],{},[125,24521,24522,24525,24528,24531,24533,24535],{},[146,24523,24524],{},"Image Optimization",[146,24526,24527],{},"Hosted",[146,24529,24530],{},"Build\u002Fproxy",[146,24532,24530],{},[146,24534,24530],{},[146,24536,24530],{},[125,24538,24539,24542,24545,24548,24550,24552],{},[146,24540,24541],{},"Preview deploys",[146,24543,24544],{},"Automatic per commit",[146,24546,24547],{},"Manual\u002Fbranch",[146,24549,24547],{},[146,24551,3077],{},[146,24553,3077],{},[125,24555,24556,24558,24560,24562,24564,24566],{},[146,24557,16369],{},[146,24559,3064],{},[146,24561,3064],{},[146,24563,3064],{},[146,24565,3064],{},[146,24567,3064],{},[125,24569,24570,24572,24574,24576,24578,24580],{},[146,24571,7102],{},[146,24573,3064],{},[146,24575,3061],{},[146,24577,3061],{},[146,24579,3058],{},[146,24581,7016],{},[125,24583,24584,24587,24590,24593,24596,24598],{},[146,24585,24586],{},"Contractual SLA",[146,24588,24589],{},"99.99% (Enterprise)",[146,24591,24592],{},"99.95%",[146,24594,24595],{},"99.9%",[146,24597,11991],{},[146,24599,24600],{},"Configurable (you operate)",[125,24602,24603,24605,24607,24609,24611,24613],{},[146,24604,16324],{},[146,24606,3064],{},[146,24608,3064],{},[146,24610,3064],{},[146,24612,3058],{},[146,24614,3064],{},[125,24616,24617,24620,24622,24624,24626,24628],{},[146,24618,24619],{},"Support in PT",[146,24621,3058],{},[146,24623,3058],{},[146,24625,3058],{},[146,24627,4351],{},[146,24629,24630],{},"Yes (Business)",[125,24632,24633,24636,24639,24641,24643,24646],{},[146,24634,24635],{},"Lock-in of primitives",[146,24637,24638],{},"High (KV\u002FPostgres\u002FBlob\u002FEdge)",[146,24640,3154],{},[146,24642,3154],{},[146,24644,24645],{},"None",[146,24647,24645],{},[12,24649,24650],{},"The column that matters changes by stage. Solo dev looks at the first row. A growing team looks at \"predictable cost\". A startup with a B2B client looks at \"contractual SLA\" and \"real high availability\".",[19,24652,24654],{"id":24653},"when-to-stay-on-vercel","When to stay on Vercel",[12,24656,24657],{},"Honesty is the defense mechanism of any comparison. Four scenarios where leaving is a loss:",[12,24659,24660,24663],{},[27,24661,24662],{},"Solo dev running a small SaaS in USD with healthy revenue."," If the app already bills in dollars and revenue passes US$30k MRR, US$100-300\u002Fmonth of Vercel is accounting noise. The time spent migrating is worth more than the savings.",[12,24665,24666,24669],{},[27,24667,24668],{},"Low-complexity Next.js marketing site."," Static page with a contact form. Vercel does it for free on the Hobby plan, and the free tier has no hard limit for that traffic profile. Switching to self-hosted is moving a problem instead of solving it.",[12,24671,24672,24675],{},[27,24673,24674],{},"Small team without anyone to look after infra, with revenue justifying it."," Vercel is, ultimately, outsourced operations. If your margin supports the price, and your only senior dev needs to be writing product, keeping Vercel is a time-allocation decision, not a technology one.",[12,24677,24678,24681],{},[27,24679,24680],{},"Global edge critical for UX."," Application with users on three continents where sub-50ms latency globally is part of the product. Self-hosted with global presence is expensive and operationally complicated. Vercel solves it.",[12,24683,24684],{},"If you are in any of these four profiles, close this tab and go back to code. The rest of the post isn't for you yet.",[19,24686,24688],{"id":24687},"technical-migration-from-vercel-to-self-hosted","Technical migration from Vercel to self-hosted",[12,24690,24691],{},"For those who decided to leave, the path has seven steps. Each takes between an afternoon and two days, depending on the size of the app.",[368,24693,24695],{"id":24694},"_1-inventory","1. Inventory",[12,24697,24698,24699,24702,24703,6562,24705,571,24707,571,24709,571,24711,571,24714,24717,24718,24721,24722,24725,24726,24729,24730,24733],{},"Before moving anything, map what's in use. List of environment variables in the Vercel project — copy everything into a versioned ",[231,24700,24701],{},".env.example"," file. List of Vercel-only dependencies that appear in ",[231,24704,8501],{},[231,24706,24281],{},[231,24708,24287],{},[231,24710,24291],{},[231,24712,24713],{},"@vercel\u002Fanalytics",[231,24715,24716],{},"@vercel\u002Fspeed-insights",". List of Next.js features that depend on a specific runtime: ISR (search for ",[231,24719,24720],{},"revalidate"," in the code), middleware (does ",[231,24723,24724],{},"middleware.ts"," exist at root?), edge runtime (",[231,24727,24728],{},"export const runtime = 'edge'","), Image Optimization (",[231,24731,24732],{},"\u003CImage \u002F>"," on how many routes?).",[12,24735,24736],{},"The inventory changes nothing. But it decides the order of the next steps.",[368,24738,24740],{"id":24739},"_2-standalone-build","2. Standalone build",[12,24742,2661,24743,2629,24745,24747,24748,24751,24752,622],{},[231,24744,24391],{},[231,24746,24394],{},". This mode makes the build copy to ",[231,24749,24750],{},".next\u002Fstandalone\u002F"," only the production dependencies actually used, plus a minimal Node server (",[231,24753,24754],{},"server.js",[224,24756,24758],{"className":374,"code":24757,"language":376,"meta":229,"style":229},"\u002F\u002F next.config.js\nmodule.exports = {\n  output: 'standalone',\n  \u002F\u002F demais opções\n}\n",[231,24759,24760,24765,24779,24790,24795],{"__ignoreMap":229},[234,24761,24762],{"class":236,"line":237},[234,24763,24764],{"class":240},"\u002F\u002F next.config.js\n",[234,24766,24767,24770,24772,24775,24777],{"class":236,"line":244},[234,24768,24769],{"class":251},"module",[234,24771,101],{"class":387},[234,24773,24774],{"class":251},"exports",[234,24776,424],{"class":383},[234,24778,505],{"class":387},[234,24780,24781,24784,24787],{"class":236,"line":271},[234,24782,24783],{"class":387},"  output: ",[234,24785,24786],{"class":255},"'standalone'",[234,24788,24789],{"class":387},",\n",[234,24791,24792],{"class":236,"line":415},[234,24793,24794],{"class":240},"  \u002F\u002F demais opções\n",[234,24796,24797],{"class":236,"line":434},[234,24798,1362],{"class":387},[12,24800,24801,24802,24805],{},"Local build with ",[231,24803,24804],{},"next build"," produces a folder of about 150 MB. Dockerfile is short:",[224,24807,24809],{"className":20735,"code":24808,"language":20737,"meta":229,"style":229},"FROM node:20-alpine AS builder\nWORKDIR \u002Fapp\nCOPY package.json pnpm-lock.yaml .\u002F\nRUN corepack enable && pnpm install --frozen-lockfile\nCOPY . .\nRUN pnpm build\n\nFROM node:20-alpine\nWORKDIR \u002Fapp\nCOPY --from=builder \u002Fapp\u002F.next\u002Fstandalone .\u002F\nCOPY --from=builder \u002Fapp\u002F.next\u002Fstatic .\u002F.next\u002Fstatic\nCOPY --from=builder \u002Fapp\u002Fpublic .\u002Fpublic\nEXPOSE 3000\nCMD [\"node\", \"server.js\"]\n",[231,24810,24811,24822,24828,24835,24842,24848,24855,24859,24866,24872,24879,24886,24893,24899],{"__ignoreMap":229},[234,24812,24813,24815,24818,24820],{"class":236,"line":237},[234,24814,20744],{"class":383},[234,24816,24817],{"class":387}," node:20-alpine ",[234,24819,20750],{"class":383},[234,24821,20753],{"class":387},[234,24823,24824,24826],{"class":236,"line":244},[234,24825,20758],{"class":383},[234,24827,20761],{"class":387},[234,24829,24830,24832],{"class":236,"line":271},[234,24831,20766],{"class":383},[234,24833,24834],{"class":387}," package.json pnpm-lock.yaml .\u002F\n",[234,24836,24837,24839],{"class":236,"line":415},[234,24838,20774],{"class":383},[234,24840,24841],{"class":387}," corepack enable && pnpm install --frozen-lockfile\n",[234,24843,24844,24846],{"class":236,"line":434},[234,24845,20766],{"class":383},[234,24847,20784],{"class":387},[234,24849,24850,24852],{"class":236,"line":459},[234,24851,20774],{"class":383},[234,24853,24854],{"class":387}," pnpm build\n",[234,24856,24857],{"class":236,"line":464},[234,24858,412],{"emptyLinePlaceholder":411},[234,24860,24861,24863],{"class":236,"line":479},[234,24862,20744],{"class":383},[234,24864,24865],{"class":387}," node:20-alpine\n",[234,24867,24868,24870],{"class":236,"line":484},[234,24869,20758],{"class":383},[234,24871,20761],{"class":387},[234,24873,24874,24876],{"class":236,"line":490},[234,24875,20766],{"class":383},[234,24877,24878],{"class":387}," --from=builder \u002Fapp\u002F.next\u002Fstandalone .\u002F\n",[234,24880,24881,24883],{"class":236,"line":508},[234,24882,20766],{"class":383},[234,24884,24885],{"class":387}," --from=builder \u002Fapp\u002F.next\u002Fstatic .\u002F.next\u002Fstatic\n",[234,24887,24888,24890],{"class":236,"line":529},[234,24889,20766],{"class":383},[234,24891,24892],{"class":387}," --from=builder \u002Fapp\u002Fpublic .\u002Fpublic\n",[234,24894,24895,24897],{"class":236,"line":535},[234,24896,20839],{"class":383},[234,24898,20842],{"class":387},[234,24900,24901,24903,24905,24908,24910,24913],{"class":236,"line":546},[234,24902,20847],{"class":383},[234,24904,20850],{"class":387},[234,24906,24907],{"class":255},"\"node\"",[234,24909,571],{"class":387},[234,24911,24912],{"class":255},"\"server.js\"",[234,24914,9527],{"class":387},[12,24916,24917],{},"Final image around 180 MB. Runs the same in any environment that supports containers.",[368,24919,24921],{"id":24920},"_3-storage-substitution","3. Storage substitution",[12,24923,24924],{},"Each Vercel managed service has a direct alternative:",[2734,24926,24927,24938,24952],{},[70,24928,24929,24932,24933,2629,24935,24937],{},[27,24930,24931],{},"Vercel KV → Redis."," You bring up a Redis in the cluster (HeroCtl runs it as a regular job) or use hosted Upstash. Client switches from ",[231,24934,24281],{},[231,24936,7362],{},". The API is similar; the adapter can be hidden behind a function.",[70,24939,24940,24943,24944,2629,24946,5839,24949,101],{},[27,24941,24942],{},"Vercel Postgres → Postgres."," Postgres in the cluster (regular job) or hosted Supabase\u002FNeon. Migration scripts stay the same. Client switches from ",[231,24945,24287],{},[231,24947,24948],{},"pg",[231,24950,24951],{},"postgres.js",[70,24953,24954,24957,24958,2629,24960,24963],{},[27,24955,24956],{},"Vercel Blob → S3-compatible."," Cloudflare R2 (no egress charge), Backblaze B2, or MinIO in the cluster itself. Client switches from ",[231,24959,24291],{},[231,24961,24962],{},"@aws-sdk\u002Fclient-s3"," pointing to the custom endpoint.",[12,24965,24966],{},"General rule: do the substitution on a separate branch, with integration tests running against the new service, before touching production.",[368,24968,24970],{"id":24969},"_4-image-optimization","4. Image Optimization",[12,24972,24973],{},"Three paths, pick one:",[2734,24975,24976,24987,24995],{},[70,24977,24978,24983,24984,24986],{},[27,24979,24980,24982],{},[231,24981,24274],{}," directly on the server."," Next.js detects ",[231,24985,24274],{}," installed and uses it for local Image Optimization. Works, but consumes CPU from the same process serving the application.",[70,24988,24989,24994],{},[27,24990,24991,101],{},[231,24992,24993],{},"next-image-export-optimizer"," Pre-optimizes all images at build time. Good for a blog or site with static images. Unfeasible for an app with user upload.",[70,24996,24997,2577,25000,5839,25003,25006,25007,25009],{},[27,24998,24999],{},"Dedicated image proxy.",[231,25001,25002],{},"imgproxy",[231,25004,25005],{},"imageflow"," running as a separate service. The ",[231,25008,24732],{}," URL points to that proxy. Solves any use case, costs one extra job in the cluster.",[368,25011,25013],{"id":25012},"_5-isr","5. ISR",[12,25015,25016],{},"Self-hosted, ISR works — Next.js standalone implements the local cache on disk. The fragile point is multi-region invalidation.",[12,25018,25019,25020,25023],{},"A 3-node cluster means 3 disk cache copies, each with its own expiration. For a blog or site whose content changes a few times a day, that's acceptable: a few-second inconsistency between nodes is invisible to the user. For an e-commerce dashboard with prices updating every minute, you need coordinated invalidation — usually via a webhook calling ",[231,25021,25022],{},"revalidatePath"," on all nodes simultaneously.",[12,25025,25026],{},"Most cases fall in the first profile. It doesn't become the problem it seems at first glance.",[368,25028,25030],{"id":25029},"_6-cicd","6. CI\u002FCD",[12,25032,25033],{},"Replace Vercel auto-deploy with your own pipeline:",[2734,25035,25036,25049,25055],{},[70,25037,25038,25041,25042,571,25045,25048],{},[27,25039,25040],{},"Build:"," GitHub Actions (or GitLab CI, or Jenkins) runs on each push. ",[231,25043,25044],{},"pnpm install",[231,25046,25047],{},"pnpm build",", generates Docker image.",[70,25050,25051,25054],{},[27,25052,25053],{},"Push:"," image registry (ECR, Docker Hub, GHCR). Tag by commit SHA or date.",[70,25056,25057,25060,25061,25064],{},[27,25058,25059],{},"Deploy:"," API call against the orchestrator (",[231,25062,25063],{},"heroctl deploy job.json"," or equivalent). Rolling update without downtime.",[12,25066,25067],{},"Pipeline time for a medium app sits around 4-6 minutes. Vercel does it in 2-3 minutes. The difference is real, but not catastrophic.",[368,25069,25071],{"id":25070},"_7-cutover","7. Cutover",[12,25073,25074],{},"Last step, and the most delicate:",[2734,25076,25077,25084,25087,25090,25093],{},[70,25078,25079,25080,25083],{},"Bring up the self-hosted version pointing to a temporary domain (",[231,25081,25082],{},"new.yourapp.com",", for example).",[70,25085,25086],{},"Run in parallel for 7 days. Internal users test. Canary traffic directed by flag.",[70,25088,25089],{},"Compare metrics: error rate, p95 latency, projected infra cost.",[70,25091,25092],{},"If parity is OK, switch the main DNS to point to the new backend. Low TTL (60s) helps with quick rollback.",[70,25094,25095],{},"Keep Vercel on for another 7 days. Only deactivate the project after confirming nobody is on the old DNS.",[12,25097,25098],{},"Total migration for a medium app takes 2-3 weeks with a dedicated dev. For a small app, one week. For a giant Next.js monolith with 50 routes and complex middleware, a quarter.",[19,25100,25102],{"id":25101},"concrete-calculation-for-a-brazilian-team","Concrete calculation for a Brazilian team",[12,25104,25105],{},"Number to close the argument. Five Brazilian devs with a medium-sized Next.js app (50 routes, Postgres database, image storage, traffic of 2 million requests\u002Fmonth).",[12,25107,25108],{},[27,25109,25110],{},"Vercel scenario:",[2734,25112,25113,25116,25119,25122,25125],{},[70,25114,25115],{},"5 × Pro (US$20\u002Fdev\u002Fmonth) = US$100\u002Fmonth",[70,25117,25118],{},"Bandwidth and function invocations (estimate with given traffic): US$50-200\u002Fmonth",[70,25120,25121],{},"Vercel Postgres (small production instance): US$30\u002Fmonth",[70,25123,25124],{},"Vercel Blob (50 GB stored, 100 GB transfer): US$20\u002Fmonth",[70,25126,25127,25130],{},[27,25128,25129],{},"Total: US$200-400\u002Fmonth = R$1,000 to R$2,000\u002Fmonth"," at the current FX.",[12,25132,25133],{},[27,25134,25135],{},"HeroCtl Community scenario on 4 Hetzner servers:",[2734,25137,25138,25141,25144,25147],{},[70,25139,25140],{},"4 × CX22 (€5.18\u002Fmonth each) = €21\u002Fmonth",[70,25142,25143],{},"Cloudflare R2 (50 GB stored, no egress charge): ~€5\u002Fmonth",[70,25145,25146],{},"Postgres running as a job in the cluster itself: zero additional",[70,25148,25149,25130],{},[27,25150,25151],{},"Total: €25-30\u002Fmonth = R$150-180\u002Fmonth",[12,25153,25154,25157],{},[27,25155,25156],{},"Difference: R$850 to R$1,850\u002Fmonth."," Over 12 months, R$10,000 to R$22,000 in savings. Equivalent to one month of mid-level developer salary in the intermediate band of the Brazilian market.",[12,25159,25160],{},"The savings pay for a migration done in four weeks in the first year, and remain available as operational margin in the years that follow. Over 36 months, R$30k-66k of difference. It is a budget line that deserves to show up in the finance meeting.",[19,25162,7347],{"id":7346},[12,25164,25165,25168],{},[27,25166,25167],{},"Does HeroCtl run Next.js directly?","\nYes. Standalone build generates a Docker image, and HeroCtl orchestrates any image. There's no custom adapter, no specific template — the Dockerfile shown above works without modification.",[12,25170,25171,25174,25175,25177],{},[27,25172,25173],{},"And ISR without global CDN?","\nWorks locally on each node of the cluster. A 3-node cluster means 3 independent caches with their own expiration. For coordinated multi-node invalidation, you use ",[231,25176,25022],{}," called via webhook on all nodes. For most cases (blog, institutional site, dashboard with revalidation every minute), the transient inconsistency is invisible.",[12,25179,25180,25183,25184,25187,25188,25191],{},[27,25181,25182],{},"How do I do preview deploys?","\nHeroCtl doesn't have a native automatic preview deploy per commit, but it supports multiple versions of the same job running side by side. Common setup: pipeline creates a job with the branch suffix (",[231,25185,25186],{},"my-app-feature-x","), with a temporary domain (",[231,25189,25190],{},"feature-x.preview.yourapp.com","), automatic TLS by the integrated router. When the branch is merged and deleted, the job is demoted. Whoever wants exactly Vercel's DX assembles that in 100-200 lines of pipeline.",[12,25193,25194,25197,25198,25201],{},[27,25195,25196],{},"Do edge functions survive?","\nEdge functions use Cloudflare Workers-style primitives and don't run on traditional Node. Self-hosted, you convert them to normal server-side routes (",[231,25199,25200],{},"export const runtime = 'nodejs'",") or split into their own services. Refactoring is per file, usually between 10 minutes and 2 hours depending on the code.",[12,25203,25204,25207,25209,25210,25213,25214,1523,25216,25218,25219,21681,25221,25223],{},[27,25205,25206],{},"What if I use Vercel Postgres?",[231,25208,24287],{}," is a wrapper around Neon Postgres. You switch to ",[231,25211,25212],{},"@neondatabase\u002Fserverless"," (keeping Neon hosted), or ",[231,25215,24948],{},[231,25217,24951],{}," pointing to a Postgres in the cluster. Schema migrates directly via ",[231,25220,5736],{},[231,25222,21196],{},". For 95% of apps, it is an afternoon of work.",[12,25225,25226,25229,25230,25232,25233,25235,25236,25238],{},[27,25227,25228],{},"Is there a substitute for Vercel Image Optimization?","\nThree options: ",[231,25231,24274],{}," directly on the server (works, consumes local CPU), ",[231,25234,24993],{}," at build (good for static images), dedicated proxy like ",[231,25237,25002],{}," running as a separate service (handles any case). For an app with user upload, the third option is the best choice.",[12,25240,25241,25244],{},[27,25242,25243],{},"How long does migration take for a medium app?","\nA Next.js application with 50 routes, Postgres, storage and middleware: 2-3 weeks with a dedicated dev following this post's step-by-step. Small application (10-15 routes, no managed storage): one week. Giant monolith with complex middleware and strong dependence on edge runtime: a full quarter.",[19,25246,3309],{"id":3308},[12,25248,25249],{},"Vercel is a good choice. For many cases, it is the right choice. The point of this post is not \"Vercel is bad\" — it is \"Vercel is not the only choice\". Most Brazilian teams looking at the monthly bill and sighing aren't looking elsewhere because their product is worse. They are looking because the savings in real, at company scale, are large enough to pay for a calm migration with cash to spare.",[12,25251,25252],{},"The choice between the three routes depends on where you are. Render and Railway solve the predictability problem without changing the operational model much. Coolify and Dokploy solve cost radically, in exchange for a single server. HeroCtl solves cost and keeps real high availability, in exchange for operating 3-4 servers.",[12,25254,25255],{},"If you want to test Route C through the shortest path:",[224,25257,25258],{"className":226,"code":5318,"language":228,"meta":229,"style":229},[231,25259,25260],{"__ignoreMap":229},[234,25261,25262,25264,25266,25268,25270],{"class":236,"line":237},[234,25263,1220],{"class":247},[234,25265,2957],{"class":251},[234,25267,5329],{"class":255},[234,25269,2963],{"class":383},[234,25271,2966],{"class":247},[12,25273,25274],{},"Bring up 3 small servers, install on each, point the domain. Bring up the Next.js application as a job. Verify that the integrated router issued a certificate, that the rolling deploy worked, that killing a server didn't take the site down. Then decide if the savings are worth it.",[12,25276,25277,25278,25280,25281,25283],{},"For more reading: ",[3336,25279,21724],{"href":6545}," explains the general thesis behind the product, and ",[3336,25282,19750],{"href":19749}," covers the adjacent use range — when you want Heroku-like DX running on your infra, without needing HeroCtl's HA level.",[12,25285,25286],{},"The intent, as always, is the same: container orchestration, without ceremony.",[3350,25288,25289],{},"html pre.shiki code .sH3jZ, html code.shiki .sH3jZ{--shiki-default:#8B949E}html pre.shiki code .sFSAA, html code.shiki .sFSAA{--shiki-default:#79C0FF}html pre.shiki code .sZEs4, html code.shiki .sZEs4{--shiki-default:#E6EDF3}html pre.shiki code .suJrU, html code.shiki .suJrU{--shiki-default:#FF7B72}html pre.shiki code .s9uIt, html code.shiki .s9uIt{--shiki-default:#A5D6FF}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html pre.shiki code .sQhOw, html code.shiki .sQhOw{--shiki-default:#FFA657}",{"title":229,"searchDepth":244,"depth":244,"links":25291},[25292,25293,25298,25303,25304,25305,25314,25315,25316],{"id":24182,"depth":244,"text":24183},{"id":24227,"depth":244,"text":24228,"children":25294},[25295,25296,25297],{"id":24231,"depth":271,"text":24232},{"id":24247,"depth":271,"text":24248},{"id":24306,"depth":271,"text":24307},{"id":24343,"depth":244,"text":24344,"children":25299},[25300,25301,25302],{"id":24350,"depth":271,"text":24351},{"id":24378,"depth":271,"text":24379},{"id":24404,"depth":271,"text":24405},{"id":24426,"depth":244,"text":24427},{"id":24653,"depth":244,"text":24654},{"id":24687,"depth":244,"text":24688,"children":25306},[25307,25308,25309,25310,25311,25312,25313],{"id":24694,"depth":271,"text":24695},{"id":24739,"depth":271,"text":24740},{"id":24920,"depth":271,"text":24921},{"id":24969,"depth":271,"text":24970},{"id":25012,"depth":271,"text":25013},{"id":25029,"depth":271,"text":25030},{"id":25070,"depth":271,"text":25071},{"id":25101,"depth":244,"text":25102},{"id":7346,"depth":244,"text":7347},{"id":3308,"depth":244,"text":3309},"2026-02-04","Vercel charges in USD, scales serverless cost per request, and pulls you into its primitives. For Brazilian teams, the bill turns ugly fast. How to run Next.js elsewhere.",{},"\u002Fen\u002Fblog\u002Fself-hosted-vercel-alternative",{"title":24171,"description":25318},{"loc":25320},"en\u002Fblog\u002Fself-hosted-vercel-alternative",[25325,25326,7507,8756,25327],"vercel","next-js","lock-in","5OLinVuLSTvNUtES7uRenVP71Dob-Vr9jU8QXwltovY",{"id":25330,"title":25331,"author":7,"body":25332,"category":8756,"cover":3379,"date":25872,"description":25873,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":25874,"navigation":411,"path":25875,"readingTime":4401,"seo":25876,"sitemap":25877,"stem":25878,"tags":25879,"__hash__":25884},"blog_en\u002Fen\u002Fblog\u002Fheroctl-vs-kamal.md","HeroCtl vs Kamal: when you need more than one server",{"type":9,"value":25333,"toc":25855},[25334,25337,25343,25346,25350,25353,25356,25363,25369,25372,25376,25379,25399,25402,25405,25409,25412,25416,25419,25422,25425,25429,25435,25438,25452,25455,25459,25462,25465,25468,25472,25475,25478,25485,25489,25492,25512,25515,25518,25522,25525,25528,25531,25534,25537,25539,25652,25655,25659,25662,25668,25678,25684,25690,25694,25697,25707,25719,25732,25740,25767,25776,25778,25788,25794,25800,25806,25818,25827,25833,25835,25838,25841,25846,25849],[12,25335,25336],{},"In 2023 37signals published numbers that didn't match the industry's narrative. The company left the public cloud to host its own products on dedicated servers, projected approximately three million dollars in annual savings, and open-sourced the tooling it used to make that migration. That tooling got a name — Kamal — and quickly became a reference for a kind of team that was tired of the complexity of planetary orchestrators.",[12,25338,25339,25340,101],{},"DHH, partner at 37signals, popularized the thesis around the tool with a short phrase: \"you don't need orchestration\". The argument is elegant, factual, and — for most teams it describes — true. Rails app deploy had become a master's degree in distributed systems without anyone asking for it. Kamal is the legitimate response to a legitimate frustration, and occupies its own niche within the ",[3336,25341,25342],{"href":19749},"self-hosted Heroku segment in 2026",[12,25344,25345],{},"This post isn't about taking down that thesis. It's about the exact moment it stops describing your reality. Somewhere between the first and the tenth serious customer, the phrase changes from \"you don't need orchestration\" to \"you started needing it and didn't notice yet\". The symptom is usually an email at three in the morning.",[19,25347,25349],{"id":25348},"the-kamal-philosophy-as-it-deserves-to-be-described","The Kamal philosophy, as it deserves to be described",[12,25351,25352],{},"Before any criticism, worth recording what Kamal gets right — because it gets a lot right.",[12,25354,25355],{},"You have an application inside a Dockerfile. You have a list of servers where you want it to run. Kamal takes that list, connects to each server via SSH, pulls the new image, swaps the old container for the new, and updates the internal router. There's no persistent control plane. There's no resident agent on each server with its own process. There's no state database outside the set of servers you already had.",[12,25357,25358,25359,25362],{},"The configuration lives in a single ",[231,25360,25361],{},"deploy.yml"," file with maybe forty lines. You run a command, and Kamal does the work in parallel on the hosts. When it finishes, it disappears — there's no service listening on port 8080 waiting for future commands. It's deploy as an SSH transaction with checkpoints.",[12,25364,25365,25366,25368],{},"That minimization of external state is the central virtue. For teams of one to three people running a monolithic application on a single server, it's hard to beat. ",[3336,25367,2770],{"href":16689}," and the other modern panels ask for their own database, their own agent, their own web interface. Kamal asks for two ingredients: SSH and Docker. You already had both.",[12,25370,25371],{},"We estimate that seventy-five percent of web app teams in Brazil in 2026 never need more than that. If you're one of them, close this post and install Kamal. Seriously.",[19,25373,25375],{"id":25374},"when-you-dont-need-orchestration-is-true","When \"you don't need orchestration\" is true",[12,25377,25378],{},"Some markers are honest. If your operation fits in this list, Kamal beats any comparison you can make against it:",[2734,25380,25381,25384,25387,25390,25393,25396],{},[70,25382,25383],{},"A single server with resource headroom",[70,25385,25386],{},"Monolithic application, no internal dependencies in private network between services",[70,25388,25389],{},"Predictable traffic, no spikes that demand emergency horizontal scaling",[70,25391,25392],{},"Occasional deploy window, with up to thirty seconds of degradation acceptable",[70,25394,25395],{},"End customer who doesn't charge a formal availability contract",[70,25397,25398],{},"One to three people taking care of infra, half-time",[12,25400,25401],{},"For this profile, any tool with a persistent control plane is unjustified overhead. Kamal is, literally, a configuration file plus a client binary. There's nothing to maintain on the server beyond the Docker you'd install anyway.",[12,25403,25404],{},"The elegance of this approach is such that many teams with a profile slightly larger than the one described above remain happy — and they're right. Migrating to an orchestrator before the pain appears is infrastructure gold-plating. The right question isn't \"do I have a cluster?\", it's \"what pain can I no longer ignore?\".",[19,25406,25408],{"id":25407},"when-you-dont-need-orchestration-starts-to-hurt","When \"you don't need orchestration\" starts to hurt",[12,25410,25411],{},"The pain doesn't arrive all at once. It arrives in four stages, generally in this order.",[368,25413,25415],{"id":25414},"first-stage-the-customer-demands-an-sla","First stage: the customer demands an SLA",[12,25417,25418],{},"At some point a service contract enters with an availability clause. The most common numbers are 99% (3.65 days of permitted downtime per year), 99.9% (8.7 hours), and 99.95% (4.4 hours). The customer wants to see the number in the contract.",[12,25420,25421],{},"Here Kamal on a single server no longer fits. Not because Kamal is bad — it's because a single server never gives 99.9% without external maneuver. Cloud provider maintenance, disk failure, kernel update: each of these events fully consumes your annual downtime budget. In 2026 no serious provider guarantees 99.9% uptime for an individual instance.",[12,25423,25424],{},"The natural response is \"I'll put two servers\". And that's where Kamal starts to need external scaffolding.",[368,25426,25428],{"id":25427},"second-stage-two-servers-and-the-cluster-illusion","Second stage: two servers and the cluster illusion",[12,25430,25431,25432,25434],{},"Kamal accepts a two-IP list in ",[231,25433,25361],{}," without complaint. But what it does with that list isn't orchestration — it's repetition. The two servers are parallel deploy destinations, not members of a cluster.",[12,25436,25437],{},"Concretely: if one of the two falls, Kamal doesn't reallocate traffic. Kamal has no opinion on traffic between deploys. You need to set up, on the side:",[2734,25439,25440,25443,25446,25449],{},[70,25441,25442],{},"A load balancer (cloud provider or self-hosted)",[70,25444,25445],{},"Periodic health check that takes the inactive server out of the balance",[70,25447,25448],{},"DNS resolution configured on top of the balancer, not the direct servers",[70,25450,25451],{},"Some notification mechanism when something goes off",[12,25453,25454],{},"That's four new products you start operating. Each has its own panel, its own account, its own bill, its own way to break at three a.m. The tooling that was a configuration file became a diagram.",[368,25456,25458],{"id":25457},"third-stage-real-rolling-deploy-gets-fragile","Third stage: real rolling deploy gets fragile",[12,25460,25461],{},"With two servers, Kamal offers sequential deploy: first one, then the other. Pretty in description. The problem lives in the error case.",[12,25463,25464],{},"Imagine the deploy on the first server succeeds, the second hangs midway. Kamal has no consolidated view of the state of the two servers after the error. You're left with a new version running on one side, old version running on the other, and no centralized system to reconcile the divergence. Reconciliation becomes manual intervention: you open both servers, decide which version stays, do rollback or advance what failed.",[12,25466,25467],{},"For a three-person team, intervening manually once a quarter is tolerable. For a team that ships four deploys a week across three different applications, manual divergence becomes the main job of one of the three people. That was exactly the trigger that motivated orchestrators being born in the 2010s.",[368,25469,25471],{"id":25470},"fourth-stage-encryption-and-routing-between-services","Fourth stage: encryption and routing between services",[12,25473,25474],{},"Sooner or later the application grows and gains a second service — maybe a queue worker, maybe a separate image processing service, maybe an auxiliary API consumed by the main frontend. These services need to talk to each other, ideally with encrypted traffic and control over who can call whom.",[12,25476,25477],{},"Kamal has no opinion on this. It's your job to set up — usually with external proxy, manually issued certificates, hand-written firewall rules. On one server it was trivial (everyone is localhost). On three servers running six services, it becomes a one-week side project.",[12,25479,25480,25481,25484],{},"The router embedded in Kamal (",[231,25482,25483],{},"kamal-proxy",") elegantly solves the inbound HTTP traffic part — TLS termination, atomic switch between versions, headers under control. But traffic between services, on a private network, is your problem.",[19,25486,25488],{"id":25487},"the-necessary-leap","The necessary leap",[12,25490,25491],{},"Looking at the four points above together, it's clear that to cover them you need, in practice:",[67,25493,25494,25497,25500,25503,25506,25509],{},[70,25495,25496],{},"A replicated control plane between multiple servers — so the fall of one doesn't take down the deploy capability",[70,25498,25499],{},"Automatic election of the coordinating server, without human intervention",[70,25501,25502],{},"Integrated balancing and health check, without depending on the provider's external balancer",[70,25504,25505],{},"Consolidated state of services, with automatic reconciliation when something diverges",[70,25507,25508],{},"Integrated router with automatic certificates",[70,25510,25511],{},"Encryption between services embedded, without setting up another product",[12,25513,25514],{},"Many teams' temptation, on reaching that list, is to fall into the planetary orchestrator and end the matter. Then they install the colossus, write three hundred lines of manifest to bring up what was forty lines in Kamal, hire a senior-paid specialist just to keep that breathing, and discover they swapped the right problem for a bigger problem.",[12,25516,25517],{},"The reasonable answer is a tool that offers the six items above without asking for a dedicated team. That's exactly where HeroCtl lives.",[19,25519,25521],{"id":25520},"heroctl-as-kamal-with-a-real-cluster","HeroCtl as Kamal with a real cluster",[12,25523,25524],{},"The conceptual simplicity is the same. You describe the service in a short configuration file — around fifty lines for a complete application with routing rules, secrets, and automatic certificate. Submit the service through the command-line client or the embedded web panel. The cluster decides where to run.",[12,25526,25527],{},"The difference is under the hood. Instead of transactional SSH, HeroCtl maintains a control plane replicated between three servers. Those three talk to each other all the time to keep a consolidated view of everything that's running, on which server, in which version, in which health. When the coordinating server falls, in around seven seconds one of the other two takes over — without human intervention, without a pager alert for anyone. The application's containers that were running on the fallen server are reallocated on the survivors.",[12,25529,25530],{},"The daily operation looks exactly like Kamal: you change the image version, submit, the cluster orchestrates the substitution. The difference appears in the bad cases. If the partial deploy fails midway, reconciliation is automatic — the desired state is recorded in the control plane, and the agents on each server converge to it without you opening a terminal. If a server falls during deploy, the others take over its work. If the port that was going to be used is stuck by a zombie container, the cluster waits or redirects — it doesn't fail the deploy.",[12,25532,25533],{},"And the embedded tooling covers the other items on the list: router with automatic Let's Encrypt certificates, encryption between services without setting up anything externally, web panel to see what's running, centralized metrics and logs without an external stack.",[12,25535,25536],{},"The installation is the same gesture Kamal asks for: Linux server with Docker, single setup command. There's no new infrastructure requirement. There's no external database. There's no managed cloud service. The cluster lives on the servers you already have.",[19,25538,23214],{"id":23213},[119,25540,25541,25551],{},[122,25542,25543],{},[125,25544,25545,25547,25549],{},[128,25546,2982],{},[128,25548,2997],{},[128,25550,2994],{},[141,25552,25553,25564,25575,25585,25594,25603,25612,25622,25632,25642],{},[125,25554,25555,25558,25561],{},[146,25556,25557],{},"Philosophy",[146,25559,25560],{},"Minimalist SSH deploy, no external state",[146,25562,25563],{},"Cluster with replicated control plane",[125,25565,25566,25569,25572],{},[146,25567,25568],{},"Ideal server range",[146,25570,25571],{},"1 (excellent), 2-3 (with external scaffolding)",[146,25573,25574],{},"3 to 500",[125,25576,25577,25579,25582],{},[146,25578,16324],{},[146,25580,25581],{},"Not native — requires external balancer",[146,25583,25584],{},"Embedded; automatic election in ~7s",[125,25586,25587,25590,25592],{},[146,25588,25589],{},"Web panel",[146,25591,3058],{},[146,25593,23272],{},[125,25595,25596,25598,25601],{},[146,25597,3923],{},[146,25599,25600],{},"Yes, via embedded router",[146,25602,25600],{},[125,25604,25605,25607,25610],{},[146,25606,23287],{},[146,25608,25609],{},"Not native",[146,25611,23272],{},[125,25613,25614,25617,25619],{},[146,25615,25616],{},"Persistent metrics",[146,25618,25609],{},[146,25620,25621],{},"Internal job",[125,25623,25624,25626,25629],{},[146,25625,22779],{},[146,25627,25628],{},"Not native — collect externally",[146,25630,25631],{},"Embedded single writer",[125,25633,25634,25637,25640],{},[146,25635,25636],{},"Automatic reconciliation after partial failure",[146,25638,25639],{},"No — requires manual intervention",[146,25641,3064],{},[125,25643,25644,25646,25649],{},[146,25645,22811],{},[146,25647,25648],{},"Open, no associated paid product",[146,25650,25651],{},"Permanently free plan + paid Business\u002FEnterprise",[12,25653,25654],{},"Kamal's column isn't punishment — it's honesty. The product was designed for a specific use case and meets it elegantly. When your use case goes beyond what it covers, the right answer is to swap tools, not force Kamal to become what it never wanted to be.",[19,25656,25658],{"id":25657},"stay-on-kamal-if","Stay on Kamal if...",[12,25660,25661],{},"The virtue of a specialized tool is admitting where it wins. Four profiles where we recommend Kamal without hesitation, even if HeroCtl exists.",[12,25663,25664,25667],{},[27,25665,25666],{},"You run a single server, no formal SLA pressure."," The conceptual cost of a cluster — understanding quorum, replicated control plane, election — is unjustified for a single server. Kamal gives you ninety-nine percent of what matters with five percent of the concept.",[12,25669,25670,25673,25674,25677],{},[27,25671,25672],{},"You're a one- to three-person team with strong Rails culture and no time available to learn new tooling."," Kamal is, in practice, a natural extension of ",[231,25675,25676],{},"bin\u002Frails",". Adopting it costs an afternoon. Adopting anything else costs a week — and that week, in your context, is worth more than the operation improvement that would come later.",[12,25679,25680,25683],{},[27,25681,25682],{},"Your applications are internal or staging, where five minutes of monthly downtime cause no real damage."," Kamal's operation is so direct that it remains the best choice even for large teams that have a set of secondary applications with high failure tolerance.",[12,25685,25686,25689],{},[27,25687,25688],{},"You're DHH."," With sincere respect. The thesis that orchestration is overkill was defended by someone who operates public products with millions of users and proves daily that you can do without a cluster, with just well-configured dedicated servers. If you're on a team where this philosophy is part of the identity, Kamal isn't just a tool — it's a statement. There's no technical reason to abandon it.",[19,25691,25693],{"id":25692},"migrating-from-kamal-to-heroctl-when-it-makes-sense","Migrating from Kamal to HeroCtl when it makes sense",[12,25695,25696],{},"For those who reached the moment where the pain described above became routine, migration is lighter than it seems, because most of the work Kamal already did remains valid.",[12,25698,352,25699,25702,25703,25706],{},[27,25700,25701],{},"Dockerfile that Kamal uses to package the application serves with no change",". HeroCtl consumes the same image that Kamal pushes to the registry. There's no need to adapt ",[231,25704,25705],{},"ENTRYPOINT",", expected environment variables, exposed ports — the container's contract is preserved.",[12,25708,25709,25712,25713,25715,25716,25718],{},[27,25710,25711],{},"Environment variables migrate with the same keys."," HeroCtl has its own secrets system, but the variable names the application consumes remain the same. You import the contents of the ",[231,25714,9367],{}," that was in ",[231,25717,25361],{}," directly into the cluster's internal vault, and the application doesn't notice the swap.",[12,25720,25721,25724,25725,25728,25729,25731],{},[27,25722,25723],{},"Named volumes are kept",", because Docker is Docker. If you had a volume called ",[231,25726,25727],{},"app_storage"," in Kamal to persist uploads, in HeroCtl it continues to be called ",[231,25730,25727],{}," and lives in the same place on the server. The difference is that the cluster knows where it is and respects that pinning when deciding where to run the application.",[12,25733,25734,25739],{},[27,25735,352,25736,25738],{},[231,25737,25361],{}," doesn't convert one-to-one to HeroCtl's job spec",", but the conceptual mapping is almost mechanical:",[2734,25741,25742,25749,25755,25761],{},[70,25743,25744,25745,25748],{},"Kamal's ",[231,25746,25747],{},"servers"," block becomes the notion of cluster with N nodes in HeroCtl. You don't list IPs in the application file — the cluster already knows itself.",[70,25750,25744,25751,25754],{},[231,25752,25753],{},"proxy"," block becomes the integrated routing configuration. Instead of describing the proxy explicitly, you describe the domain name and the rules; the embedded router applies.",[70,25756,352,25757,25760],{},[231,25758,25759],{},"accessories"," block (Postgres, Redis, and similar alongside the application) becomes auxiliary jobs in the same cluster, managed like any other service.",[70,25762,352,25763,25766],{},[231,25764,25765],{},"env"," block becomes HeroCtl's secrets system, with the same keys.",[12,25768,25769,25772,25773,25775],{},[27,25770,25771],{},"Honest estimate:"," one to three hours for a medium-complexity application, counting time reading documentation, converting the file, first validation deploy. Applications with many ",[231,25774,25759],{}," or unusual routing rules can reach half a day. Above that, write to us — we have an experimental converter that covers the common cases.",[19,25777,7347],{"id":7346},[12,25779,25780,25787],{},[27,25781,25782,25783,25786],{},"Does HeroCtl have the equivalent of ",[231,25784,25785],{},"kamal accessories","?","\nYes. In HeroCtl you describe Postgres, Redis, or any other auxiliary service as an ordinary job, managed by the same cluster that runs the main application. The practical difference is that those auxiliary services get the same treatment as the application: high availability when it makes sense, automatic reconciliation, certificates if they're publicly exposed, metrics and logs in the central panel. You don't operate a separate set of \"things on the side\".",[12,25789,25790,25793],{},[27,25791,25792],{},"What if I use Kamal for a small app and HeroCtl for a larger one, can I?","\nYes. Both worlds coexist without conflict. We even recommend this model during migration: you keep what works on Kamal exactly as it is, and use HeroCtl for the applications where the real cluster matters. The only care is not to run Kamal and HeroCtl on the same server — they compete for space in Docker and ports, and the confusion isn't worth the operational gain. Different servers, different worlds.",[12,25795,25796,25799],{},[27,25797,25798],{},"Does HeroCtl run on any Linux VPS like Kamal?","\nYes. The premise is the same: Linux server with Docker. Major cloud provider, smaller provider, dedicated server, virtual machine in the office — wherever Kamal works, HeroCtl works. There's no managed private network requirement, no external provider balancer requirement, no special filesystem requirement. The minimum condition is three servers that can see each other on the network and have Docker installed.",[12,25801,25802,25805],{},[27,25803,25804],{},"How much more does it consume than Kamal?","\nThe control plane occupies between 200 and 400 MB per server. On three servers of modest configuration, that's less than five percent of total available memory. Compared to Kamal — which consumes zero because it has no resident process — it's more. Compared to the colossus, whose managed version starts at around 700 MB per master node before any application comes up, it's half. The right question isn't \"how much more\", it's \"how much does that give me in return\". For our public demonstration cluster, which runs four servers, sixteen containers, and five sites with five vCPUs and ten total gigabytes, the answer is: real high availability, without needing to double the hardware.",[12,25807,25808,25814,25815,25817],{},[27,25809,25810,25811,25813],{},"And ",[231,25812,25483],{},", which is the integrated router underneath?","\nHeroCtl has its own integrated router, with atomic switch between versions, automatic TLS termination via Let's Encrypt, and routing by domain name. The concept is the same as ",[231,25816,25483],{},". The difference is that HeroCtl's router knows the cluster state — it knows when an entire server has fallen, when a container is in rolling, when a new version hasn't yet passed the health check. An isolated router does routing; a router integrated with the control plane does informed routing.",[12,25819,25820,25823,25824,25826],{},[27,25821,25822],{},"Do I need to learn a new language to use HeroCtl?","\nNo. The configuration file is plain text, similar to Kamal's ",[231,25825,25361],{}," in structure and size. The command-line commands follow the verb-noun pattern that anyone who has run Docker knows. The web panel covers ninety percent of operations without you needing to open a terminal. There's no templating language, there's no parallel package system, there's no ceremonial manifest. The internal rule is the same as Kamal's: the configuration has to fit on one screen.",[12,25828,25829,25832],{},[27,25830,25831],{},"When isn't HeroCtl the right answer?","\nWhen you operate a single server with no SLA pressure, Kamal is better — and we said this above. When you operate hundreds of thousands of machines at planetary scale, the colossus is better — and we said this in another post. When your compliance officer needs to point to an existing certificate with a product name on the list, today the answer is the colossus or a mature commercial orchestrator. In these three conditions, HeroCtl doesn't compete well and we don't try to force it. For the range between three and five hundred servers, with real availability pressure and scarce time to set up a stack, that's where the tool was designed to shine.",[19,25834,3309],{"id":3308},[12,25836,25837],{},"The right question isn't Kamal or HeroCtl. It's: has your second serious customer already shown up? If not yet, stay on Kamal and be happy — anything different from that is distraction. If already, and the badly slept night with the fallen server has already happened at least once, the answer starts to lean.",[12,25839,25840],{},"The test path is a single command:",[224,25842,25844],{"className":25843,"code":5318,"language":2529},[2527],[231,25845,5318],{"__ignoreMap":229},[12,25847,25848],{},"Run on three servers, bring up the application, kill one of them by force, watch the cluster reallocate. If the relief feeling is proportional to the pain you had in the last incident, it's settled. If it's not, go back to Kamal without guilt — it means your moment hasn't come yet, and respecting that is as important as adopting the right tool when the moment comes.",[12,25850,25851,25852,101],{},"The longer story of why we built this, and the honest reading of the three paths that existed before, is in ",[3336,25853,25854],{"href":6545},"Why we created HeroCtl",{"title":229,"searchDepth":244,"depth":244,"links":25856},[25857,25858,25859,25865,25866,25867,25868,25869,25870,25871],{"id":25348,"depth":244,"text":25349},{"id":25374,"depth":244,"text":25375},{"id":25407,"depth":244,"text":25408,"children":25860},[25861,25862,25863,25864],{"id":25414,"depth":271,"text":25415},{"id":25427,"depth":271,"text":25428},{"id":25457,"depth":271,"text":25458},{"id":25470,"depth":271,"text":25471},{"id":25487,"depth":244,"text":25488},{"id":25520,"depth":244,"text":25521},{"id":23213,"depth":244,"text":23214},{"id":25657,"depth":244,"text":25658},{"id":25692,"depth":244,"text":25693},{"id":7346,"depth":244,"text":7347},{"id":3308,"depth":244,"text":3309},"2026-01-29","Kamal is brilliant for a VPS running Rails. When the second serious customer asks for redundancy, the architecture has to change. Where Kamal stops and HeroCtl starts.",{},"\u002Fen\u002Fblog\u002Fheroctl-vs-kamal",{"title":25331,"description":25873},{"loc":25875},"en\u002Fblog\u002Fheroctl-vs-kamal",[25880,25881,25882,25883,8756],"kamal","rails","multi-server","37signals","c_D2DYqgrWi1-zyby-NfK6tHcI6S_5Wewiirv1bTqDs",{"id":25886,"title":25887,"author":7,"body":25888,"category":8756,"cover":3379,"date":26398,"description":26399,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":26400,"navigation":411,"path":16694,"readingTime":26401,"seo":26402,"sitemap":26403,"stem":26404,"tags":26405,"__hash__":26408},"blog_en\u002Fen\u002Fblog\u002Fheroctl-vs-dokploy.md","HeroCtl vs Dokploy: an honest comparison",{"type":9,"value":25889,"toc":26385},[25890,25896,25899,25902,25906,25909,25915,25921,25927,25933,25939,25942,25946,25953,25964,25967,25969,25972,25975,25978,25982,25985,25988,25991,25995,25998,26007,26013,26019,26025,26031,26036,26045,26047,26050,26197,26200,26204,26207,26210,26213,26219,26222,26225,26229,26232,26235,26238,26241,26244,26247,26251,26254,26257,26260,26263,26266,26270,26273,26276,26279,26286,26292,26295,26298,26300,26306,26312,26318,26324,26330,26336,26341,26343,26346,26349,26358,26361,26377,26380,26383],[12,25891,25892,25893,25895],{},"In DevOps sub-Reddits in 2026, among the self-hosted tools that appeared after the ",[3336,25894,2770],{"href":16689}," wave, the most-discussed name is Dokploy. Clean panel. Fast install. A community that grew at a speed most orchestration projects of the last decade haven't seen.",[12,25897,25898],{},"And a technical decision that defines everything else: Dokploy runs Docker Swarm as the cluster engine. It's not a detail — it's the foundation. Everything good it delivers comes from that choice, and so does everything that limits it.",[12,25900,25901],{},"This text is the honest reading of that choice, and the point-by-point comparison with the path HeroCtl took. No attack on Dokploy, no rose-tinting of HeroCtl. Both products solve similar problems with different philosophies, and the decision between them is more than panel preference.",[19,25903,25905],{"id":25904},"where-dokploy-got-it-right","Where Dokploy got it right",[12,25907,25908],{},"Before the contrast, the credit. Teams evaluating Dokploy today find a product that does five things right and does them well.",[12,25910,25911,25914],{},[27,25912,25913],{},"The UX is cohesive."," The panel has its own visual identity, linear flows, and the buttons do what they promise. For those coming from Heroku or the old commercial panel that got expensive in recent years, Dokploy gives off the familiar feel of \"open, click, deploy\". It's the kind of polish that only appears after many iterations on top of real user feedback.",[12,25916,25917,25920],{},[27,25918,25919],{},"Installation is fast."," One command, five minutes, panel up. For individual projects and small teams, that minimum friction is what separates \"I'll try\" from \"I'll give up and stick with the expensive cloud\".",[12,25922,25923,25926],{},[27,25924,25925],{},"Multi-server is out of the box."," You add a second and a third server through the panel and Swarm underneath takes care of distributing containers. For someone who had never thought about high availability, becoming HA by configuration is a big advance over panels that run on a single server only.",[12,25928,25929,25932],{},[27,25930,25931],{},"The integrated router works."," There's an embedded router that terminates TLS, issues Let's Encrypt certificates automatically, and forwards traffic to containers. You don't need to set up and maintain a separate reverse proxy, write virtual host configuration by hand, or memorize certificate renewal flags.",[12,25934,25935,25938],{},[27,25936,25937],{},"Good support for common stacks."," Node, Django, Rails apps, projects with a direct Dockerfile, projects via docker compose — everything spins without adjustments. The community publishes one-click plugins for the most common databases, queue tools, basic observability.",[12,25940,25941],{},"That list is honest. Whoever says Dokploy is \"just another wrapper\" isn't looking carefully. It's a serious product, with traction, made by people who listened to users.",[19,25943,25945],{"id":25944},"the-fundamental-technical-choice-docker-swarm","The fundamental technical choice — Docker Swarm",[12,25947,25948,25949,25952],{},"The point that changes the conversation is the engine. Dokploy didn't invent its own cluster control plane — it consumes the Swarm that already comes inside Docker. The panel talks to the Swarm API, which coordinates the agents on each server. When you add a server through the panel, it's a ",[231,25950,25951],{},"swarm join"," underneath.",[12,25954,25955,25956,25959,25960,25963],{},"That decision has real virtues. Swarm has been stable for nearly a decade. The API is consistent with the ",[231,25957,25958],{},"docker compose"," that most teams already know — declaring services has the same shape in both. Leader election is embedded, comes free with ",[231,25961,25962],{},"swarm init",". And because it's part of Docker, any server already with Docker installed can join the cluster with one command.",[12,25965,25966],{},"The problem is what happened to Swarm since 2019. Docker Inc. decided that year to focus almost all orchestration engineering on other products of the company, and Swarm entered what internal communications at the time described as \"maintenance mode\". Practical translation: security fixes continue, bug-fix releases continue, but new features stopped. There's no public roadmap of evolution. No one from the Docker team has presented at a conference, in the last five years, scheduling improvements, network improvements, encryption between services, integration with new runtimes.",[12,25968,23068],{},[12,25970,25971],{},"Swarm isn't abandoned — it's in stasis. And stasis has a compound cost.",[12,25973,25974],{},"When an overlay network between nodes has an edge case on specific cloud providers, that case sits waiting for someone outside to propose a patch. When a new container runtime pattern emerges — lightweight runtimes, confidential containers, isolation improvements — Swarm doesn't absorb it. When you want encryption between services that goes beyond Swarm's optional encrypted overlay, the answer is \"set up a separate product on top\".",[12,25976,25977],{},"Dokploy inherits this profile. Each Swarm limitation becomes a Dokploy limitation with no own evolution path. It's not Dokploy's fault — it's the mathematical consequence of building on top of an engine that stopped evolving.",[19,25979,25981],{"id":25980},"heroctls-technical-choice","HeroCtl's technical choice",[12,25983,25984],{},"HeroCtl made the opposite decision: build the control plane from scratch, without depending on Swarm or on the popular orchestration colossus. It's not \"not invented here\" purism. It's the realization that orchestration in the \"1 to 500 server\" range is a different problem from what both Swarm and the colossus solve, and neither is going to refocus on that niche.",[12,25986,25987],{},"The practical consequence is roadmap freedom. When the team decides that encryption between services needs to be native — not plugin, not optional overlay, not external operator — just implement it. When we decided that persistent metrics should run as an internal job of the cluster itself, without setting up three external products, it was an architecture decision without conditional. When we needed to optimize the deploy path so that a thousand containers enter rotation in a few minutes, there's no queue behind Docker Inc. or the foundation that maintains the colossus.",[12,25989,25990],{},"The honest counterpart: HeroCtl is newer. Has less mileage than Swarm. Has a smaller community. The one-click plugin pile is shorter. That's the trade-off of building the entire control plane instead of consuming a ready-made one. The first six months of closed production — four servers, five total vCPUs, ten gigabytes of RAM, sixteen active containers, five sites with automatic TLS — showed the core holds up. The panel is still in visual catch-up.",[19,25992,25994],{"id":25993},"operational-comparison","Operational comparison",[12,25996,25997],{},"Side by side, in matters that matter to whoever's going to operate.",[12,25999,26000,26003,26004,26006],{},[27,26001,26002],{},"Installation."," Dokploy: five minutes to panel up — installs Docker if it's not there, does ",[231,26005,25962],{},", brings up the panel. HeroCtl: five minutes too — downloads an executable file, registers the service, brings up the agent. Operational tie.",[12,26008,26009,26012],{},[27,26010,26011],{},"Real multi-server."," Dokploy depends on Swarm's control plane — three coordinating servers or more so the loss of one doesn't bring down the cluster. HeroCtl has its own replicated control plane, also in three servers or more. Both deliver real HA. The difference is who you're depending on: the current state of Swarm or the current state of HeroCtl.",[12,26014,26015,26018],{},[27,26016,26017],{},"Panel."," Both have one. In 2026, Dokploy's is more visually polished — more years of iteration and a larger community giving feedback. HeroCtl's covers the same use cases (deploy, metrics, logs, cluster topology, certificates, secrets, audit) but in aesthetic catch-up. Product honesty: if the panel's aesthetic is decisive in the decision and weighs more than architecture, Dokploy wins today.",[12,26020,26021,26024],{},[27,26022,26023],{},"Plugins and marketplace."," Dokploy has more one-click plugins today — Postgres, Redis, MinIO, observability. HeroCtl runs any container as a job, with the same uniform interface; what's missing is the \"click and have it\" showcase. For those who prefer to describe a job in a configuration file and version it in the repository, HeroCtl gets to the same place. For those who prefer click and have, Dokploy gets there faster.",[12,26026,26027,26030],{},[27,26028,26029],{},"Metrics and logs."," Dokploy: stack via external plugins — Prometheus, Grafana, Loki or similar set up separately. HeroCtl: metrics and logs as internal jobs of the cluster itself, with embedded single writer. The difference is less about who delivers better data, more about how many products you're maintaining. Small teams usually value the shorter pile; teams with SRE usually prefer the flexibility of plugging in the stack they already know.",[12,26032,26033,26035],{},[27,26034,4787],{}," Dokploy doesn't bring it by default. Swarm has overlay network with optional encryption, but Dokploy doesn't promote that path as first-class. For real mutual encryption between all services, it's one more product on top. HeroCtl brings it embedded — all communication between cluster containers is encrypted by default, with automatic PKI, no external operator. It's the difference between \"has option\" and \"comes turned on\".",[12,26037,26038,26041,26042,26044],{},[27,26039,26040],{},"Control plane observability."," Dokploy: you inspect Swarm state via ",[231,26043,1118],{}," commands on the server plus panel for the app. HeroCtl: uniform control plane API exposes job, agent, election, certificate state, in own endpoints — audited.",[19,26046,18680],{"id":18679},[12,26048,26049],{},"The honest version of the decision. As always, every orchestrator is a set of tradeoffs.",[119,26051,26052,26062],{},[122,26053,26054],{},[125,26055,26056,26058,26060],{},[128,26057,2982],{},[128,26059,2776],{},[128,26061,2994],{},[141,26063,26064,26075,26085,26095,26105,26115,26125,26136,26147,26157,26165,26176,26186],{},[125,26065,26066,26069,26072],{},[146,26067,26068],{},"Cluster engine",[146,26070,26071],{},"Docker Swarm (in maintenance since 2019)",[146,26073,26074],{},"Own control plane, active evolution",[125,26076,26077,26080,26083],{},[146,26078,26079],{},"Installation",[146,26081,26082],{},"~5 min",[146,26084,26082],{},[125,26086,26087,26089,26092],{},[146,26088,16324],{},[146,26090,26091],{},"Yes (3+ Swarm coordinators)",[146,26093,26094],{},"Yes (3+ servers with replicated control plane)",[125,26096,26097,26099,26102],{},[146,26098,25589],{},[146,26100,26101],{},"Yes, more polished in 2026",[146,26103,26104],{},"Yes, in aesthetic catch-up",[125,26106,26107,26110,26113],{},[146,26108,26109],{},"Router + automatic TLS",[146,26111,26112],{},"Embedded (reverse proxy underneath)",[146,26114,23272],{},[125,26116,26117,26119,26122],{},[146,26118,23287],{},[146,26120,26121],{},"Optional via encrypted overlay",[146,26123,26124],{},"Embedded and default",[125,26126,26127,26130,26133],{},[146,26128,26129],{},"Metrics \u002F logs",[146,26131,26132],{},"External plugins",[146,26134,26135],{},"Internal cluster jobs",[125,26137,26138,26141,26144],{},[146,26139,26140],{},"Plugin marketplace",[146,26142,26143],{},"More mature",[146,26145,26146],{},"Shorter, any container as job",[125,26148,26149,26151,26154],{},[146,26150,22811],{},[146,26152,26153],{},"Open source",[146,26155,26156],{},"Permanently free Community + Business + Enterprise",[125,26158,26159,26161,26163],{},[146,26160,19679],{},[146,26162,3061],{},[146,26164,24630],{},[125,26166,26167,26170,26173],{},[146,26168,26169],{},"Source-code escrow",[146,26171,26172],{},"Not applicable",[146,26174,26175],{},"Yes (Enterprise)",[125,26177,26178,26180,26183],{},[146,26179,5013],{},[146,26181,26182],{},"1–50 servers",[146,26184,26185],{},"1–500 servers",[125,26187,26188,26191,26194],{},[146,26189,26190],{},"Orchestration roadmap",[146,26192,26193],{},"Conditioned on Swarm",[146,26195,26196],{},"Independent",[12,26198,26199],{},"The column that drives the decision the most is the first. Everything else derives from it.",[19,26201,26203],{"id":26202},"when-dokploy-is-the-right-choice","When Dokploy is the right choice",[12,26205,26206],{},"Honesty demands the section. There are scenarios where recommending Dokploy is the correct answer.",[12,26208,26209],{},"You like Docker Swarm and it meets what you need today. Typical web apps, database managed outside the cluster, low requirement of internal encryption, small team that prefers predictability over platform evolution. Swarm will hold this for years. Building on top of it means building on top of something tested and stable, even if stopped.",[12,26211,26212],{},"The most polished panel in the self-hosted segment matters a lot to your team. If the visual interface is part of the internal sale to the rest of the company, if your CTO will show it to the CEO and the impression matters, Dokploy comes out ahead in 2026. HeroCtl is closing this distance, but hasn't closed it yet.",[12,26214,26215,26216,26218],{},"You already deploy via ",[231,26217,25958],{}," and want minimum migration path. Dokploy accepts compose with little friction. Bringing an entire team accustomed to compose to a new job spec model is an organizational cost that not every project justifies.",[12,26220,26221],{},"You want an active community of one-click plugins. If your flow is \"I need Postgres with replication, click, done\", Dokploy delivers that today. HeroCtl delivers the same Postgres, but you describe the job and version it in the repository.",[12,26223,26224],{},"You don't have formal requirements for native encryption between services or detailed audit. For a five-person team with a SaaS that hasn't yet sold to the first Enterprise customer, Dokploy is enough answer. Leaving it later is work — but it's work you may never need to do.",[19,26226,26228],{"id":26227},"when-heroctl-is-the-right-choice","When HeroCtl is the right choice",[12,26230,26231],{},"The symmetric profiles.",[12,26233,26234],{},"You want a control plane that evolves with its own decisions. For projects where the execution platform is part of the product — not just incidental infra — depending on an engine in maintenance is a strategic risk. Building on something that evolves is different from building on something that maintains.",[12,26236,26237],{},"Encryption between services needs to be native. If your architecture has dozens of services talking to each other, sensitive data trafficking between them, and you don't want to set up a separate service mesh nor trust \"cloud provider private network\" as the only layer, it makes a difference to have encryption by default.",[12,26239,26240],{},"You need detailed audit. Whoever signed Business knows who did what and when — who promoted a job version, who ran which administrative command, who rotated which secret. For teams with growing compliance requirements, this isn't optional.",[12,26242,26243],{},"You want source-code escrow as a continuity insurance. Enterprise includes a contract with a third-party custodian: if the company behind HeroCtl ceases operations, the code is delivered to paying customers with internal continuity license. For organizations that can't afford \"what if the company breaks\", that structure is what unlocks procurement approval.",[12,26245,26246],{},"Application range 3 to 500 servers with formal requirements. It's the range where neither Swarm in maintenance nor the colossus designed for tens of thousands of machines serves well. It's exactly where HeroCtl aims.",[19,26248,26250],{"id":26249},"the-swarm-in-production-question","The Swarm in production question",[12,26252,26253],{},"Worth being fair here — and being specific.",[12,26255,26256],{},"Swarm remains stable. For the absolute majority of use cases, it will deliver what it promises for a few more years. Typical web apps, microservices of medium complexity, rolling deploys, healthchecks, basic service discovery — all this runs. Stories of Swarm cluster in production for five or six years without a serious incident exist in volume.",[12,26258,26259],{},"The point isn't \"Swarm will break\". It's \"Swarm won't improve\". Building on top of it in 2026 is tacitly signing that the set of capabilities it has today is the set you'll have forever. For projects where this is an acceptable trade-off, no problem. For projects where encryption between services, native observability, integration with new runtimes, or scheduler extensibility will matter in the next three years, it's worth considering a stack that keeps evolving.",[12,26261,26262],{},"There's another angle. When Swarm has an edge case — overlay network with intermittent loss on specific cloud providers, strange scheduling when a node returns after a long partition, unexpected health check behavior on slow-starting containers — these cases are now debugged by the outside community. Docker Inc. isn't on call. Solving becomes your team's project. In HeroCtl, these cases are handled by the team that wrote the code — you open a report, we ship a fix. It's a different support model because engineering investment continues to happen.",[12,26264,26265],{},"It isn't an ideological argument of \"new code is better\". Swarm is stable precisely because it's old. The argument is practical: the evolution of the product you'll use in the next five years depends on who's investing. In Dokploy, part of the evolution depends on people who stopped touching Swarm in 2019. In HeroCtl, it depends on people touching the control plane today.",[19,26267,26269],{"id":26268},"migration-between-them","Migration between them",[12,26271,26272],{},"Conceptual path, not recipe. Each project has its own details — write to us if you want specific help.",[12,26274,26275],{},"Docker images serve in both. The Dockerfile you use in Dokploy is the same one you use in HeroCtl. There's no special build, no customized runtime.",[12,26277,26278],{},"Environment variables migrate with the same keys. Where you have an env vars block in Dokploy, you have an equivalent block in the HeroCtl job spec. The names don't change.",[12,26280,26281,26282,26285],{},"Named volumes are kept. Volume mounted in ",[231,26283,26284],{},"\u002Fvar\u002Flib\u002Fpostgresql\u002Fdata"," continues mounted there. The concept of persistent volume between restarts is the same.",[12,26287,352,26288,26291],{},[231,26289,26290],{},"compose"," that Dokploy accepts doesn't convert 1:1 to the HeroCtl job spec, but the mapping is direct. Service becomes task. Network becomes network policy. Deploy strategy becomes rolling update strategy. The first migration takes an afternoon per application; after that, copy and paste with substitutions.",[12,26293,26294],{},"Ingress: Dokploy's router and HeroCtl's integrated router both end in configuration-as-code. You describe host, redirect, certificate in very few lines in either of the two. The translation is mechanical.",[12,26296,26297],{},"For teams with up to ten applications, manual migration in an afternoon. Above that, an experimental converter covers the common cases — write to us.",[19,26299,7347],{"id":7346},[12,26301,26302,26305],{},[27,26303,26304],{},"Is HeroCtl more mature than Dokploy?","\nNot in all dimensions. Dokploy has more community time, more plugins, a more visually polished panel. HeroCtl has its own control plane with active evolution, real high availability tested in a documented chaos battery, and a roadmap independent of any other project. \"Mature\" depends on the axis. In panel aesthetics and marketplace, Dokploy. In platform architecture and explicit commercial contract, HeroCtl.",[12,26307,26308,26311],{},[27,26309,26310],{},"And the panel, is Dokploy's better?","\nIn 2026, yes — more visually polished, more years of iteration, more incorporated community feedback. HeroCtl's covers the same use cases and is in aesthetic catch-up. The important question is whether the visual difference is the decisive criterion for you. For some teams it is. For others, architecture weighs more.",[12,26313,26314,26317],{},[27,26315,26316],{},"Which consumes less resources?","\nDokploy adds the panel overhead plus the Swarm overhead inside Docker — Swarm itself is light, but the panel is a reasonably complete web app. HeroCtl has a control plane between 200 and 400 MB per server, including embedded web panel. In a small cluster, both fit comfortably on modest servers. The difference isn't decisive in the decision.",[12,26319,26320,26323],{},[27,26321,26322],{},"And the database backup?","\nIn both, the database is an explicit responsibility of whoever operates. Dokploy has one-click plugins for Postgres and similar, but automatic backup is additional configuration — you usually set up a cron job or a separate backup plugin. HeroCtl treats the database like any other job, with persistent volume; backup is a parallel job you define. Business includes managed backup with windows and retention. In neither is \"set and forget\" an honest answer — the database deserves attention in both.",[12,26325,26326,26329],{},[27,26327,26328],{},"Dokploy is open source, HeroCtl isn't, does this worry me?","\nFair question. HeroCtl has Community Edition free forever, with no server limit, no job limit, no artificial feature gates. The binary has no mandatory phone-home or remote kill-switch — once installed, the cluster works offline indefinitely. Enterprise includes source-code escrow with a third-party custodian, so if the company behind the product ceases operations, the code is delivered to paying customers with internal continuity license. It's not the same as open source, but it solves what open source solves in this context: guarantee against vendor disappearance. The commercial contract has been published since day one and is frozen for whoever signs today — there's no retroactive change clause.",[12,26331,26332,26335],{},[27,26333,26334],{},"Can I use both, one for dev and the other for prod?","\nTechnically yes, but rarely makes sense. Docker images are portable between both, so the path works. In practice, two different orchestrators in adjacent environments doubles the team's operational knowledge. We recommend choosing one and sticking with it. If the question is fundamentally which to choose, write to us — discussing case by case is more useful than generic advice.",[12,26337,26338,26340],{},[27,26339,22945],{},"\nFor above a thousand servers, neither is the obvious choice — that range is the territory of the colossus designed for tens of thousands of machines. In the 1 to 500 server range, HeroCtl was designed specifically. Dokploy scales well within what Swarm scales — typically up to a few dozen nodes in production without gymnastics. Above that, Swarm starts to require care that wasn't in the product's original proposal.",[19,26342,3309],{"id":3308},[12,26344,26345],{},"The choice between Dokploy and HeroCtl isn't between good product and bad product. Both are serious. The choice is between different platform philosophies.",[12,26347,26348],{},"Dokploy chose to build a UX and product layer on top of a mature but static engine. HeroCtl chose to build the entire control plane to control evolution. Real trade-offs in both directions.",[12,26350,26351,26352,26354,26355,26357],{},"If you're on Dokploy today and not feeling the limitations, stay. It's a good product. If you're evaluating both in greenfield, read the post on ",[3336,26353,23582],{"href":6545}," to understand the motivation behind the decision to build the control plane from scratch. If you're coming from Heroku or commercial panels that got expensive, also worth the post ",[3336,26356,19750],{"href":19749}," — the market context helps.",[12,26359,26360],{},"To try HeroCtl in three minutes:",[224,26362,26363],{"className":226,"code":5318,"language":228,"meta":229,"style":229},[231,26364,26365],{"__ignoreMap":229},[234,26366,26367,26369,26371,26373,26375],{"class":236,"line":237},[234,26368,1220],{"class":247},[234,26370,2957],{"class":251},[234,26372,5329],{"class":255},[234,26374,2963],{"class":383},[234,26376,2966],{"class":247},[12,26378,26379],{},"One executable file, one server to start, two more when you want real HA. Embedded panel, automatic certificates, encryption between services by default. No phone-home, no kill-switch, frozen commercial contract.",[12,26381,26382],{},"The intent is simple: container orchestration, without ceremony.",[3350,26384,4376],{},{"title":229,"searchDepth":244,"depth":244,"links":26386},[26387,26388,26389,26390,26391,26392,26393,26394,26395,26396,26397],{"id":25904,"depth":244,"text":25905},{"id":25944,"depth":244,"text":25945},{"id":25980,"depth":244,"text":25981},{"id":25993,"depth":244,"text":25994},{"id":18679,"depth":244,"text":18680},{"id":26202,"depth":244,"text":26203},{"id":26227,"depth":244,"text":26228},{"id":26249,"depth":244,"text":26250},{"id":26268,"depth":244,"text":26269},{"id":7346,"depth":244,"text":7347},{"id":3308,"depth":244,"text":3309},"2026-01-22","Dokploy is the self-hosted segment's bet after Coolify's growth. Honest comparison: where they overlap, where they diverge.",{},"12 min",{"title":25887,"description":26399},{"loc":16694},"en\u002Fblog\u002Fheroctl-vs-dokploy",[26406,8756,7507,26407],"dokploy","docker-swarm","Ev4T74mOQT6kdIxhCCdO02QWPc3GSxQIBSQ5YZeqZQo",{"id":26410,"title":26411,"author":7,"body":26412,"category":8756,"cover":3379,"date":27044,"description":27045,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":27046,"navigation":411,"path":27047,"readingTime":8761,"seo":27048,"sitemap":27049,"stem":27050,"tags":27051,"__hash__":27054},"blog_en\u002Fen\u002Fblog\u002Fcaprover-vs-coolify-vs-dokploy.md","CapRover vs Coolify vs Dokploy: the simple segment compared in 2026",{"type":9,"value":26413,"toc":27031},[26414,26423,26426,26433,26440,26447,26450,26454,26457,26460,26466,26476,26482,26486,26489,26492,26497,26506,26512,26516,26519,26522,26527,26532,26538,26542,26545,26724,26731,26735,26741,26744,26747,26758,26775,26778,26781,26813,26816,26820,26823,26826,26829,26841,26844,26848,26851,26883,26886,26890,26893,26899,26912,26921,26927,26930,26934,26937,26954,26957,26960,26962,26968,26974,26980,26986,26992,26998,27004,27006,27009,27012,27015,27020],[12,26415,26416,26417,5839,26419,26422],{},"Three products today dispute the same niche: the developer who wants a Heroku running on their own VPS, without learning a cluster orchestrator and without paying US$25 per dyno. CapRover, Coolify and Dokploy. The three make the same promise — ",[231,26418,16100],{},[231,26420,26421],{},"docker pull",", web panel, automatic certificate, deploy in minutes. The three deliver that promise. And even so, choosing wrong between them will cost you three months of your next quarter.",[12,26424,26425],{},"The difference is not in what each one does. It is in what each one bets on.",[12,26427,26428,26429,26432],{},"CapRover bet on ",[27,26430,26431],{},"stability",". It has existed since 2017, weathered three hype waves without losing identity, and prefers aging to reinventing itself.",[12,26434,26435,26436,26439],{},"Coolify bet on ",[27,26437,26438],{},"feature richness",". Template marketplace, native docker-compose support, multi-server, dozens of ready-made integrations. It is the most popular panel of the segment in January 2026.",[12,26441,26442,26443,26446],{},"Dokploy bet on ",[27,26444,26445],{},"lightness with modern UX",". Uses Docker Swarm underneath to gain multi-server out of the box, copies what worked in Coolify and discards what weighed it down.",[12,26448,26449],{},"The choice between the three defines the next pain you will have — and it isn't installation pain. It is the pain that appears in the eighth month, when the panel is already running your two or three production SaaS and you need to decide whether to trust it on a Thursday night. This post is the map to choose knowing which pain comes.",[19,26451,26453],{"id":26452},"caprover-the-veteran","CapRover, the veteran",[12,26455,26456],{},"CapRover came out in 2017, before Coolify existed, before the self-hosting hype came back into fashion, before docker-compose was synonymous with \"local dev\". It was written in Node, uses Docker Swarm internally (yes, before Dokploy made that a banner), and never tried to be pretty.",[12,26458,26459],{},"The project's philosophy is visible in every decision: do one thing well, keep the code simple, prefer biannual releases to weekly ones. The result is one of the most reliable tools in this segment. People who installed CapRover in 2019 still run that same installation in 2026 with incremental upgrades and zero complete reinstallation.",[12,26461,26462,26465],{},[27,26463,26464],{},"What it does well."," Idle of approximately 150 MB of RAM — the lightest of the three by a comfortable margin. Installation on a US$5\u002Fmonth VPS runs smoothly, with 600 MB free for real workload. Functional panel without frills, simple \"App\" abstraction (a Docker image, environment variables, domain, certificate), deploy support via CLI, via tarball push, via Dockerfile. Small but loyal community — the forum has answers from 2018 that still make sense in 2026.",[12,26467,26468,26471,26472,26475],{},[27,26469,26470],{},"Where it loses."," The \"App\" model is primarily single-container. You can run auxiliary services (Postgres, Redis) as separate Apps, but integration between them is via internal DNS and manual variables. docker-compose has limited support — works for simple cases, breaks for cases with complex named volumes or elaborate healthchecks. The plugin\u002Ftemplate ecosystem (\"one-click apps\") exists and works, but curation has stagnated; many templates point to image versions from 2022. The UI has the visual style from when it was written — functional, but dates the product. And the most critical point: ",[27,26473,26474],{},"there is no real multi-server",". You run on one VPS. Want redundancy? CapRover doesn't help you.",[12,26477,26478,26481],{},[27,26479,26480],{},"When CapRover is the right choice."," Solo dev with one VPS. Monolithic application without dozens of microservices. Tight budget (its lightness saves you a plan upgrade). Team of one to three people who already made all the important decisions and want a panel that simply doesn't die. You value stability above new features. CapRover ages well; in January 2026 it remains the pragmatic choice for that specific profile.",[19,26483,26485],{"id":26484},"coolify-the-popular-feature-rich","Coolify, the popular feature-rich",[12,26487,26488],{},"Coolify entered the market late — first relevant release in 2021 — and gained dizzying traction in the past three years. In January 2026 it accumulates tens of thousands of stars in the public repository, active community, monthly releases, conference presence, creator known personally to the public. It is the panel that an indie hacker on Twitter recommends first.",[12,26490,26491],{},"The philosophy is clear: build the most complete panel on the market. Support everything. Marketplace, multi-server, integrated monitoring, support for multiple destinations (cloud provider A, cloud provider B, your VPS), notifications across multiple channels, integration with multiple git providers, preview deploys, branch-per-environment, pull request review. If a competitor has a feature, Coolify wants to have it too — and usually does.",[12,26493,26494,26496],{},[27,26495,26464],{}," Modern, pleasant UI. Native and robust docker-compose support — you paste your existing compose, adjust domains, deploy. One-click database templates for Postgres, MySQL, MariaDB, MongoDB, Redis, and a dozen more. Multi-server: you register a remote destination and Coolify provisions apps there. Automatic backup integrated for the provisioned databases. Support for multiple git providers. Active community that responds quickly — active Discord, releases that incorporate requests.",[12,26498,26499,26501,26502,26505],{},[27,26500,26470],{}," Heavy. Idle between 500 and 700 MB of RAM, depending on the version and number of active auxiliary services. On a 1 GB VPS you are tight before bringing up a single application container. Complexity grew in the past two years faster than documentation kept up — new features debut with video tutorial on Twitter and take time to become consultable static pages. But the most serious point is what ",[27,26503,26504],{},"multi-server doesn't mean in Coolify",": it is not high availability. It is \"a central panel deploys on N remote hosts\". If the server hosting the panel falls, you lose the ability to deploy, see logs, review status — the remote apps keep running, but you are blind. The panel is a single point of failure by design.",[12,26507,26508,26511],{},[27,26509,26510],{},"When Coolify is the right choice."," Indie hacker with 2 to 5 applications. Wants a mature panel, with ready-made features that save configuration time. Values template marketplace and quick database setup. Has budget latitude for hardware — a 2 to 4 GB RAM VPS dedicated to the panel is not a problem. No formal SLA requirement. It is the most complete \"self-hosted Heroku\" you can get today, and for that profile it delivers exactly.",[19,26513,26515],{"id":26514},"dokploy-the-lightweight-challenger","Dokploy, the lightweight challenger",[12,26517,26518],{},"Dokploy is the newest of the three. Public launch in 2024, gained traction quickly in 2025, in January 2026 accumulates around ten thousand stars in the public repository. It was explicitly positioned as \"what Coolify should have been, without the weight\".",[12,26520,26521],{},"The philosophy: copy what Coolify got right in UX and discard what weighed it down. Uses Docker Swarm as the orchestration layer — controversial and founding decision of the project. Swarm gives you multi-server out of the box: you add a worker node and it joins the pool without manual network configuration. In exchange, you inherit Swarm's destiny as a product.",[12,26523,26524,26526],{},[27,26525,26464],{}," Idle approximately 350 MB of RAM — significantly lighter than Coolify, heavier than CapRover. Clean and modern UI, clearly inspired by Coolify but with more economical choices (fewer tabs, less initial config). Real multi-server via Swarm: register a worker, it appears in the pool, jobs distribute. Native Docker Compose support (Dokploy calls it \"stacks\"). Community growth: active Discord, frequent releases, project incorporated important requests in months, not years.",[12,26528,26529,26531],{},[27,26530,26470],{}," Newer means less battle-tested. Bugs appear in edge cases that Coolify and CapRover have already documented. Coupled to Swarm — and Swarm is a technology in slow maintenance by the original maintainer. Receives security and stability fixes, but there is no clear indication that it will gain new features. For a 2024 project to choose Swarm is a defensible bet (stable, simple, embedded in Docker), but you are building on top of a layer that no longer evolves. Plugin\u002Ftemplate ecosystem still shallow compared to Coolify — you get the basics, not the curated library of specific templates. Documentation in Brazilian Portuguese practically nonexistent.",[12,26533,26534,26537],{},[27,26535,26536],{},"When Dokploy is the right choice."," Small team (3 to 6 people) that values modern UX but doesn't want to pay Coolify's price in RAM. Real multi-server out of the box is a requirement (you want to run on 2 to 4 VPS from the start). New project without historical dependence on specific templates or plugins. You don't have aversion to Docker Swarm as the underlying technology. For that profile, in January 2026, Dokploy is the choice that ages best among the three.",[19,26539,26541],{"id":26540},"the-three-side-by-side","The three side by side",[12,26543,26544],{},"The table below is the compressed version of the decision. Each row is a dimension that often becomes a forum argument; side by side, becomes criterion.",[119,26546,26547,26560],{},[122,26548,26549],{},[125,26550,26551,26553,26556,26558],{},[128,26552,2982],{},[128,26554,26555],{},"CapRover",[128,26557,2770],{},[128,26559,2776],{},[141,26561,26562,26576,26590,26604,26618,26631,26643,26657,26671,26684,26697,26711],{},[125,26563,26564,26567,26570,26573],{},[146,26565,26566],{},"Idle RAM (observed average)",[146,26568,26569],{},"~150 MB",[146,26571,26572],{},"500 to 700 MB",[146,26574,26575],{},"~350 MB",[125,26577,26578,26581,26584,26587],{},[146,26579,26580],{},"Idle CPU on idle VPS",[146,26582,26583],{},"\u003C 1%",[146,26585,26586],{},"1 to 3%",[146,26588,26589],{},"1 to 2%",[125,26591,26592,26595,26598,26601],{},[146,26593,26594],{},"Typical installation time",[146,26596,26597],{},"5 to 8 minutes",[146,26599,26600],{},"8 to 15 minutes",[146,26602,26603],{},"5 to 10 minutes",[125,26605,26606,26609,26612,26615],{},[146,26607,26608],{},"UI (subjective impression)",[146,26610,26611],{},"dated, functional",[146,26613,26614],{},"modern, dense",[146,26616,26617],{},"modern, lean",[125,26619,26620,26623,26625,26628],{},[146,26621,26622],{},"Native docker-compose",[146,26624,17471],{},[146,26626,26627],{},"yes, robust",[146,26629,26630],{},"yes, \"stacks\"",[125,26632,26633,26636,26638,26641],{},[146,26634,26635],{},"Real multi-server",[146,26637,100],{},[146,26639,26640],{},"yes, no panel HA",[146,26642,26640],{},[125,26644,26645,26648,26651,26654],{},[146,26646,26647],{},"Template marketplace",[146,26649,26650],{},"existent, stagnated",[146,26652,26653],{},"large, active",[146,26655,26656],{},"basic, growing",[125,26658,26659,26662,26665,26668],{},[146,26660,26661],{},"Community (stars in January 2026)",[146,26663,26664],{},"~13k",[146,26666,26667],{},"~40k",[146,26669,26670],{},"~10k",[125,26672,26673,26676,26679,26682],{},[146,26674,26675],{},"Releases in last 6 months",[146,26677,26678],{},"~3",[146,26680,26681],{},"monthly",[146,26683,26681],{},[125,26685,26686,26689,26692,26694],{},[146,26687,26688],{},"Portuguese documentation",[146,26690,26691],{},"rare",[146,26693,17471],{},[146,26695,26696],{},"practically none",[125,26698,26699,26702,26705,26708],{},[146,26700,26701],{},"Production maturity (years)",[146,26703,26704],{},"8+",[146,26706,26707],{},"4+",[146,26709,26710],{},"1+",[125,26712,26713,26715,26718,26721],{},[146,26714,5013],{},[146,26716,26717],{},"1 VPS, 1 to 3 apps",[146,26719,26720],{},"1 to 2 VPS, 2 to 5 apps",[146,26722,26723],{},"2 to 4 VPS, small team",[12,26725,26726,26727,26730],{},"The column that informs most is not visible in this table: ",[27,26728,26729],{},"which pain comes third",". CapRover gives you immediate stability and the pain comes when you grow and need multi-server. Coolify gives you rich features and the pain comes on the VPS bill and growing complexity. Dokploy gives you modern multi-server and the pain comes on long-term Swarm coupling. Choosing is honestly answering: which of these three pains is the one that fits most in your next year?",[19,26732,26734],{"id":26733},"the-pain-the-three-share","The pain the three share",[12,26736,26737,26738,101],{},"There is a pain common to the three that deserves its own name, because it is where the honest transition happens: ",[27,26739,26740],{},"none of the three has real high availability",[12,26742,26743],{},"I'll be precise about what that means.",[12,26745,26746],{},"CapRover is single-server by design. Doesn't try. You install, run on one VPS, period. Falling the server is falling the product.",[12,26748,26749,26750,26753,26754,26757],{},"Coolify and Dokploy offer multi-server, but the expression used by both confuses. Multi-server, in their vocabulary, means that a ",[27,26751,26752],{},"central panel"," deploys containers on ",[27,26755,26756],{},"N remote hosts",". The panel lives on one machine; the hosts are deploy targets. When the panel server falls:",[2734,26759,26760,26763,26766,26769,26772],{},[70,26761,26762],{},"The remote apps keep running (Docker on them keeps working).",[70,26764,26765],{},"You lose access to centralized logs, metrics, deploys, environment variable editing.",[70,26767,26768],{},"Update the panel? Wait.",[70,26770,26771],{},"Redeploy a stuck app? Wait.",[70,26773,26774],{},"See why the checkout service started returning 500? Wait.",[12,26776,26777],{},"For a hobby project, \"wait\" is tolerable. For a startup with first B2B client demanding contractual uplink of 99.5%, \"wait\" is SLA violation. The three products share that wall.",[12,26779,26780],{},"The wall is signaled when:",[2734,26782,26783,26789,26795,26801,26807],{},[70,26784,26785,26788],{},[27,26786,26787],{},"Client demands explicit SLA."," Something like ≥ 99.5% in contract. Best-effort doesn't pass the client's legal.",[70,26790,26791,26794],{},[27,26792,26793],{},"The panel has fallen once a week uninterruptedly."," Could be failed update, kernel panic, OOM kill at peak traffic — the cause matters less than the frequency.",[70,26796,26797,26800],{},[27,26798,26799],{},"You have 2 or more important applications in production."," Centralized backup becomes a requirement; coordination between apps becomes a requirement; unified observability becomes a requirement.",[70,26802,26803,26806],{},[27,26804,26805],{},"Team grew to 3 or more people and the panel became a bottleneck."," One deploy blocks another. Simultaneous access confuses state.",[70,26808,26809,26812],{},[27,26810,26811],{},"Panel backup is manual and hasn't been tested in months."," Can you restore the entire VPS in 30 minutes? Have you tested that procedure in prod?",[12,26814,26815],{},"When two or more of these points confirm at the same time, you passed the phase where CapRover, Coolify and Dokploy solve the problem. The next decision is structural.",[19,26817,26819],{"id":26818},"heroctl-as-a-natural-next-step","HeroCtl as a natural next step",[12,26821,26822],{},"HeroCtl is a single executable file you install on N Linux servers with Docker. The first three servers form the replicated control plane — there is no \"central server\" that can fall. Coordination between them survives the loss of any one, with automatic election in around seven seconds without human action.",[12,26824,26825],{},"The user experience to bring up an application is the same one you already know from the three panels: you describe the service, submit via CLI or web panel, the cluster decides where to run, opens the port, registers in the integrated router, issues an automatic Let's Encrypt certificate and starts serving traffic. Updating means changing the image version and submitting again — rolling deploy without maintenance window.",[12,26827,26828],{},"The difference is in what happens when something goes wrong. Server falls? Workload migrates. \"Central\" panel doesn't exist — any of the servers serves the UI and API. You are never blind.",[12,26830,26831,26832,26834,26835,26837,26838,26840],{},"The commercial model is explicit from day one. ",[27,26833,4351],{}," is permanent free, no server limit, no artificial feature gates — runs the entire stack above including high availability, router, certificates, metrics and logs. Indie hackers and small teams never need to leave here. ",[27,26836,4355],{}," adds SSO\u002FSAML, granular access control, detailed auditing, managed backup and SLA support — for when your client comes to demand formal controls. ",[27,26839,4359],{}," adds source code escrow, continuity contract and 24×7 support. Business and Enterprise prices are published — without mandatory \"talk to sales\".",[12,26842,26843],{},"HeroCtl doesn't try to replace CapRover, Coolify or Dokploy in the bands where they shine. It tries to be the logical step after them, when the high availability wall appears. The three keep being excellent for the profiles we described above — there is no reason to switch before time.",[19,26845,26847],{"id":26846},"decision-by-profile","Decision by profile",[12,26849,26850],{},"To resolve indecision in one sentence per profile:",[2734,26852,26853,26859,26865,26871,26877],{},[70,26854,26855,26858],{},[27,26856,26857],{},"Solo dev, 1 VPS, hobby project or early product."," CapRover. Lighter, more mature, less abandonable.",[70,26860,26861,26864],{},[27,26862,26863],{},"Indie hacker with 2 to 5 apps on 1 or 2 VPS, no formal SLA requirement."," Coolify. Marketplace and ready-made features save time.",[70,26866,26867,26870],{},[27,26868,26869],{},"Small team valuing modern UX, multi-server, new project without template legacy, 3 or more VPS."," Dokploy. Bets on modern lightness.",[70,26872,26873,26876],{},[27,26874,26875],{},"Startup with first B2B client demanding contractual SLA, 3 or more servers, real high availability requirement."," HeroCtl. The panel stops being a single point of failure.",[70,26878,26879,26882],{},[27,26880,26881],{},"Solo dev with no time at all to look after a server."," Hosted: Render, Railway, Vercel, Fly.io. Self-hosting is commitment — if you don't have time, pay.",[12,26884,26885],{},"There is no shame in changing categories. Starting in CapRover, migrating to Coolify when complexity asks for more features, eventually leaving for HeroCtl when the client asks for SLA — that is a healthy path, not a failure of initial planning.",[19,26887,26889],{"id":26888},"migration-between-the-three","Migration between the three",[12,26891,26892],{},"The good news: what survives between the three is what matters.",[12,26894,26895,26898],{},[27,26896,26897],{},"Docker images (Dockerfile) work on the three."," If you built for CapRover, build the same for Coolify and Dokploy. There is no runtime lock-in.",[12,26900,26901,26904,26905,26907,26908,26911],{},[27,26902,26903],{},"Named volumes survive."," You can copy a volume's content to a new host via ",[231,26906,2405],{}," with bind mount, or via ",[231,26909,26910],{},"docker cp",". Postgres dump and restore is the same as always.",[12,26913,26914,26917,26918,26920],{},[27,26915,26916],{},"Environment variables migrate literally."," The three accept the same keys. Copying a ",[231,26919,9367],{}," is trivial.",[12,26922,26923,26926],{},[27,26924,26925],{},"What needs manual adaptation:"," Coolify one-click templates (the internal config is specific, you reinstall the service from scratch on the destination), Dokploy Swarm stacks (similar, you rewrite), CapRover single-container Apps (you fragment into compose if going to Coolify\u002FDokploy). Domain, certificate and ingress configurations migrate conceptually but each panel's UI is specific — redoing takes an afternoon, not a week.",[12,26928,26929],{},"To migrate from the three to HeroCtl, the path is the same: same images, same variables, ingress configuration rewritten once in the HeroCtl format (which is a fifty-line configuration file, not three hundred).",[19,26931,26933],{"id":26932},"the-inevitable-question-which-is-the-most-popular","The inevitable question: \"which is the most popular?\"",[12,26935,26936],{},"In January 2026, in stars on the public repository (which is a weak proxy of real use, but is the only measurable proxy):",[2734,26938,26939,26944,26949],{},[70,26940,26941,26943],{},[27,26942,2770],{},": approximately 40k stars.",[70,26945,26946,26948],{},[27,26947,26555],{},": approximately 13k stars.",[70,26950,26951,26953],{},[27,26952,2776],{},": approximately 10k stars, growing fast.",[12,26955,26956],{},"In real production use, hard to measure. Coolify has the most active public discussion — Twitter, Discord, conferences. CapRover has the older, quieter installed base: people who installed in 2018 and don't talk about it because they have nothing to complain about. Dokploy has the fastest growth in percentage.",[12,26958,26959],{},"Popular is not a direct proxy for \"right for you\". CapRover serves half a million devs in silence without making the headline. Choosing by popularity takes you to Coolify by inertia — and Coolify might even be right, but let it be by analysis, not by tweet count.",[19,26961,7347],{"id":7346},[12,26963,26964,26967],{},[27,26965,26966],{},"Can I migrate from CapRover to Coolify without redoing everything?","\nYou redo the configuration on the new panel (an afternoon of work for two or three apps), but Docker images and volumes survive. It is not code refactoring — it is re-registration on the panel.",[12,26969,26970,26973],{},[27,26971,26972],{},"Dokploy depends on Docker Swarm. Is that a problem?","\nDepends on the horizon. For the next two years, no. Swarm is stable and functional. For a five-year horizon, it is betting on a layer that hasn't received new features in a long time. If your project is meant to last, worth considering.",[12,26975,26976,26979],{},[27,26977,26978],{},"Which uses less disk?","\nCapRover installs around 1.5 GB. Dokploy around 2 to 3 GB. Coolify around 3 to 5 GB depending on how many auxiliary services you enable. On a 25 GB VPS, any one fits; on a 10 GB one, CapRover is more comfortable.",[12,26981,26982,26985],{},[27,26983,26984],{},"Is there Brazilian Portuguese support in any?","\nCapRover has some partial translation and tutorials in Portuguese scattered around. Coolify has partial UI translation but documentation predominantly in English. Dokploy is predominantly in English. For a Brazilian team that prefers material in Portuguese, the tiebreaker is between CapRover and Coolify; HeroCtl is natively in PT-BR.",[12,26987,26988,26991],{},[27,26989,26990],{},"Does HeroCtl replace one of these or is it different?","\nDifferent in proposition. CapRover, Coolify and Dokploy solve \"Heroku on 1 VPS\" (or multiple VPS without panel high availability). HeroCtl solves \"Heroku in cluster with replicated control plane\". The range where it makes sense is after you outgrow the three — not before.",[12,26993,26994,26997],{},[27,26995,26996],{},"Is it worth running panel + cluster orchestrator together?","\nNo. It will give port conflict, reverse proxy conflict and operational headache. Choose one. If you are on CapRover\u002FCoolify\u002FDokploy today and want to migrate to HeroCtl, say goodbye to the previous one before — don't run in parallel on the same machine.",[12,26999,27000,27003],{},[27,27001,27002],{},"And for an agency hosting 30 client sites?","\nCoolify is the popular choice in that profile today, for project isolation and billing-adjacent features (partial multi-tenancy, separate users). Dokploy starts to compete in that niche. HeroCtl is the choice when the agency grows to the point where the panel falling leaves 30 clients without deploy at the same time — there the single point of failure becomes unacceptable.",[19,27005,3309],{"id":3308},[12,27007,27008],{},"The three products of the simple segment are good at what they promise. CapRover ages well if you don't grow too much. Coolify delivers the most complete package if you have RAM to spare. Dokploy makes the most modern bet if you accept Docker Swarm as base. There is no wrong choice for the profiles we described — there is expensive choice, which is not choosing for the right reasons.",[12,27010,27011],{},"The day the three stop being enough is specific and identifiable: client demanding SLA, panel becoming a bottleneck, team growing, backup becoming an obligation. When that day arrives, the next tool in the sequence is what we are building.",[12,27013,27014],{},"If you are choosing now between the three and the cluster is still distant, choose by the profile above and move on. If the cluster is becoming a requirement, install on three VPS:",[224,27016,27018],{"className":27017,"code":5318,"language":2529},[2527],[231,27019,5318],{"__ignoreMap":229},[12,27021,27022,27023,27025,27026,27028,27029,101],{},"For additional reading: the direct comparison ",[3336,27024,16690],{"href":16689},", the comparison ",[3336,27027,16695],{"href":16694},", and the panorama ",[3336,27030,19750],{"href":19749},{"title":229,"searchDepth":244,"depth":244,"links":27032},[27033,27034,27035,27036,27037,27038,27039,27040,27041,27042,27043],{"id":26452,"depth":244,"text":26453},{"id":26484,"depth":244,"text":26485},{"id":26514,"depth":244,"text":26515},{"id":26540,"depth":244,"text":26541},{"id":26733,"depth":244,"text":26734},{"id":26818,"depth":244,"text":26819},{"id":26846,"depth":244,"text":26847},{"id":26888,"depth":244,"text":26889},{"id":26932,"depth":244,"text":26933},{"id":7346,"depth":244,"text":7347},{"id":3308,"depth":244,"text":3309},"2026-01-19","The three dominant panels for running 'Heroku on 1 VPS'. Each bets on a different philosophy — maturity, feature richness, or low weight. An honest comparison to choose without regret.",{},"\u002Fen\u002Fblog\u002Fcaprover-vs-coolify-vs-dokploy",{"title":26411,"description":27045},{"loc":27047},"en\u002Fblog\u002Fcaprover-vs-coolify-vs-dokploy",[27052,27053,26406,7507,24167,8756],"caprover","coolify","M08fo9CJtl0Lj3tnyLNqhsNxfYp6AlkX-_yir6ZfprQ",{"id":27056,"title":27057,"author":7,"body":27058,"category":8756,"cover":3379,"date":27502,"description":27503,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":27504,"navigation":411,"path":16689,"readingTime":4401,"seo":27505,"sitemap":27506,"stem":27507,"tags":27508,"__hash__":27510},"blog_en\u002Fen\u002Fblog\u002Fheroctl-vs-coolify.md","HeroCtl vs Coolify: when a single-server panel isn't enough",{"type":9,"value":27059,"toc":27491},[27060,27063,27066,27070,27073,27076,27081,27091,27094,27098,27101,27104,27107,27121,27124,27127,27131,27134,27137,27140,27143,27160,27163,27166,27170,27173,27176,27179,27182,27185,27187,27190,27317,27320,27323,27327,27330,27336,27342,27348,27354,27360,27363,27367,27370,27376,27382,27388,27394,27397,27400,27402,27408,27414,27420,27426,27432,27438,27444,27450,27452,27455,27458,27461,27464,27480,27486,27489],[12,27061,27062],{},"The question was never \"is Coolify good?\". It is good. The question is \"until when is Coolify enough?\".",[12,27064,27065],{},"This post isn't about taking anything down. It's about drawing a line — where the panel on a single server stops being the right answer and starts being a silent trap that shows up at the worst possible moment: in the first meeting with the first customer who asks the SLA number.",[19,27067,27069],{"id":27068},"why-coolify-won-the-segment","Why Coolify won the segment",[12,27071,27072],{},"Worth starting by acknowledging what Coolify got right, without irony.",[12,27074,27075],{},"Five-minute installation on a US$10\u002Fmonth VPS. Clean panel, with vocabulary familiar to anyone who has touched Heroku or Vercel. Active plugin ecosystem, engaged community, frequent releases. The free contract for self-hosting never changed.",[12,27077,27078,27079,101],{},"For an indie hacker in April 2026, with a 4 GB RAM server running a US$5k MRR SaaS, Coolify is the best option that exists in the market. Better than Dokku, better than Caprover, better than Dokploy in maturity — and infinitely better than orchestrating the same workload on managed Kubernetes, which would charge US$73 per month just for the control plane, before NAT, load balancer, and anything that delivers value — a topic we unpack in ",[3336,27080,15781],{"href":15780},[12,27082,27083,27084,27087,27088,27090],{},"The category \"",[3336,27085,27086],{"href":19749},"simple self-hosted Heroku","\" was invented by this generation of tools. Coolify, ",[3336,27089,2776],{"href":16694},", Caprover share that podium. Coolify leads in adoption, in community, and in release pace. It's not by chance.",[12,27092,27093],{},"The point of this article is specific: there is a wall that appears when your product grows, and that wall has no elegant solution within Coolify. It's an architectural limit, not a bug to be fixed.",[19,27095,27097],{"id":27096},"the-invisible-wall-the-single-server","The invisible wall: the single server",[12,27099,27100],{},"Coolify was designed around a central server. The panel lives there. The configuration database lives there. Deploy decisions originate there.",[12,27102,27103],{},"That's a coherent design choice — for someone on a single VPS, that centralization is exactly what makes the product so simple. There's no cluster network to debug, no certificate between nodes, no state replication to understand. You spin up a machine, install, deploy. Five minutes.",[12,27105,27106],{},"The question is what happens in the scenarios below:",[2734,27108,27109,27112,27115,27118],{},[70,27110,27111],{},"The VPS disk starts showing I\u002FO errors. The provider schedules maintenance to \"remediate\". The machine becomes unavailable for three hours on a Wednesday morning. Everything that was running on it is down.",[70,27113,27114],{},"A kernel update is pushed by the provider and the machine reboots itself. The Coolify panel comes back — but three containers enter a restart loop because they depend on a volume mount still being remounted. You discover this forty minutes later, via a customer's tweet.",[70,27116,27117],{},"The provider's entire datacenter has a regional network incident. It happened in January 2024 with a major European provider, in October 2024 with an American one, in February 2026 with a Latin American one. Your machine isn't even broken — it just became unreachable.",[70,27119,27120],{},"A bad deploy consumes all the machine's memory, the OOM killer starts taking down processes, and the Coolify panel itself enters a loop. You have nowhere to click to revert because the interface controlling the cluster is inside the machine that needs to be controlled.",[12,27122,27123],{},"In all these scenarios, the operational answer is the same: hope the machine comes back, or open a ticket, or bring up a new machine and restore backup. There's no automatic failover. There's no other panel replica to take over. There's no leader election among surviving servers — because there's only one server.",[12,27125,27126],{},"When your first serious customer asks \"what's the SLA?\", the honest answer is \"best-effort, our main machine has three nines of historical uptime from the provider\". It's an answer that works for some customers. It doesn't work for customers who have their own SLA to honor. And those are exactly the customers who pay annual contracts.",[19,27128,27130],{"id":27129},"coolify-multi-server-isnt-real-high-availability","Coolify multi-server isn't real high availability",[12,27132,27133],{},"The most common confusion in this conversation is Coolify's \"remote servers\" feature.",[12,27135,27136],{},"Coolify does allow you to connect additional machines as deploy targets. You register an IP, configure an SSH key, and the panel knows it can bring containers up on that remote machine. For those who need to separate staging from production, or to put workers on a cheaper machine, it's genuinely useful.",[12,27138,27139],{},"But it isn't high availability. It's load distribution.",[12,27141,27142],{},"The difference is in where the system's brain lives. In Coolify, the brain is the main machine — where the panel runs, where Coolify's SQLite or Postgres database lives, where the internal queues are. The remote machines are arms. If the brain falls, the arms keep holding what they had in their hands at the moment of the fall — already-running containers don't die. But you lose:",[2734,27144,27145,27148,27151,27154,27157],{},[70,27146,27147],{},"The web panel, entirely.",[70,27149,27150],{},"The ability to deploy, revert, change environment variable, read centralized log.",[70,27152,27153],{},"The ability to redirect traffic between replicas if one falls.",[70,27155,27156],{},"Automatic certificate renewal (if the main machine was the one responding to the ACME challenge).",[70,27158,27159],{},"Health checks that would restart problem containers.",[12,27161,27162],{},"You're left in a zombie state: the customer's site might keep responding for a few hours, but you've lost control of the orchestrator. Restarting the main machine becomes the only useful operation — and if the reason for the fall was a corrupted disk, you're restoring backup at four in the morning.",[12,27164,27165],{},"This isn't Coolify's fault. It's a direct consequence of the architectural design — a design that makes total sense for someone who will never need anything beyond it. The problem is when your company grows and customer expectations change. The tool doesn't grow along.",[19,27167,27169],{"id":27168},"the-natural-next-step","The natural next step",[12,27171,27172],{},"HeroCtl starts exactly where Coolify stops. Same simple-installation promise — one command, five minutes, web panel ready. But the brain of the system is replicated across three or more servers from the very first moment.",[12,27174,27175],{},"In practical terms: you install the same binary on three Linux machines with Docker. The three servers combine state with each other through consensus between servers. Important decisions (which container goes to which node, which version is active, which certificate has been renewed) are written to the replicated log and confirmed only after the majority agreed. If one of the three falls — kill -9, power outage, network partition — the remaining two elect a new leader in about seven seconds and continue serving traffic.",[12,27177,27178],{},"It's not magic. It's a technique known for twenty years, used in production by banks, by messaging systems, by distributed databases. HeroCtl's novelty is wrapping this in a package you install in five minutes, with embedded web panel and without requiring a specialized operator.",[12,27180,27181],{},"The integrated router distributes incoming traffic among healthy replicas automatically. If a server is down, it stops receiving requests — without you needing to touch DNS, without you needing to wake up at dawn. Let's Encrypt certificates live in the replicated log, so any surviving server can renew and serve them; there's no \"main machine\" responsible for TLS.",[12,27183,27184],{},"The chaos test battery covers real scenarios: kill -9 on the leader (election in seven seconds, no read-request loss), 30-second network partition (cluster reconverges on its own), momentary quorum loss (system enters read-only mode preserving existing traffic, resumes writes when quorum returns), disk wipe on a node (rejoins the cluster and downloads state from the replicated log), forced drain (workloads migrate to surviving nodes in seconds). All five scenarios survived in the public cluster that serves this blog.",[19,27186,23214],{"id":23213},[12,27188,27189],{},"The table below is the honest version. There's no column without caveats — every orchestration tool is a set of tradeoffs, and HeroCtl is too.",[119,27191,27192,27202],{},[122,27193,27194],{},[125,27195,27196,27198,27200],{},[128,27197,2982],{},[128,27199,2770],{},[128,27201,2994],{},[141,27203,27204,27214,27224,27234,27245,27255,27264,27273,27282,27290,27299,27307],{},[125,27205,27206,27209,27212],{},[146,27207,27208],{},"Installation time",[146,27210,27211],{},"5 minutes",[146,27213,27211],{},[125,27215,27216,27218,27221],{},[146,27217,25589],{},[146,27219,27220],{},"Yes, central",[146,27222,27223],{},"Yes, embedded in all servers",[125,27225,27226,27228,27231],{},[146,27227,3923],{},[146,27229,27230],{},"Yes (on main machine)",[146,27232,27233],{},"Yes, replicated across servers",[125,27235,27236,27239,27242],{},[146,27237,27238],{},"Multi-server",[146,27240,27241],{},"Yes, as deploy targets",[146,27243,27244],{},"Yes, as control plane cluster",[125,27246,27247,27249,27252],{},[146,27248,16324],{},[146,27250,27251],{},"No — panel is single point of failure",[146,27253,27254],{},"Yes — survives loss of servers",[125,27256,27257,27259,27261],{},[146,27258,23253],{},[146,27260,26172],{},[146,27262,27263],{},"Yes, ~7s after fall",[125,27265,27266,27268,27271],{},[146,27267,23287],{},[146,27269,27270],{},"Not embedded",[146,27272,23272],{},[125,27274,27275,27277,27280],{},[146,27276,25616],{},[146,27278,27279],{},"Plugin\u002Fexternal integration",[146,27281,25621],{},[125,27283,27284,27286,27288],{},[146,27285,22779],{},[146,27287,27279],{},[146,27289,25631],{},[125,27291,27292,27294,27297],{},[146,27293,22811],{},[146,27295,27296],{},"Open-source with optional paid cloud",[146,27298,25651],{},[125,27300,27301,27303,27305],{},[146,27302,16398],{},[146,27304,22187],{},[146,27306,22187],{},[125,27308,27309,27311,27314],{},[146,27310,5013],{},[146,27312,27313],{},"1 server (up to 3 with remote servers)",[146,27315,27316],{},"3 to 500 servers",[12,27318,27319],{},"The line that matters most for this conversation is the fifth — real high availability. The others are consequences of it. Without consensus between servers, you can't have a resilient panel. Without a resilient panel, you can't promise an SLA. Without an SLA, some customers don't sign a contract.",[12,27321,27322],{},"The last line also deserves attention. Coolify is amazing with one server. HeroCtl is amazing with three to five hundred. They aren't products in the same range — they are products that cover adjacent ranges of the same problem.",[19,27324,27326],{"id":27325},"when-to-stay-on-coolify","When to stay on Coolify",[12,27328,27329],{},"This section exists because honesty is the defense mechanism of a new tool. If we said \"everyone should use HeroCtl\", we'd be wrong. There are five profiles where we firmly recommend staying on Coolify.",[12,27331,27332,27335],{},[27,27333,27334],{},"You have one server and don't plan to have more.","\nIf your architecture fits entirely on a 4 or 8 GB RAM VPS, and your business model doesn't require a contractual SLA, Coolify is the right answer. HeroCtl running on a single server works, but you're paying the overhead of a coordination layer that will never need to coordinate with anyone. It's like buying a five-seater car to make only solo trips — it's not wrong, it's just unnecessary.",[12,27337,27338,27341],{},[27,27339,27340],{},"You're running internal or development applications.","\nStaging environments, internal dashboards, tools that only your team uses, experimental side-projects — none of these cases justify multiplying servers. A two-hour staging outage costs nothing beyond annoyance. Coolify delivers the best cost-benefit for that kind of workload, and we openly recommend it.",[12,27343,27344,27347],{},[27,27345,27346],{},"You're a hobby developer.","\nPersonal projects, blogs, portfolios, experiments with new APIs — stay on Coolify. The mental cost of operating a three-server cluster is bigger than the perceived gain. You want to ship, not administrate infrastructure. Coolify was designed for this profile. We won't try to push a tool with more operational ceremony than the case calls for.",[12,27349,27350,27353],{},[27,27351,27352],{},"You have strong dependence on plugins from the Coolify ecosystem.","\nIf your flow depends on three specific Coolify plugins to integrate with third-party services for which we don't yet have native integration, migrating now is premature. Wait for HeroCtl to mature those integrations or assess whether the specific integration is worth the work to rewrite. We'd rather you wait for the integration to exist than migrate and discover a hole mid-path.",[12,27355,27356,27359],{},[27,27357,27358],{},"You're not feeling pain.","\nThis is the most important and most underestimated criterion. If Coolify never failed for you, if your main machine has three years of uptime, if no customer has ever charged you on SLA — don't migrate out of fashion. Migrating an orchestration tool has real cost (one to two weeks of a senior engineer, plus adjustments). Pay that cost only when the pain justifies it. Migrating before feeling pain is an elegant way to burn engineering time that could be turning into product.",[12,27361,27362],{},"The simple mental rule: if the phrase \"if this machine falls at three a.m., I have a serious problem with a customer\" describes your situation, it's time to assess. If the phrase \"if this machine falls at three a.m., I wake up tomorrow, restart it, and everything goes on\" describes it, stay where you are.",[19,27364,27366],{"id":27365},"how-to-migrate-when-it-makes-sense","How to migrate when it makes sense",[12,27368,27369],{},"The migration doesn't need to be a big bang. The path we recommend is gradual, in four steps, and keeps Coolify running the whole time.",[12,27371,27372,27375],{},[27,27373,27374],{},"Step 1 — Bring up the HeroCtl cluster in parallel."," Three new servers, in the same region, without touching the existing Coolify. Run the installer on each, join the cluster, open the panel. Basic validation: panel responds at the three IPs, test certificate is issued, a \"hello world\" container comes up and responds. Average time: an afternoon.",[12,27377,27378,27381],{},[27,27379,27380],{},"Step 2 — Migrate a low-risk application."," Choose an internal or very low-traffic application. Something that, if down for ten minutes, doesn't cause panic. Rewrite the configuration file in the HeroCtl format (in general, fifty lines replace what was a set of fields in the Coolify panel). Bring it up on the new cluster. Point a test subdomain. Validate for a week.",[12,27383,27384,27387],{},[27,27385,27386],{},"Step 3 — Migrate workloads with real traffic, one at a time."," For each migrated app, do blue-green: bring up the version in HeroCtl, point DNS to HeroCtl, keep Coolify running the old version for 24 to 72 hours as fallback. If something goes wrong, rollback is swapping DNS back. When everything stabilizes, shut down the application on Coolify.",[12,27389,27390,27393],{},[27,27391,27392],{},"Step 4 — Shut down Coolify."," When all applications have migrated and remained stable for a week or two, shut down the Coolify machine. Keep the Coolify database backup for another quarter, just in case. Then yes, terminate the VPS.",[12,27395,27396],{},"The high point of this path is that at no moment are you without a rollback option. Coolify keeps running until the last day. If migration goes wrong at any step, you take a step back, adjust, try again. There's no \"point of no return\" forced by architecture.",[12,27398,27399],{},"Typical total time, for a portfolio of ten to twenty applications: two to three weeks, with one engineer dedicating half their time. For larger portfolios, it scales linearly.",[19,27401,7347],{"id":7346},[12,27403,27404,27407],{},[27,27405,27406],{},"Coolify deploys to N servers too. What's the real difference?","\nCoolify deploys containers to N servers. HeroCtl runs the orchestrator on N servers. The difference is where the brain lives. In Coolify, the brain lives on the main machine — remote machines are arms. In HeroCtl, the brain is distributed: three or more servers combine state by consensus, and any one of them can take on the leader role if the current one falls. It's the same difference between \"having copies of a document on multiple computers\" and \"having a real-time collaborative document\". Both have multiple machines; only the second survives the loss of one of them without losing control.",[12,27409,27410,27413],{},[27,27411,27412],{},"Can I use Coolify and HeroCtl together?","\nYes, and during a migration it's the recommendation. Both run Docker as runtime, so they don't compete for resources at the container level. What changes is which orchestrator looks at which set of machines. Technically, you can keep old applications on Coolify forever and use HeroCtl just for new workloads with HA requirements — some teams choose this approach and never get to shut down Coolify. There's no forced coupling.",[12,27415,27416,27419],{},[27,27417,27418],{},"How much does it cost to migrate?","\nThe direct cost is engineering time: two to three weeks of half-time for a typical portfolio. The indirect cost is the learning curve of the new configuration file — usually half a day for a senior engineer to get the hang of it. There's no license cost to migrate (HeroCtl Community is permanently free, with no server or job limit), nor an exit cost from Coolify (you simply stop using it). The opportunity cost is the engineer not doing something else during those two weeks.",[12,27421,27422,27425],{},[27,27423,27424],{},"What do I lose from Coolify?","\nThe Coolify visual interface is friendlier for someone who has never seen any orchestration before — it's a real trade. Some plugins from the Coolify ecosystem don't yet have an equivalent in HeroCtl, especially integrations with niche services. The Coolify community is larger in volume — forums have more answers for specific questions. And certain UI conveniences (preconfigured templates for famous applications) aren't yet at the same parity. These points are real and we acknowledge them.",[12,27427,27428,27431],{},[27,27429,27430],{},"And the Coolify plugins I use?","\nEach becomes an individual analysis. For integrations with managed databases (Postgres, Redis), HeroCtl runs these services as ordinary jobs — without needing a specific plugin. For integrations with external services (email sending, SaaS observability), it's generally an environment variable pointing to the vendor's API — replicable in any orchestrator. For very specific plugins, write to us — in some cases, native equivalents are already on the roadmap.",[12,27433,27434,27437],{},[27,27435,27436],{},"I only have one server now. Is it worth starting with HeroCtl?","\nHonestly: no. If you have one server and no intention of adding more in the next six months, Coolify is better for your case. HeroCtl running on a single server works, but you're paying the overhead of coordination that has nobody to coordinate with. The right time to start with HeroCtl is when you know you'll have three or more servers — whether because traffic grew, because the customer asked for an SLA, or because you want to sleep better. Before that, stay on Coolify.",[12,27439,27440,27443],{},[27,27441,27442],{},"When are Business and Enterprise prices coming out?","\nThey're already published on the plans page, with no \"talk to sales\". Business adds SSO\u002FSAML, granular RBAC, detailed audit, managed backup, and SLA-backed support — for teams with formal platform requirements. Enterprise adds source-code escrow, continuity contract, and 24×7 support. The contract is frozen for existing customers — there's no retroactive change clause. Community remains permanently free, with no server limit, no job limit, no artificial feature gates. Individuals and small teams never need to leave Community.",[12,27445,27446,27449],{},[27,27447,27448],{},"What if I want to go back to Coolify after migrating?","\nPossible. The containers run Docker in both cases — the images are the same. What changes is the configuration file and the panel. Going back means recreating the configurations on Coolify and repointing DNS. We never locked anyone in: the HeroCtl binary has no mandatory phone-home, there's no remote kill-switch, and the installed cluster keeps working offline forever. If your decision is to shut down and go back, just shut down and go back. Exit freedom is what makes trust honest.",[19,27451,3309],{"id":3308},[12,27453,27454],{},"The choice between Coolify and HeroCtl is not technical, it's situational.",[12,27456,27457],{},"Coolify is the right answer for a specific phase of a product's growth: solo dev, initial MRR, one server, no SLA pressure. HeroCtl is the right answer for the next phase: small team, growing MRR, three or more servers, first SLA contracts, first customers who charge on availability. Both tools serve adjacent profiles of the same problem.",[12,27459,27460],{},"If you're in the first phase, install Coolify and be happy. If you're entering the second phase and feel the wall coming, HeroCtl meets you on the other side.",[12,27462,27463],{},"To start, on any of three Linux servers with Docker:",[224,27465,27466],{"className":226,"code":2948,"language":228,"meta":229,"style":229},[231,27467,27468],{"__ignoreMap":229},[234,27469,27470,27472,27474,27476,27478],{"class":236,"line":237},[234,27471,1220],{"class":247},[234,27473,2957],{"class":251},[234,27475,2960],{"class":255},[234,27477,2963],{"class":383},[234,27479,2966],{"class":247},[12,27481,27482,27483,101],{},"Installation runs in five minutes, the panel comes up on each of the servers, and the cluster is ready to receive jobs. For more on why HeroCtl exists and what problem it tries to solve, ",[3336,27484,27485],{"href":6545},"read the founding post",[12,27487,27488],{},"The intent is simple: container orchestration, without ceremony — now with a cluster.",[3350,27490,4376],{},{"title":229,"searchDepth":244,"depth":244,"links":27492},[27493,27494,27495,27496,27497,27498,27499,27500,27501],{"id":27068,"depth":244,"text":27069},{"id":27096,"depth":244,"text":27097},{"id":27129,"depth":244,"text":27130},{"id":27168,"depth":244,"text":27169},{"id":23213,"depth":244,"text":23214},{"id":27325,"depth":244,"text":27326},{"id":27365,"depth":244,"text":27366},{"id":7346,"depth":244,"text":7347},{"id":3308,"depth":244,"text":3309},"2026-01-13","Coolify solves solo dev brilliantly. When the customer asks for an SLA and the single server becomes a single point of failure, the story changes. Honest comparison.",{},{"title":27057,"description":27503},{"loc":16689},"en\u002Fblog\u002Fheroctl-vs-coolify",[27053,27509,8756,7507],"high-availability","GnS69XLncnP1cjkMTFvUnHFqHcmAyeT57JiQnTL7sKo",{"id":27512,"title":27513,"author":7,"body":27514,"category":3378,"cover":3379,"date":28291,"description":28292,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":28293,"navigation":411,"path":11724,"readingTime":8761,"seo":28294,"sitemap":28295,"stem":28296,"tags":28297,"__hash__":28301},"blog_en\u002Fen\u002Fblog\u002Fdatabase-backup-strategies-cluster.md","Database backup in a cluster: strategies that survive 3 a.m.",{"type":9,"value":27515,"toc":28276},[27516,27523,27529,27532,27535,27539,27545,27548,27551,27555,27558,27564,27573,27576,27582,27586,27593,27599,27608,27614,27633,27636,27642,27646,27649,27659,27664,27669,27674,27693,27696,27699,27704,27708,27711,27721,27727,27733,27739,27744,27753,27758,27776,27779,27784,27788,27791,27794,27799,27804,27809,27814,27821,27824,27829,27833,27836,27841,27846,27851,27856,27859,27864,27866,27869,28068,28071,28075,28078,28084,28090,28096,28102,28108,28112,28115,28124,28130,28136,28142,28148,28152,28155,28158,28161,28164,28167,28170,28172,28184,28190,28196,28210,28219,28225,28235,28237,28240,28243,28246,28263,28271,28274],[12,27517,27518,27519,27522],{},"Three in the morning. The alert wakes you because the health check endpoint has been returning 500 for twelve minutes. You open the terminal half-asleep, connect to the production database, and the first query returns ",[231,27520,27521],{},"ERROR: invalid page in block 4421 of relation base\u002F16384\u002F24576",". Physical corruption. Postgres still answers some requests but is delivering rotten data to half of the clients that fit in the sane part of the pages.",[12,27524,27525,27526,27528],{},"You remember the ",[231,27527,5736],{}," cron that runs at 3 a.m. You look at the clock: 3:12. The cron started twelve minutes ago. That is, the backup being written right now to S3 is a snapshot of the corruption. The previous backup is 24 hours old. You have 24 hours of orders, payments, messages and uploads to lose, or a corrupted database to restore.",[12,27530,27531],{},"This is the scenario where every backup decision is paid. It is not at the architecture meeting. It is not at the sprint retro. It is at three in the morning, alone in a terminal, deciding between two bad options.",[12,27533,27534],{},"This post opens up the five backup models for a clustered database, with honest numbers of how much each loses and how long each takes to come back. Each strategy has a SaaS range where it is the right choice — and a range where it is negligence. The difference is the stage your company is at, not the style of who configured it.",[19,27536,27538],{"id":27537},"the-phrase-that-defines-everything-a-backup-that-has-never-been-restored-is-placebo","The phrase that defines everything: a backup that has never been restored is placebo",[12,27540,27541,27542,27544],{},"Most teams have confident opinions on backup and fragile practice. \"There's a ",[231,27543,5736],{}," cron to S3\" is the operational equivalent of \"there's a fire extinguisher but I never tested that it works\". Confidence comes from having the file there in the bucket. Fragility comes from never actually pulling that file, restoring it in an isolated environment, validating row counts, checking checksums.",[12,27546,27547],{},"Backup is the kind of system where the version \"works until it stops working\" is indistinguishable from the version \"really works\" — until the exact moment you need it. The first two years of a SaaS are lived in the Schrödinger state: the backup is alive and dead at the same time, and nobody opened the box.",[12,27549,27550],{},"This text is the exercise of opening the box before the incident. The five strategies below are organized in increasing complexity and guarantee. For each, the question is: how much data do I lose? how long am I down? how much does it cost? and — the criterion nobody takes seriously until they need it — how much internal competence does this require?",[19,27552,27554],{"id":27553},"rpo-and-rto-without-buzzwords","RPO and RTO without buzzwords",[12,27556,27557],{},"Before comparing strategies, two numbers need to be on the table. They have annoying acronyms but the concept is simple.",[12,27559,27560,27563],{},[27,27561,27562],{},"RPO — Recovery Point Objective."," How much data you accept losing. It is the distance between the \"now\" of the incident and the last consistent point you can restore. Daily backup at 3 a.m. means RPO of up to 24 hours — if corruption happens at 2 a.m. the next day, you lose 23 hours of transactions. Continuous backup means RPO of seconds.",[12,27565,27566,27569,27570,27572],{},[27,27567,27568],{},"RTO — Recovery Time Objective."," How long you accept being offline. It is the distance between the \"incident started\" and the \"serving traffic again\". ",[231,27571,5736],{}," restore on a 50 GB database takes between 30 and 60 minutes on a decent machine. Failover to a streaming replica takes 30 seconds.",[12,27574,27575],{},"Both cost money in different ways. Low RPO costs storage and continuous bandwidth (each commit needs to become durable bytes somewhere before being confirmed to the client). Low RTO costs redundant hardware — a hot replica is literally another database running, consuming CPU, RAM and disk in parallel, without serving traffic at normal times.",[12,27577,27578,27579,27581],{},"Defining RPO and RTO before implementing avoids the most common mistake: spending a lot to solve the wrong side. A team that pays US$300 per month for a read replica and still loses 24 hours of data when the disk corrupts spent badly — bought low RTO and ignored RPO. A team that does fortnightly ",[231,27580,5736],{}," to a cross-region encrypted bucket also spent badly — bought extreme durability and accepted RTO of hours that the B2B client doesn't tolerate.",[19,27583,27585],{"id":27584},"strategy-1-cron-pg_dump-to-s3","Strategy 1 — Cron + pg_dump to S3",[12,27587,27588,27589,27592],{},"The MVP version. A cron job on the database server, runs ",[231,27590,27591],{},"pg_dump | gzip | aws s3 cp"," at three in the morning. Cross-region bucket with lifecycle policy: files older than 30 days migrate to cold storage, older than 90 days are deleted.",[12,27594,27595,27598],{},[27,27596,27597],{},"Real RPO:"," 24 hours on the normal path. Can reach 48 if the cron fails one night and nobody notices.",[12,27600,27601,27604,27605,27607],{},[27,27602,27603],{},"Real RTO:"," 30 to 60 minutes for databases up to 50 GB. ",[231,27606,21196],{}," of a compressed dump runs close to disk I\u002FO speed — a machine with decent SSD does 1 to 2 GB per minute, counting index creation time.",[12,27609,27610,27613],{},[27,27611,27612],{},"Works for:"," personal project, MVP, internal tool, app where 24 hours of data is humanly recoverable. Registration platform where the client redoes what they lost. Cache tool that rehydrates from upstream. B2C SaaS in the first 100 users, where \"we lost a day, sorry\" is an acceptable answer.",[12,27615,27616,27619,27620,6833,27622,27625,27626,27628,27629,27632],{},[27,27617,27618],{},"Where it hurts:"," hot backup on a transactional database can capture inconsistent state between related tables if you don't use the right flags. ",[231,27621,5736],{},[231,27623,27624],{},"--serializable-deferrable"," solves transactional consistency but can block heavy writes. The parallel version with ",[231,27627,21652],{}," is faster but requires ",[231,27630,27631],{},"--format=directory",", which complicates the direct pipe to S3 — you end up needing local temporary disk.",[12,27634,27635],{},"And restore is slow in a way that surprises. A 200 GB database that seemed \"just a bit large\" becomes three hours of restore with indexes being rebuilt. Three hours your platform is offline.",[12,27637,27638,27641],{},[27,27639,27640],{},"Cost:"," R$5 to R$30 per month of S3-compatible storage for reasonable retention. Human time: zero after setup, ten minutes a month to check the cron log.",[19,27643,27645],{"id":27644},"strategy-2-pg_basebackup-wal-archiving","Strategy 2 — pg_basebackup + WAL archiving",[12,27647,27648],{},"The first time RPO drops from \"a day\" to \"a few minutes\". The idea is to separate two things: the complete snapshot of the database (basebackup) and the change history since the last snapshot (WAL — Write-Ahead Log, the file where Postgres writes each commit before applying to data).",[12,27650,27651,27652,27654,27655,27658],{},"Setup: weekly ",[231,27653,17227],{}," records a complete snapshot of the data directory. In parallel, the Postgres ",[231,27656,27657],{},"archive_command"," copies each closed WAL file (16 MB each) to S3 as they are produced. Under normal conditions, the WAL is closed in seconds under medium load.",[12,27660,27661,27663],{},[27,27662,27597],{}," 1 to 5 minutes. It is the interval between the last WAL having been sent and the disk having died.",[12,27665,27666,27668],{},[27,27667,27603],{}," 15 to 45 minutes. Restore the basebackup, replay the WALs to the desired point.",[12,27670,27671,27673],{},[27,27672,27612],{}," SaaS with 10 to 100 GB of database, first hundred paying clients. Stage where losing 24 hours is catastrophic but the team still doesn't have a dedicated DBA.",[12,27675,27676,27678,27679,27681,27682,27684,27685,27688,27689,27692],{},[27,27677,27618],{}," the ",[231,27680,27657],{}," needs to be treated as production code, not a weekend script. If it fails and nobody notices, WAL accumulates on the database disk until it bursts, and when it bursts the database stops accepting writes. I have seen a cluster fall at four in the morning because the ",[231,27683,27657],{}," called ",[231,27686,27687],{},"aws s3 cp"," without ",[231,27690,27691],{},"--no-progress"," and the progress output blocked stdout in an environment without TTY.",[12,27694,27695],{},"WAL volume surprises. A database with average traffic generates 5 to 50 GB of WAL per day, depending on what it writes. Multiplied by 30 days of retention, becomes terabytes. Cheap storage is still cheap, but the bucket lifecycle policy has to be sharp.",[12,27697,27698],{},"And restore requires sequential replay. You cannot skip WALs. If an intermediate file was lost (archive bug, bucket deleted by mistake, anything), restore stops at that point and what came after is unreadable. Periodic verification is mandatory.",[12,27700,27701,27703],{},[27,27702,27640],{}," R$50 to R$200 per month of storage depending on retention. Human time: 2 to 4 hours for well-done initial setup, plus the first half hour of each incident understanding which WAL is needed.",[19,27705,27707],{"id":27706},"strategy-3-pgbackrest-wal-e-restic-xtrabackup","Strategy 3 — pgBackRest, WAL-E, restic, xtrabackup",[12,27709,27710],{},"When strategy 2 is too good to be done by hand. Dedicated tools that combine basebackup, WAL archiving, retention, compression, encryption and — crucially — automatic verification.",[12,27712,27713,27716,27717,27720],{},[231,27714,27715],{},"pgBackRest"," is the most mature name for Postgres. Does everything strategy 2 does, but with parallelization (multiple processes sending WAL simultaneously), checksum validation on each file, point-in-time recovery (PITR) without pain, and — the big operational gain — a single command ",[231,27718,27719],{},"pgbackrest restore"," that knows how to find the most recent backup, download what's needed, and replay until the moment you ask.",[12,27722,27723,27726],{},[231,27724,27725],{},"WAL-E"," is older but simpler if you just want to send to S3 and come back.",[12,27728,27729,27732],{},[231,27730,27731],{},"restic"," is generic (not Postgres-specific), but has deduplication that is especially useful when you do a full dump multiple times a day — the second dump only sends the blocks that changed.",[12,27734,27735,27738],{},[231,27736,27737],{},"xtrabackup"," is the equivalent for MySQL, with the same philosophy: hot backup without lock, incremental, PITR.",[12,27740,27741,27743],{},[27,27742,27597],{}," minutes, same as strategy 2. The difference is that here the \"5 minutes\" is more reliable because the tool continuously verifies that the pipeline is working.",[12,27745,27746,27748,27749,27752],{},[27,27747,27603],{}," 10 to 30 minutes with direct PITR. The command ",[231,27750,27751],{},"pgbackrest restore --type=time --target='2025-12-11 03:42:00'"," solves what in strategy 2 requires half an hour of scripts.",[12,27754,27755,27757],{},[27,27756,27612],{}," serious SaaS, database from 100 GB to 1 TB, team that takes formal responsibility for monthly restore tests.",[12,27759,27760,27762,27763,27765,27766,571,27769,571,27772,27775],{},[27,27761,27618],{}," learning the tool. ",[231,27764,27715],{}," has decent documentation but the vocabulary is specific — ",[231,27767,27768],{},"stanza",[231,27770,27771],{},"repo",[231,27773,27774],{},"archive-async",". Well-done initial configuration takes 2 to 4 hours, plus another day to run the first full + incrementals and be sure everything matches.",[12,27777,27778],{},"The trick here is that the tool hides the mechanism's complexity, doesn't remove it. When something goes wrong, you need to understand that underneath it is still basebackup + WAL. The difference is that the common bugs of strategy 2 (archive failing silently, missing WAL) are detected and alerted by the tool before the incident.",[12,27780,27781,27783],{},[27,27782,27640],{}," storage proportional to volume + part-time DBA time. I say \"part-time DBA\" even if no one with that title exists on the team — it is the time someone on call needs to invest monthly to run the restore test.",[19,27785,27787],{"id":27786},"strategy-4-streaming-replication-with-automatic-failover","Strategy 4 — Streaming replication with automatic failover",[12,27789,27790],{},"The first strategy where RTO drops below one minute. Instead of copying the database after it has finished writing, you keep a replica receiving the WAL stream in real time. When the primary dies, an orchestrator promotes the replica to primary, updates routing, and the service comes back without human intervention.",[12,27792,27793],{},"Patroni is the most common name to orchestrate this in Postgres. It solves leader election between nodes, manages replication slots, fences the dead node to avoid two simultaneous writes, and exposes an endpoint that your load balancer queries to know which node is the current primary.",[12,27795,27796,27798],{},[27,27797,27597],{}," seconds. Synchronous replication can reach zero (commit on the primary only confirms after the copy reaches the replica) but costs latency on each write — a choice made transaction by transaction in most systems.",[12,27800,27801,27803],{},[27,27802,27603],{}," 30 seconds to 5 minutes. The 30 seconds is automatic failover without human-in-the-loop. The 5 minutes is the scenario where the orchestrator detects the problem, decides it is not a false positive, promotes the replica, and the clients' DNS cache expires.",[12,27805,27806,27808],{},[27,27807,27612],{}," SaaS with first serious B2B client, contractual SLA equal to or greater than 99.5%. Platform where 5 minutes of window equals contractual penalty.",[12,27810,27811,27813],{},[27,27812,27618],{}," split-brain during network partitioning. If the primary becomes isolated but keeps accepting writes, and the replica is promoted by the orchestrator on the other side of the partition, you end up with two divergent truths that need to be manually reconciled when the network comes back. Serious orchestrators use a third witness node to avoid this, but the configuration is demanding.",[12,27815,27816,27817,27820],{},"Worse: the replica copies corruption. If the primary wrote a rotten page, the replica received the identical WAL and is equally rotten. Streaming replication protects you against primary hardware failure; doesn't protect against application bug that wrote garbage, against wrong ",[231,27818,27819],{},"DROP TABLE",", against ransomware.",[12,27822,27823],{},"That's why this strategy never replaces the previous three — it adds. Very low RTO for hardware failure, plus traditional backup for logical failure.",[12,27825,27826,27828],{},[27,27827,27640],{}," two or three databases running all the time (one active primary, one or two standbys). Storage and CPU doubled or tripled relative to a single database. Human time: one week of initial setup, plus quarterly battery of failover tests.",[19,27830,27832],{"id":27831},"strategy-5-backup-managed-by-a-third-party","Strategy 5 — Backup managed by a third party",[12,27834,27835],{},"The strategy where you buy the problem out. RDS on AWS, Cloud SQL on Google, Neon, Crunchy Bridge, or a Brazilian provider that delivers automatic backup with PITR. You define the retention window, they take care of the rest.",[12,27837,27838,27840],{},[27,27839,27597],{}," 5 minutes is the typical published number.",[12,27842,27843,27845],{},[27,27844,27603],{}," 30 seconds to 5 minutes for failover within the same region. Cross-region restore is another animal — can be an hour, depending on the provider.",[12,27847,27848,27850],{},[27,27849,27612],{}," team without internal expertise OR compliance that requires vendor with SOC 2 \u002F ISO 27001 issued. Company that would prefer to pay dearly to not think about the subject.",[12,27852,27853,27855],{},[27,27854,27618],{}," cost. A database that would cost R$300\u002Fmonth in hardware becomes R$1,500 to R$3,000\u002Fmonth in equivalent RDS, depending on replica configuration. Provider lock-in — leaving RDS is a complicated migration because the way to do backup is product-specific. And cross-region restore (which is the part the B2B client demands in contract) is frequently more difficult than it seems in commercial slides.",[12,27857,27858],{},"The real advantage is zero ops under normal conditions. The real disadvantage is that when you are in an incident, the path to escalate with provider support can take more time than you would have to solve with a well-configured self-hosted.",[12,27860,27861,27863],{},[27,27862,27640],{}," US$50 to US$500 per month for small and medium databases. Above that it is proportional charging to size.",[19,27865,17370],{"id":17369},[12,27867,27868],{},"The honest version side by side. Each column is one of the strategies above.",[119,27870,27871,27892],{},[122,27872,27873],{},[125,27874,27875,27877,27880,27883,27886,27889],{},[128,27876,2982],{},[128,27878,27879],{},"1: Cron + pg_dump",[128,27881,27882],{},"2: basebackup + WAL",[128,27884,27885],{},"3: pgBackRest",[128,27887,27888],{},"4: Streaming + failover",[128,27890,27891],{},"5: Managed",[141,27893,27894,27911,27930,27947,27964,27980,27996,28014,28032,28049],{},[125,27895,27896,27899,27901,27903,27906,27909],{},[146,27897,27898],{},"Typical RPO",[146,27900,11994],{},[146,27902,3019],{},[146,27904,27905],{},"1-5 min",[146,27907,27908],{},"seconds",[146,27910,3019],{},[125,27912,27913,27916,27919,27922,27925,27928],{},[146,27914,27915],{},"Typical RTO",[146,27917,27918],{},"30-60 min",[146,27920,27921],{},"15-45 min",[146,27923,27924],{},"10-30 min",[146,27926,27927],{},"30s-5 min",[146,27929,27927],{},[125,27931,27932,27935,27937,27940,27943,27945],{},[146,27933,27934],{},"Ideal database size",[146,27936,17606],{},[146,27938,27939],{},"10-100 GB",[146,27941,27942],{},"100 GB - 1 TB",[146,27944,17601],{},[146,27946,17601],{},[125,27948,27949,27952,27955,27957,27959,27962],{},[146,27950,27951],{},"Storage cost",[146,27953,27954],{},"very low",[146,27956,17508],{},[146,27958,17508],{},[146,27960,27961],{},"high (replica)",[146,27963,17503],{},[125,27965,27966,27969,27972,27974,27976,27978],{},[146,27967,27968],{},"Human time cost",[146,27970,27971],{},"~zero",[146,27973,17508],{},[146,27975,17508],{},[146,27977,17503],{},[146,27979,27971],{},[125,27981,27982,27984,27987,27990,27992,27994],{},[146,27983,4879],{},[146,27985,27986],{},"trivial",[146,27988,27989],{},"moderate",[146,27991,27989],{},[146,27993,17503],{},[146,27995,27986],{},[125,27997,27998,28001,28004,28006,28009,28012],{},[146,27999,28000],{},"Restore complexity",[146,28002,28003],{},"low",[146,28005,27989],{},[146,28007,28008],{},"low (PITR)",[146,28010,28011],{},"n\u002Fa (failover)",[146,28013,28003],{},[125,28015,28016,28018,28021,28023,28026,28029],{},[146,28017,7102],{},[146,28019,28020],{},"manual",[146,28022,28020],{},[146,28024,28025],{},"native",[146,28027,28028],{},"requires config",[146,28030,28031],{},"depends on provider",[125,28033,28034,28036,28038,28041,28044,28047],{},[146,28035,17446],{},[146,28037,100],{},[146,28039,28040],{},"yes, manual",[146,28042,28043],{},"yes, single command",[146,28045,28046],{},"n\u002Fa",[146,28048,17429],{},[125,28050,28051,28054,28057,28059,28062,28065],{},[146,28052,28053],{},"Brazilian SaaS range",[146,28055,28056],{},"MVP",[146,28058,17612],{},[146,28060,28061],{},"early\u002Fmid",[146,28063,28064],{},"startup with SLA",[146,28066,28067],{},"enterprise \u002F no expertise",[12,28069,28070],{},"Notice that no row has an absolute winner. Strategy 5 wins on \"human time\" and loses on \"storage cost\". Strategy 4 wins on RPO and loses on \"setup complexity\". The choice is always about which column you prioritize, given the company stage.",[19,28072,28074],{"id":28073},"the-five-mistakes-that-kill-the-colleagues-backup","The five mistakes that kill the colleague's backup",[12,28076,28077],{},"From here it is operational folklore. Each of these mistakes was made by a team confident in the backup they had. Each became a postmortem.",[12,28079,28080,28083],{},[27,28081,28082],{},"Never restored."," A backup that has never been restored in an isolated environment is hypothesis. A monthly cron that pulls the most recent backup from the bucket, brings it up in an ephemeral database (a temporary machine you turn off afterwards), validates row count of critical tables, checks checksum of some samples, and emails the team confirming success. The backup of the month that cron failed is the only backup you are sure will work. Everything before is faith.",[12,28085,28086,28089],{},[27,28087,28088],{},"Backup on the same disk or same region."," Disk dies, backup goes with it. Entire region of cloud provider falls (has happened several times in the last five years), backup goes with it. Cross-region is the minimum. Cross-provider — main backup on one provider, secondary copy on another — is what separates \"prepared\" from \"fan in the stands\".",[12,28091,28092,28095],{},[27,28093,28094],{},"No long logical retention."," Seven days of retention seems comfortable until you discover corruption started eight days ago and nobody noticed. Subtle application bug that writes invalid data on one row per minute doesn't trigger an alert, but in two weeks poisoned half a million records. Short retention is negligence dressed as savings. Reasonable minimum policy: 7 days of hourly backup, 30 days of daily, 12 months of monthly.",[12,28097,28098,28101],{},[27,28099,28100],{},"No failure alert."," The silent cron is the most common killer. Failed thirty days ago because of an S3 credential change, nobody looked at the log, everyone slept peacefully. Alert integrated with the team's notification system (not email lost in the inbox, but Slack or equivalent that someone will read) is mandatory. And the alert has to be double — alert when it fails, and alert when there has been no success for more than N hours (to detect the case where the cron didn't even run).",[12,28103,28104,28107],{},[27,28105,28106],{},"Backup without encryption."," Public S3 bucket leak becomes a leak of the entire database. Press incidents in recent years are all composed of \"bucket was public for two days and had the complete backup\". Encryption at rest (server-side encryption on the bucket is the minimum, client-side encryption with a key managed by you is adequate), encryption in transit, and — separate from that — bucket access control following the principle of least privilege.",[19,28109,28111],{"id":28110},"the-brazilian-saas-maturity-trail","The Brazilian SaaS maturity trail",[12,28113,28114],{},"The practical question: which strategy for which stage. Taking Brazilian monthly recurring revenue metrics as reference.",[12,28116,28117,28120,28121,28123],{},[27,28118,28119],{},"MVP up to R$5k MRR."," Strategy 1. ",[231,28122,5736],{}," cron to S3-compatible, 30-day retention, basic failure alert. Costs practically nothing and protects against the main risk at this stage, which is \"the server vanished\". 24-hour RPO is acceptable because the client at this stage has expectations of a product that's just starting.",[12,28125,28126,28129],{},[27,28127,28128],{},"Indie from R$5k to R$30k."," Strategy 2. Adds WAL archiving, drops RPO from 24 hours to 5 minutes. The complexity jump is compensated by the guarantee jump. A client paying R$500 per month for B2B subscription starts asking SLA questions — you need a better answer than \"daily\".",[12,28131,28132,28135],{},[27,28133,28134],{},"Early startup from R$30k to R$200k."," Strategy 3. pgBackRest configured with automatic monthly restore test. Here it is no longer optional — you have a serious client, serious contract, and the fragility of strategy 2 done by hand is concrete operational risk. Half a day of setup pays itself in three months by the first incident that doesn't become a public postmortem.",[12,28137,28138,28141],{},[27,28139,28140],{},"Startup with contractual SLA."," Strategy 4 added to 3. Streaming replication with automatic failover takes care of hardware; pgBackRest takes care of logical data. The two together solve the two failure modes that serious B2B contracts charge for: momentary unavailability and data loss.",[12,28143,28144,28147],{},[27,28145,28146],{},"Enterprise or heavy compliance."," Strategy 5 OR strategy 4 with detailed audit. Here the decision is less technical and more regulatory. If audit requires a vendor with X certification, you buy managed. If audit accepts self-hosted with documented runbook, you operate strategies 3 and 4 and invest in the audit trail.",[19,28149,28151],{"id":28150},"how-heroctl-simplifies-this","How HeroCtl simplifies this",[12,28153,28154],{},"The product motivation is exactly the scheme above — they are layers that appear repeatedly in every SaaS, and each team spends an entire season rebuilding the same plumbing. HeroCtl solves the transport, orchestration and observation of these layers. Keeps what is database-specific with the team, automates what is generic.",[12,28156,28157],{},"Concretely, on the Community plan (free), you run Postgres as a regular job in the cluster, with encrypted persistence at rest. The backup cron becomes another job, with automatic retry, integrated failure alert (no need to set up Alertmanager separately), and metrics that appear on the panel — dump duration, file size, time since last success.",[12,28159,28160],{},"On Business, managed Postgres and MySQL backup enters. The difference vs. \"doing it yourself\" is that integrity verification, monthly restore test, client-side encryption with a key managed by you, and three-tier retention policy (hourly\u002Fdaily\u002Fmonthly) come pre-configured. You define the window and the rest is our problem.",[12,28162,28163],{},"On Enterprise, orchestration of streaming replication between cluster nodes enters — a job describes the topology (primary here, standby there, who promotes whom), and the cluster takes care of failover when the primary dies. Chaos test battery runs monthly against your cluster, with a report.",[12,28165,28166],{},"On all plans, the restore test can be configured as an internal cron job: the cluster brings up an ephemeral database, restores the most recent backup, validates what you defined as \"healthy\" (a checksum query, a row count, whatever makes sense), reports success or failure, and turns off the ephemeral database. The cost is the test's CPU minutes, not an engineer's workday.",[12,28168,28169],{},"The philosophy is the same as the rest of the product: what is generic to every SaaS is already done; what is specific to your domain stays with you. Backup is generic. The database content is yours.",[19,28171,3225],{"id":3224},[12,28173,28174,28180,28181,28183],{},[27,28175,28176,28177,28179],{},"Is ",[231,28178,5736],{}," enough to start?","\nYes, for MVP up to first paying clients. The practical rule: if losing 24 hours of data costs less than three hours of engineer time, ",[231,28182,5736],{}," is the right choice. When it inverts (losing data costs more than migrating to strategy 2), migrate.",[12,28185,28186,28189],{},[27,28187,28188],{},"How much does S3-compatible storage cost in Brazil in 2026?","\nCurrent range between R$0.15 and R$0.40 per GB-month depending on provider and tier. Backup of a 50 GB database with 30-day retention plus some old weekly dumps = ~3 TB-month equivalent compressed = R$50 to R$120\u002Fmonth. Cross-region practically doubles it. Cold tier for long monthly retention drops it to half.",[12,28191,28192,28195],{},[27,28193,28194],{},"How do I test restore without affecting production?","\nCron job in the cluster that: 1) downloads the most recent backup from the bucket; 2) brings up a temporary Postgres on another node (or ephemeral container) without exposing public port; 3) restores the dump; 4) runs validation queries (per-table count, checksum of samples, critical domain query); 5) compares with expected baselines; 6) reports the result on the alerts channel; 7) destroys the environment. Runs monthly, at minimum. Never touches the production database at any point.",[12,28197,28198,28201,28202,28205,28206,28209],{},[27,28199,28200],{},"Encrypted backup on S3 — what's the right way?","\nThree layers. Server-side encryption of the bucket (",[231,28203,28204],{},"AES256"," or KMS) is the base and is free. Client-side encryption with a key managed by you (you encrypt the file before uploading, with a key that lives outside the cloud provider) protects against credential compromise scenarios. Bucket policy with ",[231,28207,28208],{},"aws:SecureTransport=true"," and explicit public access blocking. The three together, no exception. One alone is not enough.",[12,28211,28212,28215,28216,28218],{},[27,28213,28214],{},"Does a read replica count as a backup?","\nNo. A read replica protects against primary hardware failure. Doesn't protect against wrong ",[231,28217,27819],{},", against a bug that writes invalid data, against logical corruption that replicates. Every logical corruption that enters the primary enters the replica in seconds. Read replica is high availability, not backup. The two coexist; neither replaces the other.",[12,28220,28221,28224],{},[27,28222,28223],{},"Is cross-region replication necessary?","\nFor serious B2B client, yes. For MVP, no. The practical boundary: if a single cloud provider falling for 4 hours is enough to break a contract, cross-region is minimum. If 4 hours of downtime can still be explained by email, cross-region can wait.",[12,28226,28227,28230,28231,28234],{},[27,28228,28229],{},"How long does it take to restore 100 GB of Postgres?","\nDepends a lot on disk. Local NVMe SSD: 20 to 40 minutes counting index creation. Remote SSD (cloud provider volume): 40 to 90 minutes. HDD: you don't want to know. Dump compression typically drops to half the original size; 100 GB database becomes a 30 to 50 GB dump. Create indexes in parallel (",[231,28232,28233],{},"pg_restore -j",") to speed up 30 to 50%.",[19,28236,3309],{"id":3308},[12,28238,28239],{},"Backup is less about technical strategy and more about test discipline. The five strategies above are all decent at some stage. The difference between the team that survives the incident and the team that loses a client is not in which they chose — it is in having restored a backup recently enough to trust.",[12,28241,28242],{},"If your answer to \"when was the last restore test?\" begins with \"let me see\", you have work for this week. Not the next sprint. This week. The three a.m. scenario isn't scheduled.",[12,28244,28245],{},"HeroCtl runs on any Linux server with Docker. You install in a lab, bring up a Postgres as a job, configure automatic backup, schedule monthly restore test, and in an afternoon you have the entire strategy 3 scheme working. Without setting up five different products, without learning specialized orchestrator vocabulary.",[224,28247,28249],{"className":28248,"code":5318,"language":1531,"meta":229,"style":229},"language-sh shiki shiki-themes github-dark-default",[231,28250,28251],{"__ignoreMap":229},[234,28252,28253,28255,28257,28259,28261],{"class":236,"line":237},[234,28254,1220],{"class":247},[234,28256,2957],{"class":251},[234,28258,5329],{"class":255},[234,28260,2963],{"class":383},[234,28262,2966],{"class":247},[12,28264,28265,28266,28268,28269,101],{},"For context on when it makes sense to run Postgres in your own cluster vs. buying managed, read ",[3336,28267,7462],{"href":7461},". To understand why the deploy window is the other side of the backup coin (because a bad deploy is the most common way to generate the incident that will require a restore), read ",[3336,28270,3339],{"href":3338},[12,28272,28273],{},"A backup that has never been restored is placebo. The difference between placebo and medicine appears exactly once, exactly when you can't choose.",[3350,28275,4376],{},{"title":229,"searchDepth":244,"depth":244,"links":28277},[28278,28279,28280,28281,28282,28283,28284,28285,28286,28287,28288,28289,28290],{"id":27537,"depth":244,"text":27538},{"id":27553,"depth":244,"text":27554},{"id":27584,"depth":244,"text":27585},{"id":27644,"depth":244,"text":27645},{"id":27706,"depth":244,"text":27707},{"id":27786,"depth":244,"text":27787},{"id":27831,"depth":244,"text":27832},{"id":17369,"depth":244,"text":17370},{"id":28073,"depth":244,"text":28074},{"id":28110,"depth":244,"text":28111},{"id":28150,"depth":244,"text":28151},{"id":3224,"depth":244,"text":3225},{"id":3308,"depth":244,"text":3309},"2025-12-11","A backup that has never been restored is placebo. Five strategies with real recovery time (RTO) and honest acceptable data loss (RPO), for each Brazilian SaaS stage.",{},{"title":27513,"description":28292},{"loc":11724},"en\u002Fblog\u002Fdatabase-backup-strategies-cluster",[28298,13016,28299,3378,28300],"backup","disaster-recovery","rpo-rto","WZ1PINJEZ8NIBCZejlBGpCsKYzhJ_0jdhdx_hDTt1Uc",{"id":28303,"title":28304,"author":7,"body":28305,"category":3378,"cover":3379,"date":28991,"description":28992,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":28993,"navigation":411,"path":3338,"readingTime":4401,"seo":28994,"sitemap":28995,"stem":28996,"tags":28997,"__hash__":28999},"blog_en\u002Fen\u002Fblog\u002Fsafe-rolling-deploys-why-yours-might-not-be.md","Safe rolling deploys: why yours probably isn't",{"type":9,"value":28306,"toc":28976},[28307,28310,28313,28316,28320,28330,28333,28346,28351,28354,28358,28361,28368,28371,28374,28378,28385,28388,28391,28395,28401,28408,28414,28417,28425,28428,28437,28440,28447,28451,28454,28457,28467,28470,28474,28477,28661,28667,28671,28674,28709,28729,28750,28768,28774,28780,28790,28794,28797,28802,28807,28813,28819,28823,28826,28832,28838,28855,28861,28863,28869,28884,28895,28905,28911,28917,28929,28931,28934,28937,28940,28942,28958,28970,28973],[12,28308,28309],{},"Every engineering team operating containers in production, sooner or later, writes a similar sentence in the status channel: \"deploy completed, no downtime\". The sentence is optimistic. In at least half the cases we've audited — in homemade scripts, in popular self-hosted panels, in official tutorials that became reference —, what actually happened was a 5 to 30 second window where the load balancer returned 502, some uploads cut off mid-way, and no one noticed because monitoring samples once a minute.",[12,28311,28312],{},"Rolling deploy seems like the simplest orchestration problem: you have N containers running an old version, you want to have N containers running a new version, and you want the app to keep responding during the swap. The conceptual recipe fits in three lines. Replace an old container with a new one, one at a time, keeping traffic always directed to who's alive. What makes this hard isn't the strategy — it's the set of six details that need to be right at the same time. Each alone seems like an implementation detail. The six together are the difference between \"real no-downtime deploy\" and \"deploy that seems no-downtime on Friday morning, but has thirty seconds of error in the middle of 5 p.m.\".",[12,28314,28315],{},"This post maps the six. At the end there's a recipe in spec format, an honest comparison of who implements what, and a tests section you can run on your current system to discover if it's bad — before your customer discovers first.",[19,28317,28319],{"id":28318},"detail-1-health-check-before-promoting-the-new-container","Detail 1 — Health check before promoting the new container",[12,28321,28322,28323,28326,28327,28329],{},"The most common error in homemade rolling deploy is trusting the ",[231,28324,28325],{},"running"," state the container runtime reports. You bring up a new container, Docker (or equivalent) marks it as ",[231,28328,28325],{}," in milliseconds, and your script considers that proof it can kill the old. Kills the old. The new, internally, is still starting — waiting for database connection, loading cache in memory, downloading feature flag configuration, opening thread pool. During that interval, the load balancer routes traffic to a process that isn't yet ready to receive and returns 502 or 503.",[12,28331,28332],{},"The window is short — usually between 5 and 30 seconds per container —, and that's why it fools monitoring. If your error metric is sampled every minute and you swap five containers in sequence, each with 10 seconds of \"is running but not ready\", the spike doesn't always fall exactly on a collection window. You're left with the statistical impression that everything went well.",[12,28334,28335,28336,571,28338,28341,28342,28345],{},"What rolling should do, instead, is separate two concepts: \"the process is running\" and \"the process is ready to serve traffic\". Running is runtime state; ready is affirmative response from an endpoint that the app itself exposes — ",[231,28337,355],{},[231,28339,28340],{},"\u002Freadyz",", or equivalent. The orchestrator does HTTP GET on that endpoint of the new container, waits to receive 200, waits for that response sustained for a period (",[231,28343,28344],{},"min_healthy_time",", typically 10 seconds), and only then removes the old container from the load balancer.",[12,28347,352,28348,28350],{},[231,28349,28344],{}," is the detail many people skip. It exists because a single 200 means nothing — it could be that the app responded before closing a critical connection, and will start failing in the next second. Waiting for 10 consecutive seconds of healthy responses filters those false positives without lengthening the deploy absurdly.",[12,28352,28353],{},"Classic Watchtower — popular script for updating containers from new image tags — does none of this. It pulls, stops the old, starts the new. Coolify and Dokploy implement partially, depending on the application configuration and the type of health check you enabled. A serious cluster orchestrator treats this as minimum requirement.",[19,28355,28357],{"id":28356},"detail-2-connection-draining-and-graceful-shutdown","Detail 2 — Connection draining and graceful shutdown",[12,28359,28360],{},"Even when you order the router to stop sending new connections to the old container, there are still in-flight connections — file uploads, large downloads, long requests waiting for database response, open websocket, event streaming. If you simply send SIGKILL (or let the runtime send, after a too-short timeout), all those connections cut off mid-way. The user sees network error at the exact moment of deploy.",[12,28362,28363,28364,28367],{},"The correct flow has four ordered steps. First, you signal the router that the old container should no longer receive new connections — that's usually a load balancer removal or a node ",[231,28365,28366],{},"drain",". Second, you send SIGTERM to the process. SIGTERM is a catchable signal; the app can handle it and start graceful shutdown. Third, you wait. The timeout depends on the application profile — 30 to 60 seconds covers the vast majority of web apps; APIs with large file upload may need 120 or more. Fourth, and only after that timeout, you send SIGKILL to whatever hasn't finished.",[12,28369,28370],{},"There's a known pitfall in this step: the app itself needs to handle SIGTERM. Node, Rails, Django, Go HTTP server — all have middleware or helpers for this, but none of them comes on by default in basic template. If your application doesn't catch SIGTERM, the signal becomes no-op and the orchestrator will wait the full timeout before killing with SIGKILL. The result is slow deploy and, even so, cut connections — because the app only realized it would die at the moment of KILL.",[12,28372,28373],{},"Check your app. Specifically: when it receives SIGTERM, does it stop accepting new connections, wait for in-flight connections to end, and only then close? If the answer is \"I don't know\", your rolling deploy is broken on this detail.",[19,28375,28377],{"id":28376},"detail-3-previous-image-pre-pulled-for-fast-rollback","Detail 3 — Previous image pre-pulled for fast rollback",[12,28379,28380,28381,28384],{},"Critical bug in production, three minutes after deploy. You need to revert to the previous version now. The ",[231,28382,28383],{},"pull"," command of the old image takes 30 to 60 seconds per node — because, of course, you \"cleaned\" the old to save disk, or the node's image cache has already been rotated. Multiply by number of replicas, add the orchestration time of the swap itself, and your five-minute incident becomes fifteen. Fifteen minutes is the difference between \"momentary instability\" in the postmortem and \"incident reported to customer\".",[12,28386,28387],{},"The fix is trivial and almost no one implements it: keep the N-1 image pre-pulled on the nodes that ran it. Rollback becomes changing the pointed-to tag and restarting — operation of about 10 seconds per container, dominated by the old container's health check coming back to life.",[12,28389,28390],{},"The more sophisticated version is keeping a snapshot of the job's complete state — not just image, but environment variables, network configurations, associated secrets, allocated resources. Partial rollback (just image) covers most cases, but doesn't cover regression introduced in a feature flag or in a connection string. Full snapshot is what separates fast rollback from complete rollback.",[19,28392,28394],{"id":28393},"detail-4-automatic-failure-detection-and-auto-revert","Detail 4 — Automatic failure detection and auto-revert",[12,28396,28397,28398,28400],{},"Common scenario in homemade script: the deploy goes up, the new container enters crash-loop (returns to ",[231,28399,28325],{},", dies in 5 seconds, returns again, dies again). The system stays waiting for someone to see the problem and abort manually. If that happens at four in the morning and the alert goes to a Slack no one's watching, the downtime extends until someone wakes up.",[12,28402,28403,28404,28407],{},"What rolling should do is define a ",[231,28405,28406],{},"healthy_deadline"," — an absolute ceiling within which the new container needs to enter and remain in healthy state. Our default is 300 seconds; five minutes covers apps that take time to initialize (heavy Java apps, apps with cache warm-up) without giving indefinite margin. If the deadline passed and the container isn't healthy, the orchestrator automatically reverts to the previous version. Alerts the team afterward, without urgency — because the system already protected itself.",[12,28409,28410,28411,28413],{},"The practical implementation counts two combined signals: restart count on the new container (if more than 3 in 60 seconds, it's crash-loop) and total elapsed time without sustained positive health check. Either of the two firing before ",[231,28412,28344],{}," is reached aborts the deploy of that replica and triggers the revert.",[12,28415,28416],{},"A subtle detail: auto-revert only makes sense if detail 3 (previous image pre-pulled) is implemented. Reverting to an image that needs to be downloaded again during auto-revert nullifies the gain.",[19,28418,28420,28421,28424],{"id":28419},"detail-5-max_parallel-1-in-multi-instance-cluster","Detail 5 — ",[231,28422,28423],{},"max_parallel: 1"," in multi-instance cluster",[12,28426,28427],{},"You have five replicas of the same service. Tempted to swap all five at the same time: parallel deploy, wait for all to be healthy, done. That path has three problems. First, during the window when all are being swapped, all traffic passes through the new version — if it has a bug, 100% of users feel it, with no fallback. Second, the resource usage peak during the swap is 2× (because old and new coexist for instants), which can blow node memory and generate cascading OOM. Third, you lose the cheap opportunity to detect regression before propagating.",[12,28429,28430,28431,28433,28434,28436],{},"Rolling should swap one replica at a time (",[231,28432,28423],{},") — or, in very large clusters, a small fraction (10-25%). Swap the first, wait for healthy, wait for ",[231,28435,28344],{},", swap the second, and so on. Total service capacity stays maintained throughout the deploy window: if five replicas handled traffic before, four new + one old (or vice versa) also handle.",[12,28438,28439],{},"The tradeoff is time. Ten replicas with 30 seconds each of swap + min_healthy_time = five minutes of total deploy window. It's not fast. In exchange, you gain cheap rollback: if the first new replica fails, the orchestrator stops swapping the others. You're left with nine old + one failed new, discard the failed, returned to the previous state without any capacity impact. The cost of slow deploy is paid by the safety of not having entire cluster with bug running before someone notices.",[12,28441,28442,28443,28446],{},"There's an extra knob that helps: ",[231,28444,28445],{},"stagger",", the interval between consecutive swaps. Reasonable default is 30 seconds. That delay lets metrics and logs of the recently swapped container be collected and evaluated before moving to the next — minimum window to detect bug that only appears under real traffic.",[19,28448,28450],{"id":28449},"detail-6-pre-stop-hooks-for-long-running-jobs","Detail 6 — Pre-stop hooks for long-running jobs",[12,28452,28453],{},"This detail specifically affects apps that have async workers — Sidekiq, Celery, Resque, RQ, BullMQ, any background job queue. The worker container picked up a job that takes 30 minutes to process (bulk email sending, report generation, payment processing). In the middle of processing, here comes the SIGTERM from the deploy. If the worker doesn't have proper handling, the job is lost — stays in intermediate state, or returns to the queue and is processed in duplicate, or simply disappears.",[12,28455,28456],{},"The correct flow is more elaborate than detail 2's. Even before SIGTERM, the orchestrator needs to execute a pre-stop hook that signals the worker to enter drain mode: stop accepting new jobs, but finish those already picked up. The orchestrator then waits (with configurable timeout — 60 to 300 seconds is normal range) for the local queue to drain. Only then does it send SIGTERM to the process.",[12,28458,28459,28460,5839,28463,28466],{},"The implementation varies. More sophisticated apps expose a ",[231,28461,28462],{},"\u002Fpause",[231,28464,28465],{},"\u002Fdrain"," endpoint that puts the worker in graceful mode. Simpler apps use a sentinel file — pre-stop creates a file, worker checks the file each loop and stops picking up new jobs if it exists. In both cases, the key is that the orchestrator needs to wait for confirmation that the local queue drained before sending SIGTERM.",[12,28468,28469],{},"Without this, your async job failure rate during deploy is directly proportional to average job processing time × number of swapped workers. In apps that process payment or email sending, that failure rate becomes a serious problem fast.",[19,28471,28473],{"id":28472},"the-complete-recipe","The complete recipe",[12,28475,28476],{},"The six details compose into a recognizable specification. In spec format:",[224,28478,28480],{"className":9008,"code":28479,"language":9010,"meta":229,"style":229},"update:\n  max_parallel: 1\n  min_healthy_time: 10    # 10s sustained healthy\n  healthy_deadline: 300   # 5 min max to be healthy\n  auto_revert: true       # if past deadline, revert\n  stagger: 30             # 30s between swapped replicas\ntasks:\n  - name: web\n    healthcheck:\n      path: \u002Fhealthz\n      interval: 5s\n      timeout: 2s\n      retries: 3\n    lifecycle:\n      pre_stop:\n        timeout: 60         # 60s for worker to drain\n        command: [\"\u002Fbin\u002Fsh\", \"-c\", \"kill -TERM 1; sleep 30\"]\n",[231,28481,28482,28489,28498,28510,28522,28535,28547,28554,28565,28572,28582,28592,28602,28612,28619,28626,28639],{"__ignoreMap":229},[234,28483,28484,28487],{"class":236,"line":237},[234,28485,28486],{"class":9017},"update",[234,28488,9021],{"class":387},[234,28490,28491,28494,28496],{"class":236,"line":244},[234,28492,28493],{"class":9017},"  max_parallel",[234,28495,6562],{"class":387},[234,28497,10041],{"class":251},[234,28499,28500,28503,28505,28507],{"class":236,"line":271},[234,28501,28502],{"class":9017},"  min_healthy_time",[234,28504,6562],{"class":387},[234,28506,1589],{"class":251},[234,28508,28509],{"class":240},"    # 10s sustained healthy\n",[234,28511,28512,28515,28517,28519],{"class":236,"line":415},[234,28513,28514],{"class":9017},"  healthy_deadline",[234,28516,6562],{"class":387},[234,28518,1576],{"class":251},[234,28520,28521],{"class":240},"   # 5 min max to be healthy\n",[234,28523,28524,28527,28529,28532],{"class":236,"line":434},[234,28525,28526],{"class":9017},"  auto_revert",[234,28528,6562],{"class":387},[234,28530,28531],{"class":251},"true",[234,28533,28534],{"class":240},"       # if past deadline, revert\n",[234,28536,28537,28540,28542,28544],{"class":236,"line":459},[234,28538,28539],{"class":9017},"  stagger",[234,28541,6562],{"class":387},[234,28543,5579],{"class":251},[234,28545,28546],{"class":240},"             # 30s between swapped replicas\n",[234,28548,28549,28552],{"class":236,"line":464},[234,28550,28551],{"class":9017},"tasks",[234,28553,9021],{"class":387},[234,28555,28556,28558,28560,28562],{"class":236,"line":479},[234,28557,9543],{"class":387},[234,28559,10456],{"class":9017},[234,28561,6562],{"class":387},[234,28563,28564],{"class":255},"web\n",[234,28566,28567,28570],{"class":236,"line":484},[234,28568,28569],{"class":9017},"    healthcheck",[234,28571,9021],{"class":387},[234,28573,28574,28577,28579],{"class":236,"line":490},[234,28575,28576],{"class":9017},"      path",[234,28578,6562],{"class":387},[234,28580,28581],{"class":255},"\u002Fhealthz\n",[234,28583,28584,28587,28589],{"class":236,"line":508},[234,28585,28586],{"class":9017},"      interval",[234,28588,6562],{"class":387},[234,28590,28591],{"class":255},"5s\n",[234,28593,28594,28597,28599],{"class":236,"line":529},[234,28595,28596],{"class":9017},"      timeout",[234,28598,6562],{"class":387},[234,28600,28601],{"class":255},"2s\n",[234,28603,28604,28607,28609],{"class":236,"line":535},[234,28605,28606],{"class":9017},"      retries",[234,28608,6562],{"class":387},[234,28610,28611],{"class":251},"3\n",[234,28613,28614,28617],{"class":236,"line":546},[234,28615,28616],{"class":9017},"    lifecycle",[234,28618,9021],{"class":387},[234,28620,28621,28624],{"class":236,"line":552},[234,28622,28623],{"class":9017},"      pre_stop",[234,28625,9021],{"class":387},[234,28627,28628,28631,28633,28636],{"class":236,"line":557},[234,28629,28630],{"class":9017},"        timeout",[234,28632,6562],{"class":387},[234,28634,28635],{"class":251},"60",[234,28637,28638],{"class":240},"         # 60s for worker to drain\n",[234,28640,28641,28644,28646,28649,28651,28654,28656,28659],{"class":236,"line":594},[234,28642,28643],{"class":9017},"        command",[234,28645,9521],{"class":387},[234,28647,28648],{"class":255},"\"\u002Fbin\u002Fsh\"",[234,28650,571],{"class":387},[234,28652,28653],{"class":255},"\"-c\"",[234,28655,571],{"class":387},[234,28657,28658],{"class":255},"\"kill -TERM 1; sleep 30\"",[234,28660,9527],{"class":387},[12,28662,28663,28664,28666],{},"That configuration — in text, in orchestrator config file, or implicit in deploy code — is the minimum viable safe rolling deploy. Covering the six details is what differentiates serious orchestration from a series of ",[231,28665,1118],{}," commands queued up.",[19,28668,28670],{"id":28669},"who-implements-what-the-honest-version","Who implements what (the honest version)",[12,28672,28673],{},"The grid below covers the most common ecosystem in the self-hosted\u002Fsmall-cluster niche. It's not exhaustive — there are excellent tools left out —, but it's honest about the ones that show up in typical architecture decisions.",[12,28675,28676,28679,28680,28683,28684,21681,28687,28690,28691,28694,28695,21681,28698,28701,28702,21681,28705,28708],{},[27,28677,28678],{},"Kubernetes."," Implements all six when the manifest is complete: ",[231,28681,28682],{},"readinessProbe"," covers detail 1, ",[231,28685,28686],{},"terminationGracePeriodSeconds",[231,28688,28689],{},"preStop"," covers 2 and 6, ",[231,28692,28693],{},"imagePullPolicy"," + local cache covers 3, ",[231,28696,28697],{},"progressDeadlineSeconds",[231,28699,28700],{},"revisionHistoryLimit"," covers 4, ",[231,28703,28704],{},"maxUnavailable",[231,28706,28707],{},"maxSurge"," covers 5. The problem isn't what it does; it's the size of the manifest needed to do this, and the number of fields whose default isn't what you need.",[12,28710,28711,28714,28715,28718,28719,571,28722,571,28725,28728],{},[27,28712,28713],{},"Docker Swarm."," Implements rolling via ",[231,28716,28717],{},"docker service update",". The primitives exist (",[231,28720,28721],{},"--update-parallelism",[231,28723,28724],{},"--update-delay",[231,28726,28727],{},"--update-failure-action","), but the defaults are too aggressive — high default parallelism, no mandatory health check, and auto-revert behavior is opt-in with a specific flag. Needs to be tuned for each service; rarely is.",[12,28730,28731,28734,28735,28737,28738,571,28740,571,28742,571,28744,571,28747,28749],{},[27,28732,28733],{},"Nomad."," Implements natively, with sensible defaults. ",[231,28736,28486],{}," block has ",[231,28739,3302],{},[231,28741,28344],{},[231,28743,28406],{},[231,28745,28746],{},"auto_revert",[231,28748,28445],{}," — basically the same fields as the spec above, because that nomenclature isn't coincidence: it's the inheritance of best practices the major orchestrators converged on.",[12,28751,28752,28755,28756,571,28758,571,28761,571,28764,28767],{},[27,28753,28754],{},"HeroCtl."," Implements all six details natively, with configuration practically identical to the spec above. The control plane coordinator election takes about 7 seconds when the leader node falls, so even a deploy executed in the middle of coordinator swap is resilient — the new coordinator resumes the health check cycle from where it was. The defaults are ",[231,28757,28423],{},[231,28759,28760],{},"min_healthy_time: 10",[231,28762,28763],{},"healthy_deadline: 300",[231,28765,28766],{},"auto_revert: true",". If you submit a job without configuring anything, that's what catches.",[12,28769,28770,28773],{},[27,28771,28772],{},"Watchtower."," No. Watchtower is a useful tool for a specific case — automatic container updates in environment where you accept short downtime and connection loss as cost of not having deploy pipeline. In serious production, it fails on five of the six details. It's not criticism of the project; it's criticism of using it in the wrong context.",[12,28775,28776,28779],{},[27,28777,28778],{},"Coolify and Dokploy."," Implement partially. Health check exists but needs to be configured per application. Connection draining depends on the app catching SIGTERM (shared responsibility). Auto-revert is manual in both. Generic pre-stop hook isn't a first-class primitive. For single-server, it's enough; for cluster, it's fragile.",[12,28781,28782,28785,28786,28789],{},[27,28783,28784],{},"Homemade scripts."," That ",[231,28787,28788],{},"docker pull && docker stop && docker run"," combination in a shell script managed by cron or triggered by GitHub webhook. Zero of the six. Honest coverage of what that is: a deploy with short downtime, not a rolling deploy.",[19,28791,28793],{"id":28792},"the-four-patterns-beyond-rolling","The four patterns beyond rolling",[12,28795,28796],{},"Rolling isn't the only strategy. Depending on requirement and budget, three others make sense in specific contexts.",[12,28798,28799,28801],{},[27,28800,2740],{}," Two complete parallel environments, each with the entire stack. Deploy is bringing up the alternative environment (green) with the new version, validating in parallel, and switching via DNS or load balancer — a single atomic moment when the entire traffic changes. Safer than rolling because you can validate the new version with synthetic traffic before any real user. Costs, in capacity, 2× during the deploy window. Recommended for apps where the cost of bug in production is high and the cost of extra capacity is low.",[12,28803,28804,28806],{},[27,28805,2746],{}," Send 5% (or 1%, or any small fraction) of traffic to the new version, monitor key metrics for an observation period (15 minutes to a few hours), and scale gradually — 5%, 25%, 50%, 100%. Detects regression before affecting the main user. Prerequisite is having reliable metrics with high sensitivity; without it, canary just delays the deploy without gaining safety. Combines well with rolling: rolling is the mechanism, canary is the promotion strategy.",[12,28808,28809,28812],{},[27,28810,28811],{},"Rainbow."," Several versions coexisting simultaneously in production, with traffic routed by customer key or tenant type. Rare use case, usually in B2B with version-per-contract requirements. Almost never the first option.",[12,28814,28815,28818],{},[27,28816,28817],{},"Recreate."," Stop everything, bring up new. Explicit and accepted downtime. Acceptable for internal apps with maintenance window or for development environments. Surprisingly appropriate in specific cases: deploy involving database migration that breaks schema, or deploy of app whose architecture doesn't support two versions coexisting. When recreate is the right choice, it's the right choice — there's no prize for doing rolling for everything.",[19,28820,28822],{"id":28821},"how-to-detect-that-your-rolling-is-bad","How to detect that your rolling is bad",[12,28824,28825],{},"Four direct tests. The first two are metrics you can look at in monitoring you already have; the next two are experiments that need to be done with intent.",[12,28827,28828,28831],{},[27,28829,28830],{},"5xx rate during the deploy window."," If your 5xx rate in production is statistically different from zero during the deploy window, it's bad. \"Statistically different\" means: take 30 consecutive deploys, measure the 5xx rate in the minute preceding the deploy and in the minute of the deploy. If the second's mean is higher, there's real error, and the window is cutting connection.",[12,28833,28834,28837],{},[27,28835,28836],{},"p99 latency during deploy."," If p99 rises 3× or more during the deploy window, it's bad. Latency spike indicates that requests are being internally restarted, or that the load balancer is re-accepting connections to slow-to-respond containers.",[12,28839,28840,28843,28844,28847,28848,28851,28852,28854],{},[27,28841,28842],{},"Forced crash test."," Before a scheduled deploy, force the app in the new container to fail — ",[231,28845,28846],{},"chmod 000"," on the binary, or environment variable that makes ",[231,28849,28850],{},"process.exit(1)"," on startup. Does the system automatically revert within ",[231,28853,28406],{},"? If it stays stuck waiting for human intervention, detail 4 is broken.",[12,28856,28857,28860],{},[27,28858,28859],{},"Friday 5 p.m. deploy with real traffic."," The social test. Do a non-trivial deploy at peak hour, on a day no one on the team is actively watching. If your app's metric during that window is indistinguishable from a random window, your rolling deploy is safe. If intervention was required, or if the status channel registered something, it isn't.",[19,28862,3225],{"id":3224},[12,28864,28865,28868],{},[27,28866,28867],{},"Is Watchtower safe for production?","\nFor small production with explicit tolerance for short downtime and fallback tool (fast manual rollback), yes. For production that has paying customer with SLA expectation, no. Watchtower was made for a different problem.",[12,28870,28871,28877,28878,28880,28881,28883],{},[27,28872,3180,28873,5839,28875,25786],{},[231,28874,355],{},[231,28876,28340],{},"\nThe convention that helps most in practice: ",[231,28879,355],{}," indicates \"the process is alive\" (liveness — used by orchestrator to decide whether to restart the container) and ",[231,28882,28340],{}," indicates \"I'm ready to serve traffic\" (readiness — used by the orchestrator to decide whether to include the container in the load balancer). For rolling deploy, what matters is readiness; the orchestrator only promotes a container to the traffic pool when readiness returns 200. If you have only one endpoint and it responds 200 immediately after the process starts, your readiness isn't measuring what it should.",[12,28885,28886,28892,28893,101],{},[27,28887,28888,28889,28891],{},"How much should ",[231,28890,28344],{}," be?","\nTypical band is 10 to 30 seconds. Shorter than that (3 seconds, 5 seconds) lets false positives through — apps that respond 200 right at the start but start failing when real traffic arrives. Longer than 60 seconds becomes operational impediment without proportional gain. If your application has complex warm-up (in-memory cache, slow third-party connection), the place to cover that is in the health check itself — making it only respond 200 after warm-up — not inflating ",[231,28894,28344],{},[12,28896,28897,28900,28901,28904],{},[27,28898,28899],{},"How do I do pre-stop in Rails app?","\nRails responds natively to SIGTERM with graceful shutdown since 5.x — the server stops accepting new connections and finishes the in-flight ones. For Sidekiq, the correct signal is SIGTSTP (pause workers), followed by SIGTERM after ",[231,28902,28903],{},"Sidekiq.redis { |c| c.llen(\"queue:default\") }"," zeroes. In practice, the pre-stop hook executes a small script that sends SIGTSTP, polls the queue for up to N seconds, and returns — the orchestrator then does the conventional SIGTERM.",[12,28906,28907,28910],{},[27,28908,28909],{},"Do sticky sessions and rolling deploy go well together?","\nBadly. Sticky session means your architecture is delegating state to the load balancer, and during rolling that state is discarded when the replica that held the session is swapped. Result: user is logged out, loses form mid-way, or has inconsistent behavior. If you need sticky, that's a symptom — refactor to external state (Redis, database) and rolling deploy becomes trivial.",[12,28912,28913,28916],{},[27,28914,28915],{},"Database migration on rolling deploy?","\nThe practical rule that avoids 90% of pain: every migration needs to be compatible with the previous version of the app during the deploy window. Adding nullable column: fine. Removing column: do in two releases (release N stops using the column; release N+1 removes it). Renaming column: idem, with new column as copy. This allows old and new replicas to coexist, which is exactly what rolling presupposes.",[12,28918,28919,28922,28923,5839,28925,28928],{},[27,28920,28921],{},"Can I test rolling deploy locally?","\nYou can. Bring up three local replicas with docker-compose, simulate a load balancer (nginx or caddy) in front, fire ",[231,28924,2490],{},[231,28926,28927],{},"wrk"," with sustained traffic, and execute a swap script as your pipeline would execute in production. Measure 5xx during the window. It's an imperfect test (traffic is synthetic, node is single, there's no real network between nodes), but catches the gross bugs in details 1, 2, and 3 before they leak to production.",[19,28930,3309],{"id":3308},[12,28932,28933],{},"Safe rolling deploy isn't a button; it's a set of six coordinated behaviors, and most tools that promise \"zero downtime\" cover three or four of them. The practical difference becomes visible Friday 5 p.m., or in Wednesday morning's incident, or in the postmortem where someone asks why three users reported error during the last deploy.",[12,28935,28936],{},"The recipe is above. If your current tool covers the six, great. If not, either you take on the cost of covering the missing ones manually, or you swap for something that covers natively.",[12,28938,28939],{},"HeroCtl covers natively. Community plan is permanent free, no server or job limit, with the rolling deploy configuration described here as default. Business plan adds SSO, RBAC, audit, and SLA support for teams with formal platform requirements. Enterprise plan adds source code escrow and continuity contract.",[12,28941,24125],{},[224,28943,28944],{"className":226,"code":2948,"language":228,"meta":229,"style":229},[231,28945,28946],{"__ignoreMap":229},[234,28947,28948,28950,28952,28954,28956],{"class":236,"line":237},[234,28949,1220],{"class":247},[234,28951,2957],{"class":251},[234,28953,2960],{"class":255},[234,28955,2963],{"class":383},[234,28957,2966],{"class":247},[12,28959,28960,28961,28963,28964,2402,28967,101],{},"If you want to see the other sides of the topic, read ",[3336,28962,21724],{"href":6545}," for the product context, and in the next posts we'll cover ",[3336,28965,28966],{"href":3343},"Docker deploy in production, from compose to cluster",[3336,28968,28969],{"href":11724},"database backup strategies in cluster for 3 a.m.",[12,28971,28972],{},"The intent remains the same: container orchestration, without ceremony — and without theater.",[3350,28974,28975],{},"html pre.shiki code .sPWt5, html code.shiki .sPWt5{--shiki-default:#7EE787}html pre.shiki code .sZEs4, html code.shiki .sZEs4{--shiki-default:#E6EDF3}html pre.shiki code .sFSAA, html code.shiki .sFSAA{--shiki-default:#79C0FF}html pre.shiki code .sH3jZ, html code.shiki .sH3jZ{--shiki-default:#8B949E}html pre.shiki code .s9uIt, html code.shiki .s9uIt{--shiki-default:#A5D6FF}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html pre.shiki code .sQhOw, html code.shiki .sQhOw{--shiki-default:#FFA657}html pre.shiki code .suJrU, html code.shiki .suJrU{--shiki-default:#FF7B72}",{"title":229,"searchDepth":244,"depth":244,"links":28977},[28978,28979,28980,28981,28982,28984,28985,28986,28987,28988,28989,28990],{"id":28318,"depth":244,"text":28319},{"id":28356,"depth":244,"text":28357},{"id":28376,"depth":244,"text":28377},{"id":28393,"depth":244,"text":28394},{"id":28419,"depth":244,"text":28983},"Detail 5 — max_parallel: 1 in multi-instance cluster",{"id":28449,"depth":244,"text":28450},{"id":28472,"depth":244,"text":28473},{"id":28669,"depth":244,"text":28670},{"id":28792,"depth":244,"text":28793},{"id":28821,"depth":244,"text":28822},{"id":3224,"depth":244,"text":3225},{"id":3308,"depth":244,"text":3309},"2025-12-04","Swapping containers without downtime sounds simple — pull new image, kill old, start new. Works until the first Friday at 5 p.m. The 6 details that separate real rolling deploy from theater.",{},{"title":28304,"description":28992},{"loc":3338},"en\u002Fblog\u002Fsafe-rolling-deploys-why-yours-might-not-be",[28998,1526,3378,3391,16724],"rolling-deploy","ZJeMejStmPw7wf16dRstpRv7ESGeIw0bMGCEwHor6IA",{"id":29001,"title":29002,"author":7,"body":29003,"category":6382,"cover":3379,"date":29751,"description":29752,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":29753,"navigation":411,"path":14874,"readingTime":3386,"seo":29754,"sitemap":29755,"stem":29756,"tags":29757,"__hash__":29759},"blog_en\u002Fen\u002Fblog\u002Fkubernetes-alternative-self-hosted-paas.md","A Kubernetes alternative in 2026: self-hosted PaaS for Brazilian teams",{"type":9,"value":29004,"toc":29731},[29005,29008,29011,29014,29017,29021,29024,29050,29053,29078,29081,29085,29088,29091,29114,29121,29131,29137,29140,29144,29147,29174,29180,29183,29186,29190,29193,29196,29199,29202,29205,29207,29210,29213,29216,29219,29223,29226,29229,29232,29235,29239,29242,29245,29248,29251,29255,29258,29261,29264,29267,29270,29274,29277,29280,29294,29297,29300,29303,29307,29310,29316,29322,29328,29334,29337,29341,29344,29385,29388,29392,29395,29572,29575,29579,29582,29588,29594,29600,29604,29607,29613,29619,29624,29630,29635,29641,29645,29651,29657,29663,29669,29675,29681,29687,29689,29692,29695,29698,29714,29717,29726,29729],[12,29006,29007],{},"The technical literature on container orchestration is mostly American. And mostly assumes a set of premises that doesn't hold for Brazil in 2026.",[12,29009,29010],{},"The median American SRE salary hovers around US$150k per year. To the CFO in San Francisco, that's \"an expensive person\". In São Paulo, at today's exchange rate, it equals three full people. The math comes out differently. So does the conclusion.",[12,29012,29013],{},"US$140 per month of managed platform cost becomes noise to the Mountain View CFO. To the founder in Belo Horizonte, it is one tenth of the salary of the mid-level dev they just hired. The same architectural decision — \"let's just pay for the managed thing and move on\" — has completely different financial weight on the two sides of the hemisphere.",[12,29015,29016],{},"This post recalibrates the decision for Brazilian reality. It is not a manifesto against Kubernetes. It is a spreadsheet.",[19,29018,29020],{"id":29019},"the-brazilian-budget-reality-no-detours","The Brazilian budget reality, no detours",[12,29022,29023],{},"Before comparing platforms, it is worth aligning current numbers. The ranges below are Brazilian market medians as of April 2026, considering a remote B2B SaaS team in a metropolitan region.",[2734,29025,29026,29032,29038,29044],{},[70,29027,29028,29031],{},[27,29029,29030],{},"Mid-level full-stack dev",": R$10k to R$15k CLT (with charges, total cost ~1.8× to the company). PJ in the R$15k to R$20k monthly range.",[70,29033,29034,29037],{},[27,29035,29036],{},"Senior full-stack dev",": R$18k to R$28k PJ.",[70,29039,29040,29043],{},[27,29041,29042],{},"SRE with serious Kubernetes in production",": R$25k to R$40k PJ. Rare to find below R$20k — those with Kubernetes on the resume learned fast to charge for what they learned.",[70,29045,29046,29049],{},[27,29047,29048],{},"24×7 on-call",": two SREs minimum, or you burn the only one in three months. The math of sustainable on-call doesn't change with country: nobody can be the only pager any longer than that.",[12,29051,29052],{},"Common cloud hosting in Brazil in 2026 follows four main patterns:",[2734,29054,29055,29060,29066,29072],{},[70,29056,29057,29059],{},[27,29058,14354],{},": no region in Brazil, the closest datacenter is New York. São Paulo latency in the 120 ms range. Pricing in dollars, simple and predictable. Popular for small teams.",[70,29061,29062,29065],{},[27,29063,29064],{},"AWS São Paulo (sa-east-1)",": good latency for Brazilian clients, but pricing 30 to 40% above us-east-1. Fewer services available than American regions.",[70,29067,29068,29071],{},[27,29069,29070],{},"Hetzner (Germany)",": 30 to 50% cheaper than AWS. ~200 ms latency to São Paulo. Good for workloads that are not latency-critical.",[70,29073,29074,29077],{},[27,29075,29076],{},"Brazilian provider (Locaweb, UOL Host, KingHost, Magalu Cloud)",": billing in real, support in Portuguese, national invoice for Brazilian accounting. Per GB and per vCPU pricing usually worse than international cloud, but zero FX exposure.",[12,29079,29080],{},"Above all of that hovers the exchange rate. The dollar in 2026 swings around R$5 with annual volatility of roughly 10%. That means a US$200 per month cost in January can become US$220-equivalent in October without you raising anything. Whoever budgets in real and pays in dollars carries silent FX risk, and that risk is proportional to the size of the bill.",[19,29082,29084],{"id":29083},"the-managed-kubernetes-bill-for-a-small-brazilian-team","The managed Kubernetes bill for a small Brazilian team",[12,29086,29087],{},"Let's put the most common scenario on the table. B2B startup in São Paulo, 4 devs, first product in production, 50 paying clients, monthly recurring revenue in real.",[12,29089,29090],{},"The \"industry standard\" choice is EKS in the São Paulo region. The minimum bill:",[2734,29092,29093,29096,29099,29102,29105],{},[70,29094,29095],{},"EKS cluster: US$73\u002Fmonth.",[70,29097,29098],{},"Egress gateway (NAT): ~US$40\u002Fmonth.",[70,29100,29101],{},"Load balancer (ALB): ~US$25\u002Fmonth.",[70,29103,29104],{},"Egress traffic: variable, conservative US$30\u002Fmonth.",[70,29106,29107,29109,29110,29113],{},[27,29108,16913],{},": ~US$170\u002Fmonth = ",[27,29111,29112],{},"~R$850\u002Fmonth"," at R$5\u002FUSD.",[12,29115,29116,29117,29120],{},"That is platform alone. The machines where the application runs are extra. On 4 m5.large nodes in the São Paulo region, another ~US$400\u002Fmonth. Realistic total for small production: ~US$570\u002Fmonth = ",[27,29118,29119],{},"R$2,850\u002Fmonth"," in infra alone.",[12,29122,29123,29124,29127,29128,101],{},"And the real cost is still missing: the team. Two junior-mid SREs CLT, R$30k\u002Fmonth each with charges, multiplied by 13 (including vacation and 13th salary): ",[27,29125,29126],{},"R$780k\u002Fyear"," in platform payroll alone. As PJ, two seniors in the same range: ",[27,29129,29130],{},"R$600k\u002Fyear",[12,29132,29133,29134,101],{},"Conservative year 1 total, platform + payroll: ",[27,29135,29136],{},"R$650k to R$820k before the first Enterprise client pays",[12,29138,29139],{},"For reference: 4 full-stack devs at R$15k PJ deliver product for R$780k\u002Fyear. The \"managed Kubernetes + 2 SREs\" choice costs almost the same as having the entire product team duplicated. Only delivering platform — not delivering features to the client.",[19,29141,29143],{"id":29142},"the-self-hosted-bill-in-brazil","The self-hosted bill in Brazil",[12,29145,29146],{},"Same scenario, different decision. 3 to 4 Linux VPS with Docker, replicated control plane, integrated router, automatic certificates.",[2734,29148,29149,29155,29161,29168,29171],{},[70,29150,29151,29152,101],{},"4 DigitalOcean VPS (4 vCPU, 8 GB RAM each): US$48\u002Fmonth × 4 = US$192\u002Fmonth = ",[27,29153,29154],{},"R$960\u002Fmonth",[70,29156,29157,29158,101],{},"Hetzner alternative: US$32\u002Fmonth × 4 = US$128\u002Fmonth = ",[27,29159,29160],{},"R$640\u002Fmonth",[70,29162,29163,29164,29167],{},"Magalu Cloud (BR) alternative: R$200\u002Fmonth × 4 = ",[27,29165,29166],{},"R$800\u002Fmonth",", in real, without FX volatility.",[70,29169,29170],{},"Backup S3 São Paulo region: ~US$15\u002Fmonth = R$75\u002Fmonth.",[70,29172,29173],{},"Full-stack dev time looking after the platform: 20% of one R$15k PJ person = R$3k\u002Fmonth allocated.",[12,29175,29176,29177,101],{},"Year 1 total: ",[27,29178,29179],{},"R$50k to R$72k",[12,29181,29182],{},"Difference vs. managed Kubernetes scenario: 9 to 13 times less. That delta — somewhere between R$580k and R$770k per year — becomes two additional full-stack devs, or one senior designer, or three months of runway. It is the margin between closing the year in profit and closing in loss.",[12,29184,29185],{},"The honest objection to this math is: \"but it'll take work to operate\". The answer must be honest too. Yes, it will take some work. The right question is whether the extra work fits in 20% of one team member's time. If the answer is yes — and at the 4-server, 16-container range it usually is — the ROI is direct.",[19,29187,29189],{"id":29188},"brazilian-hosting-in-practice-a-panorama-as-of-april-2026","Brazilian hosting in practice: a panorama as of April 2026",[12,29191,29192],{},"Each provider that matters in the Brazilian context has specific trade-offs. Short summary for each.",[368,29194,29064],{"id":29195},"aws-sao-paulo-sa-east-1",[12,29197,29198],{},"The oldest region on Brazilian soil, active since 2011. Excellent latency for São Paulo clients (1 to 5 ms intra-region, 30 to 50 ms to the South and Center-West). Pricing 30 to 40% above us-east-1 — an m5.large instance that costs US$70\u002Fmonth in Virginia costs US$95\u002Fmonth in São Paulo. Not every AWS service is available: some new products take 6 to 18 months to arrive in sa-east-1.",[12,29200,29201],{},"Good for: Brazilian B2B product with clients demanding data residency in contract, team large enough to absorb AWS complexity, company with revenue predominantly in real.",[12,29203,29204],{},"Watch out for: cross-region data transfer costs (expensive). Whoever replicates database between sa-east-1 and us-east-1 pays dearly for egress.",[368,29206,14354],{"id":14936},[12,29208,29209],{},"No region in Brazil. Closest datacenter is New York (NYC1, NYC3) or Toronto. Average latency to São Paulo in the 110 to 130 ms range — perceptible in interactive applications, irrelevant for a JSON API with 200 ms of its own processing.",[12,29211,29212],{},"Simple pricing, in USD, predictable. 4 vCPU + 8 GB RAM VPS for US$48\u002Fmonth. No billing surprises — you know on day 1 what you'll spend on day 30.",[12,29214,29215],{},"Good for: small team with tolerable latency (B2B SaaS, dashboards, APIs), startup prioritizing billing simplicity, personal project that becomes a product.",[12,29217,29218],{},"Watch out for: user-facing applications where 120 ms extra latency costs conversion (e-commerce, games, video chat). FX: everything in dollars.",[368,29220,29222],{"id":29221},"hetzner-germany-finland","Hetzner (Germany, Finland)",[12,29224,29225],{},"European provider with the best price-performance ratio on the market in 2026. 4 vCPU + 8 GB VPS for ~US$15\u002Fmonth — half the price of DigitalOcean. Dedicated servers for US$50\u002Fmonth with 64 GB of RAM.",[12,29227,29228],{},"Latency to São Paulo in the 200 to 230 ms range. Unfeasible for any synchronous interaction with a Brazilian user — perceptible as slowness. Feasible for background workloads, batch processing, data analysis, staging environments nobody uses in real time.",[12,29230,29231],{},"Good for: non-customer-facing workloads, build infrastructure, secondary database for analytics, archiving.",[12,29233,29234],{},"Watch out for: anything with a human waiting on the other side.",[368,29236,29238],{"id":29237},"brazilian-providers-locaweb-uol-host-kinghost","Brazilian providers (Locaweb, UOL Host, KingHost)",[12,29240,29241],{},"Billing in real, Brazilian NF-e, commercial Portuguese support. For Brazilian accounting, simplifies bookkeeping. For a company that needs to demonstrate a national supplier for public audit or public contract, it is a requirement.",[12,29243,29244],{},"Per vCPU and GB pricing tends to be 20 to 50% worse than international cloud. Product offering is also more limited — more emphasis on traditional hosting, less on modern primitives like block storage, snapshot and consistent programmatic API.",[12,29246,29247],{},"Good for: company with contractual obligation for a national supplier, team that values Portuguese support during incidents, public project with Brazilian supplier rules.",[12,29249,29250],{},"Watch out for: APIs and tooling generally less polished. Documentation in Portuguese is an advantage; incomplete Portuguese documentation becomes a trap when the problem is specific.",[368,29252,29254],{"id":29253},"emerging-national-cloud-magalu-cloud-dloud-similar","Emerging national cloud (Magalu Cloud, dloud, similar)",[12,29256,29257],{},"The newest category, maturing throughout 2025 and 2026. Brazilian operators offering VPS, object storage and basic managed services with pricing in real and a datacenter on national soil.",[12,29259,29260],{},"Main draw: LGPD with data explicitly remaining in Brazil, no international transfer to demonstrate at audit. Billing in real eliminates FX exposure.",[12,29262,29263],{},"Maturity still under construction. Service catalog much more limited than AWS São Paulo. Documentation sometimes incomplete. Intra-Brazil latency is good by construction.",[12,29265,29266],{},"Good for: company where LGPD with local data residency is a competitive differentiator, Brazilian public projects, teams wanting zero FX exposure.",[12,29268,29269],{},"Watch out for: catalog immaturity. Missing advanced features that have existed in AWS for five years.",[19,29271,29273],{"id":29272},"lgpd-and-self-hosted","LGPD and self-hosted",[12,29275,29276],{},"The General Data Protection Law (Law 13.709\u002F2018) has been in force since September 2020 and the active enforcement phase by ANPD began in 2022. By 2026 there are already substantial fines applied and initial case law.",[12,29278,29279],{},"LGPD does not require explicit national residency of data. But it requires:",[2734,29281,29282,29285,29288,29291],{},[70,29283,29284],{},"Adequate treatment of personal data (documented legal basis, clear purpose, justified retention).",[70,29286,29287],{},"In case of international transfer, contractual safeguards with the receiving entity.",[70,29289,29290],{},"Capacity to demonstrate internal audit on incident response: who accessed what, when, for what purpose.",[70,29292,29293],{},"Incident logging and ANPD notification within a reasonable timeframe (current interpretation: 48 to 72 hours for serious incidents).",[12,29295,29296],{},"Self-hosted on a provider with a region in Brazil simplifies three things. First: no international transfer to demonstrate — data never left national soil. Second: the cluster is yours, so the record of who accessed what is internally controllable, without depending on third parties. Third: managed backup within the cluster itself reduces exposure surface with additional vendors.",[12,29298,29299],{},"The HeroCtl Business Edition includes detailed audit that records each administrative action by user, with timestamp and context. In incident response, that record becomes evidence of operational good faith.",[12,29301,29302],{},"LGPD is not a universal argument for self-hosted — global company with revenue in dollars and clients in the US already has to worry about GDPR, with American framework, and contractual safeguards in many directions. For these, complexity already exists. For a Brazilian company serving Brazilian clients, simplifying the legal data topology is a concrete gain.",[19,29304,29306],{"id":29305},"when-managed-kubernetes-makes-sense-for-a-brazilian-team","When managed Kubernetes makes sense for a Brazilian team",[12,29308,29309],{},"Not never. There are scenarios where the math flips.",[12,29311,29312,29315],{},[27,29313,29314],{},"Company with revenue predominantly in dollars and global clients",": the platform's cost in USD is an expense in hard currency, not FX risk. The volatility becomes a natural hedge, not exposure.",[12,29317,29318,29321],{},[27,29319,29320],{},"International compliance already mapped on Kubernetes",": companies in the SOC 2 Type II or ISO 27001 process often have consultants and auditors who know Kubernetes. The path is shorter than presenting an alternative stack to each auditor — even when the alternative stack is technically equivalent or superior.",[12,29323,29324,29327],{},[27,29325,29326],{},"Platform team with 3 or more dedicated people",": the Kubernetes ecosystem rewards operational scale. With 3+ dedicated engineers, you have capacity to extract real value from specialized operators, service mesh, advanced observability. Below that, all of it becomes weight.",[12,29329,29330,29333],{},[27,29331,29332],{},"Workload above 50 servers",": in this range, the colossus's primitives start to pay off. Real multi-tenancy, namespace isolation, cross-cluster federation — things nobody needs at 4 servers, but that matter at 50.",[12,29335,29336],{},"Otherwise: probably overkill for the Brazilian context. The right question is not \"is Kubernetes good?\" — it is \"is Kubernetes good for the size of the problem I have today, and for the next 18 months?\". For a pre-Series A Brazilian startup, the honest answer is usually no.",[19,29338,29340],{"id":29339},"typical-recommended-stack-for-a-small-brazilian-team","Typical recommended stack for a small Brazilian team",[12,29342,29343],{},"Practical recipe for those starting from scratch or migrating from an expensive platform.",[2734,29345,29346,29352,29358,29363,29368,29374,29380],{},[70,29347,29348,29351],{},[27,29349,29350],{},"3 to 4 Linux VPS with Docker",": DigitalOcean, Hetzner or Brazilian provider, depending on the latency and FX trade-off that makes sense. R$500 to R$1,000\u002Fmonth range.",[70,29353,29354,29357],{},[27,29355,29356],{},"HeroCtl Community",": free, no server limit. Configures the cluster with control plane replicated across 3 or more servers, so loss of any single server doesn't take the cluster down.",[70,29359,29360,29362],{},[27,29361,193],{},": Postgres as a job in the cluster itself for projects where compliance allows it. For cases with strong regulatory requirements, managed RDS São Paulo, connected via VPN or authorized IP.",[70,29364,29365,29367],{},[27,29366,11364],{},": automatic Let's Encrypt certificates, ingress without extra operator setup.",[70,29369,29370,29373],{},[27,29371,29372],{},"Metrics and logs",": internal jobs of the system itself. No Datadog, no external New Relic — these charge in USD at US$15 to US$31 per host per month, which for 4 hosts already exceeds R$600\u002Fmonth in observability alone.",[70,29375,29376,29379],{},[27,29377,29378],{},"Backup",": weekly rotation to an S3 bucket in São Paulo or CloudFlare R2 (R2 has free egress, which makes a difference for restore).",[70,29381,29382,29384],{},[27,29383,12126],{},": Cloudflare free or Hostinger DNS for the Brazilian case. Both have stable programmatic APIs.",[12,29386,29387],{},"Operational total of this stack in the R$600 to R$1,100\u002Fmonth infra range, plus 10 to 20% of one team member's time. Supports zero to a few hundred thousand requests per day without needing to rethink architecture.",[19,29389,29391],{"id":29390},"comparison-table-adapted-to-brazil","Comparison table adapted to Brazil",[12,29393,29394],{},"Four paths, ten criteria. No column without caveats.",[119,29396,29397,29415],{},[122,29398,29399],{},[125,29400,29401,29403,29406,29409,29412],{},[128,29402,2982],{},[128,29404,29405],{},"Managed K8s (EKS-SP)",[128,29407,29408],{},"External managed PaaS (Render\u002FRailway)",[128,29410,29411],{},"Simple self-hosted (Coolify)",[128,29413,29414],{},"HA self-hosted (HeroCtl)",[141,29416,29417,29434,29448,29463,29478,29492,29507,29520,29537,29553],{},[125,29418,29419,29422,29425,29428,29431],{},[146,29420,29421],{},"Minimum platform cost\u002Fmonth",[146,29423,29424],{},"~R$850",[146,29426,29427],{},"R$0 to R$200 (free tier + first apps)",[146,29429,29430],{},"R$200 to R$500 (1 VPS)",[146,29432,29433],{},"R$500 to R$1,000 (3-4 VPS)",[125,29435,29436,29439,29441,29443,29446],{},[146,29437,29438],{},"Currency",[146,29440,15166],{},[146,29442,15166],{},[146,29444,29445],{},"Mixed (VPS USD or BRL)",[146,29447,29445],{},[125,29449,29450,29452,29455,29458,29461],{},[146,29451,23838],{},[146,29453,29454],{},"1-5 ms",[146,29456,29457],{},"100-200 ms (US servers)",[146,29459,29460],{},"depends on VPS",[146,29462,29460],{},[125,29464,29465,29467,29470,29473,29476],{},[146,29466,16398],{},[146,29468,29469],{},"1-2 dedicated SREs",[146,29471,29472],{},"0.1 dev part-time",[146,29474,29475],{},"1 dev part-time",[146,29477,29475],{},[125,29479,29480,29482,29484,29487,29490],{},[146,29481,16324],{},[146,29483,3064],{},[146,29485,29486],{},"Yes (managed by provider)",[146,29488,29489],{},"No — single-server",[146,29491,3064],{},[125,29493,29494,29497,29500,29502,29505],{},[146,29495,29496],{},"LGPD with data on BR soil",[146,29498,29499],{},"Yes (sa-east-1)",[146,29501,25609],{},[146,29503,29504],{},"Yes, if VPS is BR",[146,29506,29504],{},[125,29508,29509,29511,29513,29515,29517],{},[146,29510,14293],{},[146,29512,3061],{},[146,29514,3058],{},[146,29516,4351],{},[146,29518,29519],{},"Yes (Business\u002FEnterprise)",[125,29521,29522,29525,29528,29531,29534],{},[146,29523,29524],{},"Risk of terms change",[146,29526,29527],{},"Medium (provider has changed policy before)",[146,29529,29530],{},"High (provider may end free tier)",[146,29532,29533],{},"Low (open-source)",[146,29535,29536],{},"Low (price contractually frozen)",[125,29538,29539,29542,29544,29547,29550],{},[146,29540,29541],{},"Config lines for app+TLS+ingress",[146,29543,3047],{},[146,29545,29546],{},"5 to 10 (UI)",[146,29548,29549],{},"20 to 30 (UI)",[146,29551,29552],{},"~50 (file)",[125,29554,29555,29558,29561,29564,29570],{},[146,29556,29557],{},"Path to grow to 50 servers",[146,29559,29560],{},"Direct",[146,29562,29563],{},"Costly (price grows linearly)",[146,29565,29566,29567,16047],{},"Redo architecture (",[3336,29568,29569],{"href":16689},"leaving Coolify single-server",[146,29571,29560],{},[12,29573,29574],{},"The minimum platform cost line by itself doesn't decide. The line that most often decides is the minimum team to operate — because team costs, on average, 100 times more than platform for the operation size we're discussing.",[19,29576,29578],{"id":29577},"when-not-to-go-self-hosted","When NOT to go self-hosted",[12,29580,29581],{},"The thesis of this post is direct, but there are three scenarios where the recommendation flips.",[12,29583,29584,29587],{},[27,29585,29586],{},"Team of 1 to 2 devs with no time at all to look after a server",": in that case, Render, Railway or Heroku are the right answer. You pay in USD, but trade time (which you don't have) for money (which you have enough of to cover at this stage). When the team grows to 4+ devs and the bill becomes uncomfortable, migrating is feasible. For now, focus on the product.",[12,29589,29590,29593],{},[27,29591,29592],{},"Application whose platform cost is trivial vs. revenue",": B2B SaaS with R$200k MRR and platform bill of R$3k\u002Fmonth. Don't optimize prematurely. Use the team's time to build features that increase MRR, not to save 0.5% of revenue on infra.",[12,29595,29596,29599],{},[27,29597,29598],{},"Compliance that requires a nominally listed and third-party audited supplier",": some specific frameworks (government, regulated healthcare) require the supplier to be on a pre-approved list. These lists change slowly. If you need the AWS or Microsoft name in the contract, self-hosted on a generic VPS doesn't qualify. Wait for HeroCtl to reach formal lists or use the stack already listed.",[19,29601,29603],{"id":29602},"heroctl-in-the-brazilian-context-specifically","HeroCtl in the Brazilian context specifically",[12,29605,29606],{},"The discussion so far has been about architecture. Worth closing with what HeroCtl does specifically for the Brazilian context.",[12,29608,29609,29612],{},[27,29610,29611],{},"Permanent free Community plan",", no USD license to budget, no subscription, no server or job limit. Runs the entire stack described above — real high availability, router, automatic certificates, metrics, logs.",[12,29614,29615,29618],{},[27,29616,29617],{},"Runs on any Linux VPS with Docker."," Any Brazilian or international provider works. The cluster doesn't know or care if it is on DigitalOcean New York, AWS São Paulo, Hetzner Germany, Magalu Cloud Brazil or mixing providers. The primitive is Linux operating system + Docker, and that runs everywhere.",[12,29620,29621,29623],{},[27,29622,14293],{}," in Business and Enterprise editions. Product and support team aligned with the Brazilian time zone — your 2pm incident doesn't become \"we'll look at it tomorrow morning\".",[12,29625,29626,29629],{},[27,29627,29628],{},"Business and Enterprise pricing published in real",", contractually frozen for existing clients. No retroactive increase clause, no mid-flight license change like happened with a competing vendor in 2023. The contract you sign today is the contract that holds.",[12,29631,29632,29634],{},[27,29633,8333],{}," from the start, not as a later translation. Errors, log messages, admin panel, everything in Brazilian Portuguese as a first language.",[12,29636,29637,29640],{},[27,29638,29639],{},"No mandatory phone-home",", no remote kill-switch. Once installed, your cluster works offline indefinitely. Enterprise editions include source code escrow: if the company behind the product ceases operations, the code is delivered to paying clients via a third-party custodian.",[19,29642,29644],{"id":29643},"questions-we-get-from-brazilian-teams","Questions we get from Brazilian teams",[12,29646,29647,29650],{},[27,29648,29649],{},"Is managed Kubernetes in São Paulo good enough for LGPD?","\nTechnically yes — the sa-east-1 region keeps data on national soil. Operationally, it depends. Whoever uses managed services (RDS, S3, CloudWatch) needs to configure each one to stay exclusively in sa-east-1, and demonstrate that at audit. Self-hosted on a Brazilian VPS simplifies the demonstration: the entire cluster has one IP address, in a known datacenter, and that's it.",[12,29652,29653,29656],{},[27,29654,29655],{},"Can I run HeroCtl on a small Brazilian VPS (Hostinger, KingHost, Locaweb)?","\nYes. The minimum requirement is Linux with Docker and 1 GB of RAM per server (recommended 2 GB+). Works on R$50\u002Fmonth VPS from a Brazilian provider for a test cluster, or for development environment. For sustainable production, 4 GB+ per server is recommended.",[12,29658,29659,29662],{},[27,29660,29661],{},"How much RAM and CPU does it consume on an R$50\u002Fmonth server?","\nThe control plane occupies between 200 and 400 MB of RAM per server. On a 1 GB VPS, ~600 MB remain for workload. On a 2 GB VPS, ~1.6 GB remain. For reference, the control plane of a managed Kubernetes version starts at ~700 MB per master node before any application starts, and rarely runs on a VPS smaller than 4 GB.",[12,29664,29665,29668],{},[27,29666,29667],{},"Is there Portuguese support?","\nYes, in Business and Enterprise. Community uses documentation and forum in Portuguese, no response SLA. Business has direct commercial Portuguese support with response SLA. Enterprise adds extended hours and a dedicated channel.",[12,29670,29671,29674],{},[27,29672,29673],{},"And to scale to 50+ servers in the future?","\nThe tested application range is from 1 to 500 servers. Above 500, the Kubernetes ecosystem offers tools we don't have yet. Between 50 and 500, HeroCtl runs comfortably — doesn't require redoing architecture as happens when leaving Coolify single-server for real HA. The migration is continuation, not restart.",[12,29676,29677,29680],{},[27,29678,29679],{},"What about 24×7 support in Brazilian business hours?","\nEnterprise includes 24×7 support with a real person responding in Portuguese. For a Brazilian team that has an incident at 11pm on a Wednesday, it is the operational equivalent of the support American clients receive in American hours — only for Brasília time zone.",[12,29682,29683,29686],{},[27,29684,29685],{},"Can I use real to pay?","\nYes. Business and Enterprise are billed in real, with Brazilian NF-e. No FX exposure, no international card conversion, no IOF on the invoice. Brazilian accounting processes it like any other national supplier.",[19,29688,3309],{"id":3308},[12,29690,29691],{},"The right question for a Brazilian team in 2026 isn't \"what's the best orchestrator?\". It is \"what orchestrator makes sense on my cost spreadsheet, in my time zone, with my team, serving my client, under the law that governs my data?\".",[12,29693,29694],{},"The answer varies. For some companies, it is managed Kubernetes in sa-east-1. For others, it is Render or Railway paying in USD while MRR justifies. For most pre-Series A Brazilian startups — budget in real, lean team, Brazilian client — the answer is self-hosted on VPS, with a truly replicated control plane.",[12,29696,29697],{},"For that case, we built HeroCtl. Installation:",[224,29699,29700],{"className":226,"code":2948,"language":228,"meta":229,"style":229},[231,29701,29702],{"__ignoreMap":229},[234,29703,29704,29706,29708,29710,29712],{"class":236,"line":237},[234,29705,1220],{"class":247},[234,29707,2957],{"class":251},[234,29709,2960],{"class":255},[234,29711,2963],{"class":383},[234,29713,2966],{"class":247},[12,29715,29716],{},"In 5 minutes you have a cluster with 3 servers, replicated control plane, integrated router and automatic Let's Encrypt certificates. From there, it is just submitting applications.",[12,29718,29719,29720,29722,29723,29725],{},"For additional context, read ",[3336,29721,6546],{"href":6545}," (the story of the gap that none of the three existing alternatives filled) and ",[3336,29724,15781],{"href":15780}," (the general version, not Brazil-specific, of the same argument).",[12,29727,29728],{},"Container orchestration, without ceremony. In real.",[3350,29730,4376],{},{"title":229,"searchDepth":244,"depth":244,"links":29732},[29733,29734,29735,29736,29743,29744,29745,29746,29747,29748,29749,29750],{"id":29019,"depth":244,"text":29020},{"id":29083,"depth":244,"text":29084},{"id":29142,"depth":244,"text":29143},{"id":29188,"depth":244,"text":29189,"children":29737},[29738,29739,29740,29741,29742],{"id":29195,"depth":271,"text":29064},{"id":14936,"depth":271,"text":14354},{"id":29221,"depth":271,"text":29222},{"id":29237,"depth":271,"text":29238},{"id":29253,"depth":271,"text":29254},{"id":29272,"depth":244,"text":29273},{"id":29305,"depth":244,"text":29306},{"id":29339,"depth":244,"text":29340},{"id":29390,"depth":244,"text":29391},{"id":29577,"depth":244,"text":29578},{"id":29602,"depth":244,"text":29603},{"id":29643,"depth":244,"text":29644},{"id":3308,"depth":244,"text":3309},"2025-11-25","Brazilian teams operate under different constraints: budget in real, hosting in DigitalOcean or AWS São Paulo, LGPD instead of GDPR. How that changes the orchestrator choice.",{},{"title":29002,"description":29752},{"loc":14874},"en\u002Fblog\u002Fkubernetes-alternative-self-hosted-paas",[14939,20384,24167,29758,6394,15807],"lgpd","GK0Yk5CwUXxnD3yshlmuHJ_N73Nfavcxq7M8C7SrCDY",{"id":29761,"title":29762,"author":7,"body":29763,"category":8756,"cover":3379,"date":30360,"description":30361,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":30362,"navigation":411,"path":19749,"readingTime":3386,"seo":30363,"sitemap":30364,"stem":30365,"tags":30366,"__hash__":30368},"blog_en\u002Fen\u002Fblog\u002Fself-hosted-heroku-2026.md","Self-hosted Heroku in 2026: the state of the segment",{"type":9,"value":29764,"toc":30339},[29765,29768,29771,29774,29778,29784,29787,29790,29796,29802,29808,29811,29814,29818,29822,29825,29831,29837,29842,29845,29849,29852,29864,29867,29870,29874,29881,29884,29888,29891,29896,29901,29906,29909,29913,29916,29921,29924,29926,30141,30144,30148,30151,30157,30163,30169,30175,30179,30183,30186,30189,30192,30195,30199,30202,30205,30208,30212,30215,30218,30221,30225,30228,30234,30240,30246,30249,30253,30256,30259,30262,30265,30268,30270,30276,30282,30288,30294,30300,30306,30312,30314,30317,30320,30323,30328,30337],[12,29766,29767],{},"On November 28, 2022, Salesforce shut down the Heroku free plan. The company that bought Heroku in 2010 carried out the notice published three months earlier — all free dynos, all hobby databases, all free redises were terminated in the same window. Hundreds of thousands of hobby projects, MVPs, portfolio demos, and university prototypes disappeared from the air all at once.",[12,29769,29770],{},"The reaction was predictable. Whoever had a card on file migrated to the paid plan and went on. Whoever didn't looked elsewhere. And in the following weeks, a movement that had existed in dormant state since 2013 exploded: \"self-hosted Heroku\".",[12,29772,29773],{},"In 2026, three and a half years later, the segment matured. There are at least six serious products competing for attention, plus a handful of hosted projects that sell themselves as \"Heroku-like\" without the self. This post is the map.",[19,29775,29777],{"id":29776},"why-self-hosted-heroku-became-a-category","Why \"self-hosted Heroku\" became a category",[12,29779,29780,29781,29783],{},"The first thing that needs to be said is that Heroku solved the right problem. In 2010, web app deploy had two forms: upload a tarball to a server you administered, or pay a lot for someone to administer it for you. Heroku invented the third path — ",[231,29782,23639],{},", scale slider, embedded managed database, automatically issued certificate, working subdomain in seconds.",[12,29785,29786],{},"That pattern stuck. The concept of \"deploy is a push, scaling is a slider, TLS is automatic\" became the base expectation of any developer trained after 2012. Everything that came after — Render, Railway, Fly.io, Vercel for frontend, Cloud Run, App Runner — is a variation on top of that model.",[12,29788,29789],{},"But three things changed between 2010 and 2022.",[12,29791,29792,29795],{},[27,29793,29794],{},"Cloud bare-metal got cheap."," In 2010, a decent virtual server cost US$40\u002Fmonth. In 2026, US$5\u002Fmonth buys a VPS with 1 vCPU, 2 GB of RAM, 50 GB of disk, and 2 TB of traffic — more than enough to run five small applications. The economics of paying dynos versus running your own containers reversed.",[12,29797,29798,29801],{},[27,29799,29800],{},"Docker became standard."," The original Heroku's great virtue was \"buildpacks\" — recipes that took your Ruby or Node code and produced an isolated artifact ready to run. Docker made that encapsulation a commodity. Today anyone produces a reproducible image in three lines of Dockerfile, and any server with 100 MB of free RAM can host it.",[12,29803,29804,29807],{},[27,29805,29806],{},"The community learned to operate."," In 2010, \"running a Linux server\" was an SRE craft. In 2026, any full-stack developer has already configured nginx, dealt with certbot, set up a systemd unit, debugged the OOM-killer through dmesg. The average level rose. What justified paying US$25\u002Fmonth for Heroku to take care became an afternoon exercise.",[12,29809,29810],{},"When the free plan died, setting up \"your own Heroku\" stopped being a hacker pride exercise and became bakery math. US$10\u002Fmonth of VPS against US$25\u002Fmonth minimum on paid Heroku — and that's per application. Five applications on Heroku cost US$125\u002Fmonth. Five applications on a US$10\u002Fmonth VPS keep costing US$10\u002Fmonth.",[12,29812,29813],{},"The category that responded to that math has five distinct subgenres today. Worth separating.",[19,29815,29817],{"id":29816},"the-segment-in-2026","The segment in 2026",[368,29819,29821],{"id":29820},"single-server-simple-i-install-on-a-vps-and-forget","Single-server simple — \"I install on a VPS and forget\"",[12,29823,29824],{},"The oldest and most populated subgenre. The premise is direct: a single server, an installer, a panel or CLI, and you have dynos. No cluster, no high availability, no complication.",[12,29826,29827,29830],{},[27,29828,29829],{},"Dokku"," is the grandpa of the segment, active since 2013. The engine is Bash plus Docker. UX is mostly CLI — you push code via git remote, it builds with Heroku-compatible buildpacks, brings up the container, registers in the internal router. The community is small but loyal, the product is stable, and the learning curve is steep in the first days and flat after that. Whoever passed those first days rarely switches. It's out of fashion in the sense that the new community prefers web panels — but the product remains solid, with more than twelve years of real production operation in thousands of installations around the world.",[12,29832,29833,29836],{},[27,29834,29835],{},"Caprover"," occupies the middle of the spectrum. Web panel, plugin system, reasonably easy installation. The product has about thirteen thousand stars on public repositories and an active community, though smaller than that of newer competitors. Evolution is slower — releases come at a monthly or bimonthly cadence, and big features take time. For those who prioritize stability over novelty, it's a defensible choice.",[12,29838,29839,29841],{},[27,29840,2770],{}," is the current mindshare leader. About forty thousand stars, modern web panel, active plugin ecosystem, noisy community in forums and chat. The product evolved fast between 2023 and 2025, adding support for embedded managed database, deploy via git, automatic certificates, container monitoring. It's the default recommendation circulating in indie hacker forums today.",[12,29843,29844],{},"The main defect of Coolify, and the reason it appears also in the section on traps further below, is architectural: it was designed single-server first. Multi-server was added later, but the central panel remains a single process on a single server. If that server falls, you lose access to all the others.",[368,29846,29848],{"id":29847},"single-server-modern-deploy-without-panel","Single-server modern — \"deploy without panel\"",[12,29850,29851],{},"Newer subgenre, with philosophy opposite to the previous. Instead of web panel, command-line tool that operates over SSH.",[12,29853,29854,29856,29857,29860,29861,101],{},[27,29855,2997],{}," is the almost exclusive representative. It came out of the 37signals team in 2024, written by people close to DHH. The premise is radical: no panel, no control plane, no resident agent on the servers. You write a configuration file, run ",[231,29858,29859],{},"kamal deploy",", it SSHs into each server, pulls the image, swaps the container, and continues. DHH published in 2024 that he saved about three million dollars per year migrating Basecamp's and HEY's own apps from cloud to own hardware with Kamal. Where the \"no orchestration\" thesis starts to hurt is in ",[3336,29862,29863],{"href":25875},"HeroCtl vs Kamal",[12,29865,29866],{},"The virtue is absolute transparency — there's nothing happening that you don't see in the terminal. The defect is that multi-server isn't orchestration, it's parallel deploy. There's no leader election, no rebalancing, no failover. Each server is an independent destination. If one falls, you notify your monitoring and redo the deploy excluding that host.",[12,29868,29869],{},"For a small team operating two or three applications on three to five servers, with disciplined deploy habits, Kamal is elegant. For anything that needs \"if a server falls, the cluster decides what to do on its own\", it isn't the right tool.",[368,29871,29873],{"id":29872},"cloud-native-modern-heroku-rewrapped","Cloud-native modern — \"Heroku rewrapped\"",[12,29875,29876,29878,29879,101],{},[27,29877,2776],{}," is the most recent product to enter the conversation. Around ten thousand stars, growing fast, UX visually similar to Coolify but underlying architecture on Docker Swarm. Main attraction: real multi-server \"out of the box\", without needing to set up by hand. The point-by-point reading of the technical choice Dokploy made is in ",[3336,29880,16695],{"href":16694},[12,29882,29883],{},"The structural defect is the foundation. Docker Swarm has been in maintenance mode for years — Docker Inc. doesn't invest in new features since 2019, and the public roadmap is essentially \"keep functioning\". Building new product on top of technology in maintenance is a bet. If Swarm is formally discontinued, Dokploy needs to migrate the entire foundation or rewrite — and the user pays that bill mid-path. The plugin ecosystem is still smaller than Coolify's, but rising fast.",[368,29885,29887],{"id":29886},"hosted-but-i-prefer-self-hosted-vercelrender-but-on-my-server","Hosted-but-I-prefer-self-hosted — \"Vercel\u002FRender but on my server\"",[12,29889,29890],{},"Technically out of the post title's category, but worth mentioning because many teams compare. Whoever seeks \"Heroku alternative\" frequently ends up not in self-hosted, but in another hosted.",[12,29892,29893,29895],{},[27,29894,15014],{}," is the most direct successor to Heroku in spirit. Clean UX, predictable prices, generous free tier (but not infinite — has automatic suspension for inactivity). Managed Postgres and Redis databases, deploy via git, build logs in the panel. The price rises linearly with real use, without big traps. It's the obvious choice for those who want \"Heroku that works in 2026\" without worrying about a server.",[12,29897,29898,29900],{},[27,29899,15017],{}," is hosted, stronger focus on solo devs, price by resource use (CPU\u002FRAM\u002Ftraffic) instead of fixed tier. Works well for hobby projects that won't scale; can get expensive fast if you forget a worker running.",[12,29902,29903,29905],{},[27,29904,23789],{}," has a different proposal: distributed hosting in multiple regions, rawer primitives, closer to \"VM as service with automatic TLS\" than to \"PaaS in the Heroku style\". It's the choice for those who want low global latency without setting up by hand. The learning curve is bigger than Render or Railway.",[12,29907,29908],{},"The three are legitimate options. The important note is what comes in the traps section: hosted free tier shrinks every year, and Heroku's historical path — started free, became US$25\u002Fmonth minimum — is the default forecast for any free plan from a private company.",[368,29910,29912],{"id":29911},"real-cluster-i-need-high-availability","Real cluster — \"I need high availability\"",[12,29914,29915],{},"Short category, with few serious products. Here the premise isn't \"run deploy on more than one server\", it's \"if a server falls, the cluster keeps working on its own without human intervention\". The difference is big, and most of the segment doesn't cross that line.",[12,29917,29918,29920],{},[27,29919,2994],{}," is the product we're building. Replicated control plane between three or more servers from the first day. Automatic leader election in about seven seconds when the leader falls. Integrated router, automatic certificates, metrics, and logs embedded in the binary itself. Commercial model with permanently free Community, paid Business and Enterprise with published price. Ideal range: from one to five hundred servers.",[12,29922,29923],{},"The operational difference matters when the customer enters. While the central panel of a multi-server Coolify is a single point of failure, in HeroCtl there's no central — any of the first three servers can lead, and the transition between them is automatic.",[19,29925,18680],{"id":18679},[119,29927,29928,29948],{},[122,29929,29930],{},[125,29931,29932,29934,29936,29938,29940,29942,29944,29946],{},[128,29933,2982],{},[128,29935,29829],{},[128,29937,29835],{},[128,29939,2770],{},[128,29941,2997],{},[128,29943,2776],{},[128,29945,15014],{},[128,29947,2994],{},[141,29949,29950,29970,29988,30006,30024,30042,30060,30078,30097,30119],{},[125,29951,29952,29954,29956,29959,29961,29963,29965,29968],{},[146,29953,27208],{},[146,29955,3013],{},[146,29957,29958],{},"10 min",[146,29960,3019],{},[146,29962,3019],{},[146,29964,29958],{},[146,29966,29967],{},"n\u002Fa (hosted)",[146,29969,3019],{},[125,29971,29972,29974,29976,29978,29980,29982,29984,29986],{},[146,29973,25589],{},[146,29975,3058],{},[146,29977,3064],{},[146,29979,3064],{},[146,29981,3058],{},[146,29983,3064],{},[146,29985,3064],{},[146,29987,3064],{},[125,29989,29990,29992,29994,29996,29998,30000,30002,30004],{},[146,29991,26635],{},[146,29993,3058],{},[146,29995,3139],{},[146,29997,3139],{},[146,29999,3139],{},[146,30001,3064],{},[146,30003,28046],{},[146,30005,3064],{},[125,30007,30008,30010,30012,30014,30016,30018,30020,30022],{},[146,30009,16324],{},[146,30011,3058],{},[146,30013,3058],{},[146,30015,3058],{},[146,30017,3058],{},[146,30019,3139],{},[146,30021,3064],{},[146,30023,3064],{},[125,30025,30026,30028,30030,30032,30034,30036,30038,30040],{},[146,30027,3923],{},[146,30029,3064],{},[146,30031,3064],{},[146,30033,3064],{},[146,30035,3064],{},[146,30037,3064],{},[146,30039,3064],{},[146,30041,3064],{},[125,30043,30044,30046,30048,30050,30052,30054,30056,30058],{},[146,30045,23287],{},[146,30047,3058],{},[146,30049,3058],{},[146,30051,3058],{},[146,30053,3058],{},[146,30055,3058],{},[146,30057,3064],{},[146,30059,3064],{},[125,30061,30062,30064,30066,30068,30070,30072,30074,30076],{},[146,30063,22768],{},[146,30065,3058],{},[146,30067,19345],{},[146,30069,3064],{},[146,30071,3058],{},[146,30073,3064],{},[146,30075,3064],{},[146,30077,3064],{},[125,30079,30080,30083,30085,30087,30089,30091,30093,30095],{},[146,30081,30082],{},"Embedded logs",[146,30084,3058],{},[146,30086,19345],{},[146,30088,3064],{},[146,30090,3058],{},[146,30092,3064],{},[146,30094,3064],{},[146,30096,3064],{},[125,30098,30099,30101,30104,30106,30109,30111,30113,30116],{},[146,30100,22811],{},[146,30102,30103],{},"Open-source",[146,30105,30103],{},[146,30107,30108],{},"Open-source + paid cloud",[146,30110,30103],{},[146,30112,30103],{},[146,30114,30115],{},"Hosted paid",[146,30117,30118],{},"Free Community + Business\u002FEnterprise",[125,30120,30121,30123,30126,30129,30131,30134,30137,30139],{},[146,30122,13533],{},[146,30124,30125],{},"1 server",[146,30127,30128],{},"1–3 servers",[146,30130,30128],{},[146,30132,30133],{},"1–10 servers",[146,30135,30136],{},"3–10 servers",[146,30138,28046],{},[146,30140,26185],{},[12,30142,30143],{},"The column that splits the segment in two halves is \"real high availability\". To its left, all products share the same premise: multi-server is deploy destination, not cluster with consensus. To its right, the panel\u002Fcontrol plane is replicated and survives the loss of any server.",[19,30145,30147],{"id":30146},"decision-by-usage-profile","Decision by usage profile",[12,30149,30150],{},"Four profiles cover most cases.",[12,30152,30153,30156],{},[27,30154,30155],{},"Solo dev, hobby project, one VPS."," Dokku if you like CLI and want stability. Coolify if you prefer web panel. Kamal if you're on a Rails or Node stack and already work well with SSH and configuration files. Any of the three solves it. The choice is more about taste than technical capability.",[12,30158,30159,30162],{},[27,30160,30161],{},"Indie hacker with one to three small SaaS, still one server."," Coolify or Dokploy. The practical difference is the plugin ecosystem (Coolify has more) and the technical foundation (Dokploy bets on Swarm). For the next twelve months, either works; migration between them is feasible because both run standard Docker containers. The important architectural decision is different: when you go from one server to two, you'll feel the panel's single point of failure — and that's the time to assess whether the next migration is multi-server Dokploy or a real cluster.",[12,30164,30165,30168],{},[27,30166,30167],{},"Startup with first serious customer, contractual SLA coming into force."," HeroCtl. Here the single-server panel becomes a legal liability — any SLA written in commercial contract assumes that the infrastructure survives the loss of a node, and no single-server panel does this. You can try to set up manual redundancy on top of Coolify or Dokploy, but the result will be fragile and costly to operate. The simple rule is: when the customer contract mentions \"uptime\", the consensus cluster stops being luxury.",[12,30170,30171,30174],{},[27,30172,30173],{},"Established company, fifty servers or more, platform team with three dedicated people."," Here the conversation changes. Managed K8s on a cloud provider becomes the sensible option, because the operator ecosystem is bigger and the team has competence to absorb the complexity. HeroCtl runs in this range too — we tested hundreds of nodes in the laboratory, dozens in customer production — but above one hundred servers our specialized operator library's ceiling starts to appear.",[19,30176,30178],{"id":30177},"the-segments-three-traps","The segment's three traps",[368,30180,30182],{"id":30181},"multi-server-doesnt-mean-real-high-availability","\"Multi-server\" doesn't mean \"real high availability\"",[12,30184,30185],{},"The most expensive confusion. Most panels list \"multi-server\" as a feature, and the casual reader interprets that as \"if a server falls, the system keeps working\". That isn't what's being offered. Multi-server in most panels means: the central panel, running on a single server, is capable of deploying to multiple remote servers.",[12,30187,30188],{},"When the panel server falls, you lose control. The containers in production keep running — Docker doesn't stop because of this — but you can't deploy anymore, read centralized logs, restart service, scale. You sit waiting for it to come back.",[12,30190,30191],{},"Real high availability requires consensus between multiple servers: at least three panel processes running, automatic leader election, state replication between them. If the leader falls, another takes over in seconds. That's a different architecture, more expensive to build and more expensive to operate. That's why few products in the segment deliver it.",[12,30193,30194],{},"The concrete question to ask when evaluating any product: \"if the server where the panel runs is shut down now, in how much time does the system come back to accepting deploys, and is that return automatic or manual?\". If the answer involves a human opening SSH somewhere, it isn't high availability.",[368,30196,30198],{"id":30197},"plugin-ecosystem-can-be-disguised-dependency","\"Plugin ecosystem\" can be disguised dependency",[12,30200,30201],{},"Panels with plugin stores look complete: you install a plugin to have managed Postgres, another for Redis, another for Sentry-like, another for automatic backup, another for monitoring. Each one solves a piece, and the set adds up to a Heroku.",[12,30203,30204],{},"The problem appears two years later. The backup plugin was written by a volunteer in 2024 and stopped receiving commits in 2025. The new panel version broke compatibility with it and nobody updated. You discover at the time you need to restore a backup — and the restoration was never tested with the current version.",[12,30206,30207],{},"That pattern repeats for each plugin. The more functionalities depend on the external ecosystem, the bigger the risk surface. The structural defense is simple: prefer products with batteries included — where Postgres, metrics, logs, certificates, routing are part of the main product and maintained by the same team that maintains the rest. Plugin is convenient in the short term and costly in the medium.",[368,30209,30211],{"id":30210},"hosted-free-tier-isnt-gratis-long-term","Hosted \"free tier\" isn't gratis long term",[12,30213,30214],{},"Render, Railway, Fly.io have generous free plans today. Heroku had it in 2021. The segment's history shows a consistent pattern: free tiers from private companies shrink every capital-raise round. First suspends for inactivity, then reduces quota, then adds hour limit, then turns into thirty-day trial, then ends.",[12,30216,30217],{},"It's not malice — it's business math. Hosting workload costs money, and the investor charges return. The only structural exception is hosting subsidized by another product from the same company (cloud covering free PaaS to attract developers to the main cloud), and even those change when the CFO changes.",[12,30219,30220],{},"Self-hosting is the only structural defense. You pay the VPS bill directly to the infrastructure provider, without intermediary. When the intermediary disappears, your application doesn't disappear with it.",[19,30222,30224],{"id":30223},"when-to-stay-on-heroku-render-or-railway-without-irony","When to stay on Heroku, Render, or Railway without irony",[12,30226,30227],{},"Worth saying clearly: not every team needs to leave managed hosting. There are three situations where staying is the right decision.",[12,30229,30230,30233],{},[27,30231,30232],{},"Small team without operational competence available to take care of a server."," If the entire team is two product developers, neither with prior experience in Linux\u002FDocker\u002Fnetworking, the cognitive cost of operating infrastructure is greater than the monthly savings. Pay the US$200\u002Fmonth of Render and keep focus on product.",[12,30235,30236,30239],{},[27,30237,30238],{},"Application whose platform cost is negligible compared to revenue."," If the company bills US$50k\u002Fmonth and the Heroku bill is US$300, optimizing that bill is poorly allocated work. The marginal return of migrating is low, and the operational risk doesn't pay off.",[12,30241,30242,30245],{},[27,30243,30244],{},"Team allocated on product, not on infra."," Some startups are so dependent on rapid iteration on product that any hour spent on infra is hour stolen from the competitive differentiator. For these, the trade-off of paying more to not think about a server is real value, not waste.",[12,30247,30248],{},"The simple rule: if infra is invisible commodity for your business, let someone charge to be invisible to you. If infra is a capability that differentiates the product (low latency, specific regions, specific compliance, contractual uptime), control pays off the work.",[19,30250,30252],{"id":30251},"heroctl-in-the-segment","HeroCtl in the segment",[12,30254,30255],{},"Honest positioning: HeroCtl doesn't compete with Dokku or Coolify in the case of a hobby project on one VPS. For that case, it's more machine than needed. An indie hacker with a Django application on a US$5\u002Fmonth server should use Dokku or Coolify and keep going.",[12,30257,30258],{},"Where HeroCtl competes is where Coolify multi-server, Dokploy, and Nomad also compete: the case of serious customer with SLA, where single-server becomes a legal liability. Here the difference we offer is cluster with consensus from day one, batteries included instead of five products to set up (router, certificates, metrics, logs, and encryption between services already in the binary), and commercial contract published and frozen — without retroactive change of terms.",[12,30260,30261],{},"The demonstration cluster runs four servers totaling five vCPUs and ten gigabytes of RAM, with sixteen active containers serving five sites. The control plane occupies between 200 and 400 MB per server. By comparison, the control plane of a managed version of the large orchestrator starts at about 700 MB per master node before any application comes up.",[12,30263,30264],{},"The typical job spec has about fifty lines — describes service, ingress, secrets, resources. The equivalent on the large orchestrator passes three hundred lines to cover the same functionality.",[12,30266,30267],{},"HeroCtl doesn't compete with managed cloud at scales of one hundred nodes or more. The ideal range is one to five hundred servers. Above that, the external ecosystem of specialized operators still gives advantage to the large orchestrator, and being honest about that is part of the contract.",[19,30269,20283],{"id":20282},[12,30271,30272,30275],{},[27,30273,30274],{},"Can I migrate from Heroku directly to HeroCtl?","\nYes, with some adaptations. Stateless web application with separate Postgres migrates easily — you containerize with Dockerfile, describe the job in fifty lines, bring it up. Separate workers (Sidekiq, Celery) become additional jobs in the same cluster. What needs to be rethought is what depended on managed add-ons.",[12,30277,30278,30281],{},[27,30279,30280],{},"And the add-ons (Postgres, Redis, Sentry)?","\nPostgres you run as a job in the cluster itself, with persistent volume, and take care of backup like a human takes care — there's no automatic operator that does this better than you doing it right. Redis idem. Self-hosted Sentry exists and runs on any Docker cluster — and there's a hosted commercial product if you prefer not to operate. The general rule: critical data runs in the cluster, observability can run outside.",[12,30283,30284,30287],{},[27,30285,30286],{},"How does it cost compared?","\nTaking as base a startup with five small applications: paid Heroku comes out around US$125\u002Fmonth minimum, without add-ons. Render comes out between US$50 and US$150 depending on usage. Own cluster on three-node VPS comes out US$30 to US$60\u002Fmonth total at the infrastructure provider. The direct savings are real, and become more expressive as the applications grow.",[12,30289,30290,30293],{},[27,30291,30292],{},"What if I'm already on Coolify?","\nThere's no urgency to migrate while you operate with a single server. The time to consider is when the single-server panel becomes a contractual single point of failure — first serious customer with SLA written. Until then, Coolify works well.",[12,30295,30296,30299],{},[27,30297,30298],{},"And for a Django app with Celery, or Rails with Sidekiq?","\nWorks naturally. You define a job for the web process and another job for the worker process, both sharing the same image but with different commands. The cluster orchestrates the two independently, and the broker (Redis or similar) is one more job in the same cluster.",[12,30301,30302,30305],{},[27,30303,30304],{},"And for a Node.js app with separate workers?","\nSame story. Worker is just another process, defined as another job. There's no architectural distinction between \"web\" and \"worker\" at the orchestrator level — they are containers that run code.",[12,30307,30308,30311],{},[27,30309,30310],{},"When do Business prices come out?","\nThe plans page publishes the values. The cut line is designed so you only pay Business when the company is large enough that SSO, granular RBAC, and detailed audit are real requirements — not preference. For everything else, Community solves it, and Community is permanently free without artificial feature gate.",[19,30313,3309],{"id":3308},[12,30315,30316],{},"The \"self-hosted Heroku\" segment matured. In 2026, there are serious products for each usage profile, and the decision depends less on \"which is best\" and more on \"which fits my case\". A hobby project doesn't need a consensus cluster. A serious customer with SLA doesn't fit on a single-server panel.",[12,30318,30319],{},"For those deciding now, three final recommendations. First, read the commercial contract before adopting — flee from terms that allow retroactive change. Second, prefer batteries included over plugin ecosystems where possible — smaller risk surface. Third, test the failure path before the real incident — shut down a server and watch what happens, calmly, during the day.",[12,30321,30322],{},"To start with HeroCtl on three Linux servers:",[224,30324,30326],{"className":30325,"code":5318,"language":2529},[2527],[231,30327,5318],{"__ignoreMap":229},[12,30329,30330,30331,30333,30334,30336],{},"If you want to read more before, there are two adjacent posts: ",[3336,30332,16690],{"href":16689}," explains the direct comparison with the mindshare leader of the single-server segment, and ",[3336,30335,25854],{"href":6545}," explains the reasoning that led to the product's existence.",[12,30338,3348],{},{"title":229,"searchDepth":244,"depth":244,"links":30340},[30341,30342,30349,30350,30351,30356,30357,30358,30359],{"id":29776,"depth":244,"text":29777},{"id":29816,"depth":244,"text":29817,"children":30343},[30344,30345,30346,30347,30348],{"id":29820,"depth":271,"text":29821},{"id":29847,"depth":271,"text":29848},{"id":29872,"depth":271,"text":29873},{"id":29886,"depth":271,"text":29887},{"id":29911,"depth":271,"text":29912},{"id":18679,"depth":244,"text":18680},{"id":30146,"depth":244,"text":30147},{"id":30177,"depth":244,"text":30178,"children":30352},[30353,30354,30355],{"id":30181,"depth":271,"text":30182},{"id":30197,"depth":271,"text":30198},{"id":30210,"depth":271,"text":30211},{"id":30223,"depth":244,"text":30224},{"id":30251,"depth":244,"text":30252},{"id":20282,"depth":244,"text":20283},{"id":3308,"depth":244,"text":3309},"2025-11-19","Since Salesforce killed the Heroku free plan in November\u002F2022, dozens of self-hosted alternatives emerged. An honest map of the segment and how to choose.",{},{"title":29762,"description":30361},{"loc":19749},"en\u002Fblog\u002Fself-hosted-heroku-2026",[20490,7507,24167,8756,30367],"segment","H5F-C9_vJqNK_qxH22BnlbM121JYpgw4xh0fOHM7_NM",{"id":30370,"title":30371,"author":7,"body":30372,"category":3378,"cover":3379,"date":31035,"description":31036,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":31037,"navigation":411,"path":15780,"readingTime":3386,"seo":31038,"sitemap":31039,"stem":31040,"tags":31041,"__hash__":31044},"blog_en\u002Fen\u002Fblog\u002Fkubernetes-overkill-when-you-dont-need-it.md","Kubernetes is overkill: when you don't need the colossus",{"type":9,"value":30373,"toc":31022},[30374,30377,30380,30383,30387,30390,30397,30400,30403,30406,30408,30414,30425,30428,30433,30436,30443,30446,30453,30460,30463,30466,30468,30471,30478,30485,30502,30509,30512,30515,30519,30522,30560,30563,30566,30570,30573,30610,30617,30623,30629,30633,30636,30653,30659,30662,30665,30672,30675,30679,30682,30828,30831,30835,30838,30841,30851,30858,30864,30887,30891,30894,30926,30929,30931,30937,30943,30949,30955,30961,30975,30981,30983,30986,30989,30992,31008,31011,31017,31020],[12,30375,30376],{},"The right question in 2026 isn't \"is Kubernetes good?\". That debate is over — yes, it is. The right question is another one, and almost no startup CTO asks it out loud: \"do I need this for what I'm building?\".",[12,30378,30379],{},"For most teams deciding architecture now, the honest answer is no. But the industry incentive system pushes in the opposite direction. Recruiters filter résumés by keyword. Conferences reward those who show YAML on a projector. Series A investors ask \"are you on K8s?\" as if it were a seriousness seal. The collective result is a giant complexity layer adopted by teams that didn't have the problem this complexity solves.",[12,30381,30382],{},"This post is the explicit defense of \"you don't need it\", with the math in hand. It's not about Kubernetes being bad — it's about choosing the right tool for the actual size of the problem you have today, not for the picture you want to project.",[19,30384,30386],{"id":30385},"the-seduction-is-real-and-has-a-name","The seduction is real (and has a name)",[12,30388,30389],{},"The first thing to admit is that excessive Kubernetes adoption isn't irrational. It makes total sense — for the individual. The name of the force operating here is resume-driven development.",[12,30391,30392,30393,30396],{},"A platform engineer who learns Kubernetes earns, on average, ",[27,30394,30395],{},"30% more"," at the next job. It's the technical skill with the highest salary premium in 2026 job listings among everything operations-related. Whoever puts \"operated multi-region cluster in production\" on LinkedIn has thirty recruiters in a week. Whoever puts \"operated self-hosted panel on three servers\" has two.",[12,30398,30399],{},"For the engineer, it's rational to learn Kubernetes even when the work doesn't require it. For the CTO, it's rational to adopt Kubernetes even when the product doesn't require it — it signals to the market that the company \"is ready to scale\", which is the vocabulary investors understand. For the platform team, it's rational to suggest Kubernetes on every new project — guarantees relevance for the next five years.",[12,30401,30402],{},"Each of these individual calculations is defensible. The aggregate is disastrous. The company pays the bill for a decision nobody made thinking about the company.",[12,30404,30405],{},"The starting point of this post is to assume that tension. You, reading, may be being pressured by one of the three forces above. Recognizing the pressure is the first step to deciding based on the real problem, not on the incentive of who's advising you.",[19,30407,16837],{"id":16836},[12,30409,30410,30411,101],{},"Let's put numbers down. Real, Brazilian, 2026. For the version of this math with the full Brazilian lens — exchange rate, LGPD, national hosting — there's a specific post at ",[3336,30412,30413],{"href":14874},"Kubernetes alternative in 2026: self-hosted PaaS for Brazilian teams",[12,30415,30416,30417,30420,30421,30424],{},"A senior platform engineer in São Paulo or Florianópolis, capable of operating Kubernetes in production without supervision, costs today between ",[27,30418,30419],{},"R$25k and R$40k per month CLT",", or between ",[27,30422,30423],{},"US$8k and US$12k per month PJ",". Let's use R$30k as a conservative average — it's the salary you can close with someone competent who hasn't yet hit the market ceiling.",[12,30426,30427],{},"But R$30k is just one person. A production cluster needs 24×7 on-call. Brazilian labor law doesn't allow a single person to carry indefinite on-call — and even if it did, the mental health of the sole operator collapses in six months. You need at least two people rotating, with real time off between shifts.",[12,30429,30430],{},[27,30431,30432],{},"R$30k × 2 people × 12 months = R$720k per year just in payroll.",[12,30434,30435],{},"On top of that payroll, add charges (CLT runs around 70-100% above the base salary; PJ varies but generally 30-40% if you want to treat the person decently). And on top of that, real costs of a platform team: auxiliary tool licenses, training, certifications the market asks for, conferences, eventual punctual consulting hire when something breaks in a way nobody in-house has seen before.",[12,30437,30438,30439,30442],{},"On infrastructure proper, the math also isn't cheap. The managed version of the colossus (EKS, GKE, AKS) charges about ",[27,30440,30441],{},"US$73 per month per cluster"," just for the control plane — that's about R$370 per month. On top of that comes NAT Gateway (US$40 per month minimum), Application Load Balancer (US$25 per month minimum), outbound traffic billed per gigabyte, disk snapshots, observability costs that grow with the number of pods and exported metrics.",[12,30444,30445],{},"Serious companies run at least two environments (staging and production), and many have a third dedicated dev or homologation environment. Multiply.",[12,30447,30448,30449,30452],{},"Total estimate year 1, with two platform engineers and two managed environments: ",[27,30450,30451],{},"between R$770k and R$800k",". That's before the first serious customer, before the first real of recurring revenue, before the second product hire.",[12,30454,30455,30456,30459],{},"Make the inverse comparison. A team of three full-stack developers at R$15k each: ",[27,30457,30458],{},"R$540k per year",". With charges, let's say R$720k. It's exactly the same range.",[12,30461,30462],{},"The concrete question is this: do you prefer to have three more people shipping code the customer will use, or do you prefer to have two people maintaining a platform the customer will never see?",[12,30464,30465],{},"There's no universal answer. But almost always, at the stage when this decision is made — pre-Series A company, team of five to fifteen people, product searching for product-market fit — the correct answer is to ship product.",[19,30467,17158],{"id":17157},[12,30469,30470],{},"Money you capture or save. Time you don't recover. The time math weighs more than the R$ math.",[12,30472,30473,30474,30477],{},"Setting up managed Kubernetes for minimally decent production takes ",[27,30475,30476],{},"2 to 4 weeks",". It's not installing the cluster — that's minutes. It's configuring the ingress controller, the certificate issuer, the monitoring stack (generally three coordinated products), the centralized log system, automatic backup of persistent volumes, alerts, network policy between pods, namespace segregation. Each component has five legitimate choices and three known pitfalls.",[12,30479,30480,30481,30484],{},"First real application deploy after that: another ",[27,30482,30483],{},"week",". Manifests for the application, manifests for auxiliary components, health probe definitions (liveness, readiness, startup), resource request vs limit tuning, load tests to calibrate autoscaling, failure tests to calibrate restart policy.",[12,30486,30487,30488,30491,30492,30495,30496,30499,30500,101],{},"Each new platform feature after that becomes a project. Automatically rotated secrets: ",[27,30489,30490],{},"one to two weeks",". Custom metrics integrated into the dashboard: ",[27,30493,30494],{},"one week",". Service mesh with end-to-end observability: ",[27,30497,30498],{},"two to three weeks",", and that's if you choose one of the stable options upfront. Granular network policy per namespace with auditing: ",[27,30501,30494],{},[12,30503,30504,30505,30508],{},"You spend the first ",[27,30506,30507],{},"six months"," assembling a platform. That's the optimistic scenario, with competent people and no major incidents.",[12,30510,30511],{},"Meanwhile, the smaller competitor that chose a simpler stack is shipping features. When you finish the \"minimally decent platform\", they already have three features you don't, eight customers who made buying decisions based on those features, and a feedback cycle running that will compose their next features with more precision.",[12,30513,30514],{},"In markets where the winner is decided in the first twelve months — almost every B2B SaaS market — that delay is fatal. You can have the most robust infrastructure in the segment and lose the entire segment because you arrived three months late with the feature that matters.",[19,30516,30518],{"id":30517},"the-profile-that-does-not-need-kubernetes","The profile that DOES NOT need Kubernetes",[12,30520,30521],{},"Here lives the majority of teams deciding this in 2026. If you fit all the criteria below, Kubernetes is overkill — almost certainly:",[2734,30523,30524,30530,30536,30542,30548,30554],{},[70,30525,30526,30529],{},[27,30527,30528],{},"Team of 1 to 15 engineers."," You can't dedicate two people just to platform without compromising 15-30% of delivery capacity.",[70,30531,30532,30535],{},[27,30533,30534],{},"1 to 50 total production servers."," At this size, simple abstractions still work. You can list all servers on one page.",[70,30537,30538,30541],{},[27,30539,30540],{},"1 to 100 distinct services or applications."," The colossus ecosystem shines when you have thousands of microservices with cross-dependencies. Below 100, the orchestrator complexity is greater than the complexity of the system it orchestrates.",[70,30543,30544,30547],{},[27,30545,30546],{},"Predictable traffic."," No 100× spike in 30 seconds. If your load oscillates two or three times between valley and peak throughout the day, you don't need millimetric autoscaling.",[70,30549,30550,30553],{},[27,30551,30552],{},"Typical HTTP\u002Fweb workloads."," Web application, REST API, queue worker, managed database. Without exotic needs like GPU sharding, large-scale ML inference, petabyte-per-night batch processing.",[70,30555,30556,30559],{},[27,30557,30558],{},"Operation in 1 to 3 cloud regions."," You serve one country, or at most two close continents. You're not orchestrating workloads that migrate between five datacenters in real time.",[12,30561,30562],{},"This is the overwhelming majority of teams adopting Kubernetes today. They fit all six criteria and still chose the tool designed to solve the opposite of each criterion.",[12,30564,30565],{},"If you recognize yourself here, the next section may be uncomfortable — but read until the end before dismissing.",[19,30567,30569],{"id":30568},"the-profile-that-needs-kubernetes","The profile that NEEDS Kubernetes",[12,30571,30572],{},"To not fall into \"Kubernetes never works\", here's the honest side. These are the profiles where adopting the colossus is the right choice, not overkill:",[2734,30574,30575,30581,30587,30593,30598,30604],{},[70,30576,30577,30580],{},[27,30578,30579],{},"Multi-tenant SaaS with namespace isolation, with resource quota requirements, granular network policy, and audit per tenant."," Banks, regulated fintechs, platforms that sell hosting to other developers. The colossus namespace model is the best primitive available for that problem.",[70,30582,30583,30586],{},[27,30584,30585],{},"Real multi-region operation with workloads moving between datacenters in response to failure or demand."," It's not \"I have replicas in two regions for DR\". It's \"I move active workloads between four regions with automatic state and latency management\". Very large company, with lots of money.",[70,30588,30589,30592],{},[27,30590,30591],{},"Dependency on mature specialized operators that are worth the overhead."," Postgres with automatic replication, Kafka with continuous balancing, Cassandra with orchestrated bootstrap, Spark in batch. If your architecture has three or more of these components operated by specialized automation, the colossus ecosystem is the place where this automation matured.",[70,30594,30595,30597],{},[27,30596,5064],{}," You have real human capacity to maintain the system without compromising product delivery. It's not one person doing platform \"also\"; it's a department.",[70,30599,30600,30603],{},[27,30601,30602],{},"Compliance that requires pre-approved tool names."," FedRAMP, ITAR, some government contracts, some health frameworks in the US. If your auditor needs to point to an existing certificate on a list, the colossus is on that list. Other alternatives, generally, are not yet.",[70,30605,30606,30609],{},[27,30607,30608],{},"Scale of hundreds to thousands of nodes."," Above 500 servers you exit the range where simpler alternatives shine and enter the range where the colossus was designed to work. There's no serious substitute here.",[12,30611,30612,30613,30616],{},"If you have ",[27,30614,30615],{},"three or more"," of these, Kubernetes is the right choice, not overkill. Adopt without guilt, hire the team, spend the money — it's aligned.",[12,30618,30612,30619,30622],{},[27,30620,30621],{},"one or two",", watch out: it can probably be solved with something simpler and migrated later, if you actually scale until you need more. Evaluate calmly.",[12,30624,30612,30625,30628],{},[27,30626,30627],{},"zero",", any Kubernetes adoption is a résumé decision, not a product decision. Be honest with the team and with yourself.",[19,30630,30632],{"id":30631},"the-gray-zone-where-the-most-common-error-lives","The gray zone — where the most common error lives",[12,30634,30635],{},"There's an intermediate range where the decision is harder, and it's where the most common error happens. Let's describe it precisely:",[2734,30637,30638,30641,30644,30647,30650],{},[70,30639,30640],{},"Team of 5 to 15 engineers",[70,30642,30643],{},"5 to 20 servers",[70,30645,30646],{},"10 to 50 services or applications",[70,30648,30649],{},"1 or 2 cloud regions",[70,30651,30652],{},"Expected but not explosive growth (let's say, doubling in size in 18 months)",[12,30654,30655,30656,101],{},"Here lives 70% of pre-Series B SaaS teams in Brazil in 2026. And here lives the fatal phrase: ",[27,30657,30658],{},"\"we'll grow fast so it's better to start with Kubernetes already\"",[12,30660,30661],{},"The fallacy is double. First: you won't grow 100× in six months. Real growth of successful SaaS sits between 2× and 5× per year in the first three years. You have time to migrate if you really need to.",[12,30663,30664],{},"Second, and more important: the stack you choose for a five-person team is rarely the stack you'll run with fifty. Priorities change, requirements change, trade-offs change. Companies that scaled from five to fifty people and kept the same stack are the exception, not the rule. You'll migrate anyway — the only question is when, and what you'll have shipped by then.",[12,30666,30667,30668,30671],{},"The right heuristic is: ",[27,30669,30670],{},"optimize for today + 18 months, not for a 5-year hypothesis",". If in the next 18 months your infra won't require Kubernetes, don't adopt now. When the problem appears, you'll have revenue to hire the team that operates, and clarity about real requirements (which will be different from what you imagine today).",[12,30673,30674],{},"Engineering that ages well is the kind that solves the current problem with room to spare, not the kind that tries to anticipate every hypothetical problem.",[19,30676,30678],{"id":30677},"what-you-gain-on-each-path","What you gain on each path",[12,30680,30681],{},"To organize the decision, three paths side by side on ten criteria. The grades are honest — every path has caveats.",[119,30683,30684,30698],{},[122,30685,30686],{},[125,30687,30688,30690,30693,30696],{},[128,30689,2982],{},[128,30691,30692],{},"Managed K8s",[128,30694,30695],{},"Simple self-hosted panel",[128,30697,2994],{},[141,30699,30700,30714,30728,30740,30753,30765,30777,30790,30801,30815],{},[125,30701,30702,30705,30708,30711],{},[146,30703,30704],{},"Year 1 platform cost (R$)",[146,30706,30707],{},"700-800k",[146,30709,30710],{},"40-80k (1 part-time dev)",[146,30712,30713],{},"60-120k (1 part-time dev + plan)",[125,30715,30716,30719,30722,30725],{},[146,30717,30718],{},"Time to first app in production",[146,30720,30721],{},"2-4 weeks",[146,30723,30724],{},"5 minutes to 1 day",[146,30726,30727],{},"5 minutes to 1 hour",[125,30729,30730,30733,30736,30738],{},[146,30731,30732],{},"Minimum dedicated team",[146,30734,30735],{},"2 SREs on-call",[146,30737,22187],{},[146,30739,22187],{},[125,30741,30742,30745,30748,30751],{},[146,30743,30744],{},"Real high availability (survives server crash)",[146,30746,30747],{},"Yes, with 5+ coordinated components",[146,30749,30750],{},"No (single server)",[146,30752,22753],{},[125,30754,30755,30758,30760,30762],{},[146,30756,30757],{},"Maximum scale validated in production",[146,30759,23344],{},[146,30761,30125],{},[146,30763,30764],{},"Hundreds of nodes",[125,30766,30767,30770,30773,30775],{},[146,30768,30769],{},"HTTP router + automatic certificates",[146,30771,30772],{},"External operator (install separately)",[146,30774,23272],{},[146,30776,23272],{},[125,30778,30779,30782,30785,30788],{},[146,30780,30781],{},"Metrics with reasonable retention",[146,30783,30784],{},"External stack (3+ products)",[146,30786,30787],{},"Optional plugin",[146,30789,25621],{},[125,30791,30792,30794,30797,30799],{},[146,30793,22779],{},[146,30795,30796],{},"External stack (2+ products)",[146,30798,30787],{},[146,30800,22302],{},[125,30802,30803,30806,30809,30812],{},[146,30804,30805],{},"Secrets encrypted at rest",[146,30807,30808],{},"External component or dedicated vault",[146,30810,30811],{},"Environment variable",[146,30813,30814],{},"Embedded in control plane",[125,30816,30817,30820,30823,30825],{},[146,30818,30819],{},"Safe rolling update deploy",[146,30821,30822],{},"Yes (configure yourself)",[146,30824,3061],{},[146,30826,30827],{},"Embedded with health check",[12,30829,30830],{},"The middle column solves the case \"one server, no SLA pressure\" — and solves it well. The left column solves \"hundreds of nodes, dedicated team, complex requirements\" — and solves it well. The right column solves the range between the two extremes, which is where most SaaS teams in Brazil today live.",[19,30832,30834],{"id":30833},"heroctl-as-the-honest-middle-ground","HeroCtl as the honest middle ground",[12,30836,30837],{},"To be direct with you: this blog is HeroCtl's, and the pitch exists. But it only makes sense for those who fit.",[12,30839,30840],{},"HeroCtl preserves the operational model of simple self-hosted panels — an executable file, install on Linux servers, manage everything from an embedded panel. The learning curve is hours, not weeks. You don't need to learn new vocabulary, abstract concepts, mandatory adjacent tools. Spin up the application, open the panel, see what's running.",[12,30842,30843,30844,30847,30848,30850],{},"But, unlike simple panels, HeroCtl runs as a real replicated cluster from day one. Three or more servers and the control plane survives the crash of any of them, without manual intervention. The coordinator election happens in ",[27,30845,30846],{},"about 7 seconds"," after a hard crash — tested in lab with repeated ",[231,30849,23189],{},". You answer \"what's the SLA?\" to the first serious customer with a clean face.",[12,30852,30853,30854,30857],{},"And without the colossus complexity. No 300-line manifests — the equivalent in HeroCtl runs in ",[27,30855,30856],{},"about 50 lines",". No specialized operators for each subsystem — router, certificates, metrics and logs come embedded. No assembling three observability products — the public stack lives inside the cluster itself.",[12,30859,30860,30861,30863],{},"Practical application range: ",[27,30862,22806],{},". Above that, the colossus ecosystem gives you tools we don't yet have, and we're honest about that. Below that, the operational economy is large — generally swaps two on-call platform engineers for one part-time developer managing the entire stack.",[12,30865,30866,30867,30870,30871,30874,30875,30878,30879,30882,30883,30886],{},"The numbers from the public cluster to calibrate expectations: ",[27,30868,30869],{},"4 servers totaling 5 vCPUs and 10 GB of RAM",", running ",[27,30872,30873],{},"16 containers"," that serve ",[27,30876,30877],{},"5 distinct sites"," with automatic TLS. The control plane occupies ",[27,30880,30881],{},"between 200 and 400 MB per server",". By comparison, the control plane of a managed version of the colossus starts at about ",[27,30884,30885],{},"700 MB per master node"," before any application comes up.",[19,30888,30890],{"id":30889},"practical-decision-decision-tree","Practical decision (decision tree)",[12,30892,30893],{},"For those who need an answer today, without reading the entire post again:",[67,30895,30896,30902,30908,30914,30920],{},[70,30897,30898,30901],{},[27,30899,30900],{},"Do you have a dedicated platform team of 3 or more people, today, hired?"," If yes → Kubernetes is viable. If no → continue.",[70,30903,30904,30907],{},[27,30905,30906],{},"Do you have a formal compliance requirement that lists Kubernetes nominally?"," If yes → Kubernetes is mandatory. If no → continue.",[70,30909,30910,30913],{},[27,30911,30912],{},"Do you operate 100 or more servers in production, today?"," If yes → Kubernetes or an equivalent orchestrator is the right choice. If no → continue.",[70,30915,30916,30919],{},[27,30917,30918],{},"Do you have a single server and zero SLA pressure?"," If yes → simple self-hosted panel solves it. If no → continue.",[70,30921,30922,30925],{},[27,30923,30924],{},"Are you between 2 and 500 servers, with real SLA requirement, and want to avoid the colossus platform bill?"," → HeroCtl is the honest middle ground.",[12,30927,30928],{},"Almost every pre-Series B SaaS team in Brazil in 2026 falls into case 5. Whoever falls into case 1 or 3 already knows it and probably isn't reading this post. Whoever falls into case 2 has regulation dictating.",[19,30930,20283],{"id":20282},[12,30932,30933,30936],{},[27,30934,30935],{},"But what if I grow and need the colossus later?","\nGrow first. Migrate later. Migrating 50 servers running HeroCtl to Kubernetes is a quarter project with a small team — much cheaper than carrying the colossus platform for two years without needing it. And when you actually need it, you'll have revenue to hire whoever operates it.",[12,30938,30939,30942],{},[27,30940,30941],{},"Can I use Kubernetes for only part of the stack?","\nYou can, and sometimes it makes sense. A specific workload that benefits from a mature specialized operator (Spark batch, large-scale ML inference) can run in a dedicated colossus cluster, while the rest of the product runs on something simpler. The cost is maintaining two operational models — worth it if the isolated subset really pays off.",[12,30944,30945,30948],{},[27,30946,30947],{},"What if my customer requires Kubernetes contractually?","\nThen Kubernetes becomes a sales requirement, not an architecture decision. If the revenue justifies, adopt for the contract in question. But require the item in writing — many times the customer \"requires Kubernetes\" because their technical team assumed, without anyone writing it in a clause. Worth checking.",[12,30950,30951,30954],{},[27,30952,30953],{},"What if I already invested 2 years learning Kubernetes?","\nThe knowledge isn't lost. Kubernetes will continue to be relevant for decades — it's the standard infrastructure of large companies, and large companies don't disappear. You have applicable intellectual capital when the company grows, or when changing jobs to one that actually needs it. The choice not to use now doesn't invalidate the learning; it only postpones the use.",[12,30956,30957,30960],{},[27,30958,30959],{},"Are there cases of migrating from Kubernetes to something simpler?","\nYes, more common than publicly discussed. Companies that adopted early, realized requirements didn't justify it, and migrated to reduce operational cost and regain delivery speed. Generally stays out of blog post because admitting \"we backtracked\" is uncomfortable. But it happens, and almost always the team reports greater productivity afterward.",[12,30962,30963,30966,30967,30970,30971,30974],{},[27,30964,30965],{},"How long does it take a team to learn HeroCtl versus Kubernetes?","\nHeroCtl: a competent developer runs the entire stack in production in ",[27,30968,30969],{},"one day",". Documentation fits in a long post, concepts add up to about five. Kubernetes: the official tutorial takes a week, and the time to \"operate confidently in production\" varies from ",[27,30972,30973],{},"three to twelve months",", depending on the complexity of the adjacent stack. It's not the same scale.",[12,30976,30977,30980],{},[27,30978,30979],{},"And for those who can't pay anything for platform?","\nHeroCtl Community plan is permanently free. No artificial server limit, no feature gate on high availability, no expiration. You run the entire stack described here — coordination, router, certificates, metrics and logs — without paying anything. Paid plans (Business and Enterprise) add things that only matter when the company grows: corporate SSO, detailed auditing, source code escrow, SLA support. The current price contract is frozen for those who sign today — no clause that allows retroactive change.",[19,30982,3309],{"id":3308},[12,30984,30985],{},"The question we opened with isn't rhetorical. \"Do I need this for what I'm building?\" deserves an honest answer, the size of your real problem, without the filter of résumé pressure, the meeting with investors, or last week's conference talk.",[12,30987,30988],{},"For most pre-Series B SaaS teams in Brazil in 2026, the honest answer is no. Not because Kubernetes is bad — it's excellent for what it was designed for. But because what it was designed to solve isn't your current problem. You're planting a seedling; it's an excavator.",[12,30990,30991],{},"There's a middle path. If you want to try:",[224,30993,30994],{"className":226,"code":5318,"language":228,"meta":229,"style":229},[231,30995,30996],{"__ignoreMap":229},[234,30997,30998,31000,31002,31004,31006],{"class":236,"line":237},[234,30999,1220],{"class":247},[234,31001,2957],{"class":251},[234,31003,5329],{"class":255},[234,31005,2963],{"class":383},[234,31007,2966],{"class":247},[12,31009,31010],{},"Runs on any Linux server with Docker. No signup, no card, no phone-home. If you like it, scale to three servers and gain real high availability. If you don't like it, uninstall and go back to what you were using — without locked-in data, without embedded dependency.",[12,31012,31013,31014,31016],{},"The long history of why we chose to build this, instead of pointing to one of the existing alternatives, is at ",[3336,31015,6545],{"href":6545},". It's the complementary reading for those who want to understand the reasoning before the tool.",[12,31018,31019],{},"The intent is simple: container orchestration, without ceremony. And without the cost of the colossus when you don't need it.",[3350,31021,4376],{},{"title":229,"searchDepth":244,"depth":244,"links":31023},[31024,31025,31026,31027,31028,31029,31030,31031,31032,31033,31034],{"id":30385,"depth":244,"text":30386},{"id":16836,"depth":244,"text":16837},{"id":17157,"depth":244,"text":17158},{"id":30517,"depth":244,"text":30518},{"id":30568,"depth":244,"text":30569},{"id":30631,"depth":244,"text":30632},{"id":30677,"depth":244,"text":30678},{"id":30833,"depth":244,"text":30834},{"id":30889,"depth":244,"text":30890},{"id":20282,"depth":244,"text":20283},{"id":3308,"depth":244,"text":3309},"2025-11-17","80% of teams adopting Kubernetes don't need it. The math is direct: SRE salary × time to first feature shipped. When it's worth it to skip the colossus.",{},{"title":30371,"description":31036},{"loc":15780},"en\u002Fblog\u002Fkubernetes-overkill-when-you-dont-need-it",[20384,31042,31043,6394],"complexity","architectural-decision","Y4NSepMIaUKCg3MLmlWk9ZUXitHONbOpo_SfK6Iwfcg",{"id":31046,"title":21724,"author":7,"body":31047,"category":20632,"cover":3379,"date":31493,"description":31494,"draft":3382,"extension":3383,"lastReviewed":3379,"meta":31495,"navigation":411,"path":6545,"readingTime":26401,"seo":31496,"sitemap":31497,"stem":31498,"tags":31499,"__hash__":31501},"blog_en\u002Fen\u002Fblog\u002Fwhy-we-built-heroctl.md",{"type":9,"value":31048,"toc":31482},[31049,31052,31056,31059,31064,31073,31076,31080,31086,31089,31092,31096,31099,31110,31113,31116,31120,31123,31154,31157,31159,31162,31322,31329,31333,31336,31339,31342,31345,31348,31352,31355,31361,31367,31373,31379,31381,31387,31393,31399,31405,31420,31423,31429,31435,31439,31442,31477,31480],[12,31050,31051],{},"Every cluster running in production today has to choose between three paths, and none of them is good enough for the team of five trying to ship a SaaS.",[19,31053,31055],{"id":31054},"the-painful-path-kubernetes","The painful path: Kubernetes",[12,31057,31058],{},"You open a \"hello world\" manifest and it has 300 lines. You add a templating manager to organize it — now it's 300 lines plus 200 lines of templates. You decide to use a managed cloud version to avoid maintaining the control plane — you pay US$73\u002Fmonth per cluster, plus NAT, plus Application Load Balancer. Need automatic TLS? Install a specialized operator. Metrics? Another operator. Routing between services with encryption? Two more operators and two days studying service mesh. Centralized logs? Yet another stack.",[12,31060,31061,31062,101],{},"The complexity isn't accidental. The system is a platform for building platforms — built by a team that needed to orchestrate 100,000 machines. When a startup with four servers adopts the same tool, it's using an excavator to plant a sapling. We covered the general thesis in ",[3336,31063,15781],{"href":15780},[31065,31066,31067],"blockquote",{},[12,31068,31069,31070],{},"\"Most development teams find Kubernetes overkill for dev environments.\"\n— ",[179,31071,31072],{},"Top 13 Kubernetes Alternatives 2026",[12,31074,31075],{},"The real cost isn't infra, it's the team. Serious operators of this system charge six-figure salaries. You need at least one on staff — preferably two for on-call. They're your first hire after the CTO. Before the designer, before the second product dev, before anything that delivers value to the user.",[19,31077,31079],{"id":31078},"the-easy-path-modern-self-hosted-panels","The easy path: modern self-hosted panels",[12,31081,31082,31083,101],{},"A single install command on one server, open the panel, deploy in five minutes. It works. The two leaders of this segment together have 80,000 stars in public repositories. The community exploded over the last two years because it solved the right problem: most teams don't need the colossus, they need ",[3336,31084,31085],{"href":19749},"self-hosted Heroku",[12,31087,31088],{},"The problem only shows up later. You grow, a customer asks about SLA, the single server becomes a single point of failure. You try to replicate it across two or three servers — these panels have no distributed consensus, no leader election. They're web applications on top of Docker. Elegant for one server; fragile for three.",[12,31090,31091],{},"When your first serious customer asks \"what's the SLA?\", you'll have to answer \"best-effort\" or start migrating — probably to the colossus. Starting from scratch in the company's second year.",[19,31093,31095],{"id":31094},"the-technical-path-that-existed","The technical path that existed",[12,31097,31098],{},"There's an orchestrator that's technically what you want. Single binary, real distributed consensus, multi-tenant, scales to thousands of nodes. The vendor spent eight years polishing it, and those who ran it in production have nothing to complain about regarding the core.",[12,31100,31101,31102,31105,31106,31109],{},"But in ",[27,31103,31104],{},"August 2023"," the vendor changed the license from a legitimate OSS one to a \"source available\" license that restricts commercial use. In ",[27,31107,31108],{},"February 2025",", the company was acquired by a conglomerate historically known for five-year contracts and platform lock-in. Today that orchestrator is part of the conglomerate's portfolio — and the license prevents you from offering the technology as a service or embedding it in a product without commercial licensing.",[12,31111,31112],{},"For companies that already had it in production, it's a manageable problem. For you adopting today in 2026, it's a big asterisk: the next critical feature might ship only to the paid version, or the license might change again in the next reorganization.",[12,31114,31115],{},"The lesson we drew from this isn't \"open source or nothing\" — it's \"publish the commercial contract from day one, no retroactive change\". Honest commercial software is better than open software that turns commercial halfway through. The technical orchestrator's problem wasn't going paid; it was changing the rules for those who had already bet on it.",[19,31117,31119],{"id":31118},"the-gap","The gap",[12,31121,31122],{},"None of the three paths combines:",[2734,31124,31125,31131,31136,31142,31148],{},[70,31126,31127,31130],{},[27,31128,31129],{},"Single binary"," (operationally simple)",[70,31132,31133,31135],{},[27,31134,16324],{}," (consensus across multiple servers, leader election, durability)",[70,31137,31138,31141],{},[27,31139,31140],{},"Heroku-like experience"," (no extensive orchestration files, web panel, automatic certificates)",[70,31143,31144,31147],{},[27,31145,31146],{},"Explicit commercial contract from day one"," (permanent free plan, published paid plans — no retroactive change of terms)",[70,31149,31150,31153],{},[27,31151,31152],{},"Batteries included"," (routing, service mesh, metrics — without assembling five products)",[12,31155,31156],{},"Modern panels have the experience and the free contract, but lose on high availability. The technical orchestrator has HA but changed the contract with those who had already adopted and never prioritized experience. The colossus has all of this only if you assemble it manually — and \"manually\" costs a team.",[19,31158,4824],{"id":4823},[12,31160,31161],{},"The table below is the honest version of the decision. There's no column without a caveat — every orchestrator is a set of tradeoffs, and ours is too.",[119,31163,31164,31181],{},[122,31165,31166],{},[125,31167,31168,31170,31173,31176,31179],{},[128,31169,2982],{},[128,31171,31172],{},"Colossus (K8s)",[128,31174,31175],{},"Self-hosted panel",[128,31177,31178],{},"Ex-OSS orchestrator",[128,31180,2994],{},[141,31182,31183,31197,31211,31225,31238,31251,31263,31276,31292,31306],{},[125,31184,31185,31188,31191,31193,31195],{},[146,31186,31187],{},"Install time",[146,31189,31190],{},"4 hours to 4 days",[146,31192,27211],{},[146,31194,16314],{},[146,31196,27211],{},[125,31198,31199,31201,31203,31206,31209],{},[146,31200,29541],{},[146,31202,3047],{},[146,31204,31205],{},"30 (UI)",[146,31207,31208],{},"80–120",[146,31210,3041],{},[125,31212,31213,31215,31218,31221,31223],{},[146,31214,16324],{},[146,31216,31217],{},"Yes, with 5+ components",[146,31219,31220],{},"No (single-server)",[146,31222,3064],{},[146,31224,3064],{},[125,31226,31227,31229,31232,31234,31236],{},[146,31228,26109],{},[146,31230,31231],{},"External operator",[146,31233,7122],{},[146,31235,31231],{},[146,31237,7122],{},[125,31239,31240,31242,31245,31247,31249],{},[146,31241,23287],{},[146,31243,31244],{},"Specialized operator",[146,31246,3058],{},[146,31248,31244],{},[146,31250,7122],{},[125,31252,31253,31255,31257,31259,31261],{},[146,31254,25616],{},[146,31256,30784],{},[146,31258,19345],{},[146,31260,23311],{},[146,31262,25621],{},[125,31264,31265,31267,31269,31271,31273],{},[146,31266,22779],{},[146,31268,30796],{},[146,31270,19345],{},[146,31272,23311],{},[146,31274,31275],{},"Built-in single writer",[125,31277,31278,31280,31283,31286,31289],{},[146,31279,22811],{},[146,31281,31282],{},"Free + high operational cost",[146,31284,31285],{},"Free (single-server)",[146,31287,31288],{},"Restricted commercial (was free until 2023)",[146,31290,31291],{},"Permanent free plan + paid Business\u002FEnterprise",[125,31293,31294,31296,31299,31301,31304],{},[146,31295,16398],{},[146,31297,31298],{},"1–2 dedicated SREs",[146,31300,22187],{},[146,31302,31303],{},"1 dedicated SRE",[146,31305,22187],{},[125,31307,31308,31310,31313,31316,31319],{},[146,31309,5013],{},[146,31311,31312],{},"50+ machines",[146,31314,31315],{},"1 machine",[146,31317,31318],{},"5–500 machines",[146,31320,31321],{},"1–500 machines",[12,31323,31324,31325,31328],{},"The column that matters is the second-to-last: ",[27,31326,31327],{},"minimum team to operate",". That's where the real cost lives. The other criteria are the explanation of why.",[19,31330,31332],{"id":31331},"what-were-building","What we're building",[12,31334,31335],{},"HeroCtl is a single executable file that you install on N Linux servers with Docker. The first three become the quorum for the replicated control plane. You submit jobs via CLI, API, or built-in web panel — the cluster decides where to run, performs health checks, manages rolling deployments, automatically issues Let's Encrypt certificates via the integrated router.",[12,31337,31338],{},"No CRDs, no specialized operators, no charts. The job spec is a simple configuration file (50 lines for app+ingress+secrets, not 300). Encryption between services and automatic PKI come built-in. Persistent metrics run as a job of the system itself. Logs with single-writer architecture (no assembling Fluentd, no assembling Loki).",[12,31340,31341],{},"Today the public stack runs in production: four nodes on a cloud provider, five sites with automatic TLS, sixteen containers, zero downtime on rolling deployments. The cluster survived a complete chaos battery: kill -9 on the coordinating server (election in seven seconds), 30-second network partition, quorum loss, disk wipe, forced drain. Each of these scenarios becomes its own post.",[12,31343,31344],{},"The practical result is a radically shorter operational model. Bringing up a new application is three steps: you describe the service in a fifty-line config file, submit via CLI, and the cluster decides where to run, opens a port, registers with the router, issues a Let's Encrypt certificate, and starts serving traffic. Updating is a fourth step: change the image version in the file, submit again, and the cluster orchestrates the rolling replacement — no maintenance window, no feature flag, no manual traffic migration.",[12,31346,31347],{},"Debugging is the real test of any orchestrator. When something goes wrong at three in the morning, you need a short path between \"the site is down\" and \"I know exactly what happened\". In HeroCtl, that path is single: the panel shows which container failed, on which server it was running, last log before dying, metrics from the last few minutes, version history. No grepping through three different products, no reconstituting context from five dashboards, no switching between tools from different vendors just to understand a failure.",[19,31349,31351],{"id":31350},"when-heroctl-isnt-for-you","When HeroCtl isn't for you",[12,31353,31354],{},"Honesty is the defense mechanism of a new tool: telling where it doesn't fit is what keeps the product focused. Four profiles where we recommend a different path.",[12,31356,31357,31360],{},[27,31358,31359],{},"You operate at the level of hundreds of thousands of machines.","\nCompanies that run ten thousand nodes or more chose the colossus for a real reason: it was designed for that size. HeroCtl is honest about the ceiling: we tested up to hundreds of nodes in the lab, validated several dozen in customer production, and the roadmap targets the \"1 to 500 servers\" range. Above that, the colossus ecosystem gives you tools we don't yet have — and building them just to serve 0.1% of cases isn't a priority.",[12,31362,31363,31366],{},[27,31364,31365],{},"You have compliance requirements that list tools by name.","\nSome audit frameworks (FedRAMP, ITAR, certain government contracts) require the stack to run on specific pre-approved components. HeroCtl is too young to be on those lists. If your compliance officer needs to point to an existing certificate, today the right answer is the colossus or the ex-OSS orchestrator. But if you need a tool's name on the audit list, it's not HeroCtl yet.",[12,31368,31369,31372],{},[27,31370,31371],{},"You need a deep library of specialized operators.","\nThe colossus ecosystem has hundreds of off-the-shelf operators — Postgres with automatic replication, Kafka with balancing, Cassandra with bootstrap. If your architecture depends on four of these operators running in production from day one, HeroCtl doesn't replace them. Our proposal is different: you run your Postgres as a regular job, handling backup and replication like a human does — not delegating to an operator that took three years to stabilize.",[12,31374,31375,31378],{},[27,31376,31377],{},"You want multi-cloud with workloads moving between providers in real time.","\nHeroCtl runs on any Linux server with Docker, so technically you can mix providers. But the primitives to move encrypted storage between regions, replicate databases to another provider with automatic failover, or orchestrate virtual networks between clouds — the colossus ecosystem solves that better today. It's on our roadmap, not in the current version.",[19,31380,7347],{"id":7346},[12,31382,31383,31386],{},[27,31384,31385],{},"Is HeroCtl just another Docker wrapper?","\nNo. Docker wrappers don't do consensus between servers, don't elect a coordinator, don't survive node loss with automatic work redistribution. HeroCtl is a replicated control plane that coordinates agents on each server. Docker stays as the container runtime — an implementation choice, not the product's substance.",[12,31388,31389,31392],{},[27,31390,31391],{},"What if the company behind HeroCtl goes under?","\nThree contractual protections. First, the binary has no mandatory phone-home — once installed, your cluster keeps working without ever talking to our server. There's no remote kill-switch, no periodic activation that expires. Second, Enterprise contracts include source code escrow: if the company ceases operations, the code is delivered to paying customers via a third-party custodian, with a license for internal continuity. Third, the current price contract is frozen for those signing today — there's no clause allowing retroactive change of terms. What happened to the technical orchestrator in 2023 and in 2025 is structurally prevented here.",[12,31394,31395,31398],{},[27,31396,31397],{},"How much RAM and CPU does it consume on a small cluster?","\nThe public demo cluster runs on four servers totaling five vCPUs and ten gigabytes of RAM, with sixteen active containers serving five sites. The control plane occupies between 200 and 400 MB per server — leaving plenty for real workload. Comparatively, the control plane of a managed version of the colossus starts at about 700 MB per master node before any application comes up.",[12,31400,31401,31404],{},[27,31402,31403],{},"Can I migrate from the ex-OSS orchestrator to HeroCtl?","\nYes. The primitives are similar (job, group, task; cluster with replicated control plane; agents on each server). The big difference is in the configuration file — ours is shorter and has fewer abstractions. For teams with a few dozen jobs, migration is manual and takes an afternoon. Above that we have an experimental converter that covers the common cases. Write to us if that's your case.",[12,31406,31407,31410,31411,31413,31414,31416,31417,31419],{},[27,31408,31409],{},"How does payment work?","\nThree plans with a clear line between them. ",[27,31412,4351],{}," is free forever, no server limit, no job limit, no artificial feature gates — runs the entire stack described above, including HA, router, automatic certificates, metrics, and logs. Individuals and small teams never need to leave it. ",[27,31415,4355],{}," adds SSO\u002FSAML, granular RBAC, detailed auditing, managed backup, and SLA support — for teams with formal platform requirements. ",[27,31418,4359],{}," adds source code escrow, continuity contract, 24×7 support, and dedicated development.",[12,31421,31422],{},"Business and Enterprise prices are published on the plans page — no mandatory \"talk to sales\". The cutoff line is drawn so you only pay when the company is large enough that SSO and auditing are real requirements, not preference.",[12,31424,31425,31428],{},[27,31426,31427],{},"Is it production-ready?","\nIt's been running the public stack for six months, survived a documented battery of chaos scenarios, and supports the blog you're reading now. \"Ready\" depends on your risk appetite and the size of your team. For an indie hacker, three servers and a US$10k MRR SaaS, it's more than ready. For a bank regulated by three agencies, wait a few more quarters and talk to us about Business Edition first.",[12,31430,31431,31434],{},[27,31432,31433],{},"Where do sensitive data (secrets, certificates, configurations) run?","\nIn the cluster itself, encrypted at rest. The cluster is the vault — there's no mandatory external vault service. If you want to integrate with an external cloud vault (cloud provider KMS), there's an extension point; but the default configuration is self-sufficient.",[19,31436,31438],{"id":31437},"whats-coming-in-the-next-posts","What's coming in the next posts",[12,31440,31441],{},"The blog's intent is technical and direct: no marketing fluff.",[2734,31443,31444,31450,31465,31471],{},[70,31445,31446,31449],{},[27,31447,31448],{},"Engineering",": how consensus is configured, how defense against zombie containers holding ports works, why we chose in-memory snapshot over persisted bitmap for port allocation",[70,31451,31452,6562,31455,571,31457,571,31460,571,31462,31464],{},[27,31453,31454],{},"Comparisons",[3336,31456,16690],{"href":16689},[3336,31458,31459],{"href":23603},"HeroCtl vs Nomad",[3336,31461,29863],{"href":25875},[3336,31463,16695],{"href":16694}," — real numbers, not opinion",[70,31466,31467,31470],{},[27,31468,31469],{},"Case studies",": setup with 1 server (replacing a simple panel), 3 servers (real HA), 10+ servers (scale)",[70,31472,31473,31476],{},[27,31474,31475],{},"Releases",": narrative changelog of features that ship",[12,31478,31479],{},"If you're a developer feeling that the colossus is too much and the self-hosted panel is too little, stick around. If you operate the technical orchestrator and are uncertain about the post-acquisition future, write to us — there's a migration path.",[12,31481,26382],{},{"title":229,"searchDepth":244,"depth":244,"links":31483},[31484,31485,31486,31487,31488,31489,31490,31491,31492],{"id":31054,"depth":244,"text":31055},{"id":31078,"depth":244,"text":31079},{"id":31094,"depth":244,"text":31095},{"id":31118,"depth":244,"text":31119},{"id":4823,"depth":244,"text":4824},{"id":31331,"depth":244,"text":31332},{"id":31350,"depth":244,"text":31351},{"id":7346,"depth":244,"text":7347},{"id":31437,"depth":244,"text":31438},"2025-11-12","Kubernetes demands an SRE team. Simple panels lack real high availability. The closest technical competitor changed its license and was acquired. An alternative was missing — so we built one.",{},{"title":21724,"description":31494},{"loc":6545},"en\u002Fblog\u002Fwhy-we-built-heroctl",[31500,20384,7507,27509],"manifesto","dkaPiqvddmpD5mozdQ3ltHs2pyEMxIwmDMDys_wIx0g",1777362202794]