502 Bad Gateway, Defined for Developers Who Have to Fix One

A 502 Bad Gateway means the server you are talking to was acting as a gateway or proxy and got an invalid response from the server behind it. That is the formal definition. The working definition is “something in front of my app is asking something behind my app for a response, the response is broken, and the front has to admit to you that it is broken.” The error is a confession, not a diagnosis. The diagnosis is what came back that was bad, and from where.

The reason “define 502 bad gateway” is its own question and not just “an error” is that the error is a load-bearing signal. It tells the developer that the front-door and the back-end are out of sync, that a deploy rolled back, that a health check is missing, or that a timeout expired before the upstream answered. The shape of the bug is usually “the response was not a valid HTTP response” or “the response never came.” The shape of the fix depends on which one.

The short version
The four shapes a 502 actually takes
The five places a 502 is born
The first ten minutes of debugging a 502
The seven fixes that work in production
The mistakes that turn a 502 into a 502 storm
How 502s shape the platform decision
FAQ

The short version

A 502 is the gateway admitting that the upstream did not give it a usable response. The gateway is the front-door — usually a load balancer, a reverse proxy, a CDN, or the platform’s edge layer. The upstream is the back-end — usually your application, your API, your worker, or another service in the chain. The “invalid” can mean: the response was a non-HTTP blob, the response was an HTTP response with a malformed status line, the connection was reset before the response came, or the response never came at all (timeout). The gateway returns 502 to the client so the client knows the failure is on the back-end, not on itself.

The four shapes a 502 actually takes

The error code is one number, but the underlying failure has four distinct shapes. The shape changes what the developer should look at first.

The upstream was unreachable. The gateway opened a TCP connection, but the connection was refused. The upstream is down, the port is wrong, the network ACL is blocking the connection, or the platform has not yet started the new instance. The gateway sees a closed socket, returns 502.

The upstream was too slow. The gateway opened a TCP connection, sent the request, and waited. The upstream never finished the response before the gateway’s timeout expired. The gateway returns 502. The upstream might be slow, the database might be slow, the upstream might be in a deploy loop, or the upstream might be CPU-starved.

The upstream reset the connection. The gateway sent the request, the upstream accepted it, and then mid-response the upstream closed the connection (RST or FIN). The gateway gets a partial response it cannot use, returns 502. The upstream likely crashed, hit a memory limit, or was killed by a process supervisor.

The upstream sent garbage. The gateway got a response, but the response was not parseable as HTTP. The response might be HTML from a captive portal, a JSON blob from a debug endpoint, a raw stack trace, or a non-HTTP service listening on the port. The gateway returns 502 because the response was technically a response, but not one the gateway can forward.

The four shapes map to four different fixes. The first is “the upstream is not there.” The second is “the upstream is too slow.” The third is “the upstream is crashing.” The fourth is “something else is on that port.” Conflating them is the most common mistake.

The five places a 502 is born

The error code is a single number, but it can be born in any layer of the stack. The place changes the diagnostic.

At the CDN edge. Cloudflare, Fastly, CloudFront, or another CDN sits in front of the origin. The CDN tried to talk to the origin, the origin gave a bad response, the CDN returns 502 to the client. The fix lives in the origin, not the CDN. The CDN’s “502 error” page is a hint that the origin is the problem.

At the load balancer. NGINX, HAProxy, Envoy, or the platform’s load balancer. The load balancer tried to talk to the upstream pool, one or more backends gave a bad response, the load balancer returns 502. The fix is in the backend, in the load balancer config, or in the health check.

At the reverse proxy. NGINX in front of an app server, Caddy in front of a Node process, or the platform’s reverse proxy. The proxy tried to talk to the app, the app gave a bad response, the proxy returns 502. The fix is in the app, in the proxy config, or in the upstream timeout.

At the platform’s edge. The platform that hosts the app, the API, or the worker. The platform tried to talk to the user’s service, the service was not ready, the platform returns 502. The fix is in the service’s health check, in the start command, or in the readiness probe.

At the application itself. The app made an outbound HTTP call to a third party, the third party gave a bad response, the app returned 502 to its caller. This is the only shape where the “upstream” is not the developer’s own infrastructure. The fix is in the third party, in the timeout, in the retry policy, or in the circuit breaker.

The first ten minutes of debugging a 502

The clock is ticking. The site is throwing 502s. The team is in a channel. The first ten minutes are the time to gather signal, not to deploy a fix. The signal is what tells the team which of the four shapes they are looking at, and which of the five places the error is born.

Step 1: read the gateway log. The CDN log, the load balancer log, the platform’s edge log. The log says which upstream was hit, what the response was, how long the gateway waited, and whether the connection was reset. The line is a hint, not a verdict, but it is the starting point.

Step 2: hit the upstream directly. Bypass the gateway. Curl the upstream on the port the gateway is using, with the same headers, and see what comes back. If the upstream returns a clean 200, the bug is in the gateway. If the upstream returns garbage, the bug is in the upstream.

Step 3: read the upstream log. The app log, the worker log, the start log. The log says whether the request was received, what the app did with it, what the database said, and whether the process crashed. The log is the source of truth for the upstream’s view.

Step 4: check the deploy. A 502 storm that started five minutes after a deploy is almost always the deploy. Roll back. The rollback is the diagnostic. If the rollback clears the storm, the deploy was the cause. If the rollback does not clear the storm, the deploy was not the cause and the team has more signal to gather.

Step 5: check the resource limits. Memory, CPU, file descriptors, connection pool. A 502 that starts during traffic growth is usually a resource exhaustion. The fix is more memory, fewer connections per upstream, or a tighter timeout. The platform’s metrics dashboard is the place to look first.

The seven fixes that work in production

A short, opinionated list of fixes that have actually worked in real production outages. None of them are exotic. They are the boring ones.

Add a health check. A health check is a small endpoint that returns 200 when the service is ready to take traffic. The platform hits the endpoint on the new version, waits for the 200, and only then shifts traffic. A service without a health check is a service the platform cannot deploy safely. The 502 that appears after every deploy is the symptom of a missing health check.

Tighten the upstream timeout. A gateway that waits 60 seconds for an upstream that always responds in 2 seconds is a gateway that returns 502 after a 58-second hang. The fix is to set the timeout to slightly more than the upstream’s p99 response time. The 502 that appears during traffic spikes is the symptom of a too-long timeout.

Lower the connection pool on the gateway. A gateway that opens 10,000 connections to a back-end that can handle 1,000 is a gateway that is forcing the back-end to time out on 9,000 of them. The fix is to bound the connection pool to a number the back-end can actually handle. The 502 that appears during traffic growth is the symptom of an unbounded pool.

Add retries with a circuit breaker. A gateway that retries every 502 forever is a gateway that turns a brief upstream hiccup into a 502 storm. The fix is to retry a small number of times with backoff, then trip a circuit breaker and return a fast error. The 502 that turns a 1-minute hiccup into a 30-minute outage is the symptom of a missing circuit breaker.

Make the start command block on the health check. A platform that starts routing traffic the moment the process is launched is a platform that returns 502 until the process is actually ready. The fix is to make the start command block on the health check, so the platform only starts the route after the process returns 200. The 502 that appears for the first 5–30 seconds of every deploy is the symptom of a too-eager start.

Pin the upstream port. A reverse proxy that proxies to localhost:5000 is fine until the app changes to localhost:8080 and the proxy is not updated. The fix is to keep the proxy config in version control, in the same repo as the app, and to make the port an environment variable. The 502 that appears after a port change is the symptom of a port mismatch.

Add a request id and a trace. A gateway that returns 502 with no request id is a gateway that asks the developer to guess which request failed. The fix is to add a request id at the gateway, pass it to the upstream, and log it in both. The 502 that takes an hour to debug is the symptom of a missing request id.

The mistakes that turn a 502 into a 502 storm

A 502 is a single error. A 502 storm is the same error repeated across every request, every minute, for thirty minutes. The storm is the failure mode that wakes the on-call. The storm is the failure mode that the team is going to be talking about on Monday.

The retry that retries the storm. A client that retries every 502 in a tight loop is a client that is amplifying the storm. The fix is exponential backoff, a small retry count, and a circuit breaker on the client side. The storm that does not clear is the symptom of an unbounded retry.

The deploy that deploys the bug. A team that ships a deploy that introduces a memory leak, then deploys the bug to every instance, is a team that is going to watch the 502s climb until the instances all OOM. The fix is a canary deploy that catches the bug on 1% of traffic before it hits 100%. The storm that appears five minutes after a deploy is the symptom of a missing canary.

The health check that lies. A health check that returns 200 even when the database is down is a health check that lets the platform shift traffic to instances that cannot serve it. The fix is to make the health check exercise the actual dependencies. The storm that appears when the database wobbles is the symptom of a shallow health check.

The platform that hides the cause. A platform that returns a generic 502 with no upstream address, no request id, and no log line is a platform that turns a 30-second fix into a 30-minute investigation. The fix is to pick a platform that surfaces the operational truth. The storm that the team could not diagnose is the symptom of a platform that hides the logs.

How 502s shape the platform decision

A team that has been burned by 502s is a team that has opinions about the platform. The opinions are not about the dashboard. They are about the operational behavior. The team wants a platform where the health check is honored, where the deploy is observable, where the rollback is one click, where the logs are searchable, and where the request id flows from the edge to the app. The team does not want a platform that hides the cause.

The services layer is the part of the platform that runs the long-lived API the 502 is happening in front of. The database layer is the part that holds the data the API is slow to query. The static layer is the part that hosts the static site that is on the other side of the CDN. The environment variables are the part that holds the secrets the API needs at runtime.

A platform that handles 502s well is a platform where the team can debug the bug in ten minutes, not in two hours. A platform that handles 502s well is a platform where the rollback is one click, where the health check is the actual health check, where the request id is on every log line, and where the gateway’s log says which upstream was hit and what the response was.

For a team that wants to see the full cost of the project before it commits, the RunxBuild hosting calculator shows the line items together. The API, the database, the storage, the worker, the bandwidth — each one is a separate number, and the team’s mental model for the platform is the sum of those numbers.

FAQ

What does 502 bad gateway mean in simple terms?

A 502 bad gateway means the server you contacted was a middleman (a gateway or proxy), and the server it was trying to reach on your behalf gave a bad response. The middleman is honest enough to tell you the upstream was the problem, not itself. The actual cause is one of four shapes: the upstream was unreachable, the upstream was too slow, the upstream reset the connection mid-response, or the upstream sent a non-HTTP response.

Is 502 the user’s fault?

Almost never. A 502 is always a server-side failure. The user can clear the error by reloading, but the cause lives on the back-end. The exception is when a user is hitting a captive portal, a misconfigured VPN, or a transparent proxy that is intercepting traffic, in which case the “502” is a side-effect of the user’s network, not the developer’s app.

How is 502 different from 500?

A 500 is the application itself returning an error. A 502 is a middleman (gateway, proxy, load balancer, CDN) returning an error because the application behind it returned a bad response. The 500 means the developer is in the application’s stack trace. The 502 means the developer is in the gateway’s logs and the upstream’s logs at the same time.

How is 502 different from 504?

A 502 means the upstream gave a bad response. A 504 means the upstream gave no response at all — the gateway timed out waiting. The 502 is the upstream crashing, sending malformed data, or being unreachable. The 504 is the upstream being too slow. The two have different fixes, even though both look like a “the upstream is broken” signal from the client.

Why am I getting 502 errors after a deploy?

A 502 that appears after a deploy is usually one of three things: the new version is not ready yet (the start command did not block on the health check), the new version is crashing on the first request (a missing dependency, a misconfigured env var, a code path that only fires on warm traffic), or the new version is too slow to respond within the gateway’s timeout (a slow database query, a synchronous startup task). The fix for the first is a health check. The fix for the second is to roll back and read the logs. The fix for the third is to either make the start faster or make the timeout longer.

Can Cloudflare give a 502 that is not my fault?

Yes. Cloudflare is a middleman. If Cloudflare cannot reach the origin (the origin is down, the origin’s IP changed, the firewall is blocking Cloudflare, the TLS cert on the origin expired), Cloudflare returns a 502 with a “Cloudflare error” page. The error is Cloudflare’s honest report that the origin is the problem, not Cloudflare itself. The fix is on the origin.

What is the fastest way to debug a 502?

Read the gateway log, hit the upstream directly with curl, read the upstream log, check the deploy, check the resource limits. The order matters: signal first, fix second. The fastest path is the one that finds the underlying shape (unreachable, slow, reset, garbage) before deploying a guess. A 502 that takes an hour to debug is almost always a 502 where the team deployed a guess before reading the log.