The 504 Gateway Timeout Error, in Plain English

A 504 gateway timeout means the server you contacted was acting as a gateway or proxy, sent a request to the server behind it, and did not get a response back in time. That is the formal definition. The working definition is “I asked the back-end for a response, the back-end was too slow (or did not answer at all), and I am giving up.” The 504 is a timeout, not a crash, not a malformed response, not an auth failure. The 504 is a clock running out.

The reason “the web server reported a gateway time-out error” is its own question and not just “an error” is that the 504 is the most performance-sensitive of the 5xx codes. The 504 is what happens when a slow upstream, a slow database, or a slow third-party API becomes slow enough to trip the gateway’s timeout. The shape of the bug is usually “the upstream took longer than the gateway was willing to wait,” and the shape of the fix depends on which timeout, which upstream, and which side the developer can change.

The short version
The four shapes a 504 actually takes
The five places a 504 is born
The first ten minutes of debugging a 504
The seven fixes that work in production
The mistakes that turn a 504 into an outage
FAQ

The short version

A 504 is the gateway admitting that the upstream did not answer before the gateway’s timeout expired. The gateway is the front-door — usually a load balancer, a reverse proxy, a CDN, or the platform’s edge. The upstream is the back-end — usually the application, the API, the worker, or another service in the chain. The “did not answer” can mean: the upstream was too slow on a single request, the upstream was down and the connection was never established, the upstream accepted the connection but never wrote a response, or the upstream’s response was streaming and the gateway timed out waiting for the next byte. The 504 is a timeout, and the fix is in the timeout, the upstream, or the network between them.

The four shapes a 504 actually takes

The error code is one number, but the underlying cause has four distinct shapes. The shape changes what the developer should look at first.

The upstream was slow on a single request. The gateway opened a TCP connection, sent the request, and the upstream took longer than the gateway’s timeout to respond. The upstream might be slow on a slow database query, a slow third-party API call, a slow synchronous startup task, or a slow regex on user input. The gateway waited, gave up, returned 504. The request was the only one that tripped the timeout.

The upstream is down. The gateway opened a TCP connection, but the connection was refused, or the connection hung in the TCP handshake. The upstream is not running, the port is wrong, the firewall is blocking the connection, or the platform has not yet started the new instance. The gateway’s connect timeout expired, returned 504. Every request is hitting the same wall.

The upstream is in a death spiral. The gateway opened a TCP connection, sent the request, the upstream accepted it, and the upstream never wrote a response. The upstream is CPU-starved, the upstream is in a GC pause, the upstream is in a deadlock, or the upstream is waiting on a lock held by another process. The gateway’s read timeout expired, returned 504. The same request against a healthy instance would succeed.

The response is streaming and stalled. The upstream wrote the response headers, started writing the body, and then stopped. The body is half-written, the connection is alive, the gateway is waiting for the next byte, and the gateway’s idle timeout expires. The upstream is in a “the response was 90% done and then nothing” state. The 504 is the gateway giving up on a half-streamed response.

The four shapes map to four different fixes. The first is “speed up the slow path.” The second is “bring the upstream back.” The third is “fix the resource exhaustion.” The fourth is “fix the streaming bug.” Conflating them is the most common mistake.

The five places a 504 is born

The error code is a single number, but it can be born in any layer of the stack. The place changes the diagnostic.

At the CDN edge. Cloudflare, Fastly, CloudFront, or another CDN. The CDN tried to talk to the origin, the origin was too slow, the CDN’s origin-read timeout expired, the CDN returns 504 to the client. Cloudflare’s default origin-read timeout is 100 seconds. The fix is in the origin, in the CDN config, or in the origin’s performance.

At the load balancer. NGINX, HAProxy, Envoy, or the platform’s load balancer. The load balancer tried to talk to the upstream pool, the upstreams were too slow, the load balancer’s upstream timeout expired, the load balancer returns 504. NGINX’s default proxy_read_timeout is 60 seconds. The fix is in the backend, in the load balancer config, or in the load balancer’s timeout.

At the reverse proxy. NGINX in front of an app server, Caddy in front of a Node process, or the platform’s reverse proxy. The proxy tried to talk to the app, the app was too slow, the proxy’s upstream timeout expired, the proxy returns 504. The fix is in the app, in the proxy config, or in the proxy’s timeout.

At the platform’s edge. The platform that hosts the app, the API, or the worker. The platform tried to talk to the user’s service, the service was not ready, the platform’s startup timeout expired, the platform returns 504. The fix is in the service’s health check, in the start command, or in the readiness probe.

At the application itself. The app made an outbound HTTP call to a third party, the third party was too slow, the app’s outbound timeout expired, the app returns 504 to its caller. This is the only shape where the “upstream” is not the developer’s own infrastructure. The fix is in the third party, in the timeout, in the retry policy, or in the circuit breaker.

The first ten minutes of debugging a 504

The clock is ticking. The site is throwing 504s. The team is in a channel. The first ten minutes are the time to gather signal, not to deploy a fix. The signal is what tells the team which of the four shapes they are looking at, and which of the five places the error is born.

Step 1: read the gateway log. The CDN log, the load balancer log, the platform’s edge log. The log says which upstream was hit, what the gateway’s timeout was, how long the gateway waited, and what state the connection was in when the timeout fired. The line is the starting point.

Step 2: hit the upstream directly. Bypass the gateway. Curl the upstream on the port the gateway is using, with the same headers, and see how long the upstream takes. If the upstream responds in 50 ms, the bug is in the gateway. If the upstream takes 30 seconds, the bug is in the upstream.

Step 3: read the upstream log. The app log, the worker log, the start log. The log says whether the request was received, what the app did with it, what the database said, how long the database call took, and whether the request finished. The log is the source of truth for the upstream’s view.

Step 4: check the deploy. A 504 storm that started five minutes after a deploy is almost always the deploy. Roll back. The rollback is the diagnostic. If the rollback clears the storm, the deploy was the cause. If the rollback does not clear the storm, the deploy was not the cause.

Step 5: check the third party. A 504 that started at 3 p.m. on a Tuesday is often a third party having a bad day. Check the status pages of the services the app depends on. The third party is the upstream that the developer does not control, and the third party is the upstream that the team forgets to check.

The seven fixes that work in production

A short, opinionated list of fixes that have actually worked in real production outages. None of them are exotic. They are the boring ones.

Raise the timeout to a number that matches reality. A gateway that times out at 30 seconds is a gateway that fails every request that takes 31 seconds. The fix is to set the timeout to a number that covers the p99 of the upstream’s response time, with a margin. The 504 that fires on long-but-valid requests is the symptom of a too-short timeout.

Speed up the slow path. A request that takes 30 seconds is a request that has a slow database query, a slow third-party call, or a slow synchronous task. The fix is to optimize the query, cache the third-party call, or move the synchronous task out of the request path. The 504 that fires on the same request every time is the symptom of a slow upstream.

Add a health check. A health check is a small endpoint that returns 200 when the service is ready to take traffic. The platform hits the endpoint on the new version, waits for the 200, and only then shifts traffic. A service without a health check is a service the platform cannot deploy safely. The 504 that appears after every deploy is the symptom of a missing health check.

Bound the connect timeout separately from the read timeout. A gateway that uses one timeout for everything is a gateway that conflates “the upstream is unreachable” with “the upstream is slow.” The fix is a short connect timeout (2–5 seconds) and a longer read timeout (30–60 seconds). The 504 that fires on an unreachable upstream is the symptom of a single combined timeout.

Add retries with a circuit breaker. A gateway that retries every 504 in a tight loop is a gateway that amplifies a slow upstream into an outage. The fix is to retry a small number of times with backoff, then trip a circuit breaker and return a fast error. The 504 that turns a 1-minute slow upstream into a 30-minute outage is the symptom of a missing circuit breaker.

Cache the slow endpoint. A request that hits a slow upstream on every call is a request that should hit a cache. The fix is to add a cache layer (in-memory, Redis, edge) in front of the slow endpoint, and to invalidate the cache on the upstream’s write path. The 504 that fires on the same slow endpoint under traffic is the symptom of a missing cache.

Set the read timeout to a number that is shorter than the client’s patience. A gateway that times out at 120 seconds and a client that gives up at 30 seconds is a gateway that is wasting 90 seconds of resources on a request the client has already abandoned. The fix is to set the gateway’s timeout shorter than the client’s timeout, so the gateway can free the resources before the client has given up. The 504 that the client never sees is the right kind of 504.

The mistakes that turn a 504 into an outage

A 504 is a single error. A 504 storm is the same error repeated across every request, every minute, for thirty minutes. The storm is the failure mode that wakes the on-call. The storm is the failure mode that the team is going to be talking about on Monday.

The retry that retries the storm. A client that retries every 504 in a tight loop is a client that is amplifying the storm. The fix is exponential backoff, a small retry count, and a circuit breaker on the client side. The storm that does not clear is the symptom of an unbounded retry.

The timeout that is longer than the database’s patience. A gateway that times out at 60 seconds and a database that gives up at 30 seconds is a gateway that is wasting 30 seconds of resources on a request the database has already killed. The fix is to set the gateway’s timeout shorter than the database’s timeout. The 504 that wastes connection pool capacity is the symptom of a misaligned timeout.

The health check that lies. A health check that returns 200 even when the database is unreachable is a health check that lets the platform shift traffic to instances that cannot serve it. The fix is to make the health check exercise the actual dependencies. The storm that appears when the database wobbles is the symptom of a shallow health check.

The platform that hides the cause. A platform that returns a generic 504 with no upstream address, no request id, and no log line is a platform that turns a 30-second fix into a 30-minute investigation. The fix is to pick a platform that surfaces the operational truth. The storm that the team could not diagnose is the symptom of a platform that hides the logs.

The deploy that deploys the bug. A team that ships a deploy that introduces a slow code path, then deploys the slow path to every instance, is a team that is going to watch the 504s climb as traffic grows. The fix is a canary deploy that catches the bug on 1% of traffic before it hits 100%. The storm that appears five minutes after a deploy is the symptom of a missing canary.

How this fits the rest of the stack

A 504 is a performance symptom, not a feature. The platform that handles 504s well is the platform where the team can see the slow query, the slow upstream, the long timeout, and the request id in one place. The team does not want a platform that hides the cause.

The services layer is the part of the platform that runs the long-lived API the 504 is happening in front of. The database layer is the part that holds the data the API is slow to query. The static layer is the part that hosts the static site that is on the other side of the CDN. The environment variables are the part that holds the secrets the API needs at runtime.

A platform that handles 504s well is a platform where the timeout is configurable, the health check is honored, the slow query is visible, the rollback is one click, and the request id flows from the edge to the app. A platform that handles 504s well is a platform where the team’s debugging time goes to fixing the bug, not to finding the bug.

For a team that wants to see the full cost of the project before it commits, the RunxBuild hosting calculator shows the line items together. The API, the database, the storage, the worker, the bandwidth — each one is a separate number, and the team’s mental model for the platform is the sum of those numbers.

FAQ

What is a 504 gateway timeout error?

A 504 gateway timeout means the server you contacted was a middleman (a gateway or proxy), sent a request to the server behind it, and did not get a response back before the middleman’s timeout expired. The middleman is giving up on the back-end, not on you. The actual cause is one of four shapes: the upstream was slow on a single request, the upstream is down, the upstream is in a death spiral, or the response is streaming and stalled.

How is 504 different from 502?

A 504 means the upstream gave no response at all — the gateway timed out waiting. A 502 means the upstream gave a bad response — malformed, reset, or otherwise unusable. The 504 is the upstream being too slow. The 502 is the upstream being broken. The two have different fixes, even though both look like a “the upstream is the problem” signal from the client.

How is 504 different from 500?

A 500 is the application itself returning an error. A 504 is a middleman returning an error because the application behind it did not respond in time. The 500 means the developer is in the application’s stack trace. The 504 means the developer is in the gateway’s logs and the upstream’s logs at the same time.

Why am I getting 504 errors after a deploy?

A 504 that appears after a deploy is usually one of three things: the new version is not ready yet (the start command did not block on the health check), the new version is slower than the old one (a slow database query, a new external call, a synchronous startup task), or the new version is taking longer than the gateway’s timeout to respond. The fix for the first is a health check. The fix for the second is to optimize the slow path or to raise the timeout. The fix for the third is to either make the start faster or make the timeout longer.

How long should a gateway timeout be?

Shorter than the client’s timeout, and longer than the upstream’s p99 response time with a margin. A reasonable default for a public API is 30 seconds. A reasonable default for a long-running job API is 300 seconds. A reasonable default for a CDN origin is 100 seconds. The number is a function of the workload, not a single global value.

Can Cloudflare give a 504 that is not my fault?

Yes. Cloudflare is a middleman. If Cloudflare cannot reach the origin within Cloudflare’s origin-read timeout (default 100 seconds), Cloudflare returns a 504 with a “Cloudflare error” page. The error is Cloudflare’s honest report that the origin was the slow one, not Cloudflare itself. The fix is in the origin, in the origin’s performance, or in Cloudflare’s origin timeout.

What is the fastest way to debug a 504?

Read the gateway log, hit the upstream directly with curl, time the upstream’s response, read the upstream log, check the deploy, check the third party. The order matters: signal first, fix second. The fastest path is the one that finds the underlying shape (slow, down, death spiral, stalled stream) before deploying a guess. A 504 that takes an hour to debug is almost always a 504 where the team deployed a guess before reading the log.