A health check response protocol is the contract your app uses to tell load balancers, orchestrators, monitors, and deployment platforms whether it can receive traffic. At minimum, the endpoint should return a fast 200 OK when the service is ready and a clear non-2xx status when it is not. For production APIs, the useful version also separates liveness from readiness, keeps dependency checks bounded, and returns a small JSON body that humans and automation can both understand.
That sounds simple until a deploy goes sideways at 2 a.m. The endpoint says “healthy,” the database pool is exhausted, the queue worker is wedged, and the only thing your monitor knows is that /health still returns a cheerful green dot. A bad health check is worse than no health check because it gives everyone permission to ignore the fire.
This guide is not about decorating an endpoint with a fancy response. It is about designing a health check response protocol that helps your app make better traffic decisions, catch broken deploys faster, and avoid turning every outage into a guessing game.
Table of contents
- The contract your health check should make
- Liveness, readiness, and startup are not the same check
- A practical JSON response format
- Which HTTP status codes should a health endpoint return?
- What to check, and what not to check
- Examples for Node, Python, and containers
- How to use health checks during deployment
- Common mistakes that make health checks lie
- FAQ
The contract your health check should make
A good health endpoint answers one operational question quickly: should this process receive traffic right now?
That is different from “is every dependency perfect?” and very different from “can the app render the entire homepage?” The health check response protocol should be small enough to run often, boring enough to trust, and explicit enough to explain a failed deploy without digging through five dashboards.
A useful contract usually includes:
- a stable endpoint such as
/health,/ready, or/live - a simple status code that machines can act on
- a compact JSON body for humans and logs
- a timeout budget measured in milliseconds, not vibes
- dependency checks that are bounded and intentional
- no secrets, stack traces, connection strings, or internal topology
Here is the opinionated version: your public health endpoint should be more like a traffic signal than a diagnostic console. Green means route traffic. Red means stop. If you need a full medical chart, put it behind authentication or in your observability stack.
If you are deploying services on RunxBuild, the health check belongs in the same mental model as your service configuration, environment variables, logs, and routes. The endpoint tells the platform whether a service is ready; the logs explain why it is not.
Liveness, readiness, and startup are not the same check
One of the easiest ways to build a fragile health check is to make one endpoint do three jobs.
Liveness: is the process alive?
A liveness check answers whether the process should be restarted. It should be cheap. It should not wait on a slow external database unless the process truly cannot recover without a restart.
Good liveness checks test things like:
- the HTTP server can answer
- the event loop is not completely blocked
- the process has not entered a known fatal state
They should avoid expensive dependency calls. Restarting a healthy API because a third-party API had a 300 ms hiccup is not resilience. It is a panic button wired to a doorbell.
Readiness: should this instance receive traffic?
A readiness check answers whether the instance can handle real requests. This is where dependency checks make sense, but they still need boundaries.
A readiness check may include:
- database connectivity
- required environment variables loaded
- migrations completed
- cache or queue connectivity when required for request handling
- enough local startup work finished to serve traffic
If readiness fails, the platform should keep the process running but stop sending it traffic. That is useful during startup, deploy rollouts, migrations, and temporary dependency issues.
Startup: has the service finished booting?
Startup checks protect slow-starting apps from being killed too early. Some apps need to load models, warm caches, compile assets, or run initialization logic. If your app has a real boot phase, make it visible instead of hoping the orchestrator guesses correctly.
The Kubernetes documentation on liveness, readiness, and startup probes is a strong reference here. Even if you are not writing raw Kubernetes manifests, the distinction is worth stealing.
A practical JSON response format
For many APIs, a plain 200 OK is enough for a load balancer. For humans, deploy logs, and incident response, a small JSON response is better.
Use a shape like this:
{
"status": "ok",
"version": "2026.06.07-1",
"uptimeSeconds": 1842,
"checks": {
"database": {
"status": "ok",
"latencyMs": 12
},
"queue": {
"status": "ok",
"latencyMs": 8
}
}
}
When the service is not ready, keep the response just as clear:
{
"status": "degraded",
"version": "2026.06.07-1",
"checks": {
"database": {
"status": "fail",
"reason": "connection_timeout"
}
}
}
The expired IETF draft for Health Check Response Format for HTTP APIs popularized a structured approach with fields such as status, version, releaseId, checks, and links. You do not need to copy every field. You need consistency.
For a small production API, I like this minimum:
| Field | Purpose |
|---|---|
status | ok, degraded, or fail |
version | Build, commit, or release identifier |
uptimeSeconds | Quick clue for restart loops |
checks | Named dependency checks with status and latency |
Keep the body stable over time. Monitoring tools, deploy scripts, and humans all start to depend on it.
Which HTTP status codes should a health endpoint return?
Use HTTP status codes as the machine-readable signal and JSON as the human-readable detail.
For most services:
- return
200 OKwhen the instance is ready for traffic - return
503 Service Unavailablewhen the instance is alive but not ready - return
500 Internal Server Erroronly when the health endpoint itself failed unexpectedly - avoid redirects, auth challenges, and HTML error pages on health endpoints
A 204 No Content can work for a minimal liveness check, but it gives you no body to inspect. For readiness, a compact JSON body is usually worth the bytes.
The MDN HTTP status code reference is useful when deciding how your endpoint should behave, but do not overcomplicate this. Health checks need to be boring. Boring is how they become dependable.
What to check, and what not to check
The best health checks are selective. They check the things that decide whether this instance can safely handle traffic. They do not check every system your company has ever heard of.
Good readiness checks
Check the dependencies that are required for normal request handling:
- Can the app reach the primary database?
- Are required environment variables present?
- Has the app completed migrations or startup initialization?
- Can the app reach a cache or queue if the request path requires it?
- Is the local disk or mounted storage available when the app depends on it?
If your API stores user submissions in Postgres, the readiness check should catch a broken database URL before the first user finds it. Pair that with the RunxBuild managed database docs and network security controls so the check tests the same production path your app actually uses.
Checks to avoid
Avoid checks that make the health endpoint slow, flaky, or noisy:
- calling third-party APIs that are not required for every request
- running full SQL reports or expensive joins
- checking every downstream microservice recursively
- testing email delivery, payment providers, or analytics scripts
- returning secrets, hostnames, passwords, tokens, or stack traces
A health endpoint is not a full synthetic transaction. If you need end-to-end monitoring, build that separately and run it less often.
Examples for Node, Python, and containers
The exact implementation depends on the stack, but the pattern is the same: answer fast, set the right status code, and keep dependency checks bounded.
Node and Express
app.get('/health', async (req, res) => {
const started = Date.now();
try {
await db.query('select 1');
res.status(200).json({
status: 'ok',
version: process.env.RELEASE_ID || 'local',
uptimeSeconds: Math.round(process.uptime()),
checks: {
database: {
status: 'ok',
latencyMs: Date.now() - started
}
}
});
} catch (error) {
res.status(503).json({
status: 'fail',
version: process.env.RELEASE_ID || 'local',
checks: {
database: {
status: 'fail',
reason: 'unavailable'
}
}
});
}
});
If your service is a Node API, keep the health endpoint close to the same runtime you deploy. The RunxBuild Node service guide is the right place to line up build commands, start commands, environment variables, and production behavior.
Python and FastAPI
from fastapi import FastAPI, Response
app = FastAPI()
@app.get('/health')
async def health(response: Response):
database_ok = await check_database(timeout_ms=250)
if not database_ok:
response.status_code = 503
return {
'status': 'fail',
'checks': {
'database': {'status': 'fail'}
}
}
return {
'status': 'ok',
'checks': {
'database': {'status': 'ok'}
}
}
For Python services, the same rule applies: do not let the health check become the slowest request in the app. A timeout is part of the protocol.
Docker health checks
In Docker, you can wire the endpoint into the image:
HEALTHCHECK --interval=30s --timeout=3s --retries=3 \
CMD wget -qO- http://127.0.0.1:3000/health || exit 1
For containerized services, review the RunxBuild Docker service docs and make sure the container listens on the expected port. A perfect health endpoint cannot save a container that binds to localhost when the platform expects 0.0.0.0.
How to use health checks during deployment
A health check response protocol becomes valuable when deployment automation can act on it.
During a rollout, the platform should be able to:
- start the new instance
- wait for startup work to finish
- call readiness until it returns healthy
- route traffic only after the instance is ready
- stop or roll back if readiness never succeeds
That flow protects users from half-started releases. It also gives developers a cleaner failure mode. Instead of “deploy succeeded but every request fails,” you get “deploy did not become ready; check the database migration step.” That is a much better Tuesday.
If a release changes runtime size, database usage, or always-on service needs, the RunxBuild hosting calculator is a practical checkpoint before you scale by instinct. Health checks tell you whether the app can receive traffic. The calculator helps you avoid paying for more runtime than the app actually needs.
Common mistakes that make health checks lie
Returning 200 for everything
This is the classic mistake. The endpoint catches errors, logs them, then still returns 200 OK. Your monitor stays green while users get errors. If the service is not ready, return a non-2xx status.
Checking too much
The opposite mistake is making /health call every dependency in the company. Now your health check fails because a newsletter API blinked, even though the core product still works. Readiness should reflect traffic safety, not organizational anxiety.
No timeout budget
Every dependency check needs a timeout. Without one, the health endpoint can hang and become part of the outage. A 250 ms or 500 ms budget is often enough for local database and cache checks; choose a number that fits your app.
Exposing too much detail
Never return credentials, internal hostnames, stack traces, or SQL errors from a public health endpoint. If the endpoint is public, keep details generic. Put sensitive diagnostics in logs.
Forgetting deploy context
A health check that works locally but fails in production usually points to a difference in environment variables, network access, ports, or startup commands. That is why health checks, deploy logs, and environment configuration should be treated as one system.
A simple production checklist
Before you ship, ask:
- Does the endpoint return
200only when the instance should receive traffic? - Does it return
503when required dependencies are unavailable? - Is the response fast under normal conditions?
- Do all dependency checks have timeouts?
- Are liveness and readiness separated when the platform supports it?
- Does the JSON body avoid secrets and stack traces?
- Can deploy logs show why readiness failed?
- Is the endpoint stable across releases?
Health checks are not glamorous. Good. Glamour is how you get clever failure modes. A clean health check response protocol gives your app a simple way to say, “send traffic here,” or “not yet.” That one honest answer can save a release.
FAQ
What should a health check endpoint return?
A health check endpoint should return 200 OK when the service is ready to receive traffic and a non-2xx status, usually 503 Service Unavailable, when it is not. A small JSON body with status, version, and named dependency checks helps humans understand the result.
Should health checks use GET or HEAD?
Most teams use GET /health because it works with load balancers, monitors, browsers, and simple command-line tools. HEAD can work for minimal checks, but a JSON body from GET is easier to inspect during deploy debugging.
What is the difference between liveness and readiness?
Liveness checks decide whether a process should be restarted. Readiness checks decide whether the process should receive traffic. A service can be alive but not ready, especially during startup, migrations, or a temporary database issue.
Should a health check test the database?
A readiness check should test the database if normal requests require database access. Keep the check cheap, such as select 1, and enforce a timeout. A liveness check usually should not restart the process just because the database had a brief issue.
Is it safe to expose a health endpoint publicly?
It can be safe if the response is minimal and contains no secrets, stack traces, private hostnames, or detailed topology. If you need deep diagnostics, put them behind authentication or in your logging and monitoring tools.
How often should health checks run?
Common intervals range from 10 to 60 seconds depending on the platform and failure tolerance. Short intervals detect problems faster, but they also add traffic. Keep the endpoint lightweight so frequent checks do not become load.