Error Logs: The Part That Is the Operations Truth, and the Part That Is Just Noise

Error logs are the record of every error the application, the framework, the runtime, the database, the proxy, and the platform produced during a request — and the error log is the closest thing the team has to the operational truth of what actually happened in production. The error log is the source of truth for “why did the request fail,” the error log is the source of truth for “when did the deploy break,” the error log is the source of truth for “what is the rate of 500s.” The interesting part is the four layers that produce errors (application, framework, runtime, platform), the four log levels that actually matter (ERROR, WARN, INFO, DEBUG), the three structured fields that turn noise into queryable truth (timestamp, request_id, error_type), and the seven mistakes that quietly turn logs into a cost problem and a privacy problem.

The reason “error logs” is its own question and not just “logging” or “monitoring” is that the error log is the most important log a team has. The error log is the one the on-call reads at 3 a.m. The error log is the one that tells the team the deploy was a bad idea. The error log is the one that tells the team the third-party API is having a bad day. The error log is the source of truth, and the error log is the lever that turns “I do not know what is wrong” into “I know what is wrong.”

The short version
The four layers that produce errors
The four log levels that actually matter
The three structured fields that turn noise into truth
The seven mistakes that quietly turn logs into a problem
The four log services a real team uses
The three patterns a real team standardizes on
FAQ

The short version

An error log is a record of an error that happened during a request. The error has a timestamp, a request id, an error type, a message, and a stack trace. The error is produced by one of four layers (application, framework, runtime, platform), logged at one of four levels (ERROR, WARN, INFO, DEBUG), and shipped to a log service for storage, search, and alerting. The right error log is one that the on-call can read in 30 seconds, the wrong error log is one that requires 30 minutes of grep to find the relevant line.

The four layers that produce errors

A short, opinionated list of the four layers that produce errors. The four are the ones a developer has to know to debug a request, and the four are the ones a developer should think about when designing the logging.

The application layer. The application code is the layer that throws exceptions, returns error responses, and writes error logs. The layer is the one a developer has the most control over, the layer is the one a developer can instrument with custom log lines, and the layer is the one that has the most context (the user, the request, the business logic, the data). The pattern is the right answer for a developer who wants to debug a specific bug, and the pattern is the lever that turns “I do not know what the application is doing” into “I know what the application is doing.”

The framework layer. The framework (Django, Flask, Express, Spring, Rails, FastAPI) is the layer that catches unhandled exceptions, returns error responses, and writes error logs. The layer is the one a developer has less control over (the framework decides what to log and when), the layer is the one a developer can configure (the log level, the log format, the log destination), and the layer is the one that has the request context (the URL, the method, the headers, the body). The pattern is the right answer for a developer who wants to debug an unhandled exception, and the pattern is the lever that turns “I do not know what the framework is doing” into “I know what the framework is doing.”

The runtime layer. The runtime (the Python interpreter, the Node.js runtime, the JVM, the Go runtime, the PHP runtime) is the layer that produces errors when the application crashes, when the memory runs out, when the CPU spikes, when the file descriptor limit is hit, when the garbage collector is stuck. The layer is the one a developer has the least control over, the layer is the one that produces the most dramatic errors (segfaults, OOMs, panics, fatal errors), and the layer is the one that the on-call needs to see in the error log. The pattern is the right answer for a developer who wants to debug a crash, and the pattern is the lever that turns “I do not know why the application crashed” into “I know why the application crashed.”

The platform layer. The platform (the operating system, the container runtime, the reverse proxy, the load balancer, the cloud provider) is the layer that produces errors when the disk is full, when the network is down, when the load balancer is misconfigured, when the cloud provider is having an outage. The layer is the one a developer has the least visibility into, the layer is the one that produces the most surprising errors, and the layer is the one that the on-call needs to see in the error log. The pattern is the right answer for a developer who wants to debug a platform issue, and the pattern is the lever that turns “I do not know why the platform is failing” into “I know why the platform is failing.”

The four are the floor. There is also the “database layer” (the database driver throws an error, the database rejects a query, the database connection drops), the “third-party layer” (the third-party API returns a 5xx, the third-party SDK throws an error), and the “DNS layer” (the DNS lookup fails, the DNS server is down). The four are the ones a developer should know first, and the four are the ones a developer can think about when designing the logging.

The four log levels that actually matter

A short, opinionated list of the four log levels that actually matter. The four are the ones a developer should configure, and the four are the ones that turn “too much log” into “just enough log.”

ERROR — the “something is broken” level. A log at the ERROR level means the application, the framework, the runtime, or the platform failed in a way that affected the request. The log is the right answer for an unhandled exception, a 5xx response, a database connection failure, a third-party API 5xx. The log is the one the on-call wants to see, the log is the one that should trigger an alert, and the log is the one that should be shipped to a log service with a high retention (30-90 days, depending on the team).

WARN — the “something might be broken” level. A log at the WARN level means the application, the framework, the runtime, or the platform did something unexpected but recovered. The log is the right answer for a deprecated API call, a slow database query, a retry, a fallback. The log is the one the team wants to see, the log is the one that should not trigger an alert (but should be reviewable), and the log is the one that should be shipped to a log service with a medium retention (7-30 days).

INFO — the “something happened” level. A log at the INFO level means the application, the framework, the runtime, or the platform did something expected. The log is the right answer for a successful request, a successful deploy, a successful database migration. The log is the one the team wants to see in aggregate (requests per second, success rate, latency), the log is the one that should not trigger an alert, and the log is the one that should be shipped to a log service with a low retention (1-7 days) or to a metrics service (Datadog, Prometheus, Grafana Cloud).

DEBUG — the “here is the detail” level. A log at the DEBUG level means the application, the framework, the runtime, or the platform is producing the detail that a developer needs to debug a specific issue. The log is the right answer for a variable value, a query parameter, a function entry/exit, a conditional branch. The log is the one a developer wants to see when they are debugging, the log is the one that should not be enabled in production (or should be enabled for a specific request, with a debug header), and the log is the one that should not be shipped to a log service (it stays in the local log file or the local log buffer).

The four are the floor. There is also the TRACE level (even more detail than DEBUG), the FATAL level (the application is about to crash), and the custom levels (a SECURITY level, a COMPLIANCE level, a BILLING level). The four are the ones a developer should know first, and the four are the ones a developer can configure in every framework and every log service.

The three structured fields that turn noise into truth

A short, opinionated list of the three structured fields that turn a noisy log into a queryable truth. The three are the ones a developer should add to every log line, and the three are the ones that turn “I cannot find the error” into “I can find the error in 5 seconds.”

timestamp — when the error happened. The timestamp is the most important field in the log line, the timestamp is the one a developer uses to correlate the error with the deploy, the request, the database, the third party, and the timestamp is the one a log service uses to display the logs in order. The pattern is to use ISO 8601 with milliseconds (2026-06-10T14:23:45.123Z), and the pattern is the right answer for a developer who wants the log to be sortable and parseable.

request_id — which request the error belongs to. The request id is the unique identifier for the request, the request id is the one a developer uses to correlate the error with the request logs, the database logs, the third-party logs, and the request id is the one a log service uses to group the logs by request. The pattern is to generate the request id in the reverse proxy or the framework, the pattern is to add the request id to every log line (the application log, the framework log, the database log, the third-party log), and the pattern is the right answer for a developer who wants to debug a request end-to-end.

error_type — what kind of error happened. The error type is the categorization of the error, the error type is the one a developer uses to filter the logs by kind (e.g. ValidationError, DatabaseError, ThirdPartyError, TimeoutError), the error type is the one a log service uses to power the dashboards (e.g. “Top 10 error types in the last hour”), and the error type is the one a developer uses to write the alert (“alert me when there are more than 100 DatabaseError per minute”). The pattern is to use a stable, machine-readable string, the pattern is to be consistent across the application (the same error should always have the same error_type), and the pattern is the right answer for a developer who wants to filter and alert on errors.

The three are the floor. There is also the user_id (which user the request was for), the trace_id (which distributed trace the request belongs to), the service (which service the log came from), the environment (which environment the log came from), and the host (which host the log came from). The three are the ones a developer should know first, and the three are the ones a developer can add to every log line.

The seven mistakes that quietly turn logs into a problem

A short, opinionated list of mistakes that have actually turned real error logs into a cost problem, a privacy problem, or a debug problem. None of them are dramatic. They are the boring ones.

Logging PII (email, phone, password, credit card, SSN). An application that logs a user’s email, phone, password, credit card, or SSN is an application whose logs are a privacy violation. The fix is to scrub the PII before logging (a logger that automatically redacts known fields), and the fix is the lever that turns “the logs are a privacy violation” into “the logs are PII-free.”

Logging sensitive data (API keys, tokens, session cookies). An application that logs an API key, a token, a session cookie, or a password is an application whose logs are a security violation. The fix is to scrub the secrets before logging, and the fix is the lever that turns “the logs are a security violation” into “the logs are secret-free.”

Logging at DEBUG in production. An application that logs at DEBUG in production is an application whose logs are 10-100x bigger than they need to be. The fix is to log at INFO in production, and the fix is the lever that turns “the log service costs are too high” into “the log service costs are reasonable.”

Logging without a request id. An application that logs without a request id is an application whose logs are not correlated with the request. The fix is to add a request id to every log line, and the fix is the lever that turns “I cannot find the error” into “I can find the error in 5 seconds.”

Logging to stdout and not to a log service. An application that logs to stdout and not to a log service is an application whose logs disappear when the container is recycled. The fix is to ship the logs to a log service (Datadog, Splunk, Loki, ELK, CloudWatch, Better Stack, Logtail), and the fix is the lever that turns “the logs are gone” into “the logs are searchable.”

Logging without structured fields. An application that logs without structured fields (e.g. logger.info("User signed up: email=" + email)) is an application whose logs are not queryable. The fix is to use structured fields (logger.info("User signed up", extra={"email": email})), and the fix is the lever that turns “I cannot filter the logs” into “I can filter the logs by any field.”

Logging the same error 1000 times per second. An application that logs the same error 1000 times per second is an application whose logs are dominated by one error. The fix is to add rate limiting to the log (log the first 10 occurrences, then suppress the rest, with a count), and the fix is the lever that turns “the log service is overwhelmed” into “the log service is happy.”

The seven are the floor. There is also the “logging to a file and not rotating the file” mistake (the disk fills up), the “logging to a remote service without backpressure” mistake (the application blocks on the log call), and the “logging with the wrong timezone” mistake (the timestamps are off by hours). The seven are the ones a developer should know first, and the seven are the ones a developer can fix in the application code or the log service config.

The four log services a real team uses

A short, opinionated list of the four log services a real team uses. The four are the ones a developer will see most often, and the four are the ones a developer should know.

Datadog Logs. A managed log service with a powerful query language, dashboards, alerts, and integrations with the rest of the Datadog platform (APM, metrics, synthetics). The service is the right answer for a team that is already on Datadog, and the service is the right answer for a team that wants a one-stop shop for logs, metrics, and traces. The service is on the expensive side ($0.10/GB ingested, $0.04/GB archived, with a 15-day retention at the default).

Grafana Loki + Grafana. An open-source log service that is designed to be cost-effective at scale (it indexes labels, not the full text), with a powerful query language (LogQL) and a tight integration with Grafana for dashboards. The service is the right answer for a team that is already on Prometheus and Grafana, the service is the right answer for a team that wants an open-source solution, and the service is the right answer for a team that has a high log volume (Loki is much cheaper than Datadog Logs at scale).

Better Stack Logs (Logtail). A managed log service with a focus on developer experience (a clean UI, a powerful query builder, fast search), with built-in collaboration (share a log line with a teammate, comment on a log line) and a free tier (5 GB/month, 30-day retention). The service is the right answer for a small team that wants a managed solution, the service is the right answer for a team that values developer experience, and the service is the right answer for a team that is moving off the free tier of a more expensive service.

ELK Stack (Elasticsearch, Logstash, Kibana). The original open-source log service, with a powerful query language (Lucene), a powerful visualization layer (Kibana), and a powerful ingestion layer (Logstash). The stack is the right answer for a team that needs full-text search on logs, the stack is the right answer for a team that is already on Elasticsearch, and the stack is the right answer for a team that has the operational capacity to run Elasticsearch (the stack is not for the faint of heart).

The four are the floor. There is also Splunk (the enterprise incumbent, very expensive, very powerful), CloudWatch Logs (the AWS-native option, the right answer for a team that is all-in on AWS), and Sentry (the right answer for application error tracking, not general-purpose logging). The four are the ones a developer should know first, and the four are the ones a developer will see most often.

The three patterns a real team standardizes on

A short, opinionated list of the three patterns a real team standardizes on. The patterns are the ones that make the error log consistent across services, queryable across time, and actionable for the on-call.

The “request id in every log line” pattern. A reverse proxy or a framework middleware generates a request id for every request, the request id is added to every log line (the application log, the framework log, the database log, the third-party log), the request id is returned in the response header (X-Request-Id), and the request id is searchable in the log service. The pattern is the right answer for any team that needs to debug a specific request, and the pattern is the lever that turns “I cannot find the error” into “I can find the error in 5 seconds.”

The “structured logging everywhere” pattern. The team standardizes on structured logging (JSON or key-value), the team standardizes on a log level per environment (INFO in production, DEBUG in staging), the team standardizes on a log destination (stdout, which is then shipped to the log service by the platform). The pattern is the right answer for any team that needs to filter the logs by field, and the pattern is the lever that turns “I cannot filter the logs” into “I can filter the logs by any field.”

The “alert on error rate, not on individual errors” pattern. The team writes an alert that fires when the error rate exceeds a threshold (e.g. “more than 100 errors per minute, or more than 1% of requests”), the alert is wired to PagerDuty or Slack, the alert is reviewable in a runbook. The pattern is the right answer for any team that needs to be paged on errors, and the pattern is the lever that turns “I am paged on every individual error” into “I am paged when the error rate is high.”

The three are the floor. There is also the “log retention by error level” pattern (retain ERROR for 90 days, retain WARN for 30 days, retain INFO for 7 days), the “log sampling in production” pattern (log 100% of errors, 10% of INFO, 0% of DEBUG), and the “log redaction” pattern (scrub known PII fields before logging). The three are the ones a developer should know first, and the three are the ones a real team standardizes on.

How this fits the rest of the stack

An error log rarely lives in isolation. The error log is usually part of a stack (an application, a database, a static site, a worker) that runs on a platform. The platform that handles the error log should make the rest of the stack feel like part of the same conversation.

The services layer is the part of the platform that runs the long-lived API the error log is happening in. The database layer is the part that holds the data the API is querying. The static layer is the part that hosts the static site the API serves. The environment variables are the part that holds the secrets the API needs at runtime.

A platform that handles error logs well is a platform where the logs are searchable, the request id is on every log line, the log levels are configurable per environment, the logs are shipped to a log service by default, and the error rate is alerted on. A platform that handles error logs well is a platform where the team’s debugging time goes to fixing the bug, not to finding the bug.

For a team that wants to see the full cost of the project before it commits, the RunxBuild hosting calculator shows the line items together. The API, the database, the storage, the worker, the bandwidth, the log volume — each one is a separate number, and the team’s mental model for the platform is the sum of those numbers.

FAQ

What is an error log?

What is the difference between logging and monitoring?

Logging is the act of recording events (errors, warnings, info, debug) to a log service for later search. Monitoring is the act of aggregating metrics (request rate, error rate, latency) and alerting when the metrics cross a threshold. Logging is for “what happened to this specific request” (debugging), monitoring is for “what is happening to the system as a whole” (alerting). A real team needs both.

What are the four log levels?

The four log levels that actually matter are ERROR (something is broken, the request failed, alert on it), WARN (something might be broken, the request recovered, review periodically), INFO (something happened, the request succeeded, aggregate to metrics), and DEBUG (here is the detail, the request is being debugged, never in production). The four are the ones every framework and every log service supports, and the four are the ones a developer should configure.

How do I add structured logging?

Use a structured logger (Python: structlog or loguru, Node.js: pino or winston, Go: zap or zerolog, Java: logback with the JSON encoder, Ruby: semantic_logger). Pass the structured fields as keyword arguments (logger.error("User not found", extra={"user_id": user_id})). The logger serializes the log line as JSON, the JSON is shipped to the log service, and the log service indexes the fields. The pattern is the right answer for a developer who wants to filter the logs by field, and the pattern is the lever that turns “I cannot filter the logs” into “I can filter the logs by any field.”

What is a request id and why is it important?

A request id is a unique identifier for a request, generated by the reverse proxy or the framework. The request id is added to every log line (the application log, the framework log, the database log, the third-party log), the request id is returned in the response header (X-Request-Id), and the request id is searchable in the log service. The pattern is the right answer for a developer who wants to debug a specific request, and the pattern is the lever that turns “I cannot find the error” into “I can find the error in 5 seconds.”

How long should I retain error logs?

It depends. A common pattern is 30-90 days for ERROR, 7-30 days for WARN, 1-7 days for INFO, and 0 (or local only) for DEBUG. The retention is a balance between cost (log storage is not free) and debuggability (the longer the retention, the more incidents the team can debug). The pattern is the right answer for a developer who wants to debug an incident that happened last week, and the pattern is the lever that turns “I cannot debug last week’s incident” into “I can debug last week’s incident.”

How do I prevent logs from leaking PII?

Use a log scrubber (a logger that automatically redacts known fields, or a log service that automatically redacts known patterns), do not log fields that contain PII (email, phone, password, credit card, SSN), and use a code review checklist that includes “does this log line contain PII?” The pattern is the right answer for a developer who wants to keep the logs privacy-compliant, and the pattern is the lever that turns “the logs are a privacy violation” into “the logs are PII-free.”