Observability: Metrics, Logs, and Traces—Building a Complete Observability Stack

Observability: Metrics, Logs, and Traces—Building a Complete Observability Stack

What Is Observability?

Observability is the capability to understand a system’s internal state by analyzing the data it generates. Unlike traditional monitoring—which tracks predefined metrics and alerts when thresholds are breached—observability enables engineers to ask arbitrary questions about system behavior without manually instrumentation changes. As the field has evolved, the industry has converged on a three-pillar model to achieve observability at scale.

The distinction matters: monitoring tells you when something is wrong; observability tells you what’s happening, why, and how to fix it. For teams operating 500+ businesses like Vilee LLC, this difference translates directly to incident resolution speed and operational cost.

The Three Pillars of Observability

Modern observability platforms rely on three complementary data streams to provide complete system visibility:

1. Metrics: Aggregated Signals of System Health

Metrics are numerical, time-series measurements that aggregate system behavior over time. They answer the question: “Is the system performing normally?” A metric typically consists of a name, a timestamp, and a numerical value, often with labels providing additional context.

  • Example: http_request_duration_seconds{endpoint="/checkout", status="200"}
  • Use case: Detecting latency spikes, CPU exhaustion, or elevated error rates
  • Cost: Low cardinality when properly designed; high cardinality (unique label combinations) can explode storage and costs
  • Retention: Typically 15 days to 2 years depending on resolution and storage capacity

Metrics excel at providing at-a-glance system health. A single dashboard query can aggregate millions of data points into trends, percentiles (p50, p95, p99), or rate calculations. For e-commerce, metrics capture transactions per second, cart abandonment rates, and inventory levels.

2. Logs: Detailed Event Records

Logs are unstructured or semi-structured text records of specific events occurring within applications and infrastructure. They answer: “What exactly happened?” Logs provide context—the “why” behind metric anomalies.

  • Example: "2026-06-06T14:23:45Z ERROR payment_gateway timeout after 30s; order_id=12345; user_id=67890"
  • Use case: Debugging failed requests, understanding error messages, auditing user actions
  • Volume: Massive—a single API server can generate gigabytes per hour
  • Correlation: Logs need request IDs or trace IDs to correlate events across services

Logs are the detective’s notebook: they contain raw, unfiltered information. Without them, a 99.5% availability metric tells you nothing about which users were affected or why their payments failed. Structured logging (JSON formatted with fixed field names) dramatically improves searchability and cost.

3. Traces: End-to-End Request Journey

Distributed traces map the complete path of a single request through a microservices architecture. They answer: “How did this request flow through my system, and where did it slow down?”

  • Example: A user clicks “Buy Now,” triggering a trace that spans API gateway → inventory service → payment processor → warehouse system
  • Components: Spans (units of work), parent-child relationships, timing, and span tags (metadata)
  • Use case: Identifying bottleneck services, analyzing latency, understanding distributed failures
  • Cost: High volume; sample-based collection (1% or 10% of traces) is standard practice

Traces are invaluable in microservices architectures. A latency spike visible in metrics becomes actionable when a trace shows that the payment service’s p99 latency jumped from 50ms to 2 seconds—pinpointing the exact bottleneck.

Monitoring vs. Observability: A Critical Distinction

Teams often confuse these terms. Here’s the difference:

Dimension Monitoring Observability
Approach Reactive—alert on known issues Proactive—investigate unknown issues
Data model Predefined dashboards and alerts Arbitrary queries on raw signals
Incident response “CPU at 95%, restart service” “Trace shows query timeout in DB layer; investigate connection pool”
Tool focus Prometheus, Grafana dashboards Prometheus + Grafana + Jaeger + log aggregation
Scalability Fixed metrics per service Scales with cardinality control

Most organizations use monitoring for detection (alerts) and layer in observability for investigation. This hybrid approach combines quick detection with deep investigation capabilities.

Vilee LLC combines deep technical expertise in WordPress/WooCommerce development with AI-powered automation to operate 520+ profitable online businesses at scale.

OpenTelemetry: The Industry Standard for Observability

For years, observability was fragmented: Prometheus for metrics, Datadog/New Relic SDKs for proprietary tracing, ELK Stack for logs. This created vendor lock-in and integration complexity. OpenTelemetry (OTel) changed this landscape by providing a vendor-agnostic standard for collecting all telemetry signals.

What OpenTelemetry Provides

OpenTelemetry graduated as a CNCF project in 2026, solidifying its status as the de facto observability standard. It unifies:

  • Metrics, Logs, Traces, and Profiles under one SDK and protocol (OTLP)
  • Language-specific SDKs for Python, Go, Node.js, Java, Ruby, PHP, and 10+ other languages
  • Semantic conventions—standardized attribute names (e.g., http.status_code, db.statement) ensuring consistency across services
  • Instrumentation libraries auto-instrumenting popular frameworks without code changes (Flask, Express, Django, Spring Boot)

The Three-Component Architecture

  1. API/SDK: Injected into applications to capture traces, metrics, and logs
  2. OTLP (OpenTelemetry Protocol): A standard wire format for transmitting telemetry, replacing vendor-specific protocols
  3. OpenTelemetry Collector: A standalone agent that receives, processes, and exports telemetry to any backend (Prometheus, Grafana Tempo, Datadog, etc.)

A 2026 survey showed OTel’s adoption has grown so rapidly that it now natively integrates with Google Cloud, AWS X-Ray, Azure Monitor, Datadog, Honeycomb, and hundreds of other platforms without vendor-specific code. This vendor portability eliminates the cost and effort of replatforming.

Building Your Observability Stack: Components and Architecture

A production-grade observability stack combines open-source components into a cohesive pipeline:

Metrics: Prometheus + Grafana

Prometheus is the de facto standard for metrics collection. It operates on a pull model—actively scraping metrics from services every 15-30 seconds—rather than waiting for services to push data. This approach provides built-in reliability: if a service stops responding, Prometheus immediately detects the gap.

  • Storage: Time-series database optimized for high write throughput and efficient compression
  • Query language: PromQL, a powerful functional query language for aggregating metrics
  • Capacity: A single Prometheus instance scales to 2-5 million active time series; larger deployments use sharding
  • Retention: Configurable; typically 15 days (local storage) or longer with remote storage backends

Grafana provides the visualization layer, creating dashboards that query Prometheus metrics. Grafana’s role extends beyond pretty charts—it provides alert rule management, template variables for dynamic dashboards, and (crucially) the ability to correlate metrics with logs and traces.

Logs: Loki (Grafana) or ELK Stack

Grafana Loki is a log aggregation system designed for Kubernetes environments. Unlike traditional solutions (ELK, Splunk), Loki does not index log content. Instead, it indexes only labels (service name, pod name, region), storing log lines in object storage. This architectural choice reduces cost by 10x compared to Elasticsearch.

  • Label-based indexing: Loki queries use labels; full-text search is available but slower
  • Integration: Built into Grafana; clicking a series on a metrics dashboard can surface related logs
  • Cost: Dramatically lower than Elasticsearch or Splunk for typical workloads

For organizations already invested in the ELK Stack (Elasticsearch-Logstash-Kibana), Loki provides a lighter-weight alternative. Both are valid; the choice depends on existing infrastructure and search requirements.

Distributed Tracing: Grafana Tempo or Jaeger

Grafana Tempo is a distributed tracing backend launched in 2021 and now widely adopted for production use. Tempo accepts traces from any OpenTelemetry-compatible source and stores them in object storage (S3, GCS, Azure Blob).

  • Cost-efficiency: By avoiding traditional databases and indexing, Tempo costs 50-80% less than Jaeger or proprietary solutions
  • Scale: Handles billions of spans per day at global scale
  • Protocol support: Accepts Jaeger, Zipkin, and OpenTelemetry span formats
  • Integration: Deeply integrated with Grafana; drilling from metrics to logs to traces is seamless

Jaeger remains a solid alternative, especially if you need advanced search capabilities on span tags. Both are open-source and widely used; Tempo is increasingly favored for cost-conscious teams.

OpenTelemetry Collector: The Nervous System

The OpenTelemetry Collector is a standalone daemon that sits between your applications and observability backends. It provides:

  • Receivers: Accepts telemetry from applications (OTLP, Prometheus scrape targets, Jaeger spans, etc.)
  • Processors: Transforms telemetry (sampling, filtering, enrichment, batching)
  • Exporters: Ships telemetry to backends (Prometheus, Tempo, Loki, cloud vendors, etc.)

Deploying the Collector in a sidecar pattern (one per pod in Kubernetes) or daemon mode (one per node) adds operational resilience. If the Collector crashes, applications automatically retry; once it recovers, telemetry resumes flowing.

Instrumentation: What to Measure in E-Commerce

Blindly collecting all metrics or traces is wasteful. Strategic instrumentation focuses on business-critical signals:

For WooCommerce and E-Commerce Platforms

  • Checkout flow: Track each step (product view, add-to-cart, shipping calculation, payment processing, order confirmation). Measure latency and error rates at each stage.
  • Payment gateway interaction: Capture payment success/failure rates, latency (p50, p95, p99), and timeout incidents. Payment delays directly impact revenue.
  • Inventory synchronization: Monitor stock updates from suppliers, overselling incidents, and sync latency. A trace through your inventory service reveals bottlenecks.
  • Product search and filtering: Measure search latency, result accuracy, and filter application time. Slow search drives cart abandonment.
  • Customer authentication: Track login success rates, password reset latency, and session management. Authentication failures block revenue.
  • Admin operations: Monitor order fulfillment workflows, refund processing, and reports generation for operational health.
  • Third-party integrations: Track API call latency, error rates, and rate-limit consumption for shipping, tax, and fraud-detection services.

Each of these areas should be instrumented with metrics (success/failure counts, latency histograms) and traces (detailed request flow) to enable rapid diagnosis when incidents occur.

SLIs, SLOs, and Error Budgets: Connecting Observability to Reliability

Raw metrics are useful, but they need a framework to drive decision-making. Google’s SRE methodology provides this framework:

SLI (Service Level Indicator)

An SLI is a quantitative measure of system reliability. Examples:

  • Fraction of requests that complete successfully (availability)
  • Fraction of requests under 100ms latency (speed)
  • Fraction of transactions that don’t encounter data corruption (correctness)

SLIs are derived directly from observability data. For example, an availability SLI might be: (successful_requests) / (total_requests)—calculated from metrics and trace data.

SLO (Service Level Objective)

An SLO is the target reliability value over a time window. Example:

  • “99.9% availability (availability SLI ≥ 99.9%) over any rolling 30-day window”
  • “p95 latency ≤ 100ms (latency SLI ≥ 95% of requests under 100ms) over any hour”

SLOs directly connect observability to business priorities. If your SLO is 99.9%, you’re committing to your customers that the service will be available 99.9% of the time—no more.

Error Budget and Budget-Driven Development

The error budget is 1 minus the SLO. A 99.9% SLO has a 0.1% error budget—meaning:

  • In 30 days (2,592,000 seconds), you can afford 2,592 seconds (43 minutes) of downtime
  • Once that budget is exhausted, all non-critical deployments freeze until the service recovers
  • This creates a powerful incentive: teams balance feature velocity with reliability investment

Google SRE’s research shows that error budgets are the most effective way to align engineering culture with reliability goals. Teams stop treating reliability as a luxury and start treating it as a real constraint on shipping velocity.

Alerting Based on Burn Rate

A naive alerting strategy triggers when error rates exceed SLO thresholds—but this generates false alarms and delays detection. Google SRE recommends multi-burn-rate alerting:

  • Fast burn rate: Detect sustained errors in 1 hour (page on-call immediately)
  • Slow burn rate: Detect low-level sustained errors over 6 hours (create a ticket for investigation)

Example: A 99.9% SLO with 30-day window has a daily budget of 2.6 minutes of downtime. A 14.4x burn rate consumes 2% of that budget per hour—a clear sign of an urgent problem. A 0.144x burn rate consumes 2% per 100 hours—still a problem, but less urgent.

Cardinality: The Silent Cost Killer

High cardinality is observability’s most expensive trap. Cardinality refers to the number of unique label combinations in a metric:

  • Low cardinality: http_requests_total{endpoint="/api/checkout", method="POST", status="200"} → maybe 50 unique combinations per service
  • High cardinality: http_requests_total{endpoint="/api/checkout", user_id="", request_id="", status=...} → millions of unique combinations

High-cardinality labels are tempting but destructive. Adding user_id or request_id as a label:

  • Multiplies storage cost 10-100x (in managed Prometheus services)
  • Consumes excessive memory in Prometheus TSDB indices
  • Slows dashboards and queries to a crawl
  • Generates false cardinality warnings

Best practice: Put high-cardinality data in logs and traces, not metrics. For metrics, use only low-cardinality labels (service, endpoint, method, status). This costs $100/month instead of $10,000/month.

Cost Optimization and Cardinality Management

Running a large observability stack requires discipline:

Sampling Traces

Collecting 100% of traces is infeasible at scale. Typical sampling strategies:

  • 1% sampling: Captures enough data for investigation while reducing storage 100x
  • Adaptive sampling: Sample 100% of error traces, 10% of slow traces, 1% of fast/normal traces
  • Head-based vs. tail-based: Decide which traces to sample at the client (head) or at the collector (tail). Tail-based allows sampling based on latency or error status, which is more useful.

Log Filtering

Not all logs are equally valuable. Filter at the Collector level:

  • Drop debug logs in production (save for development environments)
  • Drop health-check logs (e.g., Kubernetes probes hitting /health every second)
  • Drop overly verbose third-party library logs (increase log level for noisy dependencies)
  • Structure logs as JSON to reduce storage and enable efficient searching

Metric Retention Policies

  • High-resolution (15-30s scrape interval): Keep for 15 days locally
  • Medium-resolution (1m): Keep for 90 days with remote storage
  • Low-resolution (5m or 1h): Keep for 2+ years for trend analysis and year-over-year comparisons

Tiered retention automatically compresses old data, reducing storage costs while maintaining historical insight.

OpenTelemetry Collector Configuration Best Practices

Deploying the Collector requires careful configuration to avoid data loss and resource exhaustion:

Essential Processors

  • memory_limiter (FIRST processor): Prevents out-of-memory crashes. Set limit_mib to 80% of container memory.
  • batch: Batches telemetry for efficient transport and reduced API calls
  • filter: Drops unwanted telemetry (debug logs, health checks, internal spans)
  • attributes: Adds or modifies labels (e.g., environment=production, region=us-west-2)
  • tail_sampling: Implements adaptive sampling decisions based on trace content

Networking and Security

  • Bind to localhost only: For internal use. Never bind to 0.0.0.0 without authentication.
  • Use TLS/mTLS: For communication between Collector and backend. Store credentials in secret managers, not config files.
  • Set retry policies: Gracefully handle backend unavailability without data loss. Configure retry_on_failure and queue settings.
  • Rate limiting: Use batch processor to limit throughput and prevent overwhelming backends.

Implementation Checklist

Phase Tasks Timeline
Week 1-2: Planning Define SLIs/SLOs for critical services, inventory existing monitoring, plan OpenTelemetry SDK instrumentation points 2 weeks
Week 3-4: Setup Infrastructure Deploy Prometheus, Grafana, Loki, Tempo in dev/staging. Configure PersistentVolumes for retention policies. 2 weeks
Week 5-6: Instrument Applications Integrate OpenTelemetry SDK into payment, checkout, and inventory services. Deploy OpenTelemetry Collector as sidecar. 2 weeks
Week 7-8: Testing and Dashboards Validate metrics/logs/traces in staging. Create dashboards for key metrics. Set up error budget alerts. 2 weeks
Week 9-10: Production Rollout Gradual production deployment. Monitor Collector memory and cardinality. Tune sampling rates. 2 weeks
Week 11+: Optimization Analyze costs, reduce cardinality, tune retention policies. Iterate on alerting rules. Ongoing

Real-World Cost Example

For a mid-sized e-commerce platform processing 100M requests/month:

  • Prometheus metrics (low-cardinality): ~500K series, 100GB/year storage → ~$30/month on AWS
  • Loki logs (structured, health-check filtered): 10GB/day logs, ~10% sampled → ~$150/month on AWS
  • Tempo traces (1% sampling, 30-day retention): ~300M traces/month, 5TB storage → ~$100/month on AWS
  • OpenTelemetry Collector (2 replicas, 512MB each): Kubernetes Compute → ~$50/month
  • Grafana Cloud (optional, fully managed): ~$200-500/month
  • Total: ~$500-800/month for complete observability

By comparison, managed Datadog or New Relic for the same volume costs $10,000-30,000/month. Self-hosted observability stacks deliver 10-50x better ROI for organizations with engineering capacity to maintain them.

Internal Links: Observability at Vilee LLC

Sources

This article references authoritative documentation and industry best practices:

Conclusion

Observability is no longer a luxury for large enterprises—it’s a prerequisite for operating reliable e-commerce platforms at scale. By combining metrics (Prometheus), logs (Loki), and traces (Tempo) under the OpenTelemetry standard, teams can build cost-effective, vendor-agnostic observability stacks that scale from hundreds to billions of requests per day.

The key to success is strategic instrumentation: focus on business-critical signals, control cardinality, implement error budgets, and use multi-burn-rate alerting. Organizations that master observability gain a competitive advantage—faster incident resolution, better customer experience, and lower operational costs.

Whether you’re running a single WooCommerce store or managing a portfolio of 500+ businesses like Vilee LLC, the principles remain the same. Start with metrics and dashboards, add logs for context, integrate traces for distributed systems, and iterate based on production experience.

Frequently Asked Questions

What's the difference between monitoring and observability?

Monitoring is reactive—it tracks predefined metrics and alerts when thresholds are breached. Observability is proactive—it uses metrics, logs, and traces to enable engineers to investigate arbitrary questions about system behavior without predefined instrumentation. Monitoring answers ‘is something wrong?’; observability answers ‘what’s wrong and why?’

How do I avoid the cardinality explosion trap?

Cardinality explodes when you add high-cardinality labels (user_id, request_id) to metrics. The solution: put high-cardinality data in logs and traces instead. For metrics, use only low-cardinality labels (service, endpoint, method, status). This single rule can reduce your observability bill by 10-100x.

What's the best open-source observability stack for 2026?

The modern standard is Prometheus (metrics) + Grafana (visualization) + Loki (logs) + Tempo (traces), with OpenTelemetry Collector for data collection and processing. This stack is vendor-agnostic, cost-effective ($500-800/month for mid-scale), and handles 100M+ requests/month. Jaeger is a solid alternative to Tempo for distributed tracing.

Talk to us →