Introduction
In 2026, Java applications are evolving toward complex distributed architectures like microservices and cloud-native environments. OpenTelemetry (OTel), the open standard born from the merger of OpenTracing and OpenCensus, has become essential for observability. Unlike proprietary tools, OTel unifies traces, metrics, and logs into a single framework, exportable to any backend (Jaeger, Prometheus, Grafana).
Why is it crucial? Imagine a Java e-commerce app where a customer request spans 10 services: without traces, pinpointing latency is a nightmare. OTel captures these flows automatically, reducing MTTR (Mean Time To Resolution) by 50% according to CNCF. This beginner tutorial, 100% theoretical, lays the foundations: concepts, architecture, and practices. No code, but concrete analogies to help you visualize implementation. By the end, you'll structure Java observability like a pro. (128 words)
Prerequisites
- Basic Java knowledge (JDK 17+ recommended in 2026).
- Understanding of observability: monitoring vs. tracing.
- Familiarity with microservices or Spring Boot (no deep expertise needed).
- Access to a backend like Jaeger or Zipkin for future tests.
Step 1: Understand OpenTelemetry's Core Pillars
OpenTelemetry rests on three main signals, like a pilot's three senses: view (traces), dashboard (metrics), logbook (logs).
- Traces: Distributed request tracking. Analogy: a pizza delivery through multiple steps (kitchen, packaging, delivery). Each span (segment) measures duration and attributes (e.g., 'db.query' = 150ms).
- Metrics: Numeric aggregates (counters, histograms). Real-world example: HTTP requests per minute in a Java service, with percentiles to spot spikes.
- Logs: Contextual text events. Link to traces: a 'DB error' log attaches to a span for correlation.
| Signal | Objective | Java Example |
|---|---|---|
| -------- | ----------- | -------------- |
| Traces | Distributed flow | Spring Boot request spanning Feign clients |
| Metrics | Trends | JVM heap usage over 24h |
| Logs | Details | Exception stack trace linked to a trace |
Step 2: Decode the OpenTelemetry Architecture
Think of OTel as a production line: API (abstract), SDK (implementation), Instrumentation (automatic), Exporter (output).
- API: Stable, language-independent interface. In Java,
io.opentelemetry.apidefines Tracer, Meter, Logger. - SDK: Java engine (
io.opentelemetry:opentelemetry-sdk). Configures processors, samplers. - Instrumentation: Magic libraries for Spring, JDBC, Kafka. E.g.,
@WithSpanon a Java method traces without boilerplate. - Exporter: Bridge to backends. OTLP (gRPC/HTTP) is the 2026 standard for Prometheus, Elastic.
Java App → API/SDK → Processors (Batch/Sampler) → Exporter → Collector → Backend
The OTel Collector (standalone component) aggregates, filters, and routes data, avoiding network overload in Java Kubernetes clusters.
Step 3: Master Context Propagation
In Java microservices, trace context travels like a passport. Baggage (custom data) and TraceContext (traceId, spanId) propagate via HTTP headers (e.g., traceparent).
Real-world example: Service A (Spring Boot) calls B via RestTemplate. OTel auto-injects headers; B extracts them for a child span.
Span states:
- Active: In progress (CPU time).
- Ended: Finalized, with status (OK/ERROR).
Samplers control volume: AlwaysOn (all), Probabilistic (1%), ParentBased (follows parent). In Java production, aim for <1% traces to avoid 10GB+/day.
Step 4: Integrate Metrics and Logs
Metrics: Four types in Java.
| Type | Description | Java Usage |
|---|---|---|
| ------ | ------------- | ------------ |
| Counter | Incremental | Requests processed |
| UpDownCounter | + or - | Connection pool |
| Histogram | Distribution | API latency (P50/P95/P99) |
| Gauge | Snapshot | Active threads |
Benefit: In Grafana, pivot from a 'high latency' metric to the causal trace, then detailed logs.
Step 5: Configure for Production
In 2026, configure via properties or env vars. E.g., OTEL_SERVICE_NAME=my-java-app, OTEL_TRACES_EXPORTER=otlp, OTEL_METRICS_EXPORTER=prometheus.
Key processors:
- Batch: Groups spans (5s timeout, max 512).
- Attributes: Limit to 128 per span to avoid cardinality explosion.
Java resources: OTel monitors JVM metrics (GC, threads) natively via Micrometer bridge.
Best Practices
- Start with traces: 80% immediate value for distributed debugging.
- Limit cardinality: Low-card attributes (userId) vs. high-card (request.body) → drop or hash.
- Use semantic conventions:
http.method=POST,db.statementfor interoperability. - Adaptive sampling: Head-based in prod to focus on errors (1:1000 ratio).
- Central Collector: Avoid direct exporters from Java pods for scalability.
Common Pitfalls to Avoid
- No-op by default: Check
OTEL_SDK_DISABLED=false, or everything stays silent. - Explosive cardinality: Logs with unique timestamps → billions of series; use
{timestamp}templates. - Lost context: Forget propagation in async (CompletableFuture) → orphan traces.
- Ignored overhead: Measure CPU (+5-10%); optimize with batching/small spans.
Next Steps
- Official docs: OpenTelemetry Java.
- CNCF case studies: Netflix, Uber migrations.
- Tools: Jaeger UI for trace visualization.
- Expert training: Check out our Learni observability courses for Java.