Go 可观测性:OpenTelemetry 完全指南

全面讲解如何在 Go 项目中集成 OpenTelemetry,实现 Tracing、Metrics、Logging 三大支柱的可观测性,覆盖自定义 Span、上下文传播、自动埋点、导出到 Jaeger/Tempo、Prometheus 集成和 Grafana 仪表盘

Go 可观测性:OpenTelemetry 完全指南

你有没有经历过这样的夜晚:凌晨两点被报警电话叫醒,服务出了故障,但你盯着满屏的日志,完全搞不清楚请求到底经过了哪些服务、在哪个环节出了问题、为什么突然变慢?如果你有微服务架构的经验,这个场景一定不陌生。

传统的日志监控在单体应用中还算够用,但到了分布式系统里,光靠日志就像拿着手电筒找黑猫——你根本不知道猫在哪里。我们需要的是可观测性(Observability),一种让你无需改变系统行为就能理解其内部状态的能力。

今天,我们就来深入探讨 OpenTelemetry——这个由 CNCF 维护的可观测性标准框架,看看如何在 Go 项目中实现全链路追踪、指标收集和日志关联。

可观测性的三大支柱

在动手之前,先理清概念。可观测性有三大支柱:

  1. Traces(链路追踪):一个请求从进入系统到返回响应的完整路径,包括经过的每个服务、每次数据库查询、每次外部调用
  2. Metrics(指标):系统行为的数值度量,比如 QPS、延迟、错误率、CPU 使用率
  3. Logs(日志):离散的事件记录,比如"用户 A 在 10:00:01 登录失败"

OpenTelemetry 的厉害之处在于:它将这三者统一到一个框架中,并且提供了跨语言、跨平台的标准化实现。你不再需要分别为 Zipkin、Prometheus、ELK 各自接入不同的 SDK。

核心概念速览

在 OpenTelemetry 的世界里,有几个关键概念你必须了解:

  • Trace:一次完整的请求链路,由一个或多个 Span 组成
  • Span:Trace 中的一个工作单元,有名称、起止时间、属性和事件
  • Context Propagation:上下文传播机制,确保 Span 之间的关系能跨服务传递
  • Exporter:将遥测数据发送到后端(Jaeger、Tempo、Prometheus 等)
  • Collector:可选的中间代理,负责接收、处理和转发遥测数据
  • Instrumentation:埋点,分为自动埋点(auto-instrumentation)和手动埋点(manual instrumentation)

项目准备

让我们创建一个示例项目来演示 OpenTelemetry 的集成:

mkdir otel-demo
cd otel-demo
go mod init github.com/yourusername/otel-demo

安装 OpenTelemetry Go SDK 及其依赖:

# 核心 SDK
go get go.opentelemetry.io/otel
go get go.opentelemetry.io/otel/sdk

# Trace 相关
go get go.opentelemetry.io/otel/trace
go get go.opentelemetry.io/otel/sdk/trace
go get go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc
go get go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp

# Metrics 相关
go get go.opentelemetry.io/otel/metric
go get go.opentelemetry.io/otel/sdk/metric
go get go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc
go get go.opentelemetry.io/otel/exporters/prometheus

# 上下文传播
go get go.opentelemetry.io/otel/propagation

# Jaeger 导出器(用于本地开发)
go get go.opentelemetry.io/otel/exporters/jaeger

# 自动埋点(HTTP、gRPC 等)
go get go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp
go get go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc
go get go.opentelemetry.io/contrib/instrumentation/github.com/gorilla/mux/otelmux
go get go.opentelemetry.io/contrib/instrumentation/runtime

初始化 TracerProvider

一切的起点是创建 TracerProvider。它是 Trace 数据的管理中心,负责创建 Span 并将它们导出到后端。

// internal/telemetry/tracer.go
package telemetry

import (
	"context"
	"fmt"
	"log"
	"time"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
	"go.opentelemetry.io/otel/propagation"
	"go.opentelemetry.io/otel/sdk/resource"
	sdktrace "go.opentelemetry.io/otel/sdk/trace"
	semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"
)

// InitTracer 初始化 TracerProvider
func InitTracer(ctx context.Context, serviceName, collectorEndpoint string) (func(context.Context) error, error) {
	// 创建 OTLP gRPC 导出器
	conn, err := grpc.NewClient(
		collectorEndpoint,
		grpc.WithTransportCredentials(insecure.NewCredentials()),
	)
	if err != nil {
		return nil, fmt.Errorf("failed to create gRPC connection: %w", err)
	}

	exporter, err := otlptracegrpc.New(ctx,
		otlptracegrpc.WithGRPCConn(conn),
		otlptracegrpc.WithInsecure(),
	)
	if err != nil {
		return nil, fmt.Errorf("failed to create trace exporter: %w", err)
	}

	// 创建资源描述(标识服务身份)
	res, err := resource.New(ctx,
		resource.WithAttributes(
			semconv.ServiceNameKey.String(serviceName),
			semconv.ServiceVersionKey.String("1.0.0"),
			semconv.DeploymentEnvironmentKey.String("development"),
		),
		resource.WithHost(),
		resource.WithProcess(),
		resource.WithTelemetrySDK(),
	)
	if err != nil {
		return nil, fmt.Errorf("failed to create resource: %w", err)
	}

	// 创建 TracerProvider
	tp := sdktrace.NewTracerProvider(
		sdktrace.WithBatcher(exporter,
			sdktrace.WithBatchTimeout(5*time.Second),
			sdktrace.WithMaxExportBatchSize(512),
		),
		sdktrace.WithResource(res),
		sdktrace.WithSampler(sdktrace.AlwaysSample()), // 生产环境建议用 TraceIDRatioBased
	)

	// 设置全局 TracerProvider
	otel.SetTracerProvider(tp)

	// 设置全局上下文传播器(W3C TraceContext + Baggage)
	otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
		propagation.TraceContext{},
		propagation.Baggage{},
	))

	log.Printf("TracerProvider initialized for service: %s", serviceName)

	// 返回 cleanup 函数
	return func(ctx context.Context) error {
		log.Println("Shutting down TracerProvider...")
		return tp.Shutdown(ctx)
	}, nil
}

创建自定义 Span

有了 TracerProvider,我们就可以创建 Span 了。Span 是追踪的基本单元,它记录了一个操作的开始和结束。

// internal/handler/order_handler.go
package handler

import (
	"context"
	"encoding/json"
	"net/http"
	"time"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/codes"
	"go.opentelemetry.io/otel/trace"
)

var tracer = otel.Tracer("order-service")

// OrderHandler 订单处理器
type OrderHandler struct {
	orderService  OrderService
	paymentClient PaymentClient
	inventoryRepo InventoryRepo
}

func (h *OrderHandler) CreateOrder(w http.ResponseWriter, r *http.Request) {
	// 从请求的 context 中提取父 Span(如果有)
	ctx := r.Context()

	// 创建一个新 Span,自动关联到父 Span
	ctx, span := tracer.Start(ctx, "CreateOrder",
		trace.WithSpanKind(trace.SpanKindServer),
		trace.WithAttributes(
			attribute.String("http.method", r.Method),
			attribute.String("http.url", r.URL.String()),
		),
	)
	defer span.End()

	// 解析请求
	var req CreateOrderRequest
	if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
		span.RecordError(err)
		span.SetStatus(codes.Error, "invalid request body")
		http.Error(w, err.Error(), http.StatusBadRequest)
		return
	}

	// 为 Span 添加业务属性
	span.SetAttributes(
		attribute.String("order.user_id", req.UserID),
		attribute.Int("order.item_count", len(req.Items)),
	)

	// 添加 Span 事件
	span.AddEvent("parsing request completed",
		trace.WithAttributes(
			attribute.String("order.id", req.OrderID),
		),
	)

	// 调用下游服务(自动创建子 Span)
	order, err := h.processOrder(ctx, &req)
	if err != nil {
		span.RecordError(err)
		span.SetStatus(codes.Error, "failed to process order")
		http.Error(w, err.Error(), http.StatusInternalServerError)
		return
	}

	span.SetStatus(codes.Ok, "order created")
	span.SetAttributes(attribute.String("order.result_id", order.ID))

	w.Header().Set("Content-Type", "application/json")
	json.NewEncoder(w).Encode(order)
}

// processOrder 处理订单(内部方法,创建子 Span)
func (h *OrderHandler) processOrder(ctx context.Context, req *CreateOrderRequest) (*Order, error) {
	ctx, span := tracer.Start(ctx, "processOrder",
		trace.WithAttributes(
			attribute.String("order.id", req.OrderID),
		),
	)
	defer span.End()

	// 第一步:验证库存
	if err := h.checkInventory(ctx, req.Items); err != nil {
		span.RecordError(err)
		return nil, fmt.Errorf("inventory check failed: %w", err)
	}

	// 第二步:扣减库存
	if err := h.deductInventory(ctx, req.Items); err != nil {
		span.RecordError(err)
		return nil, fmt.Errorf("inventory deduction failed: %w", err)
	}

	// 第三步:处理支付
	paymentID, err := h.processPayment(ctx, req.UserID, req.TotalAmount)
	if err != nil {
		span.RecordError(err)
		return nil, fmt.Errorf("payment failed: %w", err)
	}

	span.SetAttributes(attribute.String("payment.id", paymentID))

	// 第四步:创建订单记录
	order := &Order{
		ID:          req.OrderID,
		UserID:      req.UserID,
		Items:       req.Items,
		TotalAmount: req.TotalAmount,
		PaymentID:   paymentID,
		Status:      "confirmed",
		CreatedAt:   time.Now(),
	}

	span.AddEvent("order created successfully")
	return order, nil
}

// checkInventory 检查库存(创建更深层的子 Span)
func (h *OrderHandler) checkInventory(ctx context.Context, items []OrderItem) error {
	ctx, span := tracer.Start(ctx, "checkInventory")
	defer span.End()

	for _, item := range items {
		span.AddEvent("checking item",
			trace.WithAttributes(
				attribute.String("item.sku", item.SKU),
				attribute.Int("item.quantity", item.Quantity),
			),
		)

		available, err := h.inventoryRepo.GetStock(ctx, item.SKU)
		if err != nil {
			span.RecordError(err,
				trace.WithAttributes(attribute.String("item.sku", item.SKU)),
			)
			return err
		}

		if available < item.Quantity {
			err := fmt.Errorf("insufficient stock for %s: need %d, have %d",
				item.SKU, item.Quantity, available)
			span.RecordError(err)
			return err
		}
	}

	return nil
}

// processPayment 处理支付
func (h *OrderHandler) processPayment(ctx context.Context, userID string, amount float64) (string, error) {
	ctx, span := tracer.Start(ctx, "processPayment",
		trace.WithAttributes(
			attribute.String("payment.user_id", userID),
			attribute.Float64("payment.amount", amount),
		),
	)
	defer span.End()

	// 模拟调用外部支付服务
	paymentID, err := h.paymentClient.Charge(ctx, userID, amount)
	if err != nil {
		span.RecordError(err)
		span.SetStatus(codes.Error, "payment gateway error")
		return "", err
	}

	span.SetAttributes(attribute.String("payment.id", paymentID))
	span.SetStatus(codes.Ok, "payment successful")
	return paymentID, nil
}

上下文传播(Context Propagation)

在微服务架构中,一个请求可能经过多个服务。上下文传播确保所有服务的 Span 能串联成一条完整的 Trace。

// internal/client/payment_client.go
package client

import (
	"context"
	"fmt"
	"net/http"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/codes"
	"go.opentelemetry.io/otel/propagation"
	"go.opentelemetry.io/otel/trace"
)

var tracer = otel.Tracer("payment-client")

// PaymentClient 支付客户端
type PaymentClient struct {
	httpClient *http.Client
	baseURL    string
}

func NewPaymentClient(baseURL string) *PaymentClient {
	return &PaymentClient{
		httpClient: &http.Client{},
		baseURL:    baseURL,
	}
}

// Charge 调用支付网关扣款
func (c *PaymentClient) Charge(ctx context.Context, userID string, amount float64) (string, error) {
	ctx, span := tracer.Start(ctx, "PaymentClient.Charge",
		trace.WithSpanKind(trace.SpanKindClient),
		trace.WithAttributes(
			attribute.String("payment.user_id", userID),
			attribute.Float64("payment.amount", amount),
			attribute.String("payment.currency", "USD"),
		),
	)
	defer span.End()

	// 构建 HTTP 请求
	req, err := http.NewRequestWithContext(ctx, "POST",
		fmt.Sprintf("%s/api/v1/charge", c.baseURL), nil)
	if err != nil {
		span.RecordError(err)
		return "", err
	}

	// 关键步骤:注入上下文到 HTTP Header
	// 这会将 TraceID 和 SpanID 写入请求头,使得下游服务能关联到同一个 Trace
	otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))

	// 发送请求
	resp, err := c.httpClient.Do(req)
	if err != nil {
		span.RecordError(err)
		span.SetStatus(codes.Error, "HTTP request failed")
		return "", err
	}
	defer resp.Body.Close()

	if resp.StatusCode != http.StatusOK {
		err := fmt.Errorf("payment service returned status %d", resp.StatusCode)
		span.RecordError(err)
		span.SetStatus(codes.Error, resp.Status)
		return "", err
	}

	// 解析响应...
	paymentID := "pay_abc123" // 模拟
	span.SetAttributes(attribute.String("payment.id", paymentID))

	return paymentID, nil
}

下游服务(支付服务)接收时提取上下文:

// 下游支付服务的处理器
func (h *PaymentHandler) Charge(w http.ResponseWriter, r *http.Request) {
	// 从 HTTP Header 中提取传播的上下文
	ctx := otel.GetTextMapPropagator().Extract(
		r.Context(),
		propagation.HeaderCarrier(r.Header),
	)

	// 创建 Span 时会自动关联到上游 Span
	ctx, span := tracer.Start(ctx, "PaymentHandler.Charge",
		trace.WithSpanKind(trace.SpanKindServer),
	)
	defer span.End()

	// 处理支付逻辑...
	span.AddEvent("payment processed")
}

如果你使用的是标准 HTTP 客户端和服务器,OpenTelemetry 提供了自动注入/提取的包装器:

import (
	"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)

// 自动传播上下文的 HTTP 客户端
client := &http.Client{
	Transport: otelhttp.NewTransport(http.DefaultTransport),
}

// 自动传播上下文的 HTTP 服务器
handler := otelhttp.NewHandler(yourHandler, "my-server")
http.ListenAndServe(":8080", handler)

Metrics 指标收集

除了 Tracing,OpenTelemetry 也提供了完善的 Metrics 支持。让我们看看如何在 Go 应用中收集指标。

// internal/telemetry/meter.go
package telemetry

import (
	"context"
	"log"
	"time"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc"
	"go.opentelemetry.io/otel/sdk/metric"
	"go.opentelemetry.io/otel/sdk/resource"
	semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
)

// InitMeter 初始化 MeterProvider
func InitMeter(ctx context.Context, serviceName, collectorEndpoint string) (func(context.Context) error, error) {
	// 创建 OTLP gRPC 导出器
	exporter, err := otlpmetricgrpc.New(ctx,
		otlpmetricgrpc.WithEndpoint(collectorEndpoint),
		otlpmetricgrpc.WithInsecure(),
	)
	if err != nil {
		return nil, err
	}

	// 创建资源
	res, err := resource.New(ctx,
		resource.WithAttributes(
			semconv.ServiceNameKey.String(serviceName),
		),
	)
	if err != nil {
		return nil, err
	}

	// 创建 MeterProvider
	mp := metric.NewMeterProvider(
		metric.WithResource(res),
		metric.WithReader(metric.NewPeriodicReader(exporter,
			metric.WithInterval(10*time.Second),
		)),
	)

	// 设置全局 MeterProvider
	otel.SetMeterProvider(mp)

	log.Printf("MeterProvider initialized for service: %s", serviceName)

	return func(ctx context.Context) error {
		return mp.Shutdown(ctx)
	}, nil
}

自定义 Metrics

OpenTelemetry 提供了多种 Metric 类型:Counter、UpDownCounter、Histogram、Gauge。

// internal/metrics/order_metrics.go
package metrics

import (
	"context"
	"time"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/metric"
)

// OrderMetrics 订单相关指标
type OrderMetrics struct {
	orderCounter      metric.Int64Counter
	orderAmount       metric.Float64Histogram
	activeOrders      metric.Int64UpDownCounter
	orderProcessingMs metric.Int64Histogram
	cancelCounter     metric.Int64Counter
}

// NewOrderMetrics 创建订单指标收集器
func NewOrderMetrics() (*OrderMetrics, error) {
	meter := otel.Meter("order-service")

	orderCounter, err := meter.Int64Counter("orders.total",
		metric.WithDescription("Total number of orders created"),
		metric.WithUnit("{order}"),
	)
	if err != nil {
		return nil, err
	}

	orderAmount, err := meter.Float64Histogram("orders.amount",
		metric.WithDescription("Distribution of order amounts"),
		metric.WithUnit("USD"),
		metric.WithExplicitBucketBoundaries(10, 50, 100, 200, 500, 1000, 5000),
	)
	if err != nil {
		return nil, err
	}

	activeOrders, err := meter.Int64UpDownCounter("orders.active",
		metric.WithDescription("Number of currently active orders"),
		metric.WithUnit("{order}"),
	)
	if err != nil {
		return nil, err
	}

	orderProcessingMs, err := meter.Int64Histogram("orders.processing_duration",
		metric.WithDescription("Time taken to process an order"),
		metric.WithUnit("ms"),
		metric.WithExplicitBucketBoundaries(10, 50, 100, 250, 500, 1000, 2500, 5000),
	)
	if err != nil {
		return nil, err
	}

	cancelCounter, err := meter.Int64Counter("orders.cancelled",
		metric.WithDescription("Total number of cancelled orders"),
		metric.WithUnit("{order}"),
	)
	if err != nil {
		return nil, err
	}

	return &OrderMetrics{
		orderCounter:      orderCounter,
		orderAmount:       orderAmount,
		activeOrders:      activeOrders,
		orderProcessingMs: orderProcessingMs,
		cancelCounter:     cancelCounter,
	}, nil
}

// RecordOrderCreated 记录订单创建
func (m *OrderMetrics) RecordOrderCreated(ctx context.Context, userID string, amount float64, itemCount int) {
	attrs := metric.WithAttributes(
		attribute.String("user.id", userID),
		attribute.Int("order.item_count", itemCount),
	)
	m.orderCounter.Add(ctx, 1, attrs)
	m.orderAmount.Record(ctx, amount, attrs)
	m.activeOrders.Add(ctx, 1, attrs)
}

// RecordOrderCompleted 记录订单完成
func (m *OrderMetrics) RecordOrderCompleted(ctx context.Context, userID string, duration time.Duration) {
	attrs := metric.WithAttributes(
		attribute.String("user.id", userID),
		attribute.String("order.status", "completed"),
	)
	m.activeOrders.Add(ctx, -1, attrs)
	m.orderProcessingMs.Record(ctx, duration.Milliseconds(), attrs)
}

// RecordOrderCancelled 记录订单取消
func (m *OrderMetrics) RecordOrderCancelled(ctx context.Context, reason string) {
	attrs := metric.WithAttributes(
		attribute.String("order.cancel_reason", reason),
	)
	m.cancelCounter.Add(ctx, 1, attrs)
	m.activeOrders.Add(ctx, -1, attrs)
}

在业务代码中使用 Metrics

// internal/service/order_service.go
package service

import (
	"context"
	"fmt"
	"time"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/codes"
	"go.opentelemetry.io/otel/trace"

	"github.com/yourusername/otel-demo/internal/metrics"
)

var tracer = otel.Tracer("order-service")

type OrderService struct {
	repo    OrderRepo
	metrics *metrics.OrderMetrics
}

func (s *OrderService) PlaceOrder(ctx context.Context, req *CreateOrderRequest) (*Order, error) {
	ctx, span := tracer.Start(ctx, "OrderService.PlaceOrder",
		trace.WithAttributes(
			attribute.String("order.user_id", req.UserID),
		),
	)
	defer span.End()

	start := time.Now()

	// 记录指标:订单创建
	s.metrics.RecordOrderCreated(ctx, req.UserID, req.TotalAmount, len(req.Items))

	// 业务逻辑...
	order, err := s.repo.Save(ctx, req)
	if err != nil {
		span.RecordError(err)
		span.SetStatus(codes.Error, "failed to save order")
		s.metrics.RecordOrderCancelled(ctx, "save_failed")
		return nil, err
	}

	duration := time.Since(start)
	s.metrics.RecordOrderCompleted(ctx, req.UserID, duration)

	span.SetStatus(codes.Ok, "order placed")
	return order, nil
}

Prometheus 集成

如果你的监控系统基于 Prometheus,可以直接使用 Prometheus Exporter:

// internal/telemetry/prometheus.go
package telemetry

import (
	"context"
	"log"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/collectors"
	prom "go.opentelemetry.io/otel/exporters/prometheus"
	"go.opentelemetry.io/otel/sdk/metric"
	"go.opentelemetry.io/otel/sdk/resource"
	semconv "go.opentelemetry.io/otel/semconv/v1.24.0"

	"go.opentelemetry.io/otel"
)

// InitPrometheusMeter 初始化 Prometheus 导出器
func InitPrometheusMeter(ctx context.Context, serviceName string) (*prom.Exporter, error) {
	// 创建资源
	res, err := resource.New(ctx,
		resource.WithAttributes(
			semconv.ServiceNameKey.String(serviceName),
		),
	)
	if err != nil {
		return nil, err
	}

	// 创建 Prometheus 导出器
	// 它同时实现了 metric.Reader 和 prometheus.Collector
	exporter, err := prom.New(
		prom.WithRegisterer(prometheus.DefaultRegisterer),
	)
	if err != nil {
		return nil, err
	}

	// 注册 Go 运行时指标
	prometheus.MustRegister(collectors.NewGoCollector())
	prometheus.MustRegister(collectors.NewProcessCollector(collectors.ProcessCollectorOpts{}))

	// 创建 MeterProvider
	mp := metric.NewMeterProvider(
		metric.WithResource(res),
		metric.WithReader(exporter),
	)

	otel.SetMeterProvider(mp)

	log.Printf("Prometheus MeterProvider initialized for service: %s", serviceName)
	return exporter, nil
}

在 HTTP 服务器中暴露 /metrics 端点:

// main.go
package main

import (
	"context"
	"log"
	"net/http"

	"github.com/prometheus/client_golang/prometheus/promhttp"

	"github.com/yourusername/otel-demo/internal/telemetry"
)

func main() {
	ctx := context.Background()

	// 初始化 Prometheus 指标
	_, err := telemetry.InitPrometheusMeter(ctx, "order-service")
	if err != nil {
		log.Fatalf("Failed to init Prometheus meter: %v", err)
	}

	mux := http.NewServeMux()

	// Prometheus 指标端点
	mux.Handle("/metrics", promhttp.Handler())

	// 业务端点
	mux.HandleFunc("/api/orders", orderHandler.CreateOrder)

	// 健康检查
	mux.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
		w.WriteHeader(http.StatusOK)
		w.Write([]byte(`{"status":"ok"}`))
	})

	log.Println("Server starting on :8080")
	log.Fatal(http.ListenAndServe(":8080", mux))
}

日志关联(Log Correlation)

日志需要和 Trace 关联起来,这样你在看日志时就能快速跳转到对应的 Trace。

// internal/logging/logger.go
package logging

import (
	"context"
	"log/slog"
	"os"

	"go.opentelemetry.io/otel/trace"
)

// NewLogger 创建带 Trace 上下文的 slog logger
func NewLogger(serviceName string) *slog.Logger {
	handler := &traceLogHandler{
		handler: slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
			Level: slog.LevelInfo,
		}),
	}

	return slog.New(handler).With(
		slog.String("service", serviceName),
	)
}

// traceLogHandler 自定义 slog Handler,自动注入 TraceID 和 SpanID
type traceLogHandler struct {
	handler slog.Handler
}

func (h *traceLogHandler) Enabled(ctx context.Context, level slog.Level) bool {
	return h.handler.Enabled(ctx, level)
}

func (h *traceLogHandler) Handle(ctx context.Context, record slog.Record) error {
	// 从 context 中提取 Span 信息
	span := trace.SpanFromContext(ctx)
	if span.SpanContext().IsValid() {
		record.AddAttrs(
			slog.String("trace_id", span.SpanContext().TraceID().String()),
			slog.String("span_id", span.SpanContext().SpanID().String()),
			slog.Bool("trace_flags_sampled", span.SpanContext().IsSampled()),
		)
	}

	return h.handler.Handle(ctx, record)
}

func (h *traceLogHandler) WithAttrs(attrs []slog.Attr) slog.Handler {
	return &traceLogHandler{handler: h.handler.WithAttrs(attrs)}
}

func (h *traceLogHandler) WithGroup(name string) slog.Handler {
	return &traceLogHandler{handler: h.handler.WithGroup(name)}
}

// FromContext 从上下文获取 logger(带 Trace 信息)
func FromContext(ctx context.Context, logger *slog.Logger) *slog.Logger {
	span := trace.SpanFromContext(ctx)
	if !span.SpanContext().IsValid() {
		return logger
	}

	return logger.With(
		slog.String("trace_id", span.SpanContext().TraceID().String()),
		slog.String("span_id", span.SpanContext().SpanID().String()),
	)
}

在业务代码中使用关联日志:

func (s *OrderService) PlaceOrder(ctx context.Context, req *CreateOrderRequest) (*Order, error) {
	logger := logging.FromContext(ctx, s.logger)

	logger.InfoContext(ctx, "placing order",
		slog.String("user_id", req.UserID),
		slog.Float64("amount", req.TotalAmount),
	)

	// ... 业务逻辑 ...

	if err != nil {
		logger.ErrorContext(ctx, "failed to place order",
			slog.String("error", err.Error()),
		)
		return nil, err
	}

	logger.InfoContext(ctx, "order placed successfully",
		slog.String("order_id", order.ID),
	)

	return order, nil
}

输出示例(JSON 格式):

{
  "time": "2025-02-15T16:10:00+08:00",
  "level": "INFO",
  "service": "order-service",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "msg": "placing order",
  "user_id": "user_123",
  "amount": 299.99
}

有了 trace_id,你可以在 Jaeger 或 Grafana Tempo 中直接搜索这条日志对应的完整 Trace。

自动埋点(Auto-Instrumentation)

手动给每个函数加 Span 太累?OpenTelemetry 提供了常见库的自动埋点插件。

HTTP 服务器自动埋点

import (
	"net/http"
	"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)

func main() {
	mux := http.NewServeMux()
	mux.HandleFunc("/api/orders", handleOrders)
	mux.HandleFunc("/api/users", handleUsers)

	// 用 otelhttp 包装,自动为每个请求创建 Span
	handler := otelhttp.NewHandler(mux, "my-http-server",
		otelhttp.WithFilter(func(r *http.Request) bool {
			// 过滤掉健康检查和 metrics 端点
			return r.URL.Path != "/health" && r.URL.Path != "/metrics"
		}),
	)

	http.ListenAndServe(":8080", handler)
}

gRPC 自动埋点

import (
	"go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc"
	"google.golang.org/grpc"
)

// gRPC 服务器
server := grpc.NewServer(
	grpc.StatsHandler(otelgrpc.NewServerHandler()),
)

// gRPC 客户端
conn, err := grpc.NewClient("localhost:50051",
	grpc.WithStatsHandler(otelgrpc.NewClientHandler()),
)

数据库自动埋点

import (
	"go.opentelemetry.io/contrib/instrumentation/github.com/lib/pq/otelpq"
	"database/sql"
)

// 注册 OpenTelemetry 包装的 PostgreSQL 驱动
sql.Register("otelpostgres", otelpq.Wrap(&pq.Driver{}))

// 使用注册后的驱动名
db, err := sql.Open("otelpostgres", "postgres://user:pass@localhost/db?sslmode=disable")

Gin/Echo/Chi 等框架

// Chi 中间件
import "go.opentelemetry.io/contrib/instrumentation/github.com/go-chi/chi/v5/otelchi"

r := chi.NewRouter()
r.Use(otelchi.Middleware("my-server", otelchi.WithChiRoutes(r)))

// Gorilla Mux
import "go.opentelemetry.io/contrib/instrumentation/github.com/gorilla/mux/otelmux"

r := mux.NewRouter()
r.Use(otelmux.Middleware("my-server"))

采样策略

在生产环境中,不可能采样 100% 的请求——那会产生海量数据。OpenTelemetry 提供了多种采样策略:

// internal/telemetry/sampler.go
package telemetry

import (
	sdktrace "go.opentelemetry.io/otel/sdk/trace"
	"go.opentelemetry.io/otel/trace"
)

// NewProductionSampler 生产环境的采样策略
func NewProductionSampler() sdktrace.Sampler {
	// 组合多种采样策略
	return sdktrace.ParentBased(
		// 根 Span 使用基于比率的采样
		sdktrace.TraceIDRatioBased(0.1), // 采样 10%

		// 有父 Span 时跟随父 Span 的采样决策
		sdktrace.WithRemoteParentSampled(sdktrace.AlwaysSample()),
		sdktrace.WithRemoteParentNotSampled(sdktrace.NeverSample()),
		sdktrace.WithLocalParentSampled(sdktrace.AlwaysSample()),
		sdktrace.WithLocalParentNotSampled(sdktrace.NeverSample()),
	)
}

// NewErrorAwareSampler 错误请求 100% 采样的策略
func NewErrorAwareSampler() sdktrace.Sampler {
	return &errorAwareSampler{
		base: sdktrace.TraceIDRatioBased(0.05), // 正常请求采样 5%
	}
}

type errorAwareSampler struct {
	base sdktrace.Sampler
}

func (s *errorAwareSampler) ShouldSample(p sdktrace.SamplingParameters) sdktrace.SamplingResult {
	// 先使用基础采样策略
	result := s.base.ShouldSample(p)

	// 如果有错误属性,强制采样
	for _, attr := range p.Attributes {
		if attr.Key == "error" && attr.Value.AsBool() {
			return sdktrace.SamplingResult{
				Decision:   sdktrace.RecordAndSample,
				Tracestate: trace.SpanContextFromContext(p.ParentContext).TraceState(),
			}
		}
	}

	return result
}

func (s *errorAwareSampler) Description() string {
	return "ErrorAwareSampler(sampling errors at 100%, normal at 5%)"
}

导出到 Jaeger

Jaeger 是最流行的分布式追踪系统之一。本地开发时可以用 Docker 快速启动:

# 启动 Jaeger All-in-One
docker run -d --name jaeger \
  -e COLLECTOR_OTLP_ENABLED=true \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  jaegertracing/all-in-one:latest

配置导出到 Jaeger:

// 使用 OTLP gRPC 导出到 Jaeger
cleanup, err := telemetry.InitTracer(ctx, "order-service", "localhost:4317")
if err != nil {
	log.Fatalf("Failed to init tracer: %v", err)
}
defer cleanup(context.Background())

// Jaeger UI: http://localhost:16686

导出到 Grafana Tempo

Grafana Tempo 是 Grafana 生态中的分布式追踪后端,与 Grafana 仪表盘深度集成:

# docker-compose.yml(Grafana + Tempo + Prometheus)
version: '3.8'

services:
  tempo:
    image: grafana/tempo:latest
    command: ["-config.file=/etc/tempo.yaml"]
    volumes:
      - ./tempo.yaml:/etc/tempo.yaml
    ports:
      - "4317:4317"   # OTLP gRPC
      - "3200:3200"   # Tempo API

  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:latest
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
    ports:
      - "3000:3000"
    volumes:
      - ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml

Tempo 配置文件:

# tempo.yaml
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: "0.0.0.0:4317"

storage:
  trace:
    backend: local
    local:
      path: /tmp/tempo/blocks

Grafana 数据源配置:

# grafana-datasources.yml
apiVersion: 1

datasources:
  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    isDefault: true

  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090

完整的 main.go 组装

将所有组件组装在一起:

// main.go
package main

import (
	"context"
	"log"
	"net/http"
	"os"
	"os/signal"
	"syscall"
	"time"

	"github.com/prometheus/client_golang/prometheus/promhttp"
	"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"

	"github.com/yourusername/otel-demo/internal/handler"
	"github.com/yourusername/otel-demo/internal/logging"
	"github.com/yourusername/otel-demo/internal/metrics"
	"github.com/yourusername/otel-demo/internal/telemetry"
)

func main() {
	ctx := context.Background()
	logger := logging.NewLogger("order-service")

	// 初始化 TracerProvider
	traceCleanup, err := telemetry.InitTracer(ctx,
		"order-service",
		getEnv("OTEL_EXPORTER_OTLP_ENDPOINT", "localhost:4317"),
	)
	if err != nil {
		log.Fatalf("Failed to init tracer: %v", err)
	}
	defer traceCleanup(ctx)

	// 初始化 Prometheus 指标
	_, err = telemetry.InitPrometheusMeter(ctx, "order-service")
	if err != nil {
		log.Fatalf("Failed to init meter: %v", err)
	}

	// 创建业务指标
	orderMetrics, err := metrics.NewOrderMetrics()
	if err != nil {
		log.Fatalf("Failed to create metrics: %v", err)
	}

	// 创建 Handler
	orderHandler := handler.NewOrderHandler(orderMetrics, logger)

	// 路由配置
	mux := http.NewServeMux()
	mux.HandleFunc("/api/orders", orderHandler.CreateOrder)
	mux.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
		w.Write([]byte(`{"status":"ok"}`))
	})
	mux.Handle("/metrics", promhttp.Handler())

	// 用 otelhttp 包装,自动创建 Span
	wrappedMux := otelhttp.NewHandler(mux, "order-service",
		otelhttp.WithFilter(func(r *http.Request) bool {
			return r.URL.Path != "/health" && r.URL.Path != "/metrics"
		}),
	)

	srv := &http.Server{
		Addr:         ":8080",
		Handler:      wrappedMux,
		ReadTimeout:  10 * time.Second,
		WriteTimeout: 30 * time.Second,
		IdleTimeout:  60 * time.Second,
	}

	// 优雅关闭
	go func() {
		logger.Info("Server starting on :8080")
		if err := srv.ListenAndServe(); err != nil && err != http.ErrServerClosed {
			logger.Error("Server error", "error", err)
		}
	}()

	quit := make(chan os.Signal, 1)
	signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)
	<-quit

	logger.Info("Shutting down server...")
	shutdownCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
	defer cancel()

	if err := srv.Shutdown(shutdownCtx); err != nil {
		logger.Error("Server forced to shutdown", "error", err)
	}

	logger.Info("Server exited")
}

func getEnv(key, fallback string) string {
	if val := os.Getenv(key); val != "" {
		return val
	}
	return fallback
}

Grafana 仪表盘配置

在 Grafana 中,你可以创建强大的仪表盘来可视化 Trace 和 Metrics。以下是几个关键面板的建议:

面板数据源PromQL / 查询
请求 QPSPrometheusrate(http_requests_total[5m])
P50/P95/P99 延迟Prometheushistogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
错误率Prometheusrate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
订单创建速率Prometheusrate(orders_total[5m])
活跃订单数Prometheusorders_active
GC PausePrometheusrate(go_gc_duration_seconds_sum[5m])

你还可以利用 Grafana 的 Trace to Logs 功能,在 Trace 视图中直接关联到 Loki 的日志——这就是可观测性三大支柱互相关联的魅力所在。

生产环境部署建议

将 OpenTelemetry 部署到生产环境时,建议遵循以下最佳实践:

1. 使用 OpenTelemetry Collector

不要让你的服务直接导出数据到 Jaeger/Prometheus,而是经过 Collector 中转:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024

  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

  filter/drop-health:
    error_mode: ignore
    traces:
      span:
        - 'attributes["http.target"] == "/health"'
        - 'attributes["http.target"] == "/metrics"'

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true

  prometheus:
    endpoint: 0.0.0.0:8889

  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, filter/drop-health, batch]
      exporters: [otlp/jaeger, otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]

2. Kubernetes 部署 Collector

# k8s/otel-collector.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
spec:
  replicas: 2
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
      - name: otel-collector
        image: otel/opentelemetry-collector-contrib:latest
        args: ["--config=/etc/otel/collector.yaml"]
        ports:
        - containerPort: 4317
        - containerPort: 4318
        - containerPort: 8889
        resources:
          requests:
            cpu: 100m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 512Mi
        volumeMounts:
        - name: config
          mountPath: /etc/otel
      volumes:
      - name: config
        configMap:
          name: otel-collector-config
---
apiVersion: v1
kind: Service
metadata:
  name: otel-collector
spec:
  selector:
    app: otel-collector
  ports:
  - name: otlp-grpc
    port: 4317
  - name: otlp-http
    port: 4318
  - name: prometheus
    port: 8889

3. 你的 Go 服务只需要导出到本地 Collector

// 在 Kubernetes 中,每个 Pod 运行一个 Collector Sidecar
// Go 服务只需要导出到 localhost
cleanup, err := telemetry.InitTracer(ctx,
	"order-service",
	"localhost:4317", // 指向 sidecar collector
)

4. 采样策略配置

// 根据环境动态调整采样率
func getSampler(env string) sdktrace.Sampler {
	switch env {
	case "development":
		return sdktrace.AlwaysSample()
	case "staging":
		return sdktrace.TraceIDRatioBased(0.5)
	case "production":
		return sdktrace.ParentBased(
			sdktrace.TraceIDRatioBased(0.1),
		)
	default:
		return sdktrace.AlwaysSample()
	}
}

实战:完整的分布式追踪示例

让我们构建一个包含多个服务的完整分布式系统,看看 OpenTelemetry 如何在真实场景中工作。

服务 1:API Gateway

// services/gateway/main.go
package main

import (
	"context"
	"log"
	"net/http"
	"os"
	"time"

	"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/codes"
	"go.opentelemetry.io/otel/propagation"
	"go.opentelemetry.io/otel/trace"

	"github.com/yourusername/otel-demo/internal/telemetry"
)

var tracer = otel.Tracer("gateway-service")

func main() {
	ctx := context.Background()

	// 初始化追踪
	cleanup, err := telemetry.InitTracer(ctx, "gateway-service", "localhost:4317")
	if err != nil {
		log.Fatalf("Failed to init tracer: %v", err)
	}
	defer cleanup(ctx)

	// 创建 HTTP 客户端(自动传播上下文)
	httpClient := &http.Client{
		Transport: otelhttp.NewTransport(http.DefaultTransport),
		Timeout:   10 * time.Second,
	}

	// API Gateway 处理器
	handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		ctx := r.Context()
		span := trace.SpanFromContext(ctx)

		span.SetAttributes(
			attribute.String("gateway.path", r.URL.Path),
			attribute.String("gateway.method", r.Method),
		)

		// 根据路径路由到不同服务
		switch {
		case r.URL.Path == "/api/orders":
			span.AddEvent("routing to order-service")
			resp, err := httpClient.Get("http://order-service:8080/orders")
			if err != nil {
				span.RecordError(err)
				span.SetStatus(codes.Error, "order-service unavailable")
				http.Error(w, "Service unavailable", http.StatusServiceUnavailable)
				return
			}
			defer resp.Body.Close()

			span.SetAttributes(attribute.Int("order_service.status", resp.StatusCode))

		case r.URL.Path == "/api/users":
			span.AddEvent("routing to user-service")
			resp, err := httpClient.Get("http://user-service:8080/users")
			if err != nil {
				span.RecordError(err)
				http.Error(w, "Service unavailable", http.StatusServiceUnavailable)
				return
			}
			defer resp.Body.Close()

		default:
			span.SetStatus(codes.Error, "unknown route")
			http.NotFound(w, r)
			return
		}

		w.WriteHeader(http.StatusOK)
		w.Write([]byte(`{"status":"ok"}`))
	})

	// 用 otelhttp 包装(自动创建 Span)
	wrappedHandler := otelhttp.NewHandler(handler, "gateway",
		otelhttp.WithPropagators(propagation.NewCompositeTextMapPropagator(
			propagation.TraceContext{},
			propagation.Baggage{},
		)),
	)

	log.Println("Gateway starting on :8080")
	log.Fatal(http.ListenAndServe(":8080", wrappedHandler))
}

服务 2:Order Service

// services/order-service/main.go
package main

import (
	"context"
	"encoding/json"
	"log"
	"net/http"
	"time"

	"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/codes"
	"go.opentelemetry.io/otel/trace"

	"github.com/yourusername/otel-demo/internal/telemetry"
)

var tracer = otel.Tracer("order-service")

type Order struct {
	ID        string    `json:"id"`
	UserID    string    `json:"user_id"`
	Amount    float64   `json:"amount"`
	Status    string    `json:"status"`
	CreatedAt time.Time `json:"created_at"`
}

func main() {
	ctx := context.Background()

	cleanup, err := telemetry.InitTracer(ctx, "order-service", "localhost:4317")
	if err != nil {
		log.Fatalf("Failed to init tracer: %v", err)
	}
	defer cleanup(ctx)

	// HTTP 客户端(自动传播)
	httpClient := &http.Client{
		Transport: otelhttp.NewTransport(http.DefaultTransport),
	}

	mux := http.NewServeMux()

	mux.HandleFunc("/orders", func(w http.ResponseWriter, r *http.Request) {
		ctx := r.Context()

		// 创建子 Span
		ctx, span := tracer.Start(ctx, "ProcessOrderRequest",
			trace.WithAttributes(
				attribute.String("order.handler", "list"),
			),
		)
		defer span.End()

		// 模拟数据库查询
		orders, err := fetchOrdersFromDB(ctx)
		if err != nil {
			span.RecordError(err)
			span.SetStatus(codes.Error, "database error")
			http.Error(w, err.Error(), http.StatusInternalServerError)
			return
		}

		span.SetAttributes(attribute.Int("orders.count", len(orders)))

		// 调用用户服务获取用户信息
		for i, order := range orders {
			userInfo, err := fetchUserInfo(ctx, httpClient, order.UserID)
			if err != nil {
				span.AddEvent("failed to fetch user info",
					trace.WithAttributes(
						attribute.String("user.id", order.UserID),
						attribute.String("error", err.Error()),
					),
				)
				continue
			}
			span.AddEvent("enriched order with user info",
				trace.WithAttributes(
					attribute.String("order.id", order.ID),
					attribute.String("user.name", userInfo.Name),
				),
			)
			_ = i // 使用 userInfo 更新 order
		}

		w.Header().Set("Content-Type", "application/json")
		json.NewEncoder(w).Encode(orders)
	})

	// 包装处理器
	handler := otelhttp.NewHandler(mux, "order-service")

	log.Println("Order service starting on :8081")
	log.Fatal(http.ListenAndServe(":8081", handler))
}

func fetchOrdersFromDB(ctx context.Context) ([]Order, error) {
	ctx, span := tracer.Start(ctx, "DatabaseQuery",
		trace.WithAttributes(
			attribute.String("db.system", "postgresql"),
			attribute.String("db.statement", "SELECT * FROM orders"),
		),
	)
	defer span.End()

	// 模拟数据库延迟
	time.Sleep(50 * time.Millisecond)

	span.AddEvent("query completed",
		trace.WithAttributes(attribute.Int("db.rows_affected", 3)),
	)

	return []Order{
		{ID: "ord_1", UserID: "user_123", Amount: 99.99, Status: "paid"},
		{ID: "ord_2", UserID: "user_456", Amount: 149.50, Status: "pending"},
		{ID: "ord_3", UserID: "user_789", Amount: 299.00, Status: "shipped"},
	}, nil
}

func fetchUserInfo(ctx context.Context, client *http.Client, userID string) (*struct{ Name string }, error) {
	ctx, span := tracer.Start(ctx, "FetchUserInfo",
		trace.WithSpanKind(trace.SpanKindClient),
		trace.WithAttributes(
			attribute.String("user.id", userID),
			attribute.String("peer.service", "user-service"),
		),
	)
	defer span.End()

	resp, err := client.Get("http://user-service:8082/users/" + userID)
	if err != nil {
		span.RecordError(err)
		return nil, err
	}
	defer resp.Body.Close()

	if resp.StatusCode != http.StatusOK {
		err := fmt.Errorf("user service returned %d", resp.StatusCode)
		span.SetStatus(codes.Error, resp.Status)
		return nil, err
	}

	var user struct{ Name string }
	if err := json.NewDecoder(resp.Body).Decode(&user); err != nil {
		span.RecordError(err)
		return nil, err
	}

	span.SetAttributes(attribute.String("user.name", user.Name))
	return &user, nil
}

服务 3:User Service

// services/user-service/main.go
package main

import (
	"context"
	"encoding/json"
	"log"
	"net/http"
	"strings"
	"time"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/codes"
	"go.opentelemetry.io/otel/propagation"
	"go.opentelemetry.io/otel/trace"

	"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"

	"github.com/yourusername/otel-demo/internal/telemetry"
)

var tracer = otel.Tracer("user-service")

type User struct {
	ID    string `json:"id"`
	Name  string `json:"name"`
	Email string `json:"email"`
}

func main() {
	ctx := context.Background()

	cleanup, err := telemetry.InitTracer(ctx, "user-service", "localhost:4317")
	if err != nil {
		log.Fatalf("Failed to init tracer: %v", err)
	}
	defer cleanup(ctx)

	mux := http.NewServeMux()

	mux.HandleFunc("/users/", func(w http.ResponseWriter, r *http.Request) {
		// 从 Header 提取传播的上下文
		ctx := propagation.NewCompositeTextMapPropagator(
			propagation.TraceContext{},
			propagation.Baggage{},
		).Extract(r.Context(), propagation.HeaderCarrier(r.Header))

		// 创建 Span(会自动关联到上游)
		ctx, span := tracer.Start(ctx, "GetUser",
			trace.WithSpanKind(trace.SpanKindServer),
		)
		defer span.End()

		// 从 URL 提取用户 ID
		userID := strings.TrimPrefix(r.URL.Path, "/users/")
		span.SetAttributes(attribute.String("user.id", userID))

		// 查询用户
		user, err := getUserFromDB(ctx, userID)
		if err != nil {
			span.RecordError(err)
			span.SetStatus(codes.Error, "user not found")
			http.NotFound(w, r)
			return
		}

		span.SetAttributes(
			attribute.String("user.name", user.Name),
			attribute.String("user.email", user.Email),
		)
		span.SetStatus(codes.Ok, "user found")

		w.Header().Set("Content-Type", "application/json")
		json.NewEncoder(w).Encode(user)
	})

	handler := otelhttp.NewHandler(mux, "user-service")

	log.Println("User service starting on :8082")
	log.Fatal(http.ListenAndServe(":8082", handler))
}

func getUserFromDB(ctx context.Context, userID string) (*User, error) {
	_, span := tracer.Start(ctx, "DatabaseQuery",
		trace.WithAttributes(
			attribute.String("db.system", "postgresql"),
			attribute.String("db.statement", "SELECT * FROM users WHERE id = $1"),
		),
	)
	defer span.End()

	// 模拟数据库查询
	time.Sleep(30 * time.Millisecond)

	// 模拟用户数据
	users := map[string]*User{
		"user_123": {ID: "user_123", Name: "Alice", Email: "alice@example.com"},
		"user_456": {ID: "user_456", Name: "Bob", Email: "bob@example.com"},
		"user_789": {ID: "user_789", Name: "Charlie", Email: "charlie@example.com"},
	}

	user, ok := users[userID]
	if !ok {
		return nil, fmt.Errorf("user %s not found", userID)
	}

	return user, nil
}

使用 Docker Compose 启动完整环境

# docker-compose.yml
version: '3.8'

services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    environment:
      - COLLECTOR_OTLP_ENABLED=true
    ports:
      - "16686:16686"
      - "4317:4317"
      - "4318:4318"

  gateway:
    build:
      context: ./services/gateway
    ports:
      - "8080:8080"
    environment:
      - OTEL_EXPORTER_OTLP_ENDPOINT=jaeger:4317
      - OTEL_SERVICE_NAME=gateway-service
    depends_on:
      - jaeger
      - order-service
      - user-service

  order-service:
    build:
      context: ./services/order-service
    environment:
      - OTEL_EXPORTER_OTLP_ENDPOINT=jaeger:4317
      - OTEL_SERVICE_NAME=order-service
    depends_on:
      - jaeger
      - user-service

  user-service:
    build:
      context: ./services/user-service
    environment:
      - OTEL_EXPORTER_OTLP_ENDPOINT=jaeger:4317
      - OTEL_SERVICE_NAME=user-service
    depends_on:
      - jaeger

启动后访问 http://localhost:8080/api/orders,然后在 Jaeger UI (http://localhost:16686) 查看完整的分布式追踪链路。你会看到:

  1. Gateway 接收请求
  2. Gateway 路由到 Order Service
  3. Order Service 查询数据库
  4. Order Service 调用 User Service
  5. User Service 查询数据库
  6. 所有 Span 串联成一条完整的 Trace

高级追踪模式

Baggage:跨服务传递业务上下文

有时候你需要在整条链路中传递业务信息(比如用户 ID、租户 ID),而不仅仅是追踪信息。OpenTelemetry 的 Baggage 机制就是为此设计的:

import (
	"go.opentelemetry.io/otel/baggage"
	"go.opentelemetry.io/otel/attribute"
)

func handleRequest(ctx context.Context, r *http.Request) {
	// 添加 baggage(会自动传播到下游服务)
	ctx = baggage.ContextWithBaggage(ctx,
		baggage.New(
			baggage.Member{Key: "user.id", Value: "user_123"},
			baggage.Member{Key: "tenant.id", Value: "tenant_456"},
			baggage.Member{Key: "request.priority", Value: "high"},
		),
	)

	// 下游服务可以读取这些 baggage
	span := trace.SpanFromContext(ctx)
	span.SetAttributes(
		attribute.String("user.id", baggage.FromContext(ctx).Member("user.id").Value()),
	)
}

Span Links:关联相关的 Trace

有时候两个 Trace 在逻辑上是相关的(比如批量处理中的单个任务),但又不属于同一个 Trace。这时可以使用 Span Links:

func processBatchItem(ctx context.Context, batchTraceID string, item Item) {
	// 创建关联到批次 Trace 的 Span
	_, span := tracer.Start(ctx, "ProcessItem",
		trace.WithLinks(
			trace.Link{
				SpanContext: trace.NewSpanContext(trace.SpanContextConfig{
					TraceID: parseTraceID(batchTraceID),
				}),
				Attributes: []attribute.KeyValue{
					attribute.String("link.type", "batch_parent"),
				},
			},
		),
	)
	defer span.End()

	// 处理逻辑...
}

常见陷阱与最佳实践

1. 不要在循环中创建过多 Span

// ❌ 错误:循环中创建大量 Span
for i := 0; i < 10000; i++ {
	ctx, span := tracer.Start(ctx, "ProcessItem")
	processItem(i)
	span.End()
}

// ✅ 正确:为整个批次创建一个 Span
ctx, span := tracer.Start(ctx, "ProcessBatch",
	trace.WithAttributes(attribute.Int("batch.size", 10000)),
)
defer span.End()

for i := 0; i < 10000; i++ {
	// 使用事件而不是子 Span
	span.AddEvent("processing item",
		trace.WithAttributes(attribute.Int("item.index", i)),
	)
	processItem(i)
}

2. 及时调用 span.End()

忘记调用 span.End() 会导致 Span 永远不会结束,造成内存泄漏和追踪数据不完整:

// ❌ 错误:忘记 End
func bad() {
	ctx, span := tracer.Start(context.Background(), "MyOperation")
	// ... 逻辑 ...
	// 忘记调用 span.End()
}

// ✅ 正确:使用 defer
func good() {
	ctx, span := tracer.Start(context.Background(), "MyOperation")
	defer span.End()
	// ... 逻辑 ...
}

3. 合理设置 Span 属性

不要过度使用属性,只记录对调试有价值的数据:

// ❌ 错误:记录敏感信息
span.SetAttributes(
	attribute.String("user.password", user.Password),
	attribute.String("credit_card.number", card.Number),
)

// ✅ 正确:只记录必要信息
span.SetAttributes(
	attribute.String("user.id", user.ID),
	attribute.String("order.id", order.ID),
	attribute.Bool("order.is_premium", order.IsPremium),
)

故障排查指南

问题 1:Jaeger 中看不到 Trace

检查清单:

  1. 确认 Collector 是否正常运行
  2. 确认服务是否能连接到 Collector(检查网络和端口)
  3. 确认采样率设置(AlwaysSample vs TraceIDRatioBased
  4. 查看服务日志是否有导出错误
# 检查 Collector 状态
kubectl get pods -l app=otel-collector

# 查看 Collector 日志
kubectl logs -l app=otel-collector

# 测试 Collector 端口连通性
nc -zv localhost 4317

问题 2:Trace 链路不完整

通常是上下文传播出了问题:

// 检查是否正确注入/提取上下文
// 在 HTTP 客户端
req, _ := http.NewRequest("GET", url, nil)
otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))
// 打印 Header 确认是否有 traceparent
log.Printf("Headers: %v", req.Header)

// 在 HTTP 服务端
ctx = otel.GetTextMapPropagator().Extract(ctx, propagation.HeaderCarrier(r.Header))
// 检查是否有有效的 Span Context
if span := trace.SpanFromContext(ctx); span.SpanContext().IsValid() {
	log.Println("Valid parent span found")
} else {
	log.Println("No valid parent span")
}

问题 3:Prometheus 指标不更新

检查指标注册是否正确:

// 确保使用同一个 MeterProvider
meter := otel.Meter("my-service")

// 检查指标名称是否符合 Prometheus 规范
// Prometheus 要求指标名只能包含字母、数字、下划线
counter, _ := meter.Int64Counter("http_requests_total") // ✅
counter, _ := meter.Int64Counter("http-requests-total") // ❌

性能优化建议

在生产环境中使用 OpenTelemetry 时,需要注意性能影响:

1. 异步导出

使用异步批量导出,避免阻塞业务逻辑:

// 配置批量导出器
tp := sdktrace.NewTracerProvider(
	sdktrace.WithBatcher(exporter,
		sdktrace.WithBatchTimeout(5*time.Second),     // 最多等待 5 秒
		sdktrace.WithMaxExportBatchSize(512),          // 每批最多 512 个 Span
		sdktrace.WithMaxQueueSize(2048),               // 队列最大容量
		sdktrace.WithExportTimeout(30*time.Second),    // 导出超时
	),
)

2. 减少字符串分配

Span 属性和事件名称应该复用,避免每次创建新字符串:

// 定义常量属性键
var (
	userIDKey   = attribute.Key("user.id")
	orderIDKey  = attribute.Key("order.id")
	statusKey   = attribute.Key("http.status")
)

// 使用常量键
span.SetAttributes(
	userIDKey.String("user_123"),
	orderIDKey.String("order_456"),
	statusKey.Int(200),
)

3. 条件性记录详细信息

只在需要时记录详细信息,避免不必要的开销:

func handleRequest(ctx context.Context, req *Request) error {
	ctx, span := tracer.Start(ctx, "HandleRequest")
	defer span.End()

	// 只在调试模式下记录详细请求体
	if os.Getenv("OTEL_DEBUG") == "true" {
		body, _ := json.Marshal(req)
		span.SetAttributes(attribute.String("request.body", string(body)))
	}

	// 只在错误时记录详细信息
	if err := processRequest(req); err != nil {
		span.RecordError(err)
		span.SetAttributes(
			attribute.String("error.details", err.Error()),
			attribute.String("error.stack", string(debug.Stack())),
		)
		return err
	}

	return nil
}

4. 使用 Span 过滤器

在 Collector 层面过滤不需要的 Span,减少存储成本:

# otel-collector-config.yaml
processors:
  filter/drop-health:
    error_mode: ignore
    traces:
      span:
        # 丢弃健康检查相关的 Span
        - 'attributes["http.target"] == "/health"'
        - 'attributes["http.target"] == "/metrics"'
        - 'attributes["http.target"] == "/ready"'

  tail_sampling:
    policies:
      # 只采样有错误的 Trace
      - name: errors-policy
        type: status_code
        status_code: {status_codes: [ERROR]}

      # 只采样慢请求(延迟 > 1s)
      - name: latency-policy
        type: latency
        latency: {threshold_ms: 1000}

      # 采样特定服务的 Trace
      - name: service-policy
        type: string_attribute
        string_attribute:
          key: service.name
          values: [order-service, payment-service]

总结

恭喜你完成了 OpenTelemetry 的学习之旅!让我们来回顾一下核心要点:

  1. Tracing:使用 Span 记录请求的完整链路,支持跨服务上下文传播
  2. Metrics:Counter、Histogram、Gauge 等类型覆盖各种度量场景
  3. Logging:将 TraceID 和 SpanID 注入日志,实现日志与链路追踪的关联
  4. Auto-Instrumentation:HTTP、gRPC、数据库驱动等常见组件都有自动埋点插件
  5. 采样策略:生产环境使用 ParentBased + TraceIDRatio 控制数据量
  6. 导出器:支持 Jaeger、Tempo、Prometheus 等主流后端
  7. Collector:作为中间代理统一管理遥测数据的接收和转发

可观测性不是锦上添花,而是分布式系统的必需品。下次当你的服务在凌晨两点出故障时,有了 OpenTelemetry,你就能在 Jaeger 里一眼看到问题的根源,而不是在日志的海洋里苦苦搜索。

希望这篇文章能帮助你掌握 Go 项目中 OpenTelemetry 的集成方法。下一篇,我们将深入 Kubernetes Operator 的世界!

继续阅读

探索更多技术文章

浏览归档,发现更多关于系统设计、工具链和工程实践的内容。

全部文章 返回首页