golang

性能分析：pprof 和 trace 工具

学习使用 Go 的性能分析工具 pprof 和 trace，找出性能瓶颈并优化代码

Leeting Yan 2021-04-22 5 分钟阅读 2127 字

性能分析：pprof 和 trace 工具

“我的程序很慢”——这是开发者最常遇到的问题之一。但"慢"是一个模糊的描述，真正的瓶颈在哪里？是 CPU 计算太慢？是内存分配太多？是 I/O 等待太久？还是锁竞争太严重？

Go 提供了强大的性能分析工具：pprof 和 trace，帮你精确定位性能问题。

pprof 简介

pprof 是 Go 标准库中的性能分析工具，可以收集和分析以下信息：

CPU profile：哪些函数占用了最多的 CPU 时间
Memory profile：哪些函数分配了最多的内存
Block profile：哪些操作导致了阻塞
Mutex profile：哪些锁竞争最严重
Goroutine profile：当前所有 goroutine 的堆栈

使用 pprof

方式一：命令行工具

package main

import (
    "os"
    "runtime/pprof"
    "time"
)

func main() {
    // CPU profile
    f, _ := os.Create("cpu.prof")
    pprof.StartCPUProfile(f)
    defer pprof.StopCPUProfile()
    
    // 你的程序逻辑
    doWork()
    
    // Memory profile
    memF, _ := os.Create("mem.prof")
    defer memF.Close()
    pprof.WriteHeapProfile(memF)
}

func doWork() {
    for i := 0; i < 1000000; i++ {
        s := make([]byte, 1024)
        _ = s
    }
    time.Sleep(time.Second)
}

运行程序后，分析 profile：

# 分析 CPU profile
go tool pprof cpu.prof

# 分析 Memory profile
go tool pprof mem.prof

方式二：HTTP 服务

package main

import (
    "net/http"
    _ "net/http/pprof"  // 自动注册 pprof handlers
)

func main() {
    go func() {
        // 你的服务器
        http.ListenAndServe(":8080", nil)
    }()
    
    // pprof 服务（通常放在不同的端口）
    http.ListenAndServe(":6060", nil)
}

访问 pprof 端点：

# 30 秒 CPU profile
curl http://localhost:6060/debug/pprof/profile?seconds=30 > cpu.prof

# Heap profile
curl http://localhost:6060/debug/pprof/heap > mem.prof

# Goroutine profile
curl http://localhost:6060/debug/pprof/goroutine > goroutine.prof

# Block profile
curl http://localhost:6060/debug/pprof/block > block.prof

# Mutex profile
curl http://localhost:6060/debug/pprof/mutex > mutex.prof

方式三：testing 包

package main

import (
    "testing"
)

func BenchmarkMyFunction(b *testing.B) {
    b.ReportAllocs()  // 报告内存分配
    
    for i := 0; i < b.N; i++ {
        MyFunction()
    }
}

func MyFunction() {
    s := make([]int, 100)
    for i := range s {
        s[i] = i * i
    }
}

运行基准测试并生成 profile：

go test -bench=. -cpuprofile=cpu.prof -memprofile=mem.prof
go tool pprof cpu.prof

pprof 交互式命令

进入 pprof 后，可以使用以下命令：

(pprof) top          # 显示最耗资源的函数
(pprof) top 20       # 显示前 20 个函数
(pprof) top -cum     # 按累计时间排序

(pprof) list FuncName  # 查看函数源码和耗时分布
(pprof) web          # 生成 SVG 调用图（需要 graphviz）
(pprof) png          # 生成 PNG 图片
(pprof) pdf          # 生成 PDF

(pprof) tree         # 树形显示调用关系
(pprof) callgrind    # 导出为 callgrind 格式（用于 KCachegrind）

(pprof) help         # 查看所有命令
(pprof) quit         # 退出

实战：分析 CPU 瓶颈

package main

import (
    "fmt"
    "math"
    "os"
    "runtime/pprof"
)

func main() {
    f, _ := os.Create("cpu.prof")
    pprof.StartCPUProfile(f)
    defer pprof.StopCPUProfile()
    
    result := calculate()
    fmt.Println("Result:", result)
}

func calculate() float64 {
    sum := 0.0
    for i := 0; i < 10000000; i++ {
        sum += expensive(i)
    }
    return sum
}

func expensive(n int) float64 {
    result := 0.0
    for i := 0; i < 100; i++ {
        result += math.Sqrt(float64(n * i))
    }
    return result
}

func slow(n int) float64 {
    result := 0.0
    for i := 0; i < 1000; i++ {
        result += math.Log(float64(n + i))
    }
    return result
}

分析结果：

go run main.go
go tool pprof cpu.prof

(pprof) top
Showing nodes accounting for 10.23s, 99.80% of 10.25s total
      flat  flat%   sum%        cum   cum%
    8.45s 82.44% 82.44%     8.45s 82.44%  main.expensive
    1.78s 17.37% 99.80%    10.23s 99.80%  main.calculate
         0     0% 99.80%    10.23s 99.80%  main.main

从输出可以看出，expensive 函数占用了 82% 的 CPU 时间，是优化的重点。

查看具体代码：

(pprof) list expensive
Total: 10.25s
ROUTINE ======================== main.expensive
    8.45s     8.45s (flat, cum) 82.44% of Total
         .          .     15:func expensive(n int) float64 {
         .          .     16:    result := 0.0
    1.23s      1.23s     17:    for i := 0; i < 100; i++ {
    7.22s      7.22s     18:        result += math.Sqrt(float64(n * i))
         .          .     19:    }
         .          .     20:    return result
         .          .     21:}

第 18 行的 math.Sqrt 调用占用了大部分时间。

实战：分析内存分配

package main

import (
    "fmt"
    "os"
    "runtime/pprof"
)

func main() {
    doWork()
    
    f, _ := os.Create("mem.prof")
    pprof.WriteHeapProfile(f)
    f.Close()
    
    fmt.Println("Done")
}

func doWork() {
    var data [][]byte
    
    // 问题 1：频繁分配小对象
    for i := 0; i < 10000; i++ {
        b := make([]byte, 1024)
        data = append(data, b)
    }
    
    // 问题 2：字符串拼接
    s := ""
    for i := 0; i < 10000; i++ {
        s += "x"
    }
    
    // 问题 3：持有不必要的引用
    keep := data[:10]  // 只使用前 10 个，但整个 data 都无法被 GC
    _ = keep
}

分析内存分配：

go run main.go
go tool pprof -alloc_space mem.prof

(pprof) top
Showing nodes accounting for 19.54MB, 99.69% of 19.60MB total
      flat  flat%   sum%        cum   cum%
   10.02MB 51.10% 51.10%    10.02MB 51.10%  main.doWork
    9.52MB 48.59% 99.69%     9.52MB 48.59%  strings.concat

doWork 中的循环分配了 10MB，字符串拼接分配了 9.5MB。

trace 工具

trace 提供了更细粒度的执行跟踪，可以看到：

Goroutine 的创建和调度
系统调用
GC 活动
网络 I/O

使用 trace

package main

import (
    "os"
    "runtime/trace"
    "sync"
    "time"
)

func main() {
    f, _ := os.Create("trace.out")
    trace.Start(f)
    defer trace.Stop()
    
    var wg sync.WaitGroup
    
    trace.WithRegion(context.Background(), "main", func() {
        for i := 0; i < 10; i++ {
            wg.Add(1)
            go func(id int) {
                defer wg.Done()
                trace.WithRegion(context.Background(), "worker", func() {
                    time.Sleep(time.Millisecond * 100)
                })
            }(i)
        }
    })
    
    wg.Wait()
}

查看 trace：

go run main.go
go tool trace trace.out

这会打开一个 Web 界面，显示：

Goroutine 分析：每个 goroutine 的生命周期
堆栈跟踪：goroutine 的调用栈
GC 活动：垃圾回收的暂停时间
系统调用：syscall 的耗时
用户自定义区域：trace.WithRegion 标记的区域

trace 的关键视图

Goroutine 视图：

显示所有 goroutine 的时间线
可以看到 goroutine 何时运行、何时阻塞

Heap 视图：

显示堆内存的使用情况
可以看到 GC 的触发时机

Threads 视图：

显示操作系统线程的活动
可以看到线程的调度情况

实战：完整的性能优化流程

package main

import (
    "fmt"
    "math"
    "sync"
)

// 优化前：慢且有大量分配
func processSlow(data []float64) []float64 {
    results := make([]float64, 0)
    
    for _, v := range data {
        // 问题 1：每次循环都分配新的切片
        result := []float64{v}
        for i := 0; i < 100; i++ {
            result = append(result, math.Sqrt(v*float64(i)))
        }
        results = append(results, result...)
    }
    
    return results
}

// 优化后：快且分配少
func processFast(data []float64) []float64 {
    // 优化 1：预分配容量
    results := make([]float64, 0, len(data)*101)
    
    // 优化 2：复用临时切片
    temp := make([]float64, 101)
    
    for _, v := range data {
        temp[0] = v
        for i := 0; i < 100; i++ {
            temp[i+1] = math.Sqrt(v * float64(i))
        }
        results = append(results, temp...)
    }
    
    return results
}

// 进一步优化：并行处理
func processParallel(data []float64) []float64 {
    numWorkers := 4
    chunkSize := len(data) / numWorkers
    
    results := make([]float64, len(data)*101)
    var wg sync.WaitGroup
    
    for i := 0; i < numWorkers; i++ {
        wg.Add(1)
        go func(start int) {
            defer wg.Done()
            end := start + chunkSize
            if i == numWorkers-1 {
                end = len(data)
            }
            
            temp := make([]float64, 101)
            for j := start; j < end; j++ {
                v := data[j]
                temp[0] = v
                for k := 0; k < 100; k++ {
                    temp[k+1] = math.Sqrt(v * float64(k))
                }
                copy(results[j*101:], temp)
            }
        }(i * chunkSize)
    }
    
    wg.Wait()
    return results
}

func main() {
    data := make([]float64, 10000)
    for i := range data {
        data[i] = float64(i)
    }
    
    // 测试不同版本
    // result1 := processSlow(data)
    // result2 := processFast(data)
    // result3 := processParallel(data)
    
    _ = data
}

基准测试：

func BenchmarkProcessSlow(b *testing.B) {
    data := make([]float64, 10000)
    for i := range data {
        data[i] = float64(i)
    }
    b.ResetTimer()
    
    for i := 0; i < b.N; i++ {
        processSlow(data)
    }
}

func BenchmarkProcessFast(b *testing.B) {
    data := make([]float64, 10000)
    for i := range data {
        data[i] = float64(i)
    }
    b.ResetTimer()
    
    for i := 0; i < b.N; i++ {
        processFast(data)
    }
}

func BenchmarkProcessParallel(b *testing.B) {
    data := make([]float64, 10000)
    for i := range data {
        data[i] = float64(i)
    }
    b.ResetTimer()
    
    for i := 0; i < b.N; i++ {
        processParallel(data)
    }
}

运行基准测试：

go test -bench=. -benchmem

预期结果：

processFast 比 processSlow 快 2-3 倍，分配次数减少 90%
processParallel 比 processFast 快 2-4 倍（取决于 CPU 核心数）

总结

性能分析是优化的第一步。没有测量，就没有优化。

pprof 用于：

找出 CPU 热点函数
找出内存分配大户
分析锁竞争和阻塞

trace 用于：

理解程序的执行流程
分析并发行为
观察 GC 和调度

优化流程：

使用 pprof 收集 profile
分析 top 函数，找出瓶颈
使用 list 查看具体代码
优化代码
使用基准测试验证改进
重复以上步骤

记住 Donald Knuth 的名言："过早优化是万恶之源"。先让代码正确运行，再用工具分析和优化性能瓶颈。

继续阅读

探索更多技术文章

浏览归档，发现更多关于系统设计、工具链和工程实践的内容。

全部文章返回首页