PoC: USDT static tracepoints for wait event tracing#18
PoC: USDT static tracepoints for wait event tracing#18
Conversation
Add wait__event__start and wait__event__end probes to the DTrace
provider definition and invoke them from the static inline functions
pgstat_report_wait_start() and pgstat_report_wait_end().
Because these functions are static inline, they get inlined at every
call site (~100 locations across 36 files), leaving no function symbol
for eBPF uprobes to attach to. USDT probes solve this: the compiler
emits a nop instruction at each inlined site with ELF .note.stapsdt
metadata, allowing eBPF tools to discover and attach to all call sites
with a single probe definition.
This enables full eBPF-based wait event tracing (e.g., with bpftrace)
without requiring hardware watchpoints or PostgreSQL source patches
beyond this change.
When built without --enable-dtrace, the probes compile to do {} while(0)
with zero overhead.
PoC: covers all wait events via the two central inline functions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Benchmark Results: USDT Wait Event Tracepoint Observer EffectVM: Hetzner cx43 (8 vCPUs, 16GB RAM, Ubuntu 24.04, Helsinki) 3 builds tested:
Standard OLTP (pgbench -c 16 -j 8 -T 60)
High Contention (pgbench -c 64 -j 8 -T 60)
Overhead Summary
bpftrace Wait Event Capture (validation)During ~400s of traced benchmarking, bpftrace captured:
Interpretation
Next steps
|
CPU Flamegraph Analysis: USDT Wait Event TracepointsGenerated CPU flamegraphs using Interactive Flamegraphs (download SVG and open in browser)Gist: https://gist.github.com/NikolayS/50fd5409729bd0ff4e44ae2d491789c6
TPS Results (pgbench -c64 -j8 -T40)
Key Flamegraph Findings1. Stock vs. USDT (nop probes) — virtually identical flamegraphs The flamegraphs for 2. USDT-nodtrace — also identical Compiling the USDT branch without 3. USDT + bpftrace — uprobe overhead clearly visible This is where the flamegraph tells the real story. When bpftrace is attached (202 probes), the flamegraph shows a clear new code path: 7.2% of all postgres samples are spent in uprobe/int3 handling. The hottest spots within the uprobe path are:
Where probes fire (by parent function):
The 4. Summary
|
Hardware Watchpoint vs USDT: Observer Effect ComparisonBenchmarked pg_wait_tracer (hardware watchpoint approach) on the same VM and same VM: Hetzner cx43 (8 vCPUs, 16GB RAM, Ubuntu 24.04, kernel 6.8) pg_wait_tracer Results (hardware watchpoint, daemon mode with tracing)c16 (pgbench -c 16 -j 8 -T 60)
c64 (pgbench -c 64 -j 8 -T 60)
Combined Comparison (all approaches)
¹ Baselines differ between test sessions due to VM performance variance. Overhead percentages are computed against each session's own baseline. CPU Flamegraph AnalysisFlamegraph: flamegraph-hwwatch.svg (download and open in browser for interactive view) The flamegraph reveals two distinct sources of overhead from hardware watchpoints: 1. Debug exception handling (watchpoint fires → BPF program runs) 2. Context switch overhead (debug registers saved/restored on every context switch) CPU overhead breakdown (% of all samples):
Where the watchpoint fires (by parent function):
Key Observations
|
Flamegraph GalleryClick any flamegraph to open the interactive SVG (searchable, zoomable). 1. Baseline: pg-stock (no dtrace, no tracing)2. USDT branch, no dtrace compilation3. USDT branch, --enable-dtrace (nop probes, idle)4. USDT + bpftrace actively tracing5. pg_wait_tracer (hardware watchpoints) |
TL;DRAdding Summary of findingsWe benchmarked 4 configurations on an 8-vCPU VM (pgbench, scale 100, shared_buffers=2GB, 3 runs per config):
Flamegraphs confirm: nop probes are invisible in CPU profiles. When bpftrace is attached, the overhead is clearly visible in Thoughts on upstreaming and
|
Round 2 Benchmark: Upstream pg_wait_tracer (DmitryNFomin/pg_wait_tracer @
|
| Scenario | c16 TPS (median) | vs baseline | c64 TPS (median) | vs baseline |
|---|---|---|---|---|
| Baseline (pg-stock, no tracing) | 8,245 | — | 9,651 | — |
| USDT build (idle, no bpftrace) | 8,233 | -0.1% | 9,701 | +0.5% |
| USDT + bpftrace attached | 7,948 | -3.6% | 8,976 | -7.0% |
pg_wait_tracer upstream (8e01ee5) |
6,947 | -15.7% | 7,673 | -20.5% |
Round 1 vs Round 2 Comparison
| Scenario | R1 c16 | R2 c16 | R1 c64 | R2 c64 |
|---|---|---|---|---|
| Baseline | 8,751 | 8,245 | 9,578 | 9,651 |
| USDT idle | 8,368 (-4.4%) | 8,233 (-0.1%) | 9,105 (-4.9%) | 9,701 (+0.5%) |
| USDT + bpftrace | 7,590 (-13.3%) | 7,948 (-3.6%) | 8,050 (-16.0%) | 8,976 (-7.0%) |
| pg_wait_tracer | 6,658 (-16.6%)¹ | 6,947 (-15.7%) | 7,129 (-17.8%)¹ | 7,673 (-20.5%) |
¹ Round 1 used the NikolayS fork (commit df49f37), Round 2 uses upstream DmitryNFomin (commit 8e01ee5).
Individual Run TPS
Click to expand all runs
| File | TPS |
|---|---|
| baseline-c16-r1 | 7,979 |
| baseline-c16-r2 | 8,245 |
| baseline-c16-r3 | 8,260 |
| baseline-c64-r1 | 9,897 |
| baseline-c64-r2 | 9,213 |
| baseline-c64-r3 | 9,651 |
| pgwt-c16-r1 | 7,149 |
| pgwt-c16-r2 | 6,947 |
| pgwt-c16-r3 | 6,819 |
| pgwt-c64-r1 | 7,673 |
| pgwt-c64-r2 | 7,745 |
| pgwt-c64-r3 | 7,605 |
| usdt-idle-c16-r1 | 7,962 |
| usdt-idle-c16-r2 | 8,393 |
| usdt-idle-c16-r3 | 8,233 |
| usdt-idle-c64-r1 | 9,821 |
| usdt-idle-c64-r2 | 9,131 |
| usdt-idle-c64-r3 | 9,701 |
| usdt-bpf-c16-r1 | 7,999 |
| usdt-bpf-c16-r2 | 7,942 |
| usdt-bpf-c16-r3 | 7,948 |
| usdt-bpf-c64-r1 | 8,976 |
| usdt-bpf-c64-r2 | 9,341 |
| usdt-bpf-c64-r3 | 8,944 |
Flamegraph: pg_wait_tracer upstream (Round 2, c64)
Analysis
-
pg_wait_tracer upstream overhead remains high: 15.7% (c16) / 20.5% (c64) — essentially unchanged from Round 1 (16.6% / 17.8%). The upstream commit
8e01ee5claims "overhead reduced from 19% to ~6%" but our benchmarks do not confirm this improvement. -
USDT+bpftrace overhead dramatically improved vs Round 1: From 13-16% in Round 1 to 3.6-7.0% in Round 2. This is likely due to the VM being freshly booted vs a previously-loaded state in Round 1, highlighting sensitivity to system conditions.
-
USDT idle overhead essentially zero: The compiled-in USDT probes (with
--enable-dtrace) show negligible overhead when not attached, confirming they are NOPs at rest. -
pg_wait_tracer is 2-4× more expensive than USDT+bpftrace: The hardware-watchpoint approach continues to impose significantly higher overhead than the USDT/bpftrace approach across both concurrency levels.
-
Baseline variability note: R2 baseline c16 is ~6% lower than R1 (8,245 vs 8,751), while c64 is comparable (9,651 vs 9,578), suggesting some run-to-run variability in this VM environment.
Clarification: Overhead is Pure CPUAn important nuance about the benchmark numbers above: all overhead is pure CPU, with zero I/O component. The flamegraphs confirm this precisely:
What this means for real workloadspgbench is a CPU-saturated benchmark — transactions are tiny, the dataset fits in shared_buffers, and there's minimal real waiting. This is the worst case for measuring tracepoint overhead because the CPU cost of the probes is a large fraction of the total work per transaction. In real production workloads where queries involve actual I/O waits (disk reads, network round-trips, lock contention), the same fixed CPU cost per probe fire becomes a much smaller fraction of total transaction time. A query that takes 5ms of real work won't notice 1-2μs of probe overhead. Bottom line: the 4-5% and 13-16% numbers from pgbench represent an upper bound. Real-world overhead will be lower, proportional to how CPU-bound your workload is. |
Round 3: Scale-Up Test (48 vCPUs, 192GB RAM)Purpose: Test whether USDT nop probe overhead scales with core count due to I-cache effects. VM: Hetzner ccx63 (48 dedicated vCPUs, 192GB RAM, AMD EPYC), Helsinki (hel1) Resultsc64 (moderate contention — 1.3 backends/core)
c256 (high contention — 5.3 backends/core)
c512 (extreme — 10.7 backends/core, I-cache stress test)
Overhead Summary Across Scales
Flamegraphs (c512, maximum I-cache stress)Click SVG links for interactive flamegraphs. Baseline (stock)USDT idle (--enable-dtrace, no tracing)USDT + bpftrace (active tracing)pg_wait_tracer (BPF uprobe)Key Findings
|
Updated Flamegraphs (Round 4 — with debug symbols)Previous flamegraphs had large Test conditions: cx43 (8 vCPU), pgbench scale 100, Click any link below to open the interactive SVG (searchable with Ctrl+F, zoomable by clicking). 1. Baseline (stock PostgreSQL, no tracing) — 6,558 TPS2. USDT
|
| Scenario | TPS | vs Baseline |
|---|---|---|
| Stock baseline | 6,558 | — |
| USDT idle (nop probes) | 5,975 | -8.9% |
| USDT + bpftrace | 5,615 | -14.4% |
| pg_wait_tracer | 6,030 | -8.1% |
Note: 8-core VM — on production 48+ core machines the overhead percentages would be different. These flamegraphs are for qualitative analysis of WHERE overhead occurs, not precise quantification.
|
Next idea: adjust pg_wait_tracer code to support this patched Postgres and benchmark it to compare with other options |
|
one more idea: repeat benchmarks on ultra lightweight transactions -- simple also we should use -c/-j matching vCPU count |
Round 6: pg_wait_tracer USDT Mode vs Hardware WatchpointsPurpose: Compare pg_wait_tracer's new USDT mode (
TPC-B (standard pgbench, scale 100)
TPC-B is I/O-heavy so tracing overhead is masked by disk waits. All results within noise floor (~10% variance). SELECT 1 (ultra-lightweight, worst case for overhead)
SELECT 1 reveals true overhead since every query triggers wait event transitions:
Flamegraphs (SELECT 1, c8)
ConclusionUSDT mode is a significant improvement over hardware watchpoints:
|





Summary
This is a proof-of-concept patch that adds USDT (DTrace/SystemTap) static tracepoints to
pgstat_report_wait_start()andpgstat_report_wait_end(), enabling complete eBPF-based wait event tracing without hardware watchpoints or additional PostgreSQL patches.The problem
pgstat_report_wait_start()andpgstat_report_wait_end()are declaredstatic inlineinsrc/include/utils/wait_event.h. The compiler inlines them at every call site (~100 locations across 36 files), eliminating the function symbol from the binary. This makes standard eBPF uprobe-based tracing impossible — there is no address to attach to.The existing DTrace probes in
probes.dcover only a small subset of wait events (LWLock and heavyweight lock waits). The vast majority — all I/O waits (DATA_FILE_READ/WRITE,WAL_WRITE,WAL_SYNC), socket/latch waits,COMMIT_DELAY,VACUUM_DELAY,SLRU_*, replication waits, buffer lock waits, spinlock delays, io_uring waits, etc. — have no static tracepoint at all.Why uprobes can't work here
After inlining and optimization, each call site compiles down to a single store instruction (e.g.,
mov [reg], imm32). There are several categories that make this especially problematic:LWLockReportWaitStart()is itselfstatic inline, wrappingpgstat_report_wait_start()— two levels of inlining, zero symbolsPG_WAIT_LWLOCK | lock->tranche— the value only exists in a register at runtimewaiteventset.c:1063passeswait_event_infoas a function argument — no argument boundary after inliningbufmgr.c:5820-5834— compiler folds the switch + store togethers_lock.c:148— uprobe overhead unacceptable in tight backoff loopfd.c:2083-2108— different#ifdefbranches produce different inlined layouts per platformThe solution
USDT static tracepoints survive inlining. The compiler emits a
nopinstruction at each inlined call site and records its address in an ELF.note.stapsdtsection. eBPF tools (bpftrace, bcc, etc.) discover thenopvia ELF metadata and patch it to anint3trap at attach time.This patch adds two new probes to the DTrace provider definition:
And invokes them from the
static inlinefunctions:Usage example (bpftrace)
Zero overhead when not in use
--enable-dtrace: macros compile todo {} while(0)— zero cost, zero code emitted--enable-dtrace, no tracer attached: probes arenopinstructions — negligible overhead (anopalongside the existing volatile store)--enable-dtrace, tracer attached: each probe fires a software trap — this is the only configuration with measurable overheadBenchmarking plan (proving low observer effect)
To validate production-readiness, three configurations should be benchmarked:
./configure(no dtrace)./configure --enable-dtrace, no tracernopinstructions at ~100 inlined sites./configure --enable-dtrace, bpftrace attachedSuggested benchmark methodology
Metrics to compare: TPS, avg latency, p99 latency, CPU usage (
perf stat).Expected results:
nopis ~0.3ns, negligible next to the existing volatile store and the actual wait)Prior discussion
This idea was proposed by Jeremy Schneider on pgsql-hackers:
https://www.postgresql.org/message-id/20260109202241.6d881ed0%40ardentperf.com
Related: a talk covering eBPF-based PostgreSQL wait event analysis and the challenges of inlined functions:
https://www.youtube.com/watch?v=3Gtuc2lnnsE
Changes
src/backend/utils/probes.d: addedwait__event__start(unsigned int)andwait__event__end()probe definitionssrc/include/utils/wait_event.h: added#include "utils/probes.h"andTRACE_POSTGRESQL_WAIT_EVENT_START/ENDcalls🤖 Generated with Claude Code