wait_event_timing — corrected benchmark results

Independent overhead measurement of Dmitry Fomin's Oracle-style wait-event timing patch · PostgreSQL 19beta1 · Hetzner CCX43 (16 dedicated vCPU) · v3 — reviewer-corrected2026-06-17

Two independent review rounds applied. The first review rejected the original 2-round numbers (broken CPU-idle gate, unmeasured noise floor, fixed variant order, easy workload only). The corrected harness below fixed all of that. A second 3-specialty review then caught a real statistical error in the interim page — the latency significance test was wrong — and this page applies all six of its fixes: the design-correct paired test, multiple-comparison accounting, an honest one-sided framing of the throughput result, a hypothesis (not fact) framing of the stub effect, latency percentiles, and removal of an unsupported mechanism claim.

TL;DR

Method & how every p-value is computed

Throughput — no resolvable overhead (3 regimes)

Throughput capture cost vs floors (paired test)
Within-binary capture cost (paired test) against the A/A floor (±0.33%) and the cross-binary build-variance floor (~1%). Every bar is inside the floors; no paired p<0.18.
regime (wait profile)stats vs offtrace vs offstub vs vanilla¹
W1 cached point-lookup (CPU-bound)−0.07% (p=.81)−0.55% (p=.25)+0.36%
W2 ClientRead-saturated (−c64)−0.74% (p=.25)−0.34% (p=.56)+0.96%
W3 buffer-miss short-wait (81% DataFileRead @3.9µs, 20M)−0.27% (p=.38)−0.26% (p=.38)+1.21%

No individual capture comparison is statistically significant (all paired p≥0.25). But note the direction is consistent: all six within-binary deltas are negative (capture slightly slower), a 6/6 sign pattern (sign-test two-sided p=0.031). So the honest statement is a one-sided upper bound — overhead is small and positive in direction, ≤0.75%, below our resolving power — rather than “indistinguishable from zero,” which would wrongly imply capture could be free-or-faster with equal likelihood.

¹ Stub caveat (hypothesis, not conclusion). The stub should cost zero, yet it is the fastest of all five variants in all three workloads (P=(1/5)³=0.008 under pure noise; per-workload paired p as low as 0.06). The most likely explanation is binary code-layout variance (the Mytkowicz effect), and it bounds how finely any cross-binary comparison can resolve — but a systematic stub advantage cannot be fully excluded at n=5. Pinning it down would need link-order randomization / LTO-PGO-controlled builds. The clean conclusion does not depend on it: the same-binary OFF/STA/TRA comparison has no such confound.

W3 buffer-miss throughput
W3 (short-wait storm). The stub — which should be zero — is the high outlier; that is the cross-binary variance, not a capture effect.

Latency under headroom — trace adds ~1.5%, stats is free

Re-tested at n=10 (two identical-config cached replicates) with the paired within-round test: trace latency is +5.6µs (+1.5%), positive in 10/10 rounds, p=0.002 — it clears the ~0.004 Bonferroni bar and is the single effect in the whole study that does. stats is latency-neutral (+0.7µs, 4/10, p=0.53). A wait-heavy replication (W5) is directionally consistent (trace 4/5 rounds).
Latency: trace +1.5%, stats neutral

Mechanism — an honest open question

Percentiles say the trace cost is not tail jitter: the median shifts +5µs (mean +5.6µs), p95 +22µs, p99 +27µs — the whole distribution moves. But it is also not a simple per-transaction CPU cost. The +5.6µs is +1.5% of the ~373µs end-to-end latency (dominated by the client round-trip at -c16), but it is ~7.6% of the ~74µs of actual CPU-time per transaction at saturation (16 cores ÷ 217k tps). If trace spent +5.6µs of CPU per transaction, throughput would fall ~7% — yet measured throughput moves <0.55% (and W3's 20M waits/run independently bound any per-wait CPU cost to <0.5µs). A cost that appears in latency-under-headroom but leaves saturated throughput untouched is genuinely puzzling and not explained by this data — the earlier “DSA-ring-write per wait” wording was wrong and is removed. Resolving it needs perf profiling of the trace path; that is future work.

Latency percentiles OFF/STA/TRA
Trace shifts the whole distribution (median through p99); stats overlaps off. 2% pgbench sampling, ~260k transactions/run.

The instrumentation really runs

Live-captured wait histogram
Live mid-run 32-bucket log₂ histogram (ClientRead, one client backend). At W2 peak ~12M client waits across 65 backends; trace ring filled to 131,072 records; W3 drove ~20M sub-4µs DataFileRead waits/run.

Caveats & what's still open

Reproduce / raw material

Hacking Postgres session, 2026-06-17. Charts: matplotlib. Stats: exact paired (within-round) sign-flip permutation tests, Bonferroni-aware. Two independent AI review rounds applied (statistics · PostgreSQL-internals · adversarial-skeptic). Independent measurement, not a vendor claim — numbers and method open for scrutiny.