wait_event_timing
Benchmark & Testing Design
Methodology for benchmarking Dmitry Fomin's Oracle-style wait-event instrumentation patch series.
📊 Corrected benchmark results (v3) → A/A floor, randomized order, paired stats, stub rung; 3 throughput regimes + latency (n=10) + percentiles.
1. Feature under test
wait_event_timing (WET) brings Oracle-style wait-event instrumentation to
PostgreSQL. Each time a backend enters or exits a wait event, a small record is written.
Aggregates and per-session traces are exposed through new catalog views and functions.
Patch series
7 reviewable patches in PR DmitryNFomin/postgres #2.
The series applies cleanly onto upstream master (89eafad297a);
series HEAD is 983f0ec on branch wet-series.
Capture levels
Controlled by the GUC wait_event_capture:
| Level | What is recorded | Storage |
|---|---|---|
off |
Nothing — feature compiled in but dormant. | — |
stats |
Per-(backend, event) count, total duration, max duration, plus a 32-bucket log₂ histogram of individual wait durations. | Shared memory (one slot per backend per event type). |
trace |
Everything in stats, plus a per-session DSA ring buffer of individual wait entries. Query-attribution markers (ExecStart/ExecEnd, QueryStart/QueryEnd) are interleaved to correlate waits with SQL statements. |
DSA ring (per session) + shared memory aggregates. |
Compile-time gate
The feature is guarded by a compile-time flag:
# meson (preferred)
meson setup build -Dwait_event_timing=true ...
# autoconf equivalent
./configure --enable-wait-event-timing ...
Without the flag the instrumentation points are #ifdef'd out entirely.
With the flag but wait_event_capture=off,
the GUC is present and the code paths compile in, but no data is written at runtime.
Key new catalog objects
pg_stat_wait_event_timing— view exposing per-(backend, event) aggregates.- DSA-backed ring functions for trace retrieval.
2. Two cost concepts (keep them distinct)
Conflating these is the most common mistake in instrumentation benchmarks. They must be measured and reported separately.
Static / baseline overhead
The constant cost of having the feature compiled in and/or enabled, independent of whether anyone reads or resets the data. Sources include:
- Branch prediction misses on the inline
#ifdef/if (capture_level)gate in every backend loop. - Additional shared-memory footprint touching more cache lines.
- Per-wait-event entry writes into shared-memory aggregates or DSA ring under
stats/trace.
This is measured by comparing TPS/latency across the measurement ladder rungs (see Section 3) at a fixed, low observer frequency (or zero reads from outside).
Observer effect
The dynamic measurement error introduced by the act of monitoring itself. It grows with:
- Sampling/probing frequency — how often an external agent reads the views.
- Reset operations — clearing the stats table (e.g., calling the reset function analogous to
pg_stat_reset()) under load acquires a lock on the shared hash table; every backend that tries to re-register a wait event must contend for that lock. Higher reset frequency ↠ worse observer effect.
pg_stat_statements under a heavy write workload is not free: clearing
the shared hash forces every backend that next executes a query to re-register its entry,
all contending on the same spinlock/LWLock. The same dynamic applies to
wait_event_timing stats. A benchmark that resets stats frequently will
inflate overhead numbers beyond what a typical production observer would see.
The observer-effect microbenchmark (a future pass) will sweep read/reset frequency as an independent variable while holding workload constant.
3. Measurement ladder
Each rung isolates one additional source of cost. Deltas between consecutive rungs attribute cost to specific causes. All rungs run on the same hardware against the same workload within the same benchmark session (interleaved rounds — see Section 7).
-
1
Vanilla Postgres (upstream master, no patch applied) — the true baseline. Every other rung is compared to this.Δ = reference
-
2
Patched, compiled withoutΔ(1→2) source-change cost
--enable-wait-event-timing(stub build) — the patch is applied but all instrumentation is#ifdef'd out. Should be statistically indistinguishable from rung 1. Any delta here reveals a source-code-change cost (e.g., changed hot-path data structures, new function calls outside the guard). -
3
Patched + compiled with flag,Δ(2→3) compiled-in gate cost
capture=off— the feature is compiled in and the GUC is present, but no wait events are recorded. Delta over rung 2 is the cost of the compiled-in inline gate (branch, possible cache-line touch). -
4
Δ(3→4) stats capture cost
capture=stats— per-backend shared-memory aggregates are maintained (counts, durations, histogram buckets updated on every wait-event transition). -
5
Δ(4→5) trace capture cost
capture=trace— DSA ring writes per event plus query-attribution markers. Heaviest mode.
A further axis — observer read/reset frequency — is orthogonal to the rungs and will be swept separately in the observer-effect microbenchmark.
4. Validity gate
--enable-wait-event-timing still exposes
the wait_event_capture GUC, but silently downgrades any non-off
value back to off. If this is not caught, a benchmark comparing
off, stats, and trace against the un-instrumented
build will produce a false ~0% overhead — all three levels are actually identical.
Mandatory pre-benchmark check
Before accepting any numbers from a build, run this gate. Failure means the build is mis-configured and the run must be discarded.
These are the exact commands run by /root/go.sh before every benchmark session:
# Check 1: stats level must stick
WET=/opt/pg-wet/bin
su - pgbench -c "$WET/psql -p 5441 -d postgres -tAc \
\"SET wait_event_capture=stats; SELECT current_setting('wait_event_capture');\""
# Expected output: stats
# If output is 'off': build is NOT instrumented — ABORT
# Check 2: trace level must stick
su - pgbench -c "$WET/psql -p 5441 -d postgres -tAc \
\"SET wait_event_capture=trace; SELECT current_setting('wait_event_capture');\""
# Expected output: trace
# Check 3: the view must exist and be queryable
su - pgbench -c "$WET/psql -p 5441 -d postgres -tAc \
\"SELECT count(*) FROM pg_stat_wait_event_timing;\""
# Expected output: a number (0 or more) — not an error
# go.sh only proceeds if STATS='stats' AND TRACE='trace'; otherwise it exits 1
# and writes: "NOT VERIFIED — will NOT produce misleading numbers."
Lesson
Any instrumentation benchmark must assert that the feature is active, not merely that the GUC is settable. A GUC that silently no-ops is a common footgun in compile-gated features.
5. Workloads
All three workloads use pgbench -S (SELECT-only). Read-write would add
WAL/lock noise to the overhead signal; SELECT-only is sufficient to exercise the
wait-event instrumentation path cleanly. Uniform pgbench flags across all runs:
-c 16 -j 16 -T 180 -P 15 -r -n
(clients=jobs=cores; 3-minute runs for stable numbers per Dmitry; 15s progress
intervals to confirm steady-state; per-statement latency report; no vacuum).
| ID | Description | pgbench command | Bottleneck / wait profile | Headline metric |
|---|---|---|---|---|
| W1 | Read-only, fully cached | pgbench -S -c 16 -j 16 -T 180 -P 15 -r -nscale 100, shared_buffers=8GB(data ~1.6 GB, fits in buffer pool) |
CPU-bound; very few waits per transaction. | TPS. Measures the pure inline-gate cost — waits rarely fire, isolating constant overhead of the instrumented code path. |
| W2 | Wait-saturated (key test) | pgbench -S -c 16 -j 16 -T 180 -P 15 -r -nscale 400, shared_buffers=256MB(data ~6.4 GB ≫ SB) |
Constant buffer-mapping / I/O / LWLock waits. Working set far exceeds buffer pool — maximum wait-event transition rate. | TPS + wait-event proof (top events by count, total transitions). Per-event cost of stats/trace is most visible here. |
| W3 | Latency at 50% capacity | pgbench -S -c 16 -j 16 -T 180 -P 15 -r -n -R <rate>Same setup as W2 (scale 400, shared_buffers=256MB)rate = 50% of W2 baseline median TPS |
Regular load-testing (~50% capacity, not stress-testing) in the same wait-heavy environment as W2. Server is not saturated — overhead surfaces as added latency per call rather than TPS drop. Production-realistic load level. | Latency (avg, p95, p99). CPU-headroom check (vmstat idle%) required to confirm the server is genuinely at 50% capacity. Rate derived from W2 results. |
W2 proof obligation
Before trusting W2 numbers, confirm the workload is genuinely wait-bound (not CPU-bound). Run these while pgbench is active:
-- Top wait events by transition count:
SELECT wait_event_type, wait_event,
sum(count) AS total_count,
sum(total_wait_time_us) / 1000.0 AS total_wait_ms
FROM pg_stat_wait_event_timing
GROUP BY wait_event_type, wait_event
ORDER BY total_count DESC
LIMIT 10;
-- Mid-run snapshot of what backends are waiting on:
SELECT coalesce(wait_event_type, 'CPU') AS type,
coalesce(wait_event, 'running') AS event,
count(*)
FROM pg_stat_activity
WHERE state = 'active'
GROUP BY 1, 2
ORDER BY 3 DESC
LIMIT 8;
-- Confirm CPU is not maxed out (high idle% = IO/wait bound):
vmstat 1 5
If vmstat shows CPU idle near 0% and wait-event counts are low, the
workload is CPU-bound and W2 is not exercising the instrumentation path — reduce
shared_buffers or increase scale until buffer/IO waits dominate.
6. Load modes — why two perspectives
W1 and W2 run at full concurrency (stress mode); W3 runs rate-limited at ~50% capacity. Both perspectives are necessary.
Stress mode — W1 and W2
The system is driven to the bottleneck resource (CPU for W1, I/O/buffer for W2). Metric: TPS. Overhead from the feature shows as reduced throughput.
Caveat: when the bottleneck is I/O, additional CPU work from the instrumentation may hide behind I/O wait time — overhead can look smaller than it is in a CPU-saturated scenario. W1 (CPU-bound) and W2 (IO/wait-bound) together expose overhead under both bottleneck types.
Rate-limited mode — W3
Run the same wait-heavy environment as W2 but hold TPS at ~50% of the W2 baseline
via pgbench -R <rate>. The server is not saturated, so the
feature's per-wait-event cost surfaces as added latency per transaction
rather than reduced throughput. Metric: avg / p95 / p99 latency.
This is more production-realistic: most production systems run well below saturation. It also eliminates the CPU-masking effect and reduces pgbench co-location contention (pgbench and Postgres share the same 16 cores; at full load pgbench competes with the server).
-R to 50% of that value. The same rate is used for all four W3 variants
(baseline / off / stats / trace) so latency differences are directly comparable.
7. Run discipline
Interleaving (cancels temporal drift)
Within each round, cycle through all variants in sequence before moving to the next round. Do not run all trials for one variant then all for another — thermal and memory-pressure drift will bias inter-variant comparisons.
for round in 1 2 3; do
for variant in baseline off stats trace; do
pgbench -S -c 16 -j 16 -T 180 -P 15 -r -n postgres
done
done
Run parameters
- Duration: 180 seconds (3 minutes) per run — chosen by Dmitry for stable numbers.
- Rounds: 3 rounds per (workload × variant).
- Warmup: discard round 1; report on rounds 2–3.
- Progress reporting:
-P 15(every 15 s) to confirm the run has reached steady-state before accepting the result. - Latency report:
-rgives per-statement avg latency in the pgbench summary — the primary latency metric for W3. - Summary statistics: median TPS (W1, W2); avg / p95 / p99 latency (W3).
--log(-l) is intentionally NOT used — per-transaction file logging is itself an observer-effect source (one write per transaction to disk) and would inflate overhead numbers unless the log lives on a RAM disk.
Between-run resets
To isolate static overhead from observer effect, do not reset
pg_stat_wait_event_timing between variants during the primary benchmark.
A separate reset-sweep pass will quantify the observer effect independently.
8. Hardware & build
Hardware
| Parameter | Value |
|---|---|
| Provider | Hetzner Cloud |
| Instance type | CCX43 |
| CPU | AMD EPYC Milan, 16 dedicated vCPU |
| RAM | 64 GB |
| OS | Ubuntu 24.04 LTS |
Build
| Parameter | Value |
|---|---|
| Compiler | gcc (ccache wrapper), -O2 |
| Build system | autoconf (./configure) |
| Upstream base | 89eafad297a (master) |
| Patch branch | wet-series |
| Series HEAD | 983f0ec |
| Instrumented flag | -Dwait_event_timing=true (rungs 3–5); omitted for rung 2 |
Build commands (actual — autoconf)
The benchmark uses autoconf (./configure), not meson.
Two separate installs are maintained side by side.
# Rung 1: vanilla baseline (upstream master, no patch)
# installed at /opt/pg-baseline port 5440
./configure --prefix=/opt/pg-baseline CC="ccache gcc" CFLAGS="-O2"
make -j16
make install
# Rung 2: stub build (patch applied, flag ABSENT) — PLANNED, not yet built
# same configure line as vanilla but on wet-series branch
./configure --prefix=/opt/pg-stub CC="ccache gcc" CFLAGS="-O2"
make -j16
make install
# Rungs 3-5: instrumented build (patch + flag present)
# installed at /opt/pg-wet port 5441
# Source: /root/rebuild_wet.sh
cd /root/pg-src
git checkout -f wet-series
git clean -fdx -q
./configure --prefix=/opt/pg-wet --enable-wait-event-timing \
CC="ccache gcc" CFLAGS="-O2"
make -j16
make install
# Sanity check after install: flag must appear in config.log
grep -- '--enable-wait-event-timing' config.log
postgresql.conf settings (both clusters)
Appended by /root/setup_clusters.sh and /root/go.sh:
# Baseline cluster (/data/pg-baseline, port 5440)
port = 5440
max_connections = 300
max_wal_size = 8GB
track_activities = on # required for query attribution in trace mode
compute_query_id = on # required for query attribution in trace mode
# shared_buffers set per-workload at runtime (sed replacement in run_bench.sh)
# fsync left at default (on) — workload is SELECT-only, no writes
# Wet cluster (/data/pg-wet, port 5441)
port = 5441
max_connections = 300
max_wal_size = 8GB
track_activities = on # required for query attribution in trace mode
compute_query_id = on # required for query attribution in trace mode
wait_event_capture = off # changed to stats/trace per-run by run_bench.sh
# shared_buffers set per-workload at runtime
# fsync left at default (on) — workload is SELECT-only, no writes
9. Results (populated on completion, ~2h)
/root/bench.log when the benchmark run finishes (~2h).
Do not treat any cell as final until the run completes and the validity gate (Section 4)
has been passed for each build.
W1 — Read-only cached (stress, TPS)
Config: scale=100, shared_buffers=8GB, pgbench -S -c 16 -j 16 -T 180 -P 15 -r -n. Metric: median TPS of rounds 2–3.
| Variant | Median TPS | %Δ vs baseline | %Δ vs off |
|---|---|---|---|
| 1 — vanilla (baseline) | — | reference | — |
| 3 — off | — | — | reference |
| 4 — stats | — | — | — |
| 5 — trace | — | — | — |
W2 — Wait-saturated (stress, TPS)
Config: scale=400, shared_buffers=256MB, pgbench -S -c 16 -j 16 -T 180 -P 15 -r -n. Metric: median TPS of rounds 2–3.
| Variant | Median TPS | %Δ vs baseline | %Δ vs off |
|---|---|---|---|
| 1 — vanilla (baseline) | — | reference | — |
| 3 — off | — | — | reference |
| 4 — stats | — | — | — |
| 5 — trace | — | — | — |
W2 — Wait-event proof
Collected during a dedicated W2 stats run. Confirms workload is wait-bound, not CPU-bound.
| wait_event_type | wait_event | total_count | total_wait_ms |
|---|---|---|---|
| — populated from pg_stat_wait_event_timing query — | |||
Total wait transitions in 180s: — (~— /sec). CPU idle% (vmstat): —%.
W3 — Latency at 50% capacity
Config: same as W2 but rate-limited with -R <50% of W2 baseline TPS>. Rate used: — TPS. Metric: avg / p95 / p99 latency.
| Variant | Avg lat (ms) | p95 lat (ms) | p99 lat (ms) | %Δ avg vs baseline | %Δ avg vs off |
|---|---|---|---|---|---|
| 1 — vanilla (baseline) | — | — | — | reference | — |
| 3 — off | — | — | — | — | reference |
| 4 — stats | — | — | — | — | — |
| 5 — trace | — | — | — | — | — |
Source: /root/bench.log on the CCX43 benchmark host.
Results will be posted to the PR thread once all runs complete.
Rung 2 (stub build — patched without --enable-wait-event-timing) is planned but not yet built; its row will be added once available.
10. Exact commands per step
These are verbatim invocations extracted from the scripts on the benchmark host
(23.88.42.150). Scripts: /root/rebuild_wet.sh,
/root/setup_clusters.sh, /root/go.sh,
/root/run_bench.sh.
Step 0 — Data initialization
Run once per workload tier before the interleaved benchmark rounds. W3 reuses the W2 dataset (same scale=400, same shared_buffers=256MB) — no separate init required.
BASELINE_BIN=/opt/pg-baseline/bin
WET_BIN=/opt/pg-wet/bin
# W1: scale=100, run with shared_buffers=8GB
$BASELINE_BIN/pgbench -p 5440 -i -s 100 --no-vacuum postgres
$WET_BIN/pgbench -p 5441 -i -s 100 --no-vacuum postgres
# W2 + W3: scale=400, run with shared_buffers=256MB
# (W3 rate-limited uses the same dataset; no separate init)
$BASELINE_BIN/pgbench -p 5440 -i -s 400 --no-vacuum postgres
$WET_BIN/pgbench -p 5441 -i -s 400 --no-vacuum postgres
Step 1 — Per-round startup sequence
For each round and each variant, run_bench.sh stops all clusters,
edits postgresql.conf via Python regex to set shared_buffers
and wait_event_capture, then starts the appropriate cluster:
# Set shared_buffers and wait_event_capture in postgresql.conf at runtime:
python3 -c "
import re
with open('/data/pg-wet/postgresql.conf', 'r') as f:
content = f.read()
content = re.sub(r'^#?shared_buffers\s*=.*',
\"shared_buffers = '8GB'\", content, flags=re.MULTILINE)
content = re.sub(r'^#?wait_event_capture\s*=.*',
\"wait_event_capture = 'stats'\", content, flags=re.MULTILINE)
with open('/data/pg-wet/postgresql.conf', 'w') as f:
f.write(content)
"
# Start cluster and confirm settings:
su - pgbench -c "/opt/pg-wet/bin/pg_ctl -D /data/pg-wet -l /tmp/wet.log start"
su - pgbench -c "/opt/pg-wet/bin/psql -p 5441 -t \
-c 'SHOW wait_event_capture;' postgres"
Step 2 — pgbench measured-run command (canonical)
The measured run command is identical for every workload and every variant. It is defined once here; all workload descriptions reference it.
# Canonical measured-run command (W1, W2, W3 — all variants):
pgbench -S -c 16 -j 16 -T 180 -P 15 -r -n postgres
# Run via su - pgbench against the appropriate port:
# baseline cluster → -p 5440 (/opt/pg-baseline/bin/pgbench)
# wet cluster → -p 5441 (/opt/pg-wet/bin/pgbench)
# shared_buffers and wait_event_capture are set in postgresql.conf before each start.
Flag rationale:
-S— SELECT-only script (all three workloads).-c 16 -j 16— clients = worker threads = vCPUs; avoids pgbench oversubscribing the 16 dedicated cores when co-located with Postgres.-T 180— 3-minute run duration for stable numbers (per Dmitry).-P 15— progress report every 15 s; used to confirm steady-state before accepting results.-r— per-statement latency statistics in the summary; the primary metric for W3.-n— skip vacuum.--log/-lintentionally omitted — per-transaction file logging is itself an observer-effect source (one write per transaction) and would inflate overhead numbers unless the log lives on a RAM disk.
The only two justified additions to the canonical command:
-
W3 only: append
-R <rate>whererate= 50% of the W2 baseline median TPS (derived after W2 stress runs complete).pgbench -S -c 16 -j 16 -T 180 -P 15 -r -n -R $RATE postgres -
W2 diagnostic pass only (not a measured run): a separate short run
under
capture=statsto collect wait-event evidence. See Step 4 below. This run is not included in the TPS/latency results.
Step 3 — capture level changes between variants
wait_event_capture is changed by rewriting postgresql.conf and
doing a full stop/start cycle — not a live ALTER SYSTEM. This guarantees
a clean server state and eliminates residual DSA ring state from a prior trace
run.
Each variant is a full stop/start cycle of the appropriate cluster with the target
wait_event_capture value written to postgresql.conf before
start. The measured-run command (see Step 2) is identical for every variant;
only the port and cluster binary differ.
# Sequence within one round — W2 example (same pattern for W1 and W3):
# 1. baseline (separate binary, /opt/pg-baseline, port 5440)
# set shared_buffers=256MB on baseline cluster, start
su - pgbench -c "/opt/pg-baseline/bin/pgbench \
-p 5440 -S -c 16 -j 16 -T 180 -P 15 -r -n postgres"
# 2. off (wet binary, /opt/pg-wet, port 5441)
# set shared_buffers=256MB + wait_event_capture=off, start wet cluster
su - pgbench -c "/opt/pg-wet/bin/pgbench \
-p 5441 -S -c 16 -j 16 -T 180 -P 15 -r -n postgres"
# 3. stats (set wait_event_capture=stats, restart)
su - pgbench -c "/opt/pg-wet/bin/pgbench \
-p 5441 -S -c 16 -j 16 -T 180 -P 15 -r -n postgres"
# 4. trace (set wait_event_capture=trace, restart)
su - pgbench -c "/opt/pg-wet/bin/pgbench \
-p 5441 -S -c 16 -j 16 -T 180 -P 15 -r -n postgres"
Step 4 — W2 wait-event proof queries (diagnostic pass, not a measured run)
Run once, separately from the interleaved rounds, with capture=stats.
Purpose: confirm the W2 workload is genuinely wait-bound. Results go into the
"W2 wait-event proof" table in Section 9; they are not included in TPS results.
# Reset stats counters before the diagnostic run:
su - pgbench -c "/opt/pg-wet/bin/psql -p 5441 \
-c 'SELECT pg_stat_reset_wait_event_timing_all();' postgres"
# Start a 180s diagnostic load run in background, sample midway:
# (same canonical command as measured runs — no --log)
su - pgbench -c "/opt/pg-wet/bin/pgbench \
-p 5441 -S -c 16 -j 16 -T 180 -P 15 -r -n postgres" &
sleep 90
# CPU idle% (high idle% = IO/wait bound, not CPU-bound):
vmstat 1 5
# Active waits from pg_stat_activity:
/opt/pg-wet/bin/psql -p 5441 -t postgres -c "
SELECT coalesce(wait_event_type,'CPU') AS type,
coalesce(wait_event,'running') AS event,
count(*)
FROM pg_stat_activity
WHERE state = 'active'
GROUP BY 1, 2
ORDER BY 3 DESC
LIMIT 8;"
wait # let pgbench finish
# Aggregate from pg_stat_wait_event_timing:
/opt/pg-wet/bin/psql -p 5441 postgres -c "
SELECT wait_event_type, wait_event,
sum(count) AS total_count,
sum(total_wait_time_us) / 1000.0 AS total_wait_ms
FROM pg_stat_wait_event_timing
GROUP BY wait_event_type, wait_event
ORDER BY total_count DESC
LIMIT 10;"
# Total wait transitions in the 180s run:
/opt/pg-wet/bin/psql -p 5441 -t postgres -c \
"SELECT sum(count) FROM pg_stat_wait_event_timing;"
Open questions / next steps
-
Stub-build rung (rung 2) — the patched binary compiled
without
--enable-wait-event-timingis not yet built. Expected delta vs vanilla: ~0%; any non-trivial delta would reveal a hot-path structural change in the patch outside the compile guard. - W3 CPU-headroom check — sample CPU (vmstat idle%) during the rate-limited W3 run to confirm the server is genuinely not saturated. This proves W3 is regular load-testing at 50% capacity, distinct from W2 stress.
- Reset/observer-effect microbenchmark — sweep reset frequency (0, 1/s, 10/s) as an independent variable while holding the W2 stress workload constant. Isolates observer-effect cost (locking, re-registration) from the static overhead measured in the primary runs.
-
Iteration 2 — pg_stat_statements loaded — add
shared_preload_libraries = 'pg_stat_statements'andCREATE EXTENSION pg_stat_statementson both clusters to introduce realistic shared-hash / LWLock contention (more wait events in flight). Protocol: resetpg_stat_statementsbefore each run; cross-check itscallscount (should equal ≈ TPS × 180) to confirm the workload ran as expected; dump top queries alongside the wait-event summary. -
W4 (future) — read-side observer effect — while one pgbench
drives the SELECT-only workload, a second client polls
pg_stat_wait_event_timingat increasing frequency (0, 1 Hz, 10 Hz). Measures how reading the instrumentation perturbs the workload being measured — the instrumentation creates the problem it is trying to observe. - DSA ring sizing — what happens when the trace ring overflows under very high wait-event rates? Is overflow behavior documented and tested?