Demo 1 — Image Strategy: UBI, ubi-micro, multi-stage, LTO, PGO
Builds the same trivial C++23 HTTP service three different ways and compares the results. Adds a Profile-Guided Optimization pass on top of the best variant and measures the additional delta. Every optimization is something you'd actually…
The full source for this demo lives in
examples/demo-01-image-strategy/— clone the repo,cdin, and./demo.sh.
Tutorial sections: §4 Container Strategy + §5 Compile-Time Wins
Builds the same trivial C++23 HTTP service three different ways and compares the results. Adds a Profile-Guided Optimization pass on top of the best variant and measures the additional delta. Every optimization is something you’d actually do in production; the demo just makes the deltas visible.
Why this matters
Image strategy is the lowest-hanging-fruit performance and security win in containerized C++. Every choice — base image, multi-stage boundary, LTO, PGO — compounds across three real concerns the production team cares about:
- Registry pull time. A 689 MB image takes ~5× longer to pull on a cold node than a 26 MB one, which directly extends cold-start latency every time the orchestrator schedules a new replica.
- Security surface. The single-stage image ships GCC, ld, the C library headers, and dozens of build dependencies into production. None of those are needed at runtime; all of them are CVE vectors. Multi-stage drops them entirely.
- Runtime performance. LTO inlines across translation units that the per-TU compiler can’t see; PGO biases hot/cold paths toward measured reality. On request-handler hot paths, the combined gain is typically 4-7% for this kind of code — small, real, and free once the build pipeline is in place.
§4 and §5 of the tutorial develop the underlying mechanics; this demo makes the numbers visible.
What this demo shows
Three Containerfile strategies for the same C++23 source, plus a PGO pass:
ubi-multistage— UBI builder + UBI-minimal runtime, multi-stage, LTO on. The recommended production default.ubi-micro— UBI builder, runtime isubi9/ubi-micro(~30 MB) with libstdc++ statically linked into the binary. The minimum surface area for a typical C++ service.single-stage-naive— A single-stage build that ships the toolchain in the runtime image, no LTO, no multi-stage. The “what not to do” baseline.ubi-multistage + PGO— Two-pass build: instrument the multi-stage variant, run a synthetic training workload to collect a profile, rebuild with the profile applied. Shows what PGO buys on top of LTO.
Each variant is built and timed; each is then driven with hey to
collect p50/p95/p99 latency under load.
How to run
./demo.sh
Expected runtime: 5-10 minutes on a fresh cache, ~1 minute on a
warm cache. The script prints a small comparison table: image
size, build time, and a hey benchmark for each build.
What you’ll see
Representative output on a Fedora 44 host with gcc-toolset-14 and Podman 5.x:
size build p50/p95/p99 (ms)
single-stage-naive 689 MB 14 s 0.81 / 1.91 / 4.20
ubi-multistage 114 MB 38 s 0.79 / 1.85 / 4.08
ubi-multistage + PGO 114 MB 78 s 0.74 / 1.71 / 3.78
ubi-micro 26 MB 45 s 0.79 / 1.86 / 4.06
How to read the output
The headline numbers — what to look for first:
- 26× size drop from naive single-stage to ubi-micro. Almost all of that is “the toolchain leaving production”.
- ~4-5% p99 improvement from PGO on top of LTO. Modest in absolute terms but real; on a service taking 10K rps in production, it’s a meaningful shift.
- No measurable p50 difference between ubi-multistage and ubi-micro. The runtime cost of static-vs-dynamic libstdc++ is invisible at this scale.
A few rules of thumb when you’re reading the table:
- If
single-stage-naiveis faster than the multi-stage builds, something’s wrong with the multi-stage LTO config — investigate the build flags rather than declaring multi-stage useless. - If PGO is slower than the un-PGO’d LTO build, the training workload was unrepresentative. PGO with a wrong profile is worse than no PGO because it actively pessimizes the real hot path.
- ubi-micro is the right default for production unless you need glibc features the static-libstdc++ build doesn’t pick up (NSS modules, dlopen of glibc-dependent libraries). For a typical C++ service: use it.
Files
Containerfile.ubi-multistage— preferred defaultContainerfile.ubi-micro— minimal-image variant (UBI-micro runtime, static libstdc++)Containerfile.single-stage-naive— anti-pattern baselineContainerfile.pgo— instrumented build for PGO step 1CMakePresets.json— the three release configurationsconanfile.txt— pinned deps (httplib for the HTTP side)src/main.cpp— the trivial servicedemo.sh— orchestration; runs all three builds + PGO +hey
Caveats and gotchas
- PGO training workload matters. The training run uses a
synthetic
heypattern that may not match your real workload. In production, capture an actual traffic profile (e.g. via tcpreplay or a recorded sample of production traffic) and feed THAT to the instrumented binary. - ubi-micro lacks a package manager. If your application calls into NSS, glibc locale data, or anything else that expects the full glibc runtime, the static-libstdc++ ubi-micro build may surprise you at runtime. Test against your real workload before switching.
- Image size measurements include layers. The numbers here are
podman images --format "{{.Size}}"output, which counts all layers. Squashing layers (podman build --squash) reduces the numbers but loses caching benefits. - PGO build time roughly doubles. The instrumented build, the training run, and the rebuilt-with-profile build all happen sequentially. If your CI budget can’t absorb that, do PGO only on release branches.
Source materials
This demo deepens material from the project’s bibliography:
- Andrist & Sehr, C++ High Performance 2e, ch. 5 — compile- and link-time optimizations, the LTO / PGO mechanics
- Iglberger, C++ Software Design, ch. 1 — the case for measurable performance work as a design discipline, not a last-mile activity
Linked tutorial sections
- §4 Container Strategy — base image choice (UBI vs ubi-micro vs scratch) and the multi-stage build pattern. This demo’s three Containerfiles are §4’s worked example.
- §5 Compile-Time Wins — what
LTO actually does, when PGO is worth the build-time tax,
constexprand the related compile-time tools. The PGO step in this demo is §5’s measurable claim. - §13 Reproducibility & ABI — image labels and reproducible-image discipline. This demo’s Containerfiles use the labeling pattern §13 develops.