Demo 02 — STL & Layout Under Memory Pressure
Compares four key-value container designs on two operations, run once unconstrained and once under a cgroup memory cap. The takeaway: data layout determines cache behavior, and at production scales cache locality usually beats algorithmic…
The full source for this demo lives in
examples/demo-02-stl-layout/— clone the repo,cdin, and./demo.sh.
Tutorial section: §6 STL, Layout, and C++20/23 Containers
Compares four key-value container designs on two operations, run once unconstrained and once under a cgroup memory cap. The takeaway: data layout determines cache behavior, and at production scales cache locality usually beats algorithmic complexity.
Why this matters
Most C++ services use std::unordered_map by default — its O(1)
average is the rule-of-thumb most engineers reach for. But “O(1)
average” hides the per-element node allocation, the pointer
indirection on every lookup, and the cache miss that happens because
the nodes are scattered across the heap. At small N, this doesn’t
matter; the working set fits in L1. At large N, the node-based
container is paying constant cache-miss costs that the contiguous
alternatives don’t.
C++20 added std::flat_map and std::flat_set to the standard
specifically because this lesson has been learned over and over in
production: when N is large enough to overflow L2, a sorted vector
with binary search outperforms a hash table even though the
theoretical complexity is worse. The constant factor for cache
locality is roughly 10-100×, which buys a lot of log(N).
Add memory pressure — a cgroup memory.max that forces the kernel
to evict pages your container had warm — and the gap widens
dramatically. Node-based containers fault their scattered pages
back in repeatedly; contiguous containers stream from disk
sequentially and stay fast.
§6 of the tutorial develops the layout story; this demo measures it.
What this demo shows
Four container designs benchmarked at four sizes (64, 1024, 16384, 262144) on two operations (point lookup, full iteration), run twice — once unconstrained, once under a 128 MB cgroup memory cap:
| Container | Layout | Per-element allocation |
|---|---|---|
std::unordered_map<K,V> |
hash table | one allocation per insert |
std::map<K,V> |
red-black tree | one allocation per insert |
boost::container::flat_map<K,V> |
sorted vector | bulk reallocation on growth |
std::vector<pair<K,V>> + linear scan |
contiguous unsorted | bulk reallocation on growth |
Operations:
- Lookup: 1,000 hits per iteration. Even node-based containers do well at small N because the working set fits in L1/L2.
- Iterate-and-sum: walk every entry and accumulate a payload field. Cache locality dominates; this is where contiguous layouts pull dramatically ahead at large N.
The 262144 size is where the 128 MB cgroup cap in the pressured run starts to bite for node-based containers — that’s the test case the demo is built around.
How to run
./demo.sh
First build is ~3-5 minutes (Conan pulls boost + Google Benchmark from Conan Center, both pre-built for our profile in the common case). Subsequent runs hit the podman layer cache and complete in ~30 seconds for both phases.
Outputs:
results-baseline.json— unconstrained runresults-pressured.json—podman run --memory=128m --memory-swap=128mrun- A side-by-side table on stdout, comparing median real_time per (benchmark, size) pair, with a pressure-ratio column
What you’ll see
Representative output at N=262144 on the iterate benchmarks (microseconds per iteration, median of 3 runs):
Container baseline pressured ratio
boost::container::flat_map 911 µs 940 µs 1.03×
std::vector<pair> (linear scan) 920 µs 948 µs 1.03×
std::unordered_map 2,309 µs 5,840 µs 2.53×
std::map (RB tree) 32,000 µs 210,000 µs 6.56×
How to read the output
At N=262144 on iterate-and-sum:
flat_mapandvectorlinear-scan finish in roughly the same time. Both are contiguous; the hardware prefetcher feeds them at memory bandwidth.unordered_mapis roughly 2.5× slower than the contiguous options. Every node is a separate cache miss.std::mapis an order of magnitude slower. RB-tree traversal is both node-based and branch-heavy; the branch predictor can’t help.
Under pressure, the ratio column tells the story:
- Ratio close to 1.0× means the container’s layout is friendly to the cgroup — contiguous pages stream back in cleanly.
- Ratios of 2-10× mean the kernel is evicting pages the container then has to fault back in. Every cache miss becomes a page fault, every page fault becomes a syscall, and the whole operation slows by orders of magnitude.
You can read the JSON output directly:
jq '.benchmarks[] | select(.aggregate_name == "median")' \
results-baseline.json
Each benchmark function reports real_time (wall clock) and
cpu_time (busy CPU). With --benchmark_repetitions=3 (set in
the Containerfile), Google Benchmark adds aggregate entries with
aggregate_name set to mean, median, stddev.
Caveats and gotchas
- Hash function matters.
std::unordered_mapwith a poor hash can be much slower than these numbers suggest — and a great hash can mask some of the cache-miss cost. The benchmark usesstd::hash<int>which is identity on most platforms; if your keys are strings, the picture changes. - Insert order matters for
flat_map. Inserting in sorted order is O(N); inserting in random order is O(N²) because every insert shifts a tail. The demo inserts in sorted order; if your workload doesn’t, measure separately. memory.maxvsmemory.high. The demo uses--memory=128mwhich setsmemory.max(hard limit). Settingmemory.highinstead would soften the kernel’s response — pages get reclaimed proactively but the process doesn’t OOM. See §11 for the difference.- The benchmark is synthetic. Real workloads mix container operations with other work; the working-set residency calculus changes. Treat the relative ordering as reliable; treat the absolute numbers as upper-bound for your real workload.
Source materials
This demo deepens material from the project’s bibliography:
- Andrist & Sehr, C++ High Performance 2e, ch. 6 — CPU and memory architecture; the cache-locality argument in detail
- Iglberger, C++ Software Design, ch. 7 — Bridge / PIMPL and value-based containers; the design-level case for contiguous layouts
- Enberg, Latency, ch. 4 — the cache-miss-as-syscall framing that motivates this whole demo
Linked tutorial sections
- §3 RAII & Container Resource Discipline
— the per-element allocation count for node-based containers is
a real cost paid at insert time and unwound at destruction time.
Owning a million
std::mapentries means a million destructor calls. - §6 STL, Layout, and C++20/23 Containers — this demo. Cache locality > algorithmic complexity at the scales where most applications operate.
- §7 Memory Management — allocator choice matters; even a custom allocator can’t beat layout. Use both.
- §11 Noisy Neighbor Isolation — cgroup memory pressure isn’t theoretical on a shared host — your noisy neighbor’s working set is what causes the kernel to evict your pages.