GC-Induced HPA Thrash Cycle
π Used after slide 10β11, 21
Loading diagramβ¦
Why CPU-based HPA and JVM GC are a dangerous combination. The feedback loop that takes 3 pods to 20 with zero real load.
Speaker Notes
The most important diagram in the talk for platform engineers. Shows the six-step domino chain: GC pause β CPU spike β HPA scales out β new pods also GC β HPA scales again β cost explosion.
When to use this diagram
- After slide 10 (GC in Containers) to draw the full thrash chain
- As a Q&A answer when someone asks βwhy canβt I just use CPU-based HPA?β
- After slide 21 (HPA with JVM-Aware Metrics) β the fix is on that slide, the diagram shows the βwhyβ
Opening line
βI want to show you why CPU-based HPA and the JVM are a dangerous combination. This is a loop that I have seen take a cluster from 3 pods to 20 pods with zero actual load increase.β
Walk box by box
β Normal Operation β App running, GC under 200ms, requests being served.
β‘ GC Pause β GC fires. Stop-the-world halts application threads. GC threads spike CPU to 100% for 50β200ms.
β’ HPA Scales Out β HPA scrapes CPU at 30s intervals, sees the spike, adds three pods. Wrong response β the spike was GC, not load.
β£ New Pods Also GC β Cold pods do JVM bootstrap + JIT warmup. All spike CPU during startup GC. HPA sees sustained high CPU.
β€ HPA Scales Again β More pods. More GC. More spikes.
β₯ Cost Explosion β 20 pods for a 3-pod workload. On-call engineer has no idea why.
The fix
βTwo things together solve this. Change the HPA metric to RPS β completely unaffected by GC CPU spikes. Add
stabilizationWindowSeconds: 120on scaleUp β longer than any normal GC pause. Together these make HPA work correctly for Java.β