GC-Induced HPA Thrash Cycle

πŸ“ Used after slide 10–11, 21

View JSON on GitHub β†— ⬇ Download

Loading diagram…

Why CPU-based HPA and JVM GC are a dangerous combination. The feedback loop that takes 3 pods to 20 with zero real load.

Speaker Notes

The most important diagram in the talk for platform engineers. Shows the six-step domino chain: GC pause β†’ CPU spike β†’ HPA scales out β†’ new pods also GC β†’ HPA scales again β†’ cost explosion.

When to use this diagram

  • After slide 10 (GC in Containers) to draw the full thrash chain
  • As a Q&A answer when someone asks β€œwhy can’t I just use CPU-based HPA?”
  • After slide 21 (HPA with JVM-Aware Metrics) β€” the fix is on that slide, the diagram shows the β€œwhy”

Opening line

β€œI want to show you why CPU-based HPA and the JVM are a dangerous combination. This is a loop that I have seen take a cluster from 3 pods to 20 pods with zero actual load increase.”

Walk box by box

β‘  Normal Operation β€” App running, GC under 200ms, requests being served.

β‘‘ GC Pause β€” GC fires. Stop-the-world halts application threads. GC threads spike CPU to 100% for 50–200ms.

β‘’ HPA Scales Out β€” HPA scrapes CPU at 30s intervals, sees the spike, adds three pods. Wrong response β€” the spike was GC, not load.

β‘£ New Pods Also GC β€” Cold pods do JVM bootstrap + JIT warmup. All spike CPU during startup GC. HPA sees sustained high CPU.

β‘€ HPA Scales Again β€” More pods. More GC. More spikes.

β‘₯ Cost Explosion β€” 20 pods for a 3-pod workload. On-call engineer has no idea why.

The fix

β€œTwo things together solve this. Change the HPA metric to RPS β€” completely unaffected by GC CPU spikes. Add stabilizationWindowSeconds: 120 on scaleUp β€” longer than any normal GC pause. Together these make HPA work correctly for Java.”