60-Minute Deep Dive
TAMING THE JVM
Optimizing Java Workloads on OpenShift & Kubernetes
Quarkus 3.33.1 LTS
Java 21
G1GC / ZGC / Shenandoah
AppCDS
Virtual Threads
Based on: Optimizing Cloud Native Java | SRE with Java Microservices | Quarkus 3.33.1 LTS
github.com/patterncatalyst/quarkus-optimization
Welcome. This talk is about closing the gap between how Java was designed — owning the whole machine — and how it actually runs in Kubernetes — sharing a cgroup with 20 other pods.
Everything in this talk has a live demo. All slides, code, and demos are in the GitHub repo on screen.
Agenda
01 Container-Native JVM Fundamentals
02 Right-Sizing Java Workloads
03 Garbage Collection Optimization
04 Startup Time Reduction (AppCDS)
05 Observability & Instrumentation
06 Autoscaling Integration
07 Systematic Tuning & Cost ROI
Bonus Leyden · gRPC · Latency · Panama · Valhalla
Seven sections, three live demos in the core 60 minutes, six bonus demos for extended sessions. The repo has all nine.
THE PROBLEM
Why Java + Kubernetes = Complexity
60% of Java apps overprovision memory
4–8s typical JVM cold start on Kubernetes
2–3× infrastructure waste from poor bin-packing
$$$ unnecessary cloud spend each month
Default JVM reads /proc/meminfo and sees the NODE's full RAM — claims 64 GB heap inside a 512 MB container → OOMKill
These four statistics come from real customer environments. The $$$ line has an actual number for each of them — and it's usually five or six figures annually.
The root cause of most of these problems is the same: the JVM was designed before containers existed.
SECTION 01
Container-Native JVM Fundamentals
❌ Before
# Hardcoded — breaks with resize / VPA
-Xms512m -Xmx2048m
JVM reads /proc/meminfo → host RAM Claims 64GB inside 512MB container
✅ Java 21
-XX:MaxRAMPercentage=75.0
-XX:InitialRAMPercentage=50.0
-XX:MinRAMPercentage=25.0
-XX:NativeMemoryTracking=summary
UseContainerSupport is ON by default in Java 21. Reads cgroup limits correctly.
cgroup v2 (RHEL 9 / OCP 4.14+): reads /sys/fs/cgroup/memory.max
UseContainerSupport is the foundational fix. It's been on by default since Java 10 but most teams don't know about MaxRAMPercentage.
The old -Xmx approach breaks silently whenever a VPA changes the container limit or the cluster admin resizes the node.
Reference: Optimizing Cloud Native Java Ch. 3 — Container Memory Management.
SECTION 01
JVM Memory Regions — Six Buckets, Not One
Region Typical Size Controlled By
Heap (Old + Young Gen) 50–75% MaxRAMPercentage
Metaspace 50–200 MB -XX:MaxMetaspaceSize=256m
Platform Thread Stacks 1 MB/thread -Xss or Virtual Threads
Native Memory (JIT, GC) 100–300 MB —
Direct ByteBuffers Varies Netty / NIO config
GC Bookkeeping 50–100 MB —
Java 21 Virtual Threads: stacks live in heap as tiny continuations — eliminates 1MB/thread platform thread stack budget for I/O-bound workloads
MaxRAMPercentage=75 only controls the heap. The remaining 25% must cover five other regions.
Setting 90% starves Metaspace and Netty buffers — OOMKills even when your heap metric looks fine.
Measure with: jcmd pid VM.native_memory summary
SECTION 02
Right-Sizing Java Workloads
requests — Scheduling Guarantee
Scheduler uses this to find a node
Set to P50 steady-state RSS
Too high → pods can't schedule
Too low → CPU throttle on full node
limits — Hard Ceiling
Memory exceeded → OOMKill (exit 137)
CPU exceeded → throttled (not killed)
Set memory limit 25-30% above P99 RSS
Set CPU limit 2-4× request to absorb GC spikes
Demo 07: 7-workload analysis · 4 nodes → 2 nodes · +67% pod density · $6,720/month saving · 17× ROI
The key insight: requests and limits serve different purposes. Requests are for the scheduler. Limits are for safety.
Most teams set them equal — that gives you Guaranteed QoS (good for CPU Manager) but no GC surge headroom.
Demo 07 shows this analysis run on a real 7-service cluster.
SECTION 03
GC in Containers: Four Challenges
CPU Throttling Extends GC
CPU limits throttle GC threads mid-pause. 100ms G1GC → 400ms under throttle. Set limit ≥ 2× request.
ParallelGCThreads Default
JVM defaults to host CPU count. 64-core node + 4 CPU limit = 64 threads competing for 4 CPUs.
GC-Induced HPA Thrash
GC pause → CPU spike → HPA fires → new pods GC → repeat. Scale on RPS, not CPU.
Heap Sizing vs GC Pressure
Small heap = frequent GC. Too large = infrequent but long GC. Start at MaxRAMPercentage=75.
Row 4 of the ParallelGCThreads problem is the most surprising to most people. Write this down:
-XX:ParallelGCThreads=N where N equals whatever is in your resources.requests.cpu.
This costs nothing and immediately improves GC pause duration on any shared-node cluster.
SECTION 03
GC Selection Guide
Collector Pause Best For Key Flags
G1GC
50–300ms
General purpose, Temurin/Corretto default
-XX:+UseG1GC -XX:MaxGCPauseMillis=200
Shenandoah
1–20ms
UBI9 default — Red Hat images ship this
-XX:+UseShenandoahGC
ZGC (Gen)
<1ms
Low-latency APIs, any heap size, HPA stability
-XX:+UseZGC -XX:+ZGenerational
Serial GC
STW
CLI tools, batch, <256MB heap only
-XX:+UseSerialGC
Note: UBI9 ships Shenandoah. Demos 02 and 06 override to -XX:+UseG1GC / -XX:+UseZGC for clean comparison.
Tip from SRE with Java Microservices: monitor jvm.gc.pause via Micrometer.
If P99 pause > 500ms, switch from G1GC to ZGC or Shenandoah. Don't tune G1GC parameters hoping to get there — switch the algorithm.
SECTION 04
Startup Time Reduction
Spring Boot vs Quarkus Baseline
Spring Boot 4.0.5 ~4–8s
Quarkus 3.33.1 JVM ~0.3–0.8s
Quarkus + AppCDS (JDK 21) ~0.15–0.4s
Quarkus + Leyden (JDK 25) ~148ms (Demo 04)
One Property
# application.properties
quarkus.package.jar.aot.enabled=true
# Build + train
mvn verify # (not package)
# → runs @QuarkusIntegrationTest
# → writes target/quarkus-app/app.aot
Quarkus is already 5-10x faster than Spring Boot before any optimisation. That's because Quarkus moves classpath scanning and DI wiring to build time.
AppCDS gives ~5% on Quarkus vs ~40% on Spring Boot — the small Quarkus improvement proves the point: less work to cache because less work happens at runtime.
Leyden on JDK 25: 609ms → 148ms, a 75% reduction. Verified in Demo 04.
SECTION 04
Virtual Threads — @RunOnVirtualThread
@Path("/allocate")
@ApplicationScoped
public class GcResource {
@GET
@RunOnVirtualThread // ← One annotation. Done.
public AllocResponse allocate(
@QueryParam("mb") int mb) {
return doHeavyWork(mb);
}
}
Container sizing impact
Platform thread stacks: 1MB each
200 threads = 200MB off-heap
Virtual thread stacks → in heap
10,000 concurrent I/O tasks, same memory
resources:
requests:
memory: "256Mi" # Was 512Mi
limits:
memory: "512Mi"
JEP 444, Java 21. One annotation in Quarkus. Avoid synchronized + I/O (pins carrier thread) — use ReentrantLock instead.
The container sizing impact is real: a REST service that needed 512m for 200 platform threads can handle 10,000 virtual threads with the same memory.
SECTION 05
Observability — You Can't Tune What You Can't See
JFR (JDK Flight Recorder)
Built-in, <1% overhead. GC events, allocations, IO. jcmd pid JFR.start
Cryostat (OpenShift)
OpenShift-native JFR management via Kubernetes operator. Auto-discover pods via annotation.
OTel → Grafana LGTM
quarkus-micrometer-opentelemetry — single extension, all telemetry via OTLP
Essential Metrics
jvm_gc_pause_seconds P99 >500ms → switch GCjvm_memory_used_bytes heap + off-heap
Required: quarkus.micrometer.distribution.percentiles-histogram.jvm.gc.pause=true — without this, Grafana GC panels show no data
The histogram configuration is not optional. The jvm.gc.pause counter tells you "GC happened 40 times". The histogram tells you "GC P99 was 800ms — fire an alert". Set this before your next deployment.
quarkus-micrometer-opentelemetry replaces separate prometheus registry + otel extensions — one unified OTLP pipeline.
SECTION 06
Autoscaling — HPA with JVM-Aware Metrics
spec:
minReplicas: 2 # NEVER 1 — single pod + GC STW = 100% downtime
behavior:
scaleUp:
stabilizationWindowSeconds: 120 # Absorb GC CPU spikes up to 2min
policies:
- type: Pods
value: 2
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
metrics:
- type: External
external:
metric: { name: http_requests_per_second }
target: { type: AverageValue, averageValue: "50" }
- type: External
external:
metric: { name: jvm_memory_used_ratio }
target: { type: AverageValue, averageValue: "0.80" }
Scale on RPS not CPU. GC pauses create CPU spikes — CPU-based HPA treats those as load signals and scales out.
The stabilisation window of 120s is longer than any normal GC pause. The memory ratio metric scales out before OOMKill.
minReplicas:2 is the single cheapest reliability improvement — one extra pod, zero downtime during GC pause.
SECTION 07
Systematic Tuning Workflow
40-60% Memory reduction after right-sizing
2-3× Pod density per node
55% Startup reduction with AppCDS
$$$ Node savings from bin-packing
One change at a time. Measure before and after. Commit or revert based on data.
If you accumulate five JVM flags without measuring each one, you can't attribute any improvement to any flag.
OpenShift Cost Management tracks cost per namespace before and after to show business value.
DEMO 01
Container-Aware Heap Sizing
Run WITHOUT UseContainerSupport → JVM claims host RAM
Run WITH UseContainerSupport + MaxRAMPercentage=75 → respects 512MB
Live jcmd output showing heap sizes before and after
OOMKill simulation when JVM ignores container limits
cd demo-01-heap-sizing
./demo.sh
Demo 01 is the foundational fix. Everything else in this talk builds on getting this right first.
DEMO 02
GC Monitoring with Prometheus
Quarkus 3.33.1 + quarkus-micrometer-opentelemetry + Grafana LGTM
Live GC pause histograms at /q/metrics
Generate GC pressure — watch metrics AND traces simultaneously
G1GC vs Generational ZGC side-by-side pause comparison
Virtual threads: 500 concurrent tasks, minimal platform thread count
cd quarkus-demo-02-gc-monitoring
./demo.sh # starts podman-compose stack
DEMO 03
AppCDS Startup Acceleration
Quarkus baseline: ~0.3-0.8s (already 10× faster than Spring Boot)
quarkus.package.jar.aot.enabled=true — one property
Maven plugin handles training on @QuarkusIntegrationTest suite
Quarkus + AOT Cache: ~0.15-0.4s (30-50% additional gain)
Progression: AppCDS (JDK 21) → Leyden -XX:AOTCache (JDK 25)
cd quarkus-demo-03-appcds
./demo.sh
Key Takeaways
Always enable UseContainerSupport + MaxRAMPercentage — hardcoded -Xmx is a container anti-pattern
Right-size first, then tune — measure RSS + off-heap before setting requests/limits
Match GC to workload — G1GC general, ZGC/Shenandoah for latency-sensitive APIs
Quarkus AppCDS: one property — quarkus.package.jar.aot.enabled=true. Already 5-10× faster than Spring Boot
Observe before you tune — JFR + Cryostat + Prometheus validates every change
Autoscale on RPS not CPU — GC pauses lie to HPA. Use @RunOnVirtualThread
Quantify savings — track cost per namespace to show business value from engineering work
Read each one slowly. This is the audience's callback for their own environments.
Resources & Q&A
Questions?
Slides and all demos are in the GitHub repo. PRs welcome — especially if a demo breaks on your platform.
LEYDEN
Project Leyden — JVM AOT Cache
Release JEP What it adds Gain
JDK 24 483 AOT class loading & linking ~40% startup
JDK 25 LTS 514+515 Ergonomics + JIT method profiles ~75% startup (Demo 04: 609→148ms)
JDK 26 516 ZGC support — no longer have to choose +ZGC compat
Future — Pre-compiled native code in cache Instant peak perf
Quarkus 3.33.1: quarkus.package.jar.aot.enabled=true — one property, all JDK versions, cache automatically richer on each upgrade
The key message: you configure this once with one property. Upgrade the JDK at your own pace. The cache automatically becomes richer on each JDK upgrade.
Leyden vs GraalVM Native: Leyden stays on the JVM — full reflection, dynamic loading, JIT all continue to work. Native is the closed-world AOT option.
DEMO 04 · JDK 25 LTS
Quarkus + Project Leyden AOT Cache
One property: quarkus.package.jar.aot.enabled=true
Build trains on @QuarkusIntegrationTest suite
Output: app.aot alongside quarkus-run.jar
609ms → 148ms startup (−75%)
cd quarkus-demo-04-leyden
./demo.sh # JDK 25 required (in container)
gRPC
REST vs gRPC — Inside the Cluster
REST / JSON
HTTP/1.1 (or 2)
JSON text (~400 bytes)
New connection per request
SSE / WebSocket only for streaming
curl friendly ✅
Browser native ✅
gRPC / Protobuf
HTTP/2 always
Binary Protobuf (~40 bytes)
Multiplexed, persistent
Built-in streaming (4 modes) ✅
Needs grpcurl / Postman
Generated stubs
⚠️ Localhost caveat: gRPC unary is SLOWER than REST on localhost — network cost is zero. gRPC wins streaming and high concurrency (c=500) even locally.
The localhost result is expected. Show it — hiding it would be dishonest. In production with pod-to-pod latency, gRPC wins 3-4x on throughput and 73% on p50 latency.
The streaming comparison is real regardless of where you run it.
LOW LATENCY
Why the JVM Breaks Latency SLAs
G1GC — Default
Young GC pause: 10–200ms
Mixed GC pause: 50–500ms
Full GC (worst): 1–10s
Pauses SCALE with heap size
CPU spike → HPA false scale-out
ZGC Generational — JDK 21+
All pauses: <1ms
Scales with thread count, not heap
Load barrier overhead: ~5-15%
Smooth CPU profile → no HPA thrash
-XX:+UseZGC
-XX:+ZGenerational
Same app, same heap, same Quarkus config. Different GC = different production behaviour.
ZGC will show lower throughput in the hey load test — that's the load barrier cost. The meaningful metric is the GC pause delta.
RIGHT-SIZING
Cost Impact Analysis & Business Case
$80,640
annual saving · 2 nodes × $0.384/hr × 8,760 hrs · this cluster alone
💰 Direct savings
2 nodes eliminated · $1,120 → $560/month
⏱ Engineering cost
~4 hours · rolling restarts · 17× ROI
📈 Indirect benefits
HPA stability · VPA trustworthy · correct thresholds
🏢 At scale
10 clusters = $67,200/year · OpenShift Cost Management
$80,640/year from one cluster, one afternoon of analysis. That's not a rounding error.
The ROI argument: $6,720 saving for ~$400 engineering time = 17x return.
PANAMA
Project Panama — The End of JNI
JNI (1996)
Write Java + C header + C wrapper
Compile C per platform/arch
Manual native memory — leaks kill JVM
JNI crash = no Java stack trace
sun.misc.Unsafe: private API, breaks each JDK
Panama FFM (JDK 22 — finalized)
try (Arena arena = Arena.ofConfined()) {
MemorySegment data =
arena.allocateFrom(JAVA_DOUBLE, arr);
int result = (int) methodHandle
.invoke(data, arr.length, outP99);
} // freed here — zero leaks possible
FFM is production-ready in JDK 22, stable in JDK 25. No --enable-preview required.
The Arena is the key safety feature. Everything allocated in a confined arena is freed when it closes. You cannot leak if you use try-with-resources.
allocateFrom() — not allocateArray(). The preview API was renamed at GA.
VALHALLA
Project Valhalla — Closing the 30-Year Gap
Today — Value Class
// Heap object — pointer in array
// 8-byte header per element
// GC-tracked — every allocation
record Point(double x, double y) {}
Valhalla (preview JDK 25+)
// Stored inline — no header
// x0,y0,x1,y1 densely packed
// GC never sees it
value class Point {
double x; double y;
}
📉 Memory List<double>: 1× vs List<Double>: 3×. Pod requests cut up to 50%.
♻️ GC Pressure Value types: zero heap allocation, zero GC tracking. HPA stays quiet.
⚡ Cache Performance Sequential memory access. L1/L2 cache-friendly. SIMD-friendly layout.
📅 Timeline Preview JDK 25+. Universal generics (List<int>) after primitive classes. Stable ~JDK 27-29.
Valhalla doesn't change how you write code — it changes what the JVM does with your code.
A List written today will automatically get better memory layout on a Valhalla JVM if Point is a value class.
ANTI-PATTERNS
Common JVM Anti-Patterns on Kubernetes
🧠 Memory
❌ Hardcoded -Xmx/-Xms
❌ MaxRAMPercentage=90 — starves off-heap
❌ No -XX:MaxMetaspaceSize
⚙️ GC & CPU
❌ Default ParallelGCThreads on large node
❌ CPU-based HPA with Java workloads
❌ minReplicas: 1 in HPA
❌ No stabilizationWindowSeconds
🚀 AOT / Startup
❌ @QuarkusTest for AOT training
❌ Manual -XX:AOTCache in Dockerfile
❌ mvn package instead of mvn verify
❌ Ignoring JDK version on rebuild
👁 Observability
❌ No GC pause histogram
❌ Separate prometheus + otel extensions
❌ No PrometheusRule on jvm_gc_pause
❌ Tuning JVM flags without baseline
Pick 2-3 from each category that resonate with your audience. Ask for a show of hands — usually half the room has seen rows 1-3 firsthand.
ANTI-PATTERNS
Anti-Pattern Remediation
✅ Memory Fixes
→ -XX:+UseContainerSupport -XX:MaxRAMPercentage=75.0
→ Use 75%, not 90% — reserve 25% for off-heap
→ Add -XX:MaxMetaspaceSize=256m
✅ GC & CPU Fixes
→ -XX:ParallelGCThreads=N (= CPU request)
→ HPA on RPS, not CPU (KEDA or Prometheus Adapter)
→ minReplicas: 2 minimum
→ stabilizationWindowSeconds: 120
✅ AOT / Startup Fixes
→ Use @QuarkusIntegrationTest, not @QuarkusTest
→ Don't add -XX:AOTCache — Quarkus sets it automatically
→ Run mvn verify, not mvn package
→ Pin JDK minor version in Dockerfile FROM
✅ Observability Fixes
→ percentiles-histogram.jvm.gc.pause=true
→ Use quarkus-micrometer-opentelemetry (unified)
→ PrometheusRule: GC P99 >500ms for 2m → alert
→ Baseline first. Change one flag. Measure again.
Everything on this slide is a drop-in change. No application code changes. No architectural redesign.
Configuration and build pipeline changes only. The golden rule: one change at a time, measure before and after.