09
Demo 09 — AI Inference: LangChain4j + ONNX + Panama
Quarkus 3.33.1 / JDK 25 LTS
⏱ ~10 min
JDK 25
Run this demo
View source on GitHub ↗
cd quarkus-demo-09-onnx
chmod +x demo.sh
./demo.sh
In-process AI inference via LangChain4j → ONNX Runtime → Panama FFM → native .so. The all-MiniLM-L6-v2 sentence embedding model (~25MB) runs in the JVM. No Python sidecar. No gRPC. No subprocess.
The stack
Quarkus REST → LangChain4j EmbeddingModel
→ ONNX Runtime Java (OrtSession)
→ Panama FFM (MethodHandle + Arena)
→ native libonnxruntime.so
→ optimised BLAS inference kernels
Single Maven dependency
<!-- Bundles: model + ONNX Runtime + Panama bindings — no download at runtime -->
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-embeddings-all-minilm-l6-v2</artifactId>
<version>0.36.2</version>
</dependency>
Endpoints
# 384-dimension float vector
curl "http://localhost:8080/embed?text=OutOfMemoryError+heap+space"
# Cosine similarity — related (~0.85) vs unrelated (~0.15)
curl "http://localhost:8080/similarity?a=JVM+OOM&b=heap+exhausted"
# Classify an alert into ops category
curl "http://localhost:8080/classify?alert=Pod+OOMKilled+exit+code+137"
# Rank past incidents by similarity — foundation of incident-aware RAG
curl -X POST http://localhost:8080/rank \
-H "Content-Type: application/json" \
-d '{"reference":"Quarkus OOMKilled","candidates":["heap exhausted","DB timeout","CPU throttle"]}'
First run: Downloads ~300MB (ONNX Runtime + model). Subsequent runs use Podman layer cache.