GraalVM Native Images in Production: From 5-Second Startup to 50ms
Key Takeaways
- →Native images cut startup from 4–11s to 40–120ms, meeting a 2-second Kubernetes readiness SLA that's impossible on the JVM — the closed-world assumption is the price
- →At steady state, native images throughput ~10–25% lower than JIT because the AOT compiler can't speculate on runtime behavior — JIT observes hot paths and optimizes; AOT must be conservative
- →Reflection, dynamic class loading, and proxy generation require explicit hints — shipping a native image without GraalVM metadata files will fail silently in production
- →Class Data Sharing (CDS) on Spring Boot 3.3+ cuts JVM startup to 2–3s with zero code changes and no reflection hints — often the right middle ground between startup and throughput
The classic JVM startup production SLA bust. A Spring Boot service has a 2-second readiness SLA on Kubernetes. Cold start measures 7 seconds. Rolling deployments and autoscaling events trigger 502 bursts because pods miss the readiness window. We migrated this exact shape to GraalVM native image on multiple production teams: 7 seconds → 80 milliseconds, memory 480 MB → 130 MB, throughput 10 to 25 percent lower at steady state.
GraalVM Native Images: Instant Startup, Reflection Landmines
A common SLA for Kubernetes-hosted services is a two-second restart window. For most Spring Boot services, this is impossible to meet on the JVM. Startup times of 4–11 seconds are common, and those seconds matter during rolling deployments and autoscaling events.
GraalVM native images eliminate the JVM at runtime[GraalVM Native Image Docs], cutting startup from 4–11 seconds to 40–120ms in our migrations. Trade-off: the closed-world assumption requires explicit reflection hints, and throughput at steady state is typically 10–25% lower than JIT. Use native images for Kubernetes scale-to-zero workloads; skip them for heavy reflection or batch processing.
- Instant startup (40–120ms) cuts K8s readiness SLAs from 15+ seconds to under 1 second
- Memory footprint drops 60–70% in our experience — typically 80–150MB vs 350–550MB for JVM
- Reflection hints are mandatory — closure at build time breaks dynamic code, JPA, dynamic proxies
When to Use Native Image: The Quick-Start Table
[GraalVM Native Image Docs]The decision to migrate to native images is fundamentally a question of your deployment model and constraints. Before migrating any service, evaluate it against this matrix:
| Factor | JVM JAR | Native Image | Containerized JVM (CDS) |
|---|---|---|---|
| Startup time | 4–11s | 40–120ms | 1.5–3s |
| Peak throughput | Highest (after warmup) | 10–25% lower | Same as JVM |
| Memory (RSS) | 350–550MB | 80–150MB | 300–450MB |
| Image size | 300–420MB | 55–95MB | 250–350MB |
| Build time | 10–30s | 8–15min | 30–60s |
| Reflection support | Works out-of-box | Requires explicit hints | Works out-of-box |
| Debugging tools | Full JVM ecosystem | Limited (thread dumps, partial JFR) | Full JVM ecosystem |
| Best use case | Long-running services, throughput-critical | K8s scale-in-out, serverless, memory-constrained | Cost-conscious services |
Use native images when: pod lifetime is under 15 minutes, startup SLA is strict (< 2 seconds), or you're running at scale where memory per pod compounds into infrastructure cost.
Use the JVM when: services run for hours/days (batch, background jobs), you need peak throughput (long warm-up acceptable), or you rely heavily on reflection (complex JPA, AspectJ load-time weaving).
Consider Class Data Sharing (CDS) as a middle ground. Spring Boot 3.3+ supports CDS: a one-time training run records the loaded classes into an archive that the JVM memory-maps on subsequent starts — no code changes, full JVM compatibility. Spring measured ~1.5× faster startup on a minimal app; class-heavy services gain more, typically landing in the 2–3s range (the training run is automated when you build with Buildpacks and set BP_JVM_CDS_ENABLED=true). It won't match native image startup, but it preserves debugging tooling and doesn't require reflection hints — often the sweet spot for teams unsure about the native image commitment.
Two efforts are closing the startup gap without the closed-world tax. Project Leyden extends CDS into an AOT cache: JEP 483 (JDK 24) caches loaded-and-linked classes, JEP 515 (JDK 25) adds method profiles so the JIT starts compiling hot paths at boot, and JEP 516 (JDK 26) makes the object cache GC-agnostic — unblocking ZGC. These run on the ordinary HotSpot JVM, so reflection, dynamic class loading, and full debugging tooling keep working; a training run produces the cache. CRaC (Coordinated Restore at Checkpoint) takes a different route — snapshot a warmed-up JVM and restore it in milliseconds with JIT-compiled code already in place; Spring Boot, Micronaut, and Quarkus support it, though it's Linux-only and not yet GA.
This doesn't make native image obsolete — it sharpens the choice. Native image still wins on absolute memory footprint (80–150MB RSS vs the JVM's 300MB+) and a single self-contained binary with no JVM at runtime. Reach for Leyden/CDS or CRaC when you want faster starts but can't pay the reflection-metadata and lost-tooling cost; reach for native image when memory density and minimal attack surface are the goal.
How Native Image Works: Static Analysis vs JIT
Traditional JVM (JIT): Bytecode compiled at build time, executed via interpreter at runtime. JIT profiles hot paths and compiles them to optimized machine code. Performance improves over time — a 10-minute-old JVM is significantly faster than one that just started.
GraalVM Native Image (AOT): GraalVM performs static analysis at build time (the "closed-world assumption")[GraalVM Native Image Docs], builds a reachability graph, and produces a self-contained native binary. No JVM, no interpreter, no JIT, no class loading. All code is pre-compiled to optimized machine code. The binary starts instantly.
graph TD
subgraph JVM ["Traditional JVM - JIT"]
B1[Source Code] -->|javac| B2[Bytecode]
B2 --> B3[JVM Startup<br/>~4-11 seconds]
B3 --> B4[Interpreter] -->|Hot Paths| B5[JIT Compiler]
B5 --> B6[Optimized Machine Code<br/>Peak at ~5min]
end
subgraph Native ["GraalVM Native - AOT"]
N1[Source Code] -->|javac| N2[Bytecode]
N2 -->|native-image<br/>Static Analysis| N3[Closed-World Build<br/>8-15 minutes]
N3 --> N4[Native Binary<br/>Self-contained]
N4 --> N5[Pre-Optimized Code<br/>Ready immediately]
end
The closed-world assumption is the fundamental constraint. The compiler must know about everything at build time. Normal method calls, inheritance, and allocations work fine. But reflection breaks — the compiler cannot know which classes will be instantiated via Class.forName() at runtime. Same issue applies to JNI, dynamic class loading, and runtime proxy generation. This is where most teams hit their first wall. [GraalVM Native Image Docs]
The Throughput Reality: JIT vs AOT vs PGO
JIT warm-up produces better peak throughput because the JIT observes actual runtime behavior — which branches are taken, which methods are hot, which types flow through call sites. The JVM makes speculative optimizations based on observed patterns. But this comes at a cost. In a Spring Boot microbenchmark (Java 21, single endpoint, 8-core host) we typically see a curve like:
- 0–10s: ~4,000 req/s (interpreting bytecode, minimal optimization)
- 10–30s: ~12,000 req/s (C1 compiler kicks in, basic optimizations)
- 30s–2min: ~20,000 req/s (C2 compiler, inlining, escape analysis)
- 5min+: ~28,000 req/s (fully warmed, speculative optimizations like branch prediction)
For long-running services (hours/days), peak JIT throughput is typically 10–25% higher than AOT in our experience. The JIT compiler does things the static compiler cannot: it speculates on runtime behavior, eliminates allocations via escape analysis, devirtualizes virtual calls — all based on actual execution data[GraalVM Native Image Docs].
AOT (native image) in the same benchmark delivers ~22,000 req/s immediately — no warm-up. The ceiling is lower because the AOT compiler must make conservative decisions. For services scaling in/out frequently (K8s HPA), the total requests served during JIT warm-up at reduced throughput may exceed the steady-state gap. If pod lifetime is under 15 minutes, AOT often serves more total requests than JIT in the same window.
PGO (Profile-Guided Optimization) — available in Oracle GraalVM — narrows the gap by feeding real execution profiles back into the AOT compiler:
# Step 1: Build an instrumented binary that records execution profiles
native-image --pgo-instrument -jar app.jar -o app-instrumented
# Step 2: Run under realistic load to generate profiles
./app-instrumented &
k6 run load-test.js # Exercise all code paths — mimic production traffic
# Profiles written to default.iprof on exit
# Step 3: Build the optimized binary using the collected profiles
native-image --pgo=default.iprof -jar app.jar -o app-optimizedPGO gives the AOT compiler the same kind of runtime profile data that the JIT uses — hot methods, taken branches, type profiles at call sites[GraalVM Native Image Docs]. In our benchmarks, PGO recovered 30–50% of the JIT throughput gap, bringing native images to within 5–15% of peak JIT performance while keeping the instant startup. Trade-off: you need a representative workload for profiling. If your production traffic patterns differ significantly from the profiling run, the optimization may not help — or could regress performance on uncommon paths.
The Closed-World Problem: Reflection Hints in Practice
[GraalVM Native Image Docs]The biggest production trap is reflection. Here's a concrete example. A service uses Jackson to deserialize JSON from Kafka messages — standard stuff. The Jackson ObjectMapper uses reflection internally to discover fields and constructors on your model classes. In a standard JVM, this works fine because the JVM allows dynamic discovery at runtime. In a native image, GraalVM's static analysis happens at build time and cannot "see" that these classes will be instantiated via reflection, so they get excluded from the binary.
Runtime result: ClassNotFoundException: OrderEvent when the first message arrives, at 2 AM, in production.
Reflection is pervasive in Java frameworks. Jackson does it for JSON deserialization. Hibernate does it to discover entity fields and their mappings. Spring AOP uses bytecode generation (CGLIB) to create proxies. Spring Data repository interfaces are created via reflection. All of these "just work" on the JVM because reflection is allowed. On native image, they require explicit hints.
The fix: tell GraalVM what gets accessed via reflection. You have two main approaches:
// Option 1: Annotation-based (cleanest for Spring)
@RegisterReflectionForBinding(OrderEvent.class)
@Configuration
public class KafkaConfig { }
// Option 2: RuntimeHintsRegistrar for third-party classes you don't own
@Configuration
@ImportRuntimeHints(MyRuntimeHints.class)
public class AppConfig { }
public class MyRuntimeHints implements RuntimeHintsRegistrar {
@Override
public void registerHints(RuntimeHints hints, ClassLoader classLoader) {
// Register classes for reflection
hints.reflection()
.registerType(ThirdPartyDto.class,
MemberCategory.INVOKE_DECLARED_CONSTRUCTORS,
MemberCategory.DECLARED_FIELDS);
// Register resource files
hints.resources()
.registerPattern("email-templates/*.html")
.registerPattern("db/migration/*.sql");
}
}Spring Boot 3.x has done substantial work here. The Spring AOT (Ahead-Of-Time) processor auto-generates hints for most Spring-managed beans: @Component, @Service, @Repository, @Entity, @ConfigurationProperties. If you use only Spring beans and don't do anything exotic, you might get away with minimal hints.
The problem is everything around the Spring beans — third-party libraries, internal utility code that uses reflection, anything that was "working" on the JVM by relying on runtime class discovery. Every library upgrade changes reflection patterns. A minor version bump in Jackson might add new reflective access paths that you haven't registered. Budget time for metadata audits on every dependency upgrade — this is the hidden tax of native images.
Discovering Hints: The Tracing Agent
The GraalVM tracing agent is your primary tool for discovering what metadata your application needs:
# Attach the agent and run your app on the JVM
java -agentlib:native-image-agent=config-output-dir=src/main/resources/META-INF/native-image \
-jar target/app.jar
# Exercise ALL code paths while the agent is recording
# Run your full integration test suite
# Hit every endpoint
# Execute error paths
# The agent records every reflective access, proxy creation, resource loadThe critical limitation: the tracing agent only records paths that are actually executed. If you miss an endpoint in your test run, its reflection needs won't be captured. For production safety, follow this procedure:
- Run the agent against your full integration test suite
- Run the agent again against manual exploratory testing (have a person click through the UI)
- Merge the results using
native-image-agent=config-merge-dir=...to combine multiple runs - Audit the generated JSON files for completeness
- Before writing custom metadata, check the GraalVM Reachability Metadata Repository — it has pre-built hints for hundreds of libraries (Jackson, Hibernate, Netty, Spring, etc.). If your library is there, the hints are automatically applied during native compilation.
The Production Build Pipeline
Building a native image requires the GraalVM native-image compiler and a lot of memory. A simple local setup works, but for CI we recommend Docker-based builds to avoid installing GraalVM on every CI runner.
Here's a production-grade Gradle setup for Spring Boot 3.x:
plugins {
id("org.springframework.boot") version "3.4.2"
id("org.graalvm.buildtools.native") version "0.10.4"
kotlin("jvm") version "2.1.0"
kotlin("plugin.spring") version "2.1.0"
}
dependencies {
// Required for Spring AOT and native hints
implementation("org.springframework.boot:spring-boot-starter-aot")
}
graalvmNative {
binaries {
named("main") {
imageName.set("order-service")
// Tell the compiler to initialize these at build time
// to avoid runtime overhead
buildArgs.add("--initialize-at-build-time=org.slf4j")
buildArgs.add("--initialize-at-build-time=ch.qos.logback")
// Useful for debugging
buildArgs.add("-H:+ReportExceptionStackTraces")
// Enforce strict checks
buildArgs.add("--strict-image-heap")
}
}
// Enable GraalVM's community metadata repository
// This pulls pre-written hints for hundreds of libraries
metadataRepository {
enabled.set(true)
}
}Build locally with ./gradlew nativeCompile (requires GraalVM JDK installed), or via Docker for CI:
# Docker build — no GraalVM install needed on CI runner
./gradlew bootBuildImage --imageName=order-service:nativeDocker builds use Spring's Buildpacks infrastructure and download GraalVM internally. Build times on our CI: 8–12 minutes per service. Not fast, but predictable and reproducible across environments. The build is deterministic — same input, same output every time — which is valuable for supply chain security.
Starting with 25.1 (first monthly release in June 2026), GraalVM moved to a monthly release train — explicitly to keep up with the AI-driven pace of development — while quarterly releases still fold in the latest JDK Critical Patch Update (reflected in the version's SECURITY digit, e.g. 25.1.3). The previous major (Oracle GraalVM 25.0) stays the stable train, receiving security and minor bug fixes. Practical impact: pin an explicit GraalVM version in your build image rather than tracking a floating tag, so a monthly bump never silently changes the compiler under a reproducible build.
For the final container, use a multi-stage Dockerfile to keep the image small:
# Stage 1: Build the native image
FROM ghcr.io/graalvm/native-image-community:21 AS builder
WORKDIR /app
COPY . .
# Build the native binary
# --no-daemon prevents gradle daemon from staying alive
# -x test skips tests during build (run them separately in CI)
RUN ./gradlew nativeCompile --no-daemon -x test
# Stage 2: Runtime image with just the binary
# Distroless images are tiny and have minimal attack surface
FROM gcr.io/distroless/base-debian12
WORKDIR /app
# Copy the native binary from builder
COPY --from=builder /app/build/native/nativeCompile/order-service /app/order-service
# No JVM, no package manager, no shell — just the binary
EXPOSE 8080
ENTRYPOINT ["/app/order-service"]Image sizes across deployment strategies tell the story:
| Approach | Base Image | App Binary/JAR | Total Size |
|---|---|---|---|
| Fat JAR + JRE | Alpine + JRE (180MB) | 45MB | ~380MB |
| Jlink custom JRE | Distroless (20MB) | 80MB | ~145MB |
| Native + distroless | Distroless (20MB) | 48MB | ~68MB |
Consider a cluster pulling 1,000 pods of 380MB images — that's 380GB of bandwidth. The same pods as native images: 68GB — roughly five times smaller. This matters for deployment speed, node startup time, and bandwidth costs. The distroless base (no shell, no package manager) also reduces the attack surface for container security.
Real Production Numbers After Migrating Stateless Services
Across the stateless Spring Boot microservices we've migrated to native image (REST + Kafka workers, no JPA), the typical before/after looks like this:
| Metric | Before (JVM) | After (Native) | Improvement |
|---|---|---|---|
| Startup time | 4.2–11.3s | 48–120ms | 40–100× faster |
| Memory RSS | 380–520MB | 85–140MB | 60–70% reduction |
| Image size | 320–420MB | 55–95MB | 70–80% reduction |
| Peak throughput | ~28k req/s (after warmup) | ~22k req/s (immediate) | ~20% lower |
K8s initialDelaySeconds | 15–30 | 1 | 15–30× faster readiness |
| HPA scale-up time | 2–3 min | <30 sec | 4–6× faster scaling |
The throughput regression (~20% lower at steady state) is real, but for frequent scale-in/out workloads, the calculation flips. During JVM warm-up, pods serve requests at reduced throughput. If they scale out before warmup completes, instances never reach peak throughput. Over time, native images serve more total requests per deployment window. [GraalVM Native Image Docs]
Memory reduction from ~450MB to ~110MB RSS allows significantly more instances per node. Infrastructure cost drops. Rolling deployments compress from multi-minute windows to under 30 seconds.
The Production Gotchas
Dynamic Proxies: Internal libraries that generate dynamic proxies for service interfaces (similar to Spring AOP) completely break under native image because dynamic proxy generation at runtime is incompatible with the closed-world model — the compiler cannot know which interfaces will be proxied at build time. Solution: switch to compile-time proxy generation using an annotation processor. The work is painful but you only do it once.
Logback Configuration: Logback uses XML parsing and reflection to load configuration files. Your application compiles successfully but then crashes at runtime because Logback cannot find logback-spring.xml. Requires explicit hints via RuntimeHintsRegistrar to register resource patterns. This is one-time overhead if you get it right during the tracing agent phase.
Hibernate/JPA: The most blocking production issue. Services that use Spring Data JPA heavily require significant effort — expect weeks of work for complex entity graphs. Hibernate uses aggressive reflection to discover entity fields and bytecode enhancement for lazy loading. Requires spring.jpa.properties.hibernate.bytecode.provider=none and individual entity classes annotated with @RegisterReflectionForBinding or registered via hints. For services with complex JPA usage, honestly evaluate whether you need JPA at all. Switching to Spring JDBC (jOOQ, JDBI, or plain JdbcClient) makes native compilation dramatically simpler and faster.
Flyway Java Migrations: Flyway SQL migrations work fine with native image. But Flyway Java-based migrations (implementing BaseJavaMigration) need reflection hints for each migration class. If you have dozens of them, expect tedious annotation work. Conversion path: switch all future migrations to SQL, and add blanket hints for existing Java migrations via RuntimeHintsRegistrar.
When NOT to Use Native Images
Be honest about whether your service fits the native image profile. Forcing native images on services that don't benefit wastes time on metadata maintenance without the payoff.
Long-running batch processors (6+ hours) — A nightly ETL job that runs for 6 hours gets enormous benefit from JIT warm-up. After 5 minutes of execution, the JVM's JIT throughput is 15–25% higher than AOT. Over 6 hours, that compounds to processing millions more records. A native image saves you 4 seconds of startup time. That's irrelevant for a 6-hour batch job. Stick with the JVM. [GraalVM Native Image Docs]
Heavy reflection frameworks — If your service deeply uses Hibernate with complex entity graphs and lazy loading, AspectJ load-time weaving for aspect application, or runtime bytecode generation via CGLIB or Javassist, the metadata maintenance burden will exceed any operational savings. Every library upgrade becomes a potential native build breakage. You'll find yourself writing reflection hints for code you didn't write and don't fully understand. For a small team, this overhead is unjustifiable.
Rapid development iteration — Native compilation takes 8–15 minutes locally. During active development, this destroys your feedback loop. You make a change, rebuild (8 min), test (2 min), change again (8 min). Compare that to: change, ./gradlew bootRun (10 sec), test (2 min). The JVM is 50x faster for dev iteration. Use JVM during development. Reserve native compilation for CI/staging/production only. Never run nativeCompile as part of your local dev cycle — configure your IDE to run JVM mode locally. [GraalVM Native Image Docs]
Plugin architectures — If your service loads code dynamically at runtime (OSGi, custom classloaders, Service Provider Interface with runtime discovery), native images fundamentally cannot support this pattern. The closed-world assumption is absolute. There's no configuration option to relax it. The entire model depends on knowing what code exists at build time. If code is discovered at runtime, native images will never work.
The trade-offs across all of these dimensions collapse into a single routing decision. Walk a candidate service through the flowchart below before committing engineering time to a migration — the wrong call here costs weeks of metadata work for negligible operational gain.
flowchart TD
Start([New service candidate]) --> Plugin{Loads code dynamically<br/>at runtime?<br/>OSGi / custom classloaders / SPI}
Plugin -->|Yes| StayJVM[Stay on JVM<br/>Closed-world incompatible]
Plugin -->|No| Lifetime{Pod lifetime<br/>under 15 min?}
Lifetime -->|No, runs hours/days| Batch{Batch / long-running<br/>throughput-critical?}
Batch -->|Yes| StayJVM2[Stay on JVM<br/>JIT warm-up wins long-term]
Batch -->|No, steady REST traffic| CDS[Consider CDS<br/>2–3s startup, full JVM tooling]
Lifetime -->|Yes, scales in/out| SLA{Strict startup SLA?<br/>K8s readiness less than 2s}
SLA -->|No| CDS
SLA -->|Yes| Reflection{Heavy reflection?<br/>complex JPA / AspectJ /<br/>runtime bytecode gen}
Reflection -->|Yes, deep JPA graph| Refactor{Can refactor to<br/>JdbcClient / jOOQ?}
Refactor -->|No| StayJVM3[Stay on JVM<br/>Metadata burden too high]
Refactor -->|Yes| Native
Reflection -->|No, mostly Spring beans| Native([Migrate to Native Image<br/>40-120ms startup, 60-70% memory cut])
style Native fill:#22c55e,stroke:#16a34a,color:#fff
style StayJVM fill:#ef4444,stroke:#dc2626,color:#fff
style StayJVM2 fill:#ef4444,stroke:#dc2626,color:#fff
style StayJVM3 fill:#ef4444,stroke:#dc2626,color:#fff
style CDS fill:#f59e0b,stroke:#d97706,color:#fff
The two terminal cases on the right (Native Image, CDS) represent the bulk of stateless Spring Boot services we've migrated. The "stay on JVM" branches typically catch 20–30% of any service portfolio — usually batch processors, JPA-heavy domain services, and anything with custom classloader logic inherited from older codebases.
Kubernetes Integration: Where Native Images Shine
The real payoff comes in Kubernetes. The combination of instant startup and low memory usage unlocks deployment patterns that are impossible with JVM images.
With instant startup, you can use aggressive probe timings that were previously unthinkable:
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
spec:
replicas: 3
template:
spec:
containers:
- name: order-service
image: registry.example.com/order-service:native
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "500m"
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
# These timings are possible with native images
# Were 15–30 seconds with JVM images
initialDelaySeconds: 1
periodSeconds: 2
failureThreshold: 3
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 2
periodSeconds: 10Memory requests are cut from 512Mi to 128Mi. The initialDelaySeconds on readiness probe goes from 15–30 to 1–2. This has cascading effects.
Impact on rolling deployments: With JVM images taking 15+ seconds to become ready, rolling deployments require a maintenance window. You have to coordinate: drain old pods, wait for new pods to warm up, then route traffic. With native images, rolling deployments can happen during business hours without performance impact. New pods are ready in 1 second. You can update all replicas in sequence without dropping requests.
Impact on HPA (Horizontal Pod Autoscaling): When a traffic spike triggers scale-up (e.g., 3 → 12 pods during a sale event), new pods must start serving traffic immediately or requests queue up. With JVM images, scale-up took 2–3 minutes as new pods warmed up. With native images, it's under 30 seconds. For a payment processing service, this is the difference between handling the spike gracefully and dropping requests or timing out.
Impact on node consolidation: With memory usage cut by 60–70%, you run significantly more pods per node. In our experience consolidating a stateless-service estate, going native typically allowed roughly half the nodes for the same replica count, with better density and redundancy. Infrastructure cost drop is direct and measurable.
Production Debugging Without the JVM
The JVM ecosystem has decades of mature debugging tooling that does not exist in the native world. Before you migrate, understand the trade-offs:
| Tool | Purpose | Native Alternative |
|---|---|---|
jstack | Thread dumps | kill -3 <pid> (with -g flag) |
jmap / jcmd heap dump | Memory analysis | /proc/<pid>/smaps (no heap dump equivalent) |
| Arthas / BTrace | Live attach, method tracing | None |
| JFR | Production profiling | Partial support via --enable-monitoring=jfr |
| VisualVM / JMC | GUI profiling | Platform profilers (perf, async-profiler) |
You lose the ability to attach tools at runtime and introspect the heap. This is a real limitation. We mitigated by building debug builds (with -g flag) for staging and production troubleshooting, and by instrumenting aggressively with Micrometer.
Mitigation strategy: instrument at the application level. Micrometer works identically in JVM and native mode. Set up alerts on these metrics:
jvm_memory_used_bytes{area="heap"}at 80% of max (native images have fixed max heap — no dynamic expansion)jvm_gc_pause_seconds_maxat 200ms (serial GC) or 50ms (G1 GC) — serial is stop-the-world; G1 has shorter pausesprocess_resident_memory_bytesat 80% of K8s memory limit (OOMKilled with no heap dump is painful to debug)http_server_requests_seconds{quantile="0.99"}set to your SLO — catches throughput regression vs JVM baseline
For thread dumps, enable signal-based inspection at build time:
graalvmNative {
binaries {
named("main") {
buildArgs.add("-g") // Include debug symbols
buildArgs.add("-H:+AllowVMInspection") // Enable signal-based thread dumps
buildArgs.add("--enable-monitoring=jfr") // Optional: enable JFR recording
}
}
}Then get a thread dump: kill -3 $(pgrep order-service) — the output goes to stderr.
For JFR (Java Flight Recorder), build with --enable-monitoring=jfr and start recording at runtime: ./order-service -XX:StartFlightRecording=filename=recording.jfr,duration=60s. JFR support in native images is partial — you get GC events, thread events, allocation tracking — but not class loading or JIT compilation events (those don't apply to AOT).
Production Patterns: What Works
Pattern 1: Spring Boot 3 Auto-Hints — Spring AOT auto-generates hints for @Component, @Service, @Entity, @ConfigurationProperties. Stick to Spring conventions; manual hints needed only for third-party DTOs and exotic reflection.
Pattern 2: Virtual Threads + Native Image — Virtual threads[JEP 444, 2023] work in native images. Use Executors.newVirtualThreadPerTaskExecutor() for both instant startup and high concurrency without thread pool exhaustion. In our microbenchmarks, sustained throughput exceeds the ~22k baseline because virtual threads reduce context switch overhead on I/O-bound paths.
Pattern 3: Separate Native Test Stage in CI — Run native tests in a separate gate, not on every commit. Fast feedback on JVM (5-10 sec), reserve native tests for pre-deployment checks. This tests the actual production binary without destroying dev iteration speed.
Migration Checklist
Budget 3–5 days per service for the first migration, 1–2 days with organizational knowledge.
Phase 1: Assessment
- Audit dependencies at GraalVM reachability metadata repo
- Identify reflection-heavy libs; evaluate replacements (JPA → JdbcClient, dynamic proxies → compile-time processor)
- Verify GraalVM JDK matches target Java version (17 or 21)
Phase 2: Build
- Add
org.graalvm.buildtools.nativeplugin; enable metadata repository - Run tracing agent against full integration test suite
- Achieve successful
nativeCompilelocally; write smoke test
Phase 3: Validate
- Run integration tests against native binary (
nativeTest) - Load test: compare throughput, latency (P50/P95/P99), memory vs JVM baseline
- Verify logging, actuator endpoints, graceful shutdown
Phase 4: Deploy
- Create multi-stage Dockerfile with distroless base
- Reduce K8s memory requests by 60–70%; tighten probe timings
- Canary to staging, then production (10% → 50% → 100%)
Production Checklist
- Reflection hints exhaustively documented
- Tracing agent run against full test suite + manual testing
- Native tests passing in CI
- K8s
initialDelaySecondsreduced to 1–2 - Memory requests reduced by 60–70%
- Thread dump extraction documented (
kill -3 <pid>) - JFR recording setup verified
- Micrometer alerts configured for heap, GC, RSS
- Canary deployment plan written
Is It Worth It?
Native images shift costs: higher CI build time (8–15 min) for lower memory and faster autoscaling. For a team running ~20 services with several deployments per day, the extra CI compute is a real line item — but memory savings and infrastructure consolidation typically recover it in our experience.
Use native images if you scale frequently (K8s HPA, serverless), need aggressive probe timings, or memory costs compound at scale. Skip them if you rely on heavy JPA, third-party reflection libraries, long-running batch jobs, or rapid dev iteration. The JVM is dramatically faster for local iteration — seconds versus the 8–15 minutes a native rebuild takes.
For stateless microservices: memory reduction (~450MB → ~110MB RSS) allows consolidation onto fewer nodes, cutting infrastructure cost. Rolling deployments compress from multi-minute windows to under 30 seconds. The payoff compounds daily.
Frequently Asked Questions
What is the GraalVM closed-world assumption?
GraalVM native image performs static analysis at build time and includes only the code it can prove is reachable. Reflection, dynamic class loading, and JNI are not visible to static analysis, so they must be declared explicitly in configuration files or the code that uses them will fail at runtime.
How much faster is GraalVM native image startup vs JVM?
Native images typically start in 40-120ms compared to 4-11 seconds for a standard Spring Boot JVM application. This makes them ideal for Kubernetes environments with strict readiness probe SLAs and frequent autoscaling events.
Is GraalVM native image throughput lower than JVM?
Yes, typically 10-25% lower at steady state because AOT compilation makes conservative optimizations without runtime profiling data. Profile-Guided Optimization (PGO) in Oracle GraalVM can recover 30-50% of this gap, bringing native images within 5-15% of peak JIT performance. [GraalVM Native Image Docs]
When should I use GraalVM native image vs a regular JVM?
Use native images for services with short pod lifetimes (under 15 minutes), strict startup SLAs, or memory-constrained environments. Use the JVM for long-running services where peak throughput matters, services that rely heavily on reflection or dynamic class loading, or when build time (native image builds take 5-15 minutes) is a constraint.
Keep Reading
- Spring Boot REST: JPA, Validation, Exception Handling, and Testing — The Spring Boot patterns that work seamlessly with native compilation
- Java Virtual Threads: Project Loom, Pinning Hazards, and Production Migration — Combine native images with virtual threads for instant startup and high concurrency
- Go vs Java in 2026: An Honest Performance Comparison for Backend Services — How GraalVM native images change the startup and memory comparison with Go
Engineering Team
A multidisciplinary team of backend engineers, architects, and DevOps practitioners shipping deep dives into distributed systems and production infrastructure.
Read Next
Java Virtual Threads: Project Loom, Pinning Hazards, and Production Migration
Java 21 virtual threads: M:N scheduling, pinning hazards, ThreadLocal pitfalls, JFR detection, and what migration really takes.
Go vs Java in 2026: An Honest Performance Comparison for Backend Services
An honest Java (Spring Boot) vs. Go (Gin) performance comparison under load tests in 2026. Comparing throughput, memory footprint, cold starts, and AWS costs.
Modern Java Collections: computeIfAbsent, Immutables, and Best Practices
Java collections: computeIfAbsent, getOrDefault, removeIf, immutables, and Comparator chains that eliminate entire bug categories.