Spring AI in Production: RAG Pipelines, Reliability, and Observability for Java Backends
Key Takeaways
- →The classic LLM-hallucinated-policy failure mode is the model confidently inventing company policy from training-set fragments — the fix is circuit breakers, PII scrubbing, and retrieval-score gating, not removing the AI
- →Spring AI 1.1 brings a production-grade Java API for RAG pipelines with pgvector, Micrometer observability, and token cost tracking
- →Every AI response is traceable to the specific knowledge base chunks that informed it — critical for compliance and quality review
- →Gate LLM answers on retrieval similarity score (≥0.75), not model confidence — the model is trained to sound confident even when hallucinating, retrieval score actually reflects knowledge base coverage
- →Scrub PII (email, phone, card digits, SSN) before any external LLM call; Spring AI provides PII scanner middleware; uncleaned input becomes part of training data if captured
The classic LLM-hallucinated-policy failure pattern. A team integrates an LLM into their customer support flow. Within weeks, ticket volume jumps — not because the AI is broken, but because it's confidently fabricating refund and warranty policies. Customers receive refunds the AI promised "per standard policy" when no such policy exists. The LLM hallucinated coherent-sounding text from training-set fragments, and the application had no guardrails to gate the response. This is OWASP LLM06 (excessive agency) and LLM02 (sensitive information disclosure)[OWASP LLM Top 10] firing in production.
The fix isn't removing the AI — it's wrapping it in the same production engineering patterns you'd apply to any unreliable downstream service: circuit breakers, output validation, PII scrubbing, bounded caches, and observability that tracks token costs before they become a surprise on the cloud bill.
Spring AI 1.1 is production-ready for RAG-powered support chatbots. The critical pattern: scrub PII before calling the LLM, validate retrieval similarity scores instead of asking the model to rate itself, use circuit breakers with fallback chains for API outages, and instrument token budgets[Prometheus Best Practices] to prevent cost surprises.
- Scrub PII (email, phone, card digits) from user input before any external LLM call
- Gate LLM answers on retrieval similarity score (≥0.75), not model confidence
- Implement 3-tier fallback: GPT-4 → cheaper model → static FAQ
- Track
ai.tokens.*andai.escalations.*metrics per customer
graph TD
User[User question] --> PII[PII scrubber:<br/>strip email / phone / card / SSN]
PII --> Retr[Vector retrieval<br/>pgvector similarity]
Retr --> Gate{similarity ≥ 0.75?}
Gate -->|No, low confidence| Fallback[Static FAQ<br/>or 'I don't know']
Gate -->|Yes| LLM[LLM call<br/>w/ retrieved context]
LLM --> Out[Output validator:<br/>schema + policy check]
Out --> Resp[Response to user]
CB[Circuit breaker<br/>per provider] -.->|opens on failure| LLM
CB -.->|fall through to| Fallback
Cost[Cost meter<br/>per tenant] -.->|hard cap| LLM
style PII fill:#fee
style Gate fill:#eef
style Out fill:#fee
style CB fill:#fee
style Cost fill:#fee
The diagram is the production discipline: every LLM call is gated front (PII + retrieval-score) and back (output validator), with circuit breaker and cost meter as kill switches outside the call path. Most "Spring AI hello world" tutorials skip every red box on this diagram. Production is the red boxes.
Spring AI 1.x Capabilities Matrix
| Capability | Spring AI | Code Example | Trade-off |
|---|---|---|---|
| ChatClient | Unified API to OpenAI, Anthropic, Ollama, Bedrock | chatClient.prompt().user(...).call().content() | Portable across providers; minimal per-provider tuning |
| Embeddings | Vector generation for semantic search | embeddingModel.embed(text) → float[] | ~$0.02 per 1M tokens (text-embedding-3-small); benefit from caching |
| Vector Store (pgvector) | Similarity search on 1M+ vectors at ~2ms | vectorStore.similaritySearch(query) | Requires PostgreSQL extension; index tuning needed for scale |
| Tool Calling | Let LLM invoke typed Java methods | @Tool annotation (1.0 GA) or legacy Function callback | Cost: LLM must decide when to call; must validate auth inside function |
| RAG Retrieval | Combine LLM with external context | Chunking strategy + similarity scoring | LLM hallucinates less; knowledge base staleness is new risk |
Project Setup
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-bom</artifactId>
<version>1.1.7</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-starter-model-openai</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-starter-vector-store-pgvector</artifactId>
</dependency>
<!-- Resilience4j for circuit breakers and rate limiting -->
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-spring-boot3</artifactId>
</dependency>
<!-- Caffeine for bounded in-memory caching -->
<dependency>
<groupId>com.github.ben-manes.caffeine</groupId>
<artifactId>caffeine</artifactId>
</dependency>
<!-- Micrometer for observability -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
</dependencies>spring:
ai:
openai:
api-key: ${OPENAI_API_KEY}
chat:
options:
model: gpt-4o
temperature: 0.3 # lower temperature = more deterministic support answers
max-tokens: 500
embedding:
options:
model: text-embedding-3-small
vectorstore:
pgvector:
initialize-schema: false # manage schema with Flyway, not Spring AI
dimensions: 1536
distance-type: COSINE_DISTANCE
index-type: HNSW
resilience4j:
circuitbreaker:
instances:
openai:
sliding-window-size: 20
failure-rate-threshold: 50
wait-duration-in-open-state: 30s
permitted-number-of-calls-in-half-open-state: 5
ratelimiter:
instances:
ai-chat:
limit-for-period: 30 # per customer, per refresh period
limit-refresh-period: 60s
timeout-duration: 0sUse initialize-schema: false in production and manage the pgvector schema with your migration tool (Flyway or
Liquibase). Spring AI's auto-schema creation is convenient for local development but gives you no control over index
tuning, vacuum scheduling, or rollback.
PII Scrubbing and Prompt Injection Defense
[OWASP LLM Top 10]Scrub email, phone, card digits, and SSN patterns before any external LLM call. Then validate input for injection patterns and output for leaked instructions.
@Service
public class PiiScrubber {
private static final Pattern EMAIL = Pattern.compile("[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}");
private static final Pattern PHONE = Pattern.compile("\\b(\\+?1[-. ]?)?\\(?\\d{3}\\)?[-. ]?\\d{3}[-. ]?\\d{4}\\b");
private static final Pattern CARD = Pattern.compile("\\b(?:\\d[ -]?){13,16}\\b");
public String scrub(String text) {
return EMAIL.matcher(text).replaceAll("[EMAIL]")
.transform(s -> PHONE.matcher(s).replaceAll("[PHONE]"))
.transform(s -> CARD.matcher(s).replaceAll("[CARD]"));
}
}
@Service
public class PromptGuard {
private static final List<Pattern> INJECTION_PATTERNS = List.of(
Pattern.compile("(?i)ignore\\s+(all\\s+)?previous\\s+instructions"),
Pattern.compile("(?i)you\\s+are\\s+now\\s+a"),
Pattern.compile("(?i)reveal\\s+(your|the)\\s+(instructions|prompt)")
);
public boolean isInjectionAttempt(String input) {
return INJECTION_PATTERNS.stream().anyMatch(p -> p.matcher(input).find());
}
public boolean isOutputSafe(String output) {
return !output.contains("Knowledge Base:") && !output.contains("Guidelines:");
}
}For regulated industries, add a dedicated PII detection service and review LLM provider data processing agreements before sending customer data.
Caches, Budgets, and Fallbacks
[Redis Docs]Use Caffeine with explicit bounds (never unbounded ConcurrentHashMap). Enforce per-customer token budgets to prevent cost surprises. Implement 3-tier fallback: GPT-4 → cheaper model → static FAQ.
@Bean
public Cache<String, float[]> embeddingCache() {
return Caffeine.newBuilder()
.maximumSize(50_000)
.expireAfterAccess(Duration.ofHours(12))
.recordStats()
.build();
}
@Service
public class TokenBudgetEnforcer {
private final Cache<String, AtomicLong> dailyUsage;
private static final long DAILY_LIMIT_PER_CUSTOMER = 50_000;
public boolean hasRemainingBudget(String customerId) {
AtomicLong used = dailyUsage.get(customerId, k -> new AtomicLong(0));
return used.get() < DAILY_LIMIT_PER_CUSTOMER;
}
public void recordUsage(String customerId, long promptTokens, long completionTokens) {
AtomicLong used = dailyUsage.get(customerId, k -> new AtomicLong(0));
long newTotal = used.addAndGet(promptTokens + completionTokens);
if (newTotal > DAILY_LIMIT_PER_CUSTOMER * 0.8) {
log.warn("Customer {} at 80% of daily token budget", customerId);
}
}
}
@Service
public class ResilientSupportAssistant {
@CircuitBreaker(name = "openai", fallbackMethod = "fallbackToLighterModel")
public SupportResponse answerQuestion(String question, String customerId) {
return primaryAssistant.answerQuestion(question, customerId);
}
private SupportResponse fallbackToLighterModel(String question, String customerId, Exception cause) {
// Fall to GPT-4o-mini, then static FAQ
try {
return primaryAssistant.answerWithModel(question, customerId, "gpt-4o-mini");
} catch (Exception e) {
return staticFaqService.findBestMatch(question).map(SupportResponse::fromFaq).orElse(SupportResponse.escalate(question, "Escalated to human"));
}
}
}RAG Pipeline and Tool Calling
The full RAG flow with similarity gating + circuit-breaker fallback — every step has a failure mode the next step handles:
graph TD
User[User question] --> Scrub[PiiScrubber.scrub<br/>strip emails, SSNs, PII]
Scrub --> Guard{PromptGuard<br/>check?}
Guard -->|injection detected| Reject[Return safe-decline<br/>log signature]
Guard -->|safe| Embed[Embed via OpenAI<br/>or local model]
Embed --> Retrieve[VectorStore.similaritySearch<br/>top-4 chunks]
Retrieve --> Gate{Top similarity<br/>>= 0.75?}
Gate -->|No — low confidence| Fallback[Return fallback answer<br/>do not call LLM]
Gate -->|Yes — confident| Budget{Per-session<br/>budget OK?}
Budget -->|No| Reject2[Return budget-exceeded]
Budget -->|Yes| LLM[ChatClient with<br/>retrieved context]
LLM -->|success| Response[Return LLM answer<br/>+ source citations]
LLM -->|circuit-breaker open| Cache[Return cached answer<br/>or graceful fallback]
style Reject fill:#fdd
style Reject2 fill:#fdd
style Fallback fill:#ffd
style Cache fill:#ffd
style Response fill:#dfd
Three production rules visible in the flow: (1) similarity gate goes BEFORE the LLM call so a bad retrieval cannot waste tokens; (2) budget check runs even on similarity-gated paths so loops cannot bypass it; (3) circuit-breaker open returns a cached or graceful response, never a 500.
Chunk knowledge base documents at 512 tokens (fixed-size, simple strategy for FAQ-style documents). Retrieve top-4 chunks by similarity, then gate the LLM response on similarity score (≥0.75), not on asking the model to rate itself.
@Service
public class IntelligentSupportAssistant {
private static final double SIMILARITY_THRESHOLD = 0.75;
private final ChatClient chatClient;
private final VectorStore vectorStore;
private final PiiScrubber piiScrubber;
private final PromptGuard promptGuard;
public SupportResponse answerQuestion(String rawQuestion, String customerId) {
String question = piiScrubber.scrub(rawQuestion);
if (promptGuard.isInjectionAttempt(question)) {
return SupportResponse.escalate(question, "Query flagged");
}
// Retrieve context
List<Document> retrieved = vectorStore.similaritySearch(
SearchRequest.builder()
.query(question)
.topK(4)
.similarityThreshold(SIMILARITY_THRESHOLD)
.build()
);
if (retrieved.isEmpty()) {
return SupportResponse.escalate(question, "No knowledge base match");
}
String context = retrieved.stream()
.map(Document::getText)
.collect(Collectors.joining("\n\n---\n\n"));
String answer = chatClient.prompt()
.system("Answer using only the provided knowledge base.")
.user(context + "\n\nQuestion: " + question)
.call()
.content();
if (!promptGuard.isOutputSafe(answer)) {
return SupportResponse.escalate(question, "Response flagged");
}
return SupportResponse.automated(answer, retrieved);
}
}For live data (order status, account balance), use tool calling — the LLM invokes typed Java methods and always validate authorization inside the tool:
@Component
public class OrderStatusTool {
@Tool("Retrieve order status by order number")
public OrderStatusResponse getOrderStatus(String orderNumber) {
return orderService.findByOrderNumber(orderNumber)
.map(o -> new OrderStatusResponse(o.getStatus().name(), o.getTrackingNumber()))
.orElseThrow(() -> new OrderNotFoundException(orderNumber));
}
public record OrderStatusResponse(String status, String trackingNumber) {}
}Note: The @Tool annotation landed in Spring AI 1.0.0-M6 and is the preferred approach for defining tools in 1.0 GA and later. The older pattern — a @Component implementing a typed Function with @Description — still works but is superseded by the ToolCallback API.
Production Observability
[OpenTelemetry Sampling]Track these domain-level metrics beyond the framework's built-in Micrometer integration:
@Component
public class AiMetrics {
private final MeterRegistry registry;
public void recordTokenUsage(Usage usage, String model) {
registry.counter("ai.tokens", "type", "prompt", "model", model)
.increment(usage.getPromptTokens());
registry.counter("ai.tokens", "type", "completion", "model", model)
.increment(usage.getGenerationTokens());
}
public void recordEscalation(String reason) {
registry.counter("ai.escalations", "reason", reason).increment();
}
public void recordLatency(long millis, String outcome) {
registry.timer("ai.request.duration", "outcome", outcome).record(Duration.ofMillis(millis));
}
}Key alerts: escalation rate >25% (knowledge base gaps), token spike >500k/hour (runaway loop or abuse), p95 latency >8s (context size or model choice).
Production Checklist
- PII scrubber wired before any LLM call (regex heuristics + audit logging)
- Prompt injection guard: input validation for known patterns + output validation for leaked instructions
- Token budget enforcer: per-customer daily limits (50K tokens) with 80% alerts
- Bounded Caffeine caches: 50K embeddings max, 12-hour expiry
- Circuit breaker + fallback chain: GPT-4 → GPT-4o-mini → static FAQ
- Similarity-threshold gating: escalate if top chunk
<0.75, not by asking model to self-rate - Metrics:
ai.tokens.*(prompt, completion),ai.escalations.*(reason),ai.request.duration(outcome) - Alerts: escalation rate >25%, token spike >500k/hour, p95 latency >8s
For knowledge base updates: run eval suite before + after, rollback if pass rate drops >5%. Version documents and tag with version ID.
Fallback chain that survives a primary-model outage
The "GPT-4 → GPT-4o-mini → static FAQ" line in the checklist needs an actual implementation. The pattern below routes around a primary-model outage without leaking errors to the user, with the structured-output schema enforced at every layer so a degraded fallback can't silently violate the API contract:
@Service
public class ResilientChatService {
private final ChatClient primary; // GPT-4
private final ChatClient fallback; // GPT-4o-mini
private final StaticFaqService faq; // pre-canned answers
private final Counter fallbackCounter;
public ResilientChatService(ChatClient.Builder builder, MeterRegistry meters,
StaticFaqService faq) {
this.primary = builder.defaultOptions(opts -> opts.withModel("gpt-4"))
.build();
this.fallback = builder.defaultOptions(opts -> opts.withModel("gpt-4o-mini"))
.build();
this.faq = faq;
this.fallbackCounter = meters.counter("ai.fallback.invocations");
}
public Answer ask(String question) {
try {
return primary.prompt(question).call().entity(Answer.class);
} catch (RateLimitException | ServiceUnavailableException e) {
fallbackCounter.increment("tier", "secondary", "reason", e.getClass().getSimpleName());
try {
return fallback.prompt(question).call().entity(Answer.class);
} catch (Exception inner) {
fallbackCounter.increment("tier", "static");
return faq.bestMatch(question) // never throws
.orElse(Answer.escalate("model unavailable"));
}
}
}
}The corresponding schema validation — Spring AI's structured output binding rejects malformed completions before they reach the controller, so a degraded fallback model that returns "I think it might be X" instead of the JSON contract surfaces as a typed validation error rather than a 500:
public record Answer(
@NotBlank @JsonPropertyDescription("Answer text. Empty string if escalating.")
String text,
@NotNull @JsonPropertyDescription("Confidence 0.0-1.0. Trigger escalation below 0.75.")
Double confidence,
@JsonPropertyDescription("Source document IDs that backed this answer.")
String[] citations,
@JsonPropertyDescription("Why the model could not answer; null on success.")
String escalationReason
) {
public static Answer escalate(String reason) {
return new Answer("", 0.0, new String[0], reason);
}
public boolean shouldEscalate() {
return confidence < 0.75 || escalationReason != null;
}
}The fallback chain only helps if the contract holds — which is why Answer is a record with annotations the structured-output binder enforces at deserialization, not an untyped Map blob the controller has to second-guess.
Tool-Calling Reliability: Hallucinated Tools and Schema Retries
The dirty secret of LLM tool calling is that models hallucinate tool names that look plausible but do not exist in the registered set. A model that has seen getOrderStatus thousands of times in training will happily emit get_order_details, lookupOrder, or fetchOrderTrackingInfo when the user phrases the question slightly differently — none of which are registered. The framework's default behavior is to either throw or silently drop the call, both of which produce a degraded user experience without any signal to the operator.
The fix is a registry-aware dispatcher that treats the LLM's tool call as untrusted input and validates it against the actual registry before dispatch. When a hallucinated tool name arrives, the dispatcher does not throw — it injects a corrective system message back into the conversation listing the valid tools and asks the model to retry. Two retries is the sweet spot: more than that and the model is stuck in a confusion loop the dispatcher cannot break.
@Service
public class ToolCallDispatcher {
private final Map<String, Tool> registry;
private final ChatClient chatClient;
private final Counter hallucinationCounter;
private static final int MAX_RETRIES = 2;
public ToolCallDispatcher(List<Tool> tools, ChatClient client, MeterRegistry meters) {
this.registry = tools.stream().collect(Collectors.toMap(Tool::name, t -> t));
this.chatClient = client;
this.hallucinationCounter = meters.counter("ai.tool.hallucinations");
}
public ToolResult dispatch(ToolCall call, ConversationState conv) {
for (int attempt = 0; attempt <= MAX_RETRIES; attempt++) {
Tool tool = registry.get(call.name());
if (tool == null) {
hallucinationCounter.increment("invented", call.name());
String correction = "Tool '" + call.name()
+ "' does not exist. Valid tools: " + String.join(", ", registry.keySet())
+ ". Retry with one of these or respond directly.";
conv.addSystemMessage(correction);
call = chatClient.continueConversation(conv).extractToolCall();
continue;
}
try {
tool.validateArguments(call.arguments());
return tool.invoke(call.arguments());
} catch (SchemaValidationException sve) {
conv.addSystemMessage("Argument schema invalid: " + sve.getMessage()
+ ". Required schema: " + tool.schemaJson());
call = chatClient.continueConversation(conv).extractToolCall();
}
}
return ToolResult.escalate("Tool resolution failed after " + MAX_RETRIES + " retries");
}
}Schema validation deserves the same retry treatment. When the model emits {"orderNumber": 12345} for a tool that requires a string, do not let Jackson throw a 500 — feed the validation error back to the model with the expected schema, and let it correct itself once. The ai.tool.hallucinations counter with a tier tag separates "invented tool name" from "wrong argument type" so the operator can tell the difference between a registry-coverage problem (add the missing tool) and a schema-clarity problem (rename a field, add an example).
RAG Federation Across Multiple Knowledge Bases
Real support flows do not have a single knowledge base. Refund policies live in one repository, product documentation in another, and engineering runbooks in a third — each with its own access controls, refresh cadence, and chunking strategy. The naive approach is to concatenate all three into one giant pgvector table and hope the similarity search picks the right one. In practice this destroys retrieval quality: refund policy chunks compete with product manual chunks for the top-4 slots, and the LLM ends up with mixed context that produces confused answers.
The federation pattern routes the question to the right knowledge base first, then retrieves only from that base. A lightweight classifier — either a small fine-tuned model or a keyword router — picks the namespace, and the vector store query is scoped to that namespace alone. When the classifier is uncertain, the retriever fans out to all eligible bases in parallel and re-ranks the union by similarity score, so the system gracefully degrades from "scoped retrieval" to "broad retrieval" rather than failing.
@Service
public class FederatedRetriever {
private final Map<KnowledgeBase, VectorStore> stores;
private final NamespaceClassifier classifier;
private final Executor parallelExecutor;
private static final double UNCERTAIN_THRESHOLD = 0.6;
public List<Document> retrieve(String question, AuthContext ctx) {
ClassificationResult routing = classifier.classify(question);
if (routing.confidence() >= UNCERTAIN_THRESHOLD) {
VectorStore store = stores.get(routing.target());
return store.similaritySearch(SearchRequest.builder()
.query(question)
.topK(4)
.filterExpression(authFilter(ctx))
.build());
}
List<CompletableFuture<List<Document>>> fanOut = stores.entrySet().stream()
.filter(e -> ctx.canAccess(e.getKey()))
.map(e -> CompletableFuture.supplyAsync(
() -> e.getValue().similaritySearch(
SearchRequest.builder()
.query(question)
.topK(2)
.filterExpression(authFilter(ctx))
.build()),
parallelExecutor))
.toList();
return fanOut.stream()
.map(CompletableFuture::join)
.flatMap(List::stream)
.sorted(Comparator.comparingDouble(
d -> -d.getMetadata().getOrDefault("similarity", 0.0).hashCode()))
.limit(4)
.toList();
}
private Filter.Expression authFilter(AuthContext ctx) {
return new FilterExpressionBuilder()
.in("tenant_id", ctx.tenantId())
.and(new FilterExpressionBuilder().lte("classification", ctx.clearanceLevel()))
.build();
}
}The auth filter on every retrieval is non-negotiable. Without it, a federated retriever is the easiest way in the codebase to leak data across tenants — a refund-policy retrieval for tenant A returns chunks from tenant B because they happen to be more semantically similar to the question. The filter pushes the access-control predicate into the vector index itself rather than filtering after the fact, which both preserves correctness and keeps the top-4 slots full of authorized chunks.
Cost Attribution Per Spring Profile
Token cost surprises are operational failures, not accounting failures. The team wakes up to a bill that is 4x higher than last month with no idea which environment generated it — was it the staging soak test that ran overnight, the dev profile that someone left enabled with gpt-4 instead of gpt-4o-mini, or the production tenant who started a runaway loop? Without attribution at the profile and tenant level, the answer is "everyone has to investigate everything," which is how a 12-hour incident review starts.
Spring profiles are the right granularity because they map cleanly to environments, and MeterRegistry tags are how you carry that attribution into Prometheus without rewriting your metrics layer. The profile name comes from Environment.getActiveProfiles() and goes onto every token-usage counter as a tag, so a single PromQL query can break down spend by environment, model, and tenant simultaneously — which is the dimension you actually need when paging the on-call.
@Component
public class ProfiledCostMeter {
private final Counter.Builder costCounter;
private final String activeProfile;
private final ModelPricing pricing;
public ProfiledCostMeter(Environment env, MeterRegistry registry, ModelPricing pricing) {
this.activeProfile = String.join(",", env.getActiveProfiles());
this.pricing = pricing;
this.costCounter = Counter.builder("ai.cost.usd.cents")
.description("AI cost in USD cents, attributed by profile, model, and tenant");
}
public void recordCall(String tenantId, String model, Usage usage) {
long promptCents = pricing.promptCentsPer1k(model) * usage.getPromptTokens() / 1000;
long completionCents = pricing.completionCentsPer1k(model) * usage.getGenerationTokens() / 1000;
costCounter.tags("profile", activeProfile, "model", model,
"tenant", tenantId, "kind", "prompt")
.register(Metrics.globalRegistry).increment(promptCents);
costCounter.tags("profile", activeProfile, "model", model,
"tenant", tenantId, "kind", "completion")
.register(Metrics.globalRegistry).increment(completionCents);
}
}Pair this with a daily Prometheus recording rule that aggregates sum by (profile, tenant) (rate(ai_cost_usd_cents_total[24h])) and a Grafana panel that flags any profile crossing the 80%-of-monthly-budget threshold mid-month. The recording rule is what turns "we got a bill surprise" into "the staging profile crossed budget on day 14, the on-call got paged on day 14, the runaway test was killed on day 14" — which is the only cost-incident timeline that does not end in a postmortem. [Prometheus Best Practices]
Frequently Asked Questions
What is Spring AI and how does it differ from LangChain?
Spring AI is Spring's official framework for integrating AI/LLM capabilities into Java applications. Unlike LangChain (Python/JS), Spring AI follows Spring conventions — dependency injection, auto-configuration, and the Spring ecosystem. It provides a portable API across OpenAI, Azure OpenAI, Ollama, and other providers.
How do I prevent prompt injection in Spring AI applications?
Use input validation to reject known injection patterns, implement output guardrails that verify LLM responses against expected schemas, set strict system prompts that instruct the model to ignore user-injected instructions, and apply token budgets to prevent resource exhaustion from adversarial inputs.
Can Spring AI work with local models like Ollama?
Yes, Spring AI supports Ollama out of the box via the spring-ai-ollama dependency. Configure the base URL and model name in application properties, and the same ChatClient API works identically for local and cloud-hosted models.
Keep Reading
- Vector Databases Compared: pgvector vs Pinecone vs Weaviate — Benchmarks, scaling limits, and the migration thresholds for choosing the right vector store for your RAG pipeline
- Building Production RAG Pipelines in Go — The Go equivalent: chunking strategies, embedding pipelines, pgvector operations, and retrieval evaluation
- Spring Boot REST Microservice Patterns — The foundational Spring Boot patterns for the REST layer your AI-powered endpoints sit behind
Engineering Team
A multidisciplinary team of backend engineers, architects, and DevOps practitioners shipping deep dives into distributed systems and production infrastructure.
Read Next
Building Production RAG Pipelines: Chunking, Embeddings, and Retrieval at Scale
Build RAG systems that work in production: chunking strategies, embedding selection, pgvector ops, and retrieval quality evaluation.
Vector Databases Compared: pgvector vs Pinecone vs Weaviate
Compare pgvector, Pinecone, Weaviate, Qdrant, Milvus, and Chroma on performance, cost, and operational fit with real code and benchmarks.
LLM API Integration Patterns for Backend Engineers
Production LLM API patterns: streaming, function calling, retries, token budgets, cost optimization, and observability for backend engineers.