#java #spring-ai #ai #llm #rag #vector-database #spring-boot #observability

Spring AI in Production: RAG Pipelines, Reliability, and Observability for Java Backends

BackendBytes Engineering Team

Feb 17, 2026

14 min read

Spring AI in Production: RAG Pipelines, Reliability, and Observability for Java Backends

Part of Series: AI Engineering in Production

Lesson 4 of 6

Prev Next

Key Takeaways

→The model confidently invents company policy from training-set fragments and ships it as fact — the fix is circuit breakers, PII scrubbing, and retrieval-score gating, not removing the AI
→Spring AI 1.1 brings a production-grade Java API for RAG pipelines with pgvector, Micrometer observability, and token cost tracking
→Every AI response is traceable to the specific knowledge base chunks that informed it — critical for compliance and quality review
→Gate LLM answers on retrieval similarity score (≥0.75), not model confidence — the model is trained to sound confident even when hallucinating, retrieval score actually reflects knowledge base coverage
→Scrub PII (email, phone, card digits, SSN) before any external LLM call; Spring AI provides PII scanner middleware; uncleaned input becomes part of training data if captured

A support bot promises a refund "per standard policy." No such policy exists. A team wires an LLM into their customer support flow. Within weeks, ticket volume jumps — not because the AI is broken, but because it confidently fabricates refund and warranty policies. Customers collect on refunds the model invented from training-set fragments, and nothing gates the response. This is OWASP LLM06 (excessive agency) and LLM02 (sensitive information disclosure)^{[OWASP LLM Top 10]} firing in production.

The fix isn't removing the AI — it's wrapping it in the same production engineering patterns you'd apply to any unreliable downstream service: circuit breakers, output validation, PII scrubbing, bounded caches, and observability that tracks token costs before they become a surprise on the cloud bill.

TL;DR

Spring AI 1.1 is production-ready for RAG-powered support chatbots. The critical pattern: scrub PII before calling the LLM, validate retrieval similarity scores instead of asking the model to rate itself, use circuit breakers with fallback chains for API outages, and instrument token budgets^{[Prometheus Best Practices]} to prevent cost surprises.

Scrub PII (email, phone, card digits) from user input before any external LLM call
Gate LLM answers on retrieval similarity score (≥0.75), not model confidence
Implement 3-tier fallback: GPT-4 → cheaper model → static FAQ
Track ai.tokens.* and ai.escalations.* metrics per customer

graph TD
    User[User question] --> PII[PII scrubber:<br/>strip email / phone / card / SSN]
    PII --> Retr[Vector retrieval<br/>pgvector similarity]
    Retr --> Gate{similarity ≥ 0.75?}
    Gate -->|No, low confidence| Fallback[Static FAQ<br/>or 'I don't know']
    Gate -->|Yes| LLM[LLM call<br/>w/ retrieved context]
    LLM --> Out[Output validator:<br/>schema + policy check]
    Out --> Resp[Response to user]
    CB[Circuit breaker<br/>per provider] -.->|opens on failure| LLM
    CB -.->|fall through to| Fallback
    Cost[Cost meter<br/>per tenant] -.->|hard cap| LLM
    style PII fill:#fee
    style Gate fill:#eef
    style Out fill:#fee
    style CB fill:#fee
    style Cost fill:#fee

The diagram is the production discipline: every LLM call is gated front (PII + retrieval-score) and back (output validator), with circuit breaker and cost meter as kill switches outside the call path. Most "Spring AI hello world" tutorials skip every red box on this diagram. Production is the red boxes.

Spring AI 1.x Capabilities Matrix

Capability	Spring AI	Code Example	Trade-off
ChatClient	Unified API to OpenAI, Anthropic, Ollama, Bedrock	`chatClient.prompt().user(...).call().content()`	Portable across providers; minimal per-provider tuning
Embeddings	Vector generation for semantic search	`embeddingModel.embed(text)` → `float[]`	~$0.02 per 1M tokens (text-embedding-3-small); benefit from caching
Vector Store (pgvector)	Similarity search on 1M+ vectors at ~2ms	`vectorStore.similaritySearch(query)`	Requires PostgreSQL extension; index tuning needed for scale
Tool Calling	Let LLM invoke typed Java methods	`@Tool` annotation (1.0 GA) or legacy `Function` callback	Cost: LLM must decide when to call; must validate auth inside function
RAG Retrieval	Combine LLM with external context	Chunking strategy + similarity scoring	LLM hallucinates less; knowledge base staleness is new risk

Project Setup

<dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>org.springframework.ai</groupId>
            <artifactId>spring-ai-bom</artifactId>
            <version>1.1.7</version>
            <type>pom</type>
            <scope>import</scope>
        </dependency>
    </dependencies>
</dependencyManagement>
 
<dependencies>
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-starter-model-openai</artifactId>
    </dependency>
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-starter-vector-store-pgvector</artifactId>
    </dependency>
    <!-- Resilience4j for circuit breakers and rate limiting -->
    <dependency>
        <groupId>io.github.resilience4j</groupId>
        <artifactId>resilience4j-spring-boot3</artifactId>
    </dependency>
    <!-- Caffeine for bounded in-memory caching -->
    <dependency>
        <groupId>com.github.ben-manes.caffeine</groupId>
        <artifactId>caffeine</artifactId>
    </dependency>
    <!-- Micrometer for observability -->
    <dependency>
        <groupId>io.micrometer</groupId>
        <artifactId>micrometer-registry-prometheus</artifactId>
    </dependency>
</dependencies>

spring:
  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      chat:
        options:
          model: gpt-4o
          temperature: 0.3 # lower temperature = more deterministic support answers
          max-tokens: 500
      embedding:
        options:
          model: text-embedding-3-small
 
    vectorstore:
      pgvector:
        initialize-schema: false # manage schema with Flyway, not Spring AI
        dimensions: 1536
        distance-type: COSINE_DISTANCE
        index-type: HNSW
 
resilience4j:
  circuitbreaker:
    instances:
      openai:
        sliding-window-size: 20
        failure-rate-threshold: 50
        wait-duration-in-open-state: 30s
        permitted-number-of-calls-in-half-open-state: 5
  ratelimiter:
    instances:
      ai-chat:
        limit-for-period: 30 # per customer, per refresh period
        limit-refresh-period: 60s
        timeout-duration: 0s

Schema Management

Use initialize-schema: false in production and manage the pgvector schema with your migration tool (Flyway or Liquibase). Spring AI's auto-schema creation is convenient for local development but gives you no control over index tuning, vacuum scheduling, or rollback.

PII Scrubbing and Prompt Injection Defense

^{[OWASP LLM Top 10]}

Scrub email, phone, card digits, and SSN patterns before any external LLM call. Then validate input for injection patterns and output for leaked instructions.

@Service
public class PiiScrubber {
    private static final Pattern EMAIL = Pattern.compile("[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}");
    private static final Pattern PHONE = Pattern.compile("\\b(\\+?1[-. ]?)?\\(?\\d{3}\\)?[-. ]?\\d{3}[-. ]?\\d{4}\\b");
    private static final Pattern CARD = Pattern.compile("\\b(?:\\d[ -]?){13,16}\\b");
 
    public String scrub(String text) {
        return EMAIL.matcher(text).replaceAll("[EMAIL]")
            .transform(s -> PHONE.matcher(s).replaceAll("[PHONE]"))
            .transform(s -> CARD.matcher(s).replaceAll("[CARD]"));
    }
}
 
@Service
public class PromptGuard {
    private static final List<Pattern> INJECTION_PATTERNS = List.of(
        Pattern.compile("(?i)ignore\\s+(all\\s+)?previous\\s+instructions"),
        Pattern.compile("(?i)you\\s+are\\s+now\\s+a"),
        Pattern.compile("(?i)reveal\\s+(your|the)\\s+(instructions|prompt)")
    );
 
    public boolean isInjectionAttempt(String input) {
        return INJECTION_PATTERNS.stream().anyMatch(p -> p.matcher(input).find());
    }
 
    public boolean isOutputSafe(String output) {
        return !output.contains("Knowledge Base:") && !output.contains("Guidelines:");
    }
}

For regulated industries, add a dedicated PII detection service and review LLM provider data processing agreements before sending customer data.

Caches, Budgets, and Fallbacks

^{[Redis Docs]}

Use Caffeine with explicit bounds (never unbounded ConcurrentHashMap). Enforce per-customer token budgets to prevent cost surprises. Implement 3-tier fallback: GPT-4 → cheaper model → static FAQ.

@Bean
public Cache<String, float[]> embeddingCache() {
    return Caffeine.newBuilder()
        .maximumSize(50_000)
        .expireAfterAccess(Duration.ofHours(12))
        .recordStats()
        .build();
}
 
@Service
public class TokenBudgetEnforcer {
    private final Cache<String, AtomicLong> dailyUsage;
    private static final long DAILY_LIMIT_PER_CUSTOMER = 50_000;
 
    public boolean hasRemainingBudget(String customerId) {
        AtomicLong used = dailyUsage.get(customerId, k -> new AtomicLong(0));
        return used.get() < DAILY_LIMIT_PER_CUSTOMER;
    }
 
    public void recordUsage(String customerId, long promptTokens, long completionTokens) {
        AtomicLong used = dailyUsage.get(customerId, k -> new AtomicLong(0));
        long newTotal = used.addAndGet(promptTokens + completionTokens);
        if (newTotal > DAILY_LIMIT_PER_CUSTOMER * 0.8) {
            log.warn("Customer {} at 80% of daily token budget", customerId);
        }
    }
}
 
@Service
public class ResilientSupportAssistant {
    @CircuitBreaker(name = "openai", fallbackMethod = "fallbackToLighterModel")
    public SupportResponse answerQuestion(String question, String customerId) {
        return primaryAssistant.answerQuestion(question, customerId);
    }
 
    private SupportResponse fallbackToLighterModel(String question, String customerId, Exception cause) {
        // Fall to GPT-4o-mini, then static FAQ
        try {
            return primaryAssistant.answerWithModel(question, customerId, "gpt-4o-mini");
        } catch (Exception e) {
            return staticFaqService.findBestMatch(question).map(SupportResponse::fromFaq).orElse(SupportResponse.escalate(question, "Escalated to human"));
        }
    }
}

RAG Pipeline and Tool Calling

The full RAG flow with similarity gating + circuit-breaker fallback — every step has a failure mode the next step handles:

graph TD
    User[User question] --> Scrub[PiiScrubber.scrub<br/>strip emails, SSNs, PII]
    Scrub --> Guard{PromptGuard<br/>check?}
    Guard -->|injection detected| Reject[Return safe-decline<br/>log signature]
    Guard -->|safe| Embed[Embed via OpenAI<br/>or local model]
    Embed --> Retrieve[VectorStore.similaritySearch<br/>top-4 chunks]
    Retrieve --> Gate{Top similarity<br/>>= 0.75?}
    Gate -->|No — low confidence| Fallback[Return fallback answer<br/>do not call LLM]
    Gate -->|Yes — confident| Budget{Per-session<br/>budget OK?}
    Budget -->|No| Reject2[Return budget-exceeded]
    Budget -->|Yes| LLM[ChatClient with<br/>retrieved context]
    LLM -->|success| Response[Return LLM answer<br/>+ source citations]
    LLM -->|circuit-breaker open| Cache[Return cached answer<br/>or graceful fallback]
    style Reject fill:#fdd
    style Reject2 fill:#fdd
    style Fallback fill:#ffd
    style Cache fill:#ffd
    style Response fill:#dfd

Three production rules visible in the flow: (1) similarity gate goes BEFORE the LLM call so a bad retrieval cannot waste tokens; (2) budget check runs even on similarity-gated paths so loops cannot bypass it; (3) circuit-breaker open returns a cached or graceful response, never a 500.

Chunk knowledge base documents at 512 tokens (fixed-size, simple strategy for FAQ-style documents). Retrieve top-4 chunks by similarity, then gate the LLM response on similarity score (≥0.75), not on asking the model to rate itself.

@Service
public class IntelligentSupportAssistant {
    private static final double SIMILARITY_THRESHOLD = 0.75;
    private final ChatClient chatClient;
    private final VectorStore vectorStore;
    private final PiiScrubber piiScrubber;
    private final PromptGuard promptGuard;
 
    public SupportResponse answerQuestion(String rawQuestion, String customerId) {
        String question = piiScrubber.scrub(rawQuestion);
        if (promptGuard.isInjectionAttempt(question)) {
            return SupportResponse.escalate(question, "Query flagged");
        }
 
        // Retrieve context
        List<Document> retrieved = vectorStore.similaritySearch(
            SearchRequest.builder()
                .query(question)
                .topK(4)
                .similarityThreshold(SIMILARITY_THRESHOLD)
                .build()
        );
 
        if (retrieved.isEmpty()) {
            return SupportResponse.escalate(question, "No knowledge base match");
        }
 
        String context = retrieved.stream()
            .map(Document::getText)
            .collect(Collectors.joining("\n\n---\n\n"));
 
        String answer = chatClient.prompt()
            .system("Answer using only the provided knowledge base.")
            .user(context + "\n\nQuestion: " + question)
            .call()
            .content();
 
        if (!promptGuard.isOutputSafe(answer)) {
            return SupportResponse.escalate(question, "Response flagged");
        }
 
        return SupportResponse.automated(answer, retrieved);
    }
}

For live data (order status, account balance), use tool calling — the LLM invokes typed Java methods and always validate authorization inside the tool:

@Component
public class OrderStatusTool {
    @Tool("Retrieve order status by order number")
    public OrderStatusResponse getOrderStatus(String orderNumber) {
        return orderService.findByOrderNumber(orderNumber)
            .map(o -> new OrderStatusResponse(o.getStatus().name(), o.getTrackingNumber()))
            .orElseThrow(() -> new OrderNotFoundException(orderNumber));
    }
 
    public record OrderStatusResponse(String status, String trackingNumber) {}
}

Note: The @Tool annotation landed in Spring AI 1.0.0-M6 and is the preferred approach for defining tools in 1.0 GA and later. The older pattern — a @Component implementing a typed Function with @Description — still works but is superseded by the ToolCallback API.

Production Observability

^{[OpenTelemetry Sampling]}

Track these domain-level metrics beyond the framework's built-in Micrometer integration:

@Component
public class AiMetrics {
    private final MeterRegistry registry;
 
    public void recordTokenUsage(Usage usage, String model) {
        registry.counter("ai.tokens", "type", "prompt", "model", model)
            .increment(usage.getPromptTokens());
        registry.counter("ai.tokens", "type", "completion", "model", model)
            .increment(usage.getGenerationTokens());
    }
 
    public void recordEscalation(String reason) {
        registry.counter("ai.escalations", "reason", reason).increment();
    }
 
    public void recordLatency(long millis, String outcome) {
        registry.timer("ai.request.duration", "outcome", outcome).record(Duration.ofMillis(millis));
    }
}

Key alerts: escalation rate >25% (knowledge base gaps), token spike >500k/hour (runaway loop or abuse), p95 latency >8s (context size or model choice).

Production Checklist

PII scrubber wired before any LLM call (regex heuristics + audit logging)
Prompt injection guard: input validation for known patterns + output validation for leaked instructions
Token budget enforcer: per-customer daily limits (50K tokens) with 80% alerts
Bounded Caffeine caches: 50K embeddings max, 12-hour expiry
Circuit breaker + fallback chain: GPT-4 → GPT-4o-mini → static FAQ
Similarity-threshold gating: escalate if top chunk <0.75, not by asking model to self-rate
Metrics: ai.tokens.* (prompt, completion), ai.escalations.* (reason), ai.request.duration (outcome)
Alerts: escalation rate >25%, token spike >500k/hour, p95 latency >8s

For knowledge base updates: run eval suite before + after, rollback if pass rate drops >5%. Version documents and tag with version ID.

Fallback chain that survives a primary-model outage

The "GPT-4 → GPT-4o-mini → static FAQ" line in the checklist needs an actual implementation. The pattern below routes around a primary-model outage without leaking errors to the user, with the structured-output schema enforced at every layer so a degraded fallback can't silently violate the API contract:

@Service
public class ResilientChatService {
 
    private final ChatClient primary;     // GPT-4
    private final ChatClient fallback;    // GPT-4o-mini
    private final StaticFaqService faq;   // pre-canned answers
    private final MeterRegistry meters;   // resolve tagged counters at call time
 
    public ResilientChatService(ChatClient.Builder builder, MeterRegistry meters,
                                StaticFaqService faq) {
        this.primary  = builder.defaultOptions(opts -> opts.withModel("gpt-4"))
                               .build();
        this.fallback = builder.defaultOptions(opts -> opts.withModel("gpt-4o-mini"))
                               .build();
        this.faq = faq;
        this.meters = meters;
    }
 
    public Answer ask(String question) {
        try {
            return primary.prompt(question).call().entity(Answer.class);
        } catch (RateLimitException | ServiceUnavailableException e) {
            meters.counter("ai.fallback.invocations", "tier", "secondary",
                           "reason", e.getClass().getSimpleName()).increment();
            try {
                return fallback.prompt(question).call().entity(Answer.class);
            } catch (Exception inner) {
                meters.counter("ai.fallback.invocations", "tier", "static").increment();
                return faq.bestMatch(question)              // never throws
                          .orElse(Answer.escalate("model unavailable"));
            }
        }
    }
}

The corresponding schema validation — Spring AI's structured output binding rejects malformed completions before they reach the controller, so a degraded fallback model that returns "I think it might be X" instead of the JSON contract surfaces as a typed validation error rather than a 500:

public record Answer(
    @NotBlank @JsonPropertyDescription("Answer text. Empty string if escalating.")
    String text,
 
    @NotNull @JsonPropertyDescription("Confidence 0.0-1.0. Trigger escalation below 0.75.")
    Double confidence,
 
    @JsonPropertyDescription("Source document IDs that backed this answer.")
    String[] citations,
 
    @JsonPropertyDescription("Why the model could not answer; null on success.")
    String escalationReason
) {
    public static Answer escalate(String reason) {
        return new Answer("", 0.0, new String[0], reason);
    }
 
    public boolean shouldEscalate() {
        return confidence < 0.75 || escalationReason != null;
    }
}

The fallback chain only helps if the contract holds — which is why Answer is a record with annotations the structured-output binder enforces at deserialization, not an untyped Map blob the controller has to second-guess.

Tool-Calling Reliability: Hallucinated Tools and Schema Retries

The dirty secret of LLM tool calling is that models hallucinate tool names that look plausible but do not exist in the registered set. A model that has seen getOrderStatus thousands of times in training will happily emit get_order_details, lookupOrder, or fetchOrderTrackingInfo when the user phrases the question slightly differently — none of which are registered. The framework's default behavior is to either throw or silently drop the call, both of which produce a degraded user experience without any signal to the operator.

The fix is a registry-aware dispatcher that treats the LLM's tool call as untrusted input and validates it against the actual registry before dispatch. When a hallucinated tool name arrives, the dispatcher does not throw — it injects a corrective system message back into the conversation listing the valid tools and asks the model to retry. Two retries is the sweet spot: more than that and the model is stuck in a confusion loop the dispatcher cannot break.

@Service
public class ToolCallDispatcher {
 
    private final Map<String, Tool> registry;
    private final ChatClient chatClient;
    private final Counter hallucinationCounter;
    private static final int MAX_RETRIES = 2;
 
    public ToolCallDispatcher(List<Tool> tools, ChatClient client, MeterRegistry meters) {
        this.registry = tools.stream().collect(Collectors.toMap(Tool::name, t -> t));
        this.chatClient = client;
        this.hallucinationCounter = meters.counter("ai.tool.hallucinations");
    }
 
    public ToolResult dispatch(ToolCall call, ConversationState conv) {
        for (int attempt = 0; attempt <= MAX_RETRIES; attempt++) {
            Tool tool = registry.get(call.name());
            if (tool == null) {
                hallucinationCounter.increment(); // don't tag by invented name — unbounded cardinality
                String correction = "Tool '" + call.name()
                    + "' does not exist. Valid tools: " + String.join(", ", registry.keySet())
                    + ". Retry with one of these or respond directly.";
                conv.addSystemMessage(correction);
                call = chatClient.continueConversation(conv).extractToolCall();
                continue;
            }
            try {
                tool.validateArguments(call.arguments());
                return tool.invoke(call.arguments());
            } catch (SchemaValidationException sve) {
                conv.addSystemMessage("Argument schema invalid: " + sve.getMessage()
                    + ". Required schema: " + tool.schemaJson());
                call = chatClient.continueConversation(conv).extractToolCall();
            }
        }
        return ToolResult.escalate("Tool resolution failed after " + MAX_RETRIES + " retries");
    }
}

Schema validation deserves the same retry treatment. When the model emits {"orderNumber": 12345} for a tool that requires a string, do not let Jackson throw a 500 — feed the validation error back to the model with the expected schema, and let it correct itself once. The ai.tool.hallucinations counter with a tier tag separates "invented tool name" from "wrong argument type" so the operator can tell the difference between a registry-coverage problem (add the missing tool) and a schema-clarity problem (rename a field, add an example).

RAG Federation Across Multiple Knowledge Bases

Real support flows do not have a single knowledge base. Refund policies live in one repository, product documentation in another, and engineering runbooks in a third — each with its own access controls, refresh cadence, and chunking strategy. The naive approach is to concatenate all three into one giant pgvector table and hope the similarity search picks the right one. In practice this destroys retrieval quality: refund policy chunks compete with product manual chunks for the top-4 slots, and the LLM ends up with mixed context that produces confused answers.

The federation pattern routes the question to the right knowledge base first, then retrieves only from that base. A lightweight classifier — either a small fine-tuned model or a keyword router — picks the namespace, and the vector store query is scoped to that namespace alone. When the classifier is uncertain, the retriever fans out to all eligible bases in parallel and re-ranks the union by similarity score, so the system gracefully degrades from "scoped retrieval" to "broad retrieval" rather than failing.

@Service
public class FederatedRetriever {
 
    private final Map<KnowledgeBase, VectorStore> stores;
    private final NamespaceClassifier classifier;
    private final Executor parallelExecutor;
    private static final double UNCERTAIN_THRESHOLD = 0.6;
 
    public List<Document> retrieve(String question, AuthContext ctx) {
        ClassificationResult routing = classifier.classify(question);
 
        if (routing.confidence() >= UNCERTAIN_THRESHOLD) {
            VectorStore store = stores.get(routing.target());
            return store.similaritySearch(SearchRequest.builder()
                .query(question)
                .topK(4)
                .filterExpression(authFilter(ctx))
                .build());
        }
 
        List<CompletableFuture<List<Document>>> fanOut = stores.entrySet().stream()
            .filter(e -> ctx.canAccess(e.getKey()))
            .map(e -> CompletableFuture.supplyAsync(
                () -> e.getValue().similaritySearch(
                    SearchRequest.builder()
                        .query(question)
                        .topK(2)
                        .filterExpression(authFilter(ctx))
                        .build()),
                parallelExecutor))
            .toList();
 
        return fanOut.stream()
            .map(CompletableFuture::join)
            .flatMap(List::stream)
            .sorted(Comparator.comparingDouble(
                (Document d) -> ((Number) d.getMetadata().getOrDefault("similarity", 0.0)).doubleValue())
                .reversed()) // highest similarity first — hashCode() does NOT order by value
            .limit(4)
            .toList();
    }
 
    private Filter.Expression authFilter(AuthContext ctx) {
        return new FilterExpressionBuilder()
            .in("tenant_id", ctx.tenantId())
            .and(new FilterExpressionBuilder().lte("classification", ctx.clearanceLevel()))
            .build();
    }
}

The auth filter on every retrieval is non-negotiable. Without it, a federated retriever is the easiest way in the codebase to leak data across tenants — a refund-policy retrieval for tenant A returns chunks from tenant B because they happen to be more semantically similar to the question. The filter pushes the access-control predicate into the vector index itself rather than filtering after the fact, which both preserves correctness and keeps the top-4 slots full of authorized chunks.

Cost Attribution Per Spring Profile

Token cost surprises are operational failures, not accounting failures. The team wakes up to a bill that is 4x higher than last month with no idea which environment generated it — was it the staging soak test that ran overnight, the dev profile that someone left enabled with gpt-4 instead of gpt-4o-mini, or the production tenant who started a runaway loop? Without attribution at the profile and tenant level, the answer is "everyone has to investigate everything," which is how a 12-hour incident review starts.

Spring profiles are the right granularity because they map cleanly to environments, and MeterRegistry tags are how you carry that attribution into Prometheus without rewriting your metrics layer. The profile name comes from Environment.getActiveProfiles() and goes onto every token-usage counter as a tag, so a single PromQL query can break down spend by environment, model, and tenant simultaneously — which is the dimension you actually need when paging the on-call.

@Component
public class ProfiledCostMeter {
 
    private final Counter.Builder costCounter;
    private final String activeProfile;
    private final ModelPricing pricing;
 
    public ProfiledCostMeter(Environment env, MeterRegistry registry, ModelPricing pricing) {
        this.activeProfile = String.join(",", env.getActiveProfiles());
        this.pricing = pricing;
        this.costCounter = Counter.builder("ai.cost.usd.cents")
            .description("AI cost in USD cents, attributed by profile, model, and tenant");
    }
 
    public void recordCall(String tenantId, String model, Usage usage) {
        long promptCents = pricing.promptCentsPer1k(model) * usage.getPromptTokens() / 1000;
        long completionCents = pricing.completionCentsPer1k(model) * usage.getGenerationTokens() / 1000;
 
        costCounter.tags("profile", activeProfile, "model", model,
                         "tenant", tenantId, "kind", "prompt")
            .register(Metrics.globalRegistry).increment(promptCents);
        costCounter.tags("profile", activeProfile, "model", model,
                         "tenant", tenantId, "kind", "completion")
            .register(Metrics.globalRegistry).increment(completionCents);
    }
}

Pair this with a daily Prometheus recording rule that aggregates sum by (profile, tenant) (rate(ai_cost_usd_cents_total[24h])) and a Grafana panel that flags any profile crossing the 80%-of-monthly-budget threshold mid-month. The recording rule is what turns "we got a bill surprise" into "the staging profile crossed budget on day 14, the on-call got paged on day 14, the runaway test was killed on day 14" — which is the only cost-incident timeline that does not end in a postmortem. ^{[Prometheus Best Practices]}

Frequently Asked Questions

What is Spring AI and how does it differ from LangChain?

Spring AI is Spring's official framework for integrating AI/LLM capabilities into Java applications. Unlike LangChain (Python/JS), Spring AI follows Spring conventions — dependency injection, auto-configuration, and the Spring ecosystem. It provides a portable API across OpenAI, Azure OpenAI, Ollama, and other providers.

How do I prevent prompt injection in Spring AI applications?

Use input validation to reject known injection patterns, implement output guardrails that verify LLM responses against expected schemas, set strict system prompts that instruct the model to ignore user-injected instructions, and apply token budgets to prevent resource exhaustion from adversarial inputs.

Can Spring AI work with local models like Ollama?

Yes, Spring AI supports Ollama out of the box via the spring-ai-ollama dependency. Configure the base URL and model name in application properties, and the same ChatClient API works identically for local and cloud-hosted models.

Keep Reading

Vector Databases Compared: pgvector vs Pinecone vs Weaviate — Benchmarks, scaling limits, and the migration thresholds for choosing the right vector store for your RAG pipeline
Building Production RAG Pipelines in Go — The Go equivalent: chunking strategies, embedding pipelines, pgvector operations, and retrieval evaluation
Spring Boot REST Microservice Patterns — The foundational Spring Boot patterns for the REST layer your AI-powered endpoints sit behind

Was this article helpful?

Your feedback directly shapes our editorial depth and technical accuracy.