Embedding an LLM feels like a sprint; keeping cost bounded is at least a quarter-long job. POCs look cheap: few users, short prompts, one model. When traffic arrives, the bill and latency grow together — and "we'll fix it later" gets expensive.
A three-layer approach works for me. First, prompt cache: don't re-send identical system prompts and RAG chunks every request. Redis or in-app cache with a hash of prompt + model + temperature is enough. Second, model routing: simple classification or summaries to a small model, complex analysis to a large one. Routing rules belong in code, not slides. Third, hard ceiling: daily token cap per user or tenant; on exceed, degrade gracefully instead of crashing.
When the primary model times out, I prefer returning a cached or fallback summary instead of 503. The user sees "summary mode for now"; core flows like checkout stay up. Same idea on rate limits: intentional UX, not a bare 429.
Design "model didn't answer" as carefully as the happy path. Cost alerts should let you feature-flag the LLM off and keep operating.
