Srijith Ravikumar is a Principal Engineer at Amazon building AI-powered recommendation systems at scale. Published researcher at AAAI.

getty
For over a decade, the Holy Grail of e-commerce and digital retail has been the "segment of one." It’s a compelling marketing tagline, but for those of us tasked with engineering these systems to handle millions of daily interactions, it was largely a polite fiction.
Operating under strict, sub-100 millisecond latency budgets forces architectural compromises. You simply cannot compute true, individualized intent on the fly at scale. Today, every technology leader faces a boardroom mandate to inject Large Language Models (LLMs) into their customer experience to finally achieve n=1 personalization. But as we transition from pre-computed assumptions to real-time generative reasoning, we are colliding with a harsh new reality: true personalization is no longer merely a technology problem; it is a unit economics problem.
The 'Doppelganger' Era And Lossy Abstractions
For years, the gold standard of personalization was collaborative filtering and matrix factorization. At its core, this was an exercise in finding a customer’s statistical doppelganger. To hit latency targets, we relied on offline pre-computation, running massive batch processes via Singular Value Decomposition to build item-to-item matrices and load them into fast key-value stores.
These pre-computed structures were incredibly fast and cheap, but they were ultimately lossy abstractions. As noted in recent comprehensive reviews like The Application of Large Language Models in Recommendation Systems, traditional collaborative filtering struggles with data sparsity and cold-start problems. We compressed the chaotic reality of human behavior into fixed mathematical vectors. Consequently, traditional systems operate as black boxes, entirely incapable of explaining why an item was recommended beyond the opaque rationale of "people like you bought this."
The Generative Shift And The Output Surcharge
In the modern era, finding a user's doppelganger is insufficient. We must process unstructured data—real-time search semantics, conversational intent and session context—to dynamically synthesize recommendations. This architectural evolution dynamically ranks lists of items for a specific user, bypassing static pre-computation entirely.
However, technology leaders must approach this shift with economic pragmatism. The generative nature of modern recommendations inherently triggers an "output token surcharge." Across the industry, LLM API output tokens carry a 3x to 10x multiplier over input tokens. While servicing one million database read requests costs mere pennies, running one million generative interactions through a frontier reasoning model (like Claude 4.5 Sonnet or GPT-5) can easily cost over $15,000 in inference alone. Furthermore, with the global AI sector's energy footprint projected to reach 85 to 134 terawatt-hours by 2027, power constraints are making massive model deployments increasingly difficult to scale.
If the infrastructure cost of generating a hyper-personalized recommendation outpaces the marginal return on the conversion it creates, the system is fundamentally broken. AI might drive the click, but it destroys the margin.
The Imperative Of Semantic Caching
The primary defense against runaway generative costs is modernizing the API gateway through semantic caching. When transitioning to generative personalization, retailers quickly realize that many user queries are semantically identical despite slight phrasing variations.
By converting user queries and catalog SKUs into high-dimensional vector embeddings, systems can perform real-time vector similarity searches. If a new query falls within a strict similarity threshold of a previously cached response, the system intercepts the request and returns the generative response instantly.
The operational impacts are staggering. In massive e-commerce deployments, such as those implemented by Walmart, semantic caching has completely transformed search economics, achieving cache hit rates of approximately 50% for long-tail queries. Across the enterprise landscape, semantic caching gateways are reducing AI API costs by 40% to 70% while dropping response times from an 850ms LLM call to a 120ms cache hit. Mastering Time-To-Live (TTL) protocols and cache invalidation strategies is now just as critical as prompt engineering.
The 'Intelligent Planner': SLMs And Dynamic Routing
The future of personalization relies on an intelligent routing layer—a dynamic "Personalization Planner"—that orchestrates multi-model architectures. The 2026 reality is defined by the strategic deployment of highly quantized Small Language Models (SLMs) working in tandem with frontier models.
Instead of treating AI as a monolith, modern routers evaluate the complexity of a user's intent in real time. High-volume, routine queries are routed to budget-tier SLMs, which operate at a fraction of a dollar per million tokens. For complex, ambiguous intents requiring deep reasoning, the router dynamically escalates the request to a frontier model. Advanced architectures utilizing contextual bandits and reinforcement learning have demonstrated the ability to deliver over 97% of a frontier model's generation quality while consuming less than 25% of the computational cost.
We see this hybrid approach actively driving revenue in the field. At the KDD 2025 PARIS Workshop, DoorDash presented a revolutionary personalization framework utilizing Hierarchical Retrieval-Augmented Generation (RAG) and Semantic IDs. This allows them to balance familiarity, affordability and generative novelty across millions of SKUs without violating latency constraints. Similarly, Starbucks’ Deep Brew platform relies on contextual orchestration to deliver over 2.3 billion personalized experiences annually, driving a reported 30% return on investment. Meanwhile, legacy brands like Nordstrom are utilizing agentic architectures to handle the heavy personalization lifting, freeing human stylists to focus on high-touch relationship building.
Mastering The Math Of Scale
We are no longer bound by the rigid, pre-computed data structures of the past, but we are absolutely constrained by the economic realities of the present. While research firms project inference costs will drop significantly over the next decade, engineering leaders must build for the margins of today.
The next great competitive moat in digital retail will belong to the engineering teams that build the smartest routing architectures. Building a sustainable generative engine requires three tactical steps: deploying semantic caching at the infrastructure gateway, routing routine queries to heavily quantized SLMs, and reserving frontier models strictly for complex intent reasoning. By mastering dynamic orchestration, we can finally deliver the magic of the "segment of one" while protecting our unit economics with the brutal efficiency of scale.
Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?

1 month ago
8













English (US)