eazyware
Engineering·May 28, 2025·11 min read

Retail personalization: beyond "customers who bought"

Amazon-style recs are table stakes. The next layer combines embeddings, session context, and real-time LLM scoring.

KR
Kushal R.
Engineering lead

'Customers who bought this also bought...' is a decade-old pattern. It works — it accounts for significant revenue at every major retailer — but it's a floor, not a ceiling. The next generation of retail personalization combines embeddings, real-time session context, and LLM-based scoring to deliver recommendations that feel personal rather than popular. When done well, these systems lift conversion 10-20% above collaborative-filtering baselines. When done badly, they're expensive toys.

Three layers
Three-layer retail personalization stack Layer 3 · LLM session reranker intent-aware scoring on top 100 candidates, per session +5–10 pts lift Layer 2 · Content embeddings product + user embeddings — cold-start, long-tail +3–6 pts lift Layer 1 · Collaborative filtering matrix factorization · "customers also bought" · foundation baseline build bottom-up · each layer gates the next
CF is the foundation (~70% of value). Embeddings solve cold-start. LLM session reranker captures real-time intent. Each layer adds marginal lift on top of the previous.

The three layers of modern retail personalization

Layer 1: Collaborative filtering baseline

The foundation remains CF: who bought what, aggregated across users, used to suggest 'customers like you bought this.' Matrix factorization, item embeddings, neural CF variants. This covers 70% of the personalization value. Don't skip it; don't rebuild it with an LLM. CF at scale is a solved problem.

Layer 2: Content embeddings for cold-start

CF fails for new products (no purchase history) and new users (no purchase history). Content embeddings solve both: embed each product's description, image, category; embed each user's interaction history as a vector. Similarity between them ranks products the user hasn't seen, for products without purchase signal. This layer matters more as catalog size grows and long-tail items multiply.

Layer 3: LLM-based session scoring

The newest layer: at session time, combine user history, current session signals (what they're viewing, how long, what they're skipping), and product descriptions, and let an LLM score candidate products for this specific session. Not for the whole catalog — for the top 100 candidates from layers 1 and 2. The LLM reranks for the current intent.

Example: a user's history suggests they like outdoor gear. Their current session shows they're looking at kids' products. A CF-only system might show them technical outdoor gear for themselves; the LLM can notice 'they're shopping for their kids, show outdoor gear for kids.' This kind of intent-aware personalization lifts session-level conversion measurably.

Implementation architecture

  1. Offline: train CF model, compute product embeddings, compute user embeddings from history.
  2. Session start: pull user's CF recommendations and embedding-based candidates. Cache.
  3. Session interaction: maintain a short session-context vector (pages viewed, time on page, scroll depth).
  4. At each recommendation slot: combine candidates from CF and embeddings, rerank with LLM using user context + session context.
  5. Log everything for training next-day models.

Latency constraints

Retail personalization runs at high volume — every product grid, every cart, every email has recommendations. Latency budgets are tight, usually 100-300ms end-to-end. This rules out per-request LLM calls for the main catalog. Strategies:

  • LLM rerank only top-of-page modules where the ROI justifies the latency.
  • Cache LLM outputs aggressively, keyed by (user segment, session intent).
  • Use smaller/faster models for reranking (GPT-4o-mini class, sub-500ms).
  • Pre-compute reranked lists for predictable contexts (email campaigns, notifications).

Measuring personalization

Offline: precision@k on held-out purchases. Online: A/B test with revenue and conversion metrics. Always run online — offline metrics and online metrics diverge frequently, and online is ground truth. Budget 2-4 weeks per significant change for statistical significance.

Cold start

New users are 5-15% of active sessions at most retailers. Without personalization they convert dramatically worse. Content embeddings help: given any signal at all (referrer, time of day, location, current session clicks), rank products by embedding similarity to inferred intent. See our Kora case study for the cold-start architecture we deployed at a mid-market e-commerce client.

Watch-outs

  • Over-personalization surfaces too few products. Users want discovery, not just confirmation.
  • Stale embeddings. Catalog changes fast; re-compute at least weekly, ideally daily.
  • Bias amplification. CF amplifies popular items; embeddings amplify catalog bias. Audit for this.
  • Latency blowups. LLM rerank at every slot is unsustainable. Budget deliberately.

Closing

Retail personalization isn't one technique — it's a stack. CF, embeddings, LLM scoring each cover a different failure mode of the others. Build the stack in order: CF first, embeddings second, LLM reranking third only for high-ROI surfaces. Most retailers over-invest in the third layer before the first two are tight, and the result is expensive systems that barely beat baseline.

Read next
Hybrid search: why pure vector search isn't enough
Read next
Six RAG patterns that actually work in production
Read next
Total cost of ownership for LLM systems
Tags
personalizationrecommendationsretailembeddings
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request