Glossary

Every AI term, plainly explained.

40 terms covering LLMs, RAG, agents, and production AI engineering — written for humans, cross-referenced for depth.

/ A

Agent

An AI system that takes actions toward a goal, not just responds with text.

An LLM-powered system that plans, uses tools, and takes multi-step actions autonomously. Agents loop through reasoning, acting, and observing results until a goal is met. Distinct from chatbots because they perform tasks, not just conversation.

/ A

Agentic workflow

A process where AI takes multiple autonomous steps with tools.

Workflows where an LLM directs tool calls, data lookups, and branching decisions to complete complex tasks. Contrasts with single-prompt inference. Requires orchestration (LangGraph, Temporal) and error-handling infrastructure.

/ C

Chain of thought

Prompting the LLM to reason step-by-step.

Adding "think step by step" or providing reasoning-style examples improves performance on math, logic, and complex reasoning. Modern models (o1, Claude Sonnet 4) do this internally. Can be exposed to users or hidden.

/ C

Chunking

Splitting documents into smaller pieces for retrieval.

Documents are chunked (typically 200-1000 tokens) before embedding. Chunk size affects retrieval quality — too small loses context, too large dilutes relevance. Strategies: fixed-size, sentence-boundary, semantic chunking, parent-document.

/ C

Context engineering

Designing what information goes into an LLM prompt.

Superset of prompt engineering. Includes retrieval, ranking, ordering, compression, and caching of context. Emerging term in 2025 — recognition that prompt quality depends on context assembly, not just phrasing.

/ C

Context window

The maximum input length an LLM can process at once.

Measured in tokens. Current frontier models range from 128k (OpenAI GPT-4) to 1M+ (Gemini, Claude). Larger windows enable longer documents but cost more and often have degraded recall in the middle of the context ("lost in the middle" problem).

/ E

Embedding

A numerical vector representation of text, capturing semantic meaning.

Typically 768-3072 dimensional. Texts with similar meaning have nearby embeddings in vector space. Used for semantic search, clustering, and RAG. Generated by specialized models (OpenAI text-embedding-3, Cohere embed-v3, open-source BGE).

/ E

Eval (evaluation)

Automated tests that verify LLM output quality.

Evals check model outputs against expected criteria: correctness, format adherence, safety, style. Required infrastructure for production AI — without evals, prompt changes become uncontrolled experiments. Frameworks: Braintrust, Promptfoo, custom pytest.

/ E

Evals-driven development

Building AI features like TDD — tests first, then prompts.

Define success criteria and build an eval set before writing prompts. Then iterate prompts until evals pass. The AI equivalent of TDD. Without this, prompt changes regress silently.

/ F

Few-shot prompting

Giving the LLM 2-5 examples of desired input-output.

Examples in the prompt dramatically improve format adherence and task performance. Especially useful for classification, extraction, and style matching. Trade-off: increases input tokens and cost.

/ F

Fine-tuning

Training an existing LLM on your specific data.

Produces a specialized model version that performs better on narrow tasks. Generally less useful than RAG for most business problems — adds cost, training time, and reduces flexibility. Appropriate when tone, format, or domain-specific reasoning needs deep customization.

/ F

Function calling

LLMs invoking your code with structured arguments.

Also called "tool use." The model returns a JSON-structured function call that your code executes (e.g., searching a database, sending an email). Enables agents and reliable structured output. Supported by GPT-4, Claude, and most modern models.

/ G

Grounding

Anchoring LLM responses in verified data sources.

Technique to reduce hallucinations by forcing the LLM to cite or reference retrieved documents. Core to RAG. Quality grounding requires good retrieval and prompt engineering — 'answer only from the context below' with citation enforcement.

/ G

Guardrails

Input and output filters that prevent unsafe LLM behavior.

Technical controls that block harmful inputs (prompt injection, PII) and validate outputs (no hallucinations, compliant format). Libraries: Guardrails AI, NVIDIA NeMo Guardrails. Essential for consumer-facing AI.

/ H

Hallucination

LLMs generating factually incorrect information confidently.

Models invent facts, citations, and details that sound plausible but are wrong. Reduction strategies: RAG, lower temperature, chain-of-thought, verification loops, and post-generation fact-checking.

/ H

Hybrid RAG

RAG that combines multiple retrieval strategies.

Uses keyword, semantic, and metadata filtering together. Often combined with reranking. Produces significantly better retrieval than naive vector-only RAG. Slightly more complex to implement.

/ H

Hybrid search

Combining keyword and semantic search for better retrieval.

Keyword search (BM25) excels at exact matches and rare terms. Semantic search (embeddings) excels at concept matching. Hybrid combines both scores, typically using reciprocal rank fusion. Produces better retrieval than either alone.

/ I

Inference

The process of an LLM generating a response to an input.

Distinct from training. Each API call to an LLM is inference. Costs are measured per token (input + output). Inference time depends on model size, output length, and provider load. Usually 50-3000ms for frontier models.

/ K

Knowledge base

A curated collection of documents made AI-searchable.

Foundation of RAG applications. Contents: internal docs, manuals, tickets, wiki pages, PDFs. Quality depends on cleanliness, chunking strategy, and freshness. Regular reindexing required.

/ L

Latency

Time from request to complete response.

Measured in milliseconds. Key components: time to first token (TTFT), and per-token generation speed. Smaller models (Haiku, GPT-4o-mini) have 100-500ms TTFT. Streaming responses reduce perceived latency by showing text as generated.

/ L

LLM (Large Language Model)

Neural networks trained to predict and generate text.

Trained on trillions of tokens from books, web, and code. Modern LLMs (GPT-4, Claude, Gemini, Llama) have 70B-2T+ parameters. Perform language tasks zero-shot or few-shot without task-specific training.

/ M

MCP (Model Context Protocol)

Open standard for connecting LLMs to tools and data sources.

Anthropic-published protocol that standardizes how LLM applications access external resources. Reduces vendor lock-in and integration complexity. Adoption growing across enterprise AI tooling.

/ M

Model card

Documentation describing an LLM's capabilities and limits.

Published by model providers. Covers training data, intended use cases, limitations, safety evaluations, and known biases. Essential reading before deploying to production.

/ M

Multi-model routing

Picking the right LLM per task for cost/accuracy.

Dynamically routing requests between providers (OpenAI, Anthropic, Meta, local) based on task complexity, budget, and latency needs. Simple tasks go to cheap fast models (Haiku, GPT-4o-mini); complex reasoning to flagship models. Typical 40-60% cost savings.

/ O

Observability

Tools that let you see what your AI is doing in production.

Logging, tracing, cost dashboards, and error alerting for LLM applications. Tools: Langfuse, Helicone, LangSmith. Without observability, production AI is a black box. Essential infrastructure before launch.

/ O

Output parsing

Converting LLM text output into structured data.

Strategies: JSON mode, function calling, tool use, regex extraction. Modern models reliably produce JSON when asked. Validation with Zod or Pydantic prevents malformed data from breaking downstream code.

/ P

Prompt engineering

Crafting LLM inputs to produce reliable outputs.

Involves role setup, example selection (few-shot), output format specification, and chain-of-thought prompting. Modern production prompts are versioned, evaluated, and tested like code. Over-emphasized in 2023, now just one part of the stack.

/ P

Prompt injection

An attack where user input manipulates LLM behavior.

User inputs like 'ignore previous instructions' that subvert system prompts. Main mitigations: input sanitization, strict role separation, output validation, and least-privilege tool access. Not fully solvable — architectural defense in depth is required.

/ Q

Query rewriting

Rephrasing user queries to improve retrieval.

An LLM transforms a conversational user query into a search-optimized form. Essential for multi-turn conversations where context is in prior messages. Also helps with typos, abbreviations, and domain-specific terminology.

/ R

RAG (Retrieval-Augmented Generation)

Combining search with LLMs to answer from your data.

Retrieves relevant documents from a knowledge base, passes them as context to an LLM, then generates a grounded answer. Essential pattern for business AI — enables models to answer from private data without fine-tuning. Quality depends on retrieval, not just model.

/ R

Reranker

A second-pass model that improves search result ordering.

Takes top 20-100 results from initial retrieval and reorders by true relevance using cross-attention models. Cohere Rerank and BAAI BGE Reranker are common. Typically +10-20% retrieval accuracy at moderate cost.

/ R

Retrieval

The process of finding relevant documents from a knowledge base.

First half of RAG. Quality of retrieval determines quality of RAG. Improvements come from: better embeddings, hybrid search, reranking, query rewriting, and metadata filtering — in that typical order of impact.

/ S

Semantic search

Search that matches by meaning, not just keywords.

Uses embeddings to find documents with similar meaning to a query. Works when user phrasing differs from document text. Foundation of RAG and modern knowledge systems. Requires vector database (Pinecone, pgvector, Weaviate).

/ S

Speculative decoding

Using a smaller model to predict tokens that a larger model verifies.

Inference optimization technique. Small "draft" model generates candidate tokens; large "target" model verifies them in parallel. Produces identical output to the large model but 2-3x faster.

/ S

Streaming

Returning LLM tokens as they are generated.

Instead of waiting for the full response, tokens arrive incrementally. Dramatically improves perceived latency for chat and long-form outputs. Requires SSE or websocket support in your frontend.

/ S

System prompt

Persistent instructions that shape LLM behavior across a conversation.

Sets role, tone, rules, and output format. Modern models (Claude, GPT-4) give the system prompt higher priority than user messages. Production system prompts are versioned and evaluated.

/ T

Temperature

A setting that controls LLM output randomness.

0.0 = deterministic, highest-probability tokens only. 1.0 = creative, higher variance. Use 0.0-0.3 for factual tasks, extraction, classification. Use 0.7-1.0 for creative writing. Most production systems use 0.0-0.3.

/ T

Token

The unit of text LLMs process — roughly 0.75 words.

1 token ≈ 4 characters of English. Pricing and context limits are measured in tokens. Tokenization varies by model. 1000 tokens ≈ 750 words ≈ 3-4 paragraphs. Input and output tokens often have different prices.

/ V

Vector database

A database optimized for similarity search over embeddings.

Stores high-dimensional vectors and returns nearest neighbors in milliseconds. Options: Pinecone (managed, simple), Weaviate (feature-rich), pgvector (Postgres extension), Qdrant (open source, fast). Foundation of RAG systems.

/ Z

Zero-shot

Asking an LLM to perform a task with no examples.

Frontier LLMs perform many tasks zero-shot because their training data includes similar patterns. For harder or more formatted tasks, few-shot prompting (providing 2-5 examples) usually improves reliability.

/ Next step

Got a term we're missing?

Send it to hello@theeazyware.com. We'll add it with a proper explanation — with credit.

~4h

avg response

Q2 '26

next slot

100%

NDA on request

Book a call

Pick a 30-min slot · Cal.com

Email directly

hello@theeazyware.com

Send a brief

Get a written proposal · ~1 week