Asking an LLM for JSON and getting JSON back is harder than it used to be — or rather, it's still as fragile as it used to be, but the fragility is better hidden. Structured output modes and function calling have made the happy path smooth and the failure path subtle. This post is the playbook for making structured outputs actually work in production.
The four techniques and when to use each
Constrained decoding (JSON mode, strict mode)
OpenAI's structured outputs, Anthropic's tool use with forced JSON, and open-model equivalents (Outlines, LMQL) all constrain the decoder to only generate tokens consistent with a schema. In theory this gives you 100% valid output. In practice it gives you ~99.5% valid output — there are still edge cases (schema ambiguity, token-boundary issues, deeply nested structures) that slip through. But it's dramatically better than unconstrained generation with a "return JSON" instruction, which was our baseline three years ago.
Function calling / tool use
When you define tools, the model calls them with typed parameters. This is structured output by another name, and it's often the cleanest way to express "the model should return an object matching this schema." Use it whenever your downstream consumer is action-oriented ("book a meeting with these params") rather than content-oriented ("summarize this document into these fields").
Schema in the prompt
Still works and still necessary for models without structured output support or when you need flexibility the schema modes don't give you. Include the schema as JSON or TypeScript in the prompt, ask for output matching it, parse the response. Failure rate is higher — typically 2-5% malformed output on modern frontier models, higher on smaller or older models — so the repair loop matters more here.
Grammar-constrained generation
For non-JSON outputs (SQL, specific DSLs, structured text formats), grammar-constrained generation lets you define a formal grammar and guarantee conformance. Outlines and guidance support this. Rare but very useful when it applies.
The repair loop
Parse the output. If it fails schema validation, send a repair prompt: the original query, the malformed output, the specific validation errors, and a request to fix. Repeat up to two times. If still failing, fall through. Two retries is the sweet spot — one isn't enough, three rarely helps and adds latency. The repair prompt should be short and mechanical; don't re-explain the original task, just show the error.
The edge cases that break everything
Unicode corner cases. LLMs occasionally emit tokens that break JSON parsers (unpaired surrogates, control characters). Use a lenient parser that can recover — jsonrepair, json5 — before treating parse failure as a real failure.
Floats vs integers. LLMs don't distinguish reliably. If your schema says "integer" and the model returns "42.0", some parsers choke. Coerce permissively at the parse layer; enforce strictly at the business layer.
Empty arrays vs missing fields. The model may return `{"items": []}` or omit the field entirely. Your schema should be explicit about required-vs-optional and your downstream code should handle both. This is a common source of runtime errors.
Streaming structured outputs. If you need to stream partial results while the model is generating, you need a streaming JSON parser (partial-json, clarinet) that yields valid partial states. Buffer-and-parse-at-end works but defeats the point of streaming. See our streaming UX post.
Test structured outputs as code
Include schema conformance in your prompt test suite. For every test case, assert that the output parses, matches schema, and that key semantic fields contain expected values. Track parse-failure rate as a core metric alongside task quality. A regression from 0.3% malformed to 1.5% malformed on a high-volume endpoint is a real incident even if the "correct" outputs are still correct.