Opinionated Habits for Structured Output That Holds Up

There is no shortage of structured-output advice that amounts to "use a schema and validate the output." True, but useless—it tells you what without why, and the why is where the judgment lives. This piece takes positions. Each practice below comes with the reasoning that justifies it, so you can adapt it rather than cargo-cult it.

These are habits drawn from running structured output at volume, where the difference between a good practice and a generic one shows up in your error logs. Where a practice trades something off, we say so. The goal is not a tidy checklist of platitudes but a set of defensible defaults you can argue with.

If you want the ordered build sequence rather than the principles behind it, the step-by-step approach is the companion piece. This one is about judgment.

Treat the Schema as Code, Not Configuration

The strongest single practice is to define your schema as a typed object in your codebase—Pydantic, Zod, or equivalent—and derive everything else from it.

Why It Wins

When the schema is code, your validator and your model instruction come from the same place and cannot drift apart. You get autocomplete, type checking, and a single file to change when requirements shift. Treating the schema as a string pasted into a prompt forfeits all of that and guarantees eventual divergence between what you ask for and what you accept.

The trade-off is a little upfront setup. It pays back the first time a requirement changes and you update one definition instead of hunting for three.

Scope Every Schema as Tightly as Possible

Ask the model for the minimum number of fields your application actually consumes. Not the fields you might want someday—the ones you use today.

Why It Wins

Every field is a surface for error and a draw on the model's attention. Smaller schemas produce higher per-field accuracy and cost fewer tokens. A speculative field you do not yet consume adds risk and cost for zero benefit. When you genuinely need many fields, splitting into focused calls often beats one sprawling request, because each call lets the model concentrate.

The trade-off is more calls. Usually worth it; measure if you are unsure.

Validate Semantics, Always, Separately

Structural validation and semantic validation are different jobs and you need both. Run the schema validator to confirm shape, then run your own checks for meaning.

Why It Wins

Schema enforcement and structural validation cannot encode your business rules. A discount can be a valid number and still exceed your company's maximum. A date can be valid JSON and still be in the past when it must be in the future. Only domain-specific validation catches these, and they are exactly the errors that look clean and slip through. The common mistakes piece shows how often skipped semantic checks become production incidents.

Put the Reasoning in Field Descriptions

Use the description field of your schema to explain, in plain language, what belongs in each field and how to resolve edge cases.

Why It Wins

Descriptions are prompt instructions that travel with the schema, so they stay attached to the exact field they govern. A good description on an enum field—"use 'urgent' only when the customer mentions a deadline within 24 hours"—does more for accuracy than a paragraph of general prompt instructions, because it is anchored to the decision it affects.

The trade-off is a slightly larger schema and a few more tokens. Negligible against the accuracy gain.

Retry With Context, Not Blindly

When validation fails, do not simply re-run the identical request. Feed the specific error back to the model.

Why It Wins

A blind retry asks the model to make the same mistake again with the same information, so it often does. A retry that says "your previous answer set status to 'pending', but allowed values are open or closed" gives the model what it needs to correct course. Most fixable failures resolve on the first informed retry, which keeps your fallback path rare and cheap.

The trade-off is slightly more complex retry code. It is the difference between a pipeline that recovers and one that just fails twice.

Measure Failures Over Time

Log every validation failure and every retry, then review the logs weekly.

Why It Wins

Structured-output quality is tunable, but only if you can see where it breaks. The logs reveal which fields the model struggles with, which descriptions need rewriting, and whether a cheaper model would suffice for the easy cases. Teams that skip measurement keep paying for the same recurring failures because they never see the pattern. The framework for structured output builds this feedback loop in as a stage rather than an afterthought.

Prefer the Strongest Enforcement, Then Verify Anyway

Use the strictest enforcement your provider offers—but never treat it as a license to skip validation.

Why It Wins

Strong enforcement dramatically reduces malformed output, which makes your pipeline faster and cheaper because retries become rare. But enforcement is not infallible across every edge case and provider, and it says nothing about semantics. The belt-and-suspenders posture—strong enforcement plus full validation—costs little and removes a whole category of late-night surprises. The tooling survey covers which providers offer the strongest guarantees.

The Underlying Principle

Every practice here flows from one stance: the model's output is untrusted input until your code has verified it. That stance dictates the single schema source, the tight scoping, the mandatory semantic validation, the informed retries, and the measurement. Adopt the stance and the practices follow naturally; adopt the practices without the stance and you will skip the inconvenient ones.

Frequently Asked Questions

Is it really worth splitting a schema into multiple calls?

Often, yes, when accuracy on a large schema is suffering. Each call lets the model concentrate on fewer fields, which raises per-field accuracy. The trade-off is more requests and a bit more orchestration. Measure your per-field accuracy before and after splitting; if it improves meaningfully, the extra calls are justified.

Why not just trust strict schema enforcement and skip validation?

Because enforcement guarantees structure, not meaning, and across providers and edge cases it is not perfectly infallible. Validation is cheap and catches the semantic errors enforcement structurally cannot see. The combined cost is low and the downside it prevents—bad data flowing silently downstream—is high, so the belt-and-suspenders approach wins on expected value.

How detailed should field descriptions be?

Detailed enough to resolve the edge cases that actually arise, no more. For a simple unambiguous field, a short phrase suffices. For an enum or a field with tricky boundaries, spell out exactly when to choose each option. The test is whether someone unfamiliar with the task could fill the field correctly using only the description.

What should I actually log for measurement?

Log each validation failure with the field that failed, the value the model produced, and the retry outcome. Over a week this reveals which fields are weak spots and whether retries are succeeding. You do not need to log every successful response in full—the failures and retries carry the signal you tune against.

Does tight scoping conflict with wanting rich output?

Not really. Tight scoping means asking only for what you consume, not making output thin for its own sake. If you genuinely consume many fields, ask for them—just split the request when one large schema hurts accuracy. The discipline is against speculative fields you do not yet use, not against richness you actually need.

Key Takeaways

Define the schema as typed code and derive the validator and instruction from it so they never drift.
Scope schemas to the fields you actually consume; smaller schemas raise accuracy and cut cost.
Run structural and semantic validation separately, every time—enforcement cannot encode your business rules.
Retry with the specific error fed back to the model rather than blindly re-running.
Log and review failures weekly; structured-output quality is tunable only when you can see where it breaks.

If you want the ordered build sequence rather than the principles behind it, the step-by-step approach is the companion piece. This one is about judgment.

Treat the Schema as Code, Not Configuration

The strongest single practice is to define your schema as a typed object in your codebase—Pydantic, Zod, or equivalent—and derive everything else from it.

Why It Wins

The trade-off is a little upfront setup. It pays back the first time a requirement changes and you update one definition instead of hunting for three.

Scope Every Schema as Tightly as Possible

Ask the model for the minimum number of fields your application actually consumes. Not the fields you might want someday—the ones you use today.

Why It Wins

The trade-off is more calls. Usually worth it; measure if you are unsure.

Validate Semantics, Always, Separately

Structural validation and semantic validation are different jobs and you need both. Run the schema validator to confirm shape, then run your own checks for meaning.

Why It Wins

Put the Reasoning in Field Descriptions

Use the description field of your schema to explain, in plain language, what belongs in each field and how to resolve edge cases.

Why It Wins

The trade-off is a slightly larger schema and a few more tokens. Negligible against the accuracy gain.

Retry With Context, Not Blindly

When validation fails, do not simply re-run the identical request. Feed the specific error back to the model.

Why It Wins

The trade-off is slightly more complex retry code. It is the difference between a pipeline that recovers and one that just fails twice.

Measure Failures Over Time

Log every validation failure and every retry, then review the logs weekly.

Why It Wins

Prefer the Strongest Enforcement, Then Verify Anyway

Use the strictest enforcement your provider offers—but never treat it as a license to skip validation.

Why It Wins

The Underlying Principle

Frequently Asked Questions

Is it really worth splitting a schema into multiple calls?

Why not just trust strict schema enforcement and skip validation?

How detailed should field descriptions be?

What should I actually log for measurement?

Does tight scoping conflict with wanting rich output?

Key Takeaways

Define the schema as typed code and derive the validator and instruction from it so they never drift.
Scope schemas to the fields you actually consume; smaller schemas raise accuracy and cut cost.
Run structural and semantic validation separately, every time—enforcement cannot encode your business rules.
Retry with the specific error fed back to the model rather than blindly re-running.
Log and review failures weekly; structured-output quality is tunable only when you can see where it breaks.

Opinionated Habits for Structured Output That Holds Up

Treat the Schema as Code, Not Configuration

Why It Wins

Scope Every Schema as Tightly as Possible

Why It Wins

Validate Semantics, Always, Separately

Why It Wins

Put the Reasoning in Field Descriptions

Why It Wins

Retry With Context, Not Blindly

Why It Wins

Measure Failures Over Time

Why It Wins

Prefer the Strongest Enforcement, Then Verify Anyway

Why It Wins

The Underlying Principle

Frequently Asked Questions

Is it really worth splitting a schema into multiple calls?

Why not just trust strict schema enforcement and skip validation?

How detailed should field descriptions be?

What should I actually log for measurement?

Does tight scoping conflict with wanting rich output?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Opinionated Habits for Structured Output That Holds Up

Treat the Schema as Code, Not Configuration

Why It Wins

Scope Every Schema as Tightly as Possible

Why It Wins

Validate Semantics, Always, Separately

Why It Wins

Put the Reasoning in Field Descriptions

Why It Wins

Retry With Context, Not Blindly

Why It Wins

Measure Failures Over Time

Why It Wins

Prefer the Strongest Enforcement, Then Verify Anyway

Why It Wins

The Underlying Principle

Frequently Asked Questions

Is it really worth splitting a schema into multiple calls?

Why not just trust strict schema enforcement and skip validation?

How detailed should field descriptions be?

What should I actually log for measurement?

Does tight scoping conflict with wanting rich output?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?