Best-practice lists are usually where good advice goes to die. They collapse into generic encouragement: "test thoroughly," "consider performance," "follow the docs." None of that is wrong, and none of it helps when you are staring at a feature that costs too much and breaks on real inputs. The practices below are the opposite. Each one is specific, each one is opinionated, and each comes with the reasoning so you can tell when to break it.
These practices come from watching ai model input and output modalities behave in production rather than in demos. The recurring lesson is that modality is not a feature you add; it is a constraint you design around. The teams that internalize that ship features that stay cheap, fast, and reliable as traffic grows. The teams that do not end up rewriting.
Treat this as a set of defaults. Adopt them unless you have a concrete reason not to, and when you do deviate, deviate on purpose.
Default to the Fewest Modalities That Solve the Problem
Every modality you add multiplies your surface area for cost, latency, and failure. The right number is the smallest one that connects what the user has to what they need.
Why minimalism wins
A text-only feature has one failure mode and one cost line. Add image input and you inherit resolution handling, blur tolerance, and a token cost that scales with pixels. Each addition should earn its place. When in doubt, ship without it and add it later on evidence, as our step-by-step process recommends.
Make Structured Output the Default, Not the Exception
If anything other than a human reads the output, constrain it to a schema. Free-form prose is for conversation; structured data is for everything else.
The reasoning
Structured output is parseable, testable, and storable. It turns the model from a chat partner into a reliable component you can build automation on. The cost of adding a schema is minutes; the cost of parsing free-form text downstream is permanent fragility. This is the single highest-leverage default in the entire list, and the common-mistakes article shows how often skipping it backfires.
Validate at the Boundary, Always
Treat every model output as untrusted until it passes validation. For structured output, validate against the schema. For text, check the failure patterns specific to your task.
Pair validation with a defined fallback
Validation without a fallback is theater. Decide in advance what happens when output fails: retry with a tweaked prompt, fall back to a safe default, or surface a clear error. The goal is that bad output never silently reaches a customer or a downstream system.
Budget Cost and Latency Per Modality, Not Per Feature
Do not think about "the cost of the feature." Think about the cost of each modality in it, because they differ by orders of magnitude.
Concrete guidance
- Text is your cheap, fast baseline; lean on it.
- Image input scales cost with resolution and count; cap both.
- Video is the most expensive input by far; sample frames sparingly.
- Non-text output (images, speech) adds seconds; generate it lazily.
Knowing these per-modality costs lets you make trade-offs deliberately rather than discovering them on an invoice. The definitive guide explains the underlying mechanics of why density drives cost.
Design for the Worst Input You Will Actually Receive
Do not design for the clean demo input. Design for the blurry, rotated, partially obscured input your real users will send.
How to operationalize it
Build a small corpus of deliberately bad inputs and make passing them a release gate. This forces your validation and fallback logic to be real rather than aspirational. A feature that only works on clean inputs is not a feature; it is a demo.
Keep Modality Handling Modular
Isolate each modality's input handling and output validation behind clean boundaries so you can add, remove, or swap one without touching the others.
The payoff
Modularity is what makes later expansion cheap. When you decide to add image input six months in, a modular design lets you slot it in rather than rewrite the core. It also makes debugging tractable, because a failure in audio handling stays contained to audio handling.
Choose Models by Modality Fit, Not Reputation
The most capable model overall is not always the right one for your modality mix. A model with a stellar text reputation may handle your specific image task poorly.
Test on your actual task
Run your real inputs through candidate models and compare on the modalities you depend on. Reputation is a prior, not a result. The tools survey walks through how to evaluate candidates against your specific modality requirements.
Treat Prompts as Part of the Modality Decision
How you frame a request shapes what modality you get back, and teams that ignore this fight their tools unnecessarily. The same model can return prose or structured data, a terse answer or an exhaustive one, depending entirely on how you ask.
Be explicit about the output you expect
If you need structured data, say so and supply the exact shape. If you need a non-text output, request it unambiguously rather than hoping the model infers it. Vague requests produce vague modalities, and the cost lands on your downstream code, which then has to guess at what it received. A precise request is the cheapest reliability investment available, because it costs nothing extra per call and removes an entire class of parsing failures.
Match prompt effort to input quality
When inputs are messy, give the model more guidance about what to extract and what to ignore. A blurry receipt benefits from a prompt that names the fields you care about, so the model focuses its limited certainty on the data that matters. This pairs directly with designing for the worst input: the prompt is one of your levers for making a hard input tractable, and it costs far less than switching models. The connection between input quality and reliable extraction runs through our real-world examples, where the messiest inputs always demanded the most explicit prompts.
How These Practices Reinforce Each Other
These defaults are not independent; they compound. Minimalism reduces how much you have to validate. Structured output makes validation trivial. Per-modality budgeting tells you which modalities to cut under the minimalism rule. Modularity makes it cheap to act on what your budgeting reveals. Adopt them as a system and each one makes the others easier to follow.
The throughline is intentionality. Every practice here exists to replace an accidental decision with a deliberate one. Modality choices made by accident are how features become slow, expensive, and brittle. Made on purpose, the same choices become the reason your feature scales.
Frequently Asked Questions
When is it acceptable to skip structured output?
Only when a human reads the output directly and no software consumes it, such as a conversational reply. The moment any downstream code touches the result, structure stops being optional and becomes the default that prevents fragile parsing.
How small should my minimal modality set be?
As small as possible while still connecting what the user has to what they genuinely need. If you can solve the problem with text alone, do that and add richer modalities only when real usage proves they are required.
Is the most capable model always the safest default?
No. Overall capability is a weak predictor of performance on your specific modality task. Test candidate models on your actual inputs, because a model that excels at text may underperform on the exact image or audio task you depend on.
What makes validation "real" rather than theater?
A defined fallback. Validation that detects bad output but has no plan for it accomplishes nothing. Pair every check with a concrete action, retry, default, or surfaced error, so failures are handled rather than merely noticed.
Why budget cost per modality instead of per feature?
Because modalities differ in cost by orders of magnitude. A single feature might mix cheap text with expensive video, and a per-feature average hides which part is driving spend. Per-modality budgeting tells you exactly where to cut.
Key Takeaways
- Default to the fewest modalities that solve the problem; each one adds cost, latency, and failure modes.
- Make schema-constrained output the default whenever software consumes the result.
- Validate every output at the boundary and pair each check with a defined fallback.
- Budget cost and latency per modality, since they differ by orders of magnitude.
- Design for the worst real input, keep modality handling modular, and pick models by modality fit, not reputation.