Predictions about AI age badly when they reach for breakthroughs that haven't happened. This piece does the opposite: it reads the signals already visible in how ai model input and output modalities are evolving and projects them forward a step. No promises about artificial general intelligence, no invented timelines — just a thesis about where the modality story is clearly heading and what that should change about your decisions today.
The short version of the thesis: the boundaries between modalities are dissolving, the cost structure is shifting in ways that favor heavier media use, and the strategic advantage is moving from access to a model toward orchestration of many. If that's right, the teams that win won't be the ones with the fanciest single model. They'll be the ones who built a clean process for routing, optimizing, and validating across modalities early.
For the grounding this projection builds on, the Complete Guide is the right starting point. Here we look forward.
Signal 1: Modality Boundaries Are Blurring
The clearest trend is that the hard lines between text, image, audio, and video inputs are softening. Models increasingly treat all input as sequences of embeddings, which means a single model can reason across a screenshot, a caption, and a voice note as one coherent context rather than three separate problems.
What this implies
- Cross-modal reasoning — answering a text question about an image, or summarizing a video into structured notes — becomes the default capability rather than a specialty.
- The mental model of "pick a modality" gives way to "pick the inputs that carry signal," regardless of their type.
- Architectures that assumed rigid separation between media types will feel increasingly awkward.
This doesn't mean specialized models disappear. It means the general case gets broader, and the bar for reaching for a specialized model rises.
There's a design consequence worth naming. If you currently maintain separate code paths for "the image feature" and "the text feature," that separation is becoming a liability rather than a convenience. The future-leaning move is to treat all inputs as candidate context for a single reasoning step and let the model decide what's relevant, reserving hard separation only for the cases where one modality genuinely needs isolated, specialized handling. Teams that internalize this early will spend less time refactoring later.
Signal 2: The Cost Curve Is Bending
Today, media tokens are expensive relative to text, which constrains how freely teams use images and audio. The visible trend is downward pressure on those costs as encoders get more efficient and providers compete.
Why this matters strategically
If media becomes meaningfully cheaper, use cases that are marginal today become viable tomorrow — routinely processing screenshots, transcribing every call, analyzing visual content at scale. Teams that built clean media-optimization pipelines now will be positioned to scale them when the economics flip, while teams that bolted media on as an afterthought will hit a mess.
This is the practical reason to invest in the optimization discipline covered in the Best Practices guide before you strictly need it. You're building the foundation for a cheaper future.
There's a caution embedded here, though. Falling costs reward teams that already have clean pipelines and punish teams that respond to cheaper media by sending everything at full resolution. The economics improving doesn't excuse waste — it amplifies the gap between disciplined and careless systems, because the careless ones simply scale their inefficiency. The right read of this signal is to keep optimizing aggressively while expanding what you attempt, not to relax because each request got cheaper.
Signal 3: Output Modalities Catch Up to Input
For a while, "multimodal" mostly meant multimodal input with text-only output. The visible movement is toward richer output — models that generate images, edit them, and increasingly produce or manipulate audio and video as first-class outputs.
The shift this creates
- Products can close the loop entirely inside one model: read a brief, generate the visual, refine it on feedback.
- Output validation becomes a bigger concern, because generated media is harder to verify than text.
- The human-in-the-loop question moves from "should we?" to "where exactly?" as output volume grows.
The teams that thrive here will have already solved per-modality output validation, a topic the Common Mistakes article treats as foundational rather than optional.
Signal 4: Orchestration Becomes the Differentiator
When everyone has access to capable multimodal models, the model itself stops being the advantage. What separates products is how well they orchestrate — routing inputs intelligently, optimizing media, choosing single-model versus pipeline per task, and validating output rigorously.
What to build now
- A routing and optimization layer that treats modality as a first-class concern.
- A documented decision process for single-model versus pipeline that you can revisit as models improve.
- Observability that tracks cost and quality per modality, so you can move fast when the landscape shifts.
This is why the Framework and a documented workflow matter more, not less, as models get better. The model is the commodity; the orchestration is the product.
Consider the parallel with earlier waves of software. When databases became reliable and cheap, the advantage moved to how you modeled and queried data, not which database you ran. When cloud compute became ubiquitous, the advantage moved to architecture, not access. Multimodal models are following the same arc. The capability is rapidly becoming table stakes; what you build around it — the routing, the validation, the cost discipline, the graceful handling of failure — is where durable differentiation lives. Betting your roadmap on having a marginally better model is betting on a commodity. Betting it on superior orchestration is betting on something you actually control.
What This Thesis Means for Your Roadmap
If the signals hold, the right posture is to build modality-agnostic infrastructure today rather than wiring your product to the quirks of one current model. Assume media will get cheaper, output will get richer, and the boundaries will keep blurring. Design so that adding a modality or swapping a model is a configuration change, not a rewrite.
The teams that struggle will be those who treated multimodal as a feature to bolt on. The teams that win will have treated it as an architecture to invest in early — and the gap between those two groups is going to widen.
Frequently Asked Questions
Is it worth investing in multimodal infrastructure before I need it?
If your roadmap touches images, audio, or video at all, yes. The infrastructure — routing, optimization, validation — is reusable across models and use cases, and building it under pressure later is far more painful than building it deliberately now.
Will specialized single-modality models become obsolete?
No. As general multimodal models broaden, specialized models still win where one modality demands top-tier quality. The bar for choosing them rises, but the choice doesn't disappear — it stays a deliberate per-task decision.
How should falling media costs change my decisions today?
Build your media-optimization pipeline now so you can scale media use when it gets cheaper. Don't, however, design use cases that only work if costs fall — base today's decisions on today's economics and let the pipeline absorb the upside later.
What's the biggest risk in this forward-looking view?
Over-coupling your product to one model's current behavior. If you hard-wire prompts, formats, and assumptions to a specific model, you forfeit the flexibility to adopt better options as the landscape shifts. Design for swappability.
Does richer output change my validation strategy?
Yes. As models generate more images, audio, and video, automated verification gets harder and human-in-the-loop placement becomes a core design decision rather than an edge case. Solve output validation per modality before output volume grows.
Key Takeaways
- Modality boundaries are blurring; cross-modal reasoning is becoming the default rather than a specialty.
- Media costs are trending down, which will make today's marginal use cases viable — build optimization pipelines now to capture that.
- Output modalities are catching up to input, raising the stakes on per-modality output validation.
- As models commoditize, orchestration — routing, optimization, selection, validation — becomes the real differentiator.
- Build modality-agnostic, swappable infrastructure so adding a modality or changing a model is configuration, not a rewrite.