Most teams build their first AI API feature by intuition. They write a prompt, parse the response, ship it, and then patch problems as production surfaces them. That works for a prototype and falls apart as the feature grows, because there is no shared structure to reason about. What was missing was a model, a repeatable way to think about every integration so the same problems get solved deliberately instead of discovered painfully.
An AI API is a hosted model endpoint you send a request to and receive a generated response from. The framework below, which we call CRAFT, organizes the work of building on one into five stages: Contract, Retrieve, Apply, Filter, Track. It is not a rigid pipeline you must implement in full every time. It is a checklist of concerns, and the value is in knowing which stages your particular feature needs and which it can safely skip.
Walk any proposed integration through these five stages and the weak points reveal themselves before users find them.
Stage 1: Contract — Define the Interface
Before any model call, decide exactly what goes in and what must come out. This is the contract between your code and the model, and skipping it is the root of most downstream pain.
What this stage covers
- The system prompt: stable instructions that define the model's role and rules.
- The input shape: how user data is structured and delimited so it cannot be confused with instructions.
- The output shape: the exact structure you will parse, ideally enforced with structured output mode.
When to apply it
Always. Every integration has a contract whether or not you wrote it down, and the undocumented ones break unpredictably. The discipline here mirrors our best practices on treating the prompt as an interface.
Stage 2: Retrieve — Supply the Right Context
The model only knows what you tell it plus its training data. For anything factual, current, or specific to your domain, you must retrieve and supply the relevant information.
What this stage covers
- Fetching the documents, records, or data the task needs.
- Selecting only the relevant portion to control cost and improve focus.
- Formatting that context clearly within the prompt.
When to apply it
Whenever the task depends on facts the model could not reliably know, customer data, product details, current information. Skip it for purely generative tasks like brainstorming. The retrieve-then-generate pattern is the same one driving the semantic search build in our real-world examples.
Stage 3: Apply — Make the Call Resilient
This is the actual model invocation, but the stage is about resilience, not just sending bytes. The endpoint is unreliable, so the call must be too.
What this stage covers
- Retries with exponential backoff for rate limits and transient errors.
- A request timeout so nothing hangs indefinitely.
- Distinguishing retryable errors from terminal ones.
- A fallback for when the call ultimately fails.
When to apply it
Always, in any production system. The naive single call is the source of most "the AI is down" incidents, as detailed in our common mistakes guide.
Stage 4: Filter — Validate Before You Trust
The response is plausible text, not guaranteed-correct text. This stage is the gate between model output and your application acting on it.
What this stage covers
- Parsing defensively and validating against a schema.
- Checking the output is within allowed bounds, a valid category, a sensible value.
- Routing low-confidence or malformed responses to a human or a safe default.
- Requiring confirmation for high-stakes actions.
When to apply it
Always, with the rigor scaled to the stakes. A brainstorming tool needs light filtering; a system that books payments needs strict validation and human sign-off.
Stage 5: Track — Measure and Improve
Without measurement, the feature drifts. This stage makes quality and cost observable so you can improve deliberately.
What this stage covers
- Logging prompts, models, parameters, tokens, and responses.
- Running an evaluation set on every prompt or model change.
- Monitoring cost, latency, and quality over time.
When to apply it
Always for anything you intend to maintain. The specific signals to watch are covered in our metrics that matter, and they turn vague impressions into trends you can act on.
Putting CRAFT to Work
The framework's value is as a conversation tool. When a teammate proposes an AI feature, walk it through the five stages: What is the contract? What must we retrieve? How will the call survive failure? How do we filter output? What do we track? The gaps surface immediately, and you fix them in design rather than in an incident.
Not every feature needs every stage at full depth. A low-stakes internal summarizer might have a light Filter and Track stage. A customer-facing financial assistant needs all five at maximum rigor. CRAFT tells you which dials to turn, not that they must all be turned to eleven.
A worked example
Picture a feature that drafts replies to customer support emails. Walking CRAFT: the Contract is a system prompt defining tone and rules, the incoming email delimited as untrusted content, and a structured draft as output. Retrieve pulls the customer's account history and relevant help-center articles so the reply is grounded in facts, not invention. Apply wraps the call in retries and a timeout with a fallback to a templated holding reply. Filter validates the draft, checks it does not promise anything outside policy, and routes anything uncertain to a human. Track logs tokens and runs the evaluation set whenever the prompt changes. Five short questions, and the whole design falls out, including the parts a naive build would have missed.
How CRAFT Maps to Common Failures
The framework is easier to trust once you see that each stage corresponds to a class of failure teams actually hit. CRAFT is, in effect, a map of where things go wrong.
Each stage prevents a specific failure
- Skip Contract and you get inconsistent output and prompt injection, because instructions and user data blur together.
- Skip Retrieve and you get confident hallucination, because the model answers from memory instead of your facts.
- Skip Apply and you get the "the AI is down" incidents that are really unhandled rate limits and timeouts.
- Skip Filter and a malformed or wrong response flows straight into your application and your users.
- Skip Track and quality drifts silently over weeks of well-meaning edits with no one able to see it.
Read that list and you have a diagnostic tool: when an AI feature misbehaves, the symptom usually points at the stage that was shortchanged. Confident wrong facts mean Retrieve. Crashes on parsing mean Filter. Mysterious quality decay means Track.
Why a named framework helps a team
The deeper value of giving the stages names is shared language. When everyone on a team knows what Contract, Retrieve, Apply, Filter, and Track mean, design reviews get faster and more rigorous. "Where is our Filter stage for this?" is a more useful question than a vague worry that the feature might be risky. The names turn scattered best practices into a checklist a team can run together, which is the same reason our pre-launch checklist exists alongside this framework.
Frequently Asked Questions
What is an AI API, and how does CRAFT relate to it?
An AI API is a hosted model endpoint you send requests to and receive generated responses from. CRAFT is a five-stage framework, Contract, Retrieve, Apply, Filter, Track, for building reliable integrations on top of that endpoint. It organizes the concerns that determine whether a feature survives production.
Do I have to implement all five stages?
No. Contract, Apply, and a baseline Track stage apply to virtually every production feature, but Retrieve is unnecessary for purely generative tasks and the depth of Filter scales with the stakes. CRAFT helps you decide which stages your feature needs, not force all of them.
What is the most overlooked stage?
Track. Teams ship a working prompt and never build the evaluation set or logging, so quality drifts silently as they make edits. Without measurement, you cannot tell whether a change helped or hurt, which makes every improvement a guess.
How is the Filter stage different from the Contract stage?
Contract defines the output shape you ask for up front; Filter verifies the response actually matches it and is safe to act on. The model usually honors the contract but not always, so Filter is the runtime gate that catches the exceptions before they reach users.
Can CRAFT work with embeddings, not just text generation?
Yes. For an embeddings-based feature, Retrieve becomes the core stage, the embeddings power retrieval, while Contract, Apply, Filter, and Track still apply to any generation step that summarizes the retrieved results.
Key Takeaways
- CRAFT organizes AI API work into five stages: Contract, Retrieve, Apply, Filter, Track.
- Contract defines the interface, Retrieve supplies relevant context, Apply makes the call resilient.
- Filter validates output before your app acts on it, with rigor scaled to the stakes.
- Track makes cost and quality observable so improvement is deliberate, not accidental.
- Use the framework as a design conversation to surface gaps before production does.