A good system prompt is not a flash of inspiration. It is the output of a process that another person could run and get a comparable result. When prompt work lives only in one engineer's head, every change is risky and every handoff is a rewrite. The fix is a workflow: a documented sequence of stages, each with an input, an output, and an exit condition.
This article lays out that workflow end to end. It is deliberately concrete. By the end you should be able to take a vague request like "make the assistant more helpful" and run it through stages that produce a tested, reviewable change. If you want the conceptual grounding first, the complete guide covers what a system prompt is and does.
The workflow has six stages: intake, draft, structure, test, review, and ship. Each one has a clear handoff. Skip a stage and you reintroduce the chaos the workflow exists to remove.
Stage 1: Intake
Input: A request, usually vague. Output: A list of concrete behavior statements.
The first job is translation. "Make it friendlier" is not actionable. Turn every request into statements of the form "when the user does X, the assistant should do Y." This forces specificity and surfaces disagreements early.
What good intake looks like
- Vague: "Handle angry customers better."
- Concrete: "When a user expresses frustration, the assistant should acknowledge it once, avoid defensive language, and offer the escalation path."
You cannot test a vague request, and you cannot hand it off. The behavior statement is the unit of work for everything downstream.
Stage 2: Draft
Input: Behavior statements. Output: A first prompt, rough and untested.
Now write. Do not optimize yet. The goal of the draft is to get the role, the rules, and the format on the page so you have something to react to. Drafting against a blank page is hard; drafting against your behavior statements is mechanical.
Write the role sentence first, then translate each behavior statement into an instruction. Keep the draft messy. You will fix structure in the next stage. Trying to write a clean, final prompt in one pass is how people stall.
Stage 3: Structure
Input: A rough draft. Output: A clean, organized prompt.
This is where the prompt becomes maintainable. Group related instructions, name the sections, and put any examples in their own block. A structured prompt is one the next person can read and change without fear.
A reliable structure
- Role and scope.
- Hard rules and refusal conditions.
- Domain context and policies.
- Output format.
- Few-shot examples, if any.
This ordering puts the most authoritative content first, which tends to be weighted more heavily, and keeps examples at the end where they will not get confused with rules. The framework article goes deeper on why this order works.
Stage 4: Test
Input: A structured prompt. Output: A pass or fail against the test set.
No prompt ships untested. Run the prompt against your test set, the collection of representative and adversarial inputs with known-correct behaviors. This is the stage that separates a workflow from guessing.
If you do not have a test set yet, build one here. Twenty cases is enough to start: a handful of common requests, a few edge cases, and a couple of adversarial attempts to break the rules. Run them, diff the outputs against expectations, and note every failure. A failing test is information, not a setback.
Stage 5: Review
Input: A tested prompt with results. Output: An approved change, or notes for revision.
A second person reads the prompt and the test results. The reviewer is not there to admire the writing; they are there to catch the rule that contradicts another rule, the instruction that will leak, and the behavior that looks fine in isolation but is wrong for the product.
What the reviewer checks
- Do any two rules contradict each other?
- Could any line expose information that should stay private?
- Do the test cases actually cover the behavior change?
- Is the prompt shorter than the last version, or did it grow without reason?
Review is cheap and catches expensive mistakes. Even an informal second read removes most regressions before they reach users. For the patterns reviewers should look for, see Real-World Examples and Use Cases.
Stage 6: Ship and Record
Input: An approved prompt. Output: A deployed change and a record of why.
Deploy the prompt the way you deploy code: versioned, with a note explaining what changed and which behavior statement drove it. The record is what makes the next change easy. Six months from now, someone will ask why a rule exists, and the answer should be findable.
Keep the previous version. Prompt changes can have effects that only show up at scale, and the ability to roll back quickly is worth the small storage cost. Treat each shipped prompt as a versioned artifact, not a string you overwrote.
Making the Workflow Hand-Off-Able
The point of a workflow is that someone else can run it. That requires three things to be written down, not memorized.
- The test set, with inputs and expected behaviors, in a file anyone can open.
- The current prompt, versioned, with a changelog.
- The intake template, so requests arrive as behavior statements every time.
With those three artifacts, a new team member can run the full workflow on their first week. Without them, prompt work stays trapped in one person's head and breaks the moment they take vacation. The whole value of the workflow is that it survives the person who built it.
Frequently Asked Questions
How big should the test set be before I trust the workflow?
Start with twenty cases and grow it every time something breaks. The right size is "covers the behaviors you care about," not a fixed number. High-stakes assistants end up with hundreds of cases; a simple internal tool may need thirty. The growth pattern matters more than the starting size: every incident and every behavior change should add at least one case.
Can I compress the six stages for small changes?
For a typo fix, sure. For any behavior change, run at least draft, test, and review. The stages most often skipped under time pressure are structure and review, and those are exactly the ones that prevent long-term decay. The workflow is fastest when it is followed, because it prevents the rework that skipping it causes.
Who owns the workflow if multiple people touch the prompt?
One person owns the canonical prompt and the test set; others contribute through the workflow. Multiple contributors are fine, but they should all route changes through the same stages and the same approver. Shared editing without a single owner is how the structure and review stages quietly get dropped.
What tools do I need to run this?
Less than you think. A version-controlled file for the prompt, a file or simple script for the test set, and a place to record changes. You can run the entire workflow with a code repository and a spreadsheet. Dedicated prompt-management tools help at scale, but the workflow is what matters, not the tooling.
How do I handle a behavior change that breaks an old test case?
Stop and decide which behavior is correct, because both cannot be. Sometimes the old case is outdated and you update it; sometimes the new request is wrong for the product and you push back. The test failure is doing its job by forcing the decision. Never resolve it by quietly deleting the old case to make the suite pass.
Key Takeaways
- A repeatable workflow turns prompt work from inspiration into a process anyone can run.
- The six stages are intake, draft, structure, test, review, and ship, each with a clear handoff.
- Intake translates vague requests into testable behavior statements, the unit of all downstream work.
- Testing against a known-good set is what separates a workflow from guessing.
- Review by a second person catches contradictions and leaks before users do.
- Hand-off depends on three written artifacts: the test set, the versioned prompt, and the intake template.