A checklist is only useful if you understand what each item protects against. This one is built to be used, not just read. Before any system prompt goes in front of users, run it through these items and confirm each holds. Every entry comes with a one-line justification so the list teaches as it verifies.
Print it, paste it into your review notes, or keep it open while you edit. The structure follows the natural life of a prompt: clarity first, then constraints, then output, then edges, then process. Work top to bottom and you will catch the failures that matter most before they reach production.
Treat any unchecked box as a blocker, not a suggestion. The whole value of a checklist is that it refuses to let you skip the boring step that turns out to be the important one.
A short note on why a checklist beats relying on memory. Experienced prompt authors know all of these items, yet they still ship prompts with gaps, because under deadline pressure the mind cuts corners it would never cut with time to spare. A checklist is external memory that does not get tired or rushed. It is most valuable precisely when you are least able to be careful, which is to say when you are shipping fast. Used honestly, it converts your worst-case attention into your baseline.
Clarity and Scope
These items confirm the prompt knows what it is and what it is for.
- The assistant has a single clear role stated in one sentence. A diffuse mandate produces a mediocre, unpredictable assistant.
- The scope of what it handles is explicit. Undefined scope invites the model to wander into untested territory.
- What falls outside its lane is stated. Naming the out-of-scope cases prevents confident answers on topics it was never built for.
- There are no contradictory instructions about tone or behavior. Conflicting rules make the model pick for you, inconsistently.
If any of these fail, fix scope before going further. Everything downstream depends on the assistant knowing its job, a point developed in System Prompts: Real-World Examples and Use Cases.
Constraints and Rules
These items confirm the non-negotiables are real and enforceable.
- Every hard rule is written as a command, not a suggestion. Hedged phrasing reads as optional and gets followed inconsistently.
- Every rule is checkable against an output. If you cannot objectively verify compliance, the rule cannot be tested or trusted.
- The single most important rule is restated at the end of the prompt. Closing position carries extra weight and reduces violations.
- Priority between conflicting instructions is explicit, usually system over user. Without a stated hierarchy the model improvises one.
These four items resolve the most common reliability problems, the same set catalogued in 7 Common Mistakes with System Prompts (and How to Avoid Them).
Output and Format
These items confirm responses will be usable by whatever or whoever consumes them.
- The required output shape is described precisely. Vague format guidance produces inconsistent, hard-to-parse responses.
- For any specific style, one concrete example is included. A single example teaches tone and structure better than adjectives.
- Unwanted wrapper text (like "Here is your answer") is explicitly suppressed if downstream systems consume the output. Preamble breaks parsing.
- Length expectations are stated where they matter. "Be concise" is interpreted loosely; a sentence limit is not.
Edge Cases and Safety
These items confirm the prompt survives contact with messy reality.
- There is a defined behavior for when the model does not know. The absence of a graceful unknown manufactures confident fabrication.
- There is handling for empty, contradictory, or malformed input. Unhandled edges are the leading cause of visible failures.
- There is an escalation or hand-off path for cases beyond the assistant's authority. Some requests should reach a human, not an improvised answer.
- High-stakes outputs have a backstop beyond the prompt. A prompt is one layer of defense, not the whole wall.
The cost of skipping this section is dramatized in Case Study: System Prompts in Practice, where a single missing edge handler broke a live assistant.
Process and Maintenance
These items confirm the prompt can be trusted over time, not just today.
- A regression test set exists and the prompt passes it. Without one, every change risks silent breakage.
- The prompt is versioned with a note on what changed and why. History is the fastest path to diagnosing future drift.
These last two are the practices that keep all the others honest, and they anchor the broader discipline in System Prompts: Best Practices That Actually Work.
How to Use This Checklist in a Review
Run the list as a gate, not a formality. When reviewing a prompt, go item by item and demand evidence for each check rather than a quick nod. For the output and edge sections especially, the evidence should be an actual test run, not an assertion that it "should" work.
For prompts that change often, re-run the full list on every meaningful revision. It takes minutes once you are familiar with it, and it is far cheaper than discovering a regression in production. Over time the checklist becomes muscle memory, and you start writing prompts that pass it on the first draft.
Adapt the list to your stakes
Not every prompt warrants the same rigor, and the checklist should scale with what is at risk. A throwaway internal tool that summarizes meeting notes can move quickly through the output and process items. An assistant that talks to customers, touches money, or gives advice in a regulated field deserves every item run with evidence, plus domain-specific additions of your own.
The judgment to calibrate here is itself a skill. Underusing the checklist on a high-stakes assistant invites the failures it exists to prevent. Overusing it on a trivial tool wastes time you could spend elsewhere. Match the thoroughness to the consequences of being wrong, and the checklist becomes a sharp instrument rather than a blunt ritual.
Frequently Asked Questions
Do I really need to run every item for a small prompt?
The clarity, constraint, and edge items apply even to small prompts, because those are where small prompts most often fail. You can move faster through the output and process items for a trivial use case, but do not skip the graceful-unknown check regardless of size.
What is the most commonly failed item?
The defined behavior for when the model does not know. Many prompts simply never tell the model it is allowed to say "I don't know," which is precisely how confident fabrication enters production. It is the highest-value box on the list.
How does the checklist relate to a regression test set?
The checklist verifies the prompt's structure; the test set verifies its behavior. They are complementary. Several checklist items, particularly in the output and edge sections, can only be honestly confirmed by running the test set, not by reading the prompt.
Can I add my own items?
Absolutely. This is a starting point, not a fixed standard. Domain-specific risks, like compliance disclaimers in finance or safety language in health, deserve their own checks. Add items that map to the failures your specific context can produce.
Should non-engineers use this checklist?
Yes. Nothing on it requires code. A subject-matter expert reviewing an assistant's behavior can run the clarity, constraint, output, and edge items entirely from reading prompts and outputs. The process items benefit from light tooling but are still understandable to anyone.
Key Takeaways
- Treat each unchecked box as a blocker; the value of a checklist is refusing to skip the boring-but-important step.
- Confirm clarity and scope first, since a diffuse mandate undermines everything downstream.
- Verify that hard rules are imperative, checkable, prioritized, and that the most important one is restated at the end.
- The most commonly failed and highest-value item is defining what the model does when it does not know.
- Back the structural checks with a regression test set and versioning, the practices that keep the prompt trustworthy over time.