Knowing that you should evaluate prompts is easy. Actually sitting down and doing it is where most people stall, because the abstract advice rarely tells you what to do on Monday morning. This guide is the opposite. It is a sequence of concrete steps you can execute today, in order, to produce a real evaluation of any prompt you care about.
You will not need exotic tooling. A spreadsheet and access to your model are enough for the first pass, and you can graduate to scripts once the process is familiar. The point is to give you a path with no gaps, so you always know what the next action is.
Follow the steps in order. Each one produces an artifact the next step depends on, so skipping ahead tends to send you back. By the end you will have a pass rate, a list of failures with causes, and a defensible decision about whether the prompt is ready.
Step 1: Write the Success Criteria
Before touching the prompt, write down what a correct output must contain. Be specific and testable. "Helpful summary" is useless; "three bullet points, each under 20 words, covering the decision, the owner, and the deadline" is something you can actually check.
This document is your rubric. Every later judgment refers back to it, which prevents you from quietly lowering the bar when an output charms you.
Step 2: Assemble a Test Set
Collect inputs that represent what the prompt will face in the real world. Pull them from actual data if you can.
- Include 15 to 30 inputs to start.
- Cover the common cases that dominate your traffic.
- Add edge cases: empty fields, very long inputs, unusual formats.
- Add a few adversarial inputs that try to break the prompt or push it off topic.
Save these in a spreadsheet, one input per row. This set is the foundation; protect it and do not edit the prompt to fit these specific examples.
Step 3: Define How Each Output Will Be Scored
Decide your scoring method per criterion before you generate anything. Structured requirements like "valid JSON" get a programmatic or visual pass/fail. Subjective requirements like tone get a rubric with a short scale, such as 1 to 3. Writing this down now keeps your scoring consistent across dozens of rows.
Step 4: Generate Outputs for Every Input
Run the prompt against each input in your test set and record the raw output in its own column. To capture consistency, run each input two or three times rather than once. This is tedious by hand, which is the first sign you may want to script it, but even doing it manually for 20 inputs is worth the effort.
Step 5: Score Each Output Against the Rubric
Go row by row and mark each output pass or fail, or assign a rubric score, based strictly on the criteria from step one. Resist the urge to give partial credit for outputs that are charming but wrong. The discipline here is what makes the result trustworthy.
Tally a pass rate at the end. That single number is now your baseline.
Step 6: Diagnose the Failures
This is where the value lives. Read every failing output and group the failures by cause.
- Are failures clustered on a specific input type, like long messages?
- Do they share a behavior, like ignoring the format instruction?
- Are they inconsistent across runs of the same input?
The pattern points directly at the fix. A prompt failing only on long inputs needs different handling than one failing randomly across the board.
Step 7: Revise and Re-Run the Whole Set
Make one targeted change based on your diagnosis, then re-run the entire test set, not just the cases that failed. Re-running everything catches regressions, where a fix for one problem quietly breaks cases that previously passed. Compare the new pass rate against your baseline to confirm real improvement.
For the dimensions worth tracking as you iterate, see What Separates a Reliable Prompt From a Lucky One.
A Note on Iterating Without Cheating
There is a temptation, during this revise-and-rerun loop, to keep tweaking until every single test case passes. Resist scoring your success purely on the examples you have been staring at. If you tune the prompt to satisfy these specific inputs, you may simply be memorizing them rather than improving the prompt. The honest move is to hold back a handful of inputs you do not look at while editing, and check the prompt against those only at the end. If it performs well on inputs it has never been tuned against, you have evidence of real improvement rather than overfitting.
Step 8: Decide and Document
Once the pass rate clears the floor your task requires, make the call to ship, keep iterating, or escalate. Then document the test set, the final pass rate, and the decision so the next person can reproduce your reasoning.
To turn this sequence into a repeatable model your team can adopt, read A Framework for Evaluating Prompt Quality. To see the process applied end to end on a real problem, read Case Study: Evaluating Prompt Quality in Practice.
When to Automate the Process
The first time through, doing every step by hand is valuable because it builds intuition for what the prompt actually does. By the third or fourth iteration, the manual work starts to drag, and that drag is the signal to automate. The steps most worth scripting are output generation and programmatic scoring, since those are the repetitive parts that benefit from running hundreds of times.
A practical progression looks like this. Begin in a spreadsheet. Once you are re-running the same test set after every edit, write a small script that loops over your inputs, calls the model, and records the outputs. Then add automated checks for anything structured, leaving only the subjective criteria for human review. Each step of automation buys back time you can spend on diagnosis and decision-making, which are the parts of evaluation that genuinely need a human. The process stays the same; you are simply removing the toil.
Frequently Asked Questions
How long does a first evaluation take?
For a 20-input test set scored by hand, expect one to three hours including writing your criteria and diagnosing failures. The criteria and test set are the slow parts, but they are reusable, so every evaluation after the first is much faster. Automating output generation cuts the time further.
Should I run each input more than once?
Yes, when consistency matters. Running each input two or three times reveals whether the prompt behaves the same way every time or swings between good and bad answers. If you only run once, a lucky pass can hide an unreliable prompt that will fail unpredictably in production.
What pass rate is good enough to ship?
It depends on the stakes. A low-risk background task might ship at 90 percent, while a customer-facing or high-consequence task might need 98 percent or more plus human review. Set the floor based on the cost of a mistake before you evaluate, so the number drives the decision rather than your enthusiasm.
Why re-run the whole test set after a change?
Because fixes cause regressions. A change that solves your failing cases can quietly break cases that already passed, and you will only catch that by re-running everything. Comparing the full pass rate before and after is the only reliable way to confirm you actually improved the prompt.
Key Takeaways
- Start by writing specific, testable success criteria before touching the prompt.
- Build a 15 to 30 input test set covering common, edge, and adversarial cases.
- Decide scoring methods per criterion up front to keep judgments consistent.
- Run each input multiple times to capture consistency, then score strictly against the rubric.
- Diagnose failures by grouping them by cause, then make one targeted change and re-run the whole set.
- Document the test set, pass rate, and decision so the evaluation is reproducible.