8.2 The "hypothesize → test → iterate" loop
Overview and links for this section of the guide.
On this page
The core debugging loop
Debugging with AI works best when you force structure:
- Reproduce the failure reliably.
- Hypothesize a small set of plausible causes.
- Test hypotheses with quick checks.
- Iterate based on results (discard wrong hypotheses).
- Fix with the smallest diff that resolves the issue.
- Lock with a regression test.
This is “scientific method for software,” scaled down to minutes.
Without structure, the model will propose plausible fixes and you’ll try them one by one. That feels like progress but often wastes time. The loop makes the model do diagnostic work, not just code generation.
Step 1: reproduce reliably
If you can’t reproduce, you can’t debug. Your first job is to make the failure happen on demand:
- reduce concurrency, retries, and “randomness,”
- use the same input every time,
- write a failing test if possible,
- capture exact error output.
Once you have a one-command reproduction, you’ve already done most of the hard part.
Step 2: generate hypotheses (without guessing randomly)
Ask the model for a short ranked list of hypotheses. Importantly, require each hypothesis to be connected to evidence.
Good hypotheses are specific:
- “This function returns
Nonewhen input is empty, and the caller doesn’t handle it.” - “The parser treats unary minus as binary minus in this token sequence.”
- “The CLI exits with code 0 because exception is swallowed.”
Bad hypotheses are vague:
- “It’s probably a bug in your code.”
- “Maybe the environment is wrong.”
- “Try reinstalling dependencies.”
Ask the model to rank hypotheses by likelihood and impact. Ranking prevents it from listing 20 ideas with no prioritization.
Step 3: design tests to confirm/deny
For each hypothesis, demand a confirming and denying check:
- Confirming test: “If hypothesis is true, we should observe X.”
- Denying test: “If hypothesis is false, we should observe Y.”
Examples of quick tests:
- add one log line to confirm code path,
- add one unit test for a suspected edge case,
- inspect a value at a boundary (before/after parsing),
- run a minimal reproduction with a modified input.
If you apply changes without a hypothesis and an observation, you’re doing random walk debugging.
Step 4: iterate with evidence
Run one check at a time. Then update the model with results:
- what you ran,
- what you observed,
- which hypotheses are now less likely,
- what the next check should be.
This keeps the model anchored to reality and avoids “narrative debugging.”
Step 5: implement the smallest fix
Once a hypothesis is confirmed, fix it with minimal scope:
- diff-only changes,
- avoid refactors during the fix,
- prefer changing one function over re-architecting,
- preserve behavior outside the bug.
Step 6: lock the fix (regression test)
If you don’t lock the fix, it will regress. Locking means:
- add a test that fails before the fix,
- keep the test after the fix (forever),
- run tests in CI (eventually) so regressions get caught immediately.
A good regression test explains “what went wrong” and “what must never happen again.” That’s how teams build reliability over time.
Copy-paste prompt templates
Template A: hypotheses + tests only (no code)
We have a bug. Do NOT write code yet.
Goal:
[expected behavior]
Reproduction:
```sh
[command]
```
Actual output:
```text
[output]
```
Relevant code:
(paste minimal)
Task:
1) Provide 3–5 ranked hypotheses for the root cause.
2) For each hypothesis, propose a confirming test and a denying test.
3) Stop and wait for my results.
Template B: implement the smallest fix
Based on the confirmed hypothesis (#N), implement the smallest fix.
Constraints:
- Diff-only changes
- No refactors beyond what’s required
- Add/keep a regression test
Output:
- Unified diff only