Coding Agents Are "Fixing" Correct Code
Niels Mündler, Thibaud Gloaguen, Mark Niklas Müller, Veselin Raychev, Martin Vechev
March 23, 2026
Coding agents are increasingly used to maintain software over long time horizons. A standard task in this setting is resolving user-reported issues: a bug report comes in, the agent investigates, writes a patch, and submits it. But what happens when the reported issue is already resolved?
This is not a contrived scenario. In any real codebase, agents will encounter outdated bug reports, issues fixed in a parallel PR, or tickets that reference behavior that was already patched. A good agent should recognize this, report that no fix is needed, maybe update documentation and add a test case, but ultimately move on without attempting another code patch. We set out to measure whether current coding agents actually do this.
We run the coding agents on the code base after the reported user issue has been resolved. The expected behavior is to not make additional changes to relevant code.
Benchmark: Fixing already-fixed code
We sampled 100 instances from SWE-Bench Verified [1] and 135 samples from our AGENTbench dataset [2], which consists of more niche and recent codebases. We evaluate a range of recent coding models in their recommended harnesses: Claude Opus 4.6, Claude Sonnet 4.6, and GLM-5 in Claude Code; GPT-5.4 and GPT-5.4 mini in Codex; Gemini 3 Pro in Gemini CLI, and the Qwen3.5 family (35B, 122B, and 397B) in Qwen Code. Any meaningful code change (excluding documentation and tests) counts as a failure.
No model scores significantly more than 50%. GLM-5 and the Claude models reach ~50%, while most others score below 30%.
Agents introduce unnecessary changes
The results are sobering. Most models eagerly modify code even when there is nothing to fix. Interestingly, performance here does not align with coding capability as measured by SWE-Bench, but rather with the models’ tendency to push back against nonsensical requests as measured by BullshitBench [3]. Manual trace analysis reveals the deciding factor: Attempting to “fix” code without performing an issue reproduction first. GLM-5 and the Claude models typically begin by trying to trigger the bug. Upon finding it already resolved, they usually correctly submit an empty patch. Most other models jump straight to patching without verification. Worse, since the issue was resolved in the most recent commit, even a quick look at the git history would reveal the fix. In all AGENTbench instances, the issue description even specifies the commit at which the bug was present. Yet most agents never check.
This is a problem beyond our benchmark. If deployed for autonomous maintenance, these agents would systematically introduce unnecessary changes to resolve stale issues, filling codebases with agent slop. Prior work [4] has shown that agent-generated patches are typically overly verbose (unnecessarily defensive code, irrelevant edits, unrequested features), exacerbating this issue further. Our findings isolate this problem: even when the optimal patch is empty, most agents cannot help themselves.
Prompting works as stopgap solution
Upon explicitly telling the model to abstain if no change is needed, GPT-5.4 mini correctly abstains from submitting unnecessary code patches, increasing from 24% to 77% performance.
We investigated whether explicit instructions can address this. Using a prompt that tasks the agent to first investigate whether the issue still exists, then reproduce it, and only fix it if the reproduction succeeds, GPT-5.4 mini jumps from 24% to 77%. Meanwhile, simply asking to “reproduce before patching” (without the explicit option to abstain) only yields 30%. To confirm these framings don’t hurt real bug-fixing capability, we ran the same prompts on standard SWE-Bench and saw no performance degradation. This indicates a good candidate instruction to add to context files [2].
But prompting is brittle across edge cases. We tested a scenario where a previous agent had already attempted a fix that was incorrect (using GPT-5.4 nano patches that fail SWE-Bench). When asked to fix the reported issue or abstain if resolved, both Claude Sonnet 4.6 and GPT-5.4 mini now strongly favor abstaining, submitting 70% and 94% empty patches, respectively, even though the existing patch is wrong and a real fix is needed.
The deeper problem
We should not need to prompt agents to check whether their work is necessary. First, this implies requiring tight supervision from a human, which contradicts the goal of agentic autonomy. Second, it does not address the underlying issue, which is that current models lack taste in software engineering [5]; they overengineer, do not verify that changes are needed, and do not confirm that their patches actually change program behavior meaningfully. These edge cases, stale issues, partial prior fixes, and redundant changes are the norm in real-world software maintenance, not the exception.
There is a more general lesson here that extends beyond coding. LLMs, especially when used as agents, are trained to always find a way to “succeed” at the task they are given. If you ask a model to fix a bug, it will produce a fix, even if no fix is needed. The model has seen millions of bug-fixing trajectories during training and has been rewarded for producing patches, not for concluding that none are required. If you do not explicitly frame “no change needed” as a valid and successful outcome, the model will not choose it. This is why the fix-or-abstain prompt works so well: it redefines success to include the possibility of abstaining.
This dynamic applies broadly. Whenever an agent encounters unexpected circumstances, an already-resolved issue, a contradictory specification, or an ambiguous requirement, it will default to producing something rather than pushing back or asking for clarification. The practical takeaway for anyone deploying agents today: always define an explicit success path for unexpected circumstances. But long-term, if we want coding agents that can autonomously maintain software, they need to internalize the principle of minimal, verified changes. That requires changes to how models are trained, not just to how they are prompted.
Conclusion
We set out to investigate in a controlled setting whether current coding agents ensure minimality of submitted patches by asking them to fix resolved issues in code bases. Overall, we found a sobering result: Most models unquestioningly perform patches, do not stop to confirm that they actually changed the program behavior meaningfully, and submit unnecessary changes in up to 70% of instances. Our stopgap recommendation is to explicitly ask models to abstain from fixing code that does not require changes. However, we suggest that this is a symptom of a broader, underlying issue in the reward design of coding agents, which prevents their use in long-term autonomous software maintenance.
This work was done in collaboration with LogicStar. Check out their work on reliable coding agents!
References
Examples
We present a number of concrete agent traces that illustrate our findings below. In the representative instance below, GPT-5.4 mini applies a patch to the repository before running any reproduction tests or checking the git history. It edits the PostgreSQL dbshell client immediately, only then runs the relevant test, and ends up submitting the unnecessary code change.
ShowGPT-5.4 minidjango/django#11239
Open GPT-5.4 mini fix traceThere is even an easier path to discovering that no patch is required. In all AGENTbench instances, the issue text explicitly describes the commit at which the reported issue was observed. In our evaluation, the most recent commit diverges from the reported commit and includes the patch to the reported issue. If the agents inspect the recent git history, they quickly realize that the last commit solves the task they were assigned. However the agents rarely stop to compare that commit with the current state of the repository. The notable exceptions are GLM-5 and the Claude models: they usually begin their activity by attempting a reproduction of the reported issue and then continue to inspect the git history. The empty submitted patches follow their decision that no change is needed upon discovering the existing patch. Such an example is illustrated in the Sonnet 4.6 trace below.
ShowSonnet 4.6opshin/opshin#387
Open Sonnet 4.6 traceWe consider this desirable behavior and not cheating. We were actually surprised to see so few models perform even this most basic check. However, reproducing bugs and inspecting the git history do not guarantee correct abstention. Even when models first reproduce the issue at hand, they may proceed to ‘resolve’ a new issue. For example, in one instance, GLM-5 discovers that the reported issue was fixed in a prior commit but continued nonetheless, hallucinating a bug in an unrelated piece of code and committing it without checking if it causes any changes.
ShowGLM-5openai/openai-agents-python#1779
Open GLM-5 traceUpon explicitly telling the model to abstain if no change is needed, even GPT-5.4 mini correctly abstains from submitting unnecessary code patches. Its abstention rate rockets from 24% to 77%. Below we show the same instance as above, only with a small additional remark in the initial prompt. In this setting, GPT-5.4 mini reproduces the issue first and then leaves the repository unchanged.