Why AI Agents Still Need You
Why AI Agents Still Need You: Findings from Developer-Agent Collaborations in the Wild Kumar et al., Microsoft Research (2025)
In my previous post, Cracking CodeWhisperer, I wrote about the reality of using AI inside your IDE. This paper picks up that thread but, instead of autocomplete copilots, it evaluates software engineering agents like Cursor and Windsurf—tools that don’t just suggest code but act on your repository. They search your code, run tests, and attempt to close issues autonomously.
Experiment setup and who participated
This study wasn’t a synthetic benchmark. The authors recruited 19 developers from Microsoft who had recently contributed to particular open‑source projects. They were deliberately drawn from across the organization to span levels of seniority—from entry‑level Software Engineer I/II through Senior Software Engineer and Principal Software Engineer/Manager roles, plus one associate consultant. Participants had a range of experience (0–2 to ≥16 years) and came from diverse regions and backgrounds. Each developer chose real open issues (bugs or feature requests) in a repository they already knew, making the tasks authentic and relevant. Sessions lasted around an hour, with participants controlling the study administrator’s machine running the Cursor IDE and encouraged to “use the AI for everything.” They were asked to think aloud and try the agent first, only making small manual edits when necessary. The study design thus captured real‑world use with a human in the loop, not an isolated benchmark.
From my perspective, ethnographic research is underrated. It shows you the stuff no metric ever will, the pauses, the hacks, the ‘wait, why did it do that?’ moments. If you want to understand how people actually work, not how they say they work, you have to watch them.
Two styles of interaction
When the developers started delegating tasks to Cursor, two clear strategies emerged:
-
One‑shot (high risk, high reward) – Ten participants pasted the entire GitHub issue into the agent and asked it to produce a complete fix. This approach can be efficient when the problem is simple: if the agent synthesizes a correct patch, the human only needs to review and merge. This method often failed because the agent made partial or erroneous changes. When that happened, the developer had to manually unravel the generated fix, understanding the code changes, the rationale, and any test modifications. Parsing through the agent’s reasoning and diffs took time, and these one‑shotters tended to give less contextual insight and read less of the existing code than their counterparts. This strategy succeeded in only 38% of the issues.
-
Incremental resolution (safer, more work) – The other developers broke problems into smaller sub‑tasks and iteratively asked the agent to handle each piece. For example, instead of “fix the bug,” they would say “update the version check in this file” or “adjust the test to reflect the new behavior,” review that result, then move on. This method prevents runaway mistakes because errors are caught early and feedback is given in context. It requires more involvement—an average of 11 prompts per issue instead of 7—and more code reading, but it pays off. Incrementalists succeeded in 83% of their issues. One participant said they trusted one‑shot mode only “for an issue that is very contained,” but for anything touching multiple parts of the codebase they felt they had to point the agent in the right direction. Another noted that the agent produced a solid structure initially, but once more context was needed it “committed errors that need more handholding”.
The invisible layer: tacit knowledge
Success wasn’t just about delegation style. Success hinged on tacit knowledge, the unwritten norms and expertise you bring to a project. Agents don’t know your repository’s conventions or the reasons behind past decisions. Participants who actively injected such expertise (“we never cap versions here,” or “this slow test is expected”) into their prompts doubled their odds of success. They provided both contextual information (test logs, build errors) and deeper implementation advice that only someone who knows the code would think to mention. Interestingly, developers were more likely to supply this expert guidance when iterating on a proposed change than upfront—mirroring how effective engineering managers give feedback after seeing a draft. Those with more commits in a repo or prior Cursor experience were also more apt to offer such insights.
When success depended on being human
The paper’s Table II (see below) quantifies how human engagement, not programming language or seniority, predicts outcomes. Incremental resolution clearly outperforms one‑shot delegation. Writing or editing code manually boosts success, as does providing expert insights, and asking the agent to refine its answers instead of accepting the first try. Tasks involving UI fixes or refactoring are easier than debugging, and prior experience with Cursor helps. Manual debugging or testing didn’t significantly influence success, highlighting that the real leverage is in scoping and collaborating, not just running more tests.
| Factor | Attribute | Success Rate |
|---|---|---|
| Delegation Strategy | One-Shot | 38% |
| Incremental Resolution | 83% | |
| Manual Actions | Wrote/Edited Code | 73% |
| No Manual Edits | 36% | |
| Insights Provided | Shared Expert Context | 64% |
| None Provided | 29% | |
| Iteration Pattern | Asked for Refinements | 68% |
| No Refinements | 30% |
When autonomy bites back
Agents aren’t perfect. Sometimes they ran terminal commands or rewrote files without asking. In one session it even deleted test logs mid-run. That’s the downside of unbounded autonomy. You may get state changes you didn’t consent to. Yes, a codebase has “state,” but if you’re following GitOps practices you at least get cheap, reliable rollback. The real danger shows up when the agent touches systems whose state isn’t versioned very well, databases, queues, anything that lives outside your repo. Rolling that back is either painful or impossible. To keep things safe, the authors ran agents on a remote machine and had participants review diffs before accepting changes. If you’re building these tools, treat terminal and filesystem access like live explosives, sandbox first, commit later.
The social failure mode: sycophancy
Another problem is the overly agreeable agent. If the developer said, “that fix looks wrong,” the model would immediately apologize and rewrite everything, even if its first solution was fine. Participants found that framing prompts as questions rather than commands helped. Asking “could this introduce regressions?” or “are we missing any edge cases?” nudged the agent to explain its reasoning instead of simply obeying. My takeaway is that we need agents that debate, not defer.
What success actually looked like
The most effective collaborations resembled a healthy senior–junior partnership. The developer decomposed the work and provided domain‑specific guidance. The agent explored the codebase, proposed patches, and articulated its rationale. The human reviewed diffs, injected context, and verified correctness. This pairing worked well for refactoring or UI tweaks. Debugging complex bugs or tests still demanded human intuition. Writing code manually improved outcomes, but manual debugging or testing didn’t necessarily help. In other words, the agent can type, but it can’t understand.
From autonomy to orchestration
Rather than replacing developers, this study suggests we’re reshaping software engineering. Developers become orchestrators. They’re still coding, but also scoping tasks, providing context, and managing risk. The paper notes that agents performed more like junior engineers, while humans took on managerial roles. Future work may explore how agents can emulate the behaviors of high‑performing subordinates to reduce the managerial burden. But as of now, human‑agent partnerships thrive when both parties stay engaged.
My take
What I like most about this paper is its realism. It doesn’t romanticize “full autonomy.” It shows that calibrated agency—limiting the scope of what the agent does and when—produces better results than a single hand‑off. The incremental approach, combined with timely expert insights, is basically pair programming with a robot. Agents like Cursor can reason about correctness and generate decent code, but they still need context and structure. They shine when we treat them as collaborators who challenge our thinking rather than automatons who quietly take orders.
This raises some interesting questions. When does an agent accumulate enough context to shift from incremental to one‑shot? Perhaps after multiple runs on a codebase, the agent could reuse previously generated tests and properties to bootstrap itself. But context windows remain finite, and repository conventions are often tacit, so fully autonomous runs may always require oversight.
The other question I kept asking was, when does the human become the bottleneck? Running multiple sessions in parallel might seem like a workaround, but context switching between tasks is hard, and missing a subtle side effect could introduce regressions. The more productive path may be to invest in tools that help build and reuse context (like integrating call graphs, property‑based tests, and repository histories) so that agents gradually become more self‑sufficient without losing transparency or trust.
In short, these agents aren’t replacing developers anytime soon. But used thoughtfully, they can make us better at thinking through code: surfacing assumptions, challenging default patterns, and forcing us to articulate the tacit knowledge that keeps projects running.