Cracking CodeWhisperer: How Developers Actually Use AI in Their IDEs
Cracking CodeWhisperer: Analyzing Developers’ Interactions and Patterns During Programming Tasks, Jeena Javahar, Tanya Budhrani, Manaal Basha, Cleidson R. B. de Souza, Ivan Beschastnikh, and Gema Rodriguez-Pérez, 2025.
How I think about developers and LLMs
We talk a lot about “AI pair programmers,” but it’s still rare to see what that actually looks like in practice. Most studies and benchmarks, like SWE-Bench, focus on output quality instead of how developers work with these tools. This paper looks directly at that gap: what happens inside the IDE when someone codes with an LLM watching over their shoulder.
The authors run two small studies with Amazon’s CodeWhisperer to capture what developers actually do while programming. They record screens, log keystrokes, and classify every action — when people accept suggestions, delete them, write natural-language prompts, or switch focus to other apps.
What they found
Across twenty participants, four patterns emerged:
- Incremental refinement: Accept a suggestion, trim it, rewrite the middle, keep the skeleton.
- Prompting through comments: Write a comment like “create a function that parses the CSV” — half reminder, half incantation.
- Scaffolding first: Let the model spit out boilerplate, then fill in the real logic yourself.
- External triangulation: When the model gets lost, alt-tab to docs, Stack Overflow, or another LLM.
That last one was especially interesting: developers routinely switched to another model if CodeWhisperer’s output felt off. The AI wasn’t a single assistant — it was one voice in a multi-model workflow. The authors didn’t dwell on this, but ethnographically it’s telling: devs treat LLMs like coworkers with different specialties. “ChatGPT is better at explanations,” “Claude is good at context,” “CodeWhisperer knows AWS.” You pick the right person for the task.
Retention over time
The team didn’t stop at qualitative patterns. They also measured how much AI-generated code survived in the final submission. Depending on the task, between 40% and 70% of the lines the model wrote were still there at the end. The harder the problem, the higher the retention. Maybe it’s trust — or maybe it’s the same dynamic as massive pull requests: people accept what’s there because commenting feels harder than rewriting.
Retention as a metric is underrated. It’s a tangible way to measure how sticky AI code is—something that productivity dashboards and “acceptance rates” mostly miss.
What this says about how we code with AI
A few thoughts that stuck with me:
Developers use LLMs to think, not just to code
Comments-as-prompts aren’t just instructions for the model. Developers externalize intent, use the model to test their own understanding, and leave behind a textual trail of reasoning. In a weird way, the model is helping them think out loud.
Switching models is a trust move
When a developer flips to another LLM, it’s not random. The model violated an expectation (“that’s not what I meant”), so they find another conversational partner. We’ve gone from AI tools to AI colleagues you can swap out when one stops making sense. The switching cost is low, which might explain why there may never be a winner-take-all model. The harder question is whether building new models becomes cheap enough that they turn into commodities.
CodeWhisperer’s role is scaffolding
Most users didn’t copy big chunks verbatim. They treated it as an outline generator, then re-authored key logic by hand. The value wasn’t precision, it was momentum.
The future feels multi-agent
Given how often people switched to ChatGPT or used multiple assistants in parallel, IDEs will likely need multi-model support: per-task routing, “diff views” across models, or model comparison panels.
My random thoughts about the study
- The sample was all students, so the results probably underestimate how experienced developers improvise prompts or blend in external context. I’ll be looking for more papers that cover a broader range of developer experiences.
- Telemetry is great for observing mechanics, but it can’t capture why people behave the way they do. A deeper ethnographic study could surface how social norms shift when teams start coding with AI—how collaboration, review habits, or even mentoring change in ways that telemetry alone can’t show.