Agentic Property Testing
Agentic Property-Based Testing: Finding Bugs Across the Python Ecosystem, Maaz et al., NeurIPS 2025 (Anthropic + Northeastern)
How I think of property-based testing
Property-based testing (PBT) has always felt like one of those great ideas that never quite took off. You describe what should always hold true for a function and let the computer try to falsify it. The hard part is knowing what “should always hold” in the first place — you need enough context about the code’s intent to express meaningful properties. Most developers don’t have the time or patience for that, so we fall back to example-based tests. If you’re testing something simple like a function call, it’s pretty straight forward to know all of the possible invalid properties when invoking the function. But take something more complex, like a piece of code that deserializes data, many different edge cases exist and knowing all of them ahead of time can be difficult.
Let an LLM do the thinking
This paper asks what happens if you let an LLM handle the step of finding all the invariants. The authors build an agent that crawls Python packages, reads source and docstrings, proposes invariants, writes Hypothesis tests, runs them, and decides whether failing cases are real bugs or false alarms. When it’s confident, it even writes a reproducible report and a candidate patch. It’s part fuzzing, part code reviewer, part intern who doesn’t sleep.
They ran this “agentic” tester on 100 popular Python packages – 933 modules total, from json
and pathlib
to numpy
, pandas
, and requests
. Each run took about an hour and cost roughly $5. The agent generated 984 bug reports, and after manual review about 56 % were real issues; a third of those were the kind you’d actually file upstream. That works out to around ten bucks per genuine bug.
What the agent found
Some of the finds were non-trivial:
numpy.random.wald
sometimes returned negative values (now patched)- AWS Lambda Powertools had a dictionary-slicing bug that duplicated chunks
- A CloudFormation plugin hashed all lists to the same value because of a misplaced
.sort()
- Hugging Face Tokenizers emitted malformed HSL strings missing a parenthesis
Even python-dateutil
showed calendar quirks.
The approach works because the agent doesn’t just dump random tests; it reasons about what the code claims to do. It mines relationships like encode ↔ decode, parse ↔ format, or mathematical invariants such as idempotence and commutativity. When a test fails, it reflects, re-runs, and decides whether it’s meaningful before escalating to a report.
Ranking which bugs matter
The obvious weakness is intent. The agent can’t tell when a behavior is deliberate, so about half the reports are “bugs” only in spirit. Still, that’s a high hit rate for fully automated analysis. At current API prices the economics already make sense, and as models get cheaper, this sort of continuous background auditing starts to look viable.
To keep the noise manageable, the authors built a scoring rubric to decide which bugs were worth showing to developers. Each report was scored out of 15 points across three dimensions:
Dimension | What It Measures | Examples / Criteria |
---|---|---|
Reproducibility | Can the failure be deterministically reproduced? | Minimal failing input, consistent behavior, clear reproduction script. |
Legitimacy | Does the input reflect realistic usage and a genuine property claim? | Avoids contrived edge cases, tests a property actually implied by the code or docs. |
Impact | Would this affect real users or violate documented behavior? | Crashes, silent corruption, or contract violations score higher than minor edge cases. |
High-scoring bugs (≥ 15/15) were almost always valid. Of the 21 top-scoring reports, 86 % were genuine and 81 % were worth reporting. Many of them included clean reproduction steps and even proposed patches. It’s a subtle but important step — not just finding bugs, but ranking which ones deserve human attention.
Random thoughts about the approach
A few details stood out and are worth unpacking.
Autonomy and heuristics
First, the authors describe the agent as being able to “crawl through and understand entire codebases” to identify high-value properties. That autonomy is impressive, but I wonder whether an explicit call graph would outperform the model’s current heuristic approach. Having a structural view of function relationships might surface richer and more accurate properties than inferring them statistically from docstrings and call patterns.
False discovery rate and signal-to-noise
Second, their review found that the false discovery rate sits at roughly one in two — with a 95% confidence interval of [30%, 58%]. That means the agent finds a genuine bug about half the time. It’s decent signal-to-noise for an autonomous system, but still far from replacing a human reviewer. The economics are fine for a large codebase audit, but human judgment remains essential to confirm what’s real.
Cost model realism
And third, while the paper repeatedly cites the cost per bug report, the more interesting number is the bootstrap cost per repository. Each repo run was expensive, but future runs wouldn’t start from zero — the generated property-based tests could be reused, with new runs only covering deltas in the call graph. The savings would depend on how much the code’s interfaces shift between versions, but it’s easy to imagine this evolving into a kind of “continuous fuzzing agent” for large, stable codebases.
What could happen next
What I did find genuinely interesting is how property-based testing teaches a model what correctness means without relying on natural language docs. It made me wonder whether saturating a codebase with properties could serve as an alternative to verbose documentation — giving models behavioral context instead of textual context. Probably overkill, but it would be fascinating to see on a very large, collaborative codebase.
In short, I’d recommend asking your favorite LLM to do property-based testing but to do so in a manner that’s more directed to very specific parts of your codebase instead of doing the whole thing. But hey, if you want to brute force testing so that you feel better at the end of day, go for it.