Thread
Tweet 1
It's so funny that OpenAI made a SWE-bench verified and no one has made the obvious observation that since agent performance on SWE-bench UNverified is decent the whole thing is contaminated to hell.
---
Tweet 2
It's so funny the actual issues that did make it into verified which are often also hopelessly underspecified.
---
Tweet 3
The only benchmark that matter is is the agent writing itself. If it's not then it doesn't work.
---
Tweet 4
@paulgauthier's "aider wrote x% of this release" is the most important number in the field
---