12/13/2024

Thread

Tweet 1

It's so funny that Devin popularized SWE-bench which no one had ever heard of. And now even Anthropic and Google report their numbers. But not Devin. Because it's a garbage benchmark and no one cares so why would you?

---

Tweet 2

It's so funny that OpenAI made a SWE-bench verified and no one has made the obvious observation that since agent performance on SWE-bench UNverified is decent the whole thing is contaminated to hell.

---

Tweet 3

It's so funny the actual issues that did make it into verified which are often also hopelessly underspecified.

---

Tweet 4

The only benchmark that matter is is the agent writing itself. If it's not then it doesn't work.

---

Tweet 5

@paulgauthier's "aider wrote x% of this release" is the most important number in the field

---