Thread
Tweet 1
Even benchmarks that don't look saturated e.g. swe bench are so low quality that further improvements probably aren't even desirable
---
Tweet 2
The labs are most incentivized to produce high quality benchmarks but they aren't incentivized to publish: 1. No one will take "we're the best at our own benchmark" seriously 2. They help other labs improve
---
Tweet 3
One funny episode was when @bio_bootloader published loco diff @alexalbert__ retweeted and then quickly deleted his retweet. No way to know why and probably the best explanation is on reflection he thought the work wasn't of a quality he wanted to promote...
---
Tweet 4
But I like an alternative explanation: the benchmark is extremely good and extensively used within Anthropic. That's why they're the best at it. And he doesn't want other labs to get the idea to train on it
---
Tweet 5
Locodiff tests something every codegen agent has to do: understand the state of a file after editing it several times
---