12/20/2022

Thread

Tweet 1

@sailaunderscore 2 already highlights the problem: You're imaging the AI as having motivations other than the reward we've specified. So maybe it understands it could take the reward but it's also altruistic and wants to help us.

---

Tweet 2

@sailaunderscore If you can imagine that you should also be able to imagine it understands the reward but wants to secure influence on the future because the inner optimizer wants something else.

---