We've seen similar results. For us gpt-4 gets 88/122 of the exercism javascript exercises right in 2 tries but only 84/122 for gpt-4-turbo.