I am always disappointed when I compare the answers to the same queries on 2.5 Pro vs. o4-mini/o3. But trying out the same query in AI Studio gives much better results, closer to OpenAI's models.
What is wrong with 2.5 Pro in the Gemini app? I can't believe that the model in their consumer app would produce the same benchmark results as 2.5 Pro in the API or AI Studio.
The models in the Gemini app are nerfed in comparison to those in AI Studio: they have less thinking budget, output less tokens, and have various safety filters. There’s certainly a trade-off between using AI Studio for its better performance and using the API or the Gemini app in a way that doesn’t involve Google keeping your data for training purposes.
I don't have any inside information, but I'm sure there are different system prompts used in the Gemini chat interface vs the API. On OpenAI/ChatGPT they're sometimes dramatically different.