I've been trying out o3 mini in Cursor today, it seems "smarter" but overall tends to overthink things and if it's not provided with perfect context it's prone to hallucinate. Overall I prefer Sonnet still. It has a certain magic of always making reasonable assumptions and finding simple solutions.
Agreed that Sonnet still feels like the best all-round model. The new ones are at least on par with it for pure coding, or exceed it (r1, o1 both do IME) but don't generalize as well, especially to tasks with subjective answers. I find the latest Gemini 2.0-Flash-thinking to be closest to Sonnet on those.
There's a suite of code-related tasks -- covering a diversity of areas, including dev ops, media manipulation etc., derived from issues I have faced over the years -- I perform for every new release. No model has solved the set of issues solved in one go but Claude still remains the best.
An example of the sort of problems in the suite:
> I have a special problematically encoded mp4 file with a subtle issue (something I ran into a couple of years ago while fixing a bug in a computer vision pipeline). In the question prompt I also pass the output of ffprobe and ask for the ffmpeg command that'll fix it. Only Claude has figured the real underlying issue out (after 4 interactions).