That wasn’t the question… they asked if any multimodal models had been reasoning...

That wasn’t the question… they asked if any multimodal models had been reasoning trained. o1 fits that criteria precisely, and it can reason about the image input.

They didn’t ask about a model that can create images while thinking. That’s an entirely unrelated topic.