We humans live in a 3D world and our training set is a continuous stereo stream of a constant scene from different angles. Sora, on the other hand, learned the world by watching TV. It needs to play more video games, in order to learn 3D scenes (implicit representation of a world) and taking their pictures (rendering). Maybe that was the case I don’t know.