The key takeway for me is that there is a decent improvement in all categories - about 10% on average with a few outliers. However, the footprint of this model is much larger so the performance bump ends up being underwhelming in my opinion. I would expect about the same performance improvement if they released a 13B version without the MoE. May be too early to definitely say that MoE is not the whole secret sauce behind GPT4, but at least with this implementation it does not seem to lift performance dramatically.
Good question. If you believe the results on the HuggingFace leaderboard (https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...), which I find very hard to navigate, we find that Mistral was not even the best 7B model in there, and there is a huge variance as well. I prefer to rely on benchmarks done by the same group of known individuals over time for comparisons, as I think it's still too easy to game benchmark results - especially you are just releasing something anonymously.
You are right - upon closer inspection, even models that were not previously Mistral finetunes are now using Mistral in their later versions. I wasn't aware of it before as I could not filter results in the leaderboard (it doesn't even load at all for me now).
The key takeway for me is that there is a decent improvement in all categories - about 10% on average with a few outliers. However, the footprint of this model is much larger so the performance bump ends up being underwhelming in my opinion. I would expect about the same performance improvement if they released a 13B version without the MoE. May be too early to definitely say that MoE is not the whole secret sauce behind GPT4, but at least with this implementation it does not seem to lift performance dramatically.