Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think comparisons to base LLaMA are not so interesting, as almost no one is using those models. The most informative comparison is between Mistral 7B and 8x7B, provided in this picture: https://mistral.ai/images/news/mixtral-of-experts/open_model...

The key takeway for me is that there is a decent improvement in all categories - about 10% on average with a few outliers. However, the footprint of this model is much larger so the performance bump ends up being underwhelming in my opinion. I would expect about the same performance improvement if they released a 13B version without the MoE. May be too early to definitely say that MoE is not the whole secret sauce behind GPT4, but at least with this implementation it does not seem to lift performance dramatically.



How does it compare to existing 13b models on benchmarks?


Good question. If you believe the results on the HuggingFace leaderboard (https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...), which I find very hard to navigate, we find that Mistral was not even the best 7B model in there, and there is a huge variance as well. I prefer to rely on benchmarks done by the same group of known individuals over time for comparisons, as I think it's still too easy to game benchmark results - especially you are just releasing something anonymously.


Most of the top 7B models on the leaderboard are finetuned Mistral 7B models.


You are right - upon closer inspection, even models that were not previously Mistral finetunes are now using Mistral in their later versions. I wasn't aware of it before as I could not filter results in the leaderboard (it doesn't even load at all for me now).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: