Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

After reading a lot of that thread, my understanding is that yarn scaling is disabled intentionally by default in the GGUFs, because it would degrade outputs for contexts that do fit in 32k. So the only change is enabling yarn scaling at 4x, which is just a configuration setting. GGUF has these configuration settings embedded in the file format for ease of use. But you should be able to override them without downloading an entire duplicate set of weights (12 to 35 GB!). (It looks like in llama.cpp the override-kv option can be used for this, but I haven't tried it yet.)


Oh super interesting, I didn’t know you can override this with a flag on llama.cpp.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: