Cool, well let me know when Opus 4.5 level performance is available locally, at speeds that serve everyday use, and 100% I'm right there with you.
Until then, I'm going to keep sending my JSON to the server farm in Virginia because it's the only place that can serve me a model that actually works for my uses.
They may well do but in practice if you want to embody the hacker spirit, the best thing is to hack rather than trying to get some clearly inadequate local LLM to do it.
I experiment a lot with local models, and I agree.
I have a lot of fun with the local models and seeing what they can do.
I appreciate the SOTA models even more after my local experiments. The local models are really impressive these days, but the gap to SOTA is huge for complex tasks.
Reasoning over a large codebase is only one use case for large models. For the use cases in the article (summarizing, classifying, basic text rewrites) most phones can handle them just fine.
DeepSeek V4 with 1 million token context window is pretty powerful, although still not there. There's hope that Opus 4.5 level performance locally is not that far away.
That is true, it is a 1.6T parameters model so it requires a great deal of memory. I also heard there's a 2bit quantization that works well on Apple metal.
Well it depends on the task. For agentic coding, more is more, but for tasks that normal consumers use them for there really is a ceiling. OCR, text to speech, that type of thing doesn't really improve when going to a SOTA model, so you'd just be wasting your money. I think local LLMs have more value than software engineers give them credit for.
Opus is probably somewhere in the 5TB parameter range and needs terabytes of GPU memory.
The economics of running SOTA locally just does not make sense, because you’re not using it 24/7 at 80%+ utilization while the cloud based providers can.
Next year there will be Opus 4.5 level available on open source models so theoretically you may be able to run it locally but in reality it will be too expensive (i.e maybe 2 x max Studio 512GB ram each) for “normal” users.
I don't even use Sonnet anymore. Current feels worse than Claude 3.5 couple years ago. They have quantized that much? Switched to GPT 5.5, let's see how long it will stay good.
Until then, I'm going to keep sending my JSON to the server farm in Virginia because it's the only place that can serve me a model that actually works for my uses.