Hacker Newsnew | past | comments | ask | show | jobs | submit | DCKing's commentslogin

The moat right now is model performance and what that means for how many tokens and additional time you spend.

I say this as a relatively frequent user of Kimi models and generally a big fan. But on not-yet-gamed benchmarks like DeepSWE, Kimi K2.6 is beaten soundly by Claude Sonnet 4.6 ($3 / $15) and even slightly by GPT 5.4 Mini ($0.75 / $4.50).

There's no question Kimi models are very good for a lot of code tasks. They're the best quality open weight model. But to get similar overall outcomes as on Sonnet/Opus, on average you'll spend many more tokens and will have to do more managing of the model. You shouldn't look at price per token, you should look at how much you pay for the entire process.


I'm more interested in how much effort I have to put in, at least while I'm paying in the range of current subscriptions (so ~€100-€200 a month or so). If the prices go up much more than that I'll have to switch to caring more about token efficiency. But at current pricing the bottleneck is my attention, not model efficiency. As such, even a small improvement in model quality - and hence, a decrease in how much attention I have to spend on it - makes a big difference.


I personally dont put any weight to DeepSWE. Other than 5.5 being directionally the best model, it gets the others pretty wrong in my experience. FrontierCode from cognition looks interesting


I'm not sure I would put too much weight on DeepSWE as a benchmark, given that GPT-5.4-mini ended up close to Opus 4.6 there.


Any benchmark is iffy and has weird results, but this is the best we got at the moment. Most people working with Opus and Kimi would likely tell you they're much further apart than the numbers that were quoted for Kimi K2.6, and DeepSWE seems to capture that gap better.

One major thing DeepSWE has going for it is that all other benchmarks (including those quoted by MoonshotAI on this page) don't: the other benchmarks that are completely gamed. The benchmark answers are public and part of each model's training data. This benchmark may still be iffy, but at least it's not gamed.


Somehow the internet has also forgot that cheating to get ahead in China is basically a norm and expected behavior.


American labs also use gamed and cherry-picked benchmarks extensively. Anthropic used them in their Fable announcement and avoided DeepSWE because it doesn't beat GPT-5.5 in that one. Google's numbers for Gemini 3.5 Flash recently did not at all line up with people's subjective experience using these models, and this also happened with Gemini 3.1 Pro before it.

Everybody has incentives to manipulate benchmark results to show their models in the best light.


It's for predictability in upgrades. Homebrew allows you to separate system packages (from apt or dnf) from user packages (from homebrew) [1]. Running apt upgrade or dnf upgrade can render your system unbootable if you're unlucky (or unstable or degraded if you're less unlucky). Running brew upgrade can at worst break some of your own user's setup or tools.

Since everybody runs their own unique permutation of apt or dnf packages, adding as little as possible will keep you as close as possible to what distro maintainers test. There's even OSes like Fedora Silverblue or Bluefin or SteamOS that ship with a fully baked _image_ - where installing system level packages is strongly discouraged - which helps ensure predictability and stable upgradeability.

Homebrew packages also tend to be more recent (this depends on your distro of course) and don't require elevated permissions to install.

[1]: Other unprivileged package managers like Mise or Nix do the same of course


There will be more of this going forward, I think. Systemd is really not just an init system, it's a full cohesive management system for Linux distros and they've never pretended otherwise. A modular one but still a comprehensive one. Because of that its mere existence is an affront to many people with traditional opinions on Linux and Unix.

systemd-appd sounds like it could make some inroads in the threat model that Windows and Linux still have in 2026 (and macOS is still reeling from): anything that runs as my user, can access anything running as _my_ user. I don't think this threat model was tenable in 2016, much less in 2026. But moving away from that also breaks with the Unix tradition.

Systemd as the system management layer is becoming a centerpoint for moving Linux forward, on servers but especially so on the desktop, and it does so at the cost of breaking with traditional views. It's kind of hard to watch: I want Linux to move forward, and there's just a lot of good ideas there. But it will be painful for a large Linux community to break with traditions.


> systemd-appd sounds like it could make some inroads in the threat model that Windows and Linux still have in 2026 (and macOS is still reeling from): anything that runs as my user, can access anything running as _my_ user. I don't think this threat model was tenable in 2016, much less in 2026.

Are we talking about open source operating systems hero or app delivery mechanisms?


> Systemd is not an init system: it's a full cohesive management system for Linux distros.

Exactly. If you look back at the old discussions, you see how people tried to claim systemd is merely an init system, but it never was. So all comparisons to e. g. sysinit and what not, were unfair. Dishonest. The systemd devs were not interested in fair discussions. They wanted more control. And they very ruthlessly went forward with it - also thanks to corporate support. Just look at Poettering censoring discussions and stopping them whenever he could.

> But moving away from that also breaks with the Unix tradition.

Systemd never cared about UNIX. Poettering does not even understand UNIX on top of that.

> Systemd as the system management layer is becoming a centerpoint for moving Linux forward

Forward to ...? I don't really see it as moving "forward". I see it as more top-down control singularized into one crew that manages the software here.

> on servers but especially so on the desktop, and it does so at the cost of breaking with traditional views

Well, I would not call it "traditional", as the name is loaded. I see it more as a way to gain more control over the whole ecosystem. We see the same happen with wayland, but on a smaller scale, as wayland does not try to integrate a billion features and functionality.

> It's kind of hard to watch: I want Linux to move forward, and there's just a lot of good ideas there. But it will be painful for a large Linux community to break with traditions.

I don't like systemd, but I view this more realistic. I saw how the non-systemd distributions struggled and eventually most went extinct or were converted into systemd. Only few remain strong, and those few are often also dead - like slackware. And yes I know the spin-offs, but seriously, slackware is a dead man walking. Void is not dead, but yikes, it's not moving forward either.

It is not only systemd though. The whole linux stack got a lot bigger and more complicated. Nowadays you often need python, meson, llvm, mesa and so forth to compile things. Everything got bigger too. A lot of software was abandoned downstream, such as fluxbox - may be irrelevant to most folks, but this is one example of sooo many more. At the base of this problem sits the funding issue. Corporations have a lot more net-control over the ecosystem nowadays. Due to the funding. I think we need to solve this issue of funding, because otherwise we'll end up with systemd-like projects sitting at the key areas.


If two things hold up - 1) this is actually a 2-300B parameter model and 2) this is actually competitive with frontier OpenAI and Anthropic models (and not just benchmaxing), the implications are pretty big. It would mean you could run "frontier level" performance in one box at home.

300B models at least fit in a single maxed out Mac Studio or a small stack of DGX Sparks or AMD Strix Halo boxes.

For comparison, DeepSeek V4 Flash is all the rage now for small efficient models. It's very good for its size but far from the performance of the latest GPT Pro and Opus models. The vanilla variant has 284B parameters. It fits on both 256GB and 512GB Mac Studios and hits about 20-30 tokens/second.

The implication of all this here is that you could have a (somewhat sluggish) Opus in a small box at home. At least once competing models and hardware to run them will be available (high end Mac Studios have been discontinued).

Something tells me that this means that Google's performance numbers here are inflated.


Opus is estimated to be around 4T parameters, and 5.5 around 9T. [1] And while 3.5 at least qualifies to be in the same neighborhood, which is stunning if these numbers are all true, it may be that closing that last ~10% difference needs 50x more parameters.

[1]https://arxiv.org/pdf/2604.24827


Note that this paper is vibe-coded and overestimating due to incorrect analysis, though "the core idea behind the paper is largely sound".

https://x.com/justanotherlaw/status/2050399317782155726 https://www.lesswrong.com/posts/veFMEzDDyWaer2Sms/sanity-che...


Their methods are only calibrated on open models (of course) and they admit very broad confidence bounds. You can also just see from comparing their estimates of the same models at different reasoning levels that there are major confounders to this. I would err on the absolute lowest side of their estimates for frontier models (e.g. 3T for GPT-5.5, 1.5-2T for Opus 4.5+).


> the implications are pretty big. It would mean you could run "frontier level" performance in one box at home.

That wouldn't surprise me at all actually, models like Qwen3.6-35B are comparable to frontier level models from a year ago and I wouldn't be surprised if we had self-hostable open weight models matching Opus 4.7 in a year. Assuming that Google has one year of advance against Chinese lab isn't far fetched given how much resources they have compared to their Chinese competitors.


I think there was a leap around Opus 4/4.1 that hasn't quite been equalled by self hostable models yet. Perhaps full Kimi K2.6 and Deepseek V4 Pro can achieve Opus 4.1 levels (it's hard to compare anyway, benchmarks are largely a game nowadays), but both of these are also north of 1000B parameters and therefore really impractical to run at home for the foreseeable future.

It's not yet obvious to me that you can achieve the breakthrough performance of say Opus 4.1/4.5 in a number of parameters you can swing at home.


> It's not yet obvious to me that you can achieve the breakthrough performance of say Opus 4.1/4.5 in a number of parameters you can swing at home.

People used to believe the same about GPT-4, and I'm not convinced this is going to be different this time.

You do need a very big model if you want something that remembers random trivia about everything, but I'm not convinced this is needed to do meaningful work.


> 300B models at least fit in a single maxed out Mac Studio or a small stack of DGX Sparks or AMD Strix Halo boxes.

I run 2.54 BPW 397B Qwen 3.5 GGUF on a 128G mac studio at 20 tokens/second generation and 200 tokens/second processing. I'm not suggesting it matches the performance of the full BF16 model, but I did run some benchmarks locally and the results were pretty good:

- MMLU: 87.96%

- GPQA diamond: 86.36%

- IfEval: 91.13%

- GSM8k: 92.57%

So I think we have been at the "frontier capabilities at home" for a few months now.


Since I started using Qwen-3.6 35B A3B, I believe frontier like capability will be more than enough in these smaller models within a year or two, at least for coding. They don't need to memorize facts into their weights, which likely has very interesting implications that I'm not going speculatively decode


TurboQuant. They can fit more in less now


TurboQuant is a runtime optimization for a model's KV cache and doesn't allow for reduction in model size.


TurboQuant reduces the runtime memory needed for the model's KV cache.

This reduces both the memory bandwidth needed for inference (at the cost of slightly increasing the amount of compute needed), and the amount of VRAM used overall, meaning more VRAM can be allocated for more weights on the same hardware.

You were replying to a comment estimating model params from hardware. I am saying the param count could be higher for the same hardware.


Deepseek V4 came out three weeks ago: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro

Kimi K2.5 has also been superseded by a finer tuned Kimi K2.6 three weeks ago. Moonshot's Kimi models appear to be the favored Chinese model, at least for coding, and not Deepseek V4. z.AI's GLM 5.1 is also worth mentioning as rather competent for coding, also released in April.

Those models too will not be beating US AI labs by your metrics (although for coding, Kimi K2.6 might beat the very uneven Gemini depending on the situation), but in your critism at least consider the state of the art in your comparisons.


I have been using Deepseek v4 pro for personal projects and home infra related work for last couple of weeks. It's quality of work is not bad at all, it is fairly fast and given the fraction of the cost compared to Claude, I can keep going which makes it a very compelling option. Looking forward to trying out Kimi 2.6, thanks for the recommendation.


Also they have a pretty big token discount running this month: https://api-docs.deepseek.com/quick_start/pricing/

Even without the discount, I'll have to think about whether I need the 100 EUR tier of Anthropic Max, or whether downgrading to Pro and using DeepSeek is good enough. And they're also up on OpenRouter and other places.

Been using those models, not quite comparable with Opus 4.6/4.7 but with max reasoning, pretty good for a variety of dev tasks! Only big problem is no ability to process images, so can't really do browser use for some semi-automated testing, I'd have to write Playwright tests even when I don't want to.


I've been using OpenCode Go ($10/month) for personal projects (I have Claude subscription for $DAYJOB) and for the tinkering around that I do for myself the quality of the open weight models and the limits of the OpenCode plan are sufficient. I agree that for a lot of dev tasks they're quite good!


I've been using Deepseek 4 Pro (instead of Sonnet 4.6) as the developer LLM (Opus is the planner) and it's been great. Not super fast, with all the reasoning, but has been writing good code, and I think I paid $5 so far (whereas with Sonnet I'd have run out of the weekly limits on Max for weeks now).

Definitely recommended, though it's crucial that you have GPT 5.5 review the code afterwards.


> Why should Apple have done this?

For money, probably.

Apple is presumably leaving a lot of money on the table by not trying to sell Apple Silicon for AI inference and training. They're the only ones who can attach reasonably large GPUs (M3 Ultra) to very large amounts of cheaper memory (512GB SO-DIMM per GPU). Apple could e.g. sell server SKUs of Mac Studios, heck they can sell M3 Ultra chips on PCIe cards. And they could further develop Apple Silicon in that direction. Presumably they would be seen as a very legit competitor to Nvidia that way, perhaps moreso than Intel and AMD. I'd assume that in the current climate this would be extremely lucrative.

Now, actually doing this would disrupt Apple's own supply chain as well as force it to spend significant internal resources and cultural change for this kind of product line. There's a good argument to be made it would disproportionally negatively affect its Mac business, so this would be a very risky move.

But given that AI hardware is likely much higher margin than the Mac business an argument could probably (sadly) be made that it'd be lucrative for them to try it. I personally don't think Apple is inclined to take this kind of risk to jeopardize the Mac, but I'm sure some people at Apple have considered this.


I guess I mean for apple to remain as apple, they would not do this due to company culture.


systemd nowadays has a lot of sandboxing built in [0]! You can achieve jails using just systemd and no separate container manager.

[0]: https://wiki.archlinux.org/title/Systemd/Sandboxing


> they should really look at kernel CVE database

When quoting kernel CVEs as evidence as signs of insecurity, especially so seemingly authoritatively, please make sure you're informed about how what Linux kernel CVEs mean.

A CVE (for any product) does not automatically mean there is actually a vulnerability there or even if one is exploitable unless explicitly noted (in the CVE or credibly by someone else). Proof of concepts, reproducibility or even any kind of verification are not a part of the CVE process.

For the Linux kernel in particular, the CVE process is explicitly to be "overly cautious" [1]. In practice, this means the Linux security team requests a CVE for anything that has a mere whiff of being theoretically exploitable. Of course that doesn't mean that the bug that was fixed was actually exploitable, not even theoretically but certainly not in practice.

As a result, you can't use CVEs reported by the Linux kernel to make claims about the (lack of) practical security of any Linux system, including your desktop. The CVEs reported by the Linux kernel are there to notify you to very well informed users of the kernel to do further risk assessments, not to be taken at face value as a sign of insecurity. [The latter is true for the entire CVE system - they're not to be taken at face value as signs something is wrong. But it's especially true for the kernel.]

[1]: https://docs.kernel.org/process/cve.html#process


This is a common complaint with the whole CVE process to begin with, and isn't even a Linux thing.


You're right. I review each one carefully, so here I mean only the real ones. It's still a massive amount of vulnerabilities, even after excluding obscure drivers or features that aren't used on headless systems.


The opsec reason I use Safari as a work browser today is that Safari has a much more blunt tool to disrupt cookie stealers: Safari and macOS do not permit (silent) access to Safari's local storage to user level processes. If malware attempts to access Safari, its access is either denied or the user gets presented a popup to grant access.

I wish other browsers implemented this kind of self protection, but I suppose that is difficult to do for third party browsers. This seems like a great improvement as well, but it seems this is quite overengineered to work around security limitations of desktop operating systems.


Seems like a very weak mitigation, if this is to protect against malwares running in your user session, alongside your browser. Can't they already do all kinds of nefarious keylogging/screen recording/network tracing/config file editing enabling impersonation and so on?

I mean, if my threat model starts with "I have a mal/spyware running alongside my browser with access to all my local files", I would pretty much call it game over.


> I mean, if my threat model starts with "I have a mal/spyware running alongside my browser with access to all my local files", I would pretty much call it game over.

This is a big problem I have with desktop security - people just give up when faced with something so trivial as user privileged malware. I consider it a huge flaw in desktop security that user privilege malware can get away with so many things.

macOS is really the only desktop OS that doesn't just give up when faced with same user privileged malware (in good and bad ways). So there it's likely a good mitigation - macOS also doesn't permit same user privileged processes to silently key log, screen record, network trace and various other things that are possible on Windows and common Linux configurations.


Yeah, I'm siding with the sceptics on this one. Adding more layers of indirection against those malware running under a user session seem like a good idea in general, but in practice, you showed how ineffective the macOS approach is: under this model, every application is let to defend itself in an ad-hoc and specific manner. That doesn't generalise well: you can't expect every software, tool, widget, … vendor to be held to the same level of security as Apple.

Another approach is to police everything behind rules (the way selinux or others do), which is even better in theory. In practice, you waste a ton of time bending those policies to your specific needs. A typical user won't take that.

Then there is the flatpak+portal isolation model, which is probably the most pragmatic, but not without its own compromises and limitations.

The attitude of trusting by default, and chrooting/jailing in case of doubt probably still have decades to live.


> under this model, every application is let to defend itself in an ad-hoc and specific manner.

This description of the macOS model doesn't really apply so I'm not sure if I'm misunderstanding you or you're misunderstanding the model.

> Another approach is to police everything behind rules (the way selinux or others do), which is even better in theory. In practice, you waste a ton of time bending those policies to your specific needs. A typical user won't take that.

While SELinux could probably provide this kind of data protection on Linux, the method of technical enforcement is only one part. There's a lot of UI involved to get right, and that will require far more effort.

> Then there is the flatpak+portal isolation model, which is probably the most pragmatic, but not without its own compromises and limitations.

That model doesn't really apply here. Flatpak et al allow applications to self confine in order to protect the other things the user is doing. What I'm talking about is for an app to have some protections of its own data from the other things the user is doing. I'm not talking about sandboxing, this data protection.


>> under this model, every application is let to defend itself in an ad-hoc and specific manner.

> This description of the macOS model doesn't really apply so I'm not sure if I'm misunderstanding you or you're misunderstanding the model.

I admit I might be misunderstanding, since, again, I don't use macOS. But from your description:

>>> Safari and macOS do not permit (silent) access to Safari's local storage to user level processes. If malware attempts to access Safari, its access is either denied or the user gets presented a popup to grant access.

it sounds like safari detects that a foreign application is trying to read its data, warn the user and lets them call the shot on that. I don't see how that isn't very specific to safari and to one specific type of mitigation. Unless the same prompt shows up for every program trying to access every other one's configuration? Then I suppose we hit the usability nightmare I'm on about, with utilities like ncdu, borg and others just unable to do their job.

> While SELinux could probably provide this kind of data protection on Linux, the method of technical enforcement is only one part. There's a lot of UI involved to get right, and that will require far more effort.

My experience with SELinux was not that of a problematic UI or ecosystem of utilities around it, but more one of incurred fatigue working against rules: once you've hit your tenth AVC denial trying to get something to run, you might as well want to disable SELinux altogether. Or maybe that's what you call UI? Either way, I don't think there is a viable "fix" for it.

>> Then there is the flatpak+portal isolation model

> That model doesn't really apply here.

I mean, I was merely stating facts about what's existing out there. Anyhow

> What I'm talking about is for an app to have some protections of its own data

This isolates applications and their data from one another, in that aspect they are relatable.


On macOS, basically all of these are extra permissions that you have to grant to an application - you'll get prompted with a popup when they try to do it.

eg: local network access, access to the documents and desktop folder, screen recording, microphone access, accessibility access (for keylogging), full disk access, all require you to grant permission


Strix Halo is impressive, but it isn't AMD going all out on the concept. Strix Halo's die area (300mm2 ish) is roughly the same as estimates for Apple's M3 Pro die area. The M3 Max and M3 Ultra are twice or four times the size.

In a next iteration AMD could look into doubling or quadrupling the memory channels and GPU die area like as Apple has done. AMD is already a pioneer in the chiplet technology Apple is also using to scale up. So there's lots of room to grow for even higher costs.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: