More

homebrewer · 2026-05-14T15:05:05 1778771105

In poor countries like mine (and looks like GP's too), IT positions are very limited indeed. Nevertheless, it has been one of the very few sectors open to nobodies, helping us to pull ourselves out of poverty, open to those who weren't born to the right family with the right connections, or to a sugar daddy who can cover the first 25 years of our lives to go get a good education in Europe or the US.

Looks like it's being slowly taken away from us to make a few billionaires into proper trillionaires. Can't see this ending well for humanity.

And the common advice you hear on this site ("just migrate to country X") doesn't really apply to most of us. Even if you can name many examples of people doing just that, you're seeing a very narrow slice of the population; I can find many more counterexamples for each one of them.

Your weak passport won't impress anybody, almost all of the world is closed to you, you can't travel anywhere (forget migrate) without going through a lengthy and expensive process where you're treated with suspicion, and can be denied with no compensation, on every step of the way. I'm still talking about traveling here; finding work is much more difficult.

So it's really hard to move anywhere decent if you're not at the top of your profession, which in large part depends on your innate abilities, not just how many hours you put in.

I've become jaded and extremely cynical; if worst comes to worst, there's always one universal way out, which is what keeps me going for now.

homebrewer · 2026-05-08T17:21:39 1778260899

Lots of supposedly technically advanced users switched to Chrome en masse and promoted it on every occasion they could, because it was so much faster, simpler, safer, etc etc. Don't excuse useful idiots from their share of the blame. People warned about dangers of Chrome's growing domination for about as long as I can remember, back to at least 2012, only to be dismissed as paranoid.

homebrewer · 2026-05-08T07:16:36 1778224596

IT is (was?) one of the very few ways for us in third-world countries to pull ourselves out of poverty by our own bootstraps, since social mobility is quite limited if you lack the right connections. I'm pleased with you being so happy about it being taken away to make more money for billionaires.

homebrewer · 2026-05-08T07:06:34 1778223994

Has everyone here already forgotten about the WireGuard tire fire?

https://lwn.net/Articles/850098

https://news.ycombinator.com/item?id=26507507

tl;dr: deeply insecure WireGuard implementation committed directly into the FreeBSD kernel with zero review.

Was this process problem fixed?

wolvoleo · 2026-05-09T22:16:34 1778364994

Ars also has a great writeup on this: https://arstechnica.com/gadgets/2021/03/buffer-overruns-lice...

But yeah as I understand lessons were learned.

homebrewer · 2026-04-30T19:22:55 1777576975

Previously:

https://news.ycombinator.com/item?id=47941590

homebrewer · 2026-04-27T18:30:57 1777314657

Go ahead. We've been self-hosting Gitea with Drone/Woodpecker for years; either it or Forgejo will do fine if you're okay with their feature set. I sometimes wander into these GitHub threads to have a laugh; our Gitea instance has had several minutes of downtime combined over the last few years, all of them planned (to upgrade Gitea) and in the middle of the night.

lioeters · 2026-04-27T20:30:22 1777321822

Ooh, Woodpecker CI works with Gitea and Forgejo. https://woodpecker-ci.org/ That might be last piece I needed to migrating Git repos from GitHub to a self-hosted forge.

Edit: Actually there's Gitea Actions and Forgejo Actions, that might be enough for my use case.

https://docs.gitea.com/usage/actions/

https://forgejo.org/docs/next/user/actions/reference/

dymk · 2026-04-28T02:23:27 1777343007

I’ve found gitea actions (based on ACT, so it’s nearly identical to a GitHub action runner) to work great. Migrating a GitHub workflow is mostly just a file name change.

lioeters · 2026-04-28T02:49:40 1777344580

Good to know!

MiracleRabbit · 2026-04-27T18:53:41 1777316021

Gitea Upgrading.. replacing binary, restarting. I love it.

Same for Forgejo.

scottyah · 2026-04-27T20:01:58 1777320118

I struggled with Woodpecker for a bit, but now gitea has Actions that work wonderfully for my use case (and one less tool to support). I believe they also highlight compatibility with a github action protocol of sorts. Might be worth looking into.

homebrewer · 2026-04-25T10:36:12 1777113372

It is opt-in. The amount of FUD in these threads is unbelievable, both against Mozilla, Brave, or anything else really.

homebrewer · 2026-04-16T19:31:05 1776367865

I've been using gemma4 for translating Mongolian to English. It runs circles around Google Translate for that language pair, it's not even close.

homebrewer · 2026-04-16T14:12:07 1776348727

Already quantized/converted into a sane format by Unsloth:

https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

Aurornis · 2026-04-16T15:32:41 1776353561

Unsloth is great for uploading quants quickly to experiment with, but everyone should know that they almost always revise their quants after testing.

If you download the release day quants with a tool that doesn’t automatically check HF for new versions you should check back again in a week to look for updated versions.

Some times the launch day quantizations have major problems which leads to early adopters dismissing useful models. You have to wait for everyone to test and fix bugs before giving a model a real evaluation.

danielhanchen · 2026-04-16T15:46:54 1776354414

We re-uploaded Gemma4 4 times - 3 times were due to 20 llama.cpp bug fixes, which we helped solve some as well. The 4th is an official Gemma chat template improvement from Google themselves, so these are out of our hands. All providers had to re-fix their uploads, so not just us.

For MiniMax 2.7 - there were NaNs, but it wasn't just ours - all quant providers had it - we identified 38% of bartowski's had NaNs. Ours was 22%. We identified a fix, and have already fixed ours see https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax.... Bartowski has not, but is working on it. We share our investigations always.

For Qwen3.5 - we shared our 7TB research artifacts showing which layers not to quantize - all provider's quants were not optimal, not broken - ssm_out and ssm_* tensors were the issue - we're now the best in terms of KLD and disk space - see https://www.reddit.com/r/LocalLLaMA/comments/1rgel19/new_qwe...

On other fixes, we also fixed bugs in many OSS models like Gemma 1, Gemma 3, Llama chat template fixes, Mistral, and many more.

It might seem these issues are due to us, but it's because we publicize them and tell people to update. 95% of them are not related to us, but as good open source stewards, we should update everyone.

evilduck · 2026-04-16T16:36:13 1776357373

I just wanted to express gratitude to you guys, you do great work. However, it is a little annoying to have to redownload big models though and keeping up with the AI news and community sentiment is a full time job. I wish there was some mechanism somewhere (on your site or Huggingface or something) for displaying feedback or confidence in a model being "ready for general use" before kicking off 100+ GB model downloads.

danielhanchen · 2026-04-16T16:55:42 1776358542

Hey thanks - yes agreed - for now we do:

1. Split metadata into shard 0 for huge models so 10B is for chat template fixes - however sometimes fixes cause a recalculation of the imatrix, which means all quants have to be re-made

2. Add HF discussion posts on each model talking about what changed, and on our Reddit and Twitter

3. Hugging Face XET now has de-duplication downloading of shards, so generally redownloading 100GB models again should be much faster - it chunks 100GB into small chunks and hashes them, and only downloads the shards which have changed

ssrshh · 2026-04-17T01:37:35 1776389855

If you would know - is this also why LM Studio and Ollama model downloads often fail with a signature mismatch error?

danielhanchen · 2026-04-17T08:25:24 1776414324

Probably yes

evilduck · 2026-04-16T19:40:56 1776368456

Ah thanks, I wasn't aware of #3, that should be a huge boon.

danielhanchen · 2026-04-17T08:25:55 1776414355

Oh yes! This only applies if one uses hf download / snapshot_download - other normal download methods sadly won't have XET

CamperBob2 · 2026-04-16T17:42:51 1776361371

Best policy is to just wait a couple of weeks after a major model is released. It's frustrating to have to re-download tens or hundreds of GB every few days, but the quant producers have no choice but to release early and often if they want to maintain their reputation.

Ideally the labs releasing the open models would work with Unsloth and the llama.cpp maintainers in advance to work out the bugs up front. That does sometimes happen, but not always.

danielhanchen · 2026-04-16T18:04:29 1776362669

Yep agreed at least 1 week is a good idea :)

We do get early access to nearly all models, and we do find the most pressing issues sometimes. But sadly some issues are really hard to find and diagnose :(

sowbug · 2026-04-16T15:57:20 1776355040

Please publish sha256sums of the merged GGUFs in the model descriptions. Otherwise it's hard to tell if the version we have is the latest.

danielhanchen · 2026-04-16T16:01:22 1776355282

Yep we can do that probs add a table - in general be post in discussions of model pages - for eg https://huggingface.co/unsloth/MiniMax-M2.7-GGUF/discussions...

HF also provides SHA256 for eg https://huggingface.co/unsloth/MiniMax-M2.7-GGUF/blob/main/U... is 92986e39a0c0b5f12c2c9b6a811dad59e3317caaf1b7ad5c7f0d7d12abc4a6e8

But agreed it's probs better to place them in a table

sowbug · 2026-04-16T16:12:15 1776355935

Thanks! I know about HF's chunk checksums, but HF doesn't publish (or possibly even know) the merged checksums.

danielhanchen · 2026-04-16T16:56:42 1776358602

Oh for multi files? Hmm ok let me check that out

zargon · 2026-04-16T17:56:07 1776362167

Why do you merge the GGUFs? The 50 GB files are more manageable (IMO) and you can verify checksums as you say.

sowbug · 2026-04-16T19:19:54 1776367194

I admit it's a habit that's probably weeks out of date. Earlier engines barfed on split GGUFs, but support is a lot better now. Frontends didn't always infer the model name correctly from the first chunk's filename, but once llama.cpp added the models.ini feature, that objection went away.

The purist in me feels the 50GB chunks are a temporary artifact of Hugging Face's uploading requirements, and the authoritative model file should be the merged one. I am unable to articulate any practical reason why this matters.

dist-epoch · 2026-04-16T16:18:37 1776356317

What do you think about creating a tool which can just patch the template embedded in the .gguf file instead of forcing a re-download? The whole file hash can be checked afterwards.

danielhanchen · 2026-04-16T16:58:05 1776358685

Sadly it's not always chat template fixes :( But yes we now split the first shard as pure metadata (10MB) for huge models - these include the chat template etc - so you only need to download that.

For serious fixes, sadly we have to re-compute imatrix since the activation patterns have changed - this sadly makes the entire quant change a lot, hence you have to re-download :(

magicalhippo · 2026-04-16T19:31:38 1776367898

Appreciate the work of your team very much.

Though chat templates seem like they need a better solution. So many issues, seems quite fragile.

danielhanchen · 2026-04-17T08:27:03 1776414423

Thank you! Agreed on chat template issue

solomatov · 2026-04-16T22:58:06 1776380286

Just curious, the fixes are not about weights but about templates, am I right?

danielhanchen · 2026-04-17T08:27:25 1776414445

Yes so chat templates and the actual implementations

embedding-shape · 2026-04-16T15:46:02 1776354362

Not to mention that almost every model release has some (at least) minor issue in the prompt template and/or the runtime itself, so even if they (not talking unsloth specifically, in general) claim "Day 0 support", do pay extra attention to actual quality as it takes a week or two before issues been hammered out.

danielhanchen · 2026-04-16T15:50:00 1776354600

Yes this is fair - we try our best to communicate issues - I think we're mostly the only ones doing the communication that model A or B has been fixed etc.

We try our best as model distributors to fix them on day 0 or 1, but 95% of issues aren't our issues - as you mentioned it's the chat template or runtime etc

alfiedotwtf · 2026-04-17T14:38:30 1776436710

I have to ask - what do you run locally on your laptop (model, backend, and agentic cli)?

Feature request:

A leader board with filtering so you can enter your machine specs and it will sort all models along with all the various quantisation and then rank them all - because so far model ranking site either don’t include all available quants, don’t compare apples to apples (ie was one model tested with Claude code while another benchmark done with opencode) etc

Oh - and as bonus, scoring also ranked by which agentic CLI :)

fuddle · 2026-04-16T16:56:52 1776358612

I don't understand why the open source model providers don't also publish the quantized version?

danielhanchen · 2026-04-16T16:58:39 1776358719

They sometimes do! Qwen, Google etc do them!

i5heu · 2026-04-16T19:39:09 1776368349

Thank you very much for this comment! I was not aware of that.

torginus · 2026-04-16T17:48:58 1776361738

Why doesn't Qwen itself release the quantized model? My impression is that quantization is a highly nontrivial process that can degrade the model in non-obvious ways, thus its best handled by people who actually built the model, otherwise the results might be disappointing.

Users of the quantized model might be even made to think that the model sucks because the quantized version does.

bityard · 2026-04-16T18:23:34 1776363814

Model developers release open-weight models for all sorts of reasons, but the most common reason is to share their work with the greater AI research community. Sure, they might allow or even encourage personal and commercial use of the model, but they don't necessarily want to be responsible for end-user support.

An imperfect analogy might be the Linux kernel. Linus publishes official releases as a tagged source tree but most people who use Linux run a kernel that has been tweaked, built, and packaged by someone else.

That said, models often DO come from the factory in multiple quants. Here's the FP8 quant for Qwen3.6 for example: https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8

Unsloth and other organizations produce a wider variety of quants than upstream to fit a wider variety of hardware, and so end users can make their own size/quality trade-offs as needed.

halJordan · 2026-04-16T18:44:18 1776365058

Quantization is an extraordinarily trivial process. Especially if you're doing it with llama.cpp (which unsloth obviously does).

Qwen did release an fp8 version, which is a quantized version.

sander1095 · 2026-04-16T15:58:13 1776355093

I sense that I don't really understand enough of your comment to know why this is important. I hope you can explain some things to me:

- Why is Qwen's default "quantization" setup "bad" - Who is Unsloth? - Why is his format better? What gains does a better format give? What are the downsides of a bad format? - What is quantization? Granted, I can look up this myself, but I thought I'd ask for the full picture for other readers.

danielhanchen · 2026-04-16T16:12:42 1776355962

Oh hey - we're actually the 4th largest distributor of OSS AI models in GB downloads - see https://huggingface.co/unsloth

https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs is what might be helpful. You might have heard 1bit dynamic DeepSeek quants (we did that) - not all layers can be 1bit - important ones are in 8bit or 16bit, and we show it still works well.

dist-epoch · 2026-04-16T16:20:49 1776356449

The default Qwen "quantization" is not "bad", it's "large".

Unsloth releases lower-quality versions of the model (Qwen in this case). Think about taking a 95% quality JPEG and converting it to a 40% quality JPEG.

Models are quantized to lower quality/size so they can run on cheaper/consumer GPUs.

danielhanchen · 2026-04-17T08:27:49 1776414469

Love the JPEG analogy :)

est · 2026-04-16T16:30:51 1776357051

hey you can do a bit research yourself and tell your results to us!

palmotea · 2026-04-16T14:40:30 1776350430

How much VRAM does it need? I haven't run a local model yet, but I did recently pick up a 16GB GPU, before they were discontinued.

WithinReason · 2026-04-16T14:49:05 1776350945

It's on the page:

  Precision  Quantization Tag File Size
  1-bit      UD-IQ1_M         10 GB
  2-bit      UD-IQ2_XXS       10.8 GB
             UD-Q2_K_XL       12.3 GB
  3-bit      UD-IQ3_XXS       13.2 GB
             UD-Q3_K_XL       16.8 GB
  4-bit      UD-IQ4_XS        17.7 GB
             UD-Q4_K_XL       22.4 GB
  5-bit      UD-Q5_K_XL       26.6 GB
  16-bit     BF16             69.4 GB

Aurornis · 2026-04-16T15:28:07 1776353287

Additional VRAM is needed for context.

This model is a MoE model with only 3B active parameters per expert which works well with partial CPU offload. So in practice you can run the -A(N)B models on systems that have a little less VRAM than you need. The more you offload to the CPU the slower it becomes though.

Glemllksdf · 2026-04-16T15:52:53 1776354773

Isn't that some kind of gambling if you offload random experts onto the CPU?

Or is it only layers but that would affect all Experts?

dragonwriter · 2026-04-16T16:22:44 1776356564

Pretty sure all partial offload systems I’ve seen work by layers, but there might be something else out there.

namibj · 2026-04-17T14:55:41 1776437741

Speculative decoding is already gambling.

est · 2026-04-16T16:34:39 1776357279

I really want to know what does M, K, XL XS mean in this context and how to choose.

I searched all unsloth doc and there seems no explaination at all.

tredre3 · 2026-04-16T20:14:56 1776370496

Q4_K is a type of quantization. It means that all weights will be at a minimum 4bits using the K method.

But if you're willing to give more bits to only certain important weights, you get to preserve a lot more quality for not that much more space.

The S/M/L/XL is what tells you how many tensors get to use more bits.

The difference between S and M is generally noticeable (on benchmarks). The difference between M and L/XL is less so, let alone in real use (ymmv).

Here's an example of the contents of a Q4_K_:

    S
    llama_model_loader: - type  f32:  392 tensors
    llama_model_loader: - type q4_K:  136 tensors
    llama_model_loader: - type q5_0:   43 tensors
    llama_model_loader: - type q5_1:   17 tensors
    llama_model_loader: - type q6_K:   15 tensors
    llama_model_loader: - type q8_0:   55 tensors
    M
    llama_model_loader: - type  f32:  392 tensors
    llama_model_loader: - type q4_K:  106 tensors
    llama_model_loader: - type q5_0:   32 tensors
    llama_model_loader: - type q5_K:   30 tensors
    llama_model_loader: - type q6_K:   15 tensors
    llama_model_loader: - type q8_0:   83 tensors
    L
    llama_model_loader: - type  f32:  392 tensors
    llama_model_loader: - type q4_K:  106 tensors
    llama_model_loader: - type q5_0:   32 tensors
    llama_model_loader: - type q5_K:   30 tensors
    llama_model_loader: - type q6_K:   14 tensors
    llama_model_loader: - type q8_0:   84 tensors

huydotnet · 2026-04-16T16:45:02 1776357902

They are different quantization types, you can read more here https://huggingface.co/docs/hub/gguf#quantization-types

arcanemachiner · 2026-04-16T22:26:57 1776378417

Just start with q4_k_m and figure out the rest later.

palmotea · 2026-04-16T15:00:04 1776351604

Thanks! I'd scanned the main content but I'd been blind to the sidebar on the far right.

JKCalhoun · 2026-04-16T16:19:55 1776356395

"16-bit BF16 69.4 GB"

Is that (BF16) a 16-bit float?

adrian_b · 2026-04-16T22:12:18 1776377538

The IEEE standard FP16 is an older 16-bit format, which has balanced exponent and significand sizes.

It has been initially supported by GPUs, where it is useful especially for storing the color components of pixels. For geometry data, FP32 is preferred.

In CPUs, some support has been first added in 2012, in Intel Ivy Bridge. Better support is provided in some server CPUs, and since next year also in the desktop AMD Zen 6 and Intel Nova Lake.

BF16 is a format introduced by Google, intended only for AI/ML applications, not for graphics, so initially it was implemented in some of the Intel server CPUs and only later in GPUs. Unlike FP16, which is balanced, BF16 has great dynamic range, but very low precision. This is fine for ML but inappropriate for any other applications.

Nowadays, most LLMs are trained preponderantly using BF16, with a small number of parameters using FP32, for higher precision.

Then from the biggest model that uses BF16, smaller quantized models are derived, which use 8 bits or less per parameter, trading off accuracy for speed.

mtklein · 2026-04-16T16:37:40 1776357460

Yes, it's a "Brain float", basically an ordinary 32-bit float with the low 16 mantissa bits cut off. Exact same range as fp32, lower precision, and not the same as the other fp16, which has less exponent and more mantissa.

Gracana · 2026-04-16T16:35:52 1776357352

https://en.wikipedia.org/wiki/Bfloat16_floating-point_format

Yes, however it’s a different format from standard fp16, it trades precision for greater dynamic range.

WithinReason · 2026-04-16T16:49:09 1776358149

yes, it has 8 exponent bits like float32 instead of 6 like float16

tommy_axle · 2026-04-16T15:34:38 1776353678

Pick a decent quant (4-6KM) then use llama-fit-params and try it yourself to see if it's giving you what you need.

gunalx · 2026-04-16T18:27:45 1776364065

I habe found llama-fit sometimes just selects a way to conservative load with VRAM to spare.

zozbot234 · 2026-04-16T14:49:26 1776350966

Should run just fine with CPU-MoE and mmap, but inference might be a bit slow if you have little RAM.

Ladioss · 2026-04-16T15:33:09 1776353589

You can run 25-30b model easily if you use Q3 or Q4 quants and llama-server with a pretty long list of options.

trvz · 2026-04-16T14:43:59 1776350639

If you have to ask then your GPU is too small.

With 16 GB you'll be only able to run a very compressed variant with noticable quality loss.

coder543 · 2026-04-16T15:17:14 1776352634

Not true. With a MoE, you can offload quite a bit of the model to CPU without losing a ton of performance. 16GB should be fine to run the 4-bit (or larger) model at speeds that are decent. The --n-cpu-moe parameter is the key one on llama-server, if you're not just using -fit on.

boppo1 · 2026-04-16T18:57:35 1776365855

I've been way out of the local game for a while now, what's the best way to run models for a fairly technical user? I was using llama.cpp in the command line before and using bash files for prompts.

adrian_b · 2026-04-16T22:17:17 1776377837

Running llama-server (it belongs to llama.cpp) starts a HTTP server on a specified port.

You can connect to that port with any browser, for chat.

Or you can connect to that port with any application that supports the OpenAI API, e.g. a coding assistant harness.

palmotea · 2026-04-16T14:46:45 1776350805

> If you have to ask then your GPU is too small.

What's the minimum memory you need to run a decent model? Is it pretty much only doable by people running Macs with unified memory?

giobox · 2026-04-16T15:02:15 1776351735

It's worth noting now there are other machines than just Apple that combine a powerful SoC with a large pool of unified memory for local AI use:

> https://www.dell.com/en-us/shop/cty/pdp/spd/dell-pro-max-fcm...

> https://marketplace.nvidia.com/en-us/enterprise/personal-ai-...

> https://frame.work/products/desktop-diy-amd-aimax300/configu...

etc.

But yes, a modern SoC-style system with large unified memory pool is still one of the best ways to do it.

jchw · 2026-04-16T15:00:55 1776351655

32 GiB of VRAM is possible to acquire for less than $1000 if you go for the Arc Pro B70. I have two of them. The tokens/sec is nowhere near AMD or NVIDIA high end, but its unexpectedly kind of decent to use. (I probably need to figure out vLLM though as it doesn't seem like llama.cpp is able to do them justice even seemingly with split mode = row. But still, 30t/s on Gemma 4 (on 26B MoE, not dense) is pretty usable, and you can do fit a full 256k context.)

When I get home today I totally look forward to trying the unsloth variants of this out (assuming I can get it working in anything.) I expect due to the limited active parameter count it should perform very well. It's obviously going to be a long time before you can run current frontier quality models at home for less than the price of a car, but it does seem like it is bound to happen. (As long as we don't allow general purpose computers to die or become inaccessible. Surely...)

dist-epoch · 2026-04-16T16:24:31 1776356671

NVIDIA 5070 Ti can run Gemma 4 26B at 4-bit at 120 tk/s.

Arc Pro B70 seems unexpectedely slow? Or are you using 8-bit/16-bit quants.

jchw · 2026-04-16T17:17:57 1776359877

Unfortunately it really is running this slow with Llama.cpp, but of course that's with Vulkan mode. The VRAM capacity is definitely where it shines, rather than compute power. I am pretty sure that this isn't really optimal use of the cards, especially since I believe we should be able to get decent, if still sublinear, scaling with multiple cards. I am not really a machine learning expert, I'm curious to see if I can manage to trace down some performance issues. (I've already seen a couple issues get squashed since I first started testing this.)

I've heard that vLLM performs much better, scaling particularly better in the multi GPU case. The 4x B70 setup may actually be decent for the money given that, but probably worth waiting on it to see how the situation progresses rather than buying on a promise of potential.

A cursory Google search does seem to indicate that in my particular case interconnect bandwidth shouldn't actually be a constraint, so I doubt tensor level parallelism is working as expected.

nyrikki · 2026-04-17T00:05:25 1776384325

Parallelism can be tricky and always has a cost, but don't discount the 3090 which is more expensive these days in that price bracket.

3090 llama.cpp (container in VM)

    unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL  105 t/s
    unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL  103 t/s

Still slow compaired to the

    ggml-org/gpt-oss-20b-GGUF 206 t/s

But on my 3x 1080 Ti 1x TITAN V getto machine I learned that multi gpu takes a lot of tuning no matter what. With the B70, where Vulkan has the CPU copy problem, and SYCL doesn't have a sponsor or enough volunteers, it will probably take a bit of profiling on your part.

There are a lot of variables, but PCIe bus speed doesn't matter that much for inference, but the internal memory bandwidth does, and you won't match that with PCIe ever.

To be clear, multicard Vulkan and absolutely SYCL have a lot of optimizations that could happen, but the only time two GPUs are really faster for inference is when one doesn't have enough ram to fit the entire model.

A 3090 has 936.2 GB/s of (low latency) internal bandwidth, while 16xPCIe5 only has 504.12, may have to be copied through the CPU, have locks, atomic operations etc...

For LLM inference, the bottleneck just usually going to be memory bandwidth which is why my 3090 is so close to the 5070ti above.

LLM next token prediction is just a form of autoregressive decoding and will primarily be memory bound.

As I haven't used the larger intel GPUs I can't comment on what still needs to be optimized, but just don't expect multiple GPUs to increase performance without some nvlink style RDMA support _unless_ your process is compute and not memory bound.

zozbot234 · 2026-04-16T15:05:42 1776351942

New versions of llama.cpp have experimental split-tensor parallelism, but it really only helps with slow compute and a very fast interconnect, which doesn't describe many consumer-grade systems. For most users, pipeline parallelism will be their best bet for making use of multi-GPU setups.

jchw · 2026-04-16T15:45:15 1776354315

Yeah, I was doing split tensor and it seemed like a wash. The Arc B70s are not huge on compute.

Right now I'm only able to run them in PCI-e 5.0 x8 which might not be sufficient. But, a cheap older Xeon or TR seems silly since PCI-e 4.0 x16 isn't theoretically more bandwidth than PCI-e 5.0 x8. So it seems like if that is really still bottlenecked, I'll just have to bite the bullet and set up a modern HEDT build. With RAM prices... I am not sure there is a world where it could ever be worth it. At that point, seems like you may as well go for an obscenely priced NVIDIA or AMD datacenter card instead and retrofit it with consumer friendly thermal solutions. So... I'm definitely a bit conflicted.

I do like the Arc Pro B70 so far. Its not a performance monster, but it's quiet and relatively low power, and I haven't run into any instability. (The AMDGPU drivers have made amazing strides, but... The stability is not legendary. :)

I'll have to do a bit of analysis and make sure there really is an interconnect bottleneck first, versus a PEBKAC. Could be dropping more lanes than expected for one reason or another too.

mjsabby · 2026-04-23T04:52:05 1776919925

I've joined the B70 bandwagon as well, and have started profiling, will keep this thread updated. I suspect if we get enough of us to get even 80% of theoretical perf we'll be in a good spot. I also have two B70s.

zozbot234 · 2026-04-16T15:52:01 1776354721

You could fit your HEDT with minimum RAM and a combination of Optane storage (for swapping system RAM with minimum wear) and fast NAND (for offloading large read-only data). If you have abundant physical PCIe slots it ought to be feasible.

angoragoats · 2026-04-16T14:57:42 1776351462

Macs with unified memory are economical in terms of $/GB of video memory, and they match an optimized/home built GPU setup in efficiency (W/token), but they are slow in terms of absolute performance.

With this model, since the number of active parameters is low, I would think that you would be fine running it on your 16GB card, as long as you have, say 32GB of regular system memory. Temper your expectations about speed with this setup, as your system memory and CPU are multiple times slower than the GPU, so when layers spill over you will slow down.

To avoid this, there's no need to buy a Mac -- a second 16GB GPU would do the trick just fine, and the combined dual GPU setup will likely be faster than a cheap mac like a Mac mini. Pay attention to your PCIe slots, but as long as you have at least an x4 slot for the second GPU, you'll be fine (LLM inference doesn't need x8 or x16).

TechSquidTV · 2026-04-16T15:03:26 1776351806

My Mac Studio with 96GB of RAM is maybe just at the low end of passable. It's actually extremely good for local image generation. I could somewhat replace something like Nano Banana comfortably on my machine.

But I don't need Nano Banana very much, I need code. While it can, there's no way I would ever opt to use a local model on my machine for code. It makes so much more sense to spend $100 on Codex, it's genuinely not worth discussing.

For non-thinking tasks, it would be a bit slower, but a viable alternative for sure.

slopinthebag · 2026-04-16T16:29:28 1776356968

You just need to adjust your workflow to use the smaller models for coding. It's primarily just a case of holding them wrong if you end up with worse outputs.

layer8 · 2026-04-16T14:57:49 1776351469

It’s also doable with AMD Strix Halo.

bfivyvysj · 2026-04-16T14:47:52 1776350872

A bit like asking how long is a piece of string.

latentsea · 2026-04-16T15:01:59 1776351719

It's twice as long as from one end to the middle.

palmotea · 2026-04-16T14:52:06 1776351126

More like "about how long of a string do I need to run between two houses in the densest residential neighborhood of single-family homes in the US?"

littlestymaar · 2026-04-16T14:51:56 1776351116

No, GP is excessively restrictive. Llama.cpp supports RAM offloading out of the box.

It's going to be slower than if you put everything on your GPU but it would work.

And if it's too slow for your taste you can try the quantized version (some Q3 variant should fit) and see how well it works for you.

utilize1808 · 2026-04-16T14:51:43 1776351103

Obviously going to depend on your definition of "decent". My impression so far is that you will need between 90GB to 100GB of memory to run medium sized (31B dense or ~110B MoE) models with some quantization enabled.

cjbgkagh · 2026-04-16T14:57:01 1776351421

I’m running Gemma4 31B (Q8) on my 2 4090s (48GB) with no problem.

Glemllksdf · 2026-04-16T15:56:36 1776354996

I have the same setup but tried paperclip ai with it and it seems to me that either i'm unable to setup it properly or multiply agents struggle with this setup. Especially as it seems that paperclip ai and opencode (used for connection) is blowing up the context to 20-30k

Any tips around your setup running this?

I use lmstudio with default settings and prioritization instead of split.

cjbgkagh · 2026-04-16T17:14:57 1776359697

I asked AI for help setting it up. I use 128k context for 31B and 256k context for 26B4A. Ollama worked out of the box for me but I wanted more control with llama.cpp.

My command for llama-server:

llama-server -m /models/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf -ngl 99 -sm layer -ts 10,12 --jinja --flash-attn on --cont-batching -np 1 -c 262144 -b 4096 -ub 512 -ctk q8_0 -ctv q8_0 --host 0.0.0.0 --port 8080 --timeout 18000

FusionX · 2026-04-16T14:59:34 1776351574

Aren't 4bits model decent? Since, this is an MOE model, I'm assuming it should have respectable tk/s, similar to previous MOE models.

gunalx · 2026-04-16T18:31:04 1776364264

Running q3 xss with full and quantizised context as options on a 16gb gpu and still has pretty decent quality and fitting fine with up to 64k context.

halJordan · 2026-04-16T18:45:53 1776365153

There's absolutely nothing wrong it insane with a safetensors file. It might be less convenient than a single file gguf. But that's just laziness not insanity

Zetaphor · 2026-04-17T13:08:48 1776431328

Quantization is the major appeal, we can't all run full precision

lta · 2026-04-17T12:56:44 1776430604

As long as they're not releasing 32bit .pt :)

txtsd · 2026-04-16T14:35:41 1776350141

So I can use this in claude code with `ollama run claude`?

nunodonato · 2026-04-16T16:38:09 1776357489

https://sleepingrobots.com/dreams/stop-using-ollama/

txtsd · 2026-04-17T01:29:09 1776389349

Thank you, I had no idea ollama was so shady! I will start using llama.cpp directly.

Ladioss · 2026-04-16T15:31:01 1776353461

More like `ollama launch claude --model qwen3.6:latest`

Also you need to check your context size, Ollama default to 4K if <24 Gb of VRAM and you need 64K minimum if you want claude to be able to at least lift a finger.

Patrick_Devine · 2026-04-16T17:47:25 1776361645

If you're on a Mac, use the MLX backend versions which are considerably faster than the GGML based versions (including llama.cpp) and you don't need to fiddle with the context size. The models are `qwen3.6:35b-a3b-nvfp4`, `qwen3.6:35b-a3b-mxfp8`, and `qwen3.6:35b-a3b-mlx-bf16`.

egorfine · 2026-04-17T11:11:38 1776424298

I was comparing various models at M5 Pro 48GB RAM MLX vs GGUF and found that MLX models have a higher time to first token (sometimes by an order of magnitude) while tokens/sec and memory usage is same as GGUF.

Gemma 3 27B q4:

* MLX: 16.7 t/s, 1220ms ttft

* GGUF: 16.4 t/s, 760ms ttft

Gemma 4 31B q8:

* MLX: 8.3 t/s, 25000ms ttft

* GGUF: 8.4 t/s, 1140ms ttft

Gemma 4 A4B q8:

* MLX: 52 t/s, 1790ms ttft

* GGUF: 51 t/s, 380ms ttft

All comparisons done in LM Studio, all versions of everything are the latest.

txtsd · 2026-04-16T18:29:33 1776364173

I only have 16GB VRAM, and my system uses ~4GB from that. What are my options? I got this one: `Qwen3.6-35B-A3B-UD-IQ2_XXS.gguf`

Ladioss · 2026-04-17T09:22:44 1776417764

My system has 16 Gb VRAM / 32 Gb RAM, and ollama runs qwen3.6:latest at decent speed just fine. The 35b model is a moe, so I guess the whole model is offloaded.

pj_mukh · 2026-04-16T14:45:04 1776350704

have you found a model that does this with usable speeds on an M2/M3?

postalcoder · 2026-04-16T14:53:57 1776351237

On a M4 MBP ollama's qwen3.5:35b-a3b-coding-nvfp4 runs incredibly fast when in the claude/codex harness. M2/M3 should be similar.

It's incomparably faster than any other model (i.e. it's actually usable without cope). Caching makes a huge difference.

littlestymaar · 2026-04-17T14:14:45 1776435285

> converted into a sane format

Having implemented a GGUF parser, I'd beg to differ on the “sane format” qualifier.

terataiijo · 2026-04-16T14:36:17 1776350177

lmao they are so fast yooo

ttul · 2026-04-16T14:40:39 1776350439

Yes. How do they do it? Literally they must have PagerDuty set up to alert the team the second one of the labs releases anything.

beernet · 2026-04-16T14:43:44 1776350624

They obviously collaborate with some of the labs prior to the official release date.

sigbottle · 2026-04-16T14:46:35 1776350795

That... is a more plausible explanation I didn't think of.

danielhanchen · 2026-04-16T15:07:24 1776352044

Yes we collab with them!

qskousen · 2026-04-16T18:13:04 1776363184

Sorry this is a bit of a tangent, but I noticed you also released UD quants of ERNIE-Image the same day it released, which as I understand requires generating a bunch of images. I've been working to do something similar with my CLI program ggufy, and was curious of you had any info you could share on the kind of compute you put into that, and if you generate full images or look at latents?

danielhanchen · 2026-04-17T08:34:41 1776414881

Yes we have started doing diffusion GGUFs but it's in it's infancy :) But yes we do generate images to test quants out!

sigbottle · 2026-04-16T14:46:03 1776350763

Is quantization a mostly solved pipeline at this point? I thought that architectures were varied and weird enough where you can't just click a button, say "go optimize these weights", and go. I mean new models have new code that they want to operate on, right, so you'd have to analyze the code and insert the quantization at the right places, automatically, then make sure that doesn't degrade perf?

Maybe I just don't understand how quantization works, but I thought quantization was a very nasty problem involving a lot of plumbing

Readerium · 2026-04-17T01:26:16 1776389176

that is true. gguf does not support any Architecture.

for the most recent example, as of April 16, 2026 (today)

Turboquant isnt still added to GGUF

bildung · 2026-04-16T15:01:50 1776351710

Bad QA :/ They had a bunch of broken quantizations in the last releases

danielhanchen · 2026-04-16T15:13:52 1776352432

1. Gemma-4 we re-uploaded 4 times - 3 times were 10-20 llama.cpp bug fixes - we had to notify people to upload the correct ones. The 4th is an official Gemma chat template improvement from Google themselves.

2. Qwen3.5 - we shared our 7TB research artifacts showing which layers not to quantize - all provider's quants were under optimized, not broken - ssm_out and ssm_* tensors were the issue - we're now the best in terms of KLD and disk space

3. MiniMax 2.7 - we swiftly fixed it due to NaN PPL - we found the issue in all quants regardless of provider - so it affected everyone not just us. We wrote a post on it, and fixed it - others have taken our fix and fixed their quants, whilst some haven't updated.

Note we also fixed bugs in many OSS models like Gemma 1, Gemma 3, Llama chat template fixes, Mistral, and many more.

Unfortunately sometimes quants break, but we fix them quickly, and 95% of times these are out of our hand.

We swiftly and quickly fix them, and write up blogs on what happened. Other providers simply just take our blogs and fixes and re-apply, re-use our fixes.

bildung · 2026-04-16T15:21:06 1776352866

Fair enough, appreciate the detailed response! Can you elaborate why other quantizations weren't affected (e.g. bartowski)? Simply because they were straight Q4 etc. for every layer?

danielhanchen · 2026-04-16T15:26:44 1776353204

No Bartowski's are more affected - (38% NaN) than ours (22%) - for MiniMax 2.7 see https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax...

We already fixed ours. Bart hasn't yet but is still working on it following our findings.

blk.61.ffn_down_exps in Q4_K or Q5_K failed - it must be in Q6_K otherwise it overflows.

For the others, yes layers in some precision don't work. For eg Qwen3.5 ssm_out must be minimum Q4-Q6_K.

ssm_alpha and ssm_beta must be Q8_0 or higher.

Again Bart and others apply our findings - see https://www.reddit.com/r/LocalLLaMA/comments/1rgel19/new_qwe...

bildung · 2026-04-16T15:34:00 1776353640

Thanks again, TIL

danielhanchen · 2026-04-16T15:52:12 1776354732

Thanks!

rohansood15 · 2026-04-16T15:41:14 1776354074

Thanks for all the amazing work Daniel. I remember you guys being late to OH because you were working on weights released the night before - and it's great to see you guys keep up the speed!

danielhanchen · 2026-04-16T15:52:06 1776354726

Oh thanks haha :) We try our best to get model releases out the door! :) Hope you're doing great!

ekianjo · 2026-04-16T14:55:12 1776351312

yeah and often their quants are broken. They had to update their Gemma4 quants like 4 times in the past 2 weeks.

danielhanchen · 2026-04-16T15:09:10 1776352150

No it's not our fault - re our 4 uploads - the first 3 are due to llama.cpp fixing bugs - this was out of our control (we're llama.cpp contributors, but not the main devs) - we could have waited, but it's best to update when multiple (10-20) bugs are fixed.

The 4th is Google themselves improving the chat template for tool calling for Gemma.

https://github.com/ggml-org/llama.cpp/issues/21255 was another issue CUDA 13.2 was broken - this was NVIDIA's CUDA compiler itself breaking - fully out of our hands - but we provided a solution for it.

homebrewer · 2026-04-14T15:18:51 1776179931

Not necessarily. I used jj for a couple of weeks and found it to be a complete waste of time.

For an advanced user, it did not offer anything I cannot quickly solve in git. Which is probably the wrong thing to optimize in the first place, because even though I frequently rewrite history and split commits during normal worklfow, it takes so little time that improving something else would yield greater returns.

We (not royal we) don't usually go out of our way repeating negative experiences with these tools, so you build a very skewed view of their adoption.