More

shihab · 2026-02-28T08:50:38 1772268638

Citizens United is an existential threat for USA. You cannot have Israeli-American dual citizens pouring $200 million dollars in elections. and that’s just her alone. This is simply not sustainable.

idop · 2026-02-28T09:25:03 1772270703

Or one South African-Canadian-American triple citizen pouring $300 million dollars in elections. I am shocked that campaign donations are legal.

tdeck · 2026-02-28T10:28:55 1772274535

I mean, some of the stuff actually wasn't legal. But accountability for wealthy elites is limited to a strongly worded letter

https://www.bbc.com/news/articles/c748l0zv4x8o

Just look at the fallout from the Epstein files where at best we can hope people will be embarrassed into resigning their current position.

unethical_ban · 2026-02-28T15:36:20 1772292980

Ideally we would completely restructure the government to have multimember districts and change the Senate.

Within the current structure, we need to implement ranked/scored voting to break the two party system and the implied complete control it has over our government. It's so much easier for big money to control the narrative, control the candidates, and play off extreme polar politics when the voting system makes people choose the "lesser of two evils".

Were I king for a day in the US, and could only do one thing to help America, changing our voting system to some kind of rank/scored system would be it. Ending gerrymandering and Citizens United are also important but honestly less so than this.

danaris · 2026-02-28T09:36:25 1772271385

Can we not with the blatant antisemitic dogwhistles...?

shihab · 2026-02-28T10:34:57 1772274897

Exactly what part of my statement was dog whistling? Can you stop throwing around this serious accusation of antisemitism without any attempt to substantiate your claim?

danaris · 2026-02-28T12:07:12 1772280432

"Israeli-American dual citizens"

Making a big deal out of Israelis—especially wealthy ones—having dual citizenship is a classic antisemitic tactic, used to sow the idea that they aren't "real Americans" or their primary loyalty is to another country.

Also: yes, Citizens United is a big problem. But phrasing your comment as if the primary problem with it is "Israeli-American dual citizens" pouring millions of dollars into politics is perpetuating the antisemitic ideas that a) all or most Jews are wealthy, and b) Jews are controlling our country/the world.

Whether or not you meant it as antisemitic, it played directly and very clearly into multiple antisemitic tropes that are frequently used to try to smear and harm Jewish people.

shihab · 2026-02-28T17:27:50 1772299670

I brought up Israeli-American donors because that’s what is relevant in the context of the story we’re discussing. We are talking about a war many right wing Israelis wanted for decades. If it were a general discussion about Citizens United and I focused on lobbying from only this group, perhaps your argument would have carried water.

Anyway, here’s Trump himself detailing the extraordinary access to White House this lobbying bought Adelsons:

https://www.reuters.com/world/us/trump-salutes-mega-donor-mi...

shihab · 2026-02-28T08:35:49 1772267749

Another mid east war entirely on Israel’s behalf, another war Americans will pay tax for, die for- just so Israel can keep grabbing few parcels of lands from Palestine.

shihab · 2026-02-06T17:04:07 1770397447

I think there are two steps here: converting video to sensor data input, and using that sensor data to drive. Only the second step will be handled by cars on road, first one is purely for training.

shihab · 2026-02-05T17:20:14 1770312014

The article strictly talks about people who were pals with him _after_ his Pedophilia conviction. And please don't do this strawman "evil person eating babies", nobody sane is claiming that.

wcfields · 2026-02-05T18:44:23 1770317063

Just take a look at the photos of his residences, it's filled with big black redaction squares of the "artwork" and photos on his walls and the little bits that can be seen lend a huge credence to his prior conviction.

Absolutely bonkers to say he's a "somewhat flawed person" when there's gigs of PDFs of some of the most heinous things imaginable.

simianwords · 2026-02-05T17:45:10 1770313510

His pedophilia charges are 1000x worse than Roman Polanski who actually (the literal sense of the word) raped a 13 year old.

Whereas Epstein was convicted solicited sex from 16 year old. There's a mountain of a difference between them.

Here are the people who signed petition against Roman's conviction

* Woody Allen

* Wes Anderson

* Jean-Jacques Annaud

* Asia Argento (later expressed regret)

* Darren Aronofsky

* Paul Auster

* Monica Bellucci

* Gael García Bernal

* Adrien Brody

* Penélope Cruz

* Alfonso Cuarón

* Guillermo del Toro

* Jonathan Demme

* Alexandre Desplat

* Xavier Dolan (later expressed regret)

* Stephen Frears

* Harrison Ford

* Terry Gilliam

* Taylor Hackford

* Buck Henry

* Alejandro G. Iñárritu

* Jeremy Irons

* Neil Jordan

* Harmony Korine

* John Landis

* David Lynch (his daughter claimed posthumous regret in 2025)

* Michael Mann

* Sam Mendes

* Mike Nichols

* Alexander Payne

* Natalie Portman (later expressed regret)

* Brett Ratner

* Walter Salles

* Jerry Schatzberg

* Julian Schnabel

* Martin Scorsese

* Steven Soderbergh

* Tilda Swinton

* Kristin Scott Thomas

* Tom Tykwer

* Emma Thompson (later expressed regret)

* Pedro Almodóvar

* Wong Kar-wai

* Salman Rushdie

* Milan Kundera

* Diane von Furstenberg

* Neil Gaiman (signed a related petition)

* Meryl Streep (publicly supported without signing)

* Whoopi Goldberg (publicly supported without signing)

* Debra Winger (publicly supported without signing)

* Harvey Weinstein (publicly defended him)

Why don't we hold these people to the same standards?

lovich · 2026-02-05T18:03:26 1770314606

Im not a fan of Roman Polanski to put it mildly, but are you trying to claim that supporting a pedophile is as bad as being a pedophile?

Also all those people can fuck themselves. There’s more than enough blame to go around.

simianwords · 2026-02-05T18:08:53 1770314933

do you agree that you have to hold all these people to the same standard?

lovich · 2026-02-05T19:54:17 1770321257

Yea, so the guy who actually committed pedophilia is being judged harder than someone defending him who hasn’t committed pedophilia.

You for instance seem to be in the same boat as the people in this list, so I won’t judge you as harshly as Epstein.

simianwords · 2026-02-05T20:02:47 1770321767

do you judge the people Epstein was in touch with the same way you judge the people i listed above?

lovich · 2026-02-05T22:24:02 1770330242

If the were post knowing his crimes then yea, like that UK minister who got sacked.

Why are you carrying so much weight for Epstein and associates?

shihab · 2026-01-28T06:11:08 1769580668

I work with GPUs and I'm also trying to understand the motivations here.

Side note & a hot take: that sort of abstraction never really existed for GPU and it's going to be even harder now as Nvidia et al races to put more & more specialized hardware bits inside GPUs

shihab · 2026-01-28T05:55:36 1769579736

To the author (or anyone from vectorware team), can you please give me, admittedly a skeptic, a motivating example of a "GPU-native" application?

That is, where does it truly make a difference to dispatch non-parallel/syscalls etc from GPU to CPU instead of dispatching parallel part of a code from CPU to GPU?

From the "Announcing VectorWare" page:

> Even after opting in, the CPU is in control and orchestrates work on the GPU.

Isn't it better to let CPUs be in control and orchestrate things as GPUs have much smaller, dumber cores?

> Furthermore, if you look at the software kernels that run on the GPU they are simplistic with low cyclomatic complexity.

Again, there's a obvious reason why people don't put branch-y code on GPU.

Genuinely curious what I'm missing.

ukoki · 2026-01-28T12:23:35 1769603015

Not OP but I'm currently make a city-builder computer game with a large procedurally-generated world. The terrain height at any point in the world is defined by function that takes a small number of constant parameters, and the horizontal position in the world, to give the height of the terrain at that position.

I need the heights on the GPU so I can modify the terrain meshes to fit the terrain. I need the heights on the CPU so I can know when the player is clicking the terrain and where to place things.

Rather than generating a heightmap on the CPU and passing a large heightmap texture to the GPU I have implemented the identical height generating functions in rust (CPU) and webgl (GPU). As you might imagine, its very easy for these to diverge and so I have to maintain a large set of tests that verify that generated heights are identical between implementations.

Being able to write this implementation once and run it on the CPU and GPU would give me much better guarantees that the results will be the same. (although necause of architecture differences and floating point handling they the results will never be perfect, but I just need them to be within an acceptable tolerance)

xmcqdpt2 · 2026-01-28T12:52:25 1769604745

That's a good application but likely not one requiring a full standard library on the GPU? Procedurally generated data on GPU isn't uncommon AFAIK. It wasn't when I was dabbling in GPGPU stuff ~10 years ago.

If you wrote in open cl, or via intel libraries, or via torch or arrayfire or whatever, you could dispatch it to both CPU and GPU at will.

moron4hire · 2026-01-28T18:26:50 1769624810

There are GPU-based picking algorithms. You really should not have to maintain parallel data generation systems on both the GPU and CPU just to support picking. Maybe you have a different issue that would require it, but picking alone shouldn't be it.

nicman23 · 2026-01-28T18:31:38 1769625098

in large sim systems p2p comms and not having to involve the cpu in any way - because the cpu is doing work as well and you do not want to have the cpu to sync every result if it is partial

one example is pme decomposition in gromacs.

storystarling · 2026-01-28T10:48:57 1769597337

The killer app here is likely LLM inference loops. Currently you pay a PCIe latency penalty for every single token generated because the CPU has to handle the sampling and control logic. Moving that logic to the GPU and keeping the whole generation loop local avoids that round trip, which turns out to be a major bottleneck for interactive latency.

radarsat1 · 2026-01-28T12:12:23 1769602343

I don't know what the pros are doing but I'd be a bit shocked if it isn't already done this way in real production systems. And it doesn't feel like porting the standard library is necessary for this, it's just some logic.

storystarling · 2026-01-28T12:46:41 1769604401

Raw CUDA works for the heavy lifting but I suspect it gets messy once you implement things like grammar constraints or beam search. You end up with complex state machines during inference and having standard library abstractions seems pretty important to keep that logic from becoming unmaintainable.

radarsat1 · 2026-01-29T09:53:07 1769680387

I was thinking mainly about the standard AR loop, yes I can see that grammars would make it a bit more complicated especially when considering batching.

tucnak · 2026-01-28T11:32:00 1769599920

Turns out how? Where are the numbers?

storystarling · 2026-01-28T11:37:33 1769600253

It is less about the raw transfer speed and more about the synchronization and kernel launch overheads. If you profile a standard inference loop with a batch size of 1 you see the GPU spending a lot of time idle waiting for the CPU to dispatch the next command. That is why optimizations like CUDA graphs exist, but moving the control flow entirely to the device is the cleaner solution.

tucnak · 2026-01-28T14:09:39 1769609379

I'm not convinced. (A bit of advice: if you wish to make a statement about performance, always start by measuring things. Then when somebody asks you for proof/data, you would already have it.) If what you're saying were true, it would be a big deal, except unfortunately it isn't.

Dispatch has overheads, but it's largely insignificant. Where it otherwise would be significant:

1. Fused kernels exist

2. CUDA graphs (and other forms of work-submission pipelining) exist

saagarjha · 2026-01-28T14:57:24 1769612244

CUDA graphs are pretty slow at synchronizing things.

shihab · 2026-01-21T23:30:24 1769038224

> For example, NEON ... can hold up to 32 128-bit vectors to perform your operations without having to touch the "slow" memory.

Something I recently learnt: the actual number of physical registers in modern x86 CPUs are significantly larger, even for 512-bit SIMD. Zen 5 CPUs actually have 384 vectors registers, 384*512b = 24KB!

cmovq · 2026-01-22T00:05:22 1769040322

This is true, but if you run out of the 32 register names you’ll still need to spill to memory. The large register file is to allow for multiple instructions to execute in parallel among other things.

zeusk · 2026-01-22T01:50:37 1769046637

They’re used by the internal register renamer/allocator so if it sees you’re storing the results to memory then reusing the named register for a new result - it will allocate a new physical register so your instruction doesn’t stall for the previous write to go through.

adrian_b · 2026-01-22T13:22:05 1769088125

I do not understand what you want to say.

The register renamer allocates a new physical register when you attempt to write the same register as a previous instruction, as otherwise you would have to wait for that instruction to complete, and you would also have to wait for any instructions that would want to read the value from that register.

When you store a value into memory, the register renamer does nothing, because you do not attempt to modify any register.

The only optimization is that if a following instruction attempts to read the value stored in the memory, that instruction does not wait for the previous store to complete, in order to be able to load the stored value from the memory, but it gets the value directly from the store queue. But this has nothing to do with register renaming.

Thus if your algorithm has already used all the visible register numbers, and you will still need in the future all the values that occupy the registers, then you have to store one register into the memory, typically in the stack, and the register renamer cannot do anything to prevent this.

This is why Intel will increase the number of architectural general-purpose registers of x86-64 from 16 to 32, matching Arm Aarch64 and IBM POWER, with the APX ISA extension, which will be available in the Nova Lake desktop/laptop CPUs and in the Diamond Rapids server CPUs, which are expected by the end of this year.

Register renaming is a typical example of the general strategy that is used when shared resources prevent concurrency: the shared resources must be multiplied, so that each concurrent task uses its private resource.

gpderetta · 2026-01-22T15:32:52 1769095972

> When you store a value into memory, the register renamer does nothing, because you do not attempt to modify any register.

you are of course correct about everything. But the extreme pendant in me can't avoid pointing out that there are in fact a few mainstream CPUs[1] that can rename memory to physical registers, at least in some cases. This is done explicitly to mitigate the cost of spilling. edit: this is different from the store-forwarding optimization you mentioned.

[1] Ryzen for example: https://www.agner.org/forum/viewtopic.php?t=41

adrian_b · 2026-01-22T23:34:07 1769124847

That feature does not exist in any AMD Zen, but only in certain Zen generations and randomly, i.e. not in successive generations. This optimization has been introduced then removed a couple of times. Therefore this is not an optimization on whose presence you can count in a processor.

I believe that it is not useful to group such an optimization with register renaming. The effect of register renaming is to replace a single register shared by multiple instructions with multiple registers, so that each instructions may use its own private register, without interfering with the other instructions.

On the other hand, the optimization mentioned by you is better viewed as an enhancement of the optimization mentioned by me, and which is implemented in all modern CPUs, i.e. that after a store instruction the stored value persists for some time in the store queue and the subsequent instructions can access it there instead of going to memory.

With this additional optimization, the stored values that are needed by subsequent instructions are retained in some temporary registers even after the store queue is drained to the memory as long as they are still needed.

Unlike with register renaming, here the purpose is not to multiply the memory locations that store a value so that they can be accessed independently. Here the purpose is to cache the value close to the execution units, to be available quickly, instead of taking it from the far away memory.

As mentioned at your link, the most frequent case when this optimization is efficient is when arguments are pushed in the stack before invoking a function and then the invoked function loads the arguments in registers. On the CPUs where this optimization is implemented the passing of arguments to the function bypasses the stack, becoming much faster.

However this calling convention is important mainly for legacy 32-bit applications, because the 64-bit programs pass most arguments inside registers, so they do not need this optimization. Therefore this optimization is more important for Windows, where it is more frequent to use ancient 32-bit executables, which have not been recompiled to 64-bit.

gpderetta · 2026-01-23T09:34:02 1769160842

Yes, it is not in all Zen cpus.

I don't think it makes sense to distinguish it from renaming. It is effectively aliasing a memory location (or better, an offset off the stack pointer) with a physical register, effectively treating named stack offsets as additional architectural registers. AFAIK this is done on the renaming stage.

adrian_b · 2026-01-23T16:29:11 1769185751

The named stack offsets are treated as additional hidden registers, not as additional architectural registers.

You do not access them using architectural register numbers, as you would do with the renamed physical registers, but you access them with an indexed memory addressing mode.

The aliasing between a stack location and a hidden register is of the same nature as the aliasing between a stack location from its true address in the main memory and the location in the L1 cache memory where the the stack locations are normally cached in any other modern CPU.

This optimization present in some Zen CPUs just caches some locations from the stack even closer to the execution units of the CPU core than the L1 cache used for the same purpose in other CPUs, allowing those stack locations to be accessed as fast as the registers.

gpderetta · 2026-01-23T16:56:01 1769187361

The stack offset (or in general memory location address[1]) has a name (its unique address), exactly like an architectural register, how can it be an hidden register?

In any case, as far as I know the feature is known as Memory Renaming, and it was discussed in Accademia decades before it showed in actual consumer CPUs. It uses the renaming hardware and it behaves more like renaming (0 latency movs resolved at rename time, in the front end) than an actual cache (that involves an AGI unit and a load unit and it is resolved in the execution stages, in the OoO backend) .

[1] more precisely, the feature seems to use address expressions to name the stack slots, instead of actual addresses, although it can handle offset changes after push/pop/call/ret, probably thanks to the Stack Engine that canonicalizes the offsets at the decode stage.

dapperdrake · 2026-01-21T23:59:39 1769039979

In the register file or named registers?

And the critical matrix tiling size is often SRAM, so L3 unified cache.

shihab · 2026-01-21T13:00:50 1769000450

Hi, I just wanted to note that e3nn is more of an academic software that's a bit high-level by design. A better baseline for comparison would be Nvidia's cuEquivariance, which does pretty much the same thing as you did- take e3nn and optimize it for GPU.

As a HPC developer, it breaks my heart how worse academic software performance is compared to vendor libraries (from Intel or Nvidia). We need to start aiming much higher.

bee_rider · 2026-01-21T13:38:51 1769002731

I took a lot longer than I should have to finish my PhD because I wanted to beat well written/properly used vendor code. I wouldn’t recommend it, TBH.

It did make my defense a lot easier because I could just point at the graphs and say “see I beat MKL, whatever I did must work.” But I did a lot of little MPI tricks and tuning, which doesn’t add much to the scientific record. It was fun though.

I don’t know. Mixed feelings. To some extent I don’t really see how somebody could put all the effort into getting a PhD and not go on a little “I want to tune the heck out of these MPI routines” jaunt.

shihab · 2026-01-21T14:10:53 1769004653

To be practically useful, we don't need to beat vendors, just getting close would be enough, by the virtue of being open-source (and often portable). But I found, as an example, PETSc to be ~10x slower than MKL on CPU and CUDA on GPU; It still doesn't have native shared memory parallelism support on CPU etc.

bee_rider · 2026-01-21T14:15:58 1769004958

Oh dang, thanks for the heads up. I was looking at them for the “next version” of my code.

The lack of a “blas/lapack/sparse equivalents that can dispatch to GPU or CPU” is really annoying. You’d think this would be somewhat “easy” (lol, nothing is easy), in the sense that we’ve got a bunch of big chunky operations…

shihab · 2026-01-21T14:24:39 1769005479

I should note PETSc is a big piece of software that does a lot of things. It also wraps many libraries, and those might ultimately dictate actual performance depending on what you plan on doing.

PerryStyle · 2026-01-21T19:13:25 1769022805

I would love to do this in the future, but knowing me I’d get caught up making sure I’m benchmarking properly then actually writing code.

teddykoker · 2026-01-21T13:57:11 1769003831

OpenEquivariance [1] is another good baseline for with kernels for the Clebsch-Gordon tensor product and convolution, and it is fully open source. Both kernel implementations have been successfully implemented into existing machine learning interatomic potentials, e.g. [2,3].

[1] https://github.com/PASSIONLab/OpenEquivariance

[2] https://arxiv.org/abs/2504.16068

[3] https://arxiv.org/abs/2508.16067

physicsguy · 2026-01-21T14:45:31 1769006731

> As a HPC developer, it breaks my heart how worse academic software performance is compared to vendor libraries (from Intel or Nvidia). We need to start aiming much higher.

They're optimising for different things really.

Intel/Nvidia have the resources to (a) optimise across a wide range of hardware in their libraries (b) often use less well documented things (c) don't have to make their source code publicly accessible.

Take MKL for example - it's a great library, but implementing dynamic dispatch for all the different processor types is why it gets such good performance across x86-64 machines, it's not running the same code on each processor. No academic team can really compete with that.

shihab · 2026-01-21T14:54:33 1769007273

I'm not asking an academic program first published 8 year ago (e3nn) to beat actively developed CuEquivariance library. An academic proposing new algorithms doesn't need to worry too much about performance. But any new work which focuses on performance, that includes this blog and a huge number of academic papers published every year, should absolutely use latest vendor libraries as baseline.

rapatel0 · 2026-01-21T14:34:41 1769006081

I think this is the difference between research and industry. Industry should try to grind out obvious improvements through brute force iteration. I really wish the culture of academia was more of an aim towards moonshots (high risk, high reward).

geremiiah · 2026-01-21T13:42:50 1769002970

cuEquivariance is unfortunately close sourced (the acutal .cu kernels), but OP's work is targetting a consumer GPU and also a very small particle system so its hard to compare, anyway.

shihab · 2026-01-19T17:42:22 1768844542

Hi, I actually mentioned ISPC several times there. And although I strenuously avoided crowning one approach "better" over the other, it is worth pointing out that 1) Many of these benefits of ISPC can be had from explicit SIMD libraries like Google's Highway, and 2) ISPC (or any SIMT model) is a departure from how the underlying hardware works, and as the AI community is discovering with GPU, this abstraction can sometimes be lot more headache than its worth.

shihab · 2026-01-19T17:20:00 1768843200

No. Assuming `k` is small enough, which in practice often is, the arithmetic intensity of this kernel is 25-90 Flops/Byte, way above the roofline knee of any modern CPU.