this was a huge inspiration for the post! i tried to highlight it in the blog but it might have gotten buried
there are a few things that i wasn't able to figure out how to get access to/i wasn't sure if they were possible. for example, a lot of Simon's article takes advantage of the warp scheduler and warp tiling.
i had a hard time finding information on if that's even possible with my M2/metal and the general memory access patterns. it seems like CUDA does have better documentation in this regard
at least on my m2, the compiled kernel ends up using fast math anyways so using WGSL's fma didn't change anything about the actual kernel that gets run
i tried using workgroup shared memory and found it slower than just recomputing everything in each thread although i may have been doing something dumb
great question, to me webGPU sits a hair high level than CUDA or Vulkan. so you don't have the exact same level of control but can get to 80% performance of it without having to write different kernels specific to the hardware
Has been a huge boost over using Copilot. I accidentally was using Copilot instead of Codeium and was confused why the generations took so long until I realized! Great product