zanussbaum's comments

zanussbaum · on Dec 30, 2024

First embedding models trained from modern-bert-embed!

zanussbaum · on Nov 12, 2024

this was a huge inspiration for the post! i tried to highlight it in the blog but it might have gotten buried

there are a few things that i wasn't able to figure out how to get access to/i wasn't sure if they were possible. for example, a lot of Simon's article takes advantage of the warp scheduler and warp tiling.

i had a hard time finding information on if that's even possible with my M2/metal and the general memory access patterns. it seems like CUDA does have better documentation in this regard

zanussbaum · on Nov 11, 2024

at least on my m2, the compiled kernel ends up using fast math anyways so using WGSL's fma didn't change anything about the actual kernel that gets run

hedgehog · on Nov 11, 2024

inglor is probably referring to Strassen or Coppersmith–Winograd.

wbl · on Nov 12, 2024

Last I checked the extra mems really hurt on a lot of cases especially for the more complex ones, but I'm no expert.

zanussbaum · on Nov 12, 2024

oh in that case it was because i didn't know about them :) something to try next!

zanussbaum · on Nov 11, 2024

thanks! and yes definitely not at CUDA levels :)

zanussbaum · on Nov 11, 2024

i tried using workgroup shared memory and found it slower than just recomputing everything in each thread although i may have been doing something dumb

i'm excited to try subgroups though: https://developer.chrome.com/blog/new-in-webgpu-128#experime...

zanussbaum · on Nov 11, 2024

you're definitely right, 80% was a bit of an overestimation, especially with respect to CUDA

it would be cool to see if there's some way to get better access to those lower-level primitives but would be surprised

it does seem like subgroup support are a step in the right direction though!

zanussbaum · on Nov 11, 2024

great question, to me webGPU sits a hair high level than CUDA or Vulkan. so you don't have the exact same level of control but can get to 80% performance of it without having to write different kernels specific to the hardware

zanussbaum · on Dec 6, 2022

Has been a huge boost over using Copilot. I accidentally was using Copilot instead of Codeium and was confused why the generations took so long until I realized! Great product

zanussbaum · on Aug 25, 2020

This made my day