This was fun to work on. LLMs for writing kernels still has a long way to go. Its honestly a little surprising how decent they are now. I guess I've been pretty consistently "surprised" by codegen for a while now (meaning the last two years)
This is the first step towards fully automated GPU performance optimization. The idea is to automatically generate GPU kernels, then automatically integrate them in vLLM/SGLang/PyTorch.