totally agreed. A lot of our findings during this process is that there's still a lot of alpha in finding the right kernels for the job/model. We're hoping that in the future `torch.compile` will become more mature because current docs on performance at least on pytorch side definitely leave us wanting more