echoing this sentiment. I really do believe at some level there is strength/wisdom in being able to step away from a problem and return to it from a new perspective despite of what narratives are being pushed online by hustle
MFU is probably the best but requires application logic. You can export metrics at the infra level like SM efficiency. We explain it a bit how we used it to do some optimization.
John Ousterhout's "A Philosophy of Software Design" I liked. It was supposed to be assigned reading for Berkeley's data structures class CS61B, and I don't think I really internalized the lessons within, but after re-reading it recently, I appreciated it a lot more and found the material transcends how to write code but also how to architect things as well.
power is also a good proxy. For example, we've had distributed runs that we monitored on WandB where one of our workers died in the middle and the rest were basically stalling on the dead worker. On WandB, we were only logging GPU stats on one worker and that one had 100% util but basically no excess power draw compared to having nothing running, which is how I found out something was stalling. Restarting fixed it and got the power draw up to normal, but even with high power draw, we were still having some sections of code with low SM efficiency (~20%) for that training.
totally agreed. A lot of our findings during this process is that there's still a lot of alpha in finding the right kernels for the job/model. We're hoping that in the future `torch.compile` will become more mature because current docs on performance at least on pytorch side definitely leave us wanting more
I really hope this makes it easier to install/upgrade NVIDIA drivers on Linux. It's a nightmare to figure out version mismatches between drivers, utils, container-runtime...
From my limited experience with their open-sourcing of kernel modules so far: It doesn't make things easier; but - the silver lining is that, for the most part, it doesn't make installation and configuration harder! Which is no small thing actually.
reply