Part IX - Putting It All Together

To recap, over the last 8 parts of this series we

Developed an optimized CUDA kernel of 1D Convolution
Developed a fused CUDA kernel for Group Normalization + Mish
Developed a CUDA kernel for the denoising step for our Diffusion model
Fused the whole U-Net into a CUDA graph to eliminate CPU/Pytorch overhead

The final result is a ~3.4x reduction in U-Net inference time over Pytorch eager mode (the original paper implementation), and ~2.65x reduction in inference time over Pytorch compile mode (fastest performance possible with native Pytorch)! The cleaned up end product can be found here. The library implements the optimizations we developed in this blog post series and includes modes to run inference evals to see how much faster the U-Net is, as well as end-to-end evals to see how much faster the overall policy, including the CPU simulation environment is. You can install the library and play with it using ‘pip install diffusion-policy-accelerated’ The CLI command ‘diffusion-policy-accelerated —mode inference-eval —evals 2000’ will run 2000 forward passes through the U-Net using Pytorch eager mode and custom and the command ‘diffusion-policy-accelerated --mode policy-eval --evals 5’ will do the same but with 5 epsidoes of end-to-end policy evaluation.

Looking at the Pytorch profile for a single forward pass through the U-Net for the two modes, we find a few noteworthy things.

Pytorch trace profile for Eager mode forward pass. stream 7 is the GPU stream.

Pytorch trace profile for accelerated mode forward pass. stream 157 is the GPU stream.

Notice how disparate and spread out kernel launches using Eager mode are, whereas with CUDA graphs you can basically eliminate all CPU overhead and run kernels back to back. In hindsight, I probably should have graphed the forward pass including the vision encoder (Resnet-50). The few short kernel launches you see preceding the CUDA graph are copying the output of the vision encoder into the memory space allocated for the U-Net. We could likely have fused these with the larger graph as well.

Additionally, we find that the three kernels associated with the 1D Convolution in Pytorch cumulatively take ~120uS for the most common launch configuration. The same launch config with our custom kernels takes just 26 uS! The accelerated 1D Conv and CUDA graph account for the vast majority of the drop in inference latency.

Tangent

GPUs use a voltage regulator module to step down the 12V from input power to the ~1V needed to drive transistors on the actual GPU core. Its very important for the output of the VRM to be clean since the chip relies on Vdd (~1V) to represent 1s and GND (0V) to represent 0s. The VRM provides this clean output using a lot of complex power electronics but part of the conversion involves inductors, which naturally act as low-pass filters. When the load on the GPU core fluctuates significantly, these inductors can generate rapidly oscillating magnetic fields (proportional to dI/dT), which in-turn induce Lortenz forces that vibrate the coils. This effect is responsible for the notorious GPU coil whine. One thing I found fascinating was that Pytorch Eager mode causes louder GPU coil whine when compared to running with CUDA graph/custom kernels - I could hear my code! I was confused by this since nvidia-smi showed that accelerated mode consumes ~40W of more power than eager mode, and you might expect higher loads to be louder. I hypothesized this could be due to larger delays between kernel launches in eager mode causing larger changes in GPU core load and stronger B-field oscillations in the GPU inductors. To test this I wrote a kernel that performs tons of loads from global memory (very energy intensive operation) and varies the duration between kernel launches and found that you can indeed control coil whine in this way! I got a little carried away by my new-found power of making my GPU coils sing and wrote a kernel to play specific notes. Below is a video of Twinkle Twinkle Little Star on my RTX 3090 😛 Unfortunately you can't really hit lower frequencies (dT becomes too large) so everything is shifted up several octaves. Code is here if you want to play with it at your own risk. Its easy to get lost in the abstract world of convolutions and attention-mechanisms! It was nice to get an audible reminder of the electrons that make Deep Learning happen.

‍

Key Takeaways

The most common and compute intensive kernel in our U-Net involves a 1D conv with an input length of 4, 1024 input channels, and 1024 output channels. Using our FLOPs math from Part 5, we find that this kernel performs ~21M FP32 multiplies, ~20M FP32 adds, and loads ~45M FP32 bytes from DRAM. Referring to our energy use breakdown from Part 1, we find that this kernel spends 99.7% of power moving bytes from DRAM to on-chip, and just ~0.3% on actually computing things. Its pretty insane how much energy is spent on moving bytes on a GPU compared to actually computing. I am very curious as to how dedicated AI inference hardware, designed with emphasis on minimizing memory movement latency/energy requirements will shape up in the coming years. As ML models provide more utility to end-users, the ratio of inference to training compute will shift drastically and dedicated inference hardware will make increasingly more sense to spend chip-design resources on. It’s not impossible to imagine entire chips taped out around a specific model given how much optimization could opened up by that sort of specialization.
It’s really important to deeply understand your workload before trying to optimize it! When I profiled my Pytorch program in Part 3, while I did identify that the 1D Convolution operations were taking most of the program duration, I did not spend much time quantifying the light-speed for the 1D Conv operation before setting out to optimize it. There is an alternate world where 1D Convolution is actually already optimal in Pytorch, and I spend time trying to optimize something that can’t get faster. Just because something takes a long time, doesn’t mean its worth your time to optimize it! Take time to identify not just what the slowest part of your program is, but also how far from optimal run-time it is.
This is true for all engineering projects, but especially for software performance optimization stuff, keep a rigorous work-log of changes you made, why you made them, and the associated correctness-checks/trace profiles. It’s way too easy to iterate sloppily, find yourself facing some weird race-condition or floating point rounding error, and wondering which of your changes caused it.
Nvidia’s compute-sanitizer is an awesome tool! It comes built-in with CUDA and you can run it against any kernel or Python program and have it sniff out exactly which lines are causing race-conditions or illegal memory accesses. You do need to compile the relevant kernels with the ‘-lineinfo’ flag to get a line-by-line breakdown in compute-sanitizer. Half the time I use compute-sanitizer it crashes my computer so try to kill all non-essential processes when using it.
I found it helpful to first implement kernels in a Jupyter notebook (with block-size=1 & virtual thread registers & shared memory) to ensure correctness & avoid incorrect addressing before moving things to CUDA. Python’s interpretted nature makes the iteration loop much tighter. You can see some examples of this for my 1D conv kernel here.
Never doubt your ability to pick up a new skill or break through an abstraction layer. GPT-4 makes it easier than ever to learn anything you want to get good at. Just make sure it’s something you are excited about so you don’t quit prematurely and keep at it until you make progress 🙂 Also feel free to reach out to me if you think I can help (whether its about GPUs or something else)! Thanks for reading if you’ve made it this far.