To recap, over the last 8 parts of this series we
- Developed an optimized CUDA kernel of 1D Convolution
- Developed a fused CUDA kernel for Group Normalization + Mish
- Developed a CUDA kernel for the denoising step for our Diffusion model
- Fused the whole U-Net into a CUDA graph to eliminate CPU/Pytorch overhead
The final result is a ~3.4x reduction in U-Net inference time over Pytorch eager mode (the original paper implementation), and ~2.65x reduction in inference time over Pytorch compile mode (fastest performance possible with native Pytorch)! The cleaned up end product can be found here. The library implements the optimizations we developed in this blog post series and includes modes to run inference evals to see how much faster the U-Net is, as well as end-to-end evals to see how much faster the overall policy, including the CPU simulation environment is. You can install the library and play with it using ‘pip install diffusion-policy-accelerated’ The CLI command ‘diffusion-policy-accelerated —mode inference-eval —evals 2000’ will run 2000 forward passes through the U-Net using Pytorch eager mode and custom and the command ‘diffusion-policy-accelerated --mode policy-eval --evals 5’ will do the same but with 5 epsidoes of end-to-end policy evaluation.
Looking at the Pytorch profile for a single forward pass through the U-Net for the two modes, we find a few noteworthy things.
Notice how disparate and spread out kernel launches using Eager mode are, whereas with CUDA graphs you can basically eliminate all CPU overhead and run kernels back to back. In hindsight, I probably should have graphed the forward pass including the vision encoder (Resnet-50). The few short kernel launches you see preceding the CUDA graph are copying the output of the vision encoder into the memory space allocated for the U-Net. We could likely have fused these with the larger graph as well.
Additionally, we find that the three kernels associated with the 1D Convolution in Pytorch cumulatively take ~120uS for the most common launch configuration. The same launch config with our custom kernels takes just 26 uS! The accelerated 1D Conv and CUDA graph account for the vast majority of the drop in inference latency.
Key Takeaways
- The most common and compute intensive kernel in our U-Net involves a 1D conv with an input length of 4, 1024 input channels, and 1024 output channels. Using our FLOPs math from Part 5, we find that this kernel performs ~21M FP32 multiplies, ~20M FP32 adds, and loads ~45M FP32 bytes from DRAM. Referring to our energy use breakdown from Part 1, we find that this kernel spends 99.7% of power moving bytes from DRAM to on-chip, and just ~0.3% on actually computing things. Its pretty insane how much energy is spent on moving bytes on a GPU compared to actually computing. I am very curious as to how dedicated AI inference hardware, designed with emphasis on minimizing memory movement latency/energy requirements will shape up in the coming years. As ML models provide more utility to end-users, the ratio of inference to training compute will shift drastically and dedicated inference hardware will make increasingly more sense to spend chip-design resources on. It’s not impossible to imagine entire chips taped out around a specific model given how much optimization could opened up by that sort of specialization.
- It’s really important to deeply understand your workload before trying to optimize it! When I profiled my Pytorch program in Part 3, while I did identify that the 1D Convolution operations were taking most of the program duration, I did not spend much time quantifying the light-speed for the 1D Conv operation before setting out to optimize it. There is an alternate world where 1D Convolution is actually already optimal in Pytorch, and I spend time trying to optimize something that can’t get faster. Just because something takes a long time, doesn’t mean its worth your time to optimize it! Take time to identify not just what the slowest part of your program is, but also how far from optimal run-time it is.
- This is true for all engineering projects, but especially for software performance optimization stuff, keep a rigorous work-log of changes you made, why you made them, and the associated correctness-checks/trace profiles. It’s way too easy to iterate sloppily, find yourself facing some weird race-condition or floating point rounding error, and wondering which of your changes caused it.
- Nvidia’s compute-sanitizer is an awesome tool! It comes built-in with CUDA and you can run it against any kernel or Python program and have it sniff out exactly which lines are causing race-conditions or illegal memory accesses. You do need to compile the relevant kernels with the ‘-lineinfo’ flag to get a line-by-line breakdown in compute-sanitizer. Half the time I use compute-sanitizer it crashes my computer so try to kill all non-essential processes when using it.
- I found it helpful to first implement kernels in a Jupyter notebook (with block-size=1 & virtual thread registers & shared memory) to ensure correctness & avoid incorrect addressing before moving things to CUDA. Python’s interpretted nature makes the iteration loop much tighter. You can see some examples of this for my 1D conv kernel here.
- Never doubt your ability to pick up a new skill or break through an abstraction layer. GPT-4 makes it easier than ever to learn anything you want to get good at. Just make sure it’s something you are excited about so you don’t quit prematurely and keep at it until you make progress 🙂 Also feel free to reach out to me if you think I can help (whether its about GPUs or something else)! Thanks for reading if you’ve made it this far.