Diffusion Inference Optimization

Inference Optimization of Diffusion Policy

This multi-post series goes over my learnings in trying to optimize inference latency of the recent Diffusion Policy paper out of Toyota Research Institute. I dive into intracacies of GPU architecture and apply these learnings to speed up the U-Net from the TRI paper. Posts that have code to go along with them can be found in the GitHub repo for this series here.

Part IX - Putting It All Together

Part VIII - Integrating a Custom CUDA Kernel & CUDA Graphs in Pytorch

Part VII - A Dive Into DDPMs & CUDA kernel for Denoising

Part VI - Kernel Fusion in CUDA

Part V - 1D Convolution in CUDA (Optimized)

Part IV - 1D Convolution in CUDA (Naive)

Part III - Profiling a Pytorch Forward Pass

Part II - CUDA Kernel Optimization Tips

Part I - Intro to GPUs