This multi-post series goes over my learnings in trying to optimize inference latency of the recent Diffusion Policy paper out of Toyota Research Institute. I dive into intracacies of GPU architecture and apply these learnings to speed up the U-Net from the TRI paper. Posts that have code to go along with them can be found in the GitHub repo for this series
here.