CUDA Kernel Study Vector Addition Vectorized Copy Matrix Transpose Warp Shuffle Intrinsics LDG Shared Memory Parallel Reduction Asynchronous Copy SGEMM References CUDA C Programming Guide Inline PTX Assembly