Home | Core Velocity Lab

Nanobenchmarking: cycle accurate benchmarking of CUDA kernels

#CUDA #GPU #GPU Programming #Microbenchmarking

This post focuses on the accurate measurement of the number of cycles needed to execute a particular CUDA device code snippet. We will use the clock() function for the measurement and focus on adjusting the compiled device code using an …

FlashAttention-2 in Vulkan with Tensor Cores support

#Ai #Attention #Deep Learning #FlashAttention #FlashAttention-2 #GLSL #GPU Programming #Machine-Learning #Scaled Dot Product Attention #SDPA #Vulkan

UPDATE 06. 06. 2026. Note that this implementation does not support Multi-Query Attention (MQA). There is a good implementation that has also MQA support in Vulkan backend of llama.cpp since this code was published. I renamed the article …

Gradient of the attention op

#Ai #Attention #Automatic Differentiation #Deep Learning #Gradients #Machine-Learning #Math #Mathematics #Numpy #Pytorch

In this post, the gradient of the attention op will be derived from a single rule used to implement reverse mode automatic differentiation. Attention mechanism is the foundational building block of the transformer architecture that is the …