Close Menu
    Trending
    • F5 Expands AI Collaboration with Red Hat
    • JPMorgan to Cut Headcount in Some Divisions Due to AI
    • Introduction to Python for Machine Learning(Part 1): 5 Langkah Mudah Memulai Proyek Machine Learning | by I Made Satria Bimantara | May, 2025
    • NVIDIA Announces DGX Cloud Lepton for GPU Access across Multi-Cloud Platforms
    • JPMorgan Chase Will Allow Clients to Buy Bitcoin
    • How Netradyne’s AI Predicts and Prevents Fleet Accidents Before They Happen | by Mahi | May, 2025
    • MoonX: BYDFi’s On-Chain Trading Engine A Ticket from CEX to DEX
    • How AI Can Help You Cut Through Tariff Chaos — in Just 3 Simple Steps
    Finance StarGate
    • Home
    • Artificial Intelligence
    • AI Technology
    • Data Science
    • Machine Learning
    • Finance
    • Passive Income
    Finance StarGate
    Home»Artificial Intelligence»Kernel Case Study: Flash Attention
    Artificial Intelligence

    Kernel Case Study: Flash Attention

    FinanceStarGateBy FinanceStarGateApril 3, 2025No Comments16 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    mechanism is on the core of recent day transformers. However scaling the context window of those transformers was a significant problem, and it nonetheless is although we’re within the period of 1,000,000 tokens + context window (Qwen 2.5 [1]). There are each appreciable compute and reminiscence sure complexities in these fashions once we scale the context window (A naive Attention Mechanism scales quadratically in each compute and reminiscence necessities). Revisiting Flash Consideration lets us perceive the complexities of optimizing the underlying operations on GPUs and extra importantly offers us a greater grip on considering what’s subsequent.

    Let’s rapidly revisit a naive consideration algorithm to see what’s happening.

    Consideration Algorithm. Picture by Writer

    As you may see if we aren’t being cautious then we are going to find yourself materializing a full NxM consideration matrix into the GPU HBM. Which means the reminiscence requirement will go up quadratically to growing context size.

    In the event you wanna study extra in regards to the GPU reminiscence hierarchy and its variations, my previous post on Triton is an efficient place to begin. This is able to even be useful as we go alongside on this submit once we get to implementing the Flash Attention kernel in triton. The flash attention paper additionally has some actually good introduction to this.

    Moreover, once we take a look at the steps concerned in executing this algorithm and its sample of accessing the gradual HBM, (which as defined later within the submit may very well be a significant bottleneck as nicely) we discover just a few issues:

    1. We now have Q, Ok and V within the HBM initially
    2. We have to entry Q and Ok initially from the HBM to compute the dot product
    3. We write the output scores again to the HBM
    4. We entry it once more to execute the softmax, and optionally for Causal consideration, like within the case of LLMs, we should masks this output earlier than the softmax. The ensuing full consideration matrix is written once more into the HBM
    5. We entry the HBM once more to execute the ultimate dot product, to get each the eye weights and the Worth matrix to jot down the output again to the gradual GPU reminiscence

    I believe you get the purpose. We may well learn and write from the HBM to keep away from redundant operations, to make some potential positive factors. That is precisely the first motivation for the unique Flash Consideration algorithm.

    Flash Consideration initially got here out in 2022 [2], after which a yr later got here out with some a lot wanted enhancements in 2023 as Flash Consideration v2 [3] and once more in 2024 with extra enhancements for Nvidia Hopper and Blackwell GPUs [4] as Flash Consideration v3 [5]. The unique consideration paper recognized that the eye operation remains to be restricted by reminiscence bandwidth somewhat than compute. (Previously, there have been makes an attempt to scale back the computation complexity of Consideration from O(N**2) to O(NlogN) and decrease by way of approximate algorithms)

    Flash consideration proposed a fused kernel which does all the above consideration operations in a single go, block-wise, to get the ultimate consideration output with out ever having to comprehend the complete N**2 consideration matrix in reminiscence, making the algorithm considerably sooner. The time period `fused` merely means we mix a number of operations within the GPU SRAM earlier than invoking the a lot slower journey throughout the slower GPU reminiscence, making the algorithm performant. All of the whereas offering the precise consideration output with none approximations.

    This lecture, from Stanford CS139, demonstrates brilliantly how we will consider the impression of a nicely thought out reminiscence entry sample can have on an algorithm. I extremely suggest you verify this one out if you happen to haven’t already.

    Earlier than we begin diving into flash consideration to name it FA, lets?) in triton there’s something else that I wished to get out of the best way.

    Numerical Stability in exponents

    Let’s take the instance of FP32 numbers. float32 (commonplace 32-bit float) makes use of 1 signal bit, 8 exponent bits, and 23 mantissa bits [6]. The biggest finite base for the exponent in float32 is 2127≈1.7×1038. Which suggests once we take a look at exponents, e88 ≈ 1.65×1038, something near 88 (though in actuality can be a lot decrease to maintain it protected) and we’re in hassle as we may simply overflow. Right here’s a very interesting chat with OpenAI o1 shared by of us at AllenAI of their OpenInstruct repo. This though is speaking about stabilizing KL Divergence calculations within the setting of RLHF/RL, the concepts translate precisely to exponents as nicely. So to cope with the softmax scenario in consideration what we do is the next:

    Softmax with rescaling. Picture by Writer

    TRICK : Let’s additionally observe the next, if you happen to do that:

    Rescaling Trick. Picture by Writer

    then you may rescale/readjust values with out affecting the ultimate softmax worth. That is actually helpful when you’ve gotten an preliminary estimate for the utmost worth, however which may change once we encounter a brand new set of values. I do know I do know, stick with me and let me clarify.

    Setting the scene

    Let’s take a small detour into matrix multiplication.

    Blocked Matrix Multiplication. Picture by Writer

    This reveals a toy instance of a blocked matrix multiplication besides now we have blocks solely on the rows of A (inexperienced) and columns of B (Orange? Beige?). As you may see above the output O1, O2, O3 and O4 are full (these positions want no extra calculations). We simply have to fill within the remaining columns within the preliminary rows by utilizing the remaining columns of B. Like under:

    Subsequent set of block fill the remaining areas up. Picture by Writer

    So we will fill these locations within the output with a block of columns from B and a block of rows from A at a time.

    Connecting the dots

    After I launched FA, I mentioned that we by no means should compute the complete consideration matrix and retailer the entire thing. So right here’s what we do:

    1. Compute a block of the eye matrix utilizing a block of rows from Q and a block of columns from Ok. When you get the partial consideration matrix compute just a few statistics and preserve it within the reminiscence.
    Computing block consideration scores S_b, and computing the row-wise maximums. Picture by Writer

    I’ve greyed O5 to O12 as a result of we don’t know these values but, as they should come from the next blocks. We then remodel Sb like under:

    Maintaining a monitor of the present row-sum and row-maxes. Picture by Writer
    Exponents with the scaling trick. Picture by Writer

    Now you’ve gotten setup for a partial softmax

    Partial Softmax, because the denominator remains to be a partial sum. Picture by Writer

    However:

    1. What if the true most is within the Oi’s which can be but to come back?
    2. The sum remains to be native, so we have to replace this each time we see new Pi’s. We all know how you can preserve monitor of a sum, however what about rebasing it to the true most?

    Recall the trick above. All that now we have to do is to maintain a monitor of the utmost values we encounter for every row, and iteratively replace as you see new maximums from the remaining blocks of columns from Ok for a similar set of rows from Q.

    Two consecutive blocks and its row max manipulations. Picture by Writer
    Updating the estimate of our present sum with rescaling

    We nonetheless don’t wish to write our partial softmax matrix into HBM. We preserve it for the following step.

    The ultimate dot product

    The final step in our consideration computation is our dot product with V. To begin we’d have initialized a matrix filled with 0’s in our HBM as our output of form NxD. The place N is the variety of Queries as above. We use the identical block dimension for V as we had for Ok besides we will apply it row sensible like under (The subscripts simply denote that that is solely a block and never the complete matrix)

    A single block of consideration scores making a partial output. Picture by Writer
    Whereas the complete output would require the sum of all these dot merchandise. A few of which will likely be stuffed in by the blocks to come back. Picture by Writer

    Discover how we want the eye scores from all of the blocks to get the ultimate product. But when we calculate the native rating and `accumulate` it like how we did to get the precise Ls we will kind the complete output on the finish of processing all of the blocks of columns (Okb) for a given row block (Qb).

    Placing all of it collectively

    Let’s put all these concepts collectively to kind the ultimate algorithm

    Flash Consideration V1 Algorithm. Supply: Tri Dao et.al [2]

    To know the notation, _ij implies that it’s the native values for a given block of columns and rows and _i implies it’s for the worldwide output rows and Question blocks. The one half we haven’t defined up to now is the ultimate replace to Oi. That’s the place we use all of the concepts from above to get the appropriate scaling.

    The entire code is offered as a gist here.

    Let’s see what these initializations appear like in torch:

    def flash_attn_v1(Q, Ok, V, Br, Bc):
      """Flash Consideration V1"""
      B, N, D = Q.form
      M = Ok.form[1]
      Nr = int(np.ceil(N/Br))
      Nc = int(np.ceil(N/Bc))
      
      Q = Q.to('cuda')
      Ok = Ok.to('cuda')
      V = V.to('cuda')
      
      batch_stride = Q.stride(0)
      
      O = torch.zeros_like(Q).to('cuda')
      lis = torch.zeros((B, Nr, int(Br)), dtype=torch.float32).to('cuda')
      mis = torch.ones((B, Nr, int(Br)), dtype=torch.float32).to('cuda')*-torch.inf
      
      grid = (B, )
      flash_attn_v1_kernel[grid](
          Q, Ok, V,
          N, M, D,
          Br, Bc,
          Nr, Nc,
          batch_stride,
          Q.stride(1),
          Ok.stride(1),
          V.stride(1),
          lis, mis,
          O,
          O.stride(1),
      )
      return O

    In case you are not sure in regards to the launch grid, checkout my introduction to Triton

    Take a better take a look at how we initialized our Ls and Ms. We’re holding one for every row block of Output/Question, every of dimension Br. There are Nr such blocks in whole.

    Within the instance above I used to be merely utilizing Br = 2 and Bc = 2. However within the above code the initialization relies on the machine capability. I’ve included the calculation for a T4 GPU. For another GPU, we have to get the SRAM capability and regulate these numbers accordingly. Now for the precise kernel implementation:

    # Flash Consideration V1
    import triton
    import triton.language as tl
    import torch
    import numpy as np
    import pdb
    
    @triton.jit
    def flash_attn_v1_kernel(
        Q, Ok, V,
        N: tl.constexpr, M: tl.constexpr, D: tl.constexpr,
        Br: tl.constexpr,
        Bc: tl.constexpr,
        Nr: tl.constexpr,
        Nc: tl.constexpr,
        batch_stride: tl.constexpr,
        q_rstride: tl.constexpr,
        k_rstride: tl.constexpr, 
        v_rstride: tl.constexpr,
        lis, mis,
        O,
        o_rstride: tl.constexpr):
        
        """Flash Consideration V1 kernel"""
        
        pid = tl.program_id(0)
        
    
        for j in vary(Nc):
            k_offset = ((tl.arange(0, Bc) + j*Bc) * k_rstride)[:, None] + (tl.arange(0, D))[None, :] + pid * M * D
            # Utilizing k_rstride and v_rstride as we're wanting on the whole row directly, for every okay v block 
            v_offset = ((tl.arange(0, Bc) + j*Bc) * v_rstride)[:, None] + (tl.arange(0, D))[None, :] + pid * M * D
            k_mask = k_offset 

    Let’s perceive whats occurring right here:

    1. Create 1 kernel for every NxD matrix within the batch. In actuality we’d have another dimension to parallelize throughout, the pinnacle dimension. However for understanding the implementation I believe this might suffice.
    2. In every kernel we do the next:
      1. For every block of columns in Ok and V we load up the related a part of the matrix (Bc x D) into the GPU SRAM (Present whole SRAM utilization = 2BcD). This stays within the SRAM until we’re performed with all of the row blocks
      2. For every row block of Q, we load the block onto SRAM as nicely (Present whole SRAM Utilization = 2BcD + BrD)
      3. On chip we compute the dot product (sij), compute the native row-maxes (mij), the exp (pij), and the expsum (lij)
      4. We load up the working stats for the ith row block. Two vectors of dimension Br x 1, which denotes the present world row-maxes (mi) and the expsum (li). (Present SRAM utilization: 2BcD + BrD + 2Br)
      5. We get the brand new estimates for the worldwide mi and li.
      6. We load the a part of the output for this block of Q and replace it utilizing the brand new working stats and the exponent trick, we then write this again into the HBM. (Present SRAM utilization: 2BcD + 2BrD + 2Br)
      7. We write the up to date working stats additionally into the HBM.
    3. For a matrix of any dimension, aka any context size, at a time we are going to by no means materialize the complete consideration matrix, solely part of it all the time.
    4. We managed to fuse collectively all of the ops right into a single kernel, decreasing HBM entry significantly.

    Last SRAM utilization stands though at 4BD + 2B, the place B was initially calculated as M/4d the place M is the SRAM capability. Undecided if am lacking one thing right here. Please remark if you realize why that is the case!

    Block Sparse Consideration and V2 and V3

    I’ll preserve this brief as these variations preserve the core concept however found out higher and higher methods to do the identical.

    For Block Sparse Consideration,

    1. Think about we had masks for every block like within the case of causal consideration. If for a given block now we have the masks all set to zero then we will merely skip the whole block with out computing something actually. Saving FLOPs. That is the place the most important positive factors have been seen. To place this into perspective, within the case of BERT pre-training the algorithm will get a 15% enhance over one of the best performing coaching setup on the time, whereas for GPT-2 we get a 3x over huggingface coaching implementation and ~ 2x over a Megatron setup.
    Efficiency acquire for autoregressive fashions, the place now we have a sparse masks. Supply: Tri Dao et.al [2]

    2. You may actually get the identical efficiency in GPT2 in a fraction of the time, actually shaving off days from the coaching run, which is superior!

    In V2:

    1. Discover how presently we will solely do parallelization on the batch and head dimension. However if you happen to merely simply flip the order to take a look at all of the column blocks for a given row block then we get the next benefits:
      1. Every row block turns into embarrassingly parallel. How you realize that is by wanting on the illustrations above. You want all of the column blocks for a given row block to completely kind the eye output. In the event you have been to run all of the column blocks in parallel, you’ll find yourself with a race situation that may attempt to replace the identical rows of the output on the identical time. However not if you happen to do it the opposite approach round. Though there are atomic add operators in triton which may assist, they could probably set us again.
      2. We are able to keep away from hitting the HBM to get the worldwide Ms and Ls. We are able to initialize one on the chip for every kernel.
      3. Additionally we would not have to scale all of the output replace phrases with the brand new estimate of L. We are able to simply compute stuff with out dividing by L and on the finish of all of the column blocks, merely divide the output with the most recent estimate of L, saving some FLOPS once more!
    2. A lot of the advance additionally comes within the type of the backward kernel. I’m omitting all of the backward kernels from this. However they’re a enjoyable train to attempt to implement, though they’re considerably extra advanced.

    Listed here are some benchmarks:

    Efficiency benchmark of FA v2 in opposition to current consideration algorithms. Supply: Tri Dao et.al [3]

    The precise implementations of those kernels have to keep in mind numerous nuances that we encounter in the true world. I’ve tried to maintain it easy. However do check them out here.

    Extra not too long ago in V3:

    1. Newer GPUs, particularly the Hopper and Blackwell GPUs, have low precision modes (FP8 in Hopper and GP4 in Blackwell), which might double and quadruple the throughput for a similar energy and chip space and extra specialised GEMM (Basic Matrix Multiply) kernels, which the earlier model of the algorithm fails to capitalize on. It is because there are various operations that are non-GEMM, like softmax, which reduces the utilization of those specialised GPU kernels.
    2. The FA v1 and v2 are basically synchronous. Recall within the v2 description I discussed that we’re restricted when column blocks attempt to write to the identical output pointers, or when now we have to go step-by-step utilizing the output from the earlier steps. Effectively these trendy GPUs could make use particular directions to interrupt this synchrony.

    We overlap the comparatively low-throughput non-GEMM operations concerned in softmax, resembling floating level multiply-add and exponential, with the asynchronous WGMMA directions for GEMM. As a part of this, we rework the FlashAttention-2 algorithm to bypass sure sequential dependencies between softmax and the GEMMs. For instance, within the 2-stage model of our algorithm, whereas softmax executes on one block of the scores matrix, WGMMA executes within the asynchronous proxy to compute the following block.

    Flash Consideration v3, Shah et.al

    1. In addition they tailored the algorithm to focus on these specialised low precision Tensor cores on these new gadgets, considerably growing the FLOPs.

    Some extra benchmarks:

    FA v3 Efficiency acquire over v2. Supply: Shah et. al [5]

    Conclusion

    There may be a lot to admire of their work right here. The ground for this technical talent stage usually appeared excessive owing to the low stage particulars. However hopefully instruments like Triton may change the sport and get extra folks into this! The long run is vivid.

    References

    [1] Qwen 2.5-7B-Instruct-1M Huggingface Model Page

    [2] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Re, FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    [3] Tri Dao, FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    [4] NVIDIA Hopper Architecture Page

    [5] Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

    [6] Single-precision floating-point format, Wikipedia



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleBetter Data Is Transforming Wildfire Prediction | by Athena Intelligence (AthenaIntel.io) | Apr, 2025
    Next Article Top Side Hustle in Your City? Here’s the Fastest-Growing Gig
    FinanceStarGate

    Related Posts

    Artificial Intelligence

    The sweet taste of a new idea | MIT News

    May 19, 2025
    Artificial Intelligence

    Agentic AI 102: Guardrails and Agent Evaluation

    May 17, 2025
    Artificial Intelligence

    The Automation Trap: Why Low-Code AI Models Fail When You Scale

    May 17, 2025
    Add A Comment

    Comments are closed.

    Top Posts

    XRAG: Advancing Retrieval-Augmented Generation for Enhanced Question-Answering Systems | by Jenray | Mar, 2025

    March 2, 2025

    Why OCR Caching is Like Saving Recipes: A Simple Way to Speed Up AI Training | by Arsha | Apr, 2025

    April 7, 2025

    Can you invest your time and money in a mid-career gap and still be financially secure?

    May 13, 2025

    AI for Dumdum: How Machines Learn | by Rachel Tumulak | May, 2025

    May 14, 2025

    How AI Is Rewriting the Day-to-Day of Data Scientists

    May 1, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    Most Popular

    Manifold Learning and Geometry-Based Approaches: A Comprehensive Explanation | by Adnan Mazraeh | Mar, 2025

    March 6, 2025

    What’s next for smart glasses

    February 5, 2025

    Forecast demand with precision using advanced AI for SAP IBP

    April 30, 2025
    Our Picks

    Diving Deep into Large Language Models: A Technical Overview | by Prasang Biyani | Feb, 2025

    February 15, 2025

    Openlayer Raises $14.5 Million Series A

    May 14, 2025

    MIT’s McGovern Institute is shaping brain science and improving human lives on a global scale | MIT News

    April 18, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 Financestargate.com All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.