Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Emergency First Responders Say Waymos Are Getting Worse
    • Motorola Razr Fold vs. Samsung Galaxy Z Fold 7: How the Book-Style Phones Compare
    • Agentic AI: How to Save on Tokens
    • Lightweight ebike conversion kit electrifies your bike
    • Taylor Swift Wants to Trademark Her Likeness. These TikTok Deepfake Ads Show Why
    • New Releases on Prime Video in May 2026: Jack Reacher, Spider-Noir and More
    • 4 YAML Files Instead of PySpark: How We Let Analysts Build Data Pipelines Without Engineers
    • Metajets use light propulsion for future space travel
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Thursday, April 30
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Breaking the Hardware Barrier: Software FP8 for Older GPUs
    Artificial Intelligence

    Breaking the Hardware Barrier: Software FP8 for Older GPUs

    Editor Times FeaturedBy Editor Times FeaturedDecember 28, 2025No Comments9 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    As deep studying fashions develop bigger and datasets develop, practitioners face an more and more widespread bottleneck: GPU reminiscence bandwidth. Whereas cutting-edge {hardware} affords FP8 precision to speed up coaching and inference, most knowledge scientists and ML engineers work with older GPUs that lack this functionality.

    This hole within the ecosystem is what motivated me to construct Feather, an open-source library that utilises a software-based method to ship FP8-like efficiency enhancements on broadly obtainable {hardware}. I created this device to make environment friendly deep studying extra accessible to the broader ML group, and I welcome contributions

    Notation & Abbreviations

    • FPX: X-bit floating level quantity
    • UX: X-bit unsigned integer
    • GPU: Graphics processing unit
    • SRAM: Static RAM (on-chip GPU Cache)
    • HBM: Excessive bandwidth reminiscence (GPU VRAM)
    • GEMV: Normal Matrix-Vector multiplication

    Motivation

    FP8 processing has confirmed efficient within the Deep Studying group [1]; nevertheless, solely particular current {hardware} architectures (Ada and Blackwell) help it, limiting its advantages for practitioners and researchers to utilise it. I actually have an `Nvidia RTX 3050 6GB Laptop computer GPU`, which sadly doesn’t help FP8 operations on the {hardware} degree. 

    Impressed by software-based options like (software-accelerated rendering on computer systems that don’t help native {hardware} acceleration for gaming), the article proposes an fascinating resolution that may utilise the facility of FP8 datatypes


    Packing FP8 & FP16 in FP32 containers

    Impressed by bitwise operations and packing methods, the article presents an algorithm that packs two FP16s or 4 FP8s right into a single FP32. This permits for packing twice or 4 instances the reminiscence, benefiting from a decrease reminiscence footprint, whereas sacrificing solely a small quantity of precision.

    One would possibly argue that we’re performing redundant computation, “Pack -> Load -> Unpack -> Compute.” Nonetheless, contemplate Deep Studying operations; More often than not, these operations are memory-bound fairly than compute-bound. This is identical bottleneck that algorithms like FlashAttention handle; nevertheless, FlashAttention utilises tiling to maintain knowledge in quick SRAM, whereas Feather compresses knowledge to cut back reminiscence site visitors.


    GPU Reminiscence Hierarchy

    GPU Reminiscence Hierarchy & Bandwidth Chart. (Tailored from Flash Attention) (Notice: Values given don’t characterize RTX 3050 playing cards)

    Check out this diagram. SRAM is the quickest accessible GPU reminiscence area and has the best bandwidth (excluding the register itself), however is restricted to solely 20MB. HBM might be considered because the VRAM of the GPU itself, which has roughly 1/seventh the bandwidth of SRAM. 

    The GPU cores are quick sufficient to finish the computation immediately, however they spend most of their time sitting idle, ready for the info to complete loading and writing again. That is what I imply by memory-bound: the bottleneck right here isn’t the mathematics, however the knowledge switch between the hierarchy of reminiscence within the GPU.


    Decrease Precision Sorts & Bandwidth

    More often than not, values throughout computation are restricted to ranges round zero attributable to normalisation. Engineers developed lower-precision varieties reminiscent of FP8 and FP16, which permit for increased bandwidth. One is perhaps confused about how reducing the precision permits for increased bandwidth. If we take a more in-depth look, we’re successfully loading two values within the place of 1 for the FP16 kind and 4 values within the place of 1 for the FP8 kind.  We’re buying and selling off precision for increased bandwidth to deal with memory-bound operations. 

    {Hardware} Degree Assist

    Similar to AVX-512 directions, that are supported solely on a restricted variety of {hardware} platforms, FP8 and FP16 directions and registers are additionally restricted by {hardware} and can be found solely on the current ones. In case you are on an RTX-30 or RTX-20 collection GPU from Nvidia, then you definately won’t be able to benefit from this decrease precision FP8 kind. That is precisely the issue that Feather makes an attempt to resolve.


    Packing Methodology

    Utilizing bitwise operators, one can simply pack the FP16 kind right into a FP32. The algorithm is described under.

    Packing FP16

    • Forged the enter FP32 right into a FP16; this step might be carried out with ease utilizing numpy’s astype operate. 
    • Forged them to U16 after which to U32; this units the higher 16 bits to 0s and decrease 16 bits to the precise FP16.
    • Shift certainly one of them by 16 utilizing the bitwise LSHIFT operator, and mix each of them utilizing the bitwise OR operator. 

    Unpacking FP16

    • Extract the decrease 16 bits utilizing the bitwise AND operator and masks 0xFFFF.
    • Extract the higher 16 bits utilizing the RSHIFT operation by 16 after which carry out a bitwise AND operation with the masks 0xFFFF. 
    • Forged each U16 values again to FP16 and to FP32 if wanted.

    Packing FP8 

    FP8 has two broadly used codecs – E5M2 & E4M3. One can’t use the identical algorithm used for packing two FP16 into FP32 as a result of the CPU doesn’t help FP8 varieties natively, however does for FP16 (half precision); that is the explanation that np.float8 doesn’t exist. 

    FP8-E5M2 & FP16 format (Tailored from Half-Precision)

    Casting an FP16 to FP8-E5M2 is easy, as seen within the determine, as a result of each have the identical variety of exponent bits and differ solely of their fraction. 

    FP8-E5M2 Packing

    • Forged the enter FP32 right into a FP16; this step might be carried out with ease utilizing numpy’s astype operate, or get the enter itself as FP16.
    • Forged to U16, LSHIFT by 8, then RSHIFT by 8 to isolate the higher 8 bits
    • Do that for all 4 FP32s or FP16s.
    • Now utilizing the LSHIFT operator, shift them by 0, 8, 16 and 24 items and mix them utilizing the bitwise OR operator.

    As soon as once more, unpacking ought to be easy; it’s the precise reverse of packing.

    Packing an FP8-E4M3 is just not as straightforward and easy as packing an FP16 or FP8-E5M2, as a result of exponent bits mismatch.

    P8-E4M3 format (Tailored from Minifloat)

    As an alternative of implementing it from scratch, the library makes use of the ml_dtypes library, which already does the casting math.

    The ml_dtypes library gives help for generally used FP8 requirements, reminiscent of E5M2 and E4M3 casting, for NumPy arrays. Utilizing the identical astype operate, we will carry out casting simply as we did for FP16 varieties. The Algorithm is strictly an identical to how we pack FP16, so I’m skipping it right here. 


    Triton GPU Kernels

    After we pack, we want an algorithm (kernel) to utilise this packed datatype and carry out the computation. Passing the packed datatype to a kernel applied for FP32 or FP64 will end in undefined computation as a result of we have now already corrupted the FP32 or FP64 being handed. Writing a kernel that takes the packed datatype as enter in CUDA is just not a simple job and is error-prone. That is precisely the place Triton shines; it’s a Area-Particular Language library that leverages a customized intermediate illustration for GPU kernels. In layman’s phrases, it permits one to put in writing GPU kernels in Python itself with out the necessity to write CUDA kernels in C. 

    Triton kernels do precisely what was talked about beforehand; the algorithm is as follows:

    • Load the packed array into reminiscence
    • Unpack the reminiscence and upcast it to FP32 for accumulation duties
    • Carry out the computation

    The reader ought to observe that when performing the computation, upcasting is used to forestall overflows. Subsequently, from a computational perspective, there isn’t a benefit. Nonetheless, from the attitude of bandwidth, we’re loading reminiscence twice or 4 instances with out compromising the bandwidth. 

    Triton Kernel Implementation (pseudocode)
    @triton.jit
    def gemv_fp8_kernel(packed_matrix_ptr, packed_vector_ptr, out_ptr): 
        # Get present row to course of
        row_id = get_program_id()
        
        # Initialize accumulator for dot product
        accumulator = 0
        
        # Iterate over row in blocks
        for every block in row:
            # Load packed FP32 values (every comprises 4 FP8s)
            packed_matrix = load(packed_matrix_ptr)
            packed_vector = load(packed_vector_ptr)
            
            # Unpack the FP32 into 4 FP8 values
            m_a, m_b, m_c, m_d = unpack_fp8(packed_matrix)
            v_a, v_b, v_c, v_d = unpack_fp8(packed_vector)
            
            # Upcast to FP32 and compute partial dot merchandise
            accumulator += (m_a * v_a) + (m_b * v_b) + (m_c * v_c) + (m_d * v_d)
        
        # Retailer remaining end result
        retailer(out_ptr, accumulator)

    Outcomes

    {Hardware}: NVIDIA GeForce RTX 3050 6GB VRAM

    CUDA Model: 13.0

    Python Model: 3.13.9

    GEMV Benchmark (M = 16384, N = 16384) (MxN matrix)

    Implementation Time (microseconds) Speedup
    Pytorch (FP32) 5,635 (Baseline)
    Feather (FP8-E4M3) 2,703 2.13x
    Feather (FP8-E5M2) 1,679 3.3x

    The theoretical efficiency increase that may be achieved is 4x; 3.3x is superb as compared, with the remaining overhead primarily stemming from pack/unpack operations and kernel launch prices. 

    E5M2 is quicker than E4M3 as a result of simpler unpacking, however E4M3 affords higher precision. Nonetheless, it’s considerably extra advanced to unpack (Feather makes use of a separate GPU kernel to unpack the E4M3 format).

    Flash Consideration Benchmark (Sequence Size = 8192, Embedding Dimension = 512)

    Implementation Time (microseconds) Speedup
    Pytorch (FP32) 33,290 (Baseline)
    Feather (FP8-E5M2) 9,887 ~3.3x

    Accuracy & Precision

    Testing with random matrices (integer distributions within the vary [-3, 3] and customary regular distributions) exhibits that each E4M3 and E5M2 preserve numerical outcomes inside sensible tolerances for deep studying operations. The buildup errors stay manageable for typical workload sizes; nevertheless, customers requiring strict numerical precision ought to validate their particular use case.


    When must you use Feather?

    Use instances for Feather aren’t restricted; one can use Feather wherever FP8 packing and unpacking have a bonus, reminiscent of 

    • Giant matrix-vector merchandise, the place loading and unloading are the bottlenecks.
    • Consideration-like memory-bound kernels.
    • Inference or fine-tuning on native RTX 30 or 20 collection.
    • Batch processing, the place packing overhead is amortised

    When must you not use Feather?

    • You have got RTX 40-series or H100 GPUs (native FP8 is quicker).
    • Workloads are compute-bound fairly than bandwidth- or memory-bound.
    • You want assured precision.

    Limitations of Feather

    Feather is presently within the early phases of prototyping with a number of areas for enchancment. 

    • Restricted help for operations; presently, Feather helps solely the dot product, GEMV subroutine and FlashAttention. 
    • Accuracy validation for full ML workloads; presently, Feather’s accuracy is validated just for operations, not for end-to-end ML workloads.
    • Integration is presently restricted; Feather is a standalone implementation. Integration with PyTorch and help for autograd would make it extra production-ready.

    The mission is open supply; group contributions are welcome! You possibly can check out the code by merely following the directions on GitHub.

    Picture License: All the photographs are made by the writer. Adaptation sources are clearly talked about in respective captions.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Agentic AI: How to Save on Tokens

    April 29, 2026

    4 YAML Files Instead of PySpark: How We Let Analysts Build Data Pipelines Without Engineers

    April 29, 2026

    Ensembles of Ensembles of Ensembles: A Guide to Stacking

    April 29, 2026

    How AI Policy in South Africa Is Ruining Itself

    April 29, 2026

    PyTorch NaNs Are Silent Killers — So I Built a 3ms Hook to Catch Them at the Exact Layer

    April 28, 2026

    Correlation Doesn’t Mean Causation! But What Does It Mean?

    April 28, 2026

    Comments are closed.

    Editors Picks

    Emergency First Responders Say Waymos Are Getting Worse

    April 29, 2026

    Motorola Razr Fold vs. Samsung Galaxy Z Fold 7: How the Book-Style Phones Compare

    April 29, 2026

    Agentic AI: How to Save on Tokens

    April 29, 2026

    Lightweight ebike conversion kit electrifies your bike

    April 29, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    A Game-Changer for Crime Scene ID

    December 21, 2025

    As the Right Leans Into ‘Warrior’ Culture, Some Leftist Gym Bros Are Pushing Back

    October 6, 2025

    Using Claude Skills with Neo4j | Towards Data Science

    October 29, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.