Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Portable smart TV, art frame, tablet
    • Former Startmate boss Michael Batko is back in founder mode building with Hourglass AI
    • Why Sharing a Screenshot Can Get You Jailed in the UAE
    • The European Commission issues preliminary DSA findings against Meta, saying Instagram and Facebook fail to prevent under-13 users from accessing the services (Gian Volpicelli/Bloomberg)
    • Today’s NYT Mini Crossword Answers for April 29
    • Turning Dumb Bombs into Cruise Missiles
    • When Elon Musk had a crack at Australia’s online safety boss, she received 60,000 abusive messages, including death threats, in 24 hrs
    • ‘It’s Undignified’: Hundreds of Workers Training Meta’s AI Could Be Laid Off
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Wednesday, April 29
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Optimizing PyTorch Model Inference on AWS Graviton
    Artificial Intelligence

    Optimizing PyTorch Model Inference on AWS Graviton

    Editor Times FeaturedBy Editor Times FeaturedDecember 10, 2025No Comments11 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    AI/ML fashions might be an especially costly endeavor. Lots of our posts have been centered on all kinds of ideas, methods, and methods for analyzing and optimizing the runtime efficiency of AI/ML workloads. Our argument has been twofold:

    1. Efficiency evaluation and optimizations have to be an integral course of of each AI/ML improvement undertaking, and,
    2. Reaching significant efficiency boosts and price discount doesn’t require a excessive diploma of specialization. Any AI/ML developer can do it. Each AI/ML developer ought to do it.

    , we addressed the challenge of optimizing an ML inference workload on an Intel® Xeon® processor. We began by reviewing a number of scenarios in which a CPU might be the best choice for AI/ML inference even in an era of multiple dedicated AI inference chips. We then introduced a toy image-classification PyTorch model and proceeded to demonstrate a wide number of techniques for boosting its runtime performance on an Amazon EC2 c7i.xlarge occasion, powered by 4th Technology Intel Xeon Scalable processors. On this publish, we prolong our dialogue to AWS’s homegrown Arm-based Graviton CPUs. We are going to revisit lots of the optimizations we mentioned in our earlier posts — a few of which would require adaptation to the Arm processor — and assess their affect on the identical toy mannequin. Given the profound variations between the Arm and Intel processors, the paths to the very best performing configuration could take totally different paths.

    AWS Graviton

    AWS Graviton is a household of processors primarily based on Arm Neoverse CPUs, which are customized and constructed by AWS for optimum price-performance and vitality effectivity. Their devoted engines for vector processing (NEON and SVE/SVE2) and matrix multiplication (MMLA), and their help for Bfloat16 operations (as of Graviton3), make them a compelling candidate for operating compute intensive workloads reminiscent of AI/ML inference. To facilitate high-performance AI/ML on Graviton, all the software program stack has been optimized for its use:

    • Low-Degree Compute Kernels from the Arm Compute Library (ACL) are extremely optimized to leverage the Graviton {hardware} accelerators (e.g., SVE and MMLA).
    • ML Middleware Libraries reminiscent of oneDNN and OpenBLAS route deep studying and linear algebra operations to the specialised ACL kernels.
    • AI/ML Frameworks like PyTorch and TensorFlow are compiled and configured to make use of these optimized backends.

    On this publish we’ll use an Amazon EC2 c8g.xlarge occasion powered by 4 AWS Graviton4 processors and an AWS ARM64 PyTorch Deep Learning AMI (DLAMI).

    The intention of this publish is to show ideas for enhancing efficiency on an AWS Graviton occasion. Importantly, our intention is not to attract a comparability between AWS Graviton and various chips, neither is it to advocate for using one chip over the opposite. Your best option of processor will depend on an entire bunch of concerns past the scope of this publish. One of many essential concerns would be the most runtime efficiency of your mannequin on every chip. In different phrases: how a lot “bang” can we get for our buck? Thus, making an knowledgeable resolution about the very best processor is among the motivations for optimizing runtime efficiency on every one.

    One other motivation for optimizing our mannequin’s efficiency for a number of inference gadgets, is to extend its portability. The taking part in subject of AI/ML is extraordinarily dynamic and resilience to altering circumstances is essential for achievement. It’s not unusual for compute cases of sure varieties to immediately change into unavailable or scarce. Conversely, a rise in capability of AWS Graviton cases, may indicate their availability at steep reductions, e.g., within the Amazon EC2 Spot Instance market, presenting cost-savings alternatives that you wouldn’t need to miss out on.

    Disclaimers

    The blocks code of code we’ll share, the optimization steps we’ll talk about, and the outcomes we’ll attain, are supposed for example of the advantages you may even see from ML efficiency optimization on an AWS Graviton occasion. These could differ significantly from the outcomes you may see with your personal mannequin and runtime atmosphere. Please don’t depend on the accuracy or optimality of the contents of this publish. Please don’t interpret the point out of any library, framework, or platform as an endorsement of its use.

    Inference Optimization on AWS Graviton

    As in our previous post, we’ll show the optimization steps on a toy picture classification mannequin:

    import torch, torchvision
    import time
    
    
    def get_model(channels_last=False, compile=False):
        mannequin = torchvision.fashions.resnet50()
    
        if channels_last:
            mannequin= mannequin.to(memory_format=torch.channels_last)
    
        mannequin = mannequin.eval()
    
        if compile:
            mannequin = torch.compile(mannequin)
    
        return mannequin
    
    def get_input(batch_size, channels_last=False):
        batch = torch.randn(batch_size, 3, 224, 224)
        if channels_last:
            batch = batch.to(memory_format=torch.channels_last)
        return batch
    
    def get_inference_fn(mannequin, enable_amp=False):
        def infer_fn(batch):
            with torch.inference_mode(), torch.amp.autocast(
                    'cpu',
                    dtype=torch.bfloat16,
                    enabled=enable_amp
            ):
                output = mannequin(batch)
            return output
        return infer_fn
    
    def benchmark(infer_fn, batch):
        # warm-up
        for _ in vary(20):
            _ = infer_fn(batch)
    
        iters = 100
    
        begin = time.time()
        for _ in vary(iters):
            _ = infer_fn(batch)
        finish = time.time()
    
        return (finish - begin) / iters
    
    
    batch_size = 1
    mannequin = get_model()
    batch = get_input(batch_size)
    infer_fn = get_inference_fn(mannequin)
    avg_time = benchmark(infer_fn, batch)
    print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

    The preliminary throughput is 12 samples per second (SPS).

    Improve to the Most Current PyTorch Launch

    Whereas the model of PyTorch in our DLAMI is 2.8, the newest model of PyTorch, on the time of this writing, is 2.9. Given the speedy tempo of improvement within the subject of AI/ML, it’s extremely advisable to make use of essentially the most up-to-date library packages. As our first step, we improve to PyTorch 2.9 which includes key updates to its Arm backend.

    pip3 set up -U torch torchvision --index-url https://obtain.pytorch.org/whl/cpu

    Within the case of our mannequin in its preliminary configuration, upgrading the PyTorch model doesn’t have any impact. Nevertheless, this step is essential for getting essentially the most out of the optimization methods that we’ll assess.

    Batched Inference

    To cut back the overhead of launching overheads and enhance the utilization of the HW accelerators, we group collectively samples and apply batched inference. The desk beneath demonstrates how the mannequin throughput varies as a perform of batch measurement:

    Inference Throughput for Various Batch Sizes (by Writer)

    Reminiscence Optimizations

    We apply quite a lot of methods from our earlier publish for optimizing reminiscence allocation and utilization. These embody the channels-last memory format, automatic mixed precision with the bfloat16 knowledge sort (supported from Graviton3), the TCMalloc allocation library, and large web page allocation. Please see the  for particulars. We additionally allow the quick math mode of the ACL GEMM kernels, and caching of the kernel primitives — two optimizations that seem within the official guidelines for running PyTorch inference on Graviton.

    The command line directions required to allow these optimizations are proven beneath:

    # set up TCMalloc
    sudo apt-get set up google-perftools
    
    # Program using TCMalloc
    export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4
    
    # Allow big web page reminiscence allocation
    export THP_MEM_ALLOC_ENABLE=1
    
    # Allow the quick math mode of the GEMM kernels
    export DNNL_DEFAULT_FPMATH_MODE=BF16
    
    # Set LRU Cache capability to cache the kernel primitives
    export LRU_CACHE_CAPACITY=1024

    The next desk captures the affect of the reminiscence optimizations, utilized successively:

    ResNet-50 Reminiscence Optimization Outcomes (by Writer)

    Within the case of our toy mannequin, the channels-last and bfloat16-mixed precision optimizations had the best affect. After making use of all the reminiscence optimizations, the typical throughput is 53.03 SPS.

    Mannequin Compilation

    The help of PyTorch compilation for AWS Graviton is an area of focused effort of the AWS Graviton team. Nevertheless, within the case of our toy mannequin, it ends in a slight discount in throughput, from 53.03 SPS to 52.23.

    Multi-Employee Inference

    Whereas usually utilized in settings with many greater than 4 vCPUs, we show the implementation of multi-worker inference by modifying our script to help core pinning:

    if __name__ == '__main__':
        # pin CPUs based on employee rank
        import os, psutil
        rank = int(os.environ.get('RANK','0'))
        world_size = int(os.environ.get('WORLD_SIZE','1'))
        cores = listing(vary(psutil.cpu_count(logical=True)))
        num_cores = len(cores)
        cores_per_process = num_cores // world_size
        start_index = rank * cores_per_process
        end_index = (rank + 1) * cores_per_process
        pid = os.getpid()
        p = psutil.Course of(pid)
        p.cpu_affinity(cores[start_index:end_index])
    
        batch_size = 8
        mannequin = get_model(channels_last=True)
        batch = get_input(batch_size, channels_last=True)
        infer_fn = get_inference_fn(mannequin, enable_amp=True)
        avg_time = benchmark(infer_fn, batch)
        print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

    We word that opposite to different AWS EC2 CPU occasion varieties, every Graviton vCPU maps on to a single bodily CPU core. We use the torchrun utility to start out up 4 employees, with every operating on a single CPU core:

    export OMP_NUM_THREADS=1 #set one OpenMP thread per employee
    torchrun --nproc_per_node=4 most important.py

    This ends in a throughput of 55.15 SPS, a 4% enchancment over our earlier finest end result.

    INT8 Quantization for Arm

    One other space of energetic improvement and steady enchancment on Arm is INT8 quantization. INT8 quantization instruments are usually closely tied to the goal occasion sort. In our previous post we demonstrated PyTorch 2 Export Quantization with X86 Backend through Inductor utilizing the TorchAO (0.12.1) library. Happily, latest variations of TorchAO embody a devoted quantizer for Arm. The up to date quantization sequence is proven beneath. As in our previous post we have an interest simply within the potential efficiency affect. In follow, INT8 quantization can have a big affect on the standard of the mannequin and will necessitate a extra subtle quantization technique.

    from torchao.quantization.pt2e.quantize_pt2e import prepare_pt2e, convert_pt2e
    import torchao.quantization.pt2e.quantizer.arm_inductor_quantizer as aiq
    
    def quantize_model(mannequin):
        x = torch.randn(4, 3, 224, 224).contiguous(
                                memory_format=torch.channels_last)
        example_inputs = (x,)
        batch_dim = torch.export.Dim("batch")
        with torch.no_grad():
            exported_model = torch.export.export(
                mannequin,
                example_inputs,
                dynamic_shapes=((batch_dim,
                                 torch.export.Dim.STATIC,
                                 torch.export.Dim.STATIC,
                                 torch.export.Dim.STATIC),
                                )
            ).module()
        quantizer = aiq.ArmInductorQuantizer()
        quantizer.set_global(aiq.get_default_arm_inductor_quantization_config())
        prepared_model = prepare_pt2e(exported_model, quantizer)
        prepared_model(*example_inputs)
        converted_model = convert_pt2e(prepared_model)
        optimized_model = torch.compile(converted_model)
        return optimized_model
    
    
    batch_size = 8
    mannequin = get_model(channels_last=True)
    mannequin = quantize_model(mannequin)
    batch = get_input(batch_size, channels_last=True)
    infer_fn = get_inference_fn(mannequin, enable_amp=True)
    avg_time = benchmark(infer_fn, batch)
    print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

    The resultant throughput is 56.77 SPS for a 7.1% enchancment over the bfloat16 answer.

    AOT Compilation Utilizing ONNX and OpenVINO

    In our previous post, we explored ahead-of-time (AOT) mannequin compilation methods utilizing Open Neural Network Exchange (ONNX) and OpenVINO. Each libraries embody devoted help for operating on AWS Graviton (e.g., see here and here). The experiments on this part require the next library installations:

    pip set up onnxruntime onnxscript openvino nncf

    The next code block demonstrates the mannequin compilation and execution on Arm utilizing ONNX:

    def export_to_onnx(mannequin, onnx_path="resnet50.onnx"):
        dummy_input = torch.randn(4, 3, 224, 224)
        batch = torch.export.Dim("batch")
        torch.onnx.export(
            mannequin,
            dummy_input,
            onnx_path,
            input_names=["input"],
            output_names=["output"],
            dynamic_shapes=((batch,
                             torch.export.Dim.STATIC,
                             torch.export.Dim.STATIC,
                             torch.export.Dim.STATIC),
                            ),
            dynamo=True
        )
        return onnx_path
    
    def onnx_infer_fn(onnx_path):
        import onnxruntime as ort
    
        sess = ort.InferenceSession(
            onnx_path,
            suppliers=["CPUExecutionProvider"]
       )
        sess_options = ort.SessionOptions()
        sess_options.add_session_config_entry(
                   "mlas.enable_gemm_fastmath_arm64_bfloat16", "1")
        input_name = sess.get_inputs()[0].title
    
        def infer_fn(batch):
            end result = sess.run(None, {input_name: batch})
            return end result
        return infer_fn
    
    batch_size = 8
    mannequin = get_model()
    onnx_path = export_to_onnx(mannequin)
    batch = get_input(batch_size).numpy()
    infer_fn = onnx_infer_fn(onnx_path)
    avg_time = benchmark(infer_fn, batch)
    print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

    It must be famous that the ONNX runtime helps a devoted ACL-ExecutionProvider for operating on Arm, however this requires a customized ONNX construct (as of the time of this writing), which is out of the scope of this publish.

    Alternatively, we are able to compile the mannequin utilizing OpenVINO. The code block beneath demonstrates its use, together with an possibility for INT8 quantization utilizing NNCF:

    import openvino as ov
    import nncf
    
    def openvino_infer_fn(compiled_model):
        def infer_fn(batch):
            end result = compiled_model([batch])[0]
            return end result
        return infer_fn
    
    class RandomDataset(torch.utils.knowledge.Dataset):
        def __len__(self):
            return 10000
    
        def __getitem__(self, idx):
            return torch.randn(3, 224, 224)
    
    quantize_model = False
    batch_size = 8
    mannequin = get_model()
    calibration_loader = torch.utils.knowledge.DataLoader(RandomDataset())
    calibration_dataset = nncf.Dataset(calibration_loader)
    
    if quantize_model:
        # quantize PyTorch mannequin
        mannequin = nncf.quantize(mannequin, calibration_dataset)
    
    ovm = ov.convert_model(mannequin, example_input=torch.randn(1, 3, 224, 224))
    ovm = ov.compile_model(ovm)
    batch = get_input(batch_size).numpy()
    infer_fn = openvino_infer_fn(ovm)
    avg_time = benchmark(infer_fn, batch)
    print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

    Within the case of our toy mannequin, OpenVINO compilation ends in a further enhance of the throughput to 63.48 SPS, however the NNCF quantization disappoints, leading to simply 55.18 SPS.

    Outcomes

    The outcomes of our experiments are summarized within the desk beneath:

    ResNet50 Inference Optimization Outcomes (by Writer)

    As in our , we reran our experiments on a second mannequin — a Imaginative and prescient Transformer (ViT) from the timm library — to show how the affect of the runtime optimizations we mentioned can differ primarily based on the main points of the mannequin. The outcomes are captured beneath:

    ViT Inference Optimization Outcomes (by Writer)

    Abstract

    On this publish, we reviewed quite a lot of comparatively easy optimization methods and utilized them to 2 toy PyTorch fashions. Because the outcomes demonstrated, the affect of every optimization step can differ vastly primarily based on the main points of the mannequin, and the journey towards peak efficiency can take many various paths. The steps we introduced on this publish had been simply an appetizer; there are undoubtedly many extra optimizations that may unlock even better efficiency.

    Alongside the best way, we famous the various AI/ML libraries which have launched deep help for the Graviton structure, and the seemingly steady group effort of ongoing optimization. The efficiency beneficial properties we achieved, mixed with this obvious dedication, show that AWS Graviton is firmly within the “huge leagues” in the case of operating compute-intensive AI/ML workloads.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    PyTorch NaNs Are Silent Killers — So I Built a 3ms Hook to Catch Them at the Exact Layer

    April 28, 2026

    Correlation Doesn’t Mean Causation! But What Does It Mean?

    April 28, 2026

    Let the AI Do the Experimenting

    April 28, 2026

    The Next Frontier of AI in Production Is Chaos Engineering

    April 28, 2026

    How Spreadsheets Quietly Cost Supply Chains Millions

    April 27, 2026

    A Career in Data Is Not Always a Straight Line, and That’s Okay

    April 27, 2026

    Comments are closed.

    Editors Picks

    Portable smart TV, art frame, tablet

    April 29, 2026

    Former Startmate boss Michael Batko is back in founder mode building with Hourglass AI

    April 29, 2026

    Why Sharing a Screenshot Can Get You Jailed in the UAE

    April 29, 2026

    The European Commission issues preliminary DSA findings against Meta, saying Instagram and Facebook fail to prevent under-13 users from accessing the services (Gian Volpicelli/Bloomberg)

    April 29, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    DJI Action 6 Review: The Best Action Camera Is Now Better

    January 18, 2026

    Denon’s New AVR-S980H Breaks Receiver Drought for Home Theater Fans

    April 15, 2026

    €76 million Footprint Fund I to support 30 early-stage climate and DeepTech companies across Northern Europe

    January 26, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.