Optimizing PyTorch Model Inference on CPU

grows, so does the criticality of optimizing their runtime efficiency. Whereas the diploma to which AI fashions will outperform human intelligence stays a heated matter of debate, their want for highly effective and costly compute sources is unquestionable — and even infamous.

In previous posts, we coated the subject of AI mannequin optimization — primarily within the context of mannequin coaching — and demonstrated the way it can have a decisive affect on the price and pace of AI mannequin improvement. On this publish, we focus our consideration on AI mannequin inference, the place mannequin optimization has an extra goal: To reduce the latency of inference requests and enhance the consumer expertise of the mannequin client.

On this publish, we’ll assume that the platform on which mannequin inference is carried out is a 4th Gen Intel® Xeon® Scalable CPU processor, extra particularly, an Amazon EC2 c7i.xlarge occasion (with 4 Intel Xeon vCPUs) working a devoted Deep Learning Ubuntu (22.04) AMI and a CPU construct of PyTorch 2.8.0. After all, the selection of a mannequin deployment platform is among the many necessary choices taken when designing an AI answer together with the selection of mannequin structure, improvement framework, coaching accelerator, information format, deployment technique, and so forth. — every considered one of which have to be taken with consideration of the related prices and runtime pace. The selection of a CPU processor for working mannequin inference could appear stunning in an period through which the variety of devoted AI inference accelerators is repeatedly rising. Nevertheless, as we’ll see, there are some events when the perfect (and least expensive) possibility might very effectively be only a good old school CPU.

We’ll introduce a toy image-classification mannequin and proceed to display a number of the optimization alternatives for AI mannequin inference on an Intel® Xeon® CPU. The deployment of an AI mannequin usually features a full inference server answer, however for the sake of simplicity, we’ll restrict our dialogue to simply the mannequin’s core execution. For a primer on mannequin inference serving, please see our earlier publish: The Case for Centralized AI Model Inference Serving.

Our intention on this publish is to display that: 1) a number of easy optimization methods may end up in significant efficiency positive aspects and a pair of) that reaching such outcomes doesn’t require specialised experience in efficiency analyzers (equivalent to Intel® VTune™ Profiler) or on the inside workings of the low-level compute kernels. Importantly, the method of AI mannequin optimization can differ significantly based mostly on the mannequin structure and runtime atmosphere. Optimizing for coaching will differ from optimizing for inference. Optimizing a transformer mannequin will differ from optimizing a CNN mannequin. Optimizing a 22-billion-parameter mannequin will differ from optimizing a 100-million parameter mannequin. Optimizing a mannequin to run on a GPU will differ from optimizing it for a CPU. Even completely different generations of the identical CPU household might have completely different computation elements and, consequently, completely different optimization methods. Whereas the high-level steps for optimizing a given mannequin on a given occasion are fairly customary, the particular course it should take and the top consequence can fluctuate tremendously based mostly on the undertaking at hand.

The code snippets we’ll share are meant for demonstrative functions. Please don’t depend on their accuracy or their optimality. Please don’t interpret our point out of any software or method as an endorsement for its use. Finally, the perfect design decisions on your use case will tremendously rely upon the small print of your undertaking and, given the extent of the potential affect on efficiency, must be evaluated with the suitable time and a focus.

Why CPU?

With the ever-increasing variety of {hardware} options for executing AI/ML mannequin inference, our selection of a CPU could appear stunning. On this part, we describe some eventualities through which CPU could also be the popular platform for inference.

Accessibility: Using devoted AI accelerators — equivalent to GPUs — usually requires devoted deployment and upkeep or, alternatively, entry to such situations on a cloud service platform. CPUs, however, are in all places. Designing an answer to run on a CPU offers a lot larger flexibility and will increase the alternatives for deployment.
Availability: Even when your algorithm can entry an AI accelerator, there’s the query of availability. AI accelerators are in extraordinarily excessive demand, and even when/when you’ll be able to purchase one, whether or not it’s on-prem or within the cloud, you could select to prioritize them for duties which can be much more useful resource intensive, equivalent to AI mannequin coaching.
Diminished Latency: There are numerous conditions through which your AI mannequin is only one part in a pipeline of software program algorithms working on a normal CPU. Whereas the AI mannequin might carry out considerably sooner on an AI accelerator, when bearing in mind the time required to ship an inference request over the community, it’s fairly doable that working it on the identical CPU can be sooner.
Underuse of Accelerator: AI accelerators are usually fairly costly. To justify their price, your objective must be to maintain them totally occupied, minimizing their idle time. In some instances, the inference load won’t justify the price of an costly AI accelerator.
Mannequin Structure: As of late, we are likely to robotically assume that AI fashions will carry out considerably higher on AI accelerators than on CPUs. And whereas most of the time, that is certainly the case, your mannequin might embrace layers that carry out higher on CPU. For instance, sequential algorithms equivalent to Non-Most Suppression (NMS) and the Hungarian matching algorithm are likely to carry out higher on CPU than GPU and are sometimes offloaded onto the CPU even when a GPU is obtainable (e.g., see here). In case your mannequin incorporates many such layers, working it on a CPU may not be such a foul possibility.

Why Intel Xeon?

Intel® Xeon® Scalable CPU processors include built-in accelerators for the matrix and convolution operators which can be frequent in typical AI/ML workloads. These embrace AVX-512 (launched in Gen1), the VNNI extension (Gen2), and AMX (Gen4). The AMX engine, particularly, contains specialised {hardware} directions for executing AI fashions utilizing bfloat16 and int8 precision information sorts. The acceleration engines are tightly built-in with Intel’s optimized software program stack, which incorporates oneDNN, OpenVINO, and the Intel Extension for PyTorch (IPEX). These libraries make the most of the devoted Intel® Xeon® {hardware} capabilities to optimize mannequin execution with minimal code modifications.

Regardless of the arguments made on this part, the selection of inference car must be made after contemplating all choices obtainable and after assessing the alternatives for optimization on each. Within the subsequent sections, we’ll introduce a toy experiment and discover a number of the optimization alternatives on CPU.

Inference Experiment

On this part, we outline a toy AI mannequin inference experiment comprising a Resnet50 picture classification mannequin, a randomly generated enter batch, and a easy benchmarking utility which we use to report the common variety of enter samples processed per second (SPS).

import torch, torchvision
import time


def get_model():
    mannequin = torchvision.fashions.resnet50()
    mannequin = mannequin.eval()
    return mannequin


def get_input(batch_size):
    batch = torch.randn(batch_size, 3, 224, 224)
    return batch


def get_inference_fn(mannequin):
    def infer_fn(batch):
        with torch.inference_mode():
            output = mannequin(batch)
        return output
    return infer_fn


def benchmark(infer_fn, batch):
    # warm-up
    for _ in vary(10):
        _ = infer_fn(batch)

    iters = 100

    begin = time.time()
    for _ in vary(iters):
        _ = infer_fn(batch)
    finish = time.time()

    return (finish - begin) / iters


batch_size = 1
mannequin = get_model()
batch = get_input(batch_size)
infer_fn = get_inference_fn(mannequin)
avg_time = benchmark(infer_fn, batch)
print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

The baseline efficiency of our toy mannequin is 22.76 samples per second (SPS).

Mannequin Inference Optimization

On this part, we apply plenty of optimizations to our toy experiment and assess their affect on runtime efficiency. Our focus can be on optimization methods that may be utilized with relative ease. Whereas it’s fairly doubtless that further efficiency positive aspects will be achieved, these might require a lot larger specialization and a extra vital time funding.

Our focus can be on optimizations that don’t change the mannequin structure; optimization methods equivalent to mannequin distillation and mannequin pruning are out of the context of this publish. Additionally out of scope are strategies for optimizing particular mannequin elements, e.g., by implementing customized PyTorch operators.

In a earlier publish we mentioned AI mannequin optimization on Intel XEON CPUs within the context of coaching workloads. On this part we’ll revisit a number of the methods talked about there, this time within the context of AI mannequin inference. We’ll complement these with optimization methods which can be distinctive to inference settings, together with mannequin compilation for inference, INT8 quantization, and multi-worker inference.

The order through which we current the optimization strategies shouldn’t be binding. In truth, a number of the methods are interdependent; for instance, growing the variety of inference staff might affect the optimum selection of batch dimension.

Optimization 1: Batched Inference

A typical technique for growing useful resource utilization whereas lowering the common inference response time is to group enter samples into batches. In real-world eventualities, we’d like to ensure to cap the batch dimension in order that we meet the service stage response time necessities, however for the needs of our experiment we ignore this requirement. Experimenting with completely different batch sizes we discover {that a} batch dimension of 8 leads to a throughput of 26.28 SPS, 15% increased than the baseline consequence.

Notice that within the case that the shapes of the enter samples fluctuate, batching requires extra dealing with (e.g., see here).

Optimization 2: Channels-Final Reminiscence Format

By default in PyTorch, 4D tensors are saved in NCHW format, i.e., the 4 dimensions symbolize the batch dimension, channels, peak, and width, respectively. Nevertheless, the channels-last or NHWC format (i.e., batch dimension, peak, width, and channels) reveals higher efficiency on CPU. Adjusting our inference script to use the channels-last optimization is an easy matter of setting the reminiscence format of each the mannequin and the enter to torch.channels_last as proven beneath:

def get_model(channels_last=False):
    mannequin = torchvision.fashions.resnet50()
    if channels_last:
        mannequin= mannequin.to(memory_format=torch.channels_last)
    mannequin = mannequin.eval()
    return mannequin

def get_input(batch_size, channels_last=False):
    batch = torch.randn(batch_size, 3, 224, 224)
    if channels_last:
        batch = batch.to(memory_format=torch.channels_last)
    return batch


batch_size = 8
mannequin = get_model(channels_last=True)
batch = get_input(batch_size, channels_last=True)
infer_fn = get_inference_fn(mannequin)
avg_time = benchmark(infer_fn, batch)
print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

Making use of the channels-last reminiscence optimization, leads to an additional increase of 25% in throughput.

The affect of this optimization is most noticeable on fashions which have many convolutional layers. It’s not anticipated to make a noticeable affect on different mannequin architectures (e.g., transformer fashions).

Please see the PyTorch documentation for extra particulars on the reminiscence format optimization and the Intel documentation for particulars on how that is applied internally in oneDNN.

Optimization 3: Automated Combined Precision

Fashionable Intel® Xeon® Scalable processors (from Gen3) embrace native help for the bfloat16 information sort, a 16-bit floating level different to the usual float32. We are able to benefit from this by making use of PyTorch’s automated combined precision package deal, torch.amp, as demonstrated beneath:

def get_inference_fn(mannequin, enable_amp=False):
    def infer_fn(batch):
        with torch.inference_mode(), torch.amp.autocast(
                'cpu',
                dtype=torch.bfloat16,
                enabled=enable_amp
        ):
            output = mannequin(batch)
        return output
    return infer_fn

batch_size = 8
mannequin = get_model(channels_last=True)
batch = get_input(batch_size, channels_last=True)
infer_fn = get_inference_fn(mannequin, enable_amp=True)
avg_time = benchmark(infer_fn, batch)
print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

The results of making use of combined precision is a throughput of 86.95 samples per second, 2.6 occasions the earlier experiment and three.8 occasions the baseline consequence.

Notice that using a decreased precision floating level sort can have an effect on numerical accuracy, and its impact on mannequin high quality efficiency have to be evaluated.

Optimization 4: Reminiscence Allocation Optimization

Typical AI/ML workloads require the allocation and entry of enormous blocks of reminiscence. Various optimization methods are aimed toward tuning the best way reminiscence is allotted and used throughout mannequin execution. One frequent step is to exchange the default system allocator (ptmalloc) with another reminiscence allocation libraries, equivalent to Jemalloc and TCMalloc, which have been proven to carry out higher on frequent AI/ML workloads (e.g., see here). To put in TCMalloc run:

sudo apt-get set up google-perftools

We program its use by way of the LD_PRELOAD atmosphere variable:

LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 python principal.py

This optimization leads to one other vital efficiency increase: 117.54 SPS, 35% increased than our earlier experiment!!

Optimization 5: Allow Enormous Web page Allocations

By default, the Linux kernel allocates reminiscence in blocks of 4 KB, generally known as pages. The mapping between the digital and bodily reminiscence addresses is managed by the CPU’s Reminiscence Administration Unit (MMU), which makes use of a small {hardware} cache referred to as the Translation Lookaside Buffer (TLB). The TLB is proscribed within the quantity entries it will possibly maintain. When you might have many small pages (as in massive neural community fashions), the variety of TLB cache misses can climb rapidly, growing latency and slowing down the pace of this system. A typical strategy to deal with that is to make use of “large pages” — blocks of two MB (or 1 GB) per web page. This reduces the variety of TLB entries required, bettering reminiscence entry effectivity and reducing allocation latency.

export THP_MEM_ALLOC_ENABLE=1

Within the case of our mannequin, the affect is negligible. Nevertheless, this is a crucial optimization for a lot of AI/ML workloads.

Optimization 6: IPEX

Intel® Extension for PyTorch (IPEX) is a library extension for PyTorch with the most recent efficiency optimizations for Intel {hardware}. To put in it we run:

pip set up intel_extension_for_pytorch

Within the code block beneath, we display the fundamental use of the ipex.optimize API.

import intel_extension_for_pytorch as ipex

def get_model(channels_last=False, ipex_optimize=False):
    mannequin = torchvision.fashions.resnet50()

    if channels_last:
        mannequin= mannequin.to(memory_format=torch.channels_last)

    mannequin = mannequin.eval()

    if ipex_optimize:
        mannequin = ipex.optimize(mannequin, dtype=torch.bfloat16)

    return mannequin

The resultant all through is 159.31 SPS, for an additional 36% efficiency increase.

Please see the official documentation for extra particulars on the numerous optimizations that IPEX has to supply.

Optimization 7: Mannequin Compilation

One other in style PyTorch optimization is torch.compile. Launched in PyTorch 2.0, this just-in-time (JIT) compilation function, performs kernel fusion and different optimizations. In a earlier publish we coated PyTorch compilation in nice element, masking some its many options, controls, and limitations. Right here we display its fundamental use:

def get_model(channels_last=False, ipex_optimize=False, compile=False):
    mannequin = torchvision.fashions.resnet50()

    if channels_last:
        mannequin= mannequin.to(memory_format=torch.channels_last)

    mannequin = mannequin.eval()

    if ipex_optimize:
        mannequin = ipex.optimize(mannequin, dtype=torch.bfloat16)

    if compile:
        mannequin = torch.compile(mannequin)

    return mannequin

Making use of torch.compile on the IPEX-optimized mannequin leads to a throughput of 144.5 SPS, which is decrease than our earlier experiment. Within the case of our mannequin, IPEX and torch.compile don’t coexist effectively. When making use of simply the torch.compile the throughput is 133.36 SPS.

The overall takeaway from this experiment is that, for a given mannequin, any two optimization methods might intrude with each other. This necessitates evaluating the affect of a number of configurations on the runtime efficiency of a given mannequin with a view to discover the perfect one.

Optimization 8: Auto-tune Surroundings Setup With `torch.xeon.run_cpu`

There are a variety of atmosphere settings that management thread and reminiscence administration and can be utilized to additional fine-tune the runtime efficiency of an AI/ML workload. Reasonably than setting these manually, PyTorch affords the torch.xeon.run_cpu script that does this robotically. In preparation for using this script, we set up Intel’s threading and multiprocessing libraries, one TBB and Intel OpenMP. We additionally add a symbolic hyperlink to our TCMalloc set up.

# set up TBB
sudo apt set up -y libtbb12
# set up openMP
pip set up intel-openmp
# hyperlink to tcmalloc
sudo ln -sf /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 /usr/lib/libtcmalloc.so

Within the case of our toy mannequin, utilizing torch.xeon.run_cpu will increase the throughput to 162.15 SPS — a slight enhance over our earlier most of 159.31 SPS.

Please see the PyTorch documentation for extra options of the torch.xeon.run_cpu and extra particulars on the atmosphere variables it applies.

Optimization 9: Multi-worker Inference

One other in style method for growing useful resource utilization and scale is to load a number of situations of the AI mannequin and run them in parallel in separate processes. Though this method is extra generally utilized on machines with many CPUs (separated into a number of NUMA nodes) — not on our small 4-vCPU occasion — we embrace it right here for the sake of demonstration. Within the script beneath we run 2 situations of our mannequin in parallel:

python -m torch.backends.xeon.run_cpu --ninstances 2 principal.py

This leads to a throughput of 169.4 SPS — further modest however significant 4% enhance.

Optimization 10: INT8 Quantization

INT8 quantization is one other frequent method for accelerating AI mannequin inference execution. In INT8 quantization, the floating level datatypes of the mannequin weights and activations are changed by 8-bit integers. Intel’s Xeon processors embrace devoted accelerators for processing INT8 operations (e.g., see here). INT8 quantization may end up in a significant enhance in pace and a decrease reminiscence footprint. Importantly, the decreased bit-precision can have a major affect on the standard of the mannequin output. There are numerous completely different approaches to INT8 quantization a few of which embrace calibration or retraining. There are additionally all kinds of instruments and libraries for making use of quantization. A full dialogue on the subject of quantization is past the scope of this publish.

Since on this publish we have an interest simply within the potential efficiency affect, we display one quantization scheme utilizing TorchAO, with out consideration of the affect on mannequin high quality. Within the code block beneath, we implement PyTorch 2 Export Quantization with X86 Backend through Inductor. INT8 quantization is one other frequent method for accelerating AI mannequin inference execution. In INT8 quantization, the floating level datatypes of the mannequin weights and activations are changed by 8-bit integers. Intel’s Xeon processors embrace devoted accelerators for processing INT8 operations (e.g., see here). INT8 quantization may end up in a significant enhance in pace and a decrease reminiscence footprint.

Importantly, the decreased bit-precision can have a major affect on the standard of the mannequin output. There are numerous completely different approaches to INT8 quantization a few of which embrace calibration or retraining. There are additionally all kinds of instruments and libraries for making use of quantization. A full dialogue on the subject of quantization is past the scope of this publish. Since on this publish we have an interest simply within the potential efficiency affect, we display one quantization scheme utilizing TorchAO, with out consideration of the affect on mannequin high quality. Within the code block beneath, we implement PyTorch 2 Export Quantization with X86 Backend through Inductor. Please see the documentation for the complete particulars:

from torchao.quantization.pt2e.quantize_pt2e import prepare_pt2e, convert_pt2e
import torchao.quantization.pt2e.quantizer.x86_inductor_quantizer as xiq

def quantize_model(mannequin):
    x = torch.randn(4, 3, 224, 224).contiguous(
                            memory_format=torch.channels_last)
    example_inputs = (x,)
    batch_dim = torch.export.Dim("batch")
    with torch.no_grad():
        exported_model = torch.export.export(
            mannequin,
            example_inputs,
            dynamic_shapes=((batch_dim,
                             torch.export.Dim.STATIC,
                             torch.export.Dim.STATIC,
                             torch.export.Dim.STATIC),
                            )
        ).module()
    quantizer = xiq.X86InductorQuantizer()
    quantizer.set_global(xiq.get_default_x86_inductor_quantization_config())
    prepared_model = prepare_pt2e(exported_model, quantizer)
    prepared_model(*example_inputs)
    converted_model = convert_pt2e(prepared_model)
    optimized_model = torch.compile(converted_model)
    return optimized_model


batch_size = 8
mannequin = get_model(channels_last=True)
mannequin = quantize_model(mannequin)
batch = get_input(batch_size, channels_last=True)
infer_fn = get_inference_fn(mannequin, enable_amp=True)
avg_time = benchmark(infer_fn, batch)
print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

This leads to a throughput of 172.67 SPS.

Please see here for extra particulars on quantization in PyTorch.

Optimization 11: Graph Compilation and Execution With ONNX

There are a variety of third get together libraries specializing in compiling PyTorch fashions into graph representations and optimizing them for runtime efficiency on course inference units. One of the vital in style libraries for that is Open Neural Network Exchange (ONNX). ONNX performs ahead-of-time compilation of AI/ML fashions and executes them utilizing a devoted runtime library.

Whereas ONNX compilation help is included in PyTorch, we require the next library for executing an ONNX mannequin:

pip set up onnxruntime

Within the code block beneath, we display ONNX compilation and mannequin execution:

def export_to_onnx(mannequin, onnx_path="resnet50.onnx"):
    dummy_input = torch.randn(4, 3, 224, 224)
    batch = torch.export.Dim("batch")
    torch.onnx.export(
        mannequin,
        dummy_input,
        onnx_path,
        input_names=["input"],
        output_names=["output"],
        dynamic_shapes=((batch,
                         torch.export.Dim.STATIC,
                         torch.export.Dim.STATIC,
                         torch.export.Dim.STATIC),
                        ),
        dynamo=True
    )
    return onnx_path

def onnx_infer_fn(onnx_path):
    import onnxruntime as ort

    sess = ort.InferenceSession(
        onnx_path,
        suppliers=["CPUExecutionProvider"]
    )
    input_name = sess.get_inputs()[0].title

    def infer_fn(batch):
        consequence = sess.run(None, {input_name: batch})
        return consequence
    return infer_fn

batch_size = 8
mannequin = get_model()
onnx_path = export_to_onnx(mannequin)
batch = get_input(batch_size).numpy()
infer_fn = onnx_infer_fn(onnx_path)
avg_time = benchmark(infer_fn, batch)
print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

The resultant throughput is 44.92 SPS, far decrease than in our earlier experiments. Within the case of our toy mannequin, the ONNX runtime doesn’t present a profit.

Optimization 12: Graph Compilation and Execution with OpenVINO

One other opensource toolkit aimed toward deploying extremely performant AI options is OpenVINO. OpenVINO is highly optimized for mannequin execution on Intel {hardware} — e.g., by totally leveraging the Intel AMX directions. A typical strategy to apply OpenVINO in PyTorch is to first convert the mannequin to ONNX:

from openvino import Core

def compile_openvino_model(onnx_path):
    core = Core()
    mannequin = core.read_model(onnx_path)
    compiled = core.compile_model(mannequin, "CPU")
    return compiled

def openvino_infer_fn(compiled_model):
    def infer_fn(batch):
        consequence = compiled_model([batch])[0]
        return consequence
    return infer_fn

batch_size = 8
mannequin = get_model()
onnx_path = export_to_onnx(mannequin)
ovm = compile_openvino_model(onnx_path)
batch = get_input(batch_size).numpy()
infer_fn = openvino_infer_fn(ovm)
avg_time = benchmark(infer_fn, batch)
print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

The results of this optimization is a throughput of 297.33 SPS, almost twice as quick as our earlier finest experiment!!

Please see the official documentation for extra particulars on OpenVINO.

Optimization 13: INT8 Quantization in OpenVINO with NNCF

As our last optimization, we revisit INT8 quantization, this time within the framework of OpenVINO compilation. As earlier than, there are a selection of strategies for performing quantization — aimed toward minimizing the affect on high quality efficiency. Right here we display the fundamental circulate utilizing the NNCF library as documented here.

class RandomDataset(torch.utils.information.Dataset):

    def __len__(self):
        return 10000

    def __getitem__(self, idx):
        return torch.randn(3, 224, 224)

def nncf_quantize(onnx_path):
    import nncf

    core = Core()
    onnx_model = core.read_model(onnx_path)
    calibration_loader = torch.utils.information.DataLoader(RandomDataset())
    input_name = onnx_model.inputs[0].get_any_name()
    transform_fn = lambda data_item: {input_name: data_item.numpy()}
    calibration_dataset = nncf.Dataset(calibration_loader, transform_fn)
    quantized_model = nncf.quantize(onnx_model, calibration_dataset)
    return core.compile_model(quantized_model, "CPU")

batch_size = 8
mannequin = get_model()
onnx_path = export_to_onnx(mannequin)
q_model = nncf_quantize(onnx_path)
batch = get_input(batch_size).numpy()
infer_fn = openvino_infer_fn(q_model)
avg_time = benchmark(infer_fn, batch)
print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

This leads to a throughput of 482.46(!!) SPS, one other drastic enchancment and over 18 occasions sooner than our baseline experiment.

Outcomes

We summarize the outcomes of our experiments within the desk beneath:

ResNet50 Inference Experiment (by Writer)

Within the case of our toy mannequin, the optimizations steps we demonstrated resulted in large efficiency positive aspects. Importantly, the affect of every optimization can fluctuate tremendously based mostly on the small print of the mannequin. It’s possible you’ll discover that a few of these methods don’t apply to your mannequin, or don’t lead to improved efficiency. For instance, after we reapply the identical sequence of optimizations to a Imaginative and prescient Transformer (ViT) mannequin, the resultant efficiency increase is 8.41X — nonetheless vital, however lower than the 18.36X of our experiment. Please see the appendix to this publish for particulars.

Our focus has been on runtime efficiency, however it’s crucial that you simply additionally consider the affect of every optimization on different metrics which can be necessary to you — most significantly mannequin high quality.
There are, undoubtedly, many extra optimization methods that may be utilized; we’ve merely scratched the floor. Hopefully, the

Abstract

This publish continues our series on the necessary matter of AI/ML mannequin runtime efficiency evaluation and optimization. Our focus on this publish was on mannequin inference on Intel® Xeon® CPU processors. Given the ubiquity and prevalence of CPUs, the flexibility to execute fashions on them in a dependable and performant method, will be extraordinarily compelling. As we’ve proven, by making use of plenty of comparatively easy methods, we are able to obtain appreciable positive aspects in mannequin efficiency with profound implications on inference prices and inference latency.

Please don’t hesitate to succeed in out with feedback, questions, or corrections.

Appendix: Imaginative and prescient Transformer Optimization

To display how the affect of the runtime optimizations we mentioned rely upon the small print of the AI/ML mannequin, we reran our experiment on a Imaginative and prescient Transformer (ViT) mannequin from the favored timm library:

from timm.fashions.vision_transformer import VisionTransformer

def get_model(channels_last=False, ipex_optimize=False, compile=False):
    mannequin = VisionTransformer()

    if channels_last:
        mannequin= mannequin.to(memory_format=torch.channels_last)

    mannequin = mannequin.eval()

    if ipex_optimize:
        mannequin = ipex.optimize(mannequin, dtype=torch.bfloat16)

    if compile:
        mannequin = torch.compile(mannequin)

    return mannequin

One modification on this experiment was to use OpenVINO compilation directly to the PyTorch model relatively than an intermediate ONNX mannequin. This was resulting from the truth that OpenVINO compilation failed on the ViT ONNX mannequin. The revised NNCF quantization and OpenVINO compilation sequence is proven beneath:

import openvino as ov
import nncf


batch_size = 8
mannequin = get_model()
calibration_loader = torch.utils.information.DataLoader(RandomDataset())
calibration_dataset = nncf.Dataset(calibration_loader)

# quantize PyTorch mannequin
mannequin = nncf.quantize(mannequin, calibration_dataset)
ovm = ov.convert_model(mannequin, example_input=torch.randn(1, 3, 224, 224))
ovm = ov.compile_model(ovm)
batch = get_input(batch_size).numpy()
infer_fn = openvino_infer_fn(ovm)
avg_time = benchmark(infer_fn, batch)
print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

The desk beneath summarizes the outcomes of the optimizations mentioned on this publish when utilized to the ViT mannequin:

Imaginative and prescient Transformer Inference Experiment (by Writer)

Source link

Optimizing PyTorch Model Inference on CPU

From Regex to Vision Models: Which RAG Technique Fits Which Problem

Escaping the Valley of Choice in BI

Ensuring Data Integrity with Cryptographic Hashing and the Ethereum Blockchain

RAG Is Not Machine Learning, and the ML Toolkit Solves the Wrong Problem

How to Combine Claude Code and Codex for Maximum Coding Power

It’s the Lessons We Learned Along the Way. Or, Is It?

Dozens of Red Hat packages backdoored through its official NPM channel

Microsoft Build 2026 Kicks Off Today: Live Updates on Copilot AI and Dev Tools

From Regex to Vision Models: Which RAG Technique Fits Which Problem

Rehumanizing global health care with agentic AI

Featured Picks

Ford boss Lisa Brankin warns against taxing electric cars

Watch a Robot Stuff Cash Into a Wallet Just Like You Do

Infatuated AI Image Generator Pricing & Features Overview

Optimizing PyTorch Model Inference on CPU

Why CPU?

Why Intel Xeon?

Inference Experiment

Mannequin Inference Optimization

Optimization 1: Batched Inference

Optimization 2: Channels-Final Reminiscence Format

Optimization 3: Automated Combined Precision

Optimization 4: Reminiscence Allocation Optimization

Optimization 5: Allow Enormous Web page Allocations

Optimization 6: IPEX

Optimization 7: Mannequin Compilation

Optimization 8: Auto-tune Surroundings Setup With torch.xeon.run_cpu

Optimization 9: Multi-worker Inference

Optimization 10: INT8 Quantization

Optimization 11: Graph Compilation and Execution With ONNX

Optimization 12: Graph Compilation and Execution with OpenVINO

Optimization 13: INT8 Quantization in OpenVINO with NNCF

Outcomes

Abstract

Appendix: Imaginative and prescient Transformer Optimization

Related Posts

Optimization 8: Auto-tune Surroundings Setup With `torch.xeon.run_cpu`