NumPy API on a GPU?

Is way forward for Python numerical computation?

Late final 12 months, NVIDIA made a major announcement concerning the way forward for Python-based numerical computing. I wouldn’t be stunned in case you missed it. In any case, each different announcement from each AI firm, then and now, appears mega-important.

That announcement launched the cuNumeric library, a drop-in substitute for the ever-present NumPy library constructed on high of the Legate framework.

Who’re Nvidia?

Most individuals will most likely know Nvidia from their ultra-fast chips that energy computer systems and information centres all around the world. You may additionally be aware of Nvidia’s charismatic, leather-based jacket-loving CEO, Jensen Huang, who appears to pop up on the stage of each AI convention lately.

What many individuals don’t know is that Nvidia additionally designs and creates progressive gadget architectures and related software program. Considered one of its most prized merchandise is the Compute Unified System Structure (CUDA). CUDA is NVIDIA’s proprietary parallel-computing platform and programming mannequin. Since its launch in 2007, it has developed right into a complete ecosystem comprising drivers, runtime, compilers, math libraries, debugging and profiling instruments, and container photos. The result’s a neatly tuned {hardware} and software program loop that retains NVIDIA GPUs on the centre of contemporary high-performance and AI workloads.

What’s Legate?

Legate is an NVIDIA-led open-source runtime layer that allows you to run acquainted Python data-science libraries (NumPy, cuNumeric, Pandas-style APIs, sparse linear-algebra kernels, …) on multi-core CPUs, single or multi-GPU nodes, and even multi-node clusters with out altering your Python code. It interprets high-level array operations right into a graph of fine-grained duties and fingers that graph to the C++ Legion runtime, which schedules the duties, partitions the info, and strikes tiles between CPUs, GPUs and community hyperlinks for you.

In a nutshell, Legate lets acquainted single-node Python libraries scale transparently to multi-GPU, multi-node machines.

What’s cuNumeric?

cuNumeric is a drop-in substitute for NumPy whose array operations are executed by Legate’s job engine and accelerated on one or many NVIDIA GPUs (or, if no GPU is current, on all CPU cores). In follow, you put in it and want solely change one import line to start out utilizing it instead of your common NumPy code. For instance …

# previous
import numpy as np
...
...

# new
import cupynumeric as np     # every thing else stays the identical
...
...

… and run your script on the terminal with the legate command.

Behind the scenes, cuNumeric converts every NumPy name you make, for instance, np.sin, np.linalg.svd, fancy indexing, broadcasting, reductions, and so on, into Legate duties. These duties will,

Partition your arrays into tiles sized to suit GPU reminiscence.
Schedule every tile on one of the best obtainable gadget (GPU or CPU).
Overlap compute with communication when the workload spans a number of GPUs or nodes.
Spill tiles to NVMe/SSD routinely when your dataset outruns GPU RAM.

As a result of the API of cuNumeric mirrors NumPy’s practically 1-for-1, current scientific or data-science code can scale from a laptop computer to a multi-GPU cluster and not using a rewrite.

Efficiency advantages

So, this all appears nice, proper? But it surely solely is sensible if it leads to tangible efficiency enhancements over utilizing NumPy, and Nvidia is making some sturdy claims that that is the case. As information scientists, machine studying engineers and information engineers sometimes use NumPy rather a lot, we will respect that this is usually a essential facet of the techniques we write and preserve.

Now, I don’t have a cluster of GPUs or a supercomputer to check this on, however my desktop PC does have an Nvidia GeForce RTX 4070 GPU, and we’re going to make use of that to check out a few of Nvidia’s claims.

(base) tom@tpr-desktop:~$ nvidia-smi
Solar Jun 15 15:26:36 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.75                 Driver Model: 566.24         CUDA Model: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Title                 Persistence-M | Bus-Id          Disp.A | Risky Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Utilization/Cap |           Reminiscence-Utilization | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 Ti     On  |   00000000:01:00.0  On |                  N/A |
| 32%   29C    P8              9W /  285W |    1345MiB /  12282MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Kind   Course of title                              GPU Reminiscence |
|        ID   ID                                                               Utilization      |
|=========================================================================================|
|  No operating processes discovered                                                             |
+-----------------------------------------------------------------------------------------+

I’ll set up cuNumeric and NumPy on my PC to conduct comparative checks. It will assist us assess whether or not Nvidia’s claims are correct and perceive the efficiency variations between the 2 libraries.

Establishing a growth atmosphere.

As at all times, I wish to arrange a separate growth atmosphere to run my checks. That means, nothing I do in that atmosphere will have an effect on any of my different initiatives. On the time of writing, cuNumeric isn’t obtainable to put in on Home windows, so I’ll be utilizing WSL2 Ubuntu for Home windows as a substitute.

I’ll be utilizing Miniconda to arrange my atmosphere, however be happy to make use of whichever software you’re comfy with.

$ conda create cunumeric-env python=3.10 -c conda-forge
$ conda activate cunumeric-env
$ conda set up -c conda-forge -c legate cupynumeric
$ conda set up -c conda-forge ucx cuda-cudart cuda-version=12

Code instance 1 — A easy matrix multiplication

Matrix multiplication is the bread and butter of mathematical operations that underpin so many AI techniques, so it is sensible to attempt that operation out first.

Observe that in all my examples, I’ll run the NumPy and cuNumeric code snippets 5 instances in a row and common the time taken for every. I additionally carry out a “warm-up step on the GPU earlier than the timing run to consider overheads akin to just-in-time (JIT) compilation.

import time
import gc
import argparse
import sys

def benchmark_numpy(n, runs):
    """Runs the matrix multiplication benchmark utilizing commonplace NumPy on the CPU."""
    import numpy as np
    
    print(f"--- NumPy (CPU) Benchmark ---")
    print(f"Multiplying two {n}×{n} matrices ({runs} runs)n")

    # 1. Generate information ONCE earlier than the timing loop.
    print(f"Producing two {n}x{n} random matrices on CPU...")
    A = np.random.rand(n, n).astype(np.float32)
    B = np.random.rand(n, n).astype(np.float32)

    # 2. Carry out one untimed warm-up run.
    print("Performing warm-up run...")
    _ = np.matmul(A, B)
    print("Heat-up full.n")

    # 3. Carry out the timed runs.
    instances = []
    for i in vary(runs):
        begin = time.time()
        # The operation being timed. The @ operator is a handy
        # shorthand for np.matmul.
        C = A @ B
        finish = time.time()

        length = finish - begin
        instances.append(length)
        print(f"Run {i+1}: time = {length:.4f}s")
        del C # Clear up the consequence matrix
        gc.accumulate()

    avg = sum(instances) / len(instances)
    print(f"nNumPy common: {avg:.4f}sn")
    return avg

def benchmark_cunumeric(n, runs):
    """Runs the matrix multiplication benchmark utilizing cuNumeric on the GPU."""
    import cupynumeric as cn
    import numpy as np # Import numpy for the canonical sync
    
    print(f"--- cuNumeric (GPU) Benchmark ---")
    print(f"Multiplying two {n}×{n} matrices ({runs} runs)n")

    # 1. Generate information ONCE on the GPU earlier than the timing loop.
    print(f"Producing two {n}x{n} random matrices on GPU...")
    A = cn.random.rand(n, n).astype(np.float32)
    B = cn.random.rand(n, n).astype(np.float32)

    # 2. Carry out a vital untimed warm-up run for JIT compilation.
    print("Performing warm-up run...")
    C_warmup = cn.matmul(A, B)
    # The very best follow for synchronization: power a duplicate again to the CPU.
    _ = np.array(C_warmup)
    print("Heat-up full.n")

    # 3. Carry out the timed runs.
    instances = []
    for i in vary(runs):
        begin = time.time()
        
        # Launch the operation on the GPU
        C = A @ B
        
        # Synchronize by changing the consequence to a host-side NumPy array.
        np.array(C)

        finish = time.time()

        length = finish - begin
        instances.append(length)
        print(f"Run {i+1}: time = {length:.4f}s")
        del C
        gc.accumulate()

    avg = sum(instances) / len(instances)
    print(f"ncuNumeric common: {avg:.4f}sn")
    return avg

if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Benchmark matrix multiplication on NumPy (CPU) vs. cuNumeric (GPU)."
    )
    parser.add_argument(
        "-n", "--n", kind=int, default=3000, assist="Matrix measurement (n x n)"
    )
    parser.add_argument(
        "-r", "--runs", kind=int, default=5, assist="Variety of timing runs"
    )
    parser.add_argument(
        "--cunumeric", motion="store_true", assist="Run the cuNumeric (GPU) model"
    )
    
    args, unknown = parser.parse_known_args()

    # The dispatcher logic
    if args.cunumeric or "--cunumeric" in unknown:
        benchmark_cunumeric(args.n, args.runs)
    else:
        benchmark_numpy(args.n, args.runs)

Operating the NumPy facet of issues makes use of the common python example1.py command line syntax. For operating utilizing Legate, the syntax is extra advanced. What it does is disable Legate’s automated configuration after which launch the example1.py script beneath Legate with one CPU, one GPU, and 0 OpenMP threads utilizing the cuNumeric backend.

Right here is the output.

(cunumeric-env) tom@tpr-desktop:~$ python example1.py
--- NumPy (CPU) Benchmark ---
Multiplying two 3000×3000 matrices (5 runs)

Producing two 3000x3000 random matrices on CPU...
Performing warm-up run...
Heat-up full.

Run 1: time = 0.0976s
Run 2: time = 0.0987s
Run 3: time = 0.0957s
Run 4: time = 0.1063s
Run 5: time = 0.0989s

NumPy common: 0.0994s

(cunumeric-env) tom@tpr-desktop:~$ LEGATE_AUTO_CONFIG=0 legate --cpus 1 --gpus 1 --omps 0 example1.py --cunu
meric
[0 - 7f2e8fcc8480]    0.000000 {5}{module_config}: Module numa cannot detect sources.
[0 - 7f2e8fcc8480]    0.000000 {4}{topology}: cannot open /sys/units/system/node/
[0 - 7f2e8fcc8480]    0.000049 {4}{threads}: reservation ('GPU ctxsync 0x55cd5fd34530') can't be happy
--- cuNumeric (GPU) Benchmark ---
Multiplying two 3000×3000 matrices (5 runs)

Producing two 3000x3000 random matrices on GPU...
Performing warm-up run...
Heat-up full.

Run 1: time = 0.0113s
Run 2: time = 0.0089s
Run 3: time = 0.0086s
Run 4: time = 0.0090s
Run 5: time = 0.0087s

cuNumeric common: 0.0093s

Effectively, that’s a formidable begin. cuNumeric is registering a 10x speedup over NumPy.

The warnings that Legate is outputting may be ignored. These are informational, indicating Legate couldn’t discover particulars concerning the machine’s CPU/reminiscence structure (NUMA) or sufficient CPU cores to handle the GPU.

Code instance 2 — Logistic regression

Logistic regression is a foundational software in information science as a result of it supplies a easy, interpretable technique to mannequin and predict binary outcomes (sure/no, go/fail, click on/no-click). On this instance, we’ll measure how lengthy it takes to coach a easy binary classifier on artificial information. For every of the 5 runs, it first generates N samples with D options (X), and a corresponding random 0/1 label vector (Y). It initialises the burden vector w to zeros, then performs 500 iterations of batch gradient descent: computing the linear predictions z = X.dot(w), making use of the sigmoid p = 1/(1+exp(–z)), computing the gradient grad = X.T.dot(p – y) / N, and updating the weights with w -= 0.1 * grad. The script data the elapsed time for every run, cleans up reminiscence, and eventually prints the common coaching time.

import time
import gc
import argparse
import sys

# --- Reusable Coaching Operate ---
# By placing the coaching loop in its personal perform, we keep away from code duplication.
# The `np` argument permits us to go in both the numpy or cupynumeric module.
def train_logistic_regression(np, X, y, iters, alpha):
    """Performs a set variety of gradient descent iterations."""
    # Guarantee w begins on the proper gadget (CPU or GPU)
    w = np.zeros(X.form[1])
    
    for _ in vary(iters):
        z = X.dot(w)
        p = 1.0 / (1.0 + np.exp(-z))
        grad = X.T.dot(p - y) / X.form[0]
        w -= alpha * grad
    
    return w

def benchmark_numpy(n_samples, n_features, iters, alpha):
    """Runs the logistic regression benchmark utilizing commonplace NumPy on the CPU."""
    import numpy as np
    
    print(f"--- NumPy (CPU) Benchmark ---")
    print(f"Coaching on {n_samples} samples, {n_features} options for {iters} iterationsn")

    # 1. Generate information ONCE earlier than the timing loop.
    print("Producing random dataset on CPU...")
    X = np.random.rand(n_samples, n_features)
    y = (np.random.rand(n_samples) > 0.5).astype(np.float64)

    # 2. Carry out one untimed warm-up run.
    print("Performing warm-up run...")
    _ = train_logistic_regression(np, X, y, iters, alpha)
    print("Heat-up full.n")

    # 3. Carry out the timed runs.
    instances = []
    for i in vary(args.runs):
        begin = time.time()
        # The operation being timed
        _ = train_logistic_regression(np, X, y, iters, alpha)
        finish = time.time()

        length = finish - begin
        instances.append(length)
        print(f"Run {i+1}: time = {length:.3f}s")
        gc.accumulate()

    avg = sum(instances) / len(instances)
    print(f"nNumPy common: {avg:.3f}sn")
    return avg

def benchmark_cunumeric(n_samples, n_features, iters, alpha):
    """Runs the logistic regression benchmark utilizing cuNumeric on the GPU."""
    import cupynumeric as cn
    import numpy as np # Additionally import numpy for the canonical synchronization
    
    print(f"--- cuNumeric (GPU) Benchmark ---")
    print(f"Coaching on {n_samples} samples, {n_features} options for {iters} iterationsn")

    # 1. Generate information ONCE on the GPU earlier than the timing loop.
    print("Producing random dataset on GPU...")
    X = cn.random.rand(n_samples, n_features)
    y = (cn.random.rand(n_samples) > 0.5).astype(np.float64)

    # 2. Carry out a vital untimed warm-up run for JIT compilation.
    print("Performing warm-up run...")
    w_warmup = train_logistic_regression(cn, X, y, iters, alpha)
    # The very best follow for synchronization: power a duplicate again to the CPU.
    _ = np.array(w_warmup)
    print("Heat-up full.n")

    # 3. Carry out the timed runs.
    instances = []
    for i in vary(args.runs):
        begin = time.time()
        
        # Launch the operation on the GPU
        w = train_logistic_regression(cn, X, y, iters, alpha)
        
        # Synchronize by changing the ultimate consequence again to a NumPy array.
        np.array(w)

        finish = time.time()

        length = finish - begin
        instances.append(length)
        print(f"Run {i+1}: time = {length:.3f}s")
        del w
        gc.accumulate()

    avg = sum(instances) / len(instances)
    print(f"ncuNumeric common: {avg:.3f}sn")
    return avg

if __name__ == "__main__":
    # A extra sturdy argument parsing setup
    parser = argparse.ArgumentParser(
        description="Benchmark logistic regression on NumPy (CPU) vs. cuNumeric (GPU)."
    )
    # Hyperparameters for the mannequin
    parser.add_argument(
        "-n", "--n_samples", kind=int, default=2_000_000, assist="Variety of information samples"
    )
    parser.add_argument(
        "-d", "--n_features", kind=int, default=10, assist="Variety of options"
    )
    parser.add_argument(
        "-i", "--iters", kind=int, default=500, assist="Variety of gradient descent iterations"
    )
    parser.add_argument(
        "-a", "--alpha", kind=float, default=0.1, assist="Studying fee"
    )
    # Benchmark management
    parser.add_argument(
        "-r", "--runs", kind=int, default=5, assist="Variety of timing runs"
    )
    parser.add_argument(
        "--cunumeric", motion="store_true", assist="Run the cuNumeric (GPU) model"
    )
    
    args, unknown = parser.parse_known_args()

    # Dispatcher logic
    if args.cunumeric or "--cunumeric" in unknown:
        benchmark_cunumeric(args.n_samples, args.n_features, args.iters, args.alpha)
    else:
        benchmark_numpy(args.n_samples, args.n_features, args.iters, args.alpha)

And the outputs.

(cunumeric-env) tom@tpr-desktop:~$ python example2.py
--- NumPy (CPU) Benchmark ---
Coaching on 2000000 samples, 10 options for 500 iterations

Producing random dataset on CPU...
Performing warm-up run...
Heat-up full.

Run 1: time = 12.292s
Run 2: time = 11.830s
Run 3: time = 11.903s
Run 4: time = 12.843s
Run 5: time = 11.964s

NumPy common: 12.166s

(cunumeric-env) tom@tpr-desktop:~$ LEGATE_AUTO_CONFIG=0 legate --cpus 1 --gpus 1 --omps 0 example2.py --cunu
meric
[0 - 7f04b535c480]    0.000000 {5}{module_config}: Module numa cannot detect sources.
[0 - 7f04b535c480]    0.000000 {4}{topology}: cannot open /sys/units/system/node/
[0 - 7f04b535c480]    0.001149 {4}{threads}: reservation ('GPU ctxsync 0x55fb037cf140') can't be happy
--- cuNumeric (GPU) Benchmark ---
Coaching on 2000000 samples, 10 options for 500 iterations

Producing random dataset on GPU...
Performing warm-up run...
Heat-up full.

Run 1: time = 1.964s
Run 2: time = 1.957s
Run 3: time = 1.968s
Run 4: time = 1.955s
Run 5: time = 1.960s

cuNumeric common: 1.961s

Not fairly as spectacular as our first instance, however a 5x to 6x speedup on an already quick NumPy program is to not be sniffed at.

Code instance 3 — fixing linear equations

This script benchmarks how lengthy it takes to unravel a dense 3000×3000 linear algebra equation system. This can be a elementary operation in linear algebra used to unravel the equation of kind Ax = b, the place A is a huge grid of numbers (a 3000×3000 matrix on this case), and b is an inventory of numbers (a vector).

The purpose is to search out the unknown record of numbers x that makes the equation true. This can be a computationally intensive job that’s on the coronary heart of many scientific simulations, engineering issues, monetary fashions, and even some AI algorithms.

import time
import gc
import argparse
import sys # Import sys to examine arguments

# Observe: The library imports (numpy and cupynumeric) are actually finished *inside*
# their respective features to maintain them separate and keep away from import errors.

def benchmark_numpy(n, runs):
    """Runs the linear clear up benchmark utilizing commonplace NumPy on the CPU."""
    import numpy as np

    print(f"--- NumPy (CPU) Benchmark ---")
    print(f"Fixing {n}×{n} A x = b ({runs} runs)n")

    # 1. Generate information ONCE earlier than the timing loop.
    print("Producing random system on CPU...")
    A = np.random.randn(n, n).astype(np.float32)
    b = np.random.randn(n).astype(np.float32)

    # 2. Carry out one untimed warm-up run. That is good follow even for
    # the CPU to make sure caches are heat and any one-time setup is completed.
    print("Performing warm-up run...")
    _ = np.linalg.clear up(A, b)
    print("Heat-up full.n")

    # 3. Carry out the timed runs.
    instances = []
    for i in vary(runs):
        begin = time.time()
        # The operation being timed
        x = np.linalg.clear up(A, b)
        finish = time.time()

        length = finish - begin
        instances.append(length)
        print(f"Run {i+1}: time = {length:.6f}s")
        # Clear up the consequence to be secure with reminiscence
        del x
        gc.accumulate()

    avg = sum(instances) / len(instances)
    print(f"nNumPy common: {avg:.6f}sn")
    return avg

def benchmark_cunumeric(n, runs):
    """Runs the linear clear up benchmark utilizing cuNumeric on the GPU."""
    import cupynumeric as cn
    import numpy as np # Additionally import numpy for the canonical synchronization

    print(f"--- cuNumeric (GPU) Benchmark ---")
    print(f"Fixing {n}×{n} A x = b ({runs} runs)n")

    # 1. Generate information ONCE on the GPU earlier than the timing loop.
    # This ensures we aren't timing the info switch in our essential loop.
    print("Producing random system on GPU...")
    A = cn.random.randn(n, n).astype(np.float32)
    b = cn.random.randn(n).astype(np.float32)

    # 2. Carry out a vital untimed warm-up run. This handles JIT
    # compilation and different one-time GPU setup prices.
    print("Performing warm-up run...")
    x_warmup = cn.linalg.clear up(A, b)
    # The very best follow for synchronization: power a duplicate again to the CPU.
    _ = np.array(x_warmup)
    print("Heat-up full.n")

    # 3. Carry out the timed runs.
    instances = []
    for i in vary(runs):
        begin = time.time()

        # Launch the operation on the GPU
        x = cn.linalg.clear up(A, b)

        # Synchronize by changing the consequence to a host-side NumPy array.
        # That is assured to dam till the GPU has completed.
        np.array(x)

        finish = time.time()

        length = finish - begin
        instances.append(length)
        print(f"Run {i+1}: time = {length:.6f}s")
        # Clear up the GPU array consequence
        del x
        gc.accumulate()

    avg = sum(instances) / len(instances)
    print(f"ncuNumeric common: {avg:.6f}sn")
    return avg

if __name__ == "__main__":
    # A extra sturdy argument parsing setup
    parser = argparse.ArgumentParser(
        description="Benchmark linear clear up on NumPy (CPU) vs. cuNumeric (GPU)."
    )
    parser.add_argument(
        "-n", "--n", kind=int, default=3000, assist="Matrix measurement (n x n)"
    )
    parser.add_argument(
        "-r", "--runs", kind=int, default=5, assist="Variety of timing runs"
    )

    # Use parse_known_args() to deal with potential additional arguments from Legate
    args, unknown = parser.parse_known_args()

    # The dispatcher logic: examine if "--cunumeric" is within the command line
    # This can be a easy and efficient technique to swap between modes.
    if "--cunumeric" in sys.argv or "--cunumeric" in unknown:
        benchmark_cunumeric(args.n, args.runs)
    else:
        benchmark_numpy(args.n, args.runs)

The outputs.

(cunumeric-env) tom@tpr-desktop:~$ python example4.py
--- NumPy (CPU) Benchmark ---
Fixing 3000×3000 A x = b (5 runs)

Producing random system on CPU...
Performing warm-up run...
Heat-up full.

Run 1: time = 0.133075s
Run 2: time = 0.126129s
Run 3: time = 0.135849s
Run 4: time = 0.137383s
Run 5: time = 0.138805s

NumPy common: 0.134248s

(cunumeric-env) tom@tpr-desktop:~$ LEGATE_AUTO_CONFIG=0 legate --cpus 1 --gpus 1 --omps 0 example4.py --cunumeric
[0 - 7f29f42ce480]    0.000000 {5}{module_config}: Module numa cannot detect sources.
[0 - 7f29f42ce480]    0.000000 {4}{topology}: cannot open /sys/units/system/node/
[0 - 7f29f42ce480]    0.000053 {4}{threads}: reservation ('GPU ctxsync 0x562e88c28700') can't be happy
--- cuNumeric (GPU) Benchmark ---
Fixing 3000×3000 A x = b (5 runs)

Producing random system on GPU...
Performing warm-up run...
Heat-up full.

Run 1: time = 0.009685s
Run 2: time = 0.010043s
Run 3: time = 0.009966s
Run 4: time = 0.009739s
Run 5: time = 0.009383s

cuNumeric common: 0.009763s

That may be a large consequence. The Nvidia cuNumeric run is 100x quicker than the NumPy run.

Code instance 4 — Sorting

Sorting is such a elementary a part of every thing that occurs in computing, and fashionable computer systems are so quick that almost all builders don’t even give it some thought. However let’s see how a lot of a distinction utilizing cuNumeric could make to this ubiquitous operation. We’ll type a big (30,000,000) 1D array of numbers

# benchmark_sort.py
import time
import sys
import gc

# Array measurement
n = 30_000_000 # 30 million parts

def benchmark_numpy():
    import numpy as np
    print(f"Sorting an array of {n} parts with NumPy (5 runs)n")

    instances = []
    for i in vary(5):
        information = np.random.randn(n).astype(np.float32)
        begin = time.time()
        _ = np.type(information)
        finish = time.time()

        length = finish - begin
        instances.append(length)
        print(f"Run {i+1}: time = {length:.6f}s")
        del information
        gc.accumulate()

    avg = sum(instances) / len(instances)
    print(f"nNumPy common: {avg:.6f}sn")

def benchmark_cunumeric():
    import cupynumeric as np
    print(f"Sorting an array of {n} parts with cuNumeric (5 runs)n")

    instances = []
    for i in vary(5):
        information = np.random.randn(n).astype(np.float32)
        begin = time.time()
        _ = np.type(information)
        # Power GPU sync
        _ = np.linalg.norm(np.zeros(()))
        finish = time.time()

        length = finish - begin
        instances.append(length)
        print(f"Run {i+1}: time = {length:.6f}s")
        del information
        gc.accumulate()
        _ = np.linalg.norm(np.zeros(()))

    avg = sum(instances) / len(instances)
    print(f"ncuNumeric common: {avg:.6f}sn")

if __name__ == "__main__":
    if "--cunumeric" in sys.argv:
        benchmark_cunumeric()
    else:
        benchmark_numpy()

The outputs.

(cunumeric-env) tom@tpr-desktop:~$ python example5.py
--- NumPy (CPU) Benchmark ---
Sorting an array of 30000000 parts (5 runs)

Creating random array on CPU...
Performing warm-up run...
Heat-up full.

Run 1: time = 0.588777s
Run 2: time = 0.586813s
Run 3: time = 0.586745s
Run 4: time = 0.586525s
Run 5: time = 0.583783s

NumPy common: 0.586529s
-----------------------------

(cunumeric-env) tom@tpr-desktop:~$ LEGATE_AUTO_CONFIG=0 legate --cpus 1 --gpus 1 --omps 0 example5.py --cunumeric
[0 - 7fd9e4615480]    0.000000 {5}{module_config}: Module numa cannot detect sources.
[0 - 7fd9e4615480]    0.000000 {4}{topology}: cannot open /sys/units/system/node/
[0 - 7fd9e4615480]    0.000082 {4}{threads}: reservation ('GPU ctxsync 0x564489232fd0') can't be happy
--- cuNumeric (GPU) Benchmark ---
Sorting an array of 30000000 parts (5 runs)

Creating random array on GPU...
Performing warm-up run...
Heat-up full.

Run 1: time = 0.010857s
Run 2: time = 0.007927s
Run 3: time = 0.007921s
Run 4: time = 0.008240s
Run 5: time = 0.007810s

cuNumeric common: 0.008551s
-------------------------------

One more vastly spectacular efficiency from cuNumeric and Legate.

Abstract

This text launched cuNumeric, an NVIDIA library designed as a high-performance, drop-in substitute for NumPy. The important thing takeaway is that information scientists can speed up their current Python code on NVIDIA GPUs with minimal effort, typically by merely altering a single import line and operating the script with the ‘legate’ command.

Two essential parts energy the expertise:

Legate: An open-source runtime layer from NVIDIA that routinely interprets high-level Python operations into duties. It intelligently manages distributing these duties throughout single or a number of GPUs, dealing with information partitioning, reminiscence administration (even spilling to disk if wanted), and optimising communication.
cuNumeric: The user-facing library that mirrors the NumPy API. Whenever you make a name like np.matmul(), cuNumeric converts it right into a job for the Legate engine to execute on the GPU.

I used to be capable of validate Nvidia’s efficiency claims by operating 4 benchmark checks on my desktop PC (with an NVIDIA RTX 4070 Ti GPU), evaluating commonplace NumPy on the CPU towards cuNumeric on the GPU.

The outcomes display vital efficiency features for cuNumeric:

Matrix Multiplication: ~10x quicker than NumPy.
Logistic Regression Coaching: ~6x quicker.
Fixing Linear Equations: An enormous 100x+ speedup.
Sorting a Giant Array: One other big enchancment, operating roughly 70x quicker.

In conclusion, I confirmed that cuNumeric efficiently delivers on its promise, making the immense computational energy of GPUs accessible to the broader Python information science neighborhood with out requiring a steep studying curve or an entire code rewrite.

For extra data and hyperlinks to associated sources, try the unique Nvidia announcement on cuNumeric here.

Source link

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

How to Find the Optimal Coding Agent Interface

I Completed Five Years in Analytics Consulting: 5 Lessons That Changed How I Work

GPU-Resident Top-K for Agentic RAG: I Built a CUDA Kernel So My Retrieval Step Would Stop Bouncing Off the GPU

Can Machine Learning Predict the World Cup?

Automate Writing Your LLM Prompts

These Were My Favorite Things Samsung Unpacked During Its 2026 Galaxy Event

AI minister role boosted but tech department axed in Burnham shake-up

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

The risk of weather data sabotage is rising

Featured Picks

Mammotion Spino E1 Review: A Budget Pool Bot That Comes Up Short

Today’s NYT Mini Crossword Answers for Dec. 1

Clinical trial finds Urolithin A boosts immune function and anti aging