Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Australia’s privacy commissioner tried, in vain, to sound the alarm on data protection during the u16s social media ban trials
    • Nothing Phone (4a) Pro Review: A Close Second
    • Match Group CEO Spencer Rascoff says growing women’s share on Tinder is his “primary focus” to stem user declines; Sensor Tower says 75% of Tinder users are men (Kieran Smith/Financial Times)
    • Today’s NYT Connections Hints, Answers for April 20 #1044
    • AI Machine-Vision Earns Man Overboard Certification
    • Battery recycling startup Renewable Metals charges up on $12 million Series A
    • The Influencers Normalizing Not Having Sex
    • Sources say NSA is using Mythos Preview, and a source says it is also being used widely within the DoD, despite Anthropic’s designation as a supply chain risk (Axios)
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Monday, April 20
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Optimizing Data Transfer in AI/ML Workloads
    Artificial Intelligence

    Optimizing Data Transfer in AI/ML Workloads

    Editor Times FeaturedBy Editor Times FeaturedJanuary 3, 2026No Comments16 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    a , a deep studying mannequin is executed on a devoted GPU accelerator utilizing enter information batches it receives from a CPU host. Ideally, the GPU — the costlier useful resource — needs to be maximally utilized, with minimal durations of idle time. Specifically, because of this each time it completes its execution on a batch, the following batch might be “ripe and prepared” for processing. When this doesn’t occur, the GPU idles whereas ready for enter information — a typical efficiency bottleneck sometimes called GPU hunger.

    In earlier posts, (e.g., see A Caching Strategy for Identifying Bottlenecks on the Data Input Pipeline), we mentioned widespread causes of this difficulty, together with: inefficient storage retrieval, CPU useful resource exhaustion, and host-to-device switch bottlenecks. On this put up, we zoom in on information switch bottlenecks and revisit their identification and backbone — this time with the assistance of NVIDIA Nsight™ Systems (nsys), a efficiency profiler designed for analyzing the system-wide exercise of workloads working on NVIDIA GPUs.

    NVIDIA Nsight vs. PyTorch Profiler

    Readers acquainted with our work could also be shocked on the point out of NVIDIA Nsight profiler somewhat than PyTorch Profiler. In our earlier posts we now have advocated strongly for the usage of PyTorch Profiler in AI/ML mannequin improvement as a instrument for figuring out and optimizing runtime efficiency. Again and again, we now have demonstrated its software to all kinds of efficiency points. Its use doesn’t require any particular installations and might be run with out particular OS permissions. NVIDIA Nsight profiler, however, requires a dedicated system setup (or a dedicated NVIDIA container) and — for a few of its options — elevated permissions, making its use much less accessible and extra difficult than PyTorch Profiler.

    The 2 profilers differ of their focus: PyTorch profiler is a framework profiler tightly coupled with PyTorch and closely centered on how fashions use the PyTorch software program stack and supporting libraries. NVIDIA Nsight profiler is a system-level profiler; it doesn’t know the main points of the mannequin being run or which framework is getting used, however somewhat how the parts of your entire system are getting used and utilized. Whereas PyTorch Profiler excels at tracing the low-level operations of a PyTorch mannequin execution, nsys supplies an in depth view of the actions of your entire system (GPU {hardware}, CUDA streams, OS interrupts, Community, PCIe, and many others.). For a lot of efficiency points PyTorch profiler is adequate for figuring out and fixing the supply of the bottleneck; However some conditions name for nsys profiler, the “massive weapons”, for deriving deeper insights into the internal workings of the underlying system.

    On this put up we intend to exhibit a number of the distinctive capabilities of nsys profiler and their software to the widespread data-transfer bottleneck.

    Define

    To facilitate our dialogue we are going to outline a toy ML workload with a data-transfer efficiency bottleneck and proceed to introduce numerous successive optimizations in an try to unravel it. All through the method, we are going to use the nsys profiler with a view to analyze the system efficiency and assess the affect of the code modifications.

    Setup

    We are going to run our experiments on an Amazon EC2 g6e.2xlarge occasion with an NVIDIA L40S GPU working an AWS Deep Learning (Ubuntu 24.04) AMI with PyTorch (2.8). To put in the nsys-cli profiler (model 2025.6.1) we observe the official NVIDIA guidelines:

    wget https://developer.nvidia.com/downloads/property/instruments/safe/nsight-systems/2025_6/NsightSystems-linux-cli-public-2025.6.1.190-3689520.deb
    sudo apt set up ./NsightSystems-linux-cli-public-2025.6.1.190-3689520.deb

    The NVIDIA Tools Extension (NVTX) library permits us to annotate our code with human-readable labels to extend the readability and comprehension of the efficiency hint. Whereas PyTorch presents built-in NVTX help through its torch.cuda.nvtx APIs, we are going to use the standalone nvtx bundle (model 0.2.14) which helps color-coding the hint timeline for higher visible evaluation:

    pip set up nvtx

    Disclaimers

    The code we are going to share is meant for demonstrative functions; please don’t depend on its correctness or optimality. Please don’t interpret our use of any library, instrument, or platform, as an endorsement of its use. The affect of the optimizations we are going to cowl can range vastly primarily based on the main points of the mannequin and the runtime setting. Please you’ll want to assess their impact by yourself use case earlier than integrating their use.

    Many due to Yitzhak Levi and Gilad Wasserman for his or her contributions to this put up.

    A Toy PyTorch Mannequin

    We introduce a coaching script deliberately designed to include a bottleneck on the data-input pipeline.

    Within the code block beneath we outline a easy picture classification mannequin with a ResNet-18 spine.

    import time, torch, torchvision
    
    DEVICE = "cuda"
    mannequin = torchvision.fashions.resnet18().to(DEVICE).practice()
    optimizer = torch.optim.Adam(mannequin.parameters())

    Subsequent, we outline an artificial dataset which we are going to use to coach our toy mannequin.

    from torch.utils.information import Dataset, DataLoader
    
    WARMUP_STEPS = 10
    PROFILE_STEPS = 3
    COOLDOWN_STEPS = 1
    TOTAL_STEPS = WARMUP_STEPS + PROFILE_STEPS + COOLDOWN_STEPS
    BATCH_SIZE = 64
    TOTAL_SAMPLES = TOTAL_STEPS * BATCH_SIZE
    IMG_SIZE = 512
    
    # An artificial Dataset with random photos and labels
    class FakeDataset(Dataset):
    
        def __len__(self):
            return TOTAL_SAMPLES
    
        def __getitem__(self, index):
            img = torch.randn((3, IMG_SIZE, IMG_SIZE))
            label = torch.tensor(index % 10)
            return img, label
    
    train_loader = DataLoader(
        FakeDataset(),
        batch_size=BATCH_SIZE
    )

    Lastly, we outline an ordinary coaching step programmed to run nsys-profiler for 3 steps utilizing the torch.cuda.profiler.start and stop instructions — supposed to be used at the side of the nsys cli. We spotlight the parts of the coaching step utilizing the nvtx.annotate utility. Please seek advice from the official documentation for extra particulars on profiling with nsys in PyTorch.

    import nvtx
    from torch.cuda import profiler
    
    def copy_data(batch):
        information, targets = batch
        data_gpu = information.to(DEVICE)
        targets_gpu = targets.to(DEVICE)
        return data_gpu, targets_gpu
    
    
    def compute_step(mannequin, batch, optimizer):
        information, targets = batch
        output = mannequin(information)
        loss = torch.nn.useful.cross_entropy(output, targets)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        return loss
    
    
    data_iter = iter(train_loader)
    
    for i in vary(TOTAL_STEPS):
    
        if i == WARMUP_STEPS:
            # begin nsys profiler
            torch.cuda.synchronize()
            start_time = time.perf_counter()
            profiler.begin()
        elif i == WARMUP_STEPS + PROFILE_STEPS:
            # cease nsys profiler
            torch.cuda.synchronize()
            profiler.cease()
            end_time = time.perf_counter()
    
        with nvtx.annotate(f"Batch {i}", shade="blue"):
            with nvtx.annotate("get batch", shade="crimson"):
                batch = subsequent(data_iter)
            with nvtx.annotate("copy batch", shade="yellow"):
                batch = copy_data(batch)
            with nvtx.annotate("Compute", shade="inexperienced"):
                compute_step(mannequin, batch, optimizer)
    
    total_time = end_time - start_time
    throughput = PROFILE_STEPS / total_time
    print(f"Throughput: {throughput:.2f} steps/sec")

    We run our script utilizing the cudaProfilerApi possibility to begin and cease the profiler programmatically. Please see the official documentation for full particulars on profiling from the nsys cli.

    nsys profile 
      --capture-range=cudaProfilerApi 
      --trace=cuda,nvtx,osrt 
      --output=baseline 
      python practice.py

    This leads to a baseline.nsys-rep hint file that we copy over to our improvement machine for evaluation.

    With a view to draw a comparability to PyTorch profiler, we outline another coaching loop programmed with PyTorch Profiler and annotated with the torch.profiler.record_function utility:

    from torch.profiler import (
        profile, record_function, schedule, tensorboard_trace_handler
    )
    
    with profile(
        schedule=schedule(wait=0, warmup=WARMUP_STEPS, 
                          lively=PROFILE_STEPS, repeat=1),
        on_trace_ready=tensorboard_trace_handler('./baseline'),
        record_shapes=True,
        with_stack=True
    ) as prof:
        for i in vary(TOTAL_STEPS):
            with record_function("get batch"):
                batch = subsequent(data_iter)
            with record_function("copy batch"):
                batch = copy_data(batch)
            with record_function("compute"):
                compute_step(mannequin, batch, optimizer)
            prof.step()

    The throughput of our baseline experiment is 2.97 steps-per-second. Within the subsequent sections we are going to use the profile traces to determine efficiency bottlenecks in our coaching step and attempt to enhance on this consequence.

    Baseline Efficiency Evaluation

    To investigate the resultant nsys hint file, we open it within the Nsight Systems GUI application. Within the picture beneath we zoom in on the timeline of two of the coaching steps captured by the profiler:

    Baseline Nsight Techniques Profiler Hint (by Creator)

    The hint comprises a wealth of data, only a subset of which we are going to contact on on this put up. Please see the nsys documentation for extra functionalities and options.

    The timeline is split into two components: the CUDA part which reviews GPU exercise and the threads part which reviews the CPU exercise. The CUDA part makes a transparent distinction between the GPU kernel (compute) exercise (90.9%) and reminiscence exercise (9.1%). The highest bars in every part report the utilization of every of the assets and each sections embrace an NVTX part with the coloured annotations we included in our coaching step. We word the next observations:

    1. The GPU is idle for roughly 50% of every coaching step. This may be seen by the portion of time taken by every batch (in blue) within the GPU NVTX bar and the massive blocks of whitespace in between them.
    2. The GPU exercise for every batch begins instantly after the “get batch” exercise has accomplished on the CPU. It begins with the host-to-device reminiscence copy, marked in mild inexperienced and continues with the kernel computations, marked in mild blue.
    3. As soon as the CPU has launched the GPU reminiscence and compute instructions for batch N, it proceeds to the subsequent batch within the coaching loop — resulting in a partial overlap of batch N+1 on the CPU with batch N on the GPU.
    4. The overwhelming majority of the CPU thread is spent on the “get batch” exercise. This constitutes the first bottleneck in our baseline experiment.

    The profiling hint factors to a transparent perpetrator — the dataloader. By default, PyTorch performs single process data loading — a single CPU course of is used to load the subsequent information enter batch, copy it to the GPU, and launch the compute kernels — all in a sequential method. This sometimes leads to extreme under-utilization of the CPU assets by: 1) limiting dataloading to only a single course of, and a pair of) making the loading of the subsequent batch contingent on the completion of the CPU processing (i.e., kernel loading) of the earlier batch. Our irresponsible use of our CPU assets has resulted in our GPU being starved for enter information.

    The identical conclusion may have been reached utilizing PyTorch Profiler hint proven beneath:

    Baseline PyTorch Profiler Trace (by Author)

    Here too, we can see long periods of GPU underutilization that are caused by the long “get batch” blocks on the CPU side.

    Optimization 1: Multi-Process Data Loading

    The first step is to modify the data input pipeline to use multi-process data loading. We set the variety of staff to match the 8 vCPUs accessible on our Amazon EC2 g6e.2xlarge occasion. In a real-world situation, this worth needs to be tuned for optimum throughput:

    NUM_WORKERS = 8
    
    train_loader = DataLoader(
        FakeDataset(),
        batch_size=BATCH_SIZE,
        num_workers=NUM_WORKERS
    )

    Following this alteration our throughput jumps to 4.81 steps per second — a 62% enchancment over our baseline consequence. The corresponding nsys profiler hint is proven beneath:

    Multiproc Dataloading Nsight Techniques Profiler Timeline (by Creator)

    Word that the crimson “get batch” section has turn into only a tiny sliver of every step within the NVTX bar. As an alternative, the yellow “copy batch” block now takes heart stage. On account of our use of multi-process dataloading, there may be now all the time a brand new batch prepared for processing — however can we do higher?

    Taking a more in-depth have a look at the GPU part we see that there’s nonetheless a good portion (~290 milliseconds) of idle time in between the reminiscence operation and the kernel compute. This idle time is completely aligned with an “munmap” operation within the OS runtime bar. The “munmap” block is a CPU-side reminiscence cleanup operation carried out simply after the CUDA reminiscence copy is full. It happens on the tail-end of the lengthy yellow “copy batch” operation. The compute kernels are launched onto the GPU solely after the reminiscence cleanup has accomplished. It is a clear sample of synchronous host-to-device reminiscence copy: The CPU can not proceed with kernel loading till the info copy operation has been absolutely accomplished and the GPU stays idle till the CPU masses the kernels.

    The PyTorch profiler hint reveals the identical GPU idle time but it surely doesn’t present the identical “munmap” trace. That is our first instance of the benefit of the system-wide visibility of the nsys profiler.

    Multiproc Dataloading PyTorch Profiler Hint (by Creator)

    With our discovering of the data-copy efficiency bottleneck in hand, we proceed to our subsequent optimization.

    Optimization 2: Asynchronous Knowledge Switch

    The answer to the bottleneck we now have discovered is to program our coaching step to load data asynchronously. This allows the CPU to launch the compute kernels instantly after sending the reminiscence copy command — with out ready for the reminiscence copy to be accomplished. This manner the GPU can start processing the kernels as quickly because the CUDA reminiscence copy is completed. Enabling asynchronous information copy requires two adjustments: First we should program the dataloader to make use of pinned memory (as an alternative of pageable reminiscence), and second, we should go non_blocking=True argument to the to() operations:

    NUM_WORKERS = 8
    ASYNC_DATATRANSFER = True
    
    
    train_loader = DataLoader(
        FakeDataset(),
        batch_size=BATCH_SIZE,
        num_workers=NUM_WORKERS,
        pin_memory=ASYNC_DATATRANSFER
    )
    
    def copy_data(batch):
        information, targets = batch
        data_gpu = information.to(DEVICE, non_blocking=ASYNC_DATATRANSFER)
        targets_gpu = targets.to(DEVICE, non_blocking=ASYNC_DATATRANSFER)
        return data_gpu, targets_gpu

    Utilizing asynchronous dataloading leads to a throughput of 5.91 steps per second — an extra 23% enchancment and 99% enchancment total. The resultant profiling hint is proven beneath:

    Async Dataloading Nsight Techniques Profiler Timeline (by Creator)

    We now see all the CPU operations bunched collectively firstly of the hint. We’ve got eliminated all efficiency obstacles on the CPU aspect permitting it to freely load the info and kernels to the GPU. Within the GPU part, we see steady exercise with none idle time. We do, nonetheless, see a transparent separation between CUDA reminiscence actions (in mild inexperienced) and CUDA kernel actions (in mild blue). PyTorch profiler, in distinction, doesn’t make this distinction clear. That is one other benefit of the hardware-centric profiler and, within the case of our toy experiment, is what informs the subsequent steps of our optimization.

    Async Dataloading PyTorch Profiler Hint (by Creator)

    Optimization 3: Pipelining With CUDA Streams

    Our ultimate optimizations derive from the truth that trendy GPUs, such because the NVIDIA L40S, use impartial engines for copying reminiscence (the DMA) and executing compute kernels (the SMs). We will benefit from this by parallelizing the distinct reminiscence and kernel actions we noticed within the nsys profiler hint. We are going to program this by means of the usage of CUDA streams.

    In a previous post, we expanded on the chance for optimizing AI/ML workloads utilizing CUDA Streams. Right here, we apply the same pipelining technique: We outline two distinct “copy” and “compute” CUDA streams and program the “copy” stream to repeat batch N+1 on the similar time that the “compute” stream is processing batch N:

    # outline two CUDA streams
    compute_stream = torch.cuda.Stream()
    copy_stream = torch.cuda.Stream()
    
    
    # extract first batch
    next_batch = subsequent(data_iter)
    with torch.cuda.stream(copy_stream):
        next_batch = copy_data(next_batch)
    
    for i in vary(TOTAL_STEPS):
    
        if i == WARMUP_STEPS:
            torch.cuda.synchronize()
            start_time = time.perf_counter()
            profiler.begin()
        elif i == WARMUP_STEPS + PROFILE_STEPS:
            torch.cuda.synchronize()
            profiler.cease()
            end_time = time.perf_counter()
    
        with nvtx.annotate(f"Batch {i}", shade="blue"):
            # await copy stream to finish copy of batch N
            compute_stream.wait_stream(copy_stream)
            batch = next_batch
    
            # execute mannequin on batch N+1 compute stream
            attempt:
                with nvtx.annotate("get batch", shade="crimson"):
                    next_batch = subsequent(data_iter)
                with torch.cuda.stream(copy_stream):
                    with nvtx.annotate("copy batch", shade="yellow"):
                        next_batch = copy_data(next_batch)
            besides:
                # reached finish of dataset
                next_batch = None
    
            # execute mannequin on batch N compute stream
            with torch.cuda.stream(compute_stream):
                with nvtx.annotate("Compute", shade="inexperienced"):
                    compute_step(mannequin, batch, optimizer)
    
    total_time = end_time - start_time
    throughput = PROFILE_STEPS / total_time
    print(f"Throughput: {throughput:.2f} steps/sec")

    This optimization leads to a throughput of 6.44 steps per second — a 9% enchancment over our earlier experiment. We word that the affect of this optimization is capped by the length of the longer of the 2 operation sorts. In our earlier profile hint, the reminiscence block took 15.5 milliseconds and the kernel block took 155 milliseconds. Within the present profile hint, your entire GPU steps takes 155 milliseconds, which signifies that the reminiscence copy time is accomplished hidden by the kernel compute time and that our optimization reaches the utmost potential consequence.

    The usage of the CUDA streams and its affect on GPU utilization might be seen within the traces of each profilers:

    Pipelined Nsight Techniques Profiler Timeline (by Creator)
    Pipelined PyTorch Profiler Hint (by Creator)

    Optimization 4: Prefetching to CUDA

    For our ultimate step, we transfer the info copying from the principle coaching loop course of to the info loading course of: Quite than explicitly calling the copy perform contained in the coaching loop, we assume that the batches returned from the info iterator are already positioned on the GPU.

    Within the code block beneath, we wrap our dataloader with a CUDA-prefetching iterator class. Word, that this can be a simplified implementation supposed for the needs of demonstration. Extra work could also be required for extra advanced situations (e.g., DDP coaching). Alternatively, you could think about a third-party implementation resembling torchtnt.utils.data.data_prefetcher.CudaDataPrefetcher:

    class DataPrefetcher:
        def __init__(self, loader):
            self.loader = iter(loader)
            self.stream = torch.cuda.Stream()
            self.next_batch = None
            self.preload()
    
        def preload(self):
            attempt:
                information, targets = subsequent(self.loader)
    
                with torch.cuda.stream(self.stream):
                    with nvtx.annotate("copy batch", shade="yellow"):
                        next_data = information.to(DEVICE, non_blocking=True)
                        next_targets = targets.to(DEVICE, non_blocking=True)
                self.next_batch = (next_data, next_targets)        
            besides:
                self.next_batch = (None, None)
    
        def __iter__(self):
            return self
    
        def __next__(self):
            torch.cuda.current_stream().wait_stream(self.stream)
            information, targets = self.next_batch
            self.preload()
            return information, targets
    
    
    data_iter = DataPrefetcher(train_loader)
    
    for i in vary(TOTAL_STEPS):
        if i == WARMUP_STEPS:
            torch.cuda.synchronize()
            start_time = time.perf_counter()
            profiler.begin()
        elif i == WARMUP_STEPS + PROFILE_STEPS:
            torch.cuda.synchronize()
            profiler.cease()
            end_time = time.perf_counter()
    
        with nvtx.annotate(f"Batch {i}", shade="blue"):
            with nvtx.annotate("get batch", shade="crimson"):
                batch = subsequent(data_iter)
            with nvtx.annotate("Compute", shade="inexperienced"):
                loss = compute_step(mannequin, batch, optimizer)
    
    total_time = end_time - start_time
    throughput = PROFILE_STEPS / total_time
    print(f"Throughput: {throughput:.2f} steps/sec")

    This optimization leads to a throughput of 6.44 steps per second — the identical as our earlier experiment. This could not shock us since we now have already seen that the throughput is certain by the 155 millisecond GPU compute and our optimization has not achieved something to scale back the kernel compute time.

    Extra typically, regardless of the elimination of the copy name from the principle loop, you will have a tough time discovering a state of affairs the place this may have a significant affect on efficiency because the name is already being referred to as asynchronously. Nevertheless, given the minimal adjustments to the coaching loop, you could discover this resolution to be cleaner and/or to be extra relevant to be used with high-level libraries that don’t allow fine-grained management of the coaching loop.

    Unsurprisingly, the profile traces for this experiment seem practically an identical to the earlier ones. The primary distinction is the location of the yellow “copy information” block within the NVTX row of the CPU part.

    Knowledge Prefetching Nsight Techniques Profiler Timeline (by Creator)
    Knowledge Prefetching PyTorch Profiler Hint (by Creator)

    Outcomes

    The desk beneath summarizes the outcomes of our experiments:

    Experiment Outcomes (by Creator)

    The optimizations, which have been pushed by means of Nsight Techniques profiler, resulted in an total improve of 2.17X to the runtime efficiency.

    Abstract

    GPU hunger is a typical efficiency bottleneck that may have a devastating affect on the effectivity and prices of AI/ML workloads. On this put up, we demonstrated use the Nsight Techniques profiler to check the causes of the efficiency bottleneck and take knowledgeable steps in direction of their decision. Alongside the way in which, we emphasised the distinctive capabilities of Nsight Techniques profiler when in comparison with the built-in framework-centric PyTorch Profiler — particularly its deep system-level visibility.

    Our focus, on this put up has been on the host-to-device information copy that sometimes happens firstly of the coaching step. Nevertheless, data-transfer bottlenecks can seem at completely different levels of coaching. In a sequel to this put up we intend to repeat our nsys profiling evaluation on information copies getting in the other way — from the machine to the host. Keep tuned!



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    KV Cache Is Eating Your VRAM. Here’s How Google Fixed It With TurboQuant.

    April 19, 2026

    Proxy-Pointer RAG: Structure Meets Scale at 100% Accuracy with Smarter Retrieval

    April 19, 2026

    Dreaming in Cubes | Towards Data Science

    April 19, 2026

    AI Agents Need Their Own Desk, and Git Worktrees Give Them One

    April 18, 2026

    Your RAG System Retrieves the Right Data — But Still Produces Wrong Answers. Here’s Why (and How to Fix It).

    April 18, 2026

    Europe Warns of a Next-Gen Cyber Threat

    April 18, 2026

    Comments are closed.

    Editors Picks

    Australia’s privacy commissioner tried, in vain, to sound the alarm on data protection during the u16s social media ban trials

    April 20, 2026

    Nothing Phone (4a) Pro Review: A Close Second

    April 20, 2026

    Match Group CEO Spencer Rascoff says growing women’s share on Tinder is his “primary focus” to stem user declines; Sensor Tower says 75% of Tinder users are men (Kieran Smith/Financial Times)

    April 20, 2026

    Today’s NYT Connections Hints, Answers for April 20 #1044

    April 20, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    The Tea App Data Breach: What Was Exposed and What We Know About the Class Action Lawsuit

    July 30, 2025

    Apple’s Revenue Increases 4 Percent Despite Slowing iPhone Sales

    January 31, 2025

    Jorja Smith’s record label wants royalties from ‘AI clone’ song I Run by Haven

    December 1, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.