Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Supermassive black holes may create millions of new planets
    • Cheque in: 3 startups ended May by raising $15.5 million
    • Universal Audio Volt 876 USB Audio Interface Review: Pro-Level Polish
    • New York City-based Mecka AI, which trains robots with human data sourced from body sensors and iPhones, raised $60M, including a $25M Series A (Ben Weiss/Fortune)
    • Is Instagram Down? What to Know
    • It’s the Lessons We Learned Along the Way. Or, Is It?
    • The forever chemicals impacting your health
    • WiseTech CEO threatened amid job cuts; founder Richard White calls in police
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Monday, June 1
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»How I Won the “Mostly AI” Synthetic Data Challenge
    Artificial Intelligence

    How I Won the “Mostly AI” Synthetic Data Challenge

    Editor Times FeaturedBy Editor Times FeaturedAugust 11, 2025No Comments8 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    I within the Mostly AI Prize and gained each the FLAT and SEQUENTIAL information challenges. The competitors was a unbelievable studying expertise, and on this publish, I need to present some insights into my profitable resolution.

    The Competitors

    The aim of the competitors was to generate an artificial dataset with the identical statistical properties as a supply dataset, with out copying the info.

    Supply: https://www.mostlyaiprize.com/.

    The competitors was break up into two unbiased challenges:

    1. FLAT Knowledge Problem: Generate 100,000 data with 80 columns.
    2. SEQUENTIAL Knowledge Problem: Generate 20,000 sequences (teams) of data.

    To measure the standard of the artificial information, the competitors used an General Accuracy metric. This rating measures the similarity between the artificial and supply distributions for single columns (univariates), pairs of columns (bivariates), and triples of columns (trivariates) utilizing the L1 distance. Moreover, privateness metrics like DCR (Distance to Closest File) and NNDR (Nearest Neighbor Distance Ratio) had been used to make sure submissions weren’t simply overfitting or copying the coaching information.

    A pattern of the coaching dataset for the FLAT problem. Picture by creator.

    Resolution Design

    Initially, my aim was to create an ensemble of a number of totally different state-of-the-art fashions and mix their generated information. I experimented so much with totally different fashions, however the outcomes didn’t enhance as a lot as I had hoped.

    I pivoted my method and targeted on post-processing. First, I educated a single generative mannequin from the Mostly AI SDK, and as a substitute of producing the required variety of samples for the submission, I oversampled to create a big pool of candidate samples. From this pool, I then chosen the ultimate output in a manner that matches the statistical properties of the supply dataset rather more intently.

    This method led to a considerable soar within the leaderboard rating. For the FLAT information problem, the uncooked artificial information from the mannequin scored round 0.96, however after post-processing, the rating jumped to 0.992. I used a modified model of this method for the SEQUENTIAL information problem, which yielded the same enchancment.

    My last pipeline for the FLAT problem consisted of three essential steps:

    1. Iterative Proportional Becoming (IPF) to pick an outsized, high-quality subset.
    2. Grasping Trimming to scale back the subset to the goal dimension by eradicating the worst-fitting samples.
    3. Iterative Refinement to shine the ultimate dataset by swapping samples for higher becoming ones.
    The influence of every post-processing step on the ultimate accuracy rating for the FLAT problem. Picture by creator.

    Step 1: Iterative Proportional Becoming (IPF)

    Step one in my post-processing pipeline was to get a powerful preliminary subset from the oversampled pool (2.5 million generated rows). For this, I used Iterative Proportional Fitting (IPF).

    IPF is a classical statistical algorithm used to regulate a pattern distribution to match a recognized set of marginals. On this case, I needed the artificial information’s bivariate (2-column) distributions to match these of the unique information. I additionally examined uni- and trivariate distributions, however I discovered that specializing in the bivariate relationships yielded the perfect efficiency whereas being computationally quick.

    Right here’s the way it labored:

    1. I recognized the 5,000 most correlated column pairs within the coaching information utilizing mutual data. These are an important relationships to protect.
    2. IPF then calculated fractional weights for every of the two.5 million artificial rows. The weights had been adjusted iteratively in order that the weighted sums of the bivariate distributions within the artificial pool matched the goal distributions from the coaching information.
    3. Lastly, I used an expectation-rounding method to transform these fractional weights into an integer rely of what number of occasions every row must be chosen. This resulted in an outsized subset of 125,000 rows (1.25x the required dimension) that already had very sturdy bivariate accuracy.

    The IPF step supplied a high-quality place to begin for the following part.

    Step 2: Trimming

    Producing an outsized subset of 125,000 rows from IPF was a deliberate selection that enabled this extra trimming step to take away samples that didn’t match properly.

    I used a grasping method that iteratively calculates the “error contribution” of every row within the present subset. The rows that contribute essentially the most to the statistical distance from the goal distribution are recognized and eliminated. This course of repeats till solely 100,000 rows stay, making certain that the worst 25,000 rows are discarded.

    Step 3: Refinement (Swapping)

    The ultimate step was an iterative refinement course of to swap rows from the subset with higher rows from the a lot bigger, unused information pool (the remaining 2.4 million rows).

    In every iteration, the algorithm:

    1. Identifies the worst rows inside the present 100k subset (these contributing most to the L1 error).
    2. Searches for the perfect substitute candidates from the surface pool that would cut back the L1 error if swapped in.
    3. Performs the swap if it leads to a greater total rating.

    Because the accuracy of the artificial pattern is already fairly excessive, the extra achieve from this course of is slightly small.

    Adapting for the Sequential Problem

    The SEQUENTIAL problem required the same method, however with two adjustments. First, a pattern consists of a number of rows, related by the group ID. Secondly, the competitors metric provides a measure for coherence. This implies not solely do the statistical distributions have to match, however the sequences of occasions additionally have to be just like the supply dataset.

    A pattern of the coaching dataset for the SEQUENTIAL problem. Picture by creator.

    My post-processing pipeline was tailored to deal with teams and in addition optimize for coherence:

    1. Coherence-Primarily based Pre-selection: Earlier than optimizing for statistical accuracy, I ran a specialised refinement step. This algorithm iteratively swapped complete teams (sequences) to particularly match the coherence metrics of the unique information, such because the distribution of “distinctive classes per sequence” and “sequences per class”. This ensured that we continued the post-processing with a sound sequential construction.
    2. Refinement (Swapping): The 20,000 teams chosen for coherence then went by the identical statistical refinement course of because the flat information. The algorithm swapped complete teams with higher ones from the pool to attenuate the L1 error of the uni-, bi-, and trivariate distributions. A secret ingredient was to incorporate the “Sequence Size” as a characteristic, so the group lengths are additionally thought-about within the swapping.

    This two-stage method ensured the ultimate dataset was sturdy in each statistical accuracy and sequential coherence. Apparently, the IPF-based method that labored so properly for the flat information was much less efficient for the sequential problem. Due to this fact, I eliminated it to focus computing time on the coherence and swapping algorithms, which yielded higher outcomes.

    Making It Quick: Key Optimizations

    The post-processing technique by itself was computationally costly, and making it run inside the competitors time restrict was a problem in itself. To succeed, I relied on just a few key optimizations.

    First, I diminished the info sorts wherever attainable to deal with the huge pattern information pool with out operating out of reminiscence. Altering the numerical kind of a big matrix from 64-bit to 32 or 16-bit significantly reduces the reminiscence footprint.

    Secondly, when altering the info kind was not sufficient, I used sparse matrices from SciPy. This method allowed me to retailer the statistical contributions of every pattern in an extremely memory-efficient manner.

    Lastly, the core refinement loop concerned loads of specialised calculations, a few of which had been very sluggish with numpy. To beat this, I used numba. By extracting the bottlenecks in my code into specialised capabilities with the @numba.njit decorator, Numba routinely translated them into extremely optimized machine code that runs at speeds akin to C.

    Right here is an instance of how I wanted to hurry up the summation of rows in sparse matrices, which was a significant bottleneck within the authentic NumPy model.

    import numpy as np
    import numba
    
    # This may make the logic run a whole lot of occasions sooner.
    @numba.njit
    def _rows_sum_csr_int32(information, indices, indptr, rows, Okay):
        """
        Sum CSR rows right into a dense 1-D vector with out creating
        intermediate scipy / numpy objects.
        """
        out = np.zeros(Okay, dtype=np.int32)
        for r in rows:
            begin = indptr[r]
            finish = indptr[r + 1]
            for p in vary(begin, finish):
                out[indices[p]] += information[p]
        return out

    Nonetheless, Numba shouldn’t be a silver bullet; it’s useful for numerical, loop-heavy code, however for many calculations, it’s sooner and simpler to stay to vectorized NumPy operations. I counsel you to solely attempt it when a NumPy method doesn’t attain the required velocity.

    Remaining Ideas

    The highest 5 submissions for every problem. Supply: https://github.com/mostly-ai/the-prize-eval/.

    Although ML fashions are getting more and more stronger, I feel that for many issues that Knowledge Scientists are attempting to unravel, the key ingredient is commonly not within the mannequin. After all, a powerful mannequin is an integral a part of an answer, however the pre- and postprocessing are equally essential. For these challenges, a post-processing pipeline focused particularly for the analysis metric led me to the profitable resolution, with none extra ML.

    I discovered so much on this problem, and I need to thank Mostly AI and the jury for his or her nice job in organizing this unbelievable competitors.

    My code and options for each challenges are open-source and might be discovered right here:



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    It’s the Lessons We Learned Along the Way. Or, Is It?

    June 1, 2026

    Proxy-Pointer RAG: Eliminating Wasteful Entity & Relations Extraction in Knowledge Graphs

    May 31, 2026

    Solving a Murder Mystery Using Bayesian Inference

    May 31, 2026

    Rerankers Aren’t Magic Either: When the Cross-Encoder Layer Is Worth the Cost

    May 31, 2026

    Qdrant TurboQuant Explained: Is TurboQuant the Silver Bullet?

    May 30, 2026

    Meta-Cognitive Regulation Might Be the Most Important AI Skill Nobody Is Talking About

    May 30, 2026

    Comments are closed.

    Editors Picks

    Supermassive black holes may create millions of new planets

    June 1, 2026

    Cheque in: 3 startups ended May by raising $15.5 million

    June 1, 2026

    Universal Audio Volt 876 USB Audio Interface Review: Pro-Level Polish

    June 1, 2026

    New York City-based Mecka AI, which trains robots with human data sourced from body sensors and iPhones, raised $60M, including a $25M Series A (Ben Weiss/Fortune)

    June 1, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Intel’s Panther Lake Chips Aren’t Just Good—They Beat Apple’s M5

    January 26, 2026

    Neanderthals used birch tar as antibiotic medicine

    June 1, 2026

    I’m a Money Expert With $10k in Credit Card Debt. Here’s How I’m Paying It Off

    December 24, 2024
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.