Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • How Much Energy Does AI Use? The People Who Know Aren’t Saying
    • Today’s NYT Connections: Sports Edition Hints, Answers for June 20 #270
    • Telegram founder says he has fathered more than 100 children
    • LLM-as-a-Judge: A Practical Guide | Towards Data Science
    • Da Orffo Life Chariot-based Unit 1 modular off-road camping trailer
    • Lithuanian startup Traxlo raises €1.6 million to build the labour infrastructure for the AI era
    • ‘Dosa Divas’ Is a ‘Spicy’ New Game About Fighting Capitalism With Food
    • PSG vs. Botafogo From Anywhere for Free: Stream FIFA Club World Cup Soccer
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Friday, June 20
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Data Pruning MNIST: How I Hit 99% Accuracy Using Half the Data | by Chris Lettieri | Jan, 2025
    Artificial Intelligence

    Data Pruning MNIST: How I Hit 99% Accuracy Using Half the Data | by Chris Lettieri | Jan, 2025

    Editor Times FeaturedBy Editor Times FeaturedJanuary 31, 2025No Comments5 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    Discover how the chosen samples seize extra assorted writing types and edge instances.

    In some examples like cluster 1, 3, and eight the furthest level does simply appear like a extra assorted instance of the prototypical middle.

    Cluster 6 is an attention-grabbing level, showcasing how some pictures are troublesome even for a human to guess what it’s. However you possibly can nonetheless make out how this might be in a cluster with the centroid as an 8.

    Latest analysis on neural scaling laws helps to elucidate why information pruning utilizing a “furthest-from-centroid” method works, particularly on the MNIST dataset.

    Information Redundancy

    Many coaching examples in massive datasets are extremely redundant.

    Take into consideration MNIST: what number of almost similar ‘7’s do we actually want? The important thing to information pruning isn’t having extra examples — it’s having the appropriate examples.

    Choice Technique vs Dataset Measurement

    One of the vital attention-grabbing findings from the above paper is how the optimum information choice technique adjustments based mostly in your dataset measurement:

    • With “loads” of knowledge : Choose tougher, extra numerous examples (furthest from cluster facilities).
    • With scarce information: Choose simpler, extra typical examples (closest to cluster facilities).

    This explains why our “furthest-from-centroid” technique labored so nicely.

    With MNIST’s 60,000 coaching examples, we have been within the “ample information” regime the place deciding on numerous, difficult examples proved most helpful.

    Inspiration and Objectives

    I used to be impressed by these two current papers (and the truth that I’m a knowledge engineer):

    Each discover varied methods we are able to use information choice methods to coach performant fashions on much less information.

    Methodology

    I used LeNet-5 as my mannequin structure.

    Then utilizing one of many methods under I pruned the coaching dataset of MNIST and educated a mannequin. Testing was completed towards the complete take a look at set.

    As a consequence of time constraints, I solely ran 5 assessments per experiment.

    Full code and outcomes available here on GitHub.

    Technique #1: Baseline, Full Dataset

    • Customary LeNet-5 structure
    • Educated utilizing 100% of coaching information

    Technique #2: Random Sampling

    • Randomly pattern particular person pictures from the coaching dataset

    Technique #3: Okay-means Clustering with Completely different Choice Methods

    Right here’s how this labored:

    1. Preprocess the pictures with PCA to cut back the dimensionality. This simply means every picture was lowered from 784 values (28×28 pixels) into solely 50 values. PCA does this whereas retaining an important patterns and eradicating redundant info.
    2. Cluster utilizing k-means. The variety of clusters was mounted at 50 and 500 in numerous assessments. My poor CPU couldn’t deal with a lot past 500 given all of the experiments.
    3. I then examined totally different choice strategies as soon as the information was cluster:
    • Closest-to-centroid — these characterize a “typical” instance of the cluster.
    • Furthest-from-centroid — extra consultant of edge instances.
    • Random from every cluster — randomly choose inside every cluster.
    Instance of Clustering Choice. Picture by creator.
    • PCA lowered noise and computation time. At first I used to be simply flattening the pictures. The outcomes and compute each improved utilizing PCA so I saved it for the complete experiment.
    • I switched from normal Okay-means to MiniBatchKMeans clustering for higher velocity. The usual algorithm was too gradual for my CPU given all of the assessments.
    • Establishing a correct take a look at harness was key. Shifting experiment configs to a YAML, mechanically saving outcomes to a file, and having o1 write my visualization code made life a lot simpler.

    Median Accuracy & Run Time

    Listed here are the median outcomes, evaluating our baseline LeNet-5 educated on the complete dataset with two totally different methods that used 50% of the dataset.

    Median Outcomes. Picture by creator.
    Median Accuracies. Picture by creator.

    Accuracy vs Run Time Full Outcomes

    The under charts present the outcomes of my 4 pruning methods in comparison with the baseline in pink.

    Median Accuracy throughout Information Pruning strategies. Picture by creator.
    Median Run time throughout Information Pruning strategies. Picture by creator.

    Key findings throughout a number of runs:

    • Furthest-from-centroid persistently outperformed different strategies
    • There undoubtedly is a candy spot between compute time and and mannequin accuracy if you wish to discover it in your use case. Extra work must be completed right here.

    I’m nonetheless shocked that simply randomly decreasing the dataset provides acceptable outcomes if effectivity is what you’re after.

    Future Plans

    1. Take a look at this on my second brain. I wish to advantageous tune a LLM on my full Obsidian and take a look at information pruning together with hierarchical summarization.
    2. Discover different embedding strategies for clustering. I can attempt coaching an auto-encoder to embed the pictures fairly than use PCA.
    3. Take a look at this on extra complicated and bigger datasets (CIFAR-10, ImageNet).
    4. Experiment with how mannequin structure impacts the efficiency of knowledge pruning methods.

    These findings recommend we have to rethink our method to dataset curation:

    1. Extra information isn’t at all times higher — there appears to be diminishing returns to greater information/ larger fashions.
    2. Strategic pruning can truly enhance outcomes.
    3. The optimum technique depends upon your beginning dataset measurement.

    As individuals begin sounding the alarm that we’re working out of knowledge, I can’t assist however surprise if much less information is definitely the important thing to helpful, cost-effective fashions.

    I intend to proceed exploring the house, please attain out when you discover this attention-grabbing — blissful to attach and speak extra 🙂



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    LLM-as-a-Judge: A Practical Guide | Towards Data Science

    June 20, 2025

    Beyond Model Stacking: The Architecture Principles That Make Multimodal AI Systems Work

    June 19, 2025

    Understanding Matrices | Part 2: Matrix-Matrix Multiplication

    June 19, 2025

    Beyond Code Generation: Continuously Evolve Text with LLMs

    June 19, 2025

    A New Tool for Practicing Conversations

    June 19, 2025

    Enhancing Customer Support with AI Text-to-Speech Tools

    June 19, 2025

    Comments are closed.

    Editors Picks

    How Much Energy Does AI Use? The People Who Know Aren’t Saying

    June 20, 2025

    Today’s NYT Connections: Sports Edition Hints, Answers for June 20 #270

    June 20, 2025

    Telegram founder says he has fathered more than 100 children

    June 20, 2025

    LLM-as-a-Judge: A Practical Guide | Towards Data Science

    June 20, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Best Internet Providers in Asheville, North Carolina

    May 21, 2025

    Better-glass breakthrough achieved using just sound and salt

    March 2, 2025

    First 1.5-mile stretch of Saudi’s audacious Line megacity begins to rise

    February 20, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.