Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Use Tiny11 to Rescue a Computer Running Windows 10
    • Brabet owner arrested in Brazil gambling investigation
    • Today’s NYT Strands Hints, Answer and Help for May 26 #814
    • New study reveals surprising insect interaction
    • Memorial Day Tech Deals: Sony, Apple, Anker, and More
    • Visually impaired Waymo users in CA say riding in a Waymo gives them a feeling of independence and spares them the discrimination they face from human drivers (Sonia A. Rao/New York Times)
    • Best Places to Buy Contact Lenses Online for 2026 | The Cheapest Places to Find Contact Lenses
    • Introducing the Agent Toolkit for Amazon Web Services
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Tuesday, May 26
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Choosing the Best Model Size and Dataset Size under a Fixed Budget for LLMs
    Artificial Intelligence

    Choosing the Best Model Size and Dataset Size under a Fixed Budget for LLMs

    Editor Times FeaturedBy Editor Times FeaturedOctober 29, 2025No Comments6 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    Introduction

    language fashions (LLMs), we’re perpetually constrained by budgets. Such a constraint results in a basic trade-off:Think about that in case you repair a compute price range, growing the mannequin measurement implies that you will need to cut back the mannequin measurement you possibly can practice on, and vice versa. So you’re asking the query:

    Ought to we allocate extra to a mannequin with extra parameters, or ought to we practice it on extra knowledge?

    Particularly, LLMs’ efficiency and effectivity are largely influenced by this trade-off. It’s thus essential to seek out an optimum stability between the variety of parameters of a mannequin and the variety of tokens used.

    The overall coaching compute of a transformer roughly scales as: C∝N×D, the place

    • N is the variety of mannequin parameters.
    • D is the variety of tokens.
    • C is the mounted compute price range.

    It’s simple to see that for a set C, N and D are inversely proportional to one another.

    Earlier research (Kaplan et al., 2020; Hoffmann et al., 2022) have discovered that coaching lack of machine studying fashions follows a power-law with compute: L(C)∝C^{−α} and the optimum mannequin measurement and dataset measurement scale with compute as: N_opt∝C^a, D_opt∝C^b for some optimistic values a and b.

    On this article, we’ll use tiny Transformers to discover how one can stability N and D underneath a set compute C.

    Experiment Setup

    We design a minimal transformer mannequin, and we name it “tiny transformer” with the next configurable properties that affect the mannequin’s parameter measurement:

    • Mannequin dimension (d_model)
    • MLP dimension (d_mlp)
    • Variety of layers (n_layers​)

    We wish to practice the transformer of various configurations on tokenized sequences of size 64 of the WikiText-2 dataset.

    To check the impact of scaling, we outlined a grid of fashions from very small (16 hidden models, 1 layer) to comparatively giant (128 hidden models, 4 layers) and mix them with a spread of tokens from 5k to 1M. See the code under:

    model_configs = [
        {"d_model": 16,  "d_mlp": 64,   "n_layers": 1},  
        {"d_model": 24,  "d_mlp": 96,   "n_layers": 1},   
        {"d_model": 32,  "d_mlp": 128,  "n_layers": 2},
        {"d_model": 48,  "d_mlp": 192,  "n_layers": 2},
        {"d_model": 64,  "d_mlp": 256,  "n_layers": 3},
        {"d_model": 96,  "d_mlp": 384,  "n_layers": 3},
        {"d_model": 128, "d_mlp": 512,  "n_layers": 4},   
    ]
    # variety of tokens (D) we practice on — simulated through few steps × batch × seq_len
    token_budgets = [5e3, 1e4, 3e4, 5e4, 1e5, 3e5, 5e5, 1e6]  # small for demo

    By approximating the compute price as C≈N×D, our concept is to compute the loss perform for every (N,D) pair and discover the pair (N,D) with which the mannequin reaches the minimal loss perform for a given C: that is the stability we’re searching for.

    Implementation and observations

    We use the code under to coach the mannequin as much as a set variety of steps with totally different (N,D) pair and document the outcome.

    
    outcomes = []
    system = "cuda" if torch.cuda.is_available() else "cpu"
    
    for cfg in model_configs:
        mannequin = TinyTransformer(vocab_size=len(tokenizer), **cfg)
        N_params = count_params(mannequin)
        for D in token_budgets:
            steps = int(D // (SEQ_LEN * 16))  # assuming batch_size=16
            dataloader = DataLoader(
                tokenized_dataset["train"].shuffle(seed=0),
                batch_size=16,
                collate_fn=collate_fn
            )
            avg_loss = train_one(mannequin, dataloader, steps=steps, system=system)
            compute = N_params * D
            outcomes.append({
                "N": N_params,
                "D": D,
                "C": compute,
                "loss": avg_loss
            })

    We then plot the ultimate loss in opposition to the compute (N×D):

    Picture by creator: coaching loss vs compute

    Now we have the next essential observations:

    1. For small compute budgets, small fashions skilled on many of the obtainable knowledge carry out higher than bigger fashions skilled on little or no knowledge.
    2. For big compute budgets, bigger fashions change into higher when sufficient knowledge is offered.
    3. The optimum mannequin measurement doesn’t develop linearly with compute price range. For instance, doubling the compute does probably not result in an optimum variety of parameters twice as earlier than.

    The plot under offers the environment friendly frontier throughout mannequin measurement, that’s, the set of mannequin sizes which have the bottom loss for a given compute.

    Picture by creator: environment friendly frontier

    “Greatest” Mannequin

    To find out the “greatest” mannequin, we would choose the pair of mannequin measurement and the variety of tokens that minimizes loss at a set price range.

    We assume each comply with a power-law relationship: N_opt∝C^α, D_opt∝C^β, and we wish to estimate the unknown exponents α and β by the next steps:

    1. Take the logarithm of the portions: log?(N_opt)=αlog?(C)+const, log?(D_opt)=βlog?(C)+const.
    2. Match a linear regression. The slope of the regression is nothing however the power-law exponent.

    The next code offers such a regression:

    # Match log-log linear regression
    a_slope, a_intercept, *_ = st.linregress(np.log(frontier.C), np.log(frontier.N))
    b_slope, b_intercept, *_ = st.linregress(np.log(frontier.C), np.log(frontier.D))

    In our toy experiment, we discovered that N_opt ~C^0.14 and D_opt~ C^0.86. This outcome may not reveal the entire picture as a result of we did the experiment on simpilied mannequin and configurations. However we will nonetheless see that the expansion of computing results in a rise in optimum mannequin measurement, however at a diminishing charge. Clearly, the remaining price range needs to be attributed to extra coaching tokens.

    Furthermore, the compute above offers the truth that the perfect ratio N_opt/D_opt=C^-0.72. This suggests that once you improve compute, you must add extra coaching tokens somewhat than growing mannequin measurement.

    Sensible Takeaways

    From this experiment, although a toy case, we will extract a number of insights:

    1. For a set price range, utilizing a medium mannequin with extra knowledge can outperform a really giant mannequin with restricted knowledge.
    2. Optimum mannequin measurement and knowledge measurement develop with compute. Don’t practice a mannequin with many parameters you probably have a small price range.
    3. When the price range will increase, think about first the optimum ratio N_opt/D_opt to find out whether or not you must improve the mannequin measurement or add extra coaching knowledge.

    Conclusion

    On this weblog submit, we offer a examine of the trade-off between mannequin measurement and knowledge underneath a set compute price range for LLMs with a toy case. The experiment reveals that we will discover the optimum pair of mannequin measurement and tokens quantity to acheive the perfect mannequin efficiency with a given price range, permitting researchers and practitioners to design LLMs properly and obtain the perfect outcomes.

    Reference

    [1] Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Youngster, R., Grey, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling Legal guidelines for Neural Language Fashions.

    [2] Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, Okay., van den Driessche, G., Damoc, B., Man, A., Osindero, S., Simonyan, Okay., Elsen, E., … Sifre, L. (2022). Coaching Compute-Optimum Massive Language Fashions.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Introducing the Agent Toolkit for Amazon Web Services

    May 25, 2026

    Can AI write your code? | Towards Data Science

    May 25, 2026

    I Built My First ETL Pipeline as a Complete Beginner. Here’s How.

    May 25, 2026

    From TF-IDF to Transformers: Implementing Four Generations of Semantic Search

    May 25, 2026

    The Ultimate Beginners’ Guide to Building an AI Agent in Python

    May 24, 2026

    Beyond the Model: Why Data Scientists Must Embrace APIs and API Documentation

    May 24, 2026

    Comments are closed.

    Editors Picks

    Use Tiny11 to Rescue a Computer Running Windows 10

    May 26, 2026

    Brabet owner arrested in Brazil gambling investigation

    May 26, 2026

    Today’s NYT Strands Hints, Answer and Help for May 26 #814

    May 26, 2026

    New study reveals surprising insect interaction

    May 26, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    The striking Swedish workers taking on carmaker Tesla

    October 27, 2025

    Can Using the Hypershell Exoskeleton on a Bike Replace an E-Bike? I Tested It to Find Out

    May 24, 2026

    SMR Nuclear Power: Decarbonizing Commercial Shipping

    February 1, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.