Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • How to avoid hidden costs when scaling agentic AI
    • Triumph unveils Triple Tribute Edition Trident 660 with racing flair
    • The quantum leap for startups: What early adopters need to know
    • 14 Best Subscription Boxes for Kids (2025): STEM, Books, Snacks
    • Trump admin to roll back Biden’s AI chip restrictions
    • Today’s NYT Mini Crossword Answers for May 19
    • UK driverless cars unlikely until 2027
    • How to Set the Number of Trees in Random Forest
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Monday, May 19
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»How to Set the Number of Trees in Random Forest
    Artificial Intelligence

    How to Set the Number of Trees in Random Forest

    Editor Times FeaturedBy Editor Times FeaturedMay 19, 2025No Comments14 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    Scientific publication

    T. M. Lange, M. Gültas, A. O. Schmitt & F. Heinrich (2025). optRF: Optimising random forest stability by figuring out the optimum variety of timber. BMC bioinformatics, 26(1), 95.

    Comply with this LINK to the unique publication.

    Forest — A Highly effective Instrument for Anybody Working With Knowledge

    What’s Random Forest?

    Have you ever ever wished you would make higher selections utilizing information — like predicting the chance of illnesses, crop yields, or recognizing patterns in buyer habits? That’s the place machine studying is available in and probably the most accessible and highly effective instruments on this discipline is one thing known as Random Forest.

    So why is random forest so in style? For one, it’s extremely versatile. It really works effectively with many sorts of information whether or not numbers, classes, or each. It’s additionally broadly utilized in many fields — from predicting affected person outcomes in healthcare to detecting fraud in finance, from bettering purchasing experiences on-line to optimising agricultural practices.

    Regardless of the title, random forest has nothing to do with timber in a forest — nevertheless it does use one thing known as Decision Trees to make good predictions. You’ll be able to consider a call tree as a flowchart that guides a collection of sure/no questions primarily based on the information you give it. A random forest creates an entire bunch of those timber (therefore the “forest”), every barely completely different, after which combines their outcomes to make one closing choice. It’s a bit like asking a gaggle of consultants for his or her opinion after which going with the bulk vote.

    However till just lately, one query was unanswered: What number of choice timber do I really need? If every choice tree can result in completely different outcomes, averaging many timber would result in higher and extra dependable outcomes. However what number of are sufficient? Fortunately, the optRF package deal solutions this query!

    So let’s take a look at learn how to optimise Random Forest for predictions and variable choice!

    Making Predictions with Random Forests

    To optimise and to make use of random forest for making predictions, we are able to use the open-source statistics programme R. As soon as we open R, we now have to put in the 2 R packages “ranger” which permits to make use of random forests in R and “optRF” to optimise random forests. Each packages are open-source and out there through the official R repository CRAN. With a view to set up and cargo these packages, the next traces of R code might be run:

    > set up.packages(“ranger”)
    > set up.packages(“optRF”)
    > library(ranger)
    > library(optRF)

    Now that the packages are put in and loaded into the library, we are able to use the features that these packages include. Moreover, we are able to additionally use the information set included within the optRF package deal which is free to make use of beneath the GPL license (simply because the optRF package deal itself). This information set known as SNPdata accommodates within the first column the yield of 250 wheat vegetation in addition to 5000 genomic markers (so known as single nucleotide polymorphisms or SNPs) that may include both the worth 0 or 2.

    > SNPdata[1:5,1:5]
                Yield SNP_0001 SNP_0002 SNP_0003 SNP_0004
      ID_001 670.7588        0        0        0        0
      ID_002 542.5611        0        2        0        0
      ID_003 591.6631        2        2        0        2
      ID_004 476.3727        0        0        0        0
      ID_005 635.9814        2        2        0        2

    This information set is an instance for genomic information and can be utilized for genomic prediction which is an important instrument for breeding high-yielding crops and, thus, to struggle world starvation. The concept is to foretell the yield of crops utilizing genomic markers. And precisely for this function, random forest can be utilized! That signifies that a random forest mannequin is used to explain the connection between the yield and the genomic markers. Afterwards, we are able to predict the yield of wheat vegetation the place we solely have genomic markers.

    Subsequently, let’s think about that we now have 200 wheat vegetation the place we all know the yield and the genomic markers. That is the so-called coaching information set. Let’s additional assume that we now have 50 wheat vegetation the place we all know the genomic markers however not their yield. That is the so-called check information set. Thus, we separate the information body SNPdata in order that the primary 200 rows are saved as coaching and the final 50 rows with out their yield are saved as check information:

    > Coaching = SNPdata[1:200,]
    > Check = SNPdata[201:250,-1]

    With these information units, we are able to now take a look at learn how to make predictions utilizing random forests!

    First, we bought to calculate the optimum variety of timber for random forest. Since we need to make predictions, we use the operate opt_prediction from the optRF package deal. Into this operate we now have to insert the response from the coaching information set (on this case the yield), the predictors from the coaching information set (on this case the genomic markers), and the predictors from the check information set. Earlier than we run this operate, we are able to use the set.seed operate to make sure reproducibility regardless that this isn’t needed (we’ll see later why reproducibility is a matter right here):

    > set.seed(123)
    > optRF_result = opt_prediction(y = Coaching[,1], 
    +                               X = Coaching[,-1], 
    +                               X_Test = Check)
      Really helpful variety of timber: 19000

    All the outcomes from the opt_prediction operate at the moment are saved within the object optRF_result, nonetheless, a very powerful info was already printed within the console: For this information set, we must always use 19,000 timber.

    With this info, we are able to now use random forest to make predictions. Subsequently, we use the ranger operate to derive a random forest mannequin that describes the connection between the genomic markers and the yield within the coaching information set. Additionally right here, we now have to insert the response within the y argument and the predictors within the x argument. Moreover, we are able to set the write.forest argument to be TRUE and we are able to insert the optimum variety of timber within the num.timber argument:

    > RF_model = ranger(y = Coaching[,1], x = Coaching[,-1], 
    +                   write.forest = TRUE, num.timber = 19000)

    And that’s it! The article RF_model accommodates the random forest mannequin that describes the connection between the genomic markers and the yield. With this mannequin, we are able to now predict the yield for the 50 vegetation within the check information set the place we now have the genomic markers however we don’t know the yield:

    > predictions = predict(RF_model, information=Check)$predictions
    > predicted_Test = information.body(ID = row.names(Check), predicted_yield = predictions)

    The info body predicted_Test now accommodates the IDs of the wheat vegetation along with their predicted yield:

    > head(predicted_Test)
          ID predicted_yield
      ID_201        593.6063
      ID_202        596.8615
      ID_203        591.3695
      ID_204        589.3909
      ID_205        599.5155
      ID_206        608.1031

    Variable Choice with Random Forests

    A distinct strategy to analysing such a knowledge set can be to search out out which variables are most vital to foretell the response. On this case, the query can be which genomic markers are most vital to foretell the yield. Additionally this may be finished with random forests!

    If we deal with such a activity, we don’t want a coaching and a check information set. We are able to merely use all the information set SNPdata and see which of the variables are a very powerful ones. However earlier than we do this, we must always once more decide the optimum variety of timber utilizing the optRF package deal. Since we’re insterested in calculating the variable significance, we use the operate opt_importance:

    > set.seed(123)
    > optRF_result = opt_importance(y=SNPdata[,1], 
    +                               X=SNPdata[,-1])
      Really helpful variety of timber: 40000

    One can see that the optimum variety of timber is now larger than it was for predictions. That is really usually the case. Nevertheless, with this variety of timber, we are able to now use the ranger operate to calculate the significance of the variables. Subsequently, we use the ranger operate as earlier than however we alter the variety of timber within the num.timber argument to 40,000 and we set the significance argument to “permutation” (different choices are “impurity” and “impurity_corrected”). 

    > set.seed(123) 
    > RF_model = ranger(y=SNPdata[,1], x=SNPdata[,-1], 
    +                   write.forest = TRUE, num.timber = 40000,
    +                   significance="permutation")
    > D_VI = information.body(variable = names(SNPdata)[-1], 
    +                   significance = RF_model$variable.significance)
    > D_VI = D_VI[order(D_VI$importance, decreasing=TRUE),]

    The info body D_VI now accommodates all of the variables, thus, all of the genomic markers, and subsequent to it, their significance. Additionally, we now have straight ordered this information body in order that a very powerful markers are on the highest and the least vital markers are on the backside of this information body. Which signifies that we are able to take a look at a very powerful variables utilizing the pinnacle operate:

    > head(D_VI)
      variable significance
      SNP_0020   45.75302
      SNP_0004   38.65594
      SNP_0019   36.81254
      SNP_0050   34.56292
      SNP_0033   30.47347
      SNP_0043   28.54312

    And that’s it! Now we have used random forest to make predictions and to estimate a very powerful variables in a knowledge set. Moreover, we now have optimised random forest utilizing the optRF package deal!

    Why Do We Want Optimisation?

    Now that we’ve seen how simple it’s to make use of random forest and the way rapidly it may be optimised, it’s time to take a better have a look at what’s occurring behind the scenes. Particularly, we’ll discover how random forest works and why the outcomes may change from one run to a different.

    To do that, we’ll use random forest to calculate the significance of every genomic marker however as a substitute of optimising the variety of timber beforehand, we’ll keep on with the default settings within the ranger operate. By default, ranger makes use of 500 choice timber. Let’s attempt it out:

    > set.seed(123) 
    > RF_model = ranger(y=SNPdata[,1], x=SNPdata[,-1], 
    +                   write.forest = TRUE, significance="permutation")
    > D_VI = information.body(variable = names(SNPdata)[-1], 
    +                   significance = RF_model$variable.significance)
    > D_VI = D_VI[order(D_VI$importance, decreasing=TRUE),]
    > head(D_VI)
      variable significance
      SNP_0020   80.22909
      SNP_0019   60.37387
      SNP_0043   50.52367
      SNP_0005   43.47999
      SNP_0034   38.52494
      SNP_0015   34.88654

    As anticipated, every little thing runs easily — and rapidly! In truth, this run was considerably sooner than once we beforehand used 40,000 timber. However what occurs if we run the very same code once more however this time with a unique seed?

    > set.seed(321) 
    > RF_model2 = ranger(y=SNPdata[,1], x=SNPdata[,-1], 
    +                    write.forest = TRUE, significance="permutation")
    > D_VI2 = information.body(variable = names(SNPdata)[-1], 
    +                    significance = RF_model2$variable.significance)
    > D_VI2 = D_VI2[order(D_VI2$importance, decreasing=TRUE),]
    > head(D_VI2)
      variable significance
      SNP_0050   60.64051
      SNP_0043   58.59175
      SNP_0033   52.15701
      SNP_0020   51.10561
      SNP_0015   34.86162
      SNP_0019   34.21317

    As soon as once more, every little thing seems to work advantageous however take a better have a look at the outcomes. Within the first run, SNP_0020 had the best significance rating at 80.23, however within the second run, SNP_0050 takes the highest spot and SNP_0020 drops to the fourth place with a a lot decrease significance rating of 51.11. That’s a major shift! So what modified?

    The reply lies in one thing known as non-determinism. Random forest, because the title suggests, includes plenty of randomness: it randomly selects information samples and subsets of variables at numerous factors throughout coaching. This randomness helps forestall overfitting nevertheless it additionally signifies that outcomes can range barely every time you run the algorithm — even with the very same information set. That’s the place the set.seed() operate is available in. It acts like a bookmark in a shuffled deck of playing cards. By setting the identical seed, you make sure that the random selections made by the algorithm comply with the identical sequence each time you run the code. However while you change the seed, you’re successfully altering the random path the algorithm follows. That’s why, in our instance, a very powerful genomic markers got here out in another way in every run. This habits — the place the identical course of can yield completely different outcomes because of inner randomness — is a traditional instance of non-determinism in machine studying.

    As we simply noticed, random forest fashions can produce barely completely different outcomes each time you run them even when utilizing the identical information as a result of algorithm’s built-in randomness. So, how can we scale back this randomness and make our outcomes extra secure?

    One of many easiest and best methods is to extend the variety of timber. Every tree in a random forest is educated on a random subset of the information and variables, so the extra timber we add, the higher the mannequin can “common out” the noise brought on by particular person timber. Consider it like asking 10 individuals for his or her opinion versus asking 1,000 — you’re extra more likely to get a dependable reply from the bigger group.

    With extra timber, the mannequin’s predictions and variable significance rankings are likely to grow to be extra secure and reproducible even with out setting a particular seed. In different phrases, including extra timber helps to tame the randomness. Nevertheless, there’s a catch. Extra timber additionally imply extra computation time. Coaching a random forest with 500 timber may take just a few seconds however coaching one with 40,000 timber may take a number of minutes or extra, relying on the dimensions of your information set and your laptop’s efficiency.

    Nevertheless, the connection between the soundness and the computation time of random forest is non-linear. Whereas going from 500 to 1,000 timber can considerably enhance stability, going from 5,000 to 10,000 timber may solely present a tiny enchancment in stability whereas doubling the computation time. In some unspecified time in the future, you hit a plateau the place including extra timber offers diminishing returns — you pay extra in computation time however achieve little or no in stability. That’s why it’s important to search out the fitting stability: Sufficient timber to make sure secure outcomes however not so many who your evaluation turns into unnecessarily sluggish.

    And that is precisely what the optRF package deal does: it analyses the connection between the soundness and the variety of timber in random forests and makes use of this relationship to find out the optimum variety of timber that results in secure outcomes and past which including extra timber would unnecessarily improve the computation time.

    Above, we now have already used the opt_importance operate and saved the outcomes as optRF_result. This object accommodates the details about the optimum variety of timber nevertheless it additionally accommodates details about the connection between the soundness and the variety of timber. Utilizing the plot_stability operate, we are able to visualise this relationship. Subsequently, we now have to insert the title of the optRF object, which measure we’re all in favour of (right here, we have an interest within the “significance”), the interval we need to visualise on the X axis, and if the really helpful variety of timber must be added:

    > plot_stability(optRF_result, measure="significance", 
    +                from=0, to=50000, add_recommendation=FALSE)
    R graph that visualises the stability of random forest depending on the number of decision trees
    The output of the plot_stability operate visualises the soundness of random forest relying on the variety of choice timber

    This plot clearly exhibits the non-linear relationship between stability and the variety of timber. With 500 timber, random forest solely results in a stability of round 0.2 which explains why the outcomes modified drastically when repeating random forest after setting a unique seed. With the really helpful 40,000 timber, nonetheless, the soundness is close to 1 (which signifies an ideal stability). Including greater than 40,000 timber would get the soundness additional to 1 however this improve can be solely very small whereas the computation time would additional improve. That’s the reason 40,000 timber point out the optimum variety of timber for this information set.

    The Takeaway: Optimise Random Forest to Get the Most of It

    Random forest is a robust ally for anybody working with information — whether or not you’re a researcher, analyst, scholar, or information scientist. It’s simple to make use of, remarkably versatile, and extremely efficient throughout a variety of functions. However like several instrument, utilizing it effectively means understanding what’s occurring beneath the hood. On this put up, we’ve uncovered one among its hidden quirks: The randomness that makes it sturdy may also make it unstable if not fastidiously managed. Happily, with the optRF package deal, we are able to strike the right stability between stability and efficiency, guaranteeing we get dependable outcomes with out losing computational sources. Whether or not you’re working in genomics, medication, economics, agriculture, or every other data-rich discipline, mastering this stability will provide help to make smarter, extra assured selections primarily based in your information.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    The Future of Branding: AI in Logo Creation

    May 19, 2025

    8 Unfiltered NSFW AI Chat Websites that feel too Real

    May 19, 2025

    AI Girlfriend Apps That Can Send Pictures: Top 10 Picks

    May 18, 2025

    9 AI Girlfriend Apps (No Sign-Up, No Filter) to Use Now

    May 18, 2025

    Step-by-Step Guide to Using AI for Professional Logo Design

    May 18, 2025

    10 AI Girlfriend Apps with the Longest Memory

    May 18, 2025
    Leave A Reply Cancel Reply

    Editors Picks

    How to avoid hidden costs when scaling agentic AI

    May 19, 2025

    Triumph unveils Triple Tribute Edition Trident 660 with racing flair

    May 19, 2025

    The quantum leap for startups: What early adopters need to know

    May 19, 2025

    14 Best Subscription Boxes for Kids (2025): STEM, Books, Snacks

    May 19, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Health care giant Ascension says 5.6 million patients affected in cyberattack

    December 24, 2024

    Best Sonos Speakers (2025): Soundbars, Turntables, and More

    March 7, 2025

    The Best Sheets for Every Body, Bed, and Budget (2024)

    September 10, 2024
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.