Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Robot wins half marathon faster than human record
    • Analysis of 200 education dept-endorsed school apps finds most are selling BS when it comes to the privacy of children’s data
    • Spoofed Tankers Are Flooding the Strait of Hormuz. These Analysts Are Tracking Them
    • Polymarket is in talks to raise $400M at a ~$15B post-money valuation, up from $9B in October 2025, but below Kalshi’s $22B valuation from March 2026 (The Information)
    • Today’s NYT Connections: Sports Edition Hints, Answers for April 20 #574
    • Will Humans Live Forever? AI Races to Defeat Aging
    • AI evolves itself to speed up scientific discovery
    • Australia’s privacy commissioner tried, in vain, to sound the alarm on data protection during the u16s social media ban trials
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Monday, April 20
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»The Machine Learning “Advent Calendar” Day 1: k-NN Regressor in Excel
    Artificial Intelligence

    The Machine Learning “Advent Calendar” Day 1: k-NN Regressor in Excel

    Editor Times FeaturedBy Editor Times FeaturedDecember 1, 2025No Comments16 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    to this “Advent Calendar” of Machine learning and deep learning in Excel.

    For Day 1, we start with the k-NN (k-Nearest Neighbors) regressor algorithm. And as you will note, that is actually the only mannequin and it’s a good method to begin.

    For individuals who already know this mannequin, listed below are some questions for you. Hopefully, they may make you wish to proceed studying. And there are some delicate classes that aren’t taught in conventional programs.

    • Is scaling of steady options vital for this mannequin?
    • How can categorical options be dealt with?
    • What ought to be performed to the continual options to enhance mannequin efficiency?
    • What forms of distance measures could be extra appropriate in sure conditions? For instance, when predicting home costs the place geographical location issues?

    Spoiler: with a naive k-NN, you can not get the perfect scaling routinely.

    It is usually a chance, if you’re not acquainted with Excel formulation, to make use of formulation corresponding to RANK, IF, SUMPRODUCT, and different helpful Excel capabilities.

    You need to use this link to get the Excel/Google Sheet file, and my recommendation is that you simply observe the article, and also you do some testing with the file to higher perceive.

    k-NN regressor in Excel – picture by creator

    The precept of k-NN

    If you wish to promote or purchase an condominium, how would you estimate the value?

    Please take into consideration a really life like method, not some complicated mannequin that it’s a must to spend hours to construct.

    One thing that you are able to do for actual.

    Nicely, you’ll most likely ask your neighbors who’ve an condominium of the identical or related measurement. And also you calculate the common worth of those residences.

    Sure, that’s precisely the thought of k-NN, for k-Nearest Neighbors: seek for probably the most related examples, and use their values to estimate the brand new one.

    As an instance this job with a concrete instance of home pricing estimation, we’ll use this well-known dataset known as California Housing Dataset. That is the Census information from California block teams, used to foretell median home worth.

    California Housing Dataset – License: MIT

    Every commentary is just not a person home, however it’s nonetheless fascinating to make use of this instance.

    Here’s a fast description of the variables.

    Goal variable is MedHouseVal, which is the median home worth, in models of 100,000 USD (instance: 3.2 means 320,000 {dollars}).

    The characteristic variables are as follows:

    1. MedInc: median revenue (in models of 10,000 USD)
    2. HouseAge: median age of homes
    3. AveRooms: common variety of rooms per family
    4. AveBedrms: common variety of bedrooms per family
    5. Inhabitants: individuals dwelling within the block group
    6. AveOccup: common variety of occupants per family
    7. Latitude: geographic latitude
    8. Longitude: geographic longitude

    k-NN with One Steady Characteristic

    Earlier than we use a number of options to seek out the neighbors, let’s first solely use one characteristic and some observations.

    Regardless that the method for one steady characteristic will probably be quite simple, we’ll nonetheless observe each step. We first discover our dataset, then we practice the mannequin with a hyperparameter, and finally, we will use the mannequin to foretell.

    Coaching dataset

    Right here is the plot of this easy dataset of 10 observations. The x-axis is the continual characteristic, and the y-axis is the goal variable.

    One easy characteristic for Ok-NN – picture by creator

    Now, think about that we now have to foretell the worth for a brand new commentary x=10. How can we do this?

    Mannequin coaching?

    Step one for nearly all machine studying fashions is coaching.

    However for k-NN, your mannequin is your whole dataset. In different phrases, you don’t have to coach the mannequin, you utilize the unique dataset immediately.

    So in scikit-learn, if you do mannequin.match, for a k-NN estimator, nothing actually occurs.

    Some could ask: what about ok?

    Nicely, ok is the hyperparameter. So it’s a must to select a worth for ok, and it may be tuned.

    Prediction for one new commentary

    For the hyperparameter ok, we’ll use ok=3, for the reason that dataset may be very small.

    For one characteristic variable, the gap can trivially be the absolute worth of the distinction of the worth between the brand new commentary and the others.

    Within the sheet “algo1D”, you’ll be able to change the worth of the brand new commentary, and use the filter on distance column C to order the dataset within the growing order, the 3-nearest neighbors will probably be plotted.

    To make the calculation extra computerized, we will use RANK perform to see the smallest observations by way of distance.

    And we can also create a column of indicators (column G), with indicator = 1, in the event that they belong to the k-nearest neighbors.

    Lastly, for the prediction, we will use SUMPRODUCT to calculate the common worth of all y values with indicator =1.

    Within the plot,

    • the sunshine blue dots signify the dataset
    • the pink dot represents the brand new commentary with the anticipated y worth
    • the yellow dots signify the 3-nearest neighbors of the brand new commentary (in pink)
    k-NN regressor in Excel with one characteristic – picture by creator

    Let’s recap — the prediction part consists of the next steps:

    • For one given new commentary, calculate the gap between this new commentary and all of the observations within the coaching dataset.
    • Establish the ok observations which have the shortest distance. In Excel, we’ll use the filter to order manually the coaching dataset. Or we will use RANK (and a indicator column) to get the highest ok observations.
    • Calculate the anticipated worth, by calculating the common worth of the goal variable, through the use of SUMPRODUCT.

    Prediction for an interval of latest observations

    Within the sheet “algo1D f” (f for closing), I plotted the prediction for a listing of latest observations, starting from 1 to 17.

    With a programming language, we may do it simply in a loop, and for a bigger variety of new observations, so the illustration could possibly be denser.

    With Excel, I manually repeated the next steps:

    • enter a worth for x
    • order the gap column
    • copy-paste the prediction
    k-NN regressor in Excel with one characteristic – picture by creator

    Impact of the hyperparameter ok

    The hyperparameter that’s utilized in k-NN is the variety of neighbors that we take note of for the calculation of common worth.

    We often use this following graph to clarify how a mannequin could be underfitted, or overfitted.

    k-NN regressor overfitting and overfitting – picture by creator

    In our case, if ok is small, there could be a danger of overfitting.

    If ok is giant, there could be danger of underfitting.

    The intense case of very giant ok is that ok could be the entire variety of the coaching dataset. And the worth of the prediction would be the identical for each new commentary: it’s the international common.

    k-NN regressor in Excel with impact of ok – picture by creator

    So, we will say that k-NN improves the thought of predicting with a calculation of the common worth with a number of observations which are near the brand new commentary.

    k-NN with Two Steady Options

    Now, we’ll examine the case of two steady characteristic variables x1 and x2. And we’ll solely discuss in regards to the variations with the earlier state of affairs of 1 characteristic variable.

    Two steady characteristic variables dataset

    When we now have two characteristic variables, I can not plot in 3D with Excel, so the plot incorporates solely x1 as x-axis, and x2 as y-axis.

    So don’t be confused with the earlier dataset, for which y-axis represents the goal worth y.

    Ok-NN with two options in Excel – picture by creator

    Prediction with the Euclidean distance

    Now that we now have two options, we now have to take each of them into consideration.

    One regular distance we will use is the Euclidean Distance.

    Then we will use the identical course of to the highest ok observations which have the minimal distance with the brand new commentary.

    k-NN regressor in Excel with two options – picture by creator

    To get a visible plot, we will use the identical colours

    • Blue for coaching dataset
    • Purple for the brand new commentary
    • Yellow for the discovered ok nearest neighbors
    k-NN regressor in Excel plot with two options – picture by creator

    Influence of the dimensions of the variables

    When you might have two options, one query that we will ask is the influence of the dimensions of the characteristic for the results of prediction.

    First, let’s see this easy instance, I multiplied the characteristic x2 by 10.

    k-NN regressor in Excel with totally different scales – picture by creator

    Will this scaling influence the predictions? The reply is in fact sure.

    And we will simply examine them, as within the following picture.

    k-NN regressor in Excel with totally different scales – picture by creator

    It’s straightforward sufficient to grasp that the Euclidean Distance sums the squared distinction of the options, no matter their scales.

    Because of this, the characteristic that has a big scale will dominate the gap.

    On the subject of characteristic scaling, one widespread operation is standardization (additionally known as centering and discount) or min–max scaling. The concept is to put all options on a comparable scale.

    BUT, let’s take into consideration this case: what if one characteristic is expressed in {dollars}, and the opposite in yen.

    In the true world, the right relation between the 2 scales is about 1 greenback = 156 yen (as of November 2025). We all know this as a result of we perceive the which means of the models.

    How would the mannequin know this? It DOES NOT.

    The one hyperparameter is ok, and the mannequin doesn’t alter something to appropriate for variations in models or scales. k-NN has no inner mechanism for understanding that two options have totally different models.

    And that is solely the begining of the issues…

    k-NN with the California Housing dataset

    Now, let’s lastly use the real-world dataset of California Housing dataset.

    With the one-feature dataset, we received the essential thought of how k-NN works. With two-feature dataset, we noticed that the dimensions of options is vital.

    Now, with this real-world dataset, we’ll see that the heterogeneous nature of the options make the Euclidean distance meaningless.

    We’ll see another extra vital concepts once we use k-NN in follow.

    Naive utility of k-NN regressor

    Since all options on this dataset are steady, we will simply calculate the Euclidean Distance. And we outline a quantity ok, to calculate the common worth of the goal variable, right here MedHouseVal.

    In Excel, you’ll be able to simply do this your self. Or you’ll be able to assist me here and get all of the information.

    k-NN regressor in Excel with California Housing dataset – picture by creator

    Notion of distance based mostly on totally different options

    I stated that the earlier utility is naive, as a result of in the event you look nearer, you will note these issues:

    MedInc (median revenue) is expressed in models of 10,000 USD. If we determine to specific it in 100,000 USD or in 1,000 USD as an alternative, the prediction will change, as a result of k-NN is delicate to the dimensions of the options. We noticed this drawback earlier than.

    Now, furthermore, every characteristic has a distinct nature.

    • MedInc is an amount of cash (in {dollars}).
    • HouseAge is an age in years.
    • AveRooms is a rely of rooms.
    • Inhabitants is quite a lot of individuals.
    • Latitude and longitude are geographic coordinates.

    Due to this fact, the Euclidean distance is doomed.

    Several types of distances

    The most typical selection is the Euclidean distance, however it’s not the one one.

    We are able to additionally use Manhattan distance when options signify grid-like actions, and Cosine distance when solely the course issues (as with textual content embeddings).

    Every distance adjustments how “nearest” is outlined, and subsequently can change which neighbors KNN selects.

    Relying on the info, different distances could be extra applicable.
    For instance, with latitude and longitude, we will use the true geographical distance (in meters) as an alternative of a easy Euclidean distance on levels.

    Within the California Housing dataset, that is particularly helpful as a result of we now have the precise latitude and longitude of every district.

    Nonetheless, as soon as we attempt to mix these geographical distances with different variables (corresponding to median revenue, variety of rooms, or inhabitants), the issue turns into extra complicated, as a result of the variables have very totally different natures and scales.

    k-NN regressor in Excel with geographical distance – picture by creator

    Within the cartography renderings under, I used k-NN as a smoothing perform to refine the values related to totally different areas of Paris.

    On the left, every space has solely a single worth, so from one quarter to its neighboring quarters, there could be a discontinuity of the variable.

    On the best, k-NN permits me to estimate a worth for each particular handle by smoothing the data based mostly on close by areas.

    Furthermore, for indicators such because the proportion of sure skilled classes, I additionally utilized population-based weighting in order that bigger areas have a stronger affect within the smoothing course of.

    k-NN regressor in Excel with smoothing – picture by creator

    As a conclusion, when the state of affairs permits it, selecting a extra particular distance will help us higher seize the underlying actuality.

    By linking the gap to the character of the info, we will make k-NN far more significant: geographical distance for coordinates, cosine distance for embeddings, and so forth. The selection of distance isn’t just a technical element, it adjustments how the mannequin “sees” the world and which neighbors it considers related.

    How Categorical Options could be Modeled

    It’s possible you’ll hear that categorical options can’t be dealt with in k-NN fashions.

    However this isn’t utterly true.

    k-NN can work with categorical variables so long as we will outline a distance between two observations.

    Many individuals will say: “simply use one-hot encoding.”

    Others point out label encoding, or ordinal encoding.

    However these strategies behave very otherwise in a distance-based mannequin.

    To make this clear, we’ll use one other dataset: the diamond price dataset (CC BY 4.0 license), which incorporates a number of options corresponding to carat, minimize, shade, and readability.

    For simplicity, we’ll use solely carat (numerical) and readability (categorical) to show a number of outcomes.

    Predicting Costs with Carat

    First, we’ll begin with carat, because you most likely know that the value of a diamond relies upon primarily on the dimensions (carat) of the stone.

    The graphic under reveals how k-NN can discover diamonds with related sizes to estimate the value.

    k-NN regressor in Excel with diamond worth dataset – picture by creator

    One-Scorching Encoding for Readability Characteristic

    Now allow us to have a look at readability.

    Under is the desk of classes with their meanings, and we apply one-hot encoding to rework every class right into a binary vector.

    Readability That means
    IF Internally Flawless
    VVS1 Very Very Barely Included 1
    VVS2 Very Very Barely Included 2
    VS1 Very Barely Included 1
    VS2 Very Barely Included 2
    SI1 Barely Included 1
    SI2 Barely Included 2
    I1 Included 1

    On this desk, we see that for the brand new diamond with readability VVS2, the closest neighbors are all diamonds from the identical readability class.

    The numerical characteristic carat has little or no affect on the gap, whereas it’s a extra vital characteristic, as you’ll be able to see within the worth column.

    k-NN regressor in Excel with one-hot encoding for readability in diamond worth dataset – picture by creator

    Key problem 1: all classes are equally distant

    When utilizing Euclidean distance on one-hot vectors:

    • IF vs VVS1 → distance = √2
    • IF vs SI2 → distance = √2
    • IF vs I1 → distance = √2

    Each totally different class is at precisely the identical distance.
    This doesn’t mirror the true diamond grading scale.

    Key problem 2: scaling drawback with steady variables

    As a result of we mix one-hot readability with carat (a steady characteristic), we face one other drawback:

    • carat values in our instance are under 1
    • readability vectors have variations of √2 → readability dominates the gap calculation

    So even small adjustments in readability overpower the impact of carat.
    That is precisely the identical scaling problem we face with multi-continuous options, however even stronger.

    Ordinal Encoding for Readability

    Now we will strive encoding the Readability characteristic with numerical labels. However as an alternative of utilizing the basic labels 1, 2, 3… we use expert-based labels that mirror the true grading scale.

    The concept is to translate the readability ranges into values that behave extra like a steady characteristic, just like carat, even when readability is just not strictly steady.

    k-NN regressor in Excel with readability encoding – picture by creator

    With this expert-based encoding, the distances change into extra significant.

    Carat and readability at the moment are on comparable scales, so neither characteristic utterly dominates the gap calculation.

    So, we acquire a higher steadiness between measurement and readability when choosing neighbors, which provides extra life like predictions.

    k-NN regressor in Excel with ordinal encoding for readability characteristic in diamond worth dataset – picture by creator

    Conclusions

    In conclusion, the k-NN regressor is a extremely non-linear, native estimator. It’s so native that solely the Ok closest observations are literally used.

    After implementing k-NN regressor in Excel, I believe that we will actually ask this query: Is the k-NN regressor actually a Machine Studying mannequin?

    • There is no such thing as a mannequin coaching
    • When predicting, the collection of the neighbor observations doesn’t rely on the worth of goal variable

    However, it’s so straightforward to grasp, that with Excel, we will simply implement the entire algorithm. Furthermore, we will alter the gap as we want.

    Businesspeople can see the thought immediately: to foretell a worth, we have a look at related observations.

    The precise issues with k-NN, and all fashions which are based mostly on distance:

    • the dimensions of the options
    • the heterogeneous nature of options, which makes the sum meaningless
    • the precise distance that ought to be outlined in concrete conditions
    • for categorical options, label/ordinal encoding could possibly be optimized if we may discover the optimum scaling.

    So briefly, the issue is the scaling of the options. We might imagine that they are often tuned as hyperparameters, however then the tuning would require a lot time.

    We’ll see later that it’s precisely the motivation behind one other household of fashions.

    Right here, the notion of scale can be equal to the notion of characteristic significance, as a result of in k-NN the significance of every characteristic is outlined earlier than utilizing the mannequin.

    So that is solely the start of our journey. We’ll uncover collectively different fashions that may do higher, from this easy mannequin, by bettering in several instructions: characteristic scaling, from distance to likelihood, splitting to higher mannequin every class…



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Will Humans Live Forever? AI Races to Defeat Aging

    April 20, 2026

    KV Cache Is Eating Your VRAM. Here’s How Google Fixed It With TurboQuant.

    April 19, 2026

    Proxy-Pointer RAG: Structure Meets Scale at 100% Accuracy with Smarter Retrieval

    April 19, 2026

    Dreaming in Cubes | Towards Data Science

    April 19, 2026

    AI Agents Need Their Own Desk, and Git Worktrees Give Them One

    April 18, 2026

    Your RAG System Retrieves the Right Data — But Still Produces Wrong Answers. Here’s Why (and How to Fix It).

    April 18, 2026

    Comments are closed.

    Editors Picks

    Robot wins half marathon faster than human record

    April 20, 2026

    Analysis of 200 education dept-endorsed school apps finds most are selling BS when it comes to the privacy of children’s data

    April 20, 2026

    Spoofed Tankers Are Flooding the Strait of Hormuz. These Analysts Are Tracking Them

    April 20, 2026

    Polymarket is in talks to raise $400M at a ~$15B post-money valuation, up from $9B in October 2025, but below Kalshi’s $22B valuation from March 2026 (The Information)

    April 20, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    5 Signs You’re Aging Better Than You Think, According to Experts

    March 7, 2026

    Forget the Chatbots. AI’s True Potential Is Cheap, Fast and on Your Devices

    December 29, 2025

    Smart telescope reveals inner workings and cosmic views

    September 30, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.