The Machine Learning “Advent Calendar” Day 11: Linear Regression in Excel

Regression, lastly!

For Day 11, I waited many days to current this mannequin. It marks the start of a new journey on this “Advent Calendar“.

Till now, we largely checked out fashions based mostly on distances, neighbors, or native density. As it’s possible you’ll know, for tabular information, resolution timber, particularly ensembles of resolution timber, are very performant.

However beginning in the present day, we change to a different standpoint: the weighted method.

Linear Regression is our first step into this world.
It seems easy, nevertheless it introduces the core components of contemporary ML: loss features, gradients, optimization, scaling, collinearity, and interpretation of coefficients.

Now, after I say, Linear Regression, I imply Extraordinary Least Sq. Linear Regression. As we progress by way of this “Advent Calendar” and discover associated fashions, you will notice why it is very important specify this, as a result of the title “linear regression” may be complicated.

Some individuals say that Linear Regression is not machine studying.

Their argument is that machine studying is a “new” discipline, whereas Linear Regression existed lengthy earlier than, so it can’t be thought-about ML.

That is deceptive.
Linear Regression matches completely inside machine studying as a result of:

it learns parameters from information,
it minimizes a loss operate,
it makes predictions on new information.

In different phrases, Linear Regression is likely one of the oldest fashions, but in addition one of many most basic in machine studying.

That is the method utilized in:

Linear Regression,
Logistic Regression,
and, later, Neural Networks and LLMs.

For deep studying, this weighted, gradient-based method is the one that’s used in every single place.

And in fashionable LLMs, we’re now not speaking about a couple of parameters. We’re speaking about billions of weights.

On this article, our Linear Regression mannequin has precisely 2 weights.

A slope and an intercept.

That’s all.

However we’ve got to start someplace, proper?

And listed below are a couple of questions you’ll be able to bear in mind as we progress by way of this text, and within the ones to return.

We’ll attempt to interpret the mannequin. With one characteristic, y=ax+b, everybody is aware of {that a} is the slope and b is the intercept. However how will we interpret the coefficients the place there are 10, 100 or extra options?
Why is collinearity between options such an issue for linear regression? And the way can we do to unravel this subject?
Is scaling essential for linear regression?
Can Linear regression be overfitted?
And the way are the opposite fashions of this weighted familly (Logistic Regression, SVM, Neural Networks, Ridge, Lasso, and so on.), all related to the identical underlying concepts?

These questions type the thread of this text and can naturally lead us towards future subjects within the “Creation Calendar”.

Understanding the Development line in Excel

Beginning with a Easy Dataset

Allow us to start with a quite simple dataset that I generated with one characteristic.

Within the graph beneath, you’ll be able to see the characteristic variable x on the horizontal axis and the goal variable y on the vertical axis.

The aim of Linear Regression is to search out two numbers, a and b, such that we are able to write the connection:

y=a x +b

As soon as we all know a and b, this equation turns into our mannequin.

Linear regression in Excel – all pictures by writer

Creating the Development Line in Excel

In Google Sheets or Excel, you’ll be able to merely add a development line to visualise the very best linear match.

That already offers you the results of Linear Regression.

Linear regression in Excel – all pictures by writer

However the objective of this text is to compute these coefficients ourselves.

If we wish to use the mannequin to make predictions, we have to implement it straight.

Introducing Weights and the Value Perform

A Word on Weight-Primarily based Fashions

That is the primary time within the Creation Calendar that we introduce weights.

Fashions that be taught weights are sometimes known as parametric discriminant fashions.

Why discriminant?
As a result of they be taught a rule that straight separates or predicts, with out modeling how the info was generated.

Earlier than this chapter, we already noticed fashions that had parameters, however they weren’t discriminant, they had been generative.

Allow us to recap shortly.

Resolution Bushes use splits, or guidelines, and so there are not any weights to be taught. So they’re non-parametric fashions.
k-NN just isn’t a mannequin. It retains the entire dataset and makes use of distances at prediction time.

Nonetheless, after we transfer from Euclidean distance to Mahalanobis distance, one thing attention-grabbing occurs…

LDA and QDA do estimate parameters:

means of every class
covariance matrices
priors

These are actual parameters, however they aren’t weights.
These fashions are generative as a result of they mannequin the density of every class, after which use it to make predictions.

So despite the fact that they’re parametric, they don’t belong to the weight-based household.

And as you’ll be able to see, these are all classifiers, and so they estimate parameters for every class.

Linear Regression is our first instance of a mannequin that learns weights to construct a prediction.

That is the start of a brand new household within the Creation Calendar:
fashions that depend on weights + a loss operate to make predictions.

The Value Perform

How can we receive the parameters a and b?

Properly, the optimum values for a and b are these minimizing the associated fee operate, which is the Squared Error of the mannequin.

So for every information level, we are able to calculate the Squared Error.

Squared Error = (prediction-real worth)²=(a*x+b-real worth)²

Then we are able to calculate the MSE, or Imply Squared Error.

As we are able to see in Excel, the trendline offers us the optimum coefficients. When you manually change these values, even barely, the MSE will enhance.

That is precisely what “optimum” means right here: some other mixture of a and b makes the error worse.

The traditional closed-form resolution

Now that we all know what the mannequin is, and what it means to attenuate the squared error, we are able to lastly reply the important thing query:

How will we compute the 2 coefficients of Linear Regression, the slope a and the intercept b?

There are two methods to do it:

the actual algebraic resolution, referred to as the closed-form resolution,
or gradient descent, which we are going to discover simply after.

If we take the definition of the MSE and differentiate it with respect to a and b, one thing lovely occurs: all the pieces simplifies into two very compact formulation.

These formulation solely use:

the common of x and y,
how x varies (its variance),
and the way x and y range collectively (their covariance).

So even with out understanding any calculus, and with solely primary spreadsheet features, we are able to reproduce the precise resolution utilized in statistics textbooks.

The best way to interpret the coefficients

For one characteristic, interpretation is simple and intuitive:

The slope a
It tells us how a lot y modifications when x will increase by one unit.
If the slope is 1.2, it means:
“when x goes up by 1, the mannequin expects y to go up by about 1.2.”

The intercept b
It’s the predicted worth of y when x = 0.
Typically, x = 0 doesn’t exist in the actual context of the info, so the intercept just isn’t at all times significant by itself.
Its position is generally to place the road accurately to match the middle of the info.

That is normally how Linear Regression is taught:
a slope, an intercept, and a straight line.

With one characteristic, interpretation is straightforward.

With two, nonetheless manageable.

However as quickly as we begin including many options, it turns into tougher.

Tomorrow, we are going to focus on additional in regards to the interpretation.

In the present day, we are going to do the gradient descent.

Gradient Descent, Step by Step

After seeing the traditional algebraic resolution for Linear Regression, we are able to now discover the opposite important software behind fashionable machine studying: optimization.

The workhorse of optimization is Gradient Descent.

Understanding it on a quite simple instance makes the logic a lot clearer as soon as we apply it to Linear Regression.

A Mild Heat-Up: Gradient Descent on a Single Variable

Earlier than implementing the gradient descent for the Linear Regression, we are able to first do it for a easy operate: (x-2)^2.

Everybody is aware of the minimal is at x=2.

However allow us to faux we have no idea that, and let the algorithm uncover it by itself.

The concept is to search out the minimal of this operate utilizing the next course of:

First, we randomly select an preliminary worth.
Then for every step, we calculate the worth of the by-product operate df (for this x worth): df(x)
And the subsequent worth of x is obtained by subtracting the worth of by-product multiplied by a step measurement: x = x – step_size*df(x)

You’ll be able to modify the 2 parameters of the gradient descent: the preliminary worth of x and the step measurement.

Sure, even with 100, or 1000. That’s fairly stunning to see, how effectively it really works.

However, in some circumstances, the gradient descent is not going to work. For instance, if the step measurement is simply too huge, the x worth can explode.

Gradient descent for linear regression

The precept of the gradient descent algorithm is similar for linear regression: we’ve got to calculate the partial derivatives of the associated fee operate with respect to the parameters a and b. Let’s observe them as da and db.

Squared Error = (prediction-real worth)²=(a*x+b-real worth)²

da=2(a*x+b-real worth)*x

db=2(a*x+b-real worth)

After which, we are able to do the updates of the coefficients.

With this tiny replace, step-by-step, the optimum worth will probably be discovered after a couple of interations.

Within the following graph, you’ll be able to see how a and b converge in direction of the goal worth.

We will additionally see all the main points of y hat, residuals and the partial derivatives.

We will absolutely respect the great thing about gradient descent, visualized in Excel.

For these two coefficients, we are able to observe how fast the convergence is.

Now, in observe, we’ve got many observations and this needs to be achieved for every information level. That’s the place issues turn out to be loopy in Google Sheet. So, we use solely 10 information factors.

You will notice that I first created a sheet with lengthy formulation to calculate da and db, which comprise the sum of the derivatives of all of the observations. Then I created one other sheet to indicate all the main points.

Conclusion

Linear Regression might look easy, nevertheless it introduces virtually all the pieces that fashionable machine studying depends on.
With simply two parameters, a slope and an intercept, it teaches us:

the way to outline a value operate,
the way to discover optimum parameters, numerically,
and the way optimization behaves after we modify studying charges or preliminary values.

The closed-form resolution exhibits the magnificence of the arithmetic.
Gradient Descent exhibits the mechanics behind the scenes.
Collectively, they type the muse of the “weighted + loss operate” household that features Logistic Regression, SVM, Neural Networks, and even in the present day’s LLMs.

New Paths Forward

You might assume Linear Regression is straightforward, however with its foundations now clear, you’ll be able to prolong it, refine it, and reinterpret it by way of many various views:

Change the loss operate
Substitute squared error with logistic loss, hinge loss, or different features, and new fashions seem.
Transfer to classification
Linear Regression itself can separate two courses (0 and 1), however extra strong variations result in Logistic Regression and SVM. And what about multiclass classification?
Mannequin nonlinearity
By way of polynomial options or kernels, linear fashions all of a sudden turn out to be nonlinear within the unique house.
Scale to many options
Interpretation turns into more durable, regularization turns into important, and new numerical challenges seem.
Primal vs twin
Linear fashions may be written in two methods. The primal view learns the weights straight. The twin view rewrites all the pieces utilizing dot merchandise between information factors.
Perceive fashionable ML
Gradient Descent, and its variants, are the core of neural networks and huge language fashions.
What we discovered right here with two parameters generalizes to billions.

Every part on this article stays throughout the boundaries of Linear Regression, but it prepares the bottom for a complete household of future fashions.
Day after day, the Creation Calendar will present how all these concepts join.

Source link

The Machine Learning “Advent Calendar” Day 11: Linear Regression in Excel

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

How to Find the Optimal Coding Agent Interface

I Completed Five Years in Analytics Consulting: 5 Lessons That Changed How I Work

GPU-Resident Top-K for Agentic RAG: I Built a CUDA Kernel So My Retrieval Step Would Stop Bouncing Off the GPU

Can Machine Learning Predict the World Cup?

Automate Writing Your LLM Prompts

These Were My Favorite Things Samsung Unpacked During Its 2026 Galaxy Event

AI minister role boosted but tech department axed in Burnham shake-up

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

The risk of weather data sabotage is rising

Featured Picks

How to Combine Claude Code and Codex for Maximum Coding Power

How to Navigate the Shift from Prompt-Based Tools to Workflow-Driven AI

Celebrate PlayStation Days of Play With NBA 2K25 and More on PS Plus in June

The Machine Learning “Advent Calendar” Day 11: Linear Regression in Excel

Understanding the Development line in Excel

Beginning with a Easy Dataset

Creating the Development Line in Excel

Introducing Weights and the Value Perform

A Word on Weight-Primarily based Fashions

The Value Perform

The traditional closed-form resolution

The best way to interpret the coefficients

Gradient Descent, Step by Step

A Mild Heat-Up: Gradient Descent on a Single Variable

Gradient descent for linear regression

Conclusion

New Paths Forward

Related Posts