, we’re going to focus on not solely how but additionally why gradient descent and stochastic gradient descent are used.
We already learn about linear regression, and just lately I wrote about it within the context of vectors and projections.
Now, we are going to attempt to perceive gradient descent with the assistance of a linear regression drawback.
However earlier than that, I simply wish to briefly recall what we already learn about linear regression and the mathematics behind it, in order that anybody beginning out finds it simple to observe.
In case you already know the essential math behind linear regression, then you may straight begin from the part titled Why Do We Need Gradient Descent?
Let’s say we began our machine studying journey, and the very first thing we did was implementing a linear regression mannequin utilizing Python.
We applied it efficiently and acquired the most effective values for the slope and intercept.
Now we’ve got a query: What’s truly taking place behind this algorithm?
We wish to perceive the mathematics behind it.
Linear Regression Recap
For that, let’s contemplate this knowledge.
Now, we wish to perceive the mathematics behind the algorithm.

We come throughout these formulation for the slope and intercept.
[
beta_1 = frac{sum_{i=1}^{n} (x_i – bar{x})(y_i – bar{y})}{sum_{i=1}^{n} (x_i – bar{x})^2}
]
[
beta_0 = bar{y} – beta_1bar{x}
]
Now, by utilizing these formulation we calculate the slope and intercept.
The Easy Linear Regression equation is:
[
hat{y}
=
beta_0+beta_1x
]
The slope method is:
[
beta_1
=
frac{
sum_{i=1}^{n}(x_i-bar{x})(y_i-bar{y})
}{
sum_{i=1}^{n}(x_i-bar{x})^2
}
]
The intercept method is:
[
beta_0
=
bar{y}
–
beta_1bar{x}
]
The dataset is:
[
x=
[1.2,1.4,1.6,2.1,2.3,3.0,3.1,3.3,3.3,3.8]
]
[
y=
[39344,46206,37732,43526,39892,56643,60151,54446,64446,57190]
]
Compute the imply of x:
[
bar{x}
=
frac{1.2+1.4+1.6+2.1+2.3+3.0+3.1+3.3+3.3+3.8}{10}
]
[
bar{x}
=
frac{25.1}{10}
=
2.51
]
Compute the imply of y:
[
bar{y}
=
frac{
39344+46206+37732+43526+39892+56643+60151+54446+64446+57190
}{10}
]
[
bar{y}
=
frac{499576}{10}
=
49957.6
]
Now compute:
[
sum(x_i-bar{x})(y_i-bar{y})
]
After substitution and calculation:
[
sum(x_i-bar{x})(y_i-bar{y})
=
41663.44
]
Now compute:
[
sum(x_i-bar{x})^2
]
After calculation:
[
sum(x_i-bar{x})^2
=
4.619
]
Now compute the slope:
[
beta_1
=
frac{41663.44}{4.619}
]
[
beta_1
=
9020.66
]
Now compute the intercept:
[
beta_0
=
49957.6-(9020.66)(2.51)
]
[
beta_0
=
27315.74
]
Subsequently:
[
beta_0=27315.74
]
[
beta_1=9020.66
]
Ultimate regression equation:
[
hat{y}
=
27315.74+9020.66x
]
We acquired the values utilizing the formulation, however we’re not glad and wish to go deeper.
Now our purpose is to find out how we acquired these formulation.
To know that, we are going to now see a 3D bowl curve. We get that bowl curve once we plot all of the attainable combos of , and the imply squared error (MSE).

Now, by wanting on the curve, we perceive that we want the imply squared error to be as little as attainable, and it reaches it’s minimal when the gradient turns into zero.
We already know that to seek out the slope of any curve, we want differentiation.
Subsequent, we carry out differentiation on the loss perform, because the bowl curve is the 3D illustration of it, and also you understand that right here we’ve got two variables.
So, we carry out partial differentiation after which remedy additional to get the formulation for the slope and intercept.
Deriving the Formulation for Slope and Intercept
Begin with the Imply Squared Error (MSE) loss perform:
[
MSE(beta_0,beta_1)
=
frac{1}{n}
sum_{i=1}^{n}
(y_i-(beta_0+beta_1x_i))^2
]
Rearrange the inside expression:
[
=
frac{1}{n}
sum_{i=1}^{n}
(y_i-beta_0-beta_1x_i)^2
]
Now take partial spinoff with respect to ( beta_0 ):
[
frac{partial MSE}{partial beta_0}
=
frac{partial}{partial beta_0}
left(
frac{1}{n}
sum_{i=1}^{n}
(y_i-beta_0-beta_1x_i)^2
right)
]
Take fixed exterior:
[
=
frac{1}{n}
frac{partial}{partial beta_0}
sum_{i=1}^{n}
(y_i-beta_0-beta_1x_i)^2
]
Transfer spinoff contained in the summation:
[
=
frac{1}{n}
sum_{i=1}^{n}
frac{partial}{partial beta_0}
(y_i-beta_0-beta_1x_i)^2
]
Apply chain rule:
[
=
frac{1}{n}
sum_{i=1}^{n}
2(y_i-beta_0-beta_1x_i)
cdot
frac{partial}{partial beta_0}
(y_i-beta_0-beta_1x_i)
]
Apply spinoff guidelines:
[
frac{d}{dbeta_0}(y_i)=0
]
[
frac{d}{dbeta_0}(-beta_0)=-1
]
[
frac{d}{dbeta_0}(-beta_1x_i)=0
]
So the inside spinoff turns into:
[
frac{partial}{partial beta_0}
(y_i-beta_0-beta_1x_i)
=
-1
]
Substitute again:
[
frac{partial MSE}{partial beta_0}
=
frac{1}{n}
sum_{i=1}^{n}
2(y_i-beta_0-beta_1x_i)(-1)
]
Simplify:
[
=
-frac{2}{n}
sum_{i=1}^{n}
(y_i-beta_0-beta_1x_i)
]
Set spinoff equal to zero:
[
-frac{2}{n}
sum_{i=1}^{n}
(y_i-beta_0-beta_1x_i)
=
0
]
Multiply each side by:
[
-frac{n}{2}
]
[
sum_{i=1}^{n}
(y_i-beta_0-beta_1x_i)
=
0
]
Broaden:
[
sum_{i=1}^{n}y_i
–
nbeta_0
–
beta_1sum_{i=1}^{n}x_i
=
0
]
Rearrange:
[
nbeta_0
=
sum_{i=1}^{n}y_i
–
beta_1sum_{i=1}^{n}x_i
]
Divide by ( n ):
[
beta_0
=
frac{1}{n}sum_{i=1}^{n}y_i
–
beta_1
frac{1}{n}sum_{i=1}^{n}x_i
]
Utilizing means:
[
bar{x}
=
frac{1}{n}sum_{i=1}^{n}x_i
]
[
bar{y}
=
frac{1}{n}sum_{i=1}^{n}y_i
]
Ultimate intercept method:
[
beta_0
=
bar{y}
–
beta_1bar{x}
]
Now take partial spinoff with respect to ( beta_1 ):
[
frac{partial MSE}{partial beta_1}
=
frac{partial}{partial beta_1}
left(
frac{1}{n}
sum_{i=1}^{n}
(y_i-beta_0-beta_1x_i)^2
right)
]
Take fixed exterior:
[
=
frac{1}{n}
frac{partial}{partial beta_1}
sum_{i=1}^{n}
(y_i-beta_0-beta_1x_i)^2
]
Transfer spinoff contained in the summation:
[
=
frac{1}{n}
sum_{i=1}^{n}
frac{partial}{partial beta_1}
(y_i-beta_0-beta_1x_i)^2
]
Apply chain rule:
[
=
frac{1}{n}
sum_{i=1}^{n}
2(y_i-beta_0-beta_1x_i)
cdot
frac{partial}{partial beta_1}
(y_i-beta_0-beta_1x_i)
]
Apply spinoff guidelines:
[
frac{d}{dbeta_1}(y_i)=0
]
[
frac{d}{dbeta_1}(-beta_0)=0
]
[
frac{d}{dbeta_1}(-beta_1x_i)=-x_i
]
So the inside spinoff turns into:
[
frac{partial}{partial beta_1}
(y_i-beta_0-beta_1x_i)
=
-x_i
]
Substitute again:
[
frac{partial MSE}{partial beta_1}
=
frac{1}{n}
sum_{i=1}^{n}
2(y_i-beta_0-beta_1x_i)(-x_i)
]
Simplify:
[
=
-frac{2}{n}
sum_{i=1}^{n}
x_i(y_i-beta_0-beta_1x_i)
]
Set spinoff equal to zero:
[
-frac{2}{n}
sum_{i=1}^{n}
x_i(y_i-beta_0-beta_1x_i)
=
0
]
Multiply each side by:
[
-frac{n}{2}
]
[
sum_{i=1}^{n}
x_i(y_i-beta_0-beta_1x_i)
=
0
]
Broaden:
[
sum_{i=1}^{n}x_iy_i
–
beta_0sum_{i=1}^{n}x_i
–
beta_1sum_{i=1}^{n}x_i^2
=
0
]
Substitute:
[
beta_0
=
bar{y}
–
beta_1bar{x}
]
into the equation:
[
sum_{i=1}^{n}x_iy_i
–
(bar{y}-beta_1bar{x})
sum_{i=1}^{n}x_i
–
beta_1sum_{i=1}^{n}x_i^2
=
0
]
Broaden:
[
sum_{i=1}^{n}x_iy_i
–
bar{y}sum_{i=1}^{n}x_i
+
beta_1bar{x}sum_{i=1}^{n}x_i
–
beta_1sum_{i=1}^{n}x_i^2
=
0
]
Since:
[
sum_{i=1}^{n}x_i=nbar{x}
]
Substitute:
[
sum_{i=1}^{n}x_iy_i
–
nbar{x}bar{y}
+
beta_1nbar{x}^2
–
beta_1sum_{i=1}^{n}x_i^2
=
0
]
Group ( beta_1 ) phrases:
[
beta_1
(nbar{x}^2-sum_{i=1}^{n}x_i^2)
=
nbar{x}bar{y}
–
sum_{i=1}^{n}x_iy_i
]
Multiply each side by -1:
[
beta_1
(sum_{i=1}^{n}x_i^2-nbar{x}^2)
=
sum_{i=1}^{n}x_iy_i
–
nbar{x}bar{y}
]
Ultimate slope method:
[
beta_1
=
frac{
sum_{i=1}^{n}x_iy_i
–
nbar{x}bar{y}
}{
sum_{i=1}^{n}x_i^2
–
nbar{x}^2
}
]
Equal covariance kind:
[
beta_1
=
frac{
sum_{i=1}^{n}(x_i-bar{x})(y_i-bar{y})
}{
sum_{i=1}^{n}(x_i-bar{x})^2
}
]
Lastly, substitute the computed worth of ( beta_1 ) into the intercept equation:
[
beta_0
=
bar{y}
–
beta_1bar{x}
]
Thus, the ultimate regression equation turns into:
[
hat{y}
=
beta_0
+
beta_1x
]
Now, we discovered how we acquired the formulation for the slope and intercept.
However one factor we have to contemplate right here is that we derived these formulation for a case the place we solely have one function, and even for one function, we are able to see how complicated the mathematics was.
What if we’ve got a couple of function, as most real-world datasets do?
The maths turns into extra complicated, and that is the place we use the matrix kind to characterize the equations. Utilizing matrix notation, we are able to derive the conventional equation, which generalizes to any variety of options.
Deriving the Regular Equation
In Easy Linear Regression, we derived one intercept and one slope:
[
hat{y}
=
beta_0+beta_1x
]
Nonetheless, real-world issues normally comprise a number of options.
For instance:
years of expertise
schooling stage
age
In such instances, Linear Regression turns into:
[
hat{y}
=
beta_0
+
beta_1x_1
+
beta_2x_2
+
beta_3x_3
+
cdots
+
beta_px_p
]
the place:
( beta_0 ) is the intercept and
( beta_1,beta_2,beta_3,dots,beta_p ) are slopes for various options
Because the variety of options will increase, fixing separate equations for each parameter turns into tough.
To resolve this simply, Linear Regression is rewritten utilizing matrix notation.
Suppose we’ve got ( n ) observations and ( p ) options.
First outline the goal vector:
[
Y
=
begin{bmatrix}
y_1
y_2
y_3
vdots
y_n
end{bmatrix}
]
Now outline the function matrix.
The primary column incorporates solely 1s to characterize the intercept time period.
[
X
=
begin{bmatrix}
1 & x_{11} & x_{12} & cdots & x_{1p}
1 & x_{21} & x_{22} & cdots & x_{2p}
1 & x_{31} & x_{32} & cdots & x_{3p}
vdots & vdots & vdots & ddots & vdots
1 & x_{n1} & x_{n2} & cdots & x_{np}
end{bmatrix}
]
Now outline the parameter vector:
[
beta
=
begin{bmatrix}
beta_0
beta_1
beta_2
vdots
beta_p
end{bmatrix}
]
Utilizing matrix multiplication:
[
Xbeta
=
begin{bmatrix}
1 & x_{11} & x_{12} & cdots & x_{1p}
1 & x_{21} & x_{22} & cdots & x_{2p}
1 & x_{31} & x_{32} & cdots & x_{3p}
vdots & vdots & vdots & ddots & vdots
1 & x_{n1} & x_{n2} & cdots & x_{np}
end{bmatrix}
begin{bmatrix}
beta_0
beta_1
beta_2
vdots
beta_p
end{bmatrix}
]
Performing the multiplication:
[
=
begin{bmatrix}
beta_0+beta_1x_{11}+beta_2x_{12}+cdots+beta_px_{1p}
beta_0+beta_1x_{21}+beta_2x_{22}+cdots+beta_px_{2p}
beta_0+beta_1x_{31}+beta_2x_{32}+cdots+beta_px_{3p}
vdots
beta_0+beta_1x_{n1}+beta_2x_{n2}+cdots+beta_px_{np}
end{bmatrix}
]
This offers the prediction vector:
[
hat{Y}=Xbeta
]
Now outline the residual vector.
Residuals are the variations between precise and predicted values.
[
Y-hat{Y}
]
Substituting:
[
Y-Xbeta
]
The Imply Squared Error (MSE) turns into:
[
MSE
=
frac{1}{n}
(Y-Xbeta)^T(Y-Xbeta)
]
The transpose is required as a result of:
[
(Y-Xbeta)
]
is a column vector.
Multiplying by its transpose converts the expression right into a scalar sum of squared residuals.
Now broaden the expression.
[
MSE
=
frac{1}{n}
(Y-Xbeta)^T(Y-Xbeta)
]
[
=
frac{1}{n}
left(
Y^TY
–
Y^TXbeta
–
(Xbeta)^TY
+
(Xbeta)^TXbeta
right)
]
Utilizing transpose property:
[
(Xbeta)^T
=
beta^TX^T
]
Substitute into the equation:
[
MSE
=
frac{1}{n}
left(
Y^TY
–
Y^TXbeta
–
beta^TX^TY
+
beta^TX^TXbeta
right)
]
Discover that:
[
Y^TXbeta
]
is a scalar.
Scalars are equal to their transpose.
Subsequently:
[
Y^TXbeta
=
beta^TX^TY
]
So the center two phrases mix:
[
MSE
=
frac{1}{n}
left(
Y^TY
–
2beta^TX^TY
+
beta^TX^TXbeta
right)
]
To attenuate MSE, take spinoff with respect to ( beta ).
Spinoff of:
[
Y^TY
]
is zero as a result of it doesn’t comprise ( beta ).
Spinoff of:
[
-2beta^TX^TY
]
turns into:
[
-2X^TY
]
Spinoff of:
[
beta^TX^TXbeta
]
turns into:
[
2X^TXbeta
]
Subsequently:
[
frac{partial MSE}{partial beta}
=
frac{1}{n}
left(
-2X^TY
+
2X^TXbeta
right)
]
Simplify:
[
=
frac{-2}{n}X^TY
+
frac{2}{n}X^TXbeta
]
Set spinoff equal to zero for minimization:
[
frac{-2}{n}X^TY
+
frac{2}{n}X^TXbeta
=
0
]
Multiply each side by:
[
frac{n}{2}
]
[
-X^TY
+
X^TXbeta
=
0
]
Rearrange:
[
X^TXbeta
=
X^TY
]
Now multiply each side by:
[
(X^TX)^{-1}
]
[
(X^TX)^{-1}X^TXbeta
=
(X^TX)^{-1}X^TY
]
Utilizing the identification matrix property:
[
(X^TX)^{-1}(X^TX)=I
]
we get:
[
Ibeta
=
(X^TX)^{-1}X^TY
]
Since:
[
Ibeta=beta
]
the ultimate Regular Equation turns into:
[
beta
=
(X^TX)^{-1}X^TY
]
This equation concurrently computes:
the intercept
all slopes
the optimum parameters
that reduce the Imply Squared Error.
Generally, the conventional equation is derived by minimizing the RSS (Residual Sum of Squares). Nonetheless, since MSE is solely RSS divided by the variety of observations, minimizing MSE additionally produces the identical regular equation.
Now we’ve got the conventional equation. Let’s remedy for the slope and intercept as soon as once more utilizing this equation.
Fixing for Slope and Intercept Utilizing the Regular Equation
The matrix type of Linear Regression is:
[
beta=(X^TX)^{-1}X^TY
]
Assemble the function matrix.
The primary column incorporates 1s for the intercept time period.
[
X
=
begin{bmatrix}
1 & 1.2
1 & 1.4
1 & 1.6
1 & 2.1
1 & 2.3
1 & 3.0
1 & 3.1
1 & 3.3
1 & 3.3
1 & 3.8
end{bmatrix}
]
Assemble the goal vector:
[
Y
=
begin{bmatrix}
39344
46206
37732
43526
39892
56643
60151
54446
64446
57190
end{bmatrix}
]
Parameter vector:
[
beta
=
begin{bmatrix}
beta_0
beta_1
end{bmatrix}
]
Now compute the transpose:
[
X^T
=
begin{bmatrix}
1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1
1.2 & 1.4 & 1.6 & 2.1 & 2.3 & 3.0 & 3.1 & 3.3 & 3.3 & 3.8
end{bmatrix}
]
Compute:
[
X^TX
=
begin{bmatrix}
10 & 25.1
25.1 & 67.89
end{bmatrix}
]
Now compute the inverse:
[
(X^TX)^{-1}
=
begin{bmatrix}
1.4547 & -0.5378
-0.5378 & 0.2142
end{bmatrix}
]
Now compute:
[
X^TY
=
begin{bmatrix}
493576
1326200.7
end{bmatrix}
]
Substitute into the Regular Equation:
[
beta
=
begin{bmatrix}
1.4547 & -0.5378
-0.5378 & 0.2142
end{bmatrix}
begin{bmatrix}
493576
1326200.7
end{bmatrix}
]
After multiplication:
[
beta
=
begin{bmatrix}
27315.02
9020.93
end{bmatrix}
]
Subsequently:
[
beta_0=27315.02
]
[
beta_1=9020.93
]
Ultimate regression equation:
[
hat{y}
=
27315.02+9020.93x
]
Why Do We Want Gradient Descent?
Now, after getting the conventional equation for linear regression, we would assume that we are able to remedy for the optimum parameters even when we’ve got many options.
However one factor we have to observe right here is that this technique works effectively just for small or medium-sized datasets. When we’ve got very giant datasets, fixing the conventional equation turns into computationally costly.
Let’s take a look at the conventional equation:
[
beta = (X^TX)^{-1}X^Ty
]
From the equation, we are able to observe the inverse calculation, and that is the place fixing for the slope and intercept utilizing the conventional equation turns into computationally costly.
This works effectively for small datasets, however in the actual world, we frequently have hundreds of options and tens of millions of information factors.
In such instances, fixing the conventional equation turns into gradual and requires a whole lot of computational energy.
That is the place gradient descent is used, as a result of as a substitute of straight fixing for the answer, we steadily transfer towards the optimum resolution step-by-step.
Now, to grasp how gradient descent works, let’s take a look at the mathematics behind it.
The Math Behind Gradient Descent
Once we had been deriving the conventional equation, we arrived at this equation.
[
frac{partial MSE}{partial beta}
=
frac{2}{n}X^T(Xbeta-Y)
]
This equation represents the gradient (slope) of the bowl-shaped loss curve.
We made it equal to zero after which solved additional to get the conventional equation, which is used to seek out the optimum resolution.
However in gradient descent, we cease at this equation and initialize some random values for. Utilizing these values, we calculate the gradient (slope) and steadily transfer towards the minimal loss step-by-step.
Let’s assume we initialize:
and
[
beta^{(0)}=
begin{bmatrix}
beta_0
beta_1
end{bmatrix}
=
begin{bmatrix}
2
5
end{bmatrix}
]
Subsequent, we calculate the slope of the bowl curve by substituting these values into the gradient equation.
We already know that the gradient equation is:
[
frac{partial MSE}{partial beta}
=
frac{-2}{n}X^Ty
+
frac{2}{n}X^TXbeta
]
The initialized parameter values are:
[
beta^{(0)}=
begin{bmatrix}
2
5
end{bmatrix}
]
These are simply the beginning values from the place Gradient Descent begins trying to find the minimal loss.
Now let’s assemble the function matrix.
Since we’ve got one function, the matrix (X) turns into:
[
X=
begin{bmatrix}
1 & 1.2
1 & 1.4
1 & 1.6
1 & 2.1
1 & 2.3
1 & 3.0
1 & 3.1
1 & 3.3
1 & 3.3
1 & 3.8
end{bmatrix}
]
The primary column incorporates ones for the intercept time period.
Now calculate:
[
X^T
]
[
X^T=
begin{bmatrix}
1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1
1.2 & 1.4 & 1.6 & 2.1 & 2.3 & 3.0 & 3.1 & 3.3 & 3.3 & 3.8
end{bmatrix}
]
Now calculate:
[
X^TX
]
[
X^TX=
begin{bmatrix}
10 & 25.1
25.1 & 67.89
end{bmatrix}
]
Subsequent, let the goal vector be:
[
y=
begin{bmatrix}
39344
46206
37732
43526
39892
56643
60151
54446
64446
57190
end{bmatrix}
]
Now calculate:
[
X^Ty
]
[
X^Ty=
begin{bmatrix}
493576
1326200.7
end{bmatrix}
]
Since our dataset incorporates:
[
n=10
]
Now substitute all of the values into the gradient equation:
[
frac{partial MSE}{partial beta}
=
frac{-2}{10}
begin{bmatrix}
493576
1326200.7
end{bmatrix}
+
frac{2}{10}
begin{bmatrix}
10 & 25.1
25.1 & 67.89
end{bmatrix}
begin{bmatrix}
2
5
end{bmatrix}
]
First, calculate the matrix multiplication:
[
begin{bmatrix}
10 & 25.1
25.1 & 67.89
end{bmatrix}
begin{bmatrix}
2
5
end{bmatrix}
=
begin{bmatrix}
(10)(2)+(25.1)(5)
(25.1)(2)+(67.89)(5)
end{bmatrix}
]
[
=
begin{bmatrix}
20+125.5
50.2+339.45
end{bmatrix}
]
[
=
begin{bmatrix}
145.5
389.65
end{bmatrix}
]
Now multiply by:
[
frac{2}{10}
]
[
frac{2}{10}
begin{bmatrix}
145.5
389.65
end{bmatrix}
=
begin{bmatrix}
29.1
77.93
end{bmatrix}
]
Subsequent, calculate:
[
frac{-2}{10}
begin{bmatrix}
493576
1326200.7
end{bmatrix}
=
begin{bmatrix}
-98715.2
-265240.14
end{bmatrix}
]
Now substitute every little thing again:
[
frac{partial MSE}{partial beta}
=
begin{bmatrix}
-98715.2
-265240.14
end{bmatrix}
+
begin{bmatrix}
29.1
77.93
end{bmatrix}
]
Lastly:
[
frac{partial MSE}{partial beta}
=
begin{bmatrix}
-98686.1
-265162.21
end{bmatrix}
]
This gradient represents the slope of the bowl-shaped MSE loss curve on the present parameter values.
Right here:
[
-98686.1
]
represents the slope with respect to (beta_0)
and
[
-265162.21
]
represents the slope with respect to (beta_1)
Since each values are unfavourable, the loss decreases towards the best, so Gradient Descent strikes towards the best to cut back the loss.
Now, as a substitute of straight fixing for the optimum parameters mathematically, Gradient Descent steadily updates the parameter values step-by-step till it reaches the minimal level of the bowl-shaped loss curve.
This replace is carried out utilizing the Gradient Descent replace equation:
[
beta:=beta-alphafrac{partial MSE}{partial beta}
]
the place:
[
alpha
]
known as the training charge and controls how giant every replace step needs to be.
The replace equation will be understood step-by-step.
[
beta
]
represents the present parameter values.
[
frac{partial MSE}{partial beta}
]
represents the slope (gradient) of the bowl-shaped loss curve on the present level.
The gradient tells us the path during which the loss will increase the quickest.
Subsequently, to cut back the loss, we transfer in the other way of the gradient.
That is why the replace equation subtracts the gradient:
[
beta:=beta-alphafrac{partial MSE}{partial beta}
]
Right here:
[
alpha
]
controls how giant every step needs to be whereas transferring towards the minimal level.
If the gradient is optimistic, Gradient Descent strikes towards the left.
If the gradient is unfavourable, Gradient Descent strikes towards the best.
By repeatedly calculating gradients and updating parameters, Gradient Descent steadily strikes towards the minimal level of the bowl-shaped loss curve.
After updating the parameters, your complete course of is repeated once more till the loss turns into minimal, and the mannequin reaches the optimum parameters.
We will observe right here is that there isn’t any inverse calculation concerned.
Studying Fee
One essential factor we have to perceive right here is the training charge.
Let’s assume:
[
alpha = 0.01
]
and the calculated gradient is:
[
frac{partial MSE}{partial beta}
=
begin{bmatrix}
-98686.1
-265162.21
end{bmatrix}
]
Now substitute these values into the replace equation:
[
beta=
begin{bmatrix}
2
5
end{bmatrix}
–
0.01
begin{bmatrix}
-98686.1
-265162.21
end{bmatrix}
]
First, multiply the training charge with the gradient:
[
0.01
begin{bmatrix}
-98686.1
-265162.21
end{bmatrix}
=
begin{bmatrix}
-986.861
-2651.6221
end{bmatrix}
]
Now substitute again:
[
beta=
begin{bmatrix}
2
5
end{bmatrix}
–
begin{bmatrix}
-986.861
-2651.6221
end{bmatrix}
]
then
[
beta=
begin{bmatrix}
2+986.861
5+2651.6221
end{bmatrix}
]
Lastly:
[
beta=
begin{bmatrix}
988.861
2656.6221
end{bmatrix}
]
After one iteration of Gradient Descent:
[
beta_0
]
modified from:
[
2 rightarrow 988.861
]
and
[
beta_1
]
modified from:
[
5 rightarrow 2656.6221
]
These up to date parameter values transfer us nearer to the minimal level of the bowl-shaped MSE loss curve.
Now utilizing these up to date values, your complete course of is repeated once more:
[
text{Predictions}
rightarrow
text{Residuals}
rightarrow
text{Loss}
rightarrow
text{Gradient}
rightarrow
text{Parameter Update}
]
This iterative course of continues till the loss turns into minimal and the mannequin reaches the optimum parameters.
Now let’s perceive why selecting the training charge is essential.
If the training charge could be very small:
[
alpha = 0.000001
]
then the updates grow to be extraordinarily small.
Because of this:
[
text{Very Slow Learning}
]
and Gradient Descent could require hundreds of iterations to succeed in the minimal level.
Alternatively, if the training charge could be very giant:
[
alpha = 10
]
then the updates grow to be extraordinarily giant.
Because of this, Gradient Descent could overshoot the minimal level repeatedly and fail to succeed in the answer.
Subsequently, selecting a correct studying charge is essential for environment friendly optimization.

Stochastic Gradient Descent
Now we’ve got an thought about what gradient descent truly is.
On this technique, we are able to observe that we used your complete dataset to calculate the gradients earlier than updating the parameters.
This course of can grow to be gradual for very giant datasets, and this method known as batch gradient descent as a result of it makes use of your complete dataset for each replace step.
Now think about a dataset containing tens of millions of information factors.
For each single replace step, Gradient Descent would wish to:
[
text{Process Entire Dataset}
]
[
text{Calculate Loss}
]
[
text{Calculate Gradients}
]
after which lastly replace the parameters.
This repeated computation turns into computationally costly and time taking course of.
That is the place Stochastic Gradient Descent (SGD) comes into the image.
As a substitute of calculating gradients utilizing your complete dataset, SGD randomly selects just one commentary at a time and instantly updates the parameters.
The replace equation nonetheless stays the identical:
[
beta:=beta-alphafrac{partial MSE}{partial beta}
]
The one distinction is that the gradient is now calculated utilizing a single commentary as a substitute of your complete dataset.
We will perceive this by utilizing one knowledge level from our dataset.
The parameter values are:
[
beta^{(0)}=
begin{bmatrix}
2
5
end{bmatrix}
]
and the training charge is:
[
alpha = 0.01
]
Now let’s say SGD randomly chosen the next coaching instance from our dataset:
[
(x,y)=(3.0,56643)
]
For this single commentary:
[
X=
begin{bmatrix}
1 & 3.0
end{bmatrix}
]
and
[
y=
begin{bmatrix}
56643
end{bmatrix}
]
Now calculate:
[
X^T=
begin{bmatrix}
1
3.0
end{bmatrix}
]
Subsequent calculate:
[
X^TX
]
[
=
begin{bmatrix}
1
3.0
end{bmatrix}
begin{bmatrix}
1 & 3.0
end{bmatrix}
]
[
=
begin{bmatrix}
1 & 3.0
3.0 & 9.0
end{bmatrix}
]
Now calculate:
[
X^Ty
]
[
=
begin{bmatrix}
1
3.0
end{bmatrix}
begin{bmatrix}
56643
end{bmatrix}
]
[
=
begin{bmatrix}
56643
169929
end{bmatrix}
]
Since SGD is utilizing just one commentary:
[
n=1
]
Now substitute every little thing into the gradient equation:
[
frac{partial MSE}{partial beta}
=
frac{-2}{n}X^Ty
+
frac{2}{n}X^TXbeta
]
Substituting:
[
=
frac{-2}{1}
begin{bmatrix}
56643
169929
end{bmatrix}
+
frac{2}{1}
begin{bmatrix}
1 & 3.0
3.0 & 9.0
end{bmatrix}
begin{bmatrix}
2
5
end{bmatrix}
]
First calculate the matrix multiplication:
[
begin{bmatrix}
1 & 3.0
3.0 & 9.0
end{bmatrix}
begin{bmatrix}
2
5
end{bmatrix}
]
[
=
begin{bmatrix}
(1)(2)+(3.0)(5)
(3.0)(2)+(9.0)(5)
end{bmatrix}
]
[
=
begin{bmatrix}
2+15
6+45
end{bmatrix}
]
[
=
begin{bmatrix}
17
51
end{bmatrix}
]
Now multiply by:
[
frac{2}{1}
]
[
=
begin{bmatrix}
34
102
end{bmatrix}
]
Now calculate:
[
frac{-2}{1}
begin{bmatrix}
56643
169929
end{bmatrix}
=
begin{bmatrix}
-113286
-339858
end{bmatrix}
]
Now substitute every little thing again:
[
frac{partial MSE}{partial beta}
=
begin{bmatrix}
-113286
-339858
end{bmatrix}
+
begin{bmatrix}
34
102
end{bmatrix}
]
Lastly:
[
frac{partial MSE}{partial beta}
=
begin{bmatrix}
-113252
-339756
end{bmatrix}
]
This gradient represents the slope of the bowl-shaped loss curve for this single coaching instance.
Now replace the parameters utilizing:
[
beta:=beta-alphafrac{partial MSE}{partial beta}
]
Substituting the values:
[
beta=
begin{bmatrix}
2
5
end{bmatrix}
–
0.01
begin{bmatrix}
-113252
-339756
end{bmatrix}
]
First multiply the training charge:
[
=
begin{bmatrix}
2
5
end{bmatrix}
–
begin{bmatrix}
-1132.52
-3397.56
end{bmatrix}
]
Now subtract:
[
=
begin{bmatrix}
2+1132.52
5+3397.56
end{bmatrix}
]
Lastly:
[
beta=
begin{bmatrix}
1134.52
3402.56
end{bmatrix}
]
After fixing for only one commentary, the parameters instantly get up to date.
Now SGD randomly selects one other commentary from the dataset and repeats the identical course of once more.
In contrast to batch gradient descent, which waits to course of your complete dataset earlier than updating the parameters, SGD updates the parameters after each single coaching instance.
Due to these frequent updates, SGD reaches the answer sooner.
We will observe how easy the calculation turns into when utilizing only one commentary.
SGD continues updating the parameters repeatedly utilizing totally different coaching examples till the loss turns into minimal or stops altering considerably.
However the path towards the minimal level turns into noisy and zig-zag in nature.
This makes SGD extremely helpful for contemporary machine studying and deep studying issues involving very giant datasets.
Conclusion
Now we’ve got an thought of each gradient descent and stochastic gradient descent.
First, we derived the conventional equation, after which we discovered that the inverse matrix calculation turns into computationally costly and reminiscence utilization turns into excessive for giant datasets.
To resolve this drawback, we used gradient descent, which isn’t restricted to linear regression however can also be utilized in many machine studying and deep studying algorithms.
Subsequent, we discovered that even the primary technique of gradient descent that we used, referred to as batch gradient descent, can grow to be gradual for very giant datasets as a result of it makes use of your complete dataset earlier than updating parameters.
This led us to stochastic gradient descent (SGD), which updates the parameters utilizing one coaching instance at a time and works sooner than batch gradient descent for giant datasets.
We even have one other variation of gradient descent referred to as mini-batch gradient descent, during which we use a small batch of coaching examples from the dataset, resembling 32 or 64 rows, earlier than updating the parameters.
On this method, it turns into sooner than batch gradient descent and extra secure than stochastic gradient descent.
Though linear regression has a closed-form resolution, we frequently want to make use of gradient descent when working with giant datasets containing tens of millions of observations as a result of the conventional equation turns into computationally costly and impractical.
In deep studying, nonetheless, closed-form options normally don’t exist, which makes optimization algorithms like gradient descent much more essential.
Dataset License
The dataset used on this weblog is the Salary dataset.
It’s publicly obtainable on Kaggle and is licensed underneath the Creative Commons Zero (CC0 Public Domain) license. This implies it may be freely used, modified, and shared for each non-commercial and business functions with out restriction.
I hope you now have a greater understanding of what gradient descent and stochastic gradient descent truly are.
In case you’d prefer to learn extra of my writing, you too can discover it on Medium and LinkedIn.
I just lately wrote an in depth breakdown of Lasso Regression from a geometrical and intuitive perspective.
You possibly can learn it here.
Thanks for studying!

