Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • QJMotor launches 125cc beginner street bikes
    • The Federal Agency Coming for Gender-Affirming Care
    • As part of the Cohere-Aleph Alpha deal, Aleph Alpha backer Schwarz Group plans to invest $600M in Cohere’s Series E, which a source says is set to close in 2026 (Kai Nicol-Schwarz/CNBC)
    • Today’s NYT Strands Hints, Answer and Help for April 24 #782
    • Ultra portable power for camping
    • Startup 360: Using AI to deal with ‘carenting’ in the Sandwich years
    • Sam’s Club Promo Codes: 60% Off for April 2026
    • Canada’s Cohere and Germany’s Aleph Alpha agree to a merger deal valuing the combined group at ~$20B to work on sovereign AI; both governments support the deal (Financial Times)
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Friday, April 24
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Lasso Regression: Why the Solution Lives on a Diamond
    Artificial Intelligence

    Lasso Regression: Why the Solution Lives on a Diamond

    Editor Times FeaturedBy Editor Times FeaturedApril 23, 2026No Comments28 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    on linear regression, we solved a linear regression downside utilizing the idea of vectors and projections as a substitute of calculus.

    Now on this weblog, we as soon as once more use those self same ideas of vectors and projections to grasp Lasso Regression.

    Whereas I used to be studying this matter, I used to be caught at explanations like “we add a penalty time period” and “Lasso shrinks the coefficients to zero.“

    I used to be unable to understand what’s truly taking place behind this technique.

    I’m positive lots of you may need felt like me, and I believe it’s widespread for newcomers and, for that matter, anybody fixing real-world issues utilizing linear regression.

    However as we speak, we’re as soon as once more taking a brand new technique to strategy this basic matter in order that we are able to clearly see what is absolutely taking place behind the scenes.


    When a Excellent Mannequin Begins to Fail

    Earlier than continuing additional, let’s get a fundamental thought of why we truly use Lasso regression.

    For instance, think about we have now some information and we apply linear regression to it and get zero error.

    We would assume we have now an ideal mannequin, however after we take a look at that mannequin on new information, we get predicted values that aren’t dependable or not in line with actuality.

    On this case, we are able to say that our mannequin has low bias and excessive variance.

    Usually, we use Lasso when there are a lot of options, particularly when they’re corresponding to or greater than the variety of observations, which may result in overfitting.

    This implies the mannequin, as a substitute of studying patterns from the info, merely memorizes it.

    Lasso helps in choosing solely the vital options by shrinking some coefficients to zero.

    Now, to make the mannequin extra dependable, we use Lasso regression, and you’ll perceive it intimately as soon as we resolve the precise downside.


    Let’s say we have now this home information. Now we have to construct a mannequin that predicts the worth of a home utilizing its measurement and age.

    Picture by Writer

    Let’s Construct the Mannequin First

    First, let’s use Python to construct this linear regression mannequin.

    Code:

    import numpy as np
    from sklearn.linear_model import LinearRegression
    
    # Information
    # Options: Measurement (1000 sqft), Age (years)
    X = np.array([
        [1, 1],
        [2, 3],
        [3, 2]
    ])
    
    # Goal: Value ($100k)
    y = np.array([4, 8, 9])
    
    # Create mannequin
    mannequin = LinearRegression()
    
    # Match mannequin
    mannequin.match(X, y)
    
    # Coefficients
    print("Intercept:", mannequin.intercept_)
    print("Coefficients [Size, Age]:", mannequin.coef_)

    End result:

    Picture by Writer

    We obtained the outcome: β₀ = 1, β₁ = 2, β₂ = 1


    Understanding Regression as Motion in Area

    Now, let’s resolve this utilizing vectors and projections.

    We already know the right way to resolve this linear regression downside utilizing vectors, and now we are going to use this information to grasp the geometry behind it utilizing vectors.

    We already know the right way to do the mathematics to search out the answer, which we beforehand mentioned partly 2 of my linear regression weblog.

    So we is not going to do the mathematics right here, as we have already got the answer which we discovered utilizing Python.

    Let’s perceive what the precise geometry is behind this information.


    When you keep in mind, we used this similar information after we mentioned linear regression utilizing vectors.

    Picture by Writer

    Let’s think about this information as outdated information.

    Now, to clarify Lasso regression, we are going to use this information.

    Picture by Writer

    We simply added a brand new characteristic, “Age”, to our information.

    Now, let’s have a look at this GIF for our outdated information.

    GIF by Writer

    From Strains to Planes

    Let’s simply recall what we have now executed right here. We thought-about every home as an axis and plotted the factors, and we thought-about them as vectors.

    We obtained the worth vector and the dimensions vector, and we realized the necessity for an intercept and added the intercept vector.

    Now we had two instructions by which we might transfer to achieve the tip of the worth vector. Primarily based on these two instructions, there are various potential factors we are able to attain, and people factors kind a aircraft.

    Now our goal level, the worth vector, just isn’t on this aircraft, so we have to discover the purpose on the aircraft that’s closest to the tip of the worth vector.

    We calculate that closest level utilizing the idea of projection, the place the shortest distance happens after we are perpendicular to the aircraft.

    For that time, we use the idea of orthogonal projection, the place the dot product between two orthogonal vectors is zero.

    Right here, projection is the important thing, and that is how we discover the closest level on the aircraft, later utilizing the mathematics.


    Now, let’s observe the GIF under for our new information.

    GIF by Writer

    What Adjustments When We Add One Extra Function

    We now have the identical aim right here as nicely.

    We wish to attain the tip of the worth vector, however now we have now a brand new path to maneuver, which is the path of the age vector, and which means we are able to now transfer in three completely different instructions to achieve our vacation spot.

    In our outdated information, we had two instructions, and by combining each instructions to achieve the tip of the worth vector, we obtained many factors which collectively fashioned a 2D aircraft in that 3D area.

    However now we have now three instructions to maneuver in that 3D area, and what does that imply?

    Meaning, if these instructions are impartial, we are able to attain each level in that 3D area utilizing these instructions, and which means we are able to additionally attain the tip of the worth vector immediately.

    On this particular case, because the characteristic vectors span the area of the goal, we are able to attain it precisely without having projection.

    We have already got β₀ = 1, β₁ = 2, β₂ = 1

    [
    text{Now, let’s represent our new data in matrix form.}
    ]

    [
    X =
    begin{bmatrix}
    1 & 1 & 1
    1 & 2 & 3
    1 & 3 & 2
    end{bmatrix}
    quad
    y =
    begin{bmatrix}
    4
    8
    9
    end{bmatrix}
    quad
    beta =
    begin{bmatrix}
    b_0
    b_1
    b_2
    end{bmatrix}
    ]
    [
    text{Here, the columns of } X text{ represent the base, size, and age directions.}
    ]
    [
    text{And we are trying to combine them to reach } y.
    ]
    [
    hat{y} = Xbeta
    ]
    [
    =
    b_0
    begin{bmatrix}
    1
    1
    1
    end{bmatrix}
    +
    b_1
    begin{bmatrix}
    1
    2
    3
    end{bmatrix}
    +
    b_2
    begin{bmatrix}
    1
    3
    2
    end{bmatrix}
    ]
    [
    text{let’s check if we can reach } y text{ directly.}
    ]
    [
    text{Using the values } b_0 = 1, b_1 = 2, b_2 = 1
    ]
    [
    hat{y} =
    1
    begin{bmatrix}
    1
    1
    1
    end{bmatrix}
    +
    2
    begin{bmatrix}
    1
    2
    3
    end{bmatrix}
    +
    1
    begin{bmatrix}
    1
    3
    2
    end{bmatrix}
    ]
    [
    =
    begin{bmatrix}
    1
    1
    1
    end{bmatrix}
    +
    begin{bmatrix}
    2
    4
    6
    end{bmatrix}
    +
    begin{bmatrix}
    1
    3
    2
    end{bmatrix}
    ]
    [
    =
    begin{bmatrix}
    4
    8
    9
    end{bmatrix}
    = y
    ]
    [
    text{This shows that we can reach the target vector exactly using these directions.}
    ]
    [
    text{So, there is no need to find a closest point or perform projection.}
    ]
    [
    text{We have directly reached the destination.}
    ]


    From this, we are able to say that if we go 1 unit within the path of the intercept vector, 2 models within the path of the dimensions vector, and 1 unit within the path of the age vector, we are able to attain the tip of the worth vector immediately.

    Okay, now we have now constructed our linear regression mannequin utilizing our information, and it appears to be an ideal mannequin, however we all know that an ideal mannequin doesn’t exist, and we wish to take a look at our mannequin.

    A Excellent Match… That Fails Utterly

    Now let’s think about a brand new home, which is Home D.

    Picture by Writer

    Now, let’s use our mannequin to foretell the worth of Home D.

    [
    X_D =
    begin{bmatrix}
    1 & 1.5 & 20
    end{bmatrix}
    quad
    beta =
    begin{bmatrix}
    1
    2
    1
    end{bmatrix}
    ]

    [
    text{We use our model to predict the price of this house.}
    ]
    [
    hat{y}_D = X_D beta
    ]
    [
    = 1 cdot 1 + 2 cdot 1.5 + 1 cdot 20
    ]
    [
    = 1 + 3 + 20
    ]
    [
    = 24
    ]
    [
    text{So the predicted price is 24 (in 100k$ units).}
    ]
    [
    text{But the actual price is 5.5, which shows a large difference.}
    ]
    [
    text{This gives us an idea that the model may not generalize well.}
    ]

    We are able to observe the distinction between precise value and predicted value.

    From this, we are able to say that the mannequin has excessive variance. The mannequin used all of the potential instructions to match the coaching information.

    As a substitute of discovering patterns within the information, we are able to say that the mannequin memorized the info, and we are able to name this overfitting.

    This normally occurs when we have now a lot of options in comparison with the variety of observations, or when the mannequin has an excessive amount of flexibility (extra instructions = extra flexibility).

    In observe, we determine whether or not a mannequin is overfitting based mostly on its efficiency on a set of recent information factors however not only one.

    Right here, we’re contemplating a single level solely to construct instinct and perceive how Lasso Regression works.


    So What’s the Downside?

    How can we make this mannequin carry out nicely on unseen information?

    One technique to deal with that is utilizing Lasso.

    However what’s going to occur truly after we apply lasso.

    For our new information we obtained β₀ = 1, β₁ = 2, β₂ = 1, which suggests we already mentioned that 1 unit within the path of the intercept vector, 2 models within the path of the dimensions vector, and 1 unit within the path of the age vector.

    Picture by Writer

    Breaking Down the Value Vector

    Now let’s think about our goal value vector (4, 8, 9). We have to attain the tip of that mounted value vector, and for that we have now three instructions.

    Partly 2 of my linear regression weblog, we already mentioned the necessity for a base vector, which helps us add a base worth as a result of even when measurement or age is zero, we nonetheless have a base value.

    Now, for our value vector (4, 8, 9), which represents the costs of homes A, B, and C, the typical worth is 7.

    We are able to write our value vector as (7, 7, 7) + (-3, 1, 2), which is the same as (4, 8, 9).

    We are able to rewrite this as 7(1, 1, 1) + (-3, 1, 2).

    What can we observe from this?

    We are able to say that to achieve the tip of our value vector, we have to transfer 7 models within the path of the intercept vector after which alter utilizing the vector (-3, 1, 2).

    Right here, (-3, 1, 2) is a vector that represents the deviation of costs from the typical. Additionally, we don’t get any slope values right here as a result of we aren’t expressing the worth vector when it comes to characteristic instructions, however merely separating it into common and variation.

    So, if we solely think about this illustration, we would want to maneuver 7 models within the path of the intercept vector.

    However after we utilized the linear regression mannequin to our information, we obtained a special intercept worth, which is β₀ = 1.

    Why is that this taking place?

    We get an intercept worth of seven solely after we wouldn’t have every other instructions, that means the dimensions and age vectors should not current.

    However after we embody these characteristic instructions, additionally they contribute to reaching the worth vector.


    The place Did the Intercept Go?

    We obtained β₀ = 1, β₁ = 2, β₂ = 1. This implies we transfer only one unit within the path of the intercept vector. Then how can we nonetheless attain the worth vector?

    Let’s see.

    We even have two extra instructions: the dimensions vector (1, 2, 3) and the age vector (1, 3, 2).

    First, think about the dimensions vector (1, 2, 3).
    We are able to write it as (2, 2, 2) + (-1, 0, 1), which is the same as 2(1, 1, 1) + (-1, 0, 1).

    This reveals that after we transfer alongside the dimensions vector, we’re additionally partially shifting within the path of the intercept vector.

    If we transfer 2 models within the path of the dimensions vector, we get (2, 4, 6), which may be written as 4(1, 1, 1) + (-2, 0, 2).

    We are able to say that measurement vector has a part alongside intercept path.

    Now think about the age vector (1, 3, 2).
    We are able to write it as (2, 2, 2) + (-1, 1, 0), which is the same as 2(1, 1, 1) + (-1, 1, 0).

    We are able to say that age vector additionally has a part alongside intercept path.

    Now, if we observe rigorously, to achieve the worth vector, we successfully transfer a complete of seven models within the path of the intercept vector, however this motion is distributed throughout the intercept, measurement, and age instructions.


    Introducing the Constraint (This Is Lasso)

    Now we’re making use of lasso to generalize the mannequin.

    Earlier, we noticed that we might attain the goal by shifting freely in numerous instructions, with no restriction, and the mannequin might use any quantity of motion alongside every path.

    However now, we introduce a restrict.

    This implies the coefficients can’t take arbitrary values anymore; they’re restricted to remain inside a sure complete funds.

    For instance, we have now β₀ = 1, β₁ = 2, β₂ = 1, and if we add their absolute values, we get |β₀| + |β₁| + |β₂| = 4.

    This 4 represents the overall allowed contribution throughout all instructions.

    Now don’t get confused. Earlier, we stated we moved 7 models within the intercept path, and now we’re saying 4 models in complete.

    These are fully completely different.

    Earlier, we expressed the worth vector when it comes to its common and deviations, the place the intercept was taking good care of the whole common.

    However now, we’re expressing the identical vector utilizing characteristic instructions like measurement and age.

    Due to that, a part of the motion is already dealt with by these characteristic instructions, so the intercept doesn’t have to take full accountability anymore.

    We’re proscribing how a lot the mannequin can transfer in complete, however why can we do that?

    In actual world, we regularly have many options, and Abnormal Least Squares technique tries to assign a coefficient to each characteristic, even when some should not helpful.

    This makes the mannequin complicated, unstable, and liable to overfitting.

    Lasso addresses this by including a constraint. Once we restrict the overall contribution, coefficients begin shrinking, and a few shrink all the way in which to zero.

    When a coefficient turns into zero, that characteristic is successfully faraway from the mannequin.

    That’s how lasso performs characteristic choice, not by selecting options, however by forcing the mannequin to remain inside a restricted funds.

    Our aim isn’t just to suit the info completely, however to seize the true sample utilizing solely a very powerful instructions.


    Are We Utilizing This Restrict Properly?

    Now let’s say we set the restrict to 2.

    Earlier than that, we have to perceive one vital factor. Once we apply lasso, we’re shrinking the coefficients.

    Right here, the coefficients are β₀ = 1, β₁ = 2, β₂ = 1.

    β₀ represents the intercept. However take into consideration this for a second. Why ought to we shrink the intercept? What’s the want?

    The intercept represents the typical stage of the goal. It isn’t telling us how the worth modifications with options like measurement and age.

    What we truly care about is how a lot the worth depends upon these options, which is captured by β₁ and β₂. These ought to replicate the pure impact of every characteristic.

    If the info just isn’t adjusted, the intercept mixes with the characteristic contributions, and we don’t get a clear understanding of how every characteristic is influencing the goal.

    We solely have restricted actions and why can we waste them by shifting alongside intercept path? we are going to use the restrict to maneuver alongside precise deviations path in measurement and age with respect to cost.

    Additionally, since we’re placing a restrict on the overall coefficients, we solely have restricted motion. So why waste it by shifting within the intercept path?

    We should always use this restricted funds to maneuver alongside the precise deviation instructions, like measurement and age, with respect to the worth.


    The Repair: Centering the Information

    So what can we do?

    We separate the baseline from the variations. That is executed utilizing a course of known as centering, the place we subtract the imply from every vector.

    For the worth vector (4, 8, 9), the imply is 7, so the centered vector turns into (4, 8, 9) − (7, 7, 7) = (−3, 1, 2).

    For the dimensions vector (1, 2, 3), the imply is 2, so the centered vector turns into (1, 2, 3) − (2, 2, 2) = (−1, 0, 1).

    For the age vector (1, 3, 2), the imply is 2, so the centered vector turns into (1, 3, 2) − (2, 2, 2) = (−1, 1, 0).

    Now we have now three centered vectors: value (−3, 1, 2), measurement (−1, 0, 1), and age (−1, 1, 0).

    At this stage, the intercept is faraway from the issue as a result of every thing is expressed relative to the imply.

    We now construct the mannequin utilizing these centered vectors, focusing solely on how options clarify deviations from the typical.

    As soon as the mannequin is constructed, we carry again the intercept by including the imply of the goal to the predictions.


    GIF by Writer

    Now let’s resolve this as soon as once more with out utilizing lasso.

    This time with out utilizing the intercept vector.

    We all know that right here we have now two instructions to achieve the goal of value deviations.

    Right here we’re modeling the deviations within the information.

    We already know {that a} Second-plane might be fashioned in that 3d-space utilizing completely different mixtures of β₁ and β₂.

    This time let’s do the mathematics first.

    [
    text{Now we solve OLS again, but using centered vectors.}
    ]

    [
    y =
    begin{bmatrix}
    -3
    1
    2
    end{bmatrix}
    quad
    x_1 =
    begin{bmatrix}
    -1
    0
    1
    end{bmatrix}
    quad
    x_2 =
    begin{bmatrix}
    -1
    1
    0
    end{bmatrix}
    ]
    [
    X =
    begin{bmatrix}
    -1 & -1
    0 & 1
    1 & 0
    end{bmatrix}
    ]
    [
    text{We use the normal equation again.}
    ]
    [
    beta = (X^T X)^{-1} X^T y
    ]
    [
    X^T =
    begin{bmatrix}
    -1 & 0 & 1
    -1 & 1 & 0
    end{bmatrix}
    ]
    [
    X^T X =
    begin{bmatrix}
    2 & 1
    1 & 2
    end{bmatrix}
    ]
    [
    X^T y =
    begin{bmatrix}
    5
    4
    end{bmatrix}
    ]
    [
    text{Now compute the inverse.}
    ]
    [
    (X^T X)^{-1}
    =
    frac{1}{(2 cdot 2 – 1 cdot 1)}
    begin{bmatrix}
    2 & -1
    -1 & 2
    end{bmatrix}
    ]
    [
    =
    frac{1}{3}
    begin{bmatrix}
    2 & -1
    -1 & 2
    end{bmatrix}
    ]
    [
    text{Now multiply with } X^T y.
    ]
    [
    beta =
    frac{1}{3}
    begin{bmatrix}
    2 & -1
    -1 & 2
    end{bmatrix}
    begin{bmatrix}
    5
    4
    end{bmatrix}
    ]
    [
    =
    frac{1}{3}
    begin{bmatrix}
    10 – 4
    -5 + 8
    end{bmatrix}
    =
    frac{1}{3}
    begin{bmatrix}
    6
    3
    end{bmatrix}
    ]
    [
    =
    begin{bmatrix}
    2
    1
    end{bmatrix}
    ]
    [
    text{So the centered solution is: } beta_1 = 2, beta_2 = 1
    ]
    [
    hat{y} = 2x_1 + 1x_2
    ]

    We get the identical values as a result of centering solely removes the typical however not the connection between options and goal.

    GIF by Writer

    [
    text{Now we bring back the intercept to get actual predictions.}
    ]

    [
    text{We know that centering was done by subtracting the mean.}
    ]
    [
    y_{text{centered}} = y – bar{y}
    ]
    [
    text{So the original vector can be written as:}
    ]
    [
    y = y_{text{centered}} + bar{y}
    ]
    [
    text{Similarly, our prediction also follows the same idea.}
    ]
    [
    hat{y} = hat{y}_{text{centered}} + bar{y}
    ]
    [
    text{From earlier, we have:}
    ]
    [
    hat{y}_{text{centered}} = 2x_1 + 1x_2
    ]
    [
    text{Note: these centered vectors are obtained by subtracting the mean from each feature.}
    ]
    [
    x_1 – bar{x}_1 = x_1 – 2, quad x_2 – bar{x}_2 = x_2 – 2
    ]
    [
    text{So instead of using } x_1 text{ and } x_2, text{ we are using } (x_1 – 2) text{ and } (x_2 – 2).
    ]
    [
    text{Now substitute the centered vectors.}
    ]
    [
    hat{y}_{text{centered}} =
    2
    begin{bmatrix}
    -1
    0
    1
    end{bmatrix}
    +
    1
    begin{bmatrix}
    -1
    1
    0
    end{bmatrix}
    ]
    [
    =
    begin{bmatrix}
    -2
    0
    2
    end{bmatrix}
    +
    begin{bmatrix}
    -1
    1
    0
    end{bmatrix}
    =
    begin{bmatrix}
    -3
    1
    2
    end{bmatrix}
    ]
    [
    text{Now add back the mean of } y.
    ]
    [
    bar{y} = 7
    quad
    Rightarrow
    quad
    bar{y}mathbf{1} =
    begin{bmatrix}
    7
    7
    7
    end{bmatrix}
    ]
    [
    hat{y} =
    begin{bmatrix}
    -3
    1
    2
    end{bmatrix}
    +
    begin{bmatrix}
    7
    7
    7
    end{bmatrix}
    =
    begin{bmatrix}
    4
    8
    9
    end{bmatrix}
    ]
    [
    text{So we recover the actual prediction by adding back the intercept.}
    ]


    We obtained β₁ = 2 and β₂ = 1.

    In complete, we used 3 models to achieve our goal.

    Now we apply lasso.

    Let’s say we put a restrict of two models. Because of this throughout each instructions mixed, we solely have 2 models of motion accessible.

    We are able to distribute this in numerous methods. For instance, we are able to use 1 unit within the measurement path and 1 unit within the age path, or we are able to use all 2 models in both the dimensions path or the age path.

    Let’s see all of the potential values of β₁ and β₂ utilizing a plot.

    Picture by Writer

    We are able to observe that after we plot all potential mixtures of β₁ and β₂ underneath this constraint, they kind a diamond form, and our resolution lies on this diamond.

    Now let’s return to the centered vector area and see the place we attain on the aircraft underneath this constraint.

    GIF by Writer

    From the above visible, we are able to get a transparent thought.

    We already know {that a} 2D aircraft is fashioned in 3D area, and our goal lies on that aircraft.

    Now, after making use of lasso, the motion on this aircraft is restricted. We are able to see this restricted area within the visible, and our resolution now lies inside this area.

    So how can we attain that resolution?

    Let’s assume. Right here, the actions are restricted. We are able to see that the goal lies on the aircraft, however we are able to’t attain it immediately as a result of we’ve utilized a restrict on the motion.

    So what’s the most effective we are able to do?

    We are able to go as shut as potential to the goal, proper?

    Sure, and that’s our resolution. Now the query is, how do we all know which level within the restricted area is closest to our goal on that aircraft?

    Let’s see.


    Fixing Lasso Alongside a Constraint Boundary

    Let’s as soon as once more have a look at our diamond plot, which lies in coefficient area.

    We get hold of this diamond by contemplating all mixtures of coefficients that fulfill the situation

    ∣β1∣+∣β2∣≤2|beta_1| + |beta_2| leq 2

    This offers us a restricted area on the aircraft inside which we’re allowed to maneuver.

    If we observe this area, the factors inside imply we aren’t utilizing the complete restrict of two, whereas the factors on the boundary imply we’re utilizing the complete restrict.

    Now we’re looking for the closest level on our restricted area to OLS resolution.

    We are able to observe that the closest level which we’re searching for lies on the boundary of our restricted area.

    GIF by Writer

    The Lasso constraint provides us a diamond form in coefficient area. This diamond has 4 edges, and every edge represents a scenario the place we’re absolutely utilizing the restrict.

    Once we are on an edge, the coefficients are now not free. They’re tied collectively by the equation β1+β2=2beta_1 + beta_2 = 2. This implies we can’t transfer in any path we wish. We’re compelled to maneuver alongside that edge.

    Now after we translate this into information area, one thing fascinating occurs. Every edge turns right into a line of potential predictions. So as a substitute of enthusiastic about a full area, we are able to assume when it comes to these traces.

    If we have a look at the place the OLS resolution lies, we are able to see that it’s closest to the boundary β1+β2=2beta_1 + beta_2 = 2. So, we now give attention to this boundary.

    Picture by Writer

    Since this boundary is mounted, all predictions we are able to make alongside it lie on a single line. So as a substitute of looking out in all places, we simply transfer alongside this line.

    Now the issue turns into easy. We take our goal and undertaking it onto this line to search out the closest level. That time provides us the Lasso resolution.

    Now that we perceive what Lasso is doing, let’s work by way of the mathematics to search out the answer.

    [
    textbf{Solving Lasso Using Projection on a Boundary}
    ]

    [
    text{Now that we understand the boundaries, let us find the solution using the nearest one.}
    ]

    [
    text{From the constraint, we have:}
    quad
    beta_1 + beta_2 = 2
    ]

    [
    text{This means the two coefficients are no longer independent.}
    ]

    [
    text{We can express one coefficient in terms of the other:}
    quad
    beta_2 = 2 – beta_1
    ]

    [
    text{Now substitute this into the model:}
    ]

    [
    hat{y} = beta_1 x_1 + (2 – beta_1)x_2
    ]

    [
    text{Rearranging terms:}
    ]

    [
    hat{y} = 2x_2 + beta_1(x_1 – x_2)
    ]

    [
    text{This shows that all predictions lie on a line.}
    ]

    [
    text{We can write this as:}
    quad
    hat{y} = text{fixed point} + beta_1 cdot text{direction}
    ]

    [
    text{where}
    quad
    text{fixed point} = 2x_2,
    quad
    d = x_1 – x_2
    ]

    [
    text{Compute the direction vector:}
    ]

    [
    d =
    begin{bmatrix}
    -1 0 1
    end{bmatrix}
    –
    begin{bmatrix}
    -1 1 0
    end{bmatrix}
    =
    begin{bmatrix}
    0 -1 1
    end{bmatrix}
    ]

    [
    text{Compute the starting point:}
    quad
    2x_2 =
    2
    begin{bmatrix}
    -1 1 0
    end{bmatrix}
    =
    begin{bmatrix}
    -2 2 0
    end{bmatrix}
    ]

    [
    text{So any point on this boundary is:}
    ]

    [
    hat{y} =
    begin{bmatrix}
    -2 2 0
    end{bmatrix}
    +
    beta_1
    begin{bmatrix}
    0 -1 1
    end{bmatrix}
    ]

    [
    text{Now we find the point on this line closest to } y.
    ]

    [
    y =
    begin{bmatrix}
    -3 1 2
    end{bmatrix}
    ]

    [
    text{We use the projection formula:}
    quad
    beta_1 =
    frac{(y – 2x_2) cdot d}{d cdot d}
    ]

    [
    text{Compute the shifted vector:}
    ]

    [
    y – 2x_2 =
    begin{bmatrix}
    -1 -1 2
    end{bmatrix}
    ]

    [
    text{Compute } d cdot d:
    quad
    d cdot d = 2
    ]

    [
    text{Compute } (y – 2x_2) cdot d:
    quad
    3
    ]

    [
    text{So we get:}
    quad
    beta_1 = frac{3}{2}
    ]

    [
    text{Now compute } beta_2:
    quad
    beta_2 = frac{1}{2}
    ]

    [
    text{Substitute back to get the closest point on the line:}
    ]

    [
    hat{y} =
    begin{bmatrix}
    -2 2 0
    end{bmatrix}
    +
    frac{3}{2}
    begin{bmatrix}
    0 -1 1
    end{bmatrix}
    =
    begin{bmatrix}
    -2 0.5 1.5
    end{bmatrix}
    ]

    [
    textbf{Closest point to } y textbf{ on this boundary is:}
    quad
    hat{y} =
    begin{bmatrix}
    -2 0.5 1.5
    end{bmatrix}
    ]

    [
    text{Distance (error):}
    quad
    y – hat{y} =
    begin{bmatrix}
    -1 0.5 0.5
    end{bmatrix}
    ]

    [
    text{Error} = 1.5
    ]

    [
    textbf{Final Lasso solution:}
    quad
    beta_1 = 1.5,
    quad
    beta_2 = 0.5
    ]

    [
    text{This shows that the 2D problem reduces to finding the closest point on a line.}
    ]

    When you observe the above calculation, right here’s what we truly did.

    We began with the complete 2D aircraft, the place predictions can lie anyplace within the area fashioned by the options.

    Then we targeted on closest boundary of the Lasso constraint, β1+β2=2beta_1 + beta_2 = 2, as a substitute of the complete area. This ties the coefficients collectively and removes their independence.

    Once we substitute this into the mannequin, the aircraft collapses right into a line of potential predictions.

    This line represents all of the predictions we are able to get alongside that boundary.

    We are able to see that the issue diminished to projecting the goal onto this line.

    As soon as we cut back the issue to a line, the answer is only a projection.

    Picture by Writer

    Beforehand, we obtained β1​=2 and β2​=1.

    Now, after making use of Lasso, we have now β1​=1.5 and β2​=0.5.

    We are able to observe that the coefficients have shrunk.

    Now, let’s predict the worth for Home D.

    Picture by Writer

    Till now, we labored with centered information. Now we convert the answer again to the unique scale.

    [
    textbf{Centering the Data}
    ]

    [
    text{We first centered the features and target:}
    ]

    [
    x_1′ = x_1 – bar{x}_1, quad
    x_2′ = x_2 – bar{x}_2, quad
    y’ = y – bar{y}
    ]

    [
    text{After centering, the model becomes:}
    quad
    y’ = beta_1 x_1′ + beta_2 x_2′
    ]

    [
    text{Since the data is centered, the intercept becomes zero.}
    ]

    [
    textbf{Solving the Model}
    ]

    [
    text{From Lasso, we obtained:}
    quad
    beta_1 = 1.5, quad beta_2 = 0.5
    ]

    [
    textbf{Returning to Original Scale}
    ]

    [
    text{We now express the model in terms of original variables:}
    ]

    [
    y – bar{y} = beta_1 (x_1 – bar{x}_1) + beta_2 (x_2 – bar{x}_2)
    ]

    [
    text{Expanding:}
    ]

    [
    y = beta_1 x_1 + beta_2 x_2 + bar{y} – beta_1 bar{x}_1 – beta_2 bar{x}_2
    ]

    [
    text{Comparing with } hat{y} = beta_0 + beta_1 x_1 + beta_2 x_2:
    ]

    [
    beta_0 = bar{y} – beta_1 bar{x}_1 – beta_2 bar{x}_2
    ]

    [
    textbf{Compute the Means}
    ]

    [
    bar{y} = frac{4 + 8 + 9}{3} = 7
    ]

    [
    bar{x}_1 = frac{1 + 2 + 3}{3} = 2, quad
    bar{x}_2 = frac{1 + 3 + 2}{3} = 2
    ]

    [
    textbf{Compute the Intercept}
    ]

    [
    beta_0 = 7 – (1.5 cdot 2) – (0.5 cdot 2)
    ]

    [
    beta_0 = 7 – 3 – 1 = 3
    ]

    [
    textbf{Final Model}
    ]

    [
    hat{y} = 3 + 1.5x_1 + 0.5x_2
    ]

    [
    textbf{Prediction for House D}
    ]

    [
    x_1 = 1.5, quad x_2 = 20
    ]

    [
    hat{y} = 3 + 1.5(1.5) + 0.5(20)
    ]

    [
    hat{y} = 3 + 2.25 + 10 = 15.25
    ]

    Earlier than making use of Lasso, we predicted the worth of Home D as 24, which is much from the precise value of 5.5.

    After making use of Lasso, the expected value turns into 15.25.

    This occurs as a result of we don’t enable the mannequin to freely match the goal information, however as a substitute power it to remain inside a restricted area.

    In consequence, the mannequin turns into extra steady and depends much less on any single characteristic.

    This may increasingly improve the bias on the coaching information, nevertheless it reduces the variance on unseen information.

    However how can we select the most effective restrict to use?

    We are able to discover this utilizing cross-validation by making an attempt completely different values.

    In the end, we have to stability the bias and variance of the mannequin to make it appropriate for future predictions.

    In some instances, relying on the info and the restrict we select, some coefficients might turn into zero.

    This successfully removes these options from the mannequin and helps it generalize higher to new information.


    What Actually Modified After Making use of Lasso?

    Right here we should observe one vital factor.

    With out Lasso, we predicted the worth of Home D as 24, whereas with Lasso we obtained 15.25.

    What occurred right here?

    The actual value of the home is 5.5, however our mannequin overfits the coaching information and predicts a a lot larger worth. It incorrectly learns that age will increase the worth of a home.

    Now think about a real-world scenario. Suppose we see a home that was constructed 30 years in the past and is priced low. Then we see one other home of the identical age, however not too long ago renovated, and it’s priced a lot larger.

    From this, we are able to perceive that age alone just isn’t a dependable characteristic. We can’t rely too closely on it whereas predicting home costs.

    As a substitute, options like measurement might play a extra constant function.

    Once we apply Lasso, it reduces the affect of each options, particularly these which are much less dependable. In consequence, the prediction turns into 15.25, which is nearer to the precise worth, although nonetheless not excellent.

    If we improve the energy of the constraint additional, for instance by decreasing the restrict, the coefficient of age might turn into zero, successfully eradicating it from the mannequin.

    You would possibly assume that Lasso shrinks all coefficients equally, however that’s hardly ever the case. It relies upon fully on the hidden geometry of your information.


    By the way in which, the complete type of LASSO is Least Absolute Shrinkage and Choice Operator.

    I hope this gave you a clearer understanding of what Lasso Regression actually is and the geometry behind it.

    I’ve additionally written an in depth weblog on fixing linear regression utilizing vectors and projections.

    When you’re , you’ll be able to test it out here.

    Be at liberty to share your ideas.

    Thanks for studying!



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Your Synthetic Data Passed Every Test and Still Broke Your Model

    April 23, 2026

    Using a Local LLM as a Zero-Shot Classifier

    April 23, 2026

    I Simulated an International Supply Chain and Let OpenClaw Monitor It

    April 23, 2026

    The Most Efficient Approach to Crafting Your Personal AI Productivity System

    April 23, 2026

    “Your Next Coworker May Not Be Human” as Google Bets Everything on AI Agents to Power the Office

    April 23, 2026

    Ivory Tower Notes: The Methodology

    April 23, 2026
    Leave A Reply Cancel Reply

    Editors Picks

    QJMotor launches 125cc beginner street bikes

    April 24, 2026

    The Federal Agency Coming for Gender-Affirming Care

    April 24, 2026

    As part of the Cohere-Aleph Alpha deal, Aleph Alpha backer Schwarz Group plans to invest $600M in Cohere’s Series E, which a source says is set to close in 2026 (Kai Nicol-Schwarz/CNBC)

    April 24, 2026

    Today’s NYT Strands Hints, Answer and Help for April 24 #782

    April 24, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Study reveals how sexual history affects partner choice

    August 6, 2025

    The Psychology of Pricing (Why We Buy What We Buy)

    November 11, 2024

    Are prediction markets gambling? Growth blurs lines between finance and betting

    November 8, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.