Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Anthropic Confidentially Files for What Could Be the Largest IPO Ever
    • Salesforce has a stake in Anthropic worth ~$5B; Salesforce first invested about $50M in an early 2023 round and has continually invested in rounds since (Brody Ford/Bloomberg)
    • Russia’s Military Hackers Targeted Home Routers Across 23 States. Here’s What to Do
    • How to Combine Claude Code and Codex for Maximum Coding Power
    • Supermassive black holes may create millions of new planets
    • Cheque in: 3 startups ended May by raising $15.5 million
    • Universal Audio Volt 876 USB Audio Interface Review: Pro-Level Polish
    • New York City-based Mecka AI, which trains robots with human data sourced from body sensors and iPhones, raised $60M, including a $25M Series A (Ben Weiss/Fortune)
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Monday, June 1
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»5 Statistical Concepts You Need to Know Before Your Next Data Science Interview
    Artificial Intelligence

    5 Statistical Concepts You Need to Know Before Your Next Data Science Interview

    Editor Times FeaturedBy Editor Times FeaturedMay 26, 2025No Comments8 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    by myself Data Science job search journey and have been very fortunate to have gotten the prospect to interview with many firms.

    These interviews have been a mixture of technical and behavioral when assembly with actual folks, and I’ve additionally gotten my justifiable share of evaluation duties to finish by myself.

    Going by means of this course of I’ve executed plenty of analysis about what sorts of questions are generally requested throughout information science interviews. These are ideas you shouldn’t solely be accustomed to, but in addition know how one can clarify. 

    1. P worth

    Picture by creator

    Whenever you run a statistical check, sometimes you’ll have a null speculation H0 and another speculation H1. 

    Let’s say you’re operating an experiment to find out the effectiveness of some weight-loss remedy. Group A took a placebo and Group B took the remedy. You then calculate a imply variety of kilos misplaced over six months for every group and wish to see if the variety of weight misplaced for Group B is statistically considerably increased than Group A. On this case, the null speculation, H0 could be that there was no statistically important variations within the imply variety of lbs misplaced between teams, which means that the remedy had no actual impact on weight reduction. H1 could be that there was a major distinction and Group B misplaced extra weight as a result of remedy.

    To recap:

    • H0: Imply lbs misplaced Group A = Imply lbs misplaced Group B
    • H1: Imply lbs misplaced Group A < Imply lbs misplaced Group B

    You’d then conduct a t-test to match means to get a p-value. This may be executed in Python or different statistical software program. Nonetheless, previous to getting a p-value, you’d first select an alpha (α) worth (aka significance degree) that you’ll examine the p to.

    The everyday alpha worth chosen is 0.05, which signifies that the chance of a Sort I error (Saying that there’s a distinction in means when there isn’t) is 0.05 or 5%.

    In case your p worth is < alpha worth, you may reject your null speculation. In any other case, if p > alpha, you fail to reject your null speculation.

    2. Z-score (and different outlier detection strategies)

    Z-score is a measure of how far an information level lies from the imply and is likely one of the most typical outlier detection strategies.

    With a view to perceive the z rating you might want to perceive fundamental statistical ideas reminiscent of:

    • Imply — the common of a set of values
    • Commonplace deviation — a measure of unfold between values in a dataset in relation to the imply (additionally the sq. root of variance). In different phrases, it reveals how far aside values within the dataset are from the imply.

    A z-score worth of two for a given information level signifies that that worth is 2 customary deviations above the imply. A z-score of -1.5 signifies that the worth is 1.5 customary deviations under the imply.

    Sometimes, an information level with a z-score of >3 or <-3 is taken into account an outlier. 

    Outliers are a standard drawback inside information science so it’s necessary to know how one can determine them and take care of them.

    To study extra about another easy outlier detection strategies, try my article on z-score, IQR, and modified z rating:

    3. Linear Regression

    Picture by creator

    Linear regression is likely one of the most basic ML and statistical fashions and understanding it’s essential to being profitable in any information science function.

    On a excessive degree, Linear Regression goals to mannequin the connection between an unbiased variable(s) to a dependent variable and makes an attempt to make use of an unbiased variable to foretell the worth of the dependent variable. It does so by becoming a “line of finest match” to the dataset — a line that minimizes the sum of squared variations between the precise values and the anticipated values.

    An instance of that is when attempting to mannequin the connection between temperature and electrical vitality consumption. When measuring electrical consumption of a constructing usually instances the temperature will affect the utilization as a result of as electrical energy is commonly used for cooling, because the temperature goes up, buildings will use extra vitality to chill down their areas.

    So we will use a regression mannequin to mannequin this relationship the place the unbiased variable is temperature and the dependent variable is the consumption (for the reason that utilization relies on the temperature and never vice versa).

    Linear regression will output an equation within the format y=mx+b, the place m is the slope of the road and b is the y intercept. To make a prediction for y, you’d plug your x worth into the equation.

    Regression has 4 completely different assumptions of the underlying information which may be remembered by the acronym LINE:

    L: Linear relationship between the unbiased variable x and the dependent variable y.

    I: Independence of the residuals. Residuals don’t affect one another. (A residual is the distinction between the worth predicted by the road and the precise worth).

    N: Regular distribution of the residuals. The residuals observe a standard distribution.

    E: Equal variance of residuals throughout completely different x values.

    The commonest efficiency metric in relation to linear regression is the R², which tells you the proportion of variance within the dependent variable that may be defined by the unbiased variable. An R² of 1 signifies an ideal linear relationship whereas an R² of 0 means there isn’t a predictive capability for this dataset. A great R² tends to be 0.75 or above, however this additionally varies relying on the kind of drawback you’re fixing.

    Linear regression is completely different from correlation. Correlation between two variables offers you a numeric worth between -1 and 1 which tells you the energy and course of the connection between two variables. Regression offers you an equation which can be utilized to foretell future values based mostly on the road of finest match for previous values.

    4. Central restrict theorem 

    The Central Limit Theorem (CLT) is a basic idea in statistics that states that the distribution of the pattern imply will strategy a standard distribution because the pattern measurement turns into bigger, whatever the unique distribution of the info.

    A standard distribution, often known as the bell curve, is a statistical distribution through which the imply is 0 and the usual deviation is 1.

    CLT is predicated on these assumptions: 

    • Information are unbiased
    • Inhabitants of knowledge has a finite degree of variance
    • Sampling is random

    A pattern measurement of ≥ 30 is usually seen because the minimal acceptable worth for the CLT to carry true. Nonetheless, as you improve the pattern measurement the distribution will look increasingly like a bell curve. 

    CLT permits statisticians to make inferences about inhabitants parameters utilizing the traditional distribution, even when the underlying inhabitants will not be usually distributed. It kinds the idea for a lot of statistical strategies, together with confidence intervals and speculation testing.

    5. Overfitting and underfitting

    Picture by creator

    When a mannequin underfits, it has not been in a position to seize patterns within the coaching information correctly. Due to this, not solely does it carry out poorly on the coaching dataset, it performs poorly on unseen information as effectively.

    The best way to know if a mannequin is undercutting:

    • The mannequin has a excessive error on the practice, cross-validation and check units

    When a mannequin overfits, which means that it has discovered the coaching information too carefully. Basically it has memorized the coaching information and is nice at predicting it, but it surely can’t generalize to unseen information when it comes time to foretell new values.

    The best way to know if a mannequin is overfitting:

    • The mannequin has a low error on your complete practice set, however a excessive error on the check and cross-validation units

    Moreover:

    A mannequin that underfits has excessive bias.

    A mannequin that overfits has excessive variance.

    Discovering a superb stability between the 2 is known as the bias-variance tradeoff. 

    Conclusion

    That is certainly not a complete record. Different necessary subjects to evaluation embrace:

    • Resolution Timber
    • Sort I and Sort II Errors
    • Confusion Matrices
    • Regression vs Classification
    • Random Forests
    • Practice/check break up
    • Cross validation
    • The ML Life Cycle

    Listed below are a few of my different articles masking many of those fundamental ML and statistics ideas:

    It’s regular to really feel overwhelmed when reviewing these ideas, particularly when you haven’t seen lots of them since your information science programs at school. However what’s extra necessary is making certain that you simply’re updated with what’s most related to your individual expertise (e.g. the fundamentals of time collection modeling if that’s your speciality), and easily having a fundamental understanding of those different ideas. 

    Additionally, keep in mind that the easiest way to elucidate these ideas in an interview is to make use of an instance and stroll the interviewers by means of the related definitions as you discuss by means of your situation. This may allow you to bear in mind every part higher too.

    Thanks for studying

    • Join with me on LinkedIn
    • Buy me a coffee to assist my work!
    • I’m now providing 1:1 information science tutoring, profession teaching/mentoring, writing recommendation, resume evaluations & extra on Topmate!



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    How to Combine Claude Code and Codex for Maximum Coding Power

    June 1, 2026

    It’s the Lessons We Learned Along the Way. Or, Is It?

    June 1, 2026

    Proxy-Pointer RAG: Eliminating Wasteful Entity & Relations Extraction in Knowledge Graphs

    May 31, 2026

    Solving a Murder Mystery Using Bayesian Inference

    May 31, 2026

    Rerankers Aren’t Magic Either: When the Cross-Encoder Layer Is Worth the Cost

    May 31, 2026

    Qdrant TurboQuant Explained: Is TurboQuant the Silver Bullet?

    May 30, 2026

    Comments are closed.

    Editors Picks

    Anthropic Confidentially Files for What Could Be the Largest IPO Ever

    June 1, 2026

    Salesforce has a stake in Anthropic worth ~$5B; Salesforce first invested about $50M in an early 2023 round and has continually invested in rounds since (Brody Ford/Bloomberg)

    June 1, 2026

    Russia’s Military Hackers Targeted Home Routers Across 23 States. Here’s What to Do

    June 1, 2026

    How to Combine Claude Code and Codex for Maximum Coding Power

    June 1, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Production-Grade Observability for AI Agents: A Minimal-Code, Configuration-First Approach

    December 17, 2025

    Oakland police raid suspected illegal gambling site, detaining ten people

    April 15, 2026

    The Most Powerful Politics Influencers Barely Post About Politics

    December 17, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.