Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Lamborghini Design 90: The superbike nobody wanted
    • Canyon Spectral:ON CF 8 Electric Mountain Bike: Beginner-Friendly, Under $5K
    • US-sanctioned currency exchange says $15 million heist done by “unfriendly states”
    • This New Air Purifier Filter Can Remove Cannabis Smoke Odor, Just in Time for 4/20
    • Portable water filter provides safe drinking water from any source
    • MAGA Is Increasingly Convinced the Trump Assassination Attempt Was Staged
    • NCAA seeks faster trial over DraftKings disputed March Madness branding case
    • AI Trusted Less Than Social Media and Airlines, With Grok Placing Last, Survey Says
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Saturday, April 18
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Hands On Time Series Modeling of Rare Events, with Python
    Artificial Intelligence

    Hands On Time Series Modeling of Rare Events, with Python

    Editor Times FeaturedBy Editor Times FeaturedSeptember 8, 2025No Comments12 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    have you ever heard this:

    We have now these very giant values on this timeseries, however these are simply “outliers” and occur solely [insert small number here]% of the time

    Throughout my Information Science years, I’ve heard that sentence so much. In conferences, throughout product opinions, and through calls with purchasers, this sentence has been stated to reassure that some very giant (or small) undesirable values which may seem will not be “normal”, they don’t belong to the “regular course of” (all imprecise phrases I’ve heard of), and they don’t represent a problem for the system we try to construct (for cause X, Y and Z).

    In manufacturing settings, these very small or very giant values (often known as excessive values) are accompanied by guardrails to “fail gracefully” in case the acute values are measured. That is normally sufficient for instances the place you simply want your system to work, and also you wish to be secure that it really works always, even when the undesirable/not normal/impolite, loopy, and annoying excessive values occur.

    Nonetheless, when analyzing a timeseries, we are able to do one thing greater than “fixing” the acute worth with guardrails and if/else threshold: we are able to really monitor excessive values in order that we are able to perceive them.

    In timeseries, excessive values really characterize one thing within the system of curiosity. For instance, in case your time sequence describes the vitality consumption of a metropolis, an unreasonably excessive vitality stage may point out a worrisome vitality consumption in a selected space, which can require motion. If you’re coping with monetary knowledge, the highs and lows have an apparent, essential which means, and understanding their habits is extraordinarily essential.

    On this weblog put up, we will likely be coping with climate knowledge, the place the time sequence will characterize the temperature (in Kelvin). The info, as we are going to see, could have a number of cities, and each metropolis yields a timeseries. If you choose a selected metropolis, you’ve got a timeseries just like the one you’re seeing under:

    Picture generated by writer [data source]

    So, in this type of dataset, it’s fairly essential to mannequin the maxima and minima, as a result of they imply it’s both as sizzling as a furnace or extraordinarily chilly. Hopefully, at this level, you’re asking your self

    “What will we imply by modelling the maxima and minima?”

    If you end up coping with a time sequence dataset, it’s cheap to anticipate a Gaussian-like distribution, just like the one you see right here:

    Picture made by writer [data source]

    However in case you think about solely the acute values, the distribution is way from that. And as you will notice in a second, there are a number of layers of complexity related to extracting the distribution of maximum values. Three of them are:

    1. Defining an excessive worth: How have you learnt one thing is excessive? We are going to outline the implementation of our excessive worth as a primary stage
    2. Defining the distributions that describe these occasions: there are a number of potential distributions of maximum values. The three distributions that will likely be handled on this weblog put up are the Generalized Excessive Worth (GEV), Weibull, and Gumbel (a particular case of GEV) distributions.
    3. Selecting the perfect distribution: There are a number of metrics we are able to use to find out the “finest becoming” distributions. We are going to deal with the Akaike Info, the Log-likelihood, and the Bayesian Info Criterion.

    All issues we are going to speak about on this article 🥹

    Appears like we’ve a variety of floor to cowl. Let’s get began.

    0. Information and Script Supply

    The language we are going to use is Python. The code supply could be discovered on this PieroPaialungaAI/RareEvents folder. The info supply could be discovered on this open supply Kaggle Dataset. Thoughts you, in case you clone the GitHub folder, you received’t have to obtain the dataset. The dataset is contained in the RawData folder contained in the RareEvents GitHub primary folder (you’re welcome 😉).

    1. Preliminary Information Exploration

    In an effort to make all the things easy in the course of the exploration part, and provides us the utmost versatility within the pocket book with out writing lots of of strains of code. The code that does that [data.py] is the next:

    This code does all of the soiled work data-wise; so we are able to do all the next steps in a only a few strains of code.

    The very first thing we are able to do is simply show a few of the rows of the dataset. We are able to do it with this code:

    Discover that there are 36 columns/cities, not simply 4, within the dataset. I displayed 4 to have a properly formatted desk. 🙂

    A couple of issues to note:

    • Each column, besides “datetime”, is a metropolis and represents a time sequence, the place each worth corresponds to the datetime, which represents the time axis
    • Each worth within the metropolis column represents the Kelvin temperature for the date within the corresponding datetime column. For instance, index = 3 for column = ‘Vancouver’ tells us that, at datetime 2012-10-01 15:00:00, the temperature was 284.627 Okay

    I additionally developed a perform that permits you to plot town column. For instance, if you wish to peek at what occurs in New York, you need to use this:

    Picture made by writer utilizing code above

    Now, the datetime column is only a string column, however it might be really useful to have the particular month, day, and yr in separate columns. Additionally, we’ve some NaN values that we should always maintain. All these boring preprocessing steps are contained in the `.clean_and_preprocess()`

    That is the output:

    2. Detecting Excessive Occasions

    Now, an important query:

    What’s an excessive occasion? And the way are we going to detect it?

    There are two primary methods to outline an “excessive occasion”. For instance, if we wish to determine the maxima, we are able to apply:

    1. The primary Definition: Peak Over Threshold (POT). Given a threshold, all the things above that threshold is a most level (excessive occasion).
    2. The second definition: Excessive inside a area. Given a window, we outline the utmost worth of the window as an excessive occasion.

    On this weblog put up, we’ll use the second strategy. As an example, if we use each day home windows, we scan by way of the dataset and extract the very best worth for every day. This could provide you with a variety of factors, as our dataset spans greater than 5 years. OR, we may do it with month-to-month home windows or yearly home windows. This could provide you with fewer factors, however maybe richer info.

    That is precisely the facility of this technique: we’ve management over the variety of factors and their “high quality”. For this research, arguably the perfect window measurement is the “daily-sized one. For an additional dataset, be happy to regulate primarily based on the amount of your factors; for instance, you may wish to cut back the window measurement you probably have a really brief pure window (e.g., you acquire knowledge each second), or improve it you probably have a really giant dataset (e.g., you’ve got 50+ years of information and every week window is extra applicable).

    This definition of most worth is outlined inside the RareEventsToolbox class, in [rare_events_toolbox.py] script (take a look at the extract_block_max perform).

    And we are able to shortly show the distribution of uncommon occasion at completely different window sizes utilizing the next block of code:

    Picture made by writer utilizing code above

    3. Excessive Occasions Distribution

    Earlier than diving into code, let’s take a step again. Generally, the acute worth distributions don’t exhibit the attractive gaussian bell habits that you’ve got seen earlier (the Gaussian distribution for San Francisco). From a theoretical perspective, the 2 distributions to know are the Generalized Excessive Worth (GEV) distribution and the Weibull distribution.

    GEV (Generalized Excessive Worth)

    • The GEV is the inspiration of maximum worth concept and supplies a household of distributions tailor-made for modeling block maxima or minima. A particular case is the Gumbel distribution.
    • Its flexibility comes from a form parameter that determines the “tail habits.” Relying on this parameter, the GEV can mimic completely different sorts of extremes (e.g., average, heavy-tailed).
    • The demonstration of the GEV distribution could be very elegant: similar to the Central Worth Principle (CLT) says, “If you happen to common a bunch of i.i.d. random variables, the distribution of the typical tends to a Gaussian”, the EVT (Excessive Worth Principle) says “in case you take the most (or minimal) of a bunch of i.i.d. random variables, the distribution of that most tends to a GEV.”

    Weibull

    • The Weibull is likely one of the most generally used distributions in reliability engineering, meteorology, and environmental modeling.
    • It’s particularly helpful for describing knowledge the place there’s a way of “bounded” or tapered-off extremes.
    • Not like the GEV distribution(s), the Weibull formulation is empirical. Waloddi Weibull, a Swedish engineer, first proposed the distribution in 1939 to mannequin the breaking power of supplies.

    So we’ve three prospects: GEV, Gumbel, and Weibull. Now, which one is the perfect? The brief reply is “it relies upon,” and one other brief reply is “it’s finest simply to strive all of them and see which one performs finest”.

    So now we’ve one other query:

    How will we consider the standard of a distribution perform and a set of information?

    Three metrics to make use of are the next:

    Three metrics to make use of are the next:

    • Log-Chance (LL). It measures how possible the noticed knowledge is beneath the fitted distribution: increased is best.
    Picture made by writer

    the place f is the chance density (or mass) perform of the distribution with parameters θ and x_i is the i-th noticed knowledge level

    • Akaike Info Criterion (AIC) AIC balances two forces: match high quality (through the log-likelihood = L) and simplicity (penalizes fashions with too many parameters, variety of parameters = okay).
    Picture made by writer
    • Bayesian Info Criterion (BIC). Comparable spirit to AIC, however harsher on complexity (dataset measurement = n).
    Picture made by writer

    The usual suggestion is to make use of one between AIC and BIC, as they think about the Log Chance and the complexity.

    The implementation of the three distribution capabilities, and the corresponding L, AIC, and BIC values, is the next:

    After which we are able to show our distribution utilizing the next:

    Picture made by writer

    Fairly good match, proper? Whereas it visually seems good, we is usually a little bit extra quantitative and take a look at the Q-Q plot, which shows the quartile match between the information and the fitted distribution:

    Picture made by writer

    This shows that our distribution matches very properly with the offered dataset. Now, you may discover how, in case you had tried with a typical distribution (e.g. Gaussian curve), you’d have certainly failed: the distribution of the information is closely skewed (as anticipated, as a result of we’re coping with excessive values, and this requires excessive worth distributions (this feels weirdly motivational 😁).

    Now the cool factor is that, as we made it structured, we are able to additionally run this for each metropolis within the dataset utilizing the next block of code:

    And the output will seem like this:

    {'Dallas': {'dist_type': 'gev',
      'param': (0.5006578789482107, 296.2415220841758, 9.140132853556741),
      'dist': ,
      'metrics': {'log_likelihood': -6602.222429209462,
       'aic': 13210.444858418923,
       'bic': 13227.07308905503}},
     'Pittsburgh': {'dist_type': 'gev',
      'param': (0.5847547512518895, 287.21064374616327, 11.190557085335278),
      'dist': ,
      'metrics': {'log_likelihood': -6904.563305593636,
       'aic': 13815.126611187272,
       'bic': 13831.754841823378}},
     'New York': {'dist_type': 'weibull_min',
      'param': (6.0505720895039445, 238.93568735311248, 55.21556483095677),
      'dist': ,
      'metrics': {'log_likelihood': -6870.265288196851,
       'aic': 13746.530576393701,
       'bic': 13763.10587863208}},
     'Kansas Metropolis': {'dist_type': 'gev',
      'param': (0.5483246490879885, 290.4564464294219, 11.284265203196664),
      'dist': ,
      'metrics': {'log_likelihood': -6949.785968553707,
       'aic': 13905.571937107414,
       'bic': 13922.20016774352}}

    4. Abstract

    Thanks for spending time with me thus far, it means so much ❤️

    Let’s recap what we did. As an alternative of hand-waving away “outliers,” we handled extremes as first-class indicators. For instance:

    • We took a dataset representing the temperature of cities all over the world
    • We outlined our excessive occasions utilizing block maxima on a hard and fast window
    • We modeled city-level temperature highs with three candidate households (GEV, Gumbel, and Weibull)
    • We chosen the perfect match utilizing log-likelihood, AIC, and BIC, then verified matches with Q-Q plots.

    Outcomes present that “finest” varies by metropolis: for instance, Dallas, Pittsburgh, and Kansas Metropolis leaned GEV, whereas New York match a Weibull.

    This sort of strategy is essential when excessive values are of excessive significance in your system, and we have to reveal the way it statistically behaves beneath uncommon and excessive circumstances.

    5. Conclusions

    Thanks once more to your time. It means so much ❤️

    My title is Piero Paialunga, and I’m this man right here:

    Picture made by writer

    I’m a Ph.D. candidate on the College of Cincinnati Aerospace Engineering Division. I speak about AI and Machine Studying in my weblog posts and on LinkedIn, and right here on TDS. If you happen to preferred the article and wish to know extra about machine studying and observe my research, you may:

    A. Comply with me on Linkedin, the place I publish all my tales
    B. Comply with me on GitHub, the place you may see all my code
    C. For questions, you may ship me an electronic mail at piero.paialunga@hotmail



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    A Practical Guide to Memory for Autonomous LLM Agents

    April 17, 2026

    You Don’t Need Many Labels to Learn

    April 17, 2026

    Beyond Prompting: Using Agent Skills in Data Science

    April 17, 2026

    6 Things I Learned Building LLMs From Scratch That No Tutorial Teaches You

    April 17, 2026

    Introduction to Deep Evidential Regression for Uncertainty Quantification

    April 17, 2026

    memweave: Zero-Infra AI Agent Memory with Markdown and SQLite — No Vector Database Required

    April 17, 2026

    Comments are closed.

    Editors Picks

    Lamborghini Design 90: The superbike nobody wanted

    April 18, 2026

    Canyon Spectral:ON CF 8 Electric Mountain Bike: Beginner-Friendly, Under $5K

    April 18, 2026

    US-sanctioned currency exchange says $15 million heist done by “unfriendly states”

    April 18, 2026

    This New Air Purifier Filter Can Remove Cannabis Smoke Odor, Just in Time for 4/20

    April 18, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Why Jolly Ranchers Are Banned in the UK but Not the US

    July 7, 2025

    Google’s Nest Learning Thermostat Slashed My Heating Bill, and It’s $50 Off for Black Friday

    November 27, 2025

    A look at the Dor Brothers, a video production studio that has gained over 100M views across platforms by creating viral subversive videos using only AI tools (Stuart A. Thompson/New York Times)

    July 20, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.