Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Vercel says it detected unauthorized access to its internal systems after a hacker using the ShinyHunters handle claimed a breach on BreachForums (Lawrence Abrams/BleepingComputer)
    • Today’s NYT Strands Hints, Answer and Help for April 20 #778
    • KV Cache Is Eating Your VRAM. Here’s How Google Fixed It With TurboQuant.
    • OneOdio Focus A1 Pro review
    • The 11 Best Fans to Buy Before It Gets Hot Again (2026)
    • A look at Dylan Patel’s SemiAnalysis, an AI newsletter and research firm that expects $100M+ in 2026 revenue from subscriptions and AI supply chain research (Abram Brown/The Information)
    • ‘Euphoria’ Season 3 Release Schedule: When Does Episode 2 Come Out?
    • Francis Bacon and the Scientific Method
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Sunday, April 19
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Building a Monitoring System That Actually Works
    Artificial Intelligence

    Building a Monitoring System That Actually Works

    Editor Times FeaturedBy Editor Times FeaturedOctober 28, 2025No Comments17 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    and managing merchandise, it’s essential to make sure they’re performing as anticipated and that every thing is working easily. We sometimes depend on metrics to gauge the well being of our merchandise. And lots of components can affect our KPIs, from inside modifications equivalent to UI updates, pricing changes, or incidents to exterior components like competitor actions or seasonal traits. That’s why it’s essential to constantly monitor your KPIs so you may reply rapidly when one thing goes off monitor. In any other case, it’d take a number of weeks to understand that your product was utterly damaged for five% of consumers or that conversion dropped by 10 proportion factors after the final launch.

    To achieve this visibility, we create dashboards with key metrics. However let’s be trustworthy, dashboards that nobody actively screens provide little worth. We both want folks continuously watching dozens and even a whole lot of metrics, or we want an automatic alerting and monitoring system. And I strongly choose the latter. So, on this article, I’ll stroll you thru a sensible method to constructing an efficient monitoring system in your KPIs. You’ll study completely different monitoring approaches, construct your first statistical monitoring system, and what challenges you’ll possible encounter when deploying it in manufacturing.

    Establishing monitoring

    Let’s begin with the massive image of architect your monitoring system, then we’ll dive into the technical particulars. There are just a few key selections you want to make when establishing monitoring:

    • Sensitivity. That you must discover the proper steadiness between lacking essential anomalies (false negatives) and getting bombarded with false alerts 100 instances a day (false positives). We’ll discuss what levers it’s a must to modify this afterward.
    • Dimensions. The segments you select to watch additionally have an effect on your sensitivity. If there’s an issue in a small phase (like a particular browser or nation), your system is more likely to catch it should you’re monitoring that phase’s metrics straight. However right here’s the catch: the extra segments you monitor, the extra false positives you’ll take care of, so you want to discover the candy spot.
    • Time granularity. In case you have loads of information and might’t afford delays, it is perhaps value minute-by-minute information. For those who don’t have sufficient information, you may combination it into 5–15 minute buckets and monitor these as an alternative. Both manner, it’s at all times a good suggestion to have higher-level every day, weekly, or month-to-month monitoring alongside your real-time monitoring to control longer-term traits.

    Nevertheless, monitoring isn’t simply concerning the technical answer. It’s additionally concerning the processes you may have in place:

    • You want somebody who’s answerable for monitoring and responding to alerts. We used to deal with this with an on-call rotation in my crew, the place every week, one individual can be in control of reviewing all of the alerts.
    • Past automated monitoring, it’s value performing some handbook checks too. You may arrange TV shows within the workplace, or on the very least, have a course of the place somebody (like an on-call individual) evaluations the metrics as soon as a day or week.
    • That you must set up suggestions loops. While you’re reviewing alerts and searching again at incidents you might need missed, take the time to fine-tune your monitoring system’s settings.
    • The worth of a change log (a file of all modifications affecting your KPIs) can’t be overstated. It helps you and your crew at all times have context about what occurred to your KPIs and when. Plus, it provides you a useful dataset for evaluating the actual influence in your monitoring system whenever you make modifications (like determining what proportion of previous anomalies your new setup would truly catch).

    Now that we’ve lined the high-level image, let’s transfer on and dig into the technical particulars of truly detect anomalies in time sequence information.

    Frameworks for monitoring 

    There are various out-of-the-box frameworks you should utilize for monitoring. I’d break them down into two major teams.

    The primary group entails making a forecast with confidence intervals. Listed below are some choices:

    • You should utilize statsmodels and the classical implementation of ARIMA-like fashions for time sequence forecasting. 
    • An alternative choice that sometimes works fairly effectively out of the field is Prophet by Meta. It’s a easy additive mannequin that returns uncertainty intervals.
    • There’s additionally GluonTS, a deep learning-based forecasting framework from AWS.

    The second group focuses on anomaly detection, and listed below are some widespread libraries:

    • PyOD: The most well-liked Python outlier/anomaly detection toolbox, with 50+ algorithms (together with time sequence and deep studying strategies).
    • ADTK (Anomaly Detection Toolkit): Constructed for unsupervised/rule-based time sequence anomaly detection with straightforward integration into pandas dataframes.
    • Merlion: Combines forecasting and anomaly detection for time sequence utilizing each classical and ML approaches.

    I’ve solely talked about just a few examples right here; there are far more libraries on the market. You may completely attempt them out along with your information and see how they carry out. Nevertheless, I wish to share a a lot easier method to monitoring that I often begin with. Regardless that it’s so easy that you could implement it with a single SQL question, it really works surprisingly effectively in lots of circumstances. One other important benefit of this simplicity is that you could implement it in just about any device, whereas deploying extra advanced ML approaches might be tough in some techniques.

    Statistical method to monitoring

    The core thought behind monitoring is easy: use historic information to construct a confidence interval (CI) and detect when present metrics fall outdoors of anticipated behaviour. We estimate this confidence interval utilizing the imply and customary deviation of previous information. It’s simply primary statistics.

    [
    textbf{Confidence Interval} = (textbf{mean} – textsf{coef}_1 times textbf{std},; textbf{mean} + textsf{coef}_2 times textbf{std})
    ]

    Picture by creator

    Nevertheless, the effectiveness of this method relies on a number of key parameters, and the alternatives you make right here will considerably influence the accuracy of your alerts.

    The primary resolution is outline the information pattern used to calculate your statistics. Usually, we evaluate the present metric to the identical time interval on earlier days. This entails two major parts:

    • Time window: I often take a window of ±10–half-hour across the present timestamp to account for short-term fluctuations.
    • Historic days: I choose utilizing the identical weekday over the previous 3–5 weeks. This methodology accounts for weekly seasonality, which is often current in enterprise information. Nevertheless, relying in your seasonality patterns, you may select completely different approaches (for instance, splitting days into two teams: weekdays and weekends).

    One other essential parameter is the selection of coefficient used to set the width of the arrogance interval. I often use three customary deviations because it covers 99.7% of observations for distributions near regular.

    As you may see, there are a number of selections to make, and there’s no one-size-fits-all reply. Essentially the most dependable technique to decide optimum settings is to experiment with completely different configurations utilizing your individual information and select the one which delivers the most effective efficiency in your use case. So this is a perfect second to place the method into motion and see the way it performs on actual information.

    Instance: monitoring the variety of taxi rides 

    To check this out, we’ll use the popular NYC Taxi Data dataset (). I loaded data from May to July 2025 and focused on rides related to high-volume for-hire vehicles. Since we have hundreds of trips every minute, we can use minute-by-minute data for monitoring.

    Image by author

    Building the first version

    So, let’s try our approach and build confidence intervals based on real data. I started with a default set of key parameters:

    • A time window of ±15 minutes around the current timestamp,
    • Data from the current day plus the same weekday from the previous three weeks,
    • A confidence band defined as ±3 standard deviations.

    Now, let’s create a couple of functions with the business logic to calculate the confidence interval and check whether our value falls outside of it.

    # returns the dataset of historic data
    def get_distribution_for_ci(param, ts, n_weeks=3, n_mins=15): 
      tmp_df = df[['pickup_datetime', param]].rename(columns={param: 'value', 'pickup_datetime': 'dt'})
      
      tmp = [] 
      for n in range(n_weeks + 1):
        lower_bound = (pd.to_datetime(ts) - pd.Timedelta(weeks=n, minutes=n_mins)).strftime('%Y-%m-%d %H:%M:%S')
        upper_bound = (pd.to_datetime(ts) - pd.Timedelta(weeks=n, minutes=-n_mins)).strftime('%Y-%m-%d %H:%M:%S')
        tmp.append(tmp_df[(tmp_df.dt >= lower_bound) & (tmp_df.dt <= upper_bound)])
    
      base_df = pd.concat(tmp)
      base_df = base_df[base_df.dt < ts]
      return base_df
    
    # calculates mean and std needed to calculate confidence intervals
    def get_ci_statistics(param, ts, n_weeks=3, n_mins=15):
      base_df = get_distribution_for_ci(param, ts, n_weeks, n_mins)
      std = base_df.value.std()
      mean = base_df.value.mean()
      return mean, std
    
    # iterating through all the timestamps in historic data
    ci_tmp = []
    for ts in tqdm.tqdm(df.pickup_datetime):
      ci = get_ci_statistics('values', ts, n_weeks=3, n_mins=15)
      ci_tmp.append(
        {
            'pickup_datetime': ts,
            'mean': ci[0],
            'std': ci[1],
        }
      )
    
    ci_df = df[['pickup_datetime', 'values']].copy()
    ci_df = ci_df.merge(pd.DataFrame(ci_tmp), how='left', on='pickup_datetime')
    
    # defining CI
    ci_df['ci_lower'] = ci_df['mean'] - 3 * ci_df['std']
    ci_df['ci_upper'] = ci_df['mean'] + 3 * ci_df['std']
    
    # defining whether value is outside of CI
    ci_df['outside_of_ci'] = (ci_df['values'] < ci_df['ci_lower']) | (ci_df['values'] > ci_df['ci_upper'])

    Analysing results

    Let’s look at the results. First, we’re seeing quite a few false positive triggers (one-off points outside the CI that seem to be due to normal variability).

    Image by author

    There are two ways we can adjust our algorithm to account for this:

    • The CI doesn’t need to be symmetric. We might be less concerned about increases in the number of trips, so we could use a higher coefficient for the upper bound (for example, use 5 instead of 3).
    • The data is quite volatile, so there will be occasional anomalies where a single point falls outside the confidence interval. To reduce such false positive alerts, we can use more robust logic and only trigger an alert when multiple points are outside the CI (for example, at least 4 out of the last 5 points, or 8 out of 10).

    However, there’s another potential problem with our current CIs. As you can see, there are quite a few cases where the CI is excessively wide. This looks off and could reduce the sensitivity of our monitoring.

    Let’s look at one example to understand why this happens. The distribution we’re using to estimate the CI at this point is bimodal, which leads to a higher standard deviation and a wider CI. That’s because the number of trips on the evening of July 14th is significantly higher than in other weeks.

    Image by author
    Image by author

    So we’ve encountered an anomaly in the past that’s affecting our confidence intervals. There are two ways to address this issue:

    • If we’re doing constant monitoring, we know there was anomalously high demand on July 14th, and we can exclude these periods when constructing our CIs. This approach requires some discipline to track these anomalies, but it pays off with more accurate results.
    • However, there’s always a quick-and-dirty approach too: we can simply drop or cap outliers when constructing the CI.

    Improving the accuracy

    So after the first iteration, we identified several potential improvements for our monitoring approach:

    • Use a higher coefficient for the upper bound since we care less about increases. I used 6 standard deviations instead of 3.
    • Deal with outliers to filter out past anomalies. I experimented with removing or capping the top 10–20% of outliers and found that capping at 20% alongside increasing the period to 5 weeks worked best in practice.
    • Raise an alert only when 4 out of the last 5 points are outside the CI to reduce the number of false positive alerts caused by normal volatility.

    Let’s see how this looks in code. We’ve updated the logic in get_ci_statistics to account for different strategies for handling outliers.

    def get_ci_statistics(param, ts, n_weeks=3, n_mins=15, show_vis = False, filter_outliers_strategy = 'none', 
                       filter_outliers_perc = None):
      assert filter_outliers_strategy in ['none', 'clip', 'remove'], "filter_outliers_strategy must be one of 'none', 'clip', 'remove'"
      base_df = get_distribution_for_ci(param, ts, n_weeks, n_mins, show_vis)
      if filter_outliers_strategy != 'none': 
        p_upper = base_df.value.quantile(1 - filter_outliers_perc)
        p_lower = base_df.value.quantile(filter_outliers_perc)
        if filter_outliers_strategy == 'clip':
          base_df['value'] = base_df['value'].clip(lower=p_lower, upper=p_upper)
        if filter_outliers_strategy == 'remove':
          base_df = base_df[(base_df.value >= p_lower) & (base_df.value <= p_upper)]
      std = base_df.value.std()
      mean = base_df.value.mean()
      return mean, std

    We also need to update the way we define the outside_of_ci parameter.

    for ts in tqdm.tqdm(ci_df.pickup_datetime):
      tmp_df = ci_df[(ci_df.pickup_datetime <= ts)].tail(5).copy()
      tmp_df = tmp_df[~tmp_df.ci_lower.isna() & ~tmp_df.ci_upper.isna()]
      if tmp_df.shape[0] < 5: 
        continue
      tmp_df['outside_of_ci'] = (tmp_df['values'] < tmp_df['ci_lower']) | (tmp_df['values'] > tmp_df['ci_upper'])
      if tmp_df.outside_of_ci.map(int).sum() >= 4:
        anomalies.append(ts) 
    
    ci_df['outside_of_ci'] = ci_df.pickup_datetime.isin(anomalies)

    We can see that the CI is now significantly narrower (no more anomalously wide CIs), and we’re also getting far fewer alerts since we increased the upper bound coefficient.

    Image by author

    Let’s investigate the two alerts we found. These two alerts from the last 2 weeks look plausible when we compare the traffic to previous weeks.

    Image by author

    Practical tip: This chart also reminds us that ideally we should account for public holidays and either exclude them or treat them as weekends when calculating the CI.

    Image by author

    So our new monitoring approach makes total sense. However, there’s a drawback: by only looking for cases where 4 out of 5 minutes fall outside the CI, we’re delaying alerts in situations where everything is completely broken. To address this problem, you can actually use two CIs:

    • Doomsday CI: A broad confidence interval where even a single point falling outside means it’s time to panic.
    • Incident CI: The one we built earlier, where we might wait 5–10 minutes before triggering an alert, since the drop in the metric isn’t as critical.

    Let’s define 2 CIs for our case.

    Image by author

    It’s a balanced approach that gives us the best of both worlds: we can react quickly when something is completely broken while still keeping false positives under control. With that, we’ve achieved a good result and we’re ready to move on.

    Testing our monitoring on anomalies

    We’ve confirmed that our approach works well for business-as-usual cases. However, it’s also worth doing some stress testing by simulating anomalies we want to catch and checking how the monitoring performs. In practice, it’s worth testing against previously known anomalies to see how it would handle real-world examples.

    In our case, we don’t have a change log of previous anomalies, so I simulated a 20% drop in the number of trips, and our approach caught it immediately.

    Image by author

    These kinds of step changes can be tricky in real life. Imagine we lost one of our partners, and that lower level becomes the new normal for the metric. In that case, it’s worth adjusting our monitoring as well. If it’s possible to recalculate the historical metric based on the current state (for example, by filtering out the lost partner), that would be ideal since it would bring the monitoring back to normal. If that’s not feasible, we can either adjust the historical data (say, subtract 20% of traffic as our estimate of the change) or drop all data from before the change and use only the new data to construct the CI.

    Image by author

    Let’s look at another tricky real-world example: gradual decay. If your metric is slowly dropping day after day, it likely won’t be caught by our real-time monitoring since the CI will be shifting along with it. To catch situations like this, it’s worth having less granular monitoring (like daily, weekly, or even monthly).

    Image by author

    You can find the full code on GitHub.

    Operational challenges

    We’ve mentioned the mathematics behind alerting and monitoring techniques. Nevertheless, there are a number of different nuances you’ll possible encounter when you begin deploying your system in manufacturing. So I’d wish to cowl these earlier than wrapping up.

    Lagging information. We don’t face this downside in our instance since we’re working with historic information, however in actual life, you want to take care of information lags. It often takes a while for information to succeed in your information warehouse. So you want to discover ways to distinguish between circumstances the place information hasn’t arrived but versus precise incidents affecting the client expertise. Essentially the most easy method is to take a look at historic information, determine the everyday lag, and filter out the final 5–10 information factors.

    Totally different sensitivity for various segments. You’ll possible wish to monitor not simply the principle KPI (the variety of journeys), but additionally break it down by a number of segments (like companions, areas, and so forth.). Including extra segments is at all times useful because it helps you see smaller modifications in particular segments (for example, that there’s an issue in Manhattan). Nevertheless, as I discussed above, there’s a draw back: extra segments imply extra false optimistic alerts that you want to take care of. To maintain this below management, you should utilize completely different sensitivity ranges for various segments (say, 3 customary deviations for the principle KPI and 5 for segments).

    Smarter alerting system. Additionally, whenever you’re monitoring many segments, it’s value making your alerting a bit smarter. Say you may have monitoring for the principle KPI and 99 segments. Now, think about now we have a worldwide outage and the variety of journeys drops all over the place. Inside the subsequent 5 minutes, you’ll (hopefully) get 100 notifications that one thing is damaged. That’s not a great expertise. To keep away from this example, I’d construct logic to filter out redundant notifications. For instance:

    • If we obtained the identical notification inside the final 3 hours, don’t hearth one other alert.
    • If there’s a notification a few drop in the principle KPI plus greater than 3 segments, solely alert about the principle KPI change.

    Total, alert fatigue is actual, so it’s value minimising the noise.

    And that’s it! We’ve lined your entire alerting and monitoring subject, and hopefully, you’re now totally geared up to arrange your individual system.

    Abstract

    We’ve lined loads of floor on alerting and monitoring. Let me wrap it up with a step-by-step information on begin monitoring your KPIs.

    • Step one is to collect a change log of previous anomalies. You should utilize this each as a set of check circumstances in your system and to filter out anomalous intervals when calculating CIs.
    • Subsequent, construct a prototype and run it on historic information. I’d begin with the highest-level KPI, check out a number of doable configurations, and see how effectively it catches earlier anomalies and whether or not it generates loads of false alerts. At this level, it is best to have a viable answer.
    • Then attempt it out in manufacturing, since that is the place you’ll need to take care of information lags and see how the monitoring truly performs in observe. Run it for two–4 weeks and tweak the parameters to ensure it’s working as anticipated.
    • After that, share the monitoring along with your colleagues and begin increasing the scope to incorporate different segments. Don’t overlook to maintain including all anomalies to the change log and set up suggestions loops to enhance your system constantly.

    And that’s it! Now you may relaxation straightforward realizing that automation is maintaining a tally of your KPIs (however nonetheless test in on them every now and then, simply in case).

    Thanks for studying. I hope this text was insightful. Keep in mind Einstein’s recommendation: “The essential factor is to not cease questioning. Curiosity has its personal motive for current.” Could your curiosity lead you to your subsequent nice perception.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    KV Cache Is Eating Your VRAM. Here’s How Google Fixed It With TurboQuant.

    April 19, 2026

    Proxy-Pointer RAG: Structure Meets Scale at 100% Accuracy with Smarter Retrieval

    April 19, 2026

    Dreaming in Cubes | Towards Data Science

    April 19, 2026

    AI Agents Need Their Own Desk, and Git Worktrees Give Them One

    April 18, 2026

    Your RAG System Retrieves the Right Data — But Still Produces Wrong Answers. Here’s Why (and How to Fix It).

    April 18, 2026

    Europe Warns of a Next-Gen Cyber Threat

    April 18, 2026

    Comments are closed.

    Editors Picks

    Vercel says it detected unauthorized access to its internal systems after a hacker using the ShinyHunters handle claimed a breach on BreachForums (Lawrence Abrams/BleepingComputer)

    April 19, 2026

    Today’s NYT Strands Hints, Answer and Help for April 20 #778

    April 19, 2026

    KV Cache Is Eating Your VRAM. Here’s How Google Fixed It With TurboQuant.

    April 19, 2026

    OneOdio Focus A1 Pro review

    April 19, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    A Bird’s-Eye View of Linear Algebra: Measure of a Map — Determinants

    June 10, 2025

    8 AI Stock Trading Bots That Actually Work

    August 7, 2025

    Motorola’s Upcoming Razr Fold Pairs a Massive Battery With a Sleek Design

    March 2, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.