Building a Monitoring System That Actually Works

and managing merchandise, it’s essential to make sure they’re performing as anticipated and that every thing is working easily. We sometimes depend on metrics to gauge the well being of our merchandise. And lots of components can affect our KPIs, from inside modifications equivalent to UI updates, pricing changes, or incidents to exterior components like competitor actions or seasonal traits. That’s why it’s essential to constantly monitor your KPIs so you may reply rapidly when one thing goes off monitor. In any other case, it’d take a number of weeks to understand that your product was utterly damaged for five% of consumers or that conversion dropped by 10 proportion factors after the final launch.

To achieve this visibility, we create dashboards with key metrics. However let’s be trustworthy, dashboards that nobody actively screens provide little worth. We both want folks continuously watching dozens and even a whole lot of metrics, or we want an automatic alerting and monitoring system. And I strongly choose the latter. So, on this article, I’ll stroll you thru a sensible method to constructing an efficient monitoring system in your KPIs. You’ll study completely different monitoring approaches, construct your first statistical monitoring system, and what challenges you’ll possible encounter when deploying it in manufacturing.

Establishing monitoring

Let’s begin with the massive image of architect your monitoring system, then we’ll dive into the technical particulars. There are just a few key selections you want to make when establishing monitoring:

Sensitivity. That you must discover the proper steadiness between lacking essential anomalies (false negatives) and getting bombarded with false alerts 100 instances a day (false positives). We’ll discuss what levers it’s a must to modify this afterward.
Dimensions. The segments you select to watch additionally have an effect on your sensitivity. If there’s an issue in a small phase (like a particular browser or nation), your system is more likely to catch it should you’re monitoring that phase’s metrics straight. However right here’s the catch: the extra segments you monitor, the extra false positives you’ll take care of, so you want to discover the candy spot.
Time granularity. In case you have loads of information and might’t afford delays, it is perhaps value minute-by-minute information. For those who don’t have sufficient information, you may combination it into 5–15 minute buckets and monitor these as an alternative. Both manner, it’s at all times a good suggestion to have higher-level every day, weekly, or month-to-month monitoring alongside your real-time monitoring to control longer-term traits.

Nevertheless, monitoring isn’t simply concerning the technical answer. It’s additionally concerning the processes you may have in place:

You want somebody who’s answerable for monitoring and responding to alerts. We used to deal with this with an on-call rotation in my crew, the place every week, one individual can be in control of reviewing all of the alerts.
Past automated monitoring, it’s value performing some handbook checks too. You may arrange TV shows within the workplace, or on the very least, have a course of the place somebody (like an on-call individual) evaluations the metrics as soon as a day or week.
That you must set up suggestions loops. While you’re reviewing alerts and searching again at incidents you might need missed, take the time to fine-tune your monitoring system’s settings.
The worth of a change log (a file of all modifications affecting your KPIs) can’t be overstated. It helps you and your crew at all times have context about what occurred to your KPIs and when. Plus, it provides you a useful dataset for evaluating the actual influence in your monitoring system whenever you make modifications (like determining what proportion of previous anomalies your new setup would truly catch).

Now that we’ve lined the high-level image, let’s transfer on and dig into the technical particulars of truly detect anomalies in time sequence information.

Frameworks for monitoring

There are various out-of-the-box frameworks you should utilize for monitoring. I’d break them down into two major teams.

The primary group entails making a forecast with confidence intervals. Listed below are some choices:

You should utilize statsmodels and the classical implementation of ARIMA-like fashions for time sequence forecasting.
An alternative choice that sometimes works fairly effectively out of the field is Prophet by Meta. It’s a easy additive mannequin that returns uncertainty intervals.
There’s additionally GluonTS, a deep learning-based forecasting framework from AWS.

The second group focuses on anomaly detection, and listed below are some widespread libraries:

PyOD: The most well-liked Python outlier/anomaly detection toolbox, with 50+ algorithms (together with time sequence and deep studying strategies).
ADTK (Anomaly Detection Toolkit): Constructed for unsupervised/rule-based time sequence anomaly detection with straightforward integration into pandas dataframes.
Merlion: Combines forecasting and anomaly detection for time sequence utilizing each classical and ML approaches.

I’ve solely talked about just a few examples right here; there are far more libraries on the market. You may completely attempt them out along with your information and see how they carry out. Nevertheless, I wish to share a a lot easier method to monitoring that I often begin with. Regardless that it’s so easy that you could implement it with a single SQL question, it really works surprisingly effectively in lots of circumstances. One other important benefit of this simplicity is that you could implement it in just about any device, whereas deploying extra advanced ML approaches might be tough in some techniques.

Statistical method to monitoring

The core thought behind monitoring is easy: use historic information to construct a confidence interval (CI) and detect when present metrics fall outdoors of anticipated behaviour. We estimate this confidence interval utilizing the imply and customary deviation of previous information. It’s simply primary statistics.

[
textbf{Confidence Interval} = (textbf{mean} – textsf{coef}_1 times textbf{std},; textbf{mean} + textsf{coef}_2 times textbf{std})
]

Picture by creator

Nevertheless, the effectiveness of this method relies on a number of key parameters, and the alternatives you make right here will considerably influence the accuracy of your alerts.

The primary resolution is outline the information pattern used to calculate your statistics. Usually, we evaluate the present metric to the identical time interval on earlier days. This entails two major parts:

Time window: I often take a window of ±10–half-hour across the present timestamp to account for short-term fluctuations.
Historic days: I choose utilizing the identical weekday over the previous 3–5 weeks. This methodology accounts for weekly seasonality, which is often current in enterprise information. Nevertheless, relying in your seasonality patterns, you may select completely different approaches (for instance, splitting days into two teams: weekdays and weekends).

One other essential parameter is the selection of coefficient used to set the width of the arrogance interval. I often use three customary deviations because it covers 99.7% of observations for distributions near regular.

As you may see, there are a number of selections to make, and there’s no one-size-fits-all reply. Essentially the most dependable technique to decide optimum settings is to experiment with completely different configurations utilizing your individual information and select the one which delivers the most effective efficiency in your use case. So this is a perfect second to place the method into motion and see the way it performs on actual information.

Instance: monitoring the variety of taxi rides

To check this out, we’ll use the popular NYC Taxi Data dataset (). I loaded data from May to July 2025 and focused on rides related to high-volume for-hire vehicles. Since we have hundreds of trips every minute, we can use minute-by-minute data for monitoring.

Building the first version

So, let’s try our approach and build confidence intervals based on real data. I started with a default set of key parameters:

A time window of ±15 minutes around the current timestamp,
Data from the current day plus the same weekday from the previous three weeks,
A confidence band defined as ±3 standard deviations.

Now, let’s create a couple of functions with the business logic to calculate the confidence interval and check whether our value falls outside of it.

# returns the dataset of historic data
def get_distribution_for_ci(param, ts, n_weeks=3, n_mins=15): 
  tmp_df = df[['pickup_datetime', param]].rename(columns={param: 'value', 'pickup_datetime': 'dt'})
  
  tmp = [] 
  for n in range(n_weeks + 1):
    lower_bound = (pd.to_datetime(ts) - pd.Timedelta(weeks=n, minutes=n_mins)).strftime('%Y-%m-%d %H:%M:%S')
    upper_bound = (pd.to_datetime(ts) - pd.Timedelta(weeks=n, minutes=-n_mins)).strftime('%Y-%m-%d %H:%M:%S')
    tmp.append(tmp_df[(tmp_df.dt >= lower_bound) & (tmp_df.dt <= upper_bound)])

  base_df = pd.concat(tmp)
  base_df = base_df[base_df.dt < ts]
  return base_df

# calculates mean and std needed to calculate confidence intervals
def get_ci_statistics(param, ts, n_weeks=3, n_mins=15):
  base_df = get_distribution_for_ci(param, ts, n_weeks, n_mins)
  std = base_df.value.std()
  mean = base_df.value.mean()
  return mean, std

# iterating through all the timestamps in historic data
ci_tmp = []
for ts in tqdm.tqdm(df.pickup_datetime):
  ci = get_ci_statistics('values', ts, n_weeks=3, n_mins=15)
  ci_tmp.append(
    {
        'pickup_datetime': ts,
        'mean': ci[0],
        'std': ci[1],
    }
  )

ci_df = df[['pickup_datetime', 'values']].copy()
ci_df = ci_df.merge(pd.DataFrame(ci_tmp), how='left', on='pickup_datetime')

# defining CI
ci_df['ci_lower'] = ci_df['mean'] - 3 * ci_df['std']
ci_df['ci_upper'] = ci_df['mean'] + 3 * ci_df['std']

# defining whether value is outside of CI
ci_df['outside_of_ci'] = (ci_df['values'] < ci_df['ci_lower']) | (ci_df['values'] > ci_df['ci_upper'])

Analysing results

Let’s look at the results. First, we’re seeing quite a few false positive triggers (one-off points outside the CI that seem to be due to normal variability).

There are two ways we can adjust our algorithm to account for this:

The CI doesn’t need to be symmetric. We might be less concerned about increases in the number of trips, so we could use a higher coefficient for the upper bound (for example, use 5 instead of 3).
The data is quite volatile, so there will be occasional anomalies where a single point falls outside the confidence interval. To reduce such false positive alerts, we can use more robust logic and only trigger an alert when multiple points are outside the CI (for example, at least 4 out of the last 5 points, or 8 out of 10).

However, there’s another potential problem with our current CIs. As you can see, there are quite a few cases where the CI is excessively wide. This looks off and could reduce the sensitivity of our monitoring.

Let’s look at one example to understand why this happens. The distribution we’re using to estimate the CI at this point is bimodal, which leads to a higher standard deviation and a wider CI. That’s because the number of trips on the evening of July 14th is significantly higher than in other weeks.

So we’ve encountered an anomaly in the past that’s affecting our confidence intervals. There are two ways to address this issue:

If we’re doing constant monitoring, we know there was anomalously high demand on July 14th, and we can exclude these periods when constructing our CIs. This approach requires some discipline to track these anomalies, but it pays off with more accurate results.
However, there’s always a quick-and-dirty approach too: we can simply drop or cap outliers when constructing the CI.

Improving the accuracy

So after the first iteration, we identified several potential improvements for our monitoring approach:

Use a higher coefficient for the upper bound since we care less about increases. I used 6 standard deviations instead of 3.
Deal with outliers to filter out past anomalies. I experimented with removing or capping the top 10–20% of outliers and found that capping at 20% alongside increasing the period to 5 weeks worked best in practice.
Raise an alert only when 4 out of the last 5 points are outside the CI to reduce the number of false positive alerts caused by normal volatility.

Let’s see how this looks in code. We’ve updated the logic in get_ci_statistics to account for different strategies for handling outliers.

def get_ci_statistics(param, ts, n_weeks=3, n_mins=15, show_vis = False, filter_outliers_strategy = 'none', 
                   filter_outliers_perc = None):
  assert filter_outliers_strategy in ['none', 'clip', 'remove'], "filter_outliers_strategy must be one of 'none', 'clip', 'remove'"
  base_df = get_distribution_for_ci(param, ts, n_weeks, n_mins, show_vis)
  if filter_outliers_strategy != 'none': 
    p_upper = base_df.value.quantile(1 - filter_outliers_perc)
    p_lower = base_df.value.quantile(filter_outliers_perc)
    if filter_outliers_strategy == 'clip':
      base_df['value'] = base_df['value'].clip(lower=p_lower, upper=p_upper)
    if filter_outliers_strategy == 'remove':
      base_df = base_df[(base_df.value >= p_lower) & (base_df.value <= p_upper)]
  std = base_df.value.std()
  mean = base_df.value.mean()
  return mean, std

We also need to update the way we define the outside_of_ci parameter.

for ts in tqdm.tqdm(ci_df.pickup_datetime):
  tmp_df = ci_df[(ci_df.pickup_datetime <= ts)].tail(5).copy()
  tmp_df = tmp_df[~tmp_df.ci_lower.isna() & ~tmp_df.ci_upper.isna()]
  if tmp_df.shape[0] < 5: 
    continue
  tmp_df['outside_of_ci'] = (tmp_df['values'] < tmp_df['ci_lower']) | (tmp_df['values'] > tmp_df['ci_upper'])
  if tmp_df.outside_of_ci.map(int).sum() >= 4:
    anomalies.append(ts) 

ci_df['outside_of_ci'] = ci_df.pickup_datetime.isin(anomalies)

We can see that the CI is now significantly narrower (no more anomalously wide CIs), and we’re also getting far fewer alerts since we increased the upper bound coefficient.

Let’s investigate the two alerts we found. These two alerts from the last 2 weeks look plausible when we compare the traffic to previous weeks.

Practical tip: This chart also reminds us that ideally we should account for public holidays and either exclude them or treat them as weekends when calculating the CI.

So our new monitoring approach makes total sense. However, there’s a drawback: by only looking for cases where 4 out of 5 minutes fall outside the CI, we’re delaying alerts in situations where everything is completely broken. To address this problem, you can actually use two CIs:

Doomsday CI: A broad confidence interval where even a single point falling outside means it’s time to panic.
Incident CI: The one we built earlier, where we might wait 5–10 minutes before triggering an alert, since the drop in the metric isn’t as critical.

Let’s define 2 CIs for our case.

It’s a balanced approach that gives us the best of both worlds: we can react quickly when something is completely broken while still keeping false positives under control. With that, we’ve achieved a good result and we’re ready to move on.

Testing our monitoring on anomalies

We’ve confirmed that our approach works well for business-as-usual cases. However, it’s also worth doing some stress testing by simulating anomalies we want to catch and checking how the monitoring performs. In practice, it’s worth testing against previously known anomalies to see how it would handle real-world examples.

In our case, we don’t have a change log of previous anomalies, so I simulated a 20% drop in the number of trips, and our approach caught it immediately.

These kinds of step changes can be tricky in real life. Imagine we lost one of our partners, and that lower level becomes the new normal for the metric. In that case, it’s worth adjusting our monitoring as well. If it’s possible to recalculate the historical metric based on the current state (for example, by filtering out the lost partner), that would be ideal since it would bring the monitoring back to normal. If that’s not feasible, we can either adjust the historical data (say, subtract 20% of traffic as our estimate of the change) or drop all data from before the change and use only the new data to construct the CI.

Let’s look at another tricky real-world example: gradual decay. If your metric is slowly dropping day after day, it likely won’t be caught by our real-time monitoring since the CI will be shifting along with it. To catch situations like this, it’s worth having less granular monitoring (like daily, weekly, or even monthly).

You can find the full code on GitHub.

Operational challenges

We’ve mentioned the mathematics behind alerting and monitoring techniques. Nevertheless, there are a number of different nuances you’ll possible encounter when you begin deploying your system in manufacturing. So I’d wish to cowl these earlier than wrapping up.

Lagging information. We don’t face this downside in our instance since we’re working with historic information, however in actual life, you want to take care of information lags. It often takes a while for information to succeed in your information warehouse. So you want to discover ways to distinguish between circumstances the place information hasn’t arrived but versus precise incidents affecting the client expertise. Essentially the most easy method is to take a look at historic information, determine the everyday lag, and filter out the final 5–10 information factors.

Totally different sensitivity for various segments. You’ll possible wish to monitor not simply the principle KPI (the variety of journeys), but additionally break it down by a number of segments (like companions, areas, and so forth.). Including extra segments is at all times useful because it helps you see smaller modifications in particular segments (for example, that there’s an issue in Manhattan). Nevertheless, as I discussed above, there’s a draw back: extra segments imply extra false optimistic alerts that you want to take care of. To maintain this below management, you should utilize completely different sensitivity ranges for various segments (say, 3 customary deviations for the principle KPI and 5 for segments).

Smarter alerting system. Additionally, whenever you’re monitoring many segments, it’s value making your alerting a bit smarter. Say you may have monitoring for the principle KPI and 99 segments. Now, think about now we have a worldwide outage and the variety of journeys drops all over the place. Inside the subsequent 5 minutes, you’ll (hopefully) get 100 notifications that one thing is damaged. That’s not a great expertise. To keep away from this example, I’d construct logic to filter out redundant notifications. For instance:

If we obtained the identical notification inside the final 3 hours, don’t hearth one other alert.
If there’s a notification a few drop in the principle KPI plus greater than 3 segments, solely alert about the principle KPI change.

Total, alert fatigue is actual, so it’s value minimising the noise.

And that’s it! We’ve lined your entire alerting and monitoring subject, and hopefully, you’re now totally geared up to arrange your individual system.

Abstract

We’ve lined loads of floor on alerting and monitoring. Let me wrap it up with a step-by-step information on begin monitoring your KPIs.

Step one is to collect a change log of previous anomalies. You should utilize this each as a set of check circumstances in your system and to filter out anomalous intervals when calculating CIs.
Subsequent, construct a prototype and run it on historic information. I’d begin with the highest-level KPI, check out a number of doable configurations, and see how effectively it catches earlier anomalies and whether or not it generates loads of false alerts. At this level, it is best to have a viable answer.
Then attempt it out in manufacturing, since that is the place you’ll need to take care of information lags and see how the monitoring truly performs in observe. Run it for two–4 weeks and tweak the parameters to ensure it’s working as anticipated.
After that, share the monitoring along with your colleagues and begin increasing the scope to incorporate different segments. Don’t overlook to maintain including all anomalies to the change log and set up suggestions loops to enhance your system constantly.

And that’s it! Now you may relaxation straightforward realizing that automation is maintaining a tally of your KPIs (however nonetheless test in on them every now and then, simply in case).

Thanks for studying. I hope this text was insightful. Keep in mind Einstein’s recommendation: “The essential factor is to not cease questioning. Curiosity has its personal motive for current.” Could your curiosity lead you to your subsequent nice perception.

Source link

Building a Monitoring System That Actually Works

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

How to Find the Optimal Coding Agent Interface

I Completed Five Years in Analytics Consulting: 5 Lessons That Changed How I Work

GPU-Resident Top-K for Agentic RAG: I Built a CUDA Kernel So My Retrieval Step Would Stop Bouncing Off the GPU

Can Machine Learning Predict the World Cup?

Automate Writing Your LLM Prompts

These Were My Favorite Things Samsung Unpacked During Its 2026 Galaxy Event

AI minister role boosted but tech department axed in Burnham shake-up

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

The risk of weather data sabotage is rising

Featured Picks

Intel’s Panther Lake Chips Aren’t Just Good—They Beat Apple’s M5

Gevi Espresso Machine Review: Quick but Quirky

Trump admin to roll back Biden’s AI chip restrictions

Building a Monitoring System That Actually Works

Establishing monitoring

Frameworks for monitoring

Statistical method to monitoring

Instance: monitoring the variety of taxi rides

Building the first version

Analysing results

Improving the accuracy

Testing our monitoring on anomalies

Operational challenges

Abstract

Related Posts