mainstream. We first noticed them in language, then imaginative and prescient, and now additionally in video and speech. The recipe by now could be acquainted: first, pretrain an enormous neural internet on giant sufficient knowledge, then apply the mannequin to downstream duties with none per-task adaptation.
For a lot of industrial functions, time collection is an important modality. We ceaselessly must do forecasting, anomaly detection, and classification by utilizing completely different sorts of recording knowledge. The present apply is often to construct devoted fashions for one particular downside at hand. That may work, however it entails fairly some “reinventing the wheel”, and will ship suboptimal efficiency if the dataset for the present downside is small.
Naturally, we’d prefer to ask: can we apply the identical recipe right here, that’s, pretrain a big time-series basis mannequin and use it for any downstream duties, out of the field?
That’s the wager behind time collection basis fashions, or TSFMs.
Actually, a number of work has already gone down this path, and we now see a zoo of such fashions, to call just a few: TimesFM from Google, MOIRAI from Salesforce, Lag-Llama, TimeGPT, and the Chronos household from AWS.
On this put up, we have a look at Chronos-2 [1], the most recent mannequin within the Chronos line, launched in October 2025. We’ll stroll by 5 questions one may ask when encountering the mannequin for the primary time:
- What’s a time collection basis mannequin, and the way does it change the analytics workflow?
- Why would a basis mannequin even work for time collection?
- What’s Chronos-2, particularly?
- What new issues can we truly do with Chronos-2?
- The place does zero-shot cease being sufficient?
For query 4, we’ll get hands-on with a case examine on an artificial constructing electrical energy demand dataset.
1. What’s a time collection basis mannequin, and the way does it change the analytics workflow?
As implied by its identify, a TSFM is a single neural community pretrained on a big, numerous assortment of time collection. Its promise is identical as LLMs for textual content, i.e., as an alternative of coaching a recent mannequin each time a brand new forecasting downside comes up, you load one pretrained mannequin and ask it to forecast.
That’s an enormous shift to the workflow.
Let’s say we’d love to do week-ahead power demand forecasts for buildings. If we comply with the normal workflow, we’d begin with getting ready the info, adopted by selecting forecasting fashions, consider ARIMA/gradient-boosted timber/LSTM, TCN, N-BEATS, after which spending many of the mission time on coaching, hyperparameter tuning, and validation. The output is a mannequin that (hopefully) solves this one downside on this one dataset.
Six months later, a brand new forecasting job arrives, and the cycle restarts virtually from scratch.
Now with TSFM, most of what I described above is compressed right into a single inference name. The workflow now turns into: Take the historic collection (if out there, additionally associated covariates, we’ll focus on that later), enter to the pretrained TSFM, set the specified forecast horizon, then run the TSFM inference and get again a forecast.
What’s additionally good about it’s that you simply gained’t simply get some extent forecast, however usually with predictive quantiles to quantify uncertainties.
So what does this indicate?
Properly, the very first thing is that the price of simply making an attempt out a forecast drops loads. If it really works, nice. If not, you’ve realized one thing helpful in simply ten minutes.
Then, cold-start is now not an enormous difficulty. Prior to now, you might need needed to cease a mission just because “we don’t actually have sufficient knowledge but.” With a pretrained mannequin, that “little knowledge” may already be enough to ship one thing significant. The mannequin has already seen a number of demand/site visitors/sensor-like patterns. It’s bringing prior data that your tiny dataset can’t totally characterize.
Lastly, who can do that adjustments too. It used to take an ML skilled to do correct forecasting. A TSFM, after all, doesn’t make any of that data out of date, however it does imply a site skilled with some Python data can get a reputable forecast with out years of ML background.
None of that is free, although. You’re now relying on someone else’s mannequin. Inference will get costlier. To your domains, zero-shot most likely gained’t be ok. And cautious analysis and validation turn out to be much more vital.
2. Why would a basis mannequin even work for time collection?
“I don’t imagine in TSFM. This shouldn’t actually work.”
That’s what I hear most frequently from my colleagues, and that skepticism is sensible. Language is bounded and has a finite vocabulary. “Apple” means roughly the identical factor in a novel or a grocery record.
Numbers aren’t like that.
Numbers are steady, and their that means can fluctuate extensively throughout contexts. A “100” in retail demand would have a really completely different that means in comparison with a “100” in a heart-rate hint.
So why ought to we hope a pretrained mannequin can work throughout completely different contexts?
Properly, the mannequin isn’t actually studying your particular knowledge; it’s studying shapes corresponding to cycles, traits, stage shifts, recurring spikes, and people shapes recur throughout time collection of varied domains. The shapes are the “vocabulary” right here, and there are far fewer of them than there are doable numeric values. A mannequin that has seen sufficient of them at sufficient scales and frequencies can hopefully acknowledge them in your collection, though it has by no means seen or been educated in your collection earlier than.
Empirically, now we have concrete numbers to help this: Chronos-2 at the moment holds the main place in zero-shot accuracy throughout a number of benchmarks. As well as, current work exhibits that Chronos-2 truly beats classical statistical baselines and specialised deep studying architectures, particularly at longer horizons, with no task-specific tuning [2].
In fact, that doesn’t imply it at all times works. Some domains actually are in contrast to something within the pretraining combine. We’ll come again to that in query 5. However one thing to remember: TSFM zero-shot is now the baseline to beat, not the opposite method round.
3. What’s Chronos-2, particularly?
On this part, we briefly focus on vital elements of Chronos-2: the way it’s used at inference, the way it’s constructed, what it was educated on, and some sensible specs. For extra detailed technical discussions, please check with the unique paper [1].
3.1 How is it used at inference?
Practitioners usually care probably the most about find out how to truly use the mannequin. So let’s begin with that.
Chronos-2 provides completely different utilization patterns for performing forecasting. The excellent news is: you don’t really want to choose completely different configurations for various forecasting duties. As a substitute, you set up the inputs to let the mannequin know what to do.
The important thing mechanism is the “group ID“.
Within the Chronos-2 framework, a “group” is an idea used to characterize relatedness. Each time collection fed into Chronos-2 ought to belong to a “group”, recognized by an ID. Primarily based on the way you assign these IDs at inference time, you possibly can have the next 4 patterns:
- Univariate forecasting: That is whenever you need to forecast one single goal. You assign every collection its personal group ID, and Chronos-2 will merely deal with every collection independently.
- Multivariate forecasting: That is whenever you need to forecast a number of targets on the similar time as a result of they may transfer collectively otherwise you need their predictions to be mutually knowledgeable. To realize that, it’s essential to give one shared group ID to all associated collection.
- Covariate-informed forecasting: That is when you’ve got extra collection that affect your goal. These further collection are generally referred to as covariates, and they are often identified previously, and sometimes additionally identified into the long run. To carry out covariate-informed forecasting, it’s essential to assign the goal and its covariates to the identical group ID, with the goal recognized because the collection to forecast and the others offered as identified context.
- Cross-learning: That is when you’ve got many associated collection and wish the forecast for one collection to profit from patterns proven in different collection. This state of affairs is completely different from the covariate-informed forecasting, as a result of each collection within the group is itself a goal. They’re friends that inform one another, not auxiliary inputs like covariates. Additionally, this state of affairs is semantically completely different from multivariate forecasting. In multivariate forecasting, the collection in a gaggle are often completely different goal dimensions of the identical downside (e.g., completely different load parts). In cross-learning, the grouped collection are friends (whole hundreds from completely different buildings). Nonetheless, the underlying mechanism is identical: you assign all associated collection the identical group ID, so group consideration can permit info movement throughout collection.
So, similar mannequin, similar configurations. The one factor that adjustments is how the inputs are organized.
3.2 How was it constructed?
Chronos-2 is an encoder-only Transformer with 120M parameters, which is kind of small in the event you choose by at the moment’s LLM requirements. A few design decisions which are price highlighting right here:
1. Chronos-2 makes use of steady patch embeddings, not discrete vocabulary tokens.
If you realize concerning the authentic Chronos, you may do not forget that Chronos encodes time collection by first scaling the time collection values, after which quantizing them into one of many bins. These bins are naturally handled as “tokens.”
Chronos-2, nevertheless, drops this method.
It really works by grouping consecutive observations right into a “patch” and embeds the entire patch instantly as a steady vector. In case you are accustomed to Imaginative and prescient Transformers (ViT), you’d instantly see the similarities. By doing that, the mannequin can course of far fewer objects per collection, which results in sooner inference and introduces no precision loss attributable to quantization.
2. Contained in the encoder, Chronos-2 has two sorts of consideration, and it alternates them in every layer.
Chronos-2 has time consideration and group consideration mechanisms.
For time consideration, as implied by its identify, its function is to let every collection attend to its personal previous, thus capturing the temporal construction.
For group consideration, it really works throughout collection. Particularly, it implies that at every time place, all collection sharing a gaggle ID can attend to at least one one other.
Concretely, in the event you consider the enter as a matrix, the place the rows are the time collection and columns are the time, then time consideration occurs inside a single row, whereas group consideration occurs inside a single column.
The 2 consideration mechanisms alternate layer by layer. Because of this, info finally flows each temporally inside every collection and throughout the group.
3. For the output, Chronos-2 employs a direct quantile regression head as an alternative of autoregressive token technology.
In Chronos-2, the prediction just isn’t finished in an autoregressive vogue. As a substitute, it’s a regression head that outputs all 21 quantile stage predictions for all steps within the forecast horizon in a single shot.
Put collectively, these decisions make Chronos-2 each quick and probabilistic by default.
3.3 What was it educated on?
For a time-series basis mannequin, coaching knowledge is an important ingredient.
As a result of real-world corpora are often scarce, Chronos-2 depends closely on a considerable amount of artificial knowledge. They arrive in two tracks:
For univariate forecasting, artificial collection are generated from three turbines: Gaussian-process curves (KernelSynth, inherited from Chronos-1), random mixtures of pattern/seasonality/irregularity (TSI), and collection sampled from random temporal causal graphs (TCM).
For multivariate (and covariate-informed) settings, all of the coaching knowledge is artificial. The Chronos-2 group developed a software referred to as “multivariatizers”, which takes a number of univariate collection from the turbines above and imposes dependencies between them, corresponding to same-time correlations or time-shifted ones (lead-lag, cointegration).
Actually, probably the most placing discovering from the paper is that artificial knowledge alone is sort of sufficient: a variant educated solely on artificial knowledge carried out solely barely worse than the ultimate mannequin.
3.4 Sensible specs
Lastly, a few sensible specs price figuring out if you wish to use the mannequin:
- Its most context size is 8192 steps.
- Its most forecast horizon is 1024 steps (effectively over a 12 months of every day knowledge or six weeks of hourly).
- Its license is Apache-2.0.
- Each CPU and GPU inference are supported.
That’s Chronos-2 in idea.
4. What new issues can we truly do with Chronos-2?
On this part, let’s get hands-on and do some precise forecasting with Chronos-2.
Right here, we contemplate a small case examine with an artificial constructing electricity-demand forecasting downside. Particularly, we need to do hourly electrical energy demand forecasts one week forward. For many buildings, now we have 45 days of current recording knowledge. For one newly onboarded constructing, just a few days can be found. This lets us take a look at a cold-start setting later.
For this case examine, we use bodily simulated knowledge. The primary goal is the whole demand, which is the sum of base load, plug load, lighting load, and HVAC load. Bodily, the plug and lighting hundreds comply with weekday occupancy patterns, and the HVAC load responds to outside temperature and the constructing’s thermal dynamics. For detailed knowledge simulation and technology, please check with the pocket book I hooked up on the finish of this put up.

4.1 Organising the Chronos-2 mannequin
On the tooling facet, we’ll devour Chronos-2 mannequin by the chronos-forecasting Python package deal. We’ll want PyTorch, Pandas, and the same old scientific Python stack:
pip set up chronos-forecasting pandas numpy matplotlib
The mannequin weights themselves are hosted on Hugging Face underneath amazon/chronos-2. You possibly can instantiate the pipeline with:
from chronos import Chronos2Pipeline
pipeline = Chronos2Pipeline.from_pretrained("amazon/chronos-2", device_map="cuda") # or device_map="cpu"
The primary from_pretrained name would obtain the weights into your native Hugging Face cache (~/.cache/huggingface/), whereas subsequent calls load from disk. The mannequin takes about 478 MB on disk.
Chronos-2 can run on CPU, however GPU inference is often most well-liked.
I’m utilizing my private laptop computer with NVIDIA RTX 2000 Ada (8GB VRAM). For this pocket book’s workload with hourly knowledge, 45-day context, 168-hour horizon, 8 buildings, the univariate forecast (8 collection) completes in ~0.07s, the multivariate forecast (8 buildings × 4 targets = 32 collection) takes ~0.22s, and the covariate-informed forecast takes ~0.27s. Peak GPU reminiscence stays underneath 1GB all through. A manufacturing workload at a really completely different scale (say, extra collection, longer context size/forecast horizon, and so on.) would positively have a really completely different throughput.
4.2 The univariate forecasting
Can Chronos-2 forecast constructing demand zero-shot?
The very first thing we want to attempt is the best setup doable: we hand Chronos-2 every constructing’s current demand historical past and ask for a week-ahead forecast. That’s it, nothing fancy.
We first generate the artificial knowledge:
full_df = make_dataset()
That is how the info seems to be:
constructing timestamp ... solar_irradiance is_weekend
0 Constructing 01 2025-03-01 00:00:00 ... 0.0 1
1 Constructing 01 2025-03-01 01:00:00 ... 0.0 1
2 Constructing 01 2025-03-01 02:00:00 ... 0.0 1
3 Constructing 01 2025-03-01 03:00:00 ... 0.0 1
4 Constructing 01 2025-03-01 04:00:00 ... 0.0 1
... ... ... ... ... ...
32635 Constructing 08 2025-08-17 19:00:00 ... 0.0 1
32636 Constructing 08 2025-08-17 20:00:00 ... 0.0 1
32637 Constructing 08 2025-08-17 21:00:00 ... 0.0 1
32638 Constructing 08 2025-08-17 22:00:00 ... 0.0 1
32639 Constructing 08 2025-08-17 23:00:00 ... 0.0 1
[32640 rows x 11 columns]
This dataset incorporates the next columns:
['building', 'timestamp', 'total_load_kw', 'hvac_load_kw', 'plug_load_kw',
'lighting_load_kw', 'indoor_temp_c', 'outdoor_temp_c', 'occupancy',
'solar_irradiance', 'is_weekend']
The constructing column is the group ID column; total_load_kw is the goal column that we purpose to forecast.
Then, we are able to put together the historic context and make the forecast with predict_df API:
history_df = full_df[
(full_df["timestamp"] >= context_start_date)
& (full_df["timestamp"] < cutoff_date)
].copy()
context_univariate = history_df[["building", "timestamp", "total_load_kw"]]
pred_univariate = pipeline.predict_df(
context_univariate,
prediction_length=168, # one week of hourly forecasts
quantile_levels=[0.025, 0.5, 0.975], # 95% confidence interval
id_column="constructing",
timestamp_column="timestamp",
goal="total_load_kw",
)
Here’s what pred_univariate appear to be:
constructing timestamp ... 0.5 0.975
0 Constructing 01 2025-07-14 00:00:00 ... 175.027161 194.386108
1 Constructing 01 2025-07-14 01:00:00 ... 177.673050 198.921997
2 Constructing 01 2025-07-14 02:00:00 ... 175.633270 199.574677
3 Constructing 01 2025-07-14 03:00:00 ... 167.960052 192.789505
4 Constructing 01 2025-07-14 04:00:00 ... 154.674759 178.479599
... ... ... ... ... ...
1339 Constructing 08 2025-07-20 19:00:00 ... 103.391228 164.987076
1340 Constructing 08 2025-07-20 20:00:00 ... 116.543739 185.808151
1341 Constructing 08 2025-07-20 21:00:00 ... 135.177704 202.919937
1342 Constructing 08 2025-07-20 22:00:00 ... 150.139679 216.866089
1343 Constructing 08 2025-07-20 23:00:00 ... 160.572784 219.451172
[1344 rows x 7 columns]
with the next columns:
['building', 'timestamp', 'target_name', 'predictions', '0.025', '0.5', '0.975']
The median forecast is saved in predictions and the requested quantile columns (0.025, 0.975) for every (constructing, hour) are underneath 0.025 and 0.975, respectively.
For visualizing the outcomes, we choose constructing 03 for example:

We are able to see that with none fine-tuning, the forecast captured effectively each the every day occupancy cycle and the weekday/weekend rhythm solely from the 45-day context window. What’s additionally proven within the determine is the 95% confidence interval, and we see that they largely cowl the bottom fact.
Be aware that what we simply did above is successfully a batch forecasting for all eight buildings. Throughout these buildings, zero-shot Chronos-2 produces a weighted absolute proportion error (WAPE) of 8.6%. This positively gained’t be the most effective efficiency for this particular dataset, however one thing credible with little effort.
4.3 Multivariate forecasting
Can Chronos-2 forecast a number of targets concurrently?
Subsequent, we use Chronos-2 to forecast the person parts of demand, i.e., HVAC, plug, and lighting. In our present setup, these parts share underlying driving components, they usually’re correlated. A mannequin that treats them as a system can hopefully leverage these correlations to ship extra correct predictions.
The code is sort of the identical because the univariate model — solely the goal argument adjustments from a string to an inventory. Additionally, all 4 targets per constructing share one group ID, so the mannequin can attend throughout them at every time place.
target_columns = ["total_load_kw", "hvac_load_kw", "plug_load_kw", "lighting_load_kw"]
context_multivariate = history_df[["building", "timestamp"] + target_columns]
pred_multivariate = pipeline.predict_df(
context_multivariate,
prediction_length=168,
quantile_levels=[0.025, 0.5, 0.975],
id_column="constructing",
timestamp_column="timestamp",
goal=target_columns, # now an inventory
)
That is what the produced pred_multivariate appear to be:
constructing timestamp ... 0.5 0.975
0 Constructing 01 2025-07-14 00:00:00 ... 170.219849 185.517609
1 Constructing 01 2025-07-14 01:00:00 ... 175.033524 191.951599
2 Constructing 01 2025-07-14 02:00:00 ... 175.106644 193.513306
3 Constructing 01 2025-07-14 03:00:00 ... 169.450287 189.806625
4 Constructing 01 2025-07-14 04:00:00 ... 159.008575 177.918198
... ... ... ... ... ...
5371 Constructing 08 2025-07-20 19:00:00 ... 19.697739 24.007296
5372 Constructing 08 2025-07-20 20:00:00 ... 19.775898 23.811647
5373 Constructing 08 2025-07-20 21:00:00 ... 19.995640 24.352007
5374 Constructing 08 2025-07-20 22:00:00 ... 19.610260 23.614372
5375 Constructing 08 2025-07-20 23:00:00 ... 19.025314 22.950117
[5376 rows x 7 columns]
with the next columns:
['building', 'timestamp', 'target_name', 'predictions', '0.025', '0.5', '0.975']
The vital distinction is target_name:
['total_load_kw', 'hvac_load_kw', 'plug_load_kw', 'lighting_load_kw']
Now we have 1344 rows for every of the targets, that’s why the whole row depend for pred_multivariate is 5376.
Subsequent, we test constructing 03 outcomes once more. Within the determine under, every panel exhibits the forecast for one part:

In our artificial case, the plug and lighting hundreds are pushed largely by routine that follows the weekday occupancy schedule, and the mannequin picks them up simply from the 45-day context. HVAC load is extra variable as a result of it’s pushed by outside temperature dynamics that the mannequin has to deduce from the demand sample alone (remember the fact that it doesn’t see temperature explicitly but, however we’ll repair that later). Because of this, we see some clear discrepancies in HVAC load predictions.
Part-wise, now we have 15.4% WAPE for HVAC load, 4.6% for lighting load, 2.2% for plug load, and 5.4% for the whole load.
It’s truly fairly fascinating to see that the total-load WAPE within the multivariate setup can be decrease than the univariate baseline (which is 8.6%) we produced within the earlier part. Within the present multivariate case, the mannequin leveraged correlation patterns between completely different parts to higher infer what the whole load would appear to be.
From a sensible perspective, additionally it is good to have a single predict_df name to return forecasts for the entire load breakdown with constant remedy. This may be very helpful for a lot of downstream operations, as now the operator is aware of not simply how a lot demand to anticipate but additionally the place it’s coming from. This may inform designing efficient HVAC scheduling, lighting controls, and peak-shaving methods.
4.4 Covariate-informed forecasting
Can Chronos-2 use identified future climate and working schedules?
Many real-world forecasting issues include details about the long run that we already know. For our constructing demand downside, we all know the long run climate and working schedule. Subsequently, we must always hand them to the Chronos-2 mannequin and ask it to higher inform its predictions.
That is Chronos-2’s covariate-informed mode from part 3.1. Right here, goal and covariates share the identical group ID, however solely the goal will get predicted, and the covariates have to be equipped for each the historic window, so the mannequin learns their relationship to demand, and the forecast horizon, so it may possibly situation on their identified future values.

The determine above exhibits known-future indicators (i.e., outside temperature, occupancy schedule, and photo voltaic irradiance) we’ll situation on for Constructing 03. Moreover, is_weekend can be included as a categorical covariate.
In an actual deployment, these may come from a climate service and a constructing administration system. Right here, we produce them utilizing the identical simulator that generated the demand historical past.
The code requires two adjustments from the univariate case: the historic context now consists of the covariate columns (alongside the goal), and a future_df argument holds the covariate values for the forecast horizon.
future_truth_df = full_df[
(full_df["timestamp"] >= cutoff_date)
& (full_df["timestamp"] < cutoff_date + pd.Timedelta(hours=168))
].copy()
future_covariates_df = future_truth_df[
["building", "timestamp", "outdoor_temp_c", "occupancy", "solar_irradiance", "is_weekend"]
].copy()
known_future_columns = ["outdoor_temp_c", "occupancy", "solar_irradiance", "is_weekend"]
context_with_covariates = history_df[["building", "timestamp", "total_load_kw"] + known_future_columns]
One concrete row from context_with_covariates:
constructing timestamp total_load_kw outdoor_temp_c occupancy solar_irradiance is_weekend
Constructing 01 2025-05-30 190.671989 30.232122 0.000553 0.0 0
One concrete row from future_covariates_df:
constructing timestamp outdoor_temp_c occupancy solar_irradiance is_weekend
Constructing 01 2025-07-14 30.669441 0.000553 0.0 0
Discover that we provide each dataframes to the API:
pred_with_covariates = pipeline.predict_df(
context_with_covariates,
future_df=future_covariates_df, # covariate values for the forecast horizon
prediction_length=168,
quantile_levels=[0.025, 0.5, 0.975],
id_column="constructing",
timestamp_column="timestamp",
goal="total_load_kw", # nonetheless a single string — one goal
)
First few rows of pred_with_covariates:
constructing timestamp target_name predictions 0.025 0.5 0.975
Constructing 01 2025-07-14 00:00:00 total_load_kw 169.025482 156.235291 169.025482 183.735748
Constructing 01 2025-07-14 01:00:00 total_load_kw 174.691132 161.001785 174.691132 190.913406
Constructing 01 2025-07-14 02:00:00 total_load_kw 174.087845 158.777023 174.087845 191.526764
Constructing 01 2025-07-14 03:00:00 total_load_kw 170.675781 155.457169 170.675781 188.504807
Constructing 01 2025-07-14 04:00:00 total_load_kw 160.536011 145.712753 160.536011 177.670593
Constructing 01 2025-07-14 05:00:00 total_load_kw 154.637436 140.687515 154.637436 170.052383
Constructing 01 2025-07-14 06:00:00 total_load_kw 155.766968 141.554367 155.766968 170.922150
Constructing 01 2025-07-14 07:00:00 total_load_kw 167.152252 152.965271 167.152252 182.419205
The determine under exhibits the brand new prediction outcomes for Constructing 03:

We are able to see clear enhancements in forecasting accuracy when the Chronos-2 mannequin has entry to the informative covariates. Throughout all eight buildings, the WAPE drops to 4%, an enormous enchancment from the un-informed 8.6%.
The sensible takeaway is that this: when dependable details about the long run is offered, hand it to the mannequin.
4.5 Cross-learning
Can associated buildings assist a newly metered constructing?
The ultimate state of affairs we’re investigating right here is the one {that a} TSFM is finest positioned to deal with in precept: a chilly begin.
Think about Constructing 06 has simply been related to the monitoring platform, so now we have solely three days of meter historical past for it. Three days are often not sufficient to suit an inexpensive building-specific mannequin historically. However now that now we have a TSFM, a pure query is: can the opposite seven buildings, every with 45 days of historical past, assist forecast Constructing 06?
That is Chronos-2’s cross-learning mode that now we have mentioned in part 3.1. Implementation-wise, all eight buildings ought to share one group ID. The mannequin doesn’t have to be advised that the buildings are associated; it naturally picks up usable patterns by group consideration throughout the collection. Additionally, on this examine, we intentionally drop future covariates, so no future climate or schedule is being handed. This manner, we’d know that any enchancment has to return from peer histories alone.
We construct two dataframes:
short_building = "Constructing 06"
short_history_start = cutoff_date - pd.Timedelta(days=3)
context_univariate = history_df[["building", "timestamp", "total_load_kw"]].copy()
cold_context = pd.concat(
[
context_univariate[context_univariate["building"] != short_building],
context_univariate[
(context_univariate["building"] == short_building)
& (context_univariate["timestamp"] >= short_history_start)
],
],
ignore_index=True,
).sort_values(["building", "timestamp"])
cold_context_new_only = cold_context[cold_context["building"].eq(short_building)].copy()
Listed below are the primary few rows of cold_context:
constructing timestamp total_load_kw
Constructing 01 2025-05-30 00:00:00 190.671989
Constructing 01 2025-05-30 01:00:00 181.611690
Constructing 01 2025-05-30 02:00:00 177.875806
Constructing 01 2025-05-30 03:00:00 166.297421
Constructing 01 2025-05-30 04:00:00 154.846159
Constructing 01 2025-05-30 05:00:00 151.078626
Constructing 01 2025-05-30 06:00:00 157.557114
Constructing 01 2025-05-30 07:00:00 155.563899
With the next counts:
constructing
Constructing 01 1080
Constructing 02 1080
Constructing 03 1080
Constructing 04 1080
Constructing 05 1080
Constructing 06 72
Constructing 07 1080
Constructing 08 1080
Listed below are the primary few rows of cold_context_new_only:
constructing timestamp total_load_kw
Constructing 06 2025-07-11 00:00:00 101.528455
Constructing 06 2025-07-11 01:00:00 117.270784
Constructing 06 2025-07-11 02:00:00 111.178600
Constructing 06 2025-07-11 03:00:00 110.586007
Constructing 06 2025-07-11 04:00:00 98.715046
Constructing 06 2025-07-11 05:00:00 100.550960
Constructing 06 2025-07-11 06:00:00 114.863499
Constructing 06 2025-07-11 07:00:00 125.766400
We run the next A/B testing:
# Remoted: solely Constructing 06's 3-day historical past
pred_isolated = pipeline.predict_df(
cold_context_new_only,
prediction_length=168,
quantile_levels=[0.025, 0.5, 0.975],
id_column="constructing",
timestamp_column="timestamp",
goal="total_load_kw",
cross_learning=False,
)
# Cross-learning: Constructing 06's 3-day historical past + 7 siblings' 45-day histories
pred_cross = pipeline.predict_df(
cold_context, # consists of all 8 buildings
prediction_length=168,
quantile_levels=[0.025, 0.5, 0.975],
id_column="constructing",
timestamp_column="timestamp",
goal="total_load_kw",
cross_learning=True, # attend throughout the group
)
The outcomes are proven under:

The decrease panel exhibits the 2 forecast outcomes along with the bottom fact. The “remoted forecast” is the end result when Chronos-2 solely makes use of the three days of information because the context. We are able to see that it managed to seize the every day cycle someway, however missed the weekly rhythm and underestimates the peaks. The cross-learning model, then again, successfully realized to drag the weekday/weekend form and peak magnitude from the opposite buildings, thus yielding higher demand predictions. When it comes to WAPE, it drops from 22.2% within the remoted studying technique to 16.7% within the cross-learning technique.
Be aware that the cross-learning we’re doing right here just isn’t that the mannequin peeked at different buildings’ futures, as a result of solely histories are within the mannequin’s context. What the mannequin is doing is in-context studying: it sees seven buildings on this portfolio, then cross-checks the form of patterns Constructing 06’s three days present, and at last tasks ahead accordingly.
5. The place does zero-shot cease being sufficient?
Earlier than getting enthusiastic about Chronos’ new capabilities we simply noticed within the earlier case examine, we must always at all times hold this in thoughts: Zero-shot is a good default; it isn’t the common reply.
So, the place does zero-shot cease being sufficient? I imagine the next 4 indicators are vital to observe for:
- Your knowledge seems to be in contrast to something within the pretraining combine. For instance, specialised scientific indicators, or area of interest sensor sorts.
- You have got numerous clear historical past that isn’t getting used. Chronos-2 pays no consideration to something previous the context window. If that historical past exists and incorporates patterns Chronos-2 hasn’t seen, you most likely want fine-tuning to explicitly encode it.
- You see systematic errors the mannequin retains making. For points like that, no quantity of context engineering will possible repair it. You want focused adaptation to bridge the hole.
- You want habits the zero-shot goal doesn’t optimize for. In case your downstream price is uneven, e.g., under-forecasting demand prices you ten occasions what over-forecasting does, fine-tuning along with your particular loss operate is perhaps the best way to go.
That is the place Half 2 picks up! Within the subsequent put up, we’ll focus on find out how to fine-tune Chronos-2.
You could find the total pocket book right here: https://github.com/ShuaiGuo16/chronos-2-forecasting/blob/main/01_chronos2_zero_shot_building_demand_demo.ipynb
References
[1] Chronos-2: From Univariate to Universal Forecasting, arXiv, 2025.
[2] Time Series Foundation Models as Strong Baselines in Transportation Forecasting: A Large-Scale Benchmark Analysis, arXiv, 2026.

