TL;DR

The authors created a live benchmark that continuously tests time‑series forecasting models on daily GitHub event streams, avoiding the common practice of one‑time static train‑test splits. They discovered that many advanced forecasting models lose roughly 12 % accuracy and become more erratic when evaluated under this rolling‑window protocol, revealing that prior benchmarks overstate performance. This finding shows that current evaluation methods inflate accuracy and underscores the need for benchmarks that reflect the shifting nature of real‑world data, guiding the development of more robust forecasting systems.

Impermanent demonstrates that static train‑test splits in time‑series forecasting routinely inflate reported accuracy, masking the true temporal fragility of foundation models.

Garza, Rosillo, Mendoza‑Smith and colleagues at arXiv set out to build a live benchmark that mirrors the open‑world dynamics of real data streams. They selected the top 400 GitHub repositories by star count and extracted four event streams—issues opened, pull requests opened, push events, and new stargazers—into daily time series. Rather than splitting these series once and forever, the authors evaluate models over a rolling window, scoring each day’s forecast against the next day’s actual observations. The evaluation pipeline runs continuously, automatically ingesting new events as they arrive, and publishes the results on a public dashboard (https://impermanent.timecopilot.dev). By doing so, Impermanent exposes models to the inevitable distributional shifts that occur when software projects evolve, contributors change, and platform tooling updates.

The study’s key finding is that many state‑of‑the‑art foundation models, which are often touted as “broadly generalizable,” perform well on frozen test sets but degrade markedly when evaluated under Impermanent’s sequential protocol. The authors document that the static splits used in prior benchmarks allow leakage: models can inadvertently train on future data or tune hyperparameters against test scores, thereby inflating performance. In contrast, the live benchmark forces models to generalize to genuinely unseen future observations. Across a suite of baseline models—including ARIMA, Prophet, and transformer‑based architectures—performance drops by an average of 12 % in mean absolute error when moving from a static to a live evaluation. Moreover, the authors observe that the variance of errors increases over time, signaling that foundation models lack the stability required for sustained forecasting.

This matters because the field has leaned heavily on foundation models to promise universal applicability. If evaluation protocols are too permissive, researchers may overestimate a model’s usefulness in production, where data streams are continuous and non‑stationary. Impermanent’s live evaluation framework forces a more honest appraisal of temporal robustness and distributional shift, aligning research benchmarks with the realities of real‑world deployment. By providing an open‑source, continuously updated dataset and a reproducible leaderboard, the authors lower the barrier for other teams to test their models in a realistic setting, potentially accelerating the discovery of truly resilient forecasting architectures.

Ultimately, Impermanent invites the community to rethink how we certify generalization in time‑series work. Will future foundation models be engineered to maintain performance across shifting regimes, or will they remain brittle when confronted with the inevitable evolution of real data? The live benchmark offers a concrete path forward, but it also raises open questions about the design of training curricula and the role of continual learning in achieving durable forecasting performance.