Statistical Models vs. Machine Learning


1. Introductory Remarks

To be able to succeed in time series forecasting, a repeated past pattern is needed, and that pattern has to continue into the future to some extent. The higher the regularity of this pattern, the higher its expected forecastability. In other words, the higher the disorder is in the history of the time series, the more difficult it is to predict. The disorder of a time series can be formally measured in terms of its entropy (approximate entropy and sample entropy), e.g., see Ref. [1]. The concept of entropy comes from the field of physics. In information theory, the Shannon interpretation of entropy is used to measure the amounts of information required to accurately send and receive some messages (e.g., see the explanation and gedanken experiment in [2]). In data analysis, the concept of entropy has been used to analyze patterns in medical data, such as heart rate; later, this spread to applications in physiology, finance, climate sciences, and a number of other phenomena. In practice, the entropy in a forecasting situation is seldomly computed explicitly, but it is indirectly reflected in the uncertainty of a time series forecast. Uncertainty can be given in terms of a probability interval, a confidence interval, or, in more detail, as a forecast distribution, which attaches probabilities and distributions to a variety of predicted values. Even though a forecast is often just given by a point forecast, which may be interpreted as the most likely value of a time series variable at a future time, it seems to be increasingly common nowadays to let the point forecast be accompanied by a measure of uncertainty. This, in a sense, can be thought of as reflecting the entropy of the series. It is explained in more detail at the end of Section 5.2. A few more comments on entropy and forecastability are also given in Section 4.4 on forecasting competitions.

The history of time series forecasting goes very far back, at least to the age of antiquity, with more or less accurate predictions of astronomic events and future weather. Presently, forecasts are used virtually everywhere, ranging from, say, economics and opinion polls to long- and short-term weather forecasts. Some forecasts are more or less purely statistical in nature, i.e., demand forecasting for groups of items based entirely and exclusively on past sales, but other types of forecasts, such as financial predictions, are aided by economic theory, whereas weather forecasting may depend critically on a geophysical theory of the atmosphere.

At present, forecasting is helped by an enormous increase in the availability of data. Ours is truly the age of Big Data. However, how can this help us to make better predictions? Is it easier or more difficult to see patterns in the data that can be exploited in forecasting? In parallel with the increase in data, there has been an increase in the availability and speed of computers. Computers can certainly detect patterns in complex data that humans or more simple analyzing techniques cannot detect. There are machine learning (hereafter denoted ML) techniques for handling such patterns, but are they useful for forecasting? Or does the computer see patterns in the noise and, in that way, make misleading forecasts?

I will treat these questions in this paper. It is a review where I seek to include very recent contributions. I emphasize attempts to find an answer and to pinpoint situations where ML does particularly well or where it does not work so well. A frequent criticism of machine learning is that it is a black box method. Nevertheless, can something be said when it comes to finding reasons for it doing well in some situations, e.g., beating simple parametric models, and not so well in other situations, where it is beaten by other types of models? See the discussion in Section 4.3 and at the end of Section 6.

What is the potential for ML methods in the future? Are these methods improving in quality while parametric models are at a standstill? Moreover, how well can purely algorithmic ML methods be integrated into models to a large degree based on physics and non-statistical features, such as in weather forecasting?

Partially, I try to give my review a somewhat personal touch. It is colored by my research background, which has not been focused directly on forecasting but has been centered on topics close to the problem of forecasting. Hence, I will discuss nonstationarity and the patterns of nonstationarity that are statistically forecastable, such as cointegration. This has also led to looking at leading lags and Granger causality, as outlined in [3]. In addition, I will mention the possible potential of local Gaussian models presented in [4,5]. Further, I briefly look at GARCH-like structures, not only in volatility prediction for financial series, but in integer time series modeling, so-called Poisson autoregression, and similar recursive schemes for continuous time data described by stochastic differential equations leading to the connection with Hawkes processes. There seem to be only limited applications of ML methods in these contexts.
To compare individual forecasting methods and types of methods is a challenging task. What kind of criteria or metrics should be used? A number of forecasting competitions have been arranged, and I report on some of the main ones in Section 4.4. In most of these competitions, forecasting accuracy has been used as a yardstick, but then several measures of accuracy exist, and the ranking of methods may depend on the measure used. However, clearly, there are other key metrics in judging forecasting algorithms, such as training time, interpretability, and robustness against noisy data. I will briefly discuss each of these criteria for the methods treated in the present paper.
Here is an outline of the paper: In Section 2, I mention the Holt-Winters models, which, in many ways, may be said to have started modern applied forecasting. These models are intuitive, provide good results, and are still used. A related, so-called “ θ -model”, has been doing well in forecasting competitions. Actually, this model was declared the winner of the M3 forecasting competition; see [6]. The well-known ARIMA models, also briefly discussed in Section 2, are related to the Holt-Winters models. Both point forecasts and interval forecasts can be provided in the Gaussian case in particular.
All of these models are linear (and quite often Gaussian). Section 3 is devoted to simple nonlinear parametric models and to nonparametric forecasting.
Section 4 is, in some sense, the main part of the paper. Here, I discuss the various types of ML forecasts, such as those based on neural networks and random forest, which are possibly the two main contenders for simpler statistical models. What are their strengths and weaknesses? How and when can they be strengthened further by combining various ML forecasts with classic statistical models? How well do they do in forecasting competitions (covered in Section 4.4) compared to more traditional methods? Further, parametric models can be analyzed theoretically. How about ML methods?
Section 5 continues the discussion of ML methods. I treat multivariate time series and panels, nonstationary effects, and cointegration as an alternative instrument to differencing and removing deterministic trends. Further, I discuss forecasting distributions, including scores, forecasting count time series via GARCH-type structures, and analogs in continuous time with potential applications in multivariate financial forecasting and contagion. For some of these problems, ML methods seem less well developed than in forecasting stationary one-dimensional series.
Finally, in Section 6, I look at weather forecasting. Recently, ML methods have been introduced at various stages of the forecasting process, sometimes eminently successful but at other times, a failure. What is the future of ML forecasting here? This is discussed in light of the very recent ML methods of GraphCast and GenCast.
I do not give any explicit forecasting examples using my own real data. For a great number of such examples, I refer, instead, to the discussion and references in Section 4.4.

2. Classical Parametric Forecasts

The earliest forecasting technique that is still being used is probably the exponential smoothing technique based on work in [7,8]. Exponential smoothing is an example of a recursive scheme, with such schemes being extensively used in increasingly complex forms in modern nonlinear and ML forecasting. The basic idea goes back to at least Poisson as an extension of a numerical analysis technique from the 17th century, and it was later adopted by the signal-filtering processing community in the 1940s.
Let y t be a time series for which predictions y ^ t + 1 at time t + 1 are sought. Then, the simplest form of an exponential smoothing forecast is given by

y ^ t + 1 = α y t + ( 1 α ) y ^ t .

By solving recursively, y ^ t + 1 is expressed as an exponentially weighted average of the previous observations of y t , from which the appellation exponential smoothing is derived. The parameter α ( 0 α 1 ) must be estimated, usually using least squares, and the initial value y ^ 0 must be stipulated by, for example, taking an average of the early values of y t , which are needed for starting the recursive scheme.

Simple exponential smoothing does not work well if there is a trend involved. Including an (instantaneous) trend estimate, d ^ t , results in a double exponential smoothing scheme:

y ^ t + 1 = α y t + ( 1 α ) y ^ t + d ^ t

d ^ t + 1 = β ( y ^ t + 1 y ^ t ) + ( 1 β ) d ^ t

where α is the data smoothing factor, and β ( 0 β 1 ) is the trend smoothing factor. These factors and the initial conditions can be determined as indicated above for the simple scheme in (1). The inclusion of a trend also makes it easier to forecast several steps ahead.

With the help of a student (Peter Winters), Holt extended the scheme further to contain seasonal indices, the Holt-Winters forecasts, involving triple exponential smoothing. See also [9]. Exponential smoothing can be extended to multivariate time series; see [10].
Probably the most well-known class of parametric time series is the class of ARIMA (integrated autoregressive moving average) models. The major breakthrough of this class of models came in the form of a book [11]. The authors introduced, in a time series context, the classical sequence of statistical modeling: identification, estimation, and diagnostic checking. The ARIMA models are so well-known that it seems unnecessary to say much about them here, but a few facts will be mentioned for completeness. The simplest member of this class is the purely recursive autoregressive model (the models are assumed to have a zero mean),

y t = a 1 y t 1 + + a p y t p + e t ,

where a 1 , , a p are unknown parameters in a pth-order model, and e t is a series of uncorrelated (in inference, these are often assumed to be iid (independent identically distributed)) variables. The process is stationary if the roots of the characteristic polynomial A ( z ) = 1 a 1 z a p z p are located outside the unit circle in the z-plane. The one-step-ahead forecast is given by y ^ t + 1 = a 1 y t + + a p y t p + 1 , and higher-order forecasts are given by reiterating this equation. The estimation of the parameters is carried out using maximum likelihood or quasi-maximum likelihood, and the asymptotic confidence intervals and asymptotic distribution of the forecasts are easily established under weak regularity conditions. The order is determined by inspecting the so-called partial autocorrelation function or by using an AIC-type criterion.

The class of ARMA(p,q) models is formed by replacing the noise or innovation term e t with a moving average term, resulting in

y t a 1 y t 1 a p y t p = e t + b 1 e t 1 + + b q e t q .

It is somewhat more cumbersome to obtain parameter estimates and one-step or multi-step forecasts using maximum likelihood or quasi-maximum likelihood, but standard computer packages are available, where forecast confidence intervals are computed.

ARMA models can be further extended to encompass nonstationary models with a linear stochastic or deterministic trend. This is carried out by reducing the series to a stationary series by differencing the observations up to d times, resulting in an ARIMA(p,d,q) model. Often, d = 1 is sufficient. The models can be extended by including a seasonal component that, in itself, can be modeled by an ARIMA-type process. In the vector time series case, the ARIMA is based on the individual differencing of the component series, but presently, this structure is predominantly replaced by a cointegration approach; this will be briefly mentioned at the end of Section 5.3.

Forecasts can be computed using uncertainty intervals from standard statistical packages. The ARIMA models have tended to do well in forecasting competitions. They have been compared to Holt-Winters exponential models, but it is not correct that these models are merely a subset of ARIMA models; this is because Holt-Winters models contain instantaneous estimates of trends and seasonal components. These quantities are recursively updated in time, which is not the case for ARIMA models.

The ARIMA models are easily updated to multivariate series y t by replacing the scalar parameters a 1 , , a p and b 1 , , b q with matrices. The identification, estimation, and validation steps are more complex. A useful reference is [12]. There are several program packages for more or less automatically processing multivariate ARIMA models (see, e.g., a Python 3.13 package on forecasting for ARIMA models and more general machine learning forecasts regarding multivariate time series). These models can also be allowed to contain a possible moving average exogenous variable, x t , in addition to the multivariate innovation term e t . These are the so-called ARIMAX models. Again, forecasts with uncertainty intervals can be obtained from program packages.
The winner of the Makridakis M3 [6] forecasting competition was a forecast obtained using the so-called θ (theta) method. The method was introduced in [13]. It is much more completely described in the monograph [14]. It can be seen as a generalization of Holt-Winters-type models. The choice of weight parameters and the numbers of trend functions (and their nature) has been researched extensively; this is reflected in, e.g., Ref. [15], in a comparison with machine learning forecasts.
In the general literature on theta models, there are two strands. The first one gives probabilistic properties of the method, starting in [16]; subsequently, a number of theoretical results was collected in [14]. The latter provides a complete analysis of the theta method under a unit root-generating process, explaining its success in the M3 Makridakis competition in 2000 [6]. The authors of Ref. [14] also introduced the multivariate theta method and looked at its relationship to cointegration.
The second strand expands and details various implementations of the method, optimizing theta models and pointing out their relationship with state space models (see [17]).
A recent generalization of the theta method is given in [18]. Connections to adaptive learning forecasts were pursued in [19].
The connection of the theta method to the perhaps more well-known class of state space models prompts a short mention of the structure of the latter. In fact, state space models are very powerful and versatile tools for working with time series and their forecasting. They were initially developed by engineers but have been widely adopted and extended among other applications to econometrics. Two central textbooks are [20,21]. State space models are formulated by means of two different equations: an equation in terms of the states, α t , which are a set of variables that are usually unobserved. Typical examples are trends, seasonal components, or time-varying parameters. The state equation describes the dynamic law governing the states for two adjacent points in time. The second is the observation equation, which specifies the relationship between observed data and unobserved states. A linear Gaussian version of such a system could look like this:

α t + 1 = T t α t + Γ t + R t η t , η t N ( 0 , Q t )

y t = E t α t + D t + C t ε t ε t N ( 0 , H t )

with the initial condition α 1 N ( α 0 , P 0 ) . In these equations, η t and ε t are the state and observational vectors of zero-mean Gaussian noises. The matrices T t , Γ t , R t , Q t , E t , D t , C t , and H t are the so-called (generally time-varying) system matrices, and α 1 and P 1 are the initial state and state covariance matrix, respectively.

Once such a system is fully specified, the core problem is to provide optimal predictions of states and their covariances over time. This can be carried out by looking back in time using the well-known Kalman filter.

Given any set of data and a specific model, the system is not fully specified in most cases because this would depend on unknown parameter matrices. An estimation of such parameters is sought, carried out using maximum likelihood defined by prediction error decomposition [20].
A more recent text book is [22]. The forecasting aspect is covered in, e.g., Ref. [23].
Concerning the key metrics mentioned in the Introduction, the classical parametric forecasting methods reported in the present section must be said to be easy to interpret, possibly with the exception of the theta-type models; see [15,24] for a simplification and rephrasing of the model in terms of more understandable terms. The training of the parametric models consists of estimating the parameters. This is very simply carried out for univariate models, whereas high-dimensional ARIMA models and state space models would require more effort. Usually, parametric models would contain a safeguard against overfitting by using an AIC-type criterion or a penalty factor, thus limiting the number of allowed parameters. Classical parametric models have high accuracy for forecasting in linear models, but it is not difficult to find nonlinear models where they fail completely.

3. Nonlinear and Nonparametric Forecasts

Up to now, I have been mostly concerned with linear and often Gaussian models. A claim of the ML methods is that (at least sometimes) they are capable of handling nonlinearities better than classical parametric models. This means that, in a sense, ML methods try to capture more of the probabilistic structure of the data. In comparison, ARIMA models are based solely on statistical moments, e.g., the covariance structure. This is fine if the data are Gaussian, but in the non-Gaussian case, one would expect that information may have been lost.

Before I go over to the main theme of this paper—the construction of ML methods, including their claimed ability to handle nonlinearities in a better way—it is instructive to take a look at classical parametric nonlinear models, including a look at the classical nonparametric methodology.

To make things simple, I concentrate on the first-order scalar model, where only one lag is included. Going from model to forecast is then trivial.

The most popular nonlinear model is probably the threshold model introduced in [25]. In its simplest form, it is given by

y t = θ 1 y t 1 s t 1 c + θ 2 y t 1 s t 1 > c + e t .

The process moves as a mixture of two autoregressive regimes, with autoregressive coefficients θ 1 and θ 2 , respectively. The regime is determined by the state process, s t , and a threshold, c. When s t 1 c , it moves according to θ 1 when s t 1 > c according to θ 2 . The state process, s t , may be equal to y t d for a certain delay, d. An inference theory for the threshold process is given in [26].
There are numerous generalizations and applications of the threshold model; see, e.g., Refs. [27,28]. Usually, a threshold process is assumed to be stationary. For a nonstationary cointegration-type threshold process, see [29]. For a recent paper that covers the use of thresholds in neural networks, see [30].
The threshold model is sometimes criticized, perhaps especially in econometrics, for its abrupt change between regimes. There is a class of models with continuous transition between the θ 1 and θ 2 regimes, the so-called STAR models (smooth transition models). In the LSTAR model, the transition is modeled by a logistic function,

y t = ( θ 1 + θ 2 G ( γ , c , s t 1 ) ) y t 1 + e t

with G having the logistic form

G ( γ , c , s t 1 ) = ( 1 + exp γ ( s t 1 c ) ) 1 , γ > 0 .

Alternatively, an exponential form, ESTAR, can be used:

G ( γ , c , s t 1 ) = 1 exp γ ( s t 1 c ) 2 γ > 0 .

An earlier version of these models was introduced in [31]. Later, the author of Ref. [32] defined a family of STAR models that included both the LSTAR and ESTAR models and devised a data-driven strategy with the aim of, among other things, helping the user choose between these two alternatives. A number of generalizations of the simple models (10) and (11) exist. An update can be found in [33].

The last class of parametric nonlinear models is the hidden Markov chain autoregressive process. In contrast, for the threshold models and the STAR models, the state process is typically given by s t = y t d , a delay of the observed process, in the hidden Markov chain autoregressive process. s t is a Markov chain so that

y t = s t 1 y t 1 + e t ,

where, in the simplest case, s t , can only take two values, θ 1 and θ 2 , say, and where s t varies between these two values according to a 2 × 2 Markov transition matrix. Both θ 1 and θ 2 and the entries in the probability transition matrix are unknown and have to be estimated. The EM algorithm is often used for this task [34]. Additionally, the mean and variance of the series could be the subject of Markov regime shifts [35]; see the more recent textbook reference also [36].

For all of these classes of processes, it is possible to do asymptotic inference on the parameters, and asymptotic confidence intervals can be secured under certain regularity conditions. In principle, these can be transferred to asymptotic uncertainty intervals for the forecasts. However, it seems more difficult to produce a distributional theory of forecasts, unlike the nonparametric models examined next.

The well-known kernel density estimate for the marginal density, p ( y ) , of a stationary time series, y t , given n observations y 1 , , y n , is given by

p ^ ( y ) = t = 1 n K h ( y t y ) ,

where K ( · ) is a kernel function, h is the bandwidth, and K h ( · ) = h 1 K ( h 1 ( · ) ) . Under nonrestrictive conditions, consistency and asymptotic normality can be proved. It is well-known that the optimal (with respect to least squares deviation) predictor y ^ t + 1 is given by the conditional mean of y t + 1 , given the past value. The conditional mean m ( y t + 1 | y ) of y t + 1 given y t = y can be estimated using the Nadaraya-Watson estimator

m ^ ( y t + 1 | y t = y ) = s t y s K h ( y s 1 y ) s t K h ( y s y ) .

This will also be an estimate of the optimal predictor if the past of y t + 1 can be represented by y t , e.g., in the possibly nonlinear autoregressive AR(1) case y t + 1 = g ( y t ) + e t for some function, g. In the case of the AR(p) process with y t + 1 = g ( y t , , y t p ) + e t , the function g can be estimated, and consequently, y t + 1 can be forecasted, but one quickly runs into the curse of dimensionality as p increases.

To avoid the curse of dimensionality, one can use an additive model, where it is assumed that g ( y 1 , , y p ) = g 1 ( y 1 ) + + g p ( y p ) , resulting in a model:

y t + 1 = g 1 ( y t ) + + g p ( y t p ) + e t

for some component functions, g 1 , , g p . The model can be estimated by so-called backfitting, which is thoroughly explained in [37]. There are extensions to models with mild forms of interactions between the components; see, e.g., Ref. [38]. The relevant variables can be selected by using the lasso routine, as explained in [39].

The conditional mean forecast is optimal in the least squares error sense, but it should be noted that there are alternative measures of error for forecasting, which are, in fact, often used as error measures in comparing forecasts and forecasting competitions. The most used are the mean absolute error (MAE) minimized by the median and mean absolute percentage error (MAPE). For a discussion of these and related topics, I refer the reader to [40].

Nonlinear parametric models certainly have an advantage if the data are nonlinearly generated, but, of course, they lose out to linear models if the data are linearly generated. In general, the danger of overfitting increases with nonlinear models, as can be seen with far worse examples of out-of-sample forecasts. The training (estimating of parameters) naturally takes longer due to the more involved estimation algorithms. These may be iterative, and it is not always obvious when the iteration should be stopped. Often, when they are regression-type models, they are usually not difficult to interpret.

The nonparametric forecasting algorithms are based on estimating the conditional mean and are, therefore, easy to interpret. Their scalability is a problem due to the curse of dimensionality but can partly be avoided using simplifications such as additive models.

4. Machine Learning Forecasts

As emphasized in the Introduction, to be able to make forecasts, it is necessary that an observed pattern has features that are preserved in time when moving into the forecasting period. Since we are dealing with time series, this seems to require some kind of stationarity in the series. However, the strictness or wide-sense stationarity of the series is not an absolute prerequisite. Multivariate time series may be nonstationary; nevertheless, it may be possible to forecast if there is a time-invariant pattern governing the nonstationarity, for example, via cointegration, where there is one or more time-invariant linear combination of the series, which results in a (possibly) multivariate stationary time series.

For a univariate time series, there could be a linear stochastic trend, which may be removed by differencing, or there could be a possibly nonlinear deterministic trend, which may be removed by a regression device. Moreover, heterogeneity may be removed using a Box-Cox-type transformation [41]. In a local Gaussian approach, further analysis is facilitated by transforming the series into a Gaussian series; see [4].

A Gaussian AR series is the series that is easiest to handle regarding prediction; everything is based on the autocorrelation function, and the optimal least squares forecast, the conditional mean, is linear and can be expressed in terms of the autocorrelation function, which, in turn, gives a complete description of the probability structure in the Gaussian case.

A key aspect of the autoregressive process is its recursive definition, where forecasts can be obtained easily by regressing using the series itself, and forecasts with a more distant time horizon can be built up recursively. The recursive property can also be built up for the ARMA and ARIMA series and, as seen in Section 2, for exponential smoothing models using Equations (2) and (3). The recursive mechanism can also be extended to seasonal forecasts.

For nonparametric and nonlinear forecasts, similar but more complicated recursive mechanisms may, in many cases, be relied on.

Common to all of these mechanisms and a prerequisite for them to work well is a training period, where the model is built up by identification, estimation, and diagnostic checking, not only in the ARIMA case but also for the exponential smoothing models. In the training period, the parameters need to be estimated, and this is similar when regarding nonlinear parametric models. The nonparametric method does not contain parameters as such, but the number of lags and the bandwidth need to be determined.

For machine learning methods to be dealt with regarding this section, the training is absolutely essential and is, in many ways, the central concept of the methodology; this is not only used for forecasting; it is used in general when finding a pattern that describes the data and where this pattern can be recognized and used in the analysis when new data are coming in. There are several ways of doing this, but I will concentrate on two methods: neural network analysis and random forest, which seem to have been the most successful machine learning (ML) forecasting methods.

4.1. Forecasting Using Artificial Neural Networks

Neural networks are used for a number of problems in prediction, classification, and clustering. The success and impact of neural network-like approaches in machine learning have recently been recognized by the 2024 Nobel Physics Prize awarded to John Hopfield and Geoffery Hinton. Currently, there is intense activity in this field, not the least in so-called deep neural network modeling. These are networks with several hidden layers. A detailed survey of developments up to 2015 is given in [42].

I start with a single-layer network. Assume a given data vector, x , which may be thought of as a section of a time series, y 1 , , y t . In a neural network approach, one is interested in transforming x via linear combinations of its components and a nonlinear activation function, often of sigmoidal form, of the linear combinations. These linear transformations represent what is called a hidden layer.

The first step in constructing the hidden layer is to form linear combinations:

h i = w 0 + j = 1 n w i j x j

and then possibly subject these terms to an activation function. In the case of just one hidden layer, an output layer in terms of the hidden layer is given by

z i = j = 1 m w i j h j

followed by another possibly nonlinear function, depending on the problem. It has been shown, e.g., Ref. [43], that this system of linear combinations and activation functions is able to approximate general classes of functions. In a s-step time series forecasting problem, the output may successively be built up to obtain the forecast y ^ t + s at step s from the input set y 1 , , y t . When using a training set, the weights w i j and w i j are determined by minimizing a loss function involving ( y t + s , y ^ t + s ) , where y ^ t + s is based on the output layer. The error function is evaluated for each of the samples coming in the input window. Subsequently, the output of the gradient of the error function with respect to y t + s is evaluated, and the weights are recomputed and updated in the direction of the gradient using stochastic gradient descent. The weights w i j for the output layer are computed first. Then, w i j is adjusted via the chain differentiation rule using so-called backpropagation.

The simple, one-hidden-layer forward scheme has met with considerable success. Applications such as the embedding of graphs and text processing are described in [44], with emphasis on the work carried out in [45,46]. However, it has turned out that in many problems, including forecasting, improved performance can be obtained using deep, multiple-layer networks, where the learning process is called deep learning.
A prime advantage of deep learning is that hidden units can be connected to each other recursively in time. This is taken up among several other sources in the literature in [47]. They consider a somewhat extended multivariate, one-step-ahead forecast framework, where for component i,

y ^ i , t + 1 = f ( y i , t : p , x i , t : q , s i )

where y i , t : p = y i , t , , y i , t p , and x i , t : q = x i , t , , x i , t q are exogenous variables of component i over the look-back window of length p; s i represents static meta-data, e.g., the location of a meteorological station, for climate data. The function f is the prediction function learned by the model. It contains weight functions and hidden units of the hidden layers, which, as will be seen below, can be chosen in various ways.

Convolution neural networks (CNNs) and (TCNs): Convolution neural networks (CNNs) [48] were originally designed for spatial data; however, they have been adapted to a time series situation. For an intermediate layer, l, each convolution is a causal filter. In terms of hidden vector units, it takes the form

h t = A ( ( w h ) ( , t ) )

with the convolution between weights and hidden units given by

( w h ) ( , t ) = τ = 0 p w ( , τ ) h t τ .

In (13), h t is an intermediate state at layer , and w is a fixed matrix of convolution weights at layer . The function A is an activation function, e.g., a sigmoidal function.

In a time series context, convolution neural networks are often named temporal convolutional networks (TCNs). The contrast with the spatial CNNs mainly consists of their focus on the data types and the structure of their convolutional layers. In particular, TCNs employ temporal and dilated convolutions to capture more long-range dependencies and complex temporal patterns, whereas CNNs use spatial convolutions to detect local patterns and features in images.

A basic TCN publication is [49], which is concerned with action segmentation in computer vision. The TCN algorithm has been compared to other recurrent networks like RNN and LSTM, which will be treated next. An advantage of TCNs is that, due to the parallel nature of the convolutions, they can train faster and more efficiently than LSTMs. Examples of comparisons of forecasting potential for TCN, RNN, and LSTM are [50,51]. The outcome of these investigations and similar experiments is that TCNs can compete on par with RNN and LSTM, but this somewhat depends on the data type.

A single, causal CNN layer is equivalent to an autoregressive model.

Recurrent neural networks (RNNs): Recurrent neural networks (RNNs) do not require an explicit fixed look-back window, as is the case for CNNs. Given the natural interpretation of time series forecasts as sequences of inputs and targets, many applications for temporal time series prediction have been developed [52,53,54]. At their core, RNN cell units contain an internal memory state, which acts as a compact memory of past information. The memory state is recursively updated from new observations at each time step.
Recurrent neural networks were originally based on workin the 1980s. Since then, a number of different RNN architectures have been developed. Just to give a flavor of these networks, I present the model equations for the three-layer (one hidden layer) Elman network. They are

h t = σ h ( W h x t + U h h t 1 + b h ) ,

z t = σ z ( W z h t + b z )

where x t is an input vector, h t is a hidden layer vector, z t is an output vector, W and U are parameter matrices, b is a parameter vector, and σ h and σ z are activation functions. Equations (14) and (15) show the recursive structure of the network.

The Elman network and the similar Jordan network are known to be simple recurrent networks (SRNs). There exist considerably more complicated recurrent networks with a number of layers than in CNN networks, but the idea of recursion in time is kept.

Long short-term memory (LSTM) recurrent networks: The long-term gradients that are backpropagated in classic RNNs can tend to zero or explode. The problem is computational. The so-called long short-term memory (LSTM) recurrent network is designed to counter this problem. LSTM units practically solve the vanishing gradient problem because LSTM units allow gradients to flow unchanged. LSTM can still sometimes suffer from the exploding gradient problem.

The vanishing gradient problem was first analyzed in a diploma thesis by Sepp Hochreiter at the Technological University of München under the guidance of Sepp Schmidhuber. After having been published as a technical report and as a conference proceeding, the full LSTM paper appeared in 1997 in Neural Computation [55]. Since then, there have been substantial and applied advances regarding the method. Now, the original LSTM paper stands with a hefty count of more than 67,000 citations. The paper has become the most cited neural network article of the 20th century, and it has been applied with considerable success to topics such as unsegmented, connected handwriting, speech recognition, machine translation, robot control, video games, and healthcare. LSTM is particularly well-suited to classifying, processing, and making predictions of time series data since there can be lags of unknown duration between important events in time series.

A common LSTM unit is composed of a cell unit, an input gate, an output gate, and a forget gate. The cell remembers values over arbitrary time intervals, and the three gates regulate the flow of information into and out of the cell.

Transformers: In 2017, a paper with the endearing title “Attention is all you need” [56] was published. “Attention” is here a technical machine learning term referring to a method that determines the relative importance of each component in a sequence relative to the other components in that sequence. The term has its origin in natural language processing, with the importance being represented by assigning weights to each word in a sentence (more generally, attention encodes vectors called token embeddings across a fixed-width sequence that can range from tens to millions of tokens in size). The title of the [56] paper refers to the fact that the authors were able to do away with the recursive structure of machine learning methods like RNN and LSTM and build on language processing concepts instead. This meant that parallelization could be implemented with enormous savings in processing time and could also be said to lay the foundation for artificial intelligence routines like ChatGPT.
The methodology presented in [56] has recently had a very big impact. This model of a neural network that learns context and, thus, meaning in sequential data, like the words in a sentence, has been named the transformer model (with the added fact that transformers can use parallel processing to increase speed). Transformers are, in many cases, replacing convolutional and recurrent neural networks (CNNs and RNNs) as the most popular deep learning models. Indeed, 70 percent of arXiv papers on AI posted in the last two years mention transformers. One example of further development of an aspect of transformer methodology is [57].
The technology is, in principle, valid for any sequence of data, including time series and molecular sequences and their application in medicine, leading to a large growth in use in a number of fields. The structure of transformers also allows them to be able to model long-range dependence [58]; see [59].
I will return to transformer forecasting in the section on weather forecasting—Section 6.

4.2. Random Forest and Gradient Boosting

Random forest and gradient boosting seem to be the main ML competitors to neural network-based models in forecasting. Both of these methods can be used both for classification and prediction. They are described well in chapters 10, 15, and 16 in [60]. Both of these methods are random tree methods and use bootstrap realizations efficiently to form new trees. Random forest uses averages of forecasts generated by independent bootstrap-generated trees and can be run in parallel. Gradient boosting runs sequentially, where the new tree model is sought to correct the errors of the previous one.

The procedure of random forest is a special case of a bagging procedure, where a final predictor or classifier is obtained by using the average of predictors/classifiers obtained by bootstrapping the original data and creating a regression tree and a forecast for each bootstrap realization. A block bootstrap has to be used in time series forecasting. The advantage of this “ensemble averaging” lies in the possible variance reduction in taking averages. The variance reduction is largest if the bootstrapped trees are independent. In the random forest algorithm, one tries to reduce dependence by using a random mechanism in the growth of each bootstrap tree.

For forecasting purposes, the tree structure used is regression trees. The range of the explanatory variables is randomly split into regions (sequentially) in time; when the tree is fully grown, the predicted value for each sub-region (possibly obtained using several splits) is given by a constant determined by a least squares algorithm on the training set for each bootstrap realization of the data. The final predicted value for a given value, x , of the explanatory variables is then obtained by taking the average of the regionalized values for the regions x belongs to for the various bootstrapped trees. Since the bootstrap realizations can be run and worked on in parallel, efficient computation time may not be too long, even for a large number of bootstraps.

Where the point of random forest is variance reduction, the purpose of gradient boosting is more directed towards bias reduction. Again, the aim is to obtain suitable regression trees. However, in this case, the bootstrap realizations are not just run in parallel. Now, a regression tree is sought by correcting the prediction error in the training period of the previous regression tree. This is a sequential procedure, and it is continued until the entire validation part of the training data set is accurately predicted or a predefined maximum number of models is reached. During this process, the weight of the original data points is changed so that points that are incorrectly predicted are given more weight. In this way, an adaptive boosting procedure is constructed. The authors of Ref. [61] developed the first version of AdaBoost, and they received the prestigious 2003 Gödel prize for their work. Compared to random forest, the results of gradient boosting are more sensitive to parameter settings during training. However, using the correct parameter settings, the user expects to obtain better test results than random forest.
The sequential procedure involved in gradient boosting may make the procedure impractical if a large set of data with many bootstrap generations is involved. Approximate methods and more time-efficient algorithms, e.g., the so-called light-GBM, have been suggested; see, in particular, [62].

4.3. XAI

There are a number of reports on the general scalability of machine learning algorithms; see, e.g., Ref. [63]. The same is the case for the order of training times required; a table is given in [64].
There are other ML methods that can be applied to forecasting: support vector machines, ridge regression, and the lasso process. The latter includes regularization terms to avoid overfitting. It is claimed that an ensemble method like random forest itself avoids overfitting by using the ensemble operation. Much more details can be found in [60]. A brief overview of the overfitting problem is contained in [65].

As is mentioned repeatedly in this paper, the largest problem of machine learning methods is their black box nature. Both neural network and random forest prediction algorithms produce forecasts, but they are not simply expressible in terms of the input variables (features) of the model. It is understandable that skepticism arises towards machine learning when it is difficult to understand or (intuitively) explain how each feature contributes to the model. The situation is very different from forecasts formed from generalized linear regression models, where one can find a natural measure of how each input variable contributes to the forecast. Is it possible to find similar measures for the contribution of each input feature in a machine learning context?

Work on this problem has actually started and has, in fact, spawned a whole new discipline: “Explainable artificial intelligence”, XAI, where the aim is to “explain” black box models. A recent survey of this field is given in [66], where the problem is discussed broadly and partly from a philosophical point of view. There are also papers devoted to this problem in special research areas, e.g., opening the black box in applications in genetics, such as in [67], where three general techniques are being used to probe the kinds of changes in the output are being caused by the alternative weighting of structures in the model. Sensitivity analysis can be used to examine the effects of modifying the input features. The model in [68], which is treated in Section 5.3, was tested by undergoing such sensitivity analyses. A final strategy mentioned in [67] is the surrogate strategy, where the black box model is sought to be replaced by a simpler, more interpretable model, which still provides a good approximation.
There are also techniques that seek to attribute a concrete value to the importance of each input feature, thus constituting, in a sense, a generalization of the importance measure of each input variable in a generalized linear regression model. The most used technique is probably the SHAP value, which is related to the Shapley value in competitive games. The Shapley value was introduced by Lloyd Shapley in 1951, see [69], for which he was later awarded the Nobel Prize in economics. To put this into our present context (as in [70]), let F be the set of all features. The Shapley method, in principle, requires retraining a model on all feature subsets, S F . The method assigns an importance value to each feature, i, that represents the effect of the model prediction of including that feature. To compute this effect for the ith feature, a model, f S i , is trained with that feature present, and another model, f S , is trained with the feature withheld. These predictions from the two models are compared to the current input forming the difference f S i ( x S i ) f S ( x S ) , where x S represents the values of the input features in the set S. Since the effect of withholding a feature depends on the other features in the model, the preceding differences are computed for all possible subsets, S F i . The Shapley values are then computed and used as feature attributions (importance). They are weighted averages of all possible differences

ϕ i = S F i | S | ! ( F | S | 1 ) ! F ! [ f S i ( x S i ) f ( x s ) ]

where | S | is the number of elements in S.

In [70] (also see [71]), the Shapley values are integrated in an additive regression-type framework to create SHAP values (SHapley Additive exPlanations). This framework also serves as a common framework for five other explanation models, with LIME (local interpretable model-agnostic explanations) described in [72] among them. Some stability issues are highlighted in [73]. Based on the unification attained in [70], the authors indicate that the new methods show improved computational performance and/or better consistency with human intuition than previous approaches. As an example among numerous applications of SHAP in explaining machine learning models, real hospital data are examined in [74].
It should be remarked that the above examples mark only the beginning of the problem of opening the black box for machine learning models. A rather critical evaluation of XAI is contained in [75]. Moreover, rather limited attention has been given to this issue in the forecasting competitions treated in the next subsection. One reason for this is probably due to the rather late appearance of XAI methods. There is surely a need for competitions that compare XAI possibilities for various models.

4.4. Statistical vs. ML Methods: Evidence from Forecasting Competitions

A number of forecasting competitions have been arranged to find the “best” forecasts and to evaluate statistical forecasts against ML forecasts and simple models vs. more complicated ones. Spyros Makridakis, in particular, has organized a number of forecasting competitions, starting in 1982 with the M1 competition and then continuing with the M2–M6 competitions, with the results published in 1982, 1993, 2000, 2020, 2022, and 2024 (see below for exact references). The scope has been somewhat varied, and so has the number of participants and the number and versions of methods represented. For the most recent ones, major ML and statistical methods have been included. In the M1 competition, the number of time series sought to be predicted was 1001; in M3, it was 3003, and in M4, it was 100,000. A history of forecasting competitions up to 2020 is given in [76]. The forecasting competitions described by the forecast organizers themselves can be found in [6,77,78,79,80,81].
The general tendency in the course of these competitions is that ML methods have been more successful. In the first competition, simple statistical models of the exponential smoothing type did better than the more complex ARIMA models. In M3 in the year 2000, several ML methods did partake, but the winner was the theta method described in Section 2, a relatively simple parametric extension of exponential smoothing -type models. The M1–M3 competitions had a relatively small number of time series available for prediction. The M4 competition represented a very significant extension, featuring 100,000 time series, and it is also (to date) the competition with the most general types of series; it was also the first where prediction intervals were evaluated in addition to point forecasts. It seems reasonable to give a brief review of that contest. The time series differed in sampling frequency: yearly, quarterly, monthly, daily, and hourly, with about half of the series sampled on a monthly basis. The series was taken from the following areas: business, economics, industry, demographics, education, labor and wages, government, households, bonds, stocks, insurance, loans, real estate, transportation, natural resources, and the environment. The sheer amount of time series and their spread is, of course, commendable, but, at the same time, this could be discouraging for the competitors due to the amount of labor required. There was also a relatively large number of dropouts because the forecasting teams were discouraged when they compared their results to simple benchmark forecasting methods supplied by the organizers. Nevertheless, 61 forecasting methods ended up being represented.

The length of the time series varied considerably, with the minimum number of observations in the training set being 13 for yearly, 16 for quarterly, 42 for monthly, 80 for weekly, 93 for daily, and 700 for hourly series. However, on average, the length of the series was much longer than the series in the M3 contest. The forecasts were evaluated in several ways, mostly based on relative absolute deviation, with an average taken over forecasting horizons, which varied from 3 for yearly data to 48 for hourly data.

The winner of the competition was Slake Small, a data scientist at Umber Technologies. He mixed exponential smoothing-type models with RN recursive deep learning methods, which are described in more detail in [82]. His hybrid method perhaps represented a main result of the competition, namely the improved numerical accuracy of combining forecasting methods and the superiority of a hybrid approach that utilizes both statistical and ML features.
Combining has long been considered a useful practice in the forecasting literature, going all the way back to the classical paper of [83] for purely statistical methods and up to the ensemble methods for random forest and gradient boosting. Intuitively, it is understandable that the combining principle can lead to improvements, as different methods pick up different forecasting potentials of the data, but the exact theory that explains the improvement is more scarce; see, e.g., Ref. [84]. More specifically, for the M4 competition, there was only 1 pure point forecast statistical method among the top 10 most accurate methods, while all of the 10 least accurate methods were either purely statistical or purely ML. Similar results were obtained for interval forecasting.

It seems that most of the methods were applied to individual series, with only a few attempts to exploit multiple series to extrapolate the individual ones. Similarly, exogenous variables do not seem to have been used (possibly not the point of the competition and perhaps not so easily available).

Another issue not dealt with by Makridakis et al. and by other similar contests is that there was no emphasis on so-called black swans, i.e., in the evaluation of the forecasts, it was not registered whether a forecasting method in some cases did disastrously badly; this would be missed in an evaluation based on average performance over the series. The “black swans” may, in some cases, be caused by thick-tailed distributions [85]. The organizers remark that this phenomenon was outside the scope of the M4 competition but stated that another forecast competition is needed for closer examination to analyze the implications of black swans for risk management and decision-making.
A total of five ML methods, four pure and one combination of solely ML methods, did not do well. These results were in agreement with [24]. However, it could be possible that this is partly due to the ML methods used in the competition. For example, it does not seem that the best-developed random forest or gradient boosting processes participated.
These types of criticism (partly pointed out by the organizers themselves) and some additional ones are taken up in [86]. They refer to the results of a recent competition of 145,000 series on web traffic data run by the Kaggle corporation and hosted at about the same time. The time series in the Kaggle data had a much higher spectral entropy (defined by π π f ( λ ) log f ( λ ) d λ , where f ( λ ) is the spectral density of the series). This means that the series were noisier and more difficult to predict. Further, many more series had (much) shorter time intervals between the samples. For the Makridakis data mentioned above, the emphasis is on monthly data (48%), whereas only 0.4% were weekly, 4.2% daily, and 0.4% hourly.
In [86], Fry and Brundag state that at Google, as is the case with many other net-based companies, there is an increasing degree of situations where one has to deal with short time intervals, e.g., even 5 min intervals, and sometimes irregular time intervals. For these data with higher spectral entropy, it has turned out that ML methods have been doing increasingly well, and in [86], Fry and Brundage state that ML methods have, in fact, been a game-changer for this kind of forecast. At the same time, they reiterate that more emphasis should be put on multivariate and hierarchies of time series, as well as on predicting extreme events. They state that perhaps more attention should be put on predicting higher quantiles in a forecast distribution than on producing a single-point forecast.
Perhaps as a reaction to this, a new M5 forecast competition was launched [80], which focused on retail sales forecasting applications and used real-life hierarchically structured sales data with intermittent and erratic characteristics. The competition attracted many participants who were eager to experiment with effective forecasting solutions in real-life situations faced by numerous retail companies on a daily basis. Many of the series in the competition were characterized by intermittency, involving sporadic unit sales with many zeroes.

The data set used comprised the unit sales of 3049 products sold by Walmart in the USA. The data were organized in the form of grouped time series aggregated based on their type (category and department) and selling locations (stores and states), with a total of 42,480 series in 12 cross-sectional aggregation levels. As a performance measure, a scaled least squares criterion averaged over several forecasting horizons was used.

Again, simple methods like exponential smoothing did fairly well. However, this time, they were clearly beaten by the best ML methods. The light-GBM version of gradient boosting mentioned in Section 4.2 turned out to have superior forecasting capabilities compared to all other alternatives, and it was used in practice by all of the 50 competitors. This is consistent with the claim of [86] regarding the superiority of ML methods for this and similar types of series. Somewhat curiously, the first prize was won by a senior undergraduate at Kung Hee University of South Korea with the team appellation YJ.STU. He used an equal-weighted average of various light-GBM algorithms.
The most recent of the Makridakis forecasting competitions is M6 [81]. Participants were asked to forecast 100 financial assets and to make corresponding financial investments. Surprisingly, there was virtually no correlation between the best forecasts and the best investment decisions; it turned out to be very difficult to beat the standard S&P market index. The authors do warn that the participants to a limited degree carried out the study according to the intentions of the organizers. It was intended that the competitions should have 12 non-overlapping submission periods, with each 28 days after the previous one. As a result, the competition lasted for almost one year. However, only a small percentage of the teams updated their forecast regularly during the duration of the competition. Many teams simply submitted once or twice at the very beginning of the competition. Again, a winning team was directed by a graduate student.
It is satisfying that ML techniques have the potential to obtain better forecasts, and they do increase the tool box of forecasters. However, there are reasons for some reservations and criticisms directed towards ML methods. A primary reason, as mentioned, is that they are black box forecasts based on pattern recognition of data types. Some attempts, see, e.g., Ref. [68] (who obtain their best results using random forest routines), have been made to look into this by comparing particular cases of benchmark forecasts, changing the parameters of an ML forecast, and seeking to explain the corresponding changes in the forecasts. See Section 5.3 for further mentions of this paper. Basically, ML methods are not immediately interpretable, as e.g., an AR time series model; however, as indicated in Section 4.3, attempts to open the black box have started to emerge. Moreover, another distinct difference is the lack of theoretical results on the asymptotics of ML methods. The beginning of theoretical development may be represented in [87].

Summing up, regarding the results from the Makridakis competitions and other similar competitions like the Kaggle competitions, there seems to be a definite trend where the ML forecasts are improving, and at least, in some areas as those with high spectral entropy are superior to statistical forecasts. However, limitations of these competitions, other than the black box criticism of ML methods, have been pointed out. For example, they have essentially been carried out for univariate series. They started out by using more or less pure point forecasts. Interval prediction has been increasingly included, but there is still a lack of clear results on forecasting distributions and extreme quantiles. Moreover, there is a lack of systematically evaluating multivariate forecasts, the influence of exogenous variables, and the effect of nonstationarity on multivariate systems. The next section will be devoted to some of these themes.

6. Weather Forecasting

I will finish this paper by looking at a problem where the main thrust of the existing forecasts depends on a (numerical) physical model, namely weather forecasting. What are the interactions, if any, between this numerical physical model and ML techniques like neural networks and random forest? Is it possible to completely do away with numerical weather forecasting (NWP) and base a weather forecast entirely on ML? NWP is very time-consuming and requires extensive computer resources. For the fastest and most effective ML methods, one might think that there could be a gain in processing time, and, of course, the hope is that, ultimately, ML may increase the accuracy of weather forecasting.

Manual NWP was first attempted in [148] in Britain in 1922, and early computer-aided weather forecasts were produced in the 1950s by a group led by John von Neuman at the Princeton Institute of Advanced Studies; see [149]. The early history is also outlined in [150,151]. Modern weather prediction relies extensively on massive numerical simulation systems that are routinely operated by national weather agencies all over the world. Statistical analysis plays a major part in the stepwise procedure that NWP consists of.
In order to perform NWP, a variety of observations regarding atmosphere, land, and ocean are collected all over the world, creating a global observation network, not the least in terms of satellite measurements. Although millions of different direct and indirect measurements are obtained each day, these raw data must be transformed and refined before they can be input into the numerical weather model. This stage is called data assimilation (DA). The data are then projected into a discrete model grid (interpolated in time) and adjusted to be consistent in terms of state variables (e.g., temperature, pressure, wind, etc.). DA also takes care of measurement errors, such as biases between different spatial instruments. The obtained initial Earth system after the DA step constitutes an optimized estimate of the real conditions, cf. [152].

Given these initial conditions, the core of the NWP model consists of numerically solving the coupled partial differential system (the Navier-Stokes equations) describing the atmosphere in terms of momentum, mass, and enthalpy.

The Navier-Stokes equations for fluid motion, v , underlying atmospheric flow are the following, cf. [153], p. 14:

ρ v t + v · v = p + μ 2 v + ρ g

where ρ is density, p is pressure, g is acceleration, and μ is a parameter. The Navier-Stokes equations are accompanied by a set of thermodynamic equations that interrelate temperature, pressure, and humidity within the atmosphere:

ρ t + · ( ρ v ) = 0 ( C o n t i n u i t y   e q u a t i o n )

T t + v · T = q c p ( Energy   equation )

D p D t = ρ c p · v ( Pressure   equation )

where T is the temperature, and c p is a pressure parameter. The resulting solution gives the atmospheric state in each model grid cell.

This is very demanding computationally, and the solution can only be given over a relatively coarse grid. A postprocessing stage (mainly statistical in nature, possibly involving ML) is, therefore, needed to obtain data on a finer grid.

Over the past decades, the ability of NWP models to predict a future atmospheric state has continuously improved. Contemporary global NWP models are not only able to predict the synoptic-scale weather pattern for several days, but they have increased in their ability to predict regional events such as, e.g., major precipitation events.

The increasing success of operational NWP models is due to improvements in all the steps involved in the NWP workflow and the new capabilities (e.g., satellite measurements) of the global Earth observation system. The NWP model improvements can, therefore, be related in part to resolution enhancement. The continuous refinement of grid spacing has also required reformulating the dynamical cores of the NWP models, where the discretized Navier-Stokes equations are solved. Simulating the atmosphere at the kilometer scale brings with it the demand for highly parallelizable algorithms for the dynamical core [154]. Simultaneously, remarkable progress has been achieved in designing discretization approaches that enable the grid scale dynamics to follow the conservation laws of energy, entropy, and mass. Such physical conservation laws may cause problems in a pure ML approach. An extensive overview of contemporary dynamical core architecture and spatio-temporal forecasting can be found in [155].
The general development of ML, spanning over much more than weather forecasting and, in particular, of deep learning (DL), has had a history of advances and setbacks; see, e.g., Ref. [156]. Nowadays, ML has seen several instances of success. Recent developments are due to vastly increased computer capabilities, not the least due to massive parallel processing. This is the case with, for example, TCNs and transformers, as reported in Section 4.1. This has carried with it the hope that an ML system for weather forecasting may be faster than the present NWP system, allowing for more efficient processing of very large data sets. Finally, large benchmark data sets have become available on the internet. As a result of this development, highly complex neural network systems with more than 10 6 parameters have enabled a breakthrough in image recognition, speech recognition, gaming, and video analysis.
The weather and climate community is, of course, aware of the powerful advances in ML. However, according to Ref. [156], there have still been reservations against ML methods, possibly due to the black box nature of ML methods and also the lack of implementation of physical conservation laws, as mentioned above; see [157]. Moreover, Ref. [158] argued that traditional ML methods may not yet be fully suitable for Earth System data; they claim that new network topologies are needed that not only exploit local neighborhoods (even at different scales) but also long-range relationships, for example, significant relationships or links between weather phenomena at widely separated locations on Earth, such as the effects from the El Niño Southern Oscillation (ENSO)). It may be of interest that both TCNs and transformers represent potential tools for modeling long-range relationships. See [159] for an example of transformers being used in long-term forecasting.
For some time now, there have been a number of uses of ML methods, but these have mainly been within more isolated parts of the weather prediction problem, such as the preprocessing and postprocessing stages mentioned earlier. In the preprocessing stage, ML methods can, for instance, be used in the interpolation of data to a regular grid. The postprocessing techniques have been reviewed in several papers, such as [160,161].
One survey [160] discusses both the use of parametric statistical distributions and ML methods. Distribution-based postprocessing approaches specify a parametric model for a forecast distribution. The parameters of the forecast distributions are then linked to the predictors from the NWP system via regression equations. Flexible additive regression functions are suggested in [162]. Despite these powerful variants, the need to select a suitable parametric family to describe the distribution of a target variable remains a limitation for parametric postprocessing methods and often necessitates elaborate fine-tuning.
When it comes to ML methods, random forest has been used in a variety of postprocessing situations, see, e.g., Ref. [163]. One challenge is, again, the interpretability of ML approaches. However, in weather forecasting, attempts have also been made to break open the black box; see, e.g., Ref. [164], which reviews methods for deciding which predictors are most important for a model and for a particular forecast. A more recent survey is contained in [157]. The authors discuss applications of SHAP and LIME, as mentioned in Section 4.3, to meteorological ML routines. They indicate how these interpretative aids may have the potential to gain more insight into predictions, uncovering novel meteorological relationships captured by machine learning. Further, they point out challenges around achieving deeper mechanistic interpretations aligned with physical principles. Refs. [165,166] should also be mentioned.
In general, combining forecasts (also in postprocessing) has been considered. Ref. [167] does this in the context of an ensemble Kalman filter. Again, one challenge regarding blending forecasts is to maintain the realistic physical features of the NWP model. There is, though, a clear shift [160] from physical modeling approaches to data-driven approaches. This shift is perhaps particularly transparent for the recent method of transformers, as discussed in Section 4.1. In [168], a transformer-based neural network is used in the postprocessing phase in predicting near-surface temperature, and in [169], hierarchical transformers are used in the postprocessing of temperature and precipitation forecasts. Moreover, transformers are used in precipitation nowcasting in [170].
These examples are restricted in that they are not full weather forecasts. This is also the case with some TCN (again, see Section 4.1) applications. Ref. [171] used TCN and LSTM for weather forecasting using time series data from local weather stations. TCN is also an important component in [172] in a hybrid model for wind power forecasting.
Despite its publication in Annals of Applied Statistics, Ref. [161] represents a more theoretical approach to postprocessing. The authors stress that accurate modeling of multivariate dependencies is crucial in many practical applications in general and, particularly, in NWP. In several approaches to postprocessing, ensemble predictions are first postprocessed separately in each margin, and multivariate dependencies are next established via copulas. Ref. [161] proposes a multivariate postprocessing method based on generative machine learning as an output of a general neural network to avoid some of the limitations regarding the two-stage copula approach. This results in a new class of nonparametric data-driven distributional regression models.

While this kind of postprocessing still requires a core of NWP or is restricted to one or a few meteorological variables, simultaneously, intensive work has been going on to try to establish a scheme of weather forecasts that are based entirely on statistical and/or ML approaches. Very recently, there has been remarkable progress in this area, which may even be a breakthrough.

The group behind this breakthrough is the Google Deep Mind group. They have taken ideas from the mechanism of transformers and used them in a wider context. They are extending the framework to graphs, which seems to be a more fitting model environment for weather forecasts due to the observations (nodes) being scattered over space and with dependence not only between neighboring points but also over larger distances referring to long-range meteorological phenomena.

A gentle introduction to graph neural networks is given in [173]. The connection between transformers and graph neural networks, with the former as a special case of the latter, is described in a Graph Deep Learning Lab report [174].
The GraphCast module developed for weather forecasting by the Google Deep Mind group is described in a Science 2023 article [175]; see also [176]. GraphCast is able to make medium-range weather forecasts with unprecedented accuracy. According to the authors, it predicts weather conditions up to 10 days in advance more accurately and much faster (inherited from the fast transformer routine set-up) than the industry gold standard weather simulation system—High Resolution Forecast (HRES) produced by the European Centre for Medium Range Weather Forecasts. GraphCast can also offer earlier warnings of extreme weather events.

GraphCast makes forecasts at the high resolution of 0.25 degrees longitude/latitude (28 km × 28 km at the equator). That is more than a million grid points covering the entire surface of the Earth. At each gridpoint, the model predicts five Earth-surface variables—including temperature, wind speed and direction, and mean sea level pressure—and six atmospheric variables. GraphCast is autoregressive; it can be “rolled out” by feeding its own predictions back in as input to generate an arbitrarily long trajectory of weather states.

The model was trained using 39 years (1979–2017) of historical data from existing archives. A training objective was to average the mean squared errors between GraphCast’s predicted states over a set of autoregressive steps and the corresponding data found in the archives. Much more details can be found in the Science paper.

What makes the Google Deep Mind group especially impressive is that the group has come up with another module that seems to be on the level of GraphCast, if not better. It is described in a late 2024 Nature paper [177]. It uses the appellation GenCast and is based on conditional probabilities, a form of conditional independence, using diffusion modeling. It has the same resolution as GraphCast and roughly the same training period; it beats the ENS model, the ensemble forecast of the European Centre for Medium Range Forecasting, in speed and accuracy in 97.2% of 1320 targets.
Do these advances mean that numerical weather prediction models can be discarded in favor of machine learning methods? This is not a viewpoint held by scientists in this area. In [178], GraphCast is critizised for its possible lack of physical consistency. Furthermore, explainability, in spite of recent advances, is still an issue that needs further attention. Rare events are hard to model due to the difficulty of having limited data during the training period.
Can machine learning methods be extended to cover really long-range periods and even give clues to possible climate modeling? This question is raised among others in a Csiro report [179]. It is pointed out that while weather prediction is a “big-data” problem, climate projections represent a “small-data” problem, with relatively little available data.
On the other hand, there seems to be every reason to continue the search for good machine learning methods in weather prediction, not the least as an extremely valuable (and fast) complement to numerical weather forecasting. Hybrid models of physical models and AI models may be of special interest. In a recent Nature paper, the authors of Ref. [180] discuss the possibility of coupling the physics of general circulation models with machine learning models to obtain a hybrid model that may be able to produce useful simulations of long-range weather or even climate processes. Similar arguments are used in [158].
Ref. [156] concludes that the research on pure ML methods in weather forecasting is still in its infancy; the specific properties of weather data require the development of new approaches beyond the classical concepts from computer vision, speech recognition, and other typical ML tasks. The author’s conclusion is that there may be potential for end-to-end (“pure”) ML weather forecast applications to produce equal or better quality forecasts for specific end-user demands, especially if these systems can exploit small-scale patterns in the observational data that are not resolved in the traditional NWP model chain. Whether ML and DL will evolve enough to replace most or all of the current NWP systems cannot be answered at this point. In any case, according to Ref. [158], one might argue for more interaction between physical and deep learning systems.



Source link

Dag Tjøstheim www.mdpi.com