Last modified: September 21, 2024

This article is written in: 🇺🇸

Forecasting with Time Series

Time series forecasting is a technique used to predict future values based on historical data. It is widely used in various fields, such as finance, economics, and meteorology. In this section, we will discuss the basics of time series forecasting.

Components of a Time Series

A time series can be decomposed into four main components:

I. Trend represents the long-term progression of the series, signifying a persistent, general direction of the data over a long period. It can be upward, downward, or even a stable trend.

output(3)

II. Seasonality are patterns that repeat at regular intervals, like daily, monthly, or quarterly. This component reflects the influence of seasonal factors on the time series.

output(4)

III. Unlike seasonality, cyclical patterns occur at less regular intervals. These fluctuations are often linked to economic, political, or even environmental factors and can span multiple years.

output(7)

IV. Random (or Irregular) component captures the 'noise' or random variation in the data. It represents the unpredictable, erratic factors affecting the time series after the trend, seasonality, and cyclical components have been accounted for.

output(6)

Forecasting Methods

There are various methods for time series forecasting, each suited to specific scenarios and data characteristics. Here are some commonly used methods:

Naive Forecast

This method assumes that the next value in the time series will be equal to the most recent value.

If y is the time series and t is an index to time, the naive forecast for time t+1 is simply the value at time t.

$$ \hat{y}_{t+1} = y_t $$

Simple Exponential Smoothing (SES)

Simple Exponential Smoothing (SES) is a method used for forecasting univariate time series data without a trend or seasonal component. Unlike methods that weight each past observation equally, SES assigns exponentially decreasing weights to past observations, giving more importance to recent data. The method is especially suitable for data that follows a pattern that is approximately flat with noise around a constant level.

The formula for SES is:

$$ \hat{x}_{t+1} = \alpha x_t + (1 - \alpha) \hat{x}_t $$

where:

SES can be thought of as a weighted average of past observations, where the weights decrease exponentially as we move further into the past. This means that more recent observations are weighted more heavily than older ones.

For example, for any time $t$, we can recursively substitute the previous forecasts:

$$ \hat{x}{t+1} = \alpha x_t + (1 - \alpha)\left[\alpha x{t-1} + (1 - \alpha)\hat{x}_{t-1}\right] $$

Expanding this equation:

$$ \hat{x}{t+1} = \alpha x_t + \alpha(1 - \alpha) x{t-1} + \alpha(1 - \alpha)^2 x_{t-2} + \ldots $$

This shows that the forecast is a weighted average of all previous observations, with the weights decreasing exponentially at the rate $1 - \alpha$.

The sum of the weights converges to 1, ensuring the method remains stable.

Initial Condition

To initialize the process, we need a starting point for the forecast, $\hat{x}_1$. One common approach is to set the initial forecast equal to the first data point:

$$ \hat{x}_1 = x_1 $$

Alternatively, we can use the average of the first few data points as the initial value.

Forecast Error

The forecast error at any time $t$ is the difference between the actual observation and the forecast made at time $t-1$:

$$ e_t = x_t - \hat{x}_t $$

The aim of SES is to minimize the sum of squared errors over time. We can use this to find the optimal value of $\alpha$.

Sum of Squared Errors (SSE)

The Sum of Squared Errors (SSE) is a measure of the total error in the model, which we aim to minimize when choosing the best smoothing parameter $\alpha$. The SSE is defined as:

$$ SSE(\alpha) = \sum_{t=1}^{n} (x_t - \hat{x}_t)^2 $$

For different values of $\alpha$, we compute the SSE and select the $\alpha$ that minimizes this sum.

Choosing the Optimal Smoothing Parameter

The choice of $\alpha$ determines how much weight we give to recent observations versus older ones:

In practice, $\alpha$ is usually chosen by minimizing the SSE using a grid search or another optimization technique.

Recursive Form of SES

SES is often expressed in a recursive form, which is computationally efficient:

$$ \hat{x}_{t+1} = \alpha x_t + (1 - \alpha) \hat{x}_t $$

This recursive equation updates the forecast at time $t+1$ based on the observed value at time $t$ and the forecast made for time $t$. It requires minimal computational resources and is easy to implement programmatically.

Holt’s Linear Trend Method (Double Exponential Smoothing)

Holt's model has two equations: one for the level and one for the trend.

Level equation:

$$ \ell_t = \alpha x_t + (1 - \alpha)(\ell_{t-1} + b_{t-1}) $$

where:

Trend equation:

$$ b_t = \beta (\ell_t - \ell_{t-1}) + (1 - \beta) b_{t-1} $$

where:

Forecast equation:

$$ \hat{x}_{t+h} = \ell_t + h b_t $$

where:

$$ b_1 = x_2 - x_1 $$

Holt-Winters Seasonal Method (Triple Exponential Smoothing)

Additive Model

Level equation:

$$ \ell_t = \alpha \frac{x_t}{s_{t-L}} + (1 - \alpha)(\ell_{t-1} + b_{t-1}) $$

where:

Trend equation:

$$ b_t = \beta (\ell_t - \ell_{t-1}) + (1 - \beta) b_{t-1} $$

Seasonality equation:

$$ s_t = \gamma \frac{x_t}{\ell_t} + (1 - \gamma) s_{t-L} $$

where:

Forecast equation:

$$ \hat{x}{t+h} = (\ell_t + h b_t) s{t+h-L(k+1)} $$

where:

Multiplicative Model

Level equation:

$$ \ell_t = \alpha (x_t - s_{t-L}) + (1 - \alpha)(\ell_{t-1} + b_{t-1}) $$

Trend equation:

$$ b_t = \beta (\ell_t - \ell_{t-1}) + (1 - \beta) b_{t-1} $$

Seasonality equation:

$$ s_t = \gamma (x_t - \ell_t) + (1 - \gamma) s_{t-L} $$

Forecast equation:

$$ \hat{x}{t+h} = (\ell_t + h b_t) \times s{t+h-L(k+1)} $$

Machine Learning for Time Series Forecasting

Machine learning (ML) has become increasingly popular for time series forecasting, especially as data grows more complex and requires non-linear methods. Traditional statistical approaches like ARIMA work well for simpler datasets, but machine learning can handle complex patterns, high-dimensionality, and non-linear relationships better. In this overview, we'll cover key machine learning models used for time series forecasting, with detailed explanations of each approach.

Key Challenges in Time Series Forecasting

Machine learning algorithms for time series forecasting face several challenges:

To address these challenges, various machine learning models and techniques can be employed.

1. Supervised Learning Approach for Time Series

In a supervised learning framework for time series forecasting, we aim to transform the time series problem into a regression task where:

For example, to predict the value at time $t$, the features might be the values at times $t-1, t-2, \ldots, t-n$.

Data Preparation (Feature Engineering for Time Series)
  1. The lag features are essential in time series analysis and include past values of the time series. Example: To predict the value of $y_t$, we can use past values like $y_{t-1}, y_{t-2}, \ldots$ as input features.
  2. Windowed features summarize past values over a specific window, such as calculating the mean, variance, or sum over the last 7 days.
  3. Time-based features include details like month, year, weekday, and hour, helping capture seasonality or periodic trends.
  4. Using rolling/aggregated statistics, such as moving averages or rolling sums, provides useful features by aggregating data over time windows.

2. Machine Learning Models for Time Series Forecasting

Example:

from sklearn.ensemble import RandomForestRegressor
import numpy as np

# Assuming 'data' is a Pandas DataFrame with time series values in the column 'value'
# Create lag features
data['lag1'] = data['value'].shift(1)
data['lag2'] = data['value'].shift(2)
data.dropna(inplace=True)

# Train-test split
train_size = int(len(data) * 0.8)
train, test = data.iloc[:train_size], data.iloc[train_size:]

# Train Random Forest Regressor
X_train, y_train = train[['lag1', 'lag2']], train['value']
X_test, y_test = test[['lag1', 'lag2']], test['value']

rf_model = RandomForestRegressor(n_estimators=100)
rf_model.fit(X_train, y_train)

# Predict future values
predictions = rf_model.predict(X_test)

2.2 Gradient Boosting Machines (GBM, XGBoost, LightGBM, CatBoost)

Example Using XGBoost:

import xgboost as xgb
import numpy as np

# Train XGBoost Regressor
xgb_model = xgb.XGBRegressor(n_estimators=100, max_depth=3, learning_rate=0.1)
xgb_model.fit(X_train, y_train)

# Predict future values
predictions = xgb_model.predict(X_test)

2.3 Support Vector Machines (SVM)

Example:

from sklearn.svm import SVR

# Train Support Vector Regressor
svr_model = SVR(kernel='rbf', C=100, gamma=0.1)
svr_model.fit(X_train, y_train)

# Predict future values
predictions = svr_model.predict(X_test)

2.4 Artificial Neural Networks (ANN)

Example:

from sklearn.neural_network import MLPRegressor

# Train a Multi-Layer Perceptron Regressor (ANN)
mlp_model = MLPRegressor(hidden_layer_sizes=(100,), activation='relu', solver='adam', max_iter=1000)
mlp_model.fit(X_train, y_train)

# Predict future values
predictions = mlp_model.predict(X_test)

3. Advanced Neural Network Models for Time Series

Example:

import tensorflow as tf

# Define a simple RNN model
model = tf.keras.models.Sequential([
    tf.keras.layers.SimpleRNN(50, activation='relu', input_shape=(n_timesteps, n_features)),
    tf.keras.layers.Dense(1)
])

model.compile(optimizer='adam', loss='mse')
model.fit(X_train, y_train, epochs=50, batch_size=32)

3.2 Long Short-Term Memory (LSTM)

Example Using LSTM:

import tensorflow as tf

# Reshape the data for LSTM (samples, timesteps, features)
X_train_reshaped = X_train.values.reshape((X_train.shape[0], X_train.shape[1], 1))
X_test_reshaped = X_test.values.reshape((X_test.shape[0], X_test.shape[1], 1))

# Define LSTM model
model = tf.keras.models.Sequential([
    tf.keras.layers.LSTM(50, activation='relu', input_shape=(X_train_reshaped.shape[1], X_train_reshaped.shape[2])),
    tf.keras.layers.Dense(1)
])

model.compile(optimizer='adam', loss='mse')
model.fit(X_train_reshaped, y_train, epochs=100, batch_size=32)

# Predict future values
predictions = model.predict(X_test_reshaped)

3.3 Gated Recurrent Units (GRU)

Example Using GRU:

import tensorflow as tf

# Define a GRU model
model = tf.keras.models.Sequential([
    tf.keras.layers.GRU(50, activation='relu', input_shape=(n_timesteps, n_features)),
    tf.keras.layers.Dense(1)
])

model.compile(optimizer='adam', loss='mse')
model.fit(X_train, y_train, epochs=50, batch_size=32)

4. Hybrid Models

In practice, many machine learning models for time series forecasting combine traditional statistical models with machine learning models. For instance:

Comparison of Various Models

The table below compares several time series forecasting algorithms, highlighting their descriptions, advantages, disadvantages, and whether they can be applied locally or globally:

Algorithm Name Description Pros Cons Local vs Global
ARIMA (AutoRegressive Integrated Moving Average) A statistical model used for analyzing and forecasting time series data by using the dependencies between an observation and a number of lagged observations. - Flexible and capable of handling a wide range of time series patterns.
- Suitable for univariate time series data.
- Complex to understand and implement.
- Requires the data to be stationary.
- Sensitive to the chosen parameters.
- AutoARIMA can alleviate some implementation challenges but still requires expertise.
Local only. Cannot be used globally.
Prophet Developed by Facebook, this algorithm is tailored for forecasting time series data with daily observations that include strong seasonal effects and the presence of outliers. - Easy to use with intuitive parameter settings.
- Automatically handles missing data and outliers well.
- Provides a simple and interpretable result.
- Less effective for non-daily data or data without strong seasonality.
- Requires domain knowledge to fine-tune accurately.
- Has faced criticism after issues with high-profile predictions (e.g., Zillow collapse).
Local only. Cannot be used globally.
LSTM (Long Short-Term Memory) A type of recurrent neural network (RNN) that is well-suited for learning from sequences and time series data, capable of capturing long-term dependencies. - Excellent at capturing long-term dependencies and patterns in time series data.
- Can handle large and complex datasets.
- Requires substantial amounts of data for training.
- Computationally intensive and can be slow to train.
- Complex architecture that requires careful tuning of hyperparameters.
Can be used locally or globally.
Holt-Winters Method A time series forecasting method that accounts for level, trend, and seasonality by applying exponential smoothing. - Good for data with trend and seasonal patterns.
- Straightforward and relatively easy to implement.
- May not perform well on non-seasonal data.
- Sensitive to parameter choices and initial settings.
Local only. Cannot be used globally.
SARIMA (Seasonal ARIMA) An extension of ARIMA that includes seasonal components, enabling it to handle seasonal effects in the data. - Capable of handling both trend and seasonality in the data.
- Flexible model structure that can be tailored to specific time series characteristics.
- Complex to configure and requires a thorough understanding of time series analysis.
- Data must be stationary, requiring transformations.
- AutoARIMA can assist but still demands expertise.
Local only. Cannot be used globally.
Exponential Smoothing A forecasting technique that applies weighted averages to past observations, with the weights decaying exponentially over time. - Simple to implement and use.
- Effective for data without clear trend or seasonal patterns.
- May not be accurate for more complex data involving trends and seasonality.
- Struggles with data exhibiting sudden changes or volatility.
Local only. Cannot be used globally.
Random Forest An ensemble learning method using multiple decision trees to improve predictive performance and robustness. - Handles a wide variety of data types and is robust to outliers.
- Can detect complex interactions and dependencies in the data.
- Computationally intensive, especially for large datasets.
- Can overfit if not properly tuned.
- Does not extrapolate beyond the range of training data.
Can be used locally or globally.
XGBoost A highly efficient and scalable implementation of gradient boosting, particularly effective for structured data and competitions. - High performance with excellent predictive power.
- Handles a wide range of data types well, including complex seasonality.
- Offers extensive tuning options and regularization techniques to improve accuracy.
- Can be complex to tune and requires careful parameter selection to avoid overfitting.
- Computationally demanding.
- Cannot predict values outside the range of training data (above max or below min).
Can be used locally or globally.

Model Evaluation

When evaluating the accuracy and performance of time series forecasting models like Simple Exponential Smoothing (SES), Holt-Winters, ARIMA, and others, there are several widely used metrics that help assess how well the model predictions match the actual data. Here are the key evaluation metrics commonly employed:

Metric Definition Formula Interpretation
Mean Absolute Error (MAE) Measures the average of the absolute differences between predicted and actual values. $MAE = \frac{1}{n} \sum_{t=1}^{n} \lvert y_t - \hat{y}t \rvert$ MAE gives a straightforward sense of the average magnitude of errors in predictions. It is easy to interpret but doesn't penalize large errors as heavily as metrics like MSE.
Mean Squared Error (MSE) The average of the squared differences between predicted and actual values. $MSE = \frac{1}{n} \sum{t=1}^{n} (y_t - \hat{y}t)^2$ MSE penalizes larger errors more than MAE by squaring them, making it useful when larger errors are particularly undesirable. However, it can be sensitive to outliers.
Root Mean Squared Error (RMSE) The square root of the mean squared error, often used to bring the error metric back to the original scale of the data. $RMSE = \sqrt{\frac{1}{n} \sum{t=1}^{n} (y_t - \hat{y}t)^2}$ RMSE is more sensitive to large errors and outliers than MAE, and it provides an error measure in the same units as the original data, which makes it more interpretable in real-world contexts.
Mean Absolute Percentage Error (MAPE) Measures the percentage error by calculating the ratio of the absolute forecast error to the actual value. $MAPE = \frac{100}{n} \sum{t=1}^{n} \left\lvert \frac{y_t - \hat{y}t}{y_t} \right\rvert$ MAPE expresses the prediction error as a percentage, making it useful for comparing performance across datasets. However, it can give very high values when actual values are close to zero.
Symmetric Mean Absolute Percentage Error (sMAPE) A variation of MAPE that accounts for symmetry in the error measurement, avoiding the issue of dividing by small actual values. $sMAPE = \frac{100}{n} \sum{t=1}^{n} \frac{\lvert y_t - \hat{y}_t \rvert}{\frac{\lvert y_t \rvert + \lvert \hat{y}_t \rvert}{2}}$ sMAPE resolves the issue of division by zero in MAPE by averaging the actual and predicted values, giving a more balanced perspective on prediction errors, especially in time series forecasting.
Akaike Information Criterion (AIC) Used for model selection, the AIC balances the goodness-of-fit of the model with its complexity. Lower AIC values indicate better models, but it penalizes models with more parameters. $AIC = 2k - 2\ln(L)$ AIC helps to compare models, taking both the fit and complexity into account, favoring models that explain the data well without overfitting.
Bayesian Information Criterion (BIC) Similar to AIC but includes a stronger penalty for models with more parameters, making it more suitable when the number of data points is small. $BIC = k \ln(n) - 2 \ln(L)$ BIC penalizes model complexity more than AIC, making it more conservative and often more appropriate when working with smaller datasets.
Ljung-Box Q-test A statistical test that checks whether the residuals from a time series forecasting model exhibit any remaining autocorrelation. If the residuals are white noise, the model is adequate. N/A If the test detects significant autocorrelation in the residuals, it suggests the model has not fully captured the time series' structure and may need improvement.

Table of Contents

    Forecasting with Time Series
    1. Components of a Time Series
    2. Forecasting Methods
      1. Naive Forecast
      2. Simple Exponential Smoothing (SES)
      3. Holt’s Linear Trend Method (Double Exponential Smoothing)
      4. Holt-Winters Seasonal Method (Triple Exponential Smoothing)
    3. Machine Learning for Time Series Forecasting
      1. Key Challenges in Time Series Forecasting
      2. 1. Supervised Learning Approach for Time Series
      3. 2. Machine Learning Models for Time Series Forecasting
      4. 3. Advanced Neural Network Models for Time Series
      5. 4. Hybrid Models
    4. Comparison of Various Models
    5. Model Evaluation