What is Multiple-Resolution Tokenization (MRT)?
Time series forecasting plays a crucial role in data science, impacting everything from finance to healthcare. We've been using models like ARIMA and LSTM, and even advanced methods like DeepAR—they all have strengths and weaknesses. But now there's Multiple-Resolution Tokenization (MRT), a technique that could change the game by capturing patterns at multiple resolutions all at once. It's an exciting development that might revolutionize the handling of time series data.
Multiple-Resolution Tokenization (MRT) is an emerging transformer-based architecture that's designed to tackle the unique challenges of time series forecasting. It's particularly useful in fields with high variability, non-stationary data, and limited sample sizes, where traditional methods often fall short. MRT employs advanced tokenization strategies to break down time series data into multiple resolutions. These tokens capture crucial patterns over time, enhancing the model's ability to forecast more effectively. This is especially important in contexts where additional variables like pricing in the retail world play a significant role in shaping future trends.
What makes MRT stand out from other transformer-based models is that it explicitly integrates auxiliary data and cross-series information into its architecture. This allows it to learn from data at various resolutions and scales such as hourly, daily, and monthly patterns, capturing both local trends (fine resolution) and long-term trends (coarser resolution). Think of it like viewing a landscape through different lenses: using a microscope to see fine details and a telescope to observe the broader view. This multi-scale perspective allows MRT to improve forecasting accuracy, especially when dealing with complex datasets where patterns exist at various time intervals.
How MRT Works: Detailed Breakdown
MRT uses a multi-resolution patching technique, which splits the time series into various segments across different resolutions. The model treats each segment as a token and processes these tokens in parallel using the transformer’s attention mechanism. This ensures that the model can simultaneously capture relationships across fine-grained short-term patterns and longer-term trends.
Here’s how MRT operates in detail:
Multiple-Resolution Patching: This is the heart of MRT. The time series is divided into several resolutions (e.g., large patches for long-term trends and smaller patches for short-term behaviors). This patching enables the model to learn patterns at different scales simultaneously. For each resolution, tokens are created that capture both past observations and auxiliary data.
Handling Auxiliary Variables: In MRT, auxiliary variables such as pricing, day of the week, or seasonality are treated separately from the time series itself. This allows for greater interpretability and flexibility. These auxiliary tokens are processed in parallel with the time series tokens, ensuring that the model can capture complex relationships between different data types.
Cross-Series Information: Unlike many traditional methods that treat different time series as independent, MRT can capture cross-series dependencies. This is particularly useful in scenarios where multiple related time series (e.g., sales across different stores) need to be forecasted simultaneously. MRT uses a channel mixer module that learns these dependencies, ensuring the model has a holistic view across all time series.
Reverse Splitting Output Head: MRT’s innovative output head improves scaling efficiency by reversing the multi-resolution patching process. Instead of flattening the transformer’s output (which can become computationally expensive), MRT intelligently reconstructs the time series forecast from the processed tokens, scaling more efficiently with large datasets.
How MRT Differs from Other Time Series Forecasting Techniques
Time series forecasting is a well-researched domain, with popular models like ARIMA, SARIMA, LSTM, Prophet, and DeepAR being some of the commonly used methods. However, MRT stands out due to its transformer architecture and its specialized focus on tokenization strategies. Below are key distinctions:
- MRT vs FB Prophet
- FB Prophet: While Prophet is highly robust and user-friendly, designed for capturing seasonality and trends with human-friendly parameters, it lacks the deep learning capabilities of MRT.
- MRT: MRT's transformer-based approach allows it to manage highly nonlinear relationships and a broader context of data that Prophet would struggle with, especially in handling auxiliary variables like price and weather.
- MRT vs ARIMA/SARIMA:
- ARIMA/SARIMA: These statistical models are excellent for linear patterns, univariate time series data but struggle with non-stationary data and multiple seasonalities.
- MRT: MRT, on the other hand, leverages deep learning to model non-linearities and can process non-stationary and multiple seasonal patterns time series without the need for extensive pre-transformation by analyzing different resolutions simultaneously.
- MRT vs LSTM:
- LSTM: Long Short-Term Memory (LSTM) networks have been a popular choice for deep learning in time series, but they often struggle with long-term dependencies unless extensively tuned.
- MRT: However, MRT outperforms LSTM by processing at multiple scales simultaneously and MRT’s transformer architecture, by design, excels at handling both short- and long-term dependencies.
- MRT vs XGBoost:
- XGBoost: Though extremely powerful for tabular data, XGBoost struggles with sequential dependencies and multivariate forecasting,
- MRT: Whereas MRT can handle these complexities, especially with its cross-series tokenization.
- MRT vs DeepAR:
- DeepAR: It uses autoregressive recurrent networks for probabilistic forecasting but can be complex to implement.
- MRT: However, MRT's unique multiple-resolution patching provides superior handling of context windows, giving it an edge in more complex forecasting scenarios like markdown pricing.
Advantages of MRT
- Contextual Awareness: One of the standout features of MRT is its ability to broaden the context window by processing data at multiple resolutions simultaneously. This means it can capture both short-term fluctuations and long-term trends within a single model. By looking at the data through different lenses at the same time, MRT gains a deeper understanding of the patterns within the time series, leading to more accurate and insightful forecasts.
- Cross-Series Learning: The ability to model relationships between different time series makes MRT ideal for applications where multiple related series need to be forecasted, such as predicting sales across various stores.
- Handling Non-Stationarity: MRT excels in handling non-stationary time series data, a common challenge for traditional models like ARIMA and SARIMA. Its deep learning architecture adapts to changes in the data-generating process without the need for extensive pre-processing.
- Incorporating Auxiliary Variables: Unlike many traditional methods, MRT can easily include additional data types (e.g., prices, weather, promotions) into its predictions, making it a more versatile tool for real-world applications like retail sales forecasting.
- Scalability: MRT’s design, especially its reverse splitting output head, allows for efficient scaling. It can handle large datasets with multiple time series without becoming computationally expensive, a significant advantage over traditional methods.
Constraints of MRT
- Complexity in Training: MRT introduces a high number of hyperparameters, especially in terms of the number of resolutions to consider. This increases the difficulty of hyperparameter tuning and model optimization.
- Overfitting Risks: Due to the large number of tokens and auxiliary data incorporated into the model, MRT can be prone to overfitting, especially when applied to small datasets.
- Quadratic Scaling with Token Size: The transformer’s self-attention mechanism scales quadratically with the number of tokens. While MRT addresses this through efficient tokenization, the method still becomes computationally expensive as the number of tokens increases.
- Computational Overhead: The increased complexity of multiple resolutions and auxiliary tokens means MRT can be more computationally intensive compared to simpler models like ARIMA or Prophet.
Conclusion: Why MRT is a Game-Changer
MRT represents a significant leap forward in time series forecasting, offering capabilities that surpass many traditional methods. Its ability to model time series at multiple resolutions, incorporate auxiliary data, and learn from cross-series relationships makes it a powerful tool for complex forecasting tasks. While it has some constraints in terms of computational cost and complexity, its advantages in handling real-world, noisy, and highly variable data make it a compelling choice for industries like retail, finance, and logistics.
As time series data continues to grow in complexity and volume, techniques like MRT will become increasingly important. Future research and development could focus on automating the resolution selection process and optimizing computational efficiency.
AUTHOR - FOLLOW
Aman Gupta
Consultant, Data Science
Next Topic
From Data Engineering to Visualization: How Databricks Now Enables End-to-end Analytics Processes
Next Topic