Unlocking the Power of Multiple-Resolution Tokenization (MRT) in Time Series Forecasting

Data Science

Date : 12/17/2024

Data Science

Date : 12/17/2024

Unlocking the Power of Multiple-Resolution Tokenization (MRT) in Time Series Forecasting

Explore Multiple-Resolution Tokenization (MRT) for time series forecasting. Learn how it enhances accuracy with cross-series learning, auxiliary data, and scalability.

Aman Gupta

AUTHOR - FOLLOW
Aman Gupta
Consultant, Data Science

Unlocking the Power of Multiple-Resolution Tokenization (MRT) in Time Series Forecasting

Like the blog

Table of contents

Unlocking the Power of Multiple-Resolution Tokenization (MRT) in Time Series Forecasting

What is Multiple-Resolution Tokenization (MRT)?
How MRT Works: Detailed Breakdown
How MRT Differs from Other Time Series Forecasting Techniques
Advantages of MRT
Constraints of MRT
Conclusion: Why MRT is a Game-Changer

Like the blog

Table of contents

Unlocking the Power of Multiple-Resolution Tokenization (MRT) in Time Series Forecasting

What is Multiple-Resolution Tokenization (MRT)?
How MRT Works: Detailed Breakdown
How MRT Differs from Other Time Series Forecasting Techniques
Advantages of MRT
Constraints of MRT
Conclusion: Why MRT is a Game-Changer

Unlocking the Power of Multiple-Resolution Tokenization (MRT) in Time Series Forecasting

What is Multiple-Resolution Tokenization (MRT)?

Time series forecasting plays a crucial role in data science, impacting everything from finance to healthcare. We've been using models like ARIMA and LSTM, and even advanced methods like DeepAR—they all have strengths and weaknesses. But now there's Multiple-Resolution Tokenization (MRT), a technique that could change the game by capturing patterns at multiple resolutions all at once. It's an exciting development that might revolutionize the handling of time series data.

Multiple-Resolution Tokenization (MRT) is an emerging transformer-based architecture that's designed to tackle the unique challenges of time series forecasting. It's particularly useful in fields with high variability, non-stationary data, and limited sample sizes, where traditional methods often fall short. MRT employs advanced tokenization strategies to break down time series data into multiple resolutions. These tokens capture crucial patterns over time, enhancing the model's ability to forecast more effectively. This is especially important in contexts where additional variables like pricing in the retail world play a significant role in shaping future trends.

What makes MRT stand out from other transformer-based models is that it explicitly integrates auxiliary data and cross-series information into its architecture. This allows it to learn from data at various resolutions and scales such as hourly, daily, and monthly patterns, capturing both local trends (fine resolution) and long-term trends (coarser resolution). Think of it like viewing a landscape through different lenses: using a microscope to see fine details and a telescope to observe the broader view. This multi-scale perspective allows MRT to improve forecasting accuracy, especially when dealing with complex datasets where patterns exist at various time intervals.

How MRT Works: Detailed Breakdown

MRT uses a multi-resolution patching technique, which splits the time series into various segments across different resolutions. The model treats each segment as a token and processes these tokens in parallel using the transformer’s attention mechanism. This ensures that the model can simultaneously capture relationships across fine-grained short-term patterns and longer-term trends.

Here’s how MRT operates in detail:

Multiple-Resolution Patching: This is the heart of MRT. The time series is divided into several resolutions (e.g., large patches for long-term trends and smaller patches for short-term behaviors). This patching enables the model to learn patterns at different scales simultaneously. For each resolution, tokens are created that capture both past observations and auxiliary data.

multiple-resolution patching

Handling Auxiliary Variables: In MRT, auxiliary variables such as pricing, day of the week, or seasonality are treated separately from the time series itself. This allows for greater interpretability and flexibility. These auxiliary tokens are processed in parallel with the time series tokens, ensuring that the model can capture complex relationships between different data types.

basic combination rule for each resolution

Cross-Series Information: Unlike many traditional methods that treat different time series as independent, MRT can capture cross-series dependencies. This is particularly useful in scenarios where multiple related time series (e.g., sales across different stores) need to be forecasted simultaneously. MRT uses a channel mixer module that learns these dependencies, ensuring the model has a holistic view across all time series.

channel mixer

Reverse Splitting Output Head: MRT’s innovative output head improves scaling efficiency by reversing the multi-resolution patching process. Instead of flattening the transformer’s output (which can become computationally expensive), MRT intelligently reconstructs the time series forecast from the processed tokens, scaling more efficiently with large datasets.

reverse splitter

How MRT Differs from Other Time Series Forecasting Techniques

Time series forecasting is a well-researched domain, with popular models like ARIMA, SARIMA, LSTM, Prophet, and DeepAR being some of the commonly used methods. However, MRT stands out due to its transformer architecture and its specialized focus on tokenization strategies. Below are key distinctions:

MRT vs FB Prophet
- FB Prophet: While Prophet is highly robust and user-friendly, designed for capturing seasonality and trends with human-friendly parameters, it lacks the deep learning capabilities of MRT.
- MRT: MRT's transformer-based approach allows it to manage highly nonlinear relationships and a broader context of data that Prophet would struggle with, especially in handling auxiliary variables like price and weather.
MRT vs ARIMA/SARIMA:
- ARIMA/SARIMA: These statistical models are excellent for linear patterns, univariate time series data but struggle with non-stationary data and multiple seasonalities.
- MRT: MRT, on the other hand, leverages deep learning to model non-linearities and can process non-stationary and multiple seasonal patterns time series without the need for extensive pre-transformation by analyzing different resolutions simultaneously.
MRT vs LSTM:
- LSTM: Long Short-Term Memory (LSTM) networks have been a popular choice for deep learning in time series, but they often struggle with long-term dependencies unless extensively tuned.
- MRT: However, MRT outperforms LSTM by processing at multiple scales simultaneously and MRT’s transformer architecture, by design, excels at handling both short- and long-term dependencies.
MRT vs XGBoost:
- XGBoost: Though extremely powerful for tabular data, XGBoost struggles with sequential dependencies and multivariate forecasting,
- MRT: Whereas MRT can handle these complexities, especially with its cross-series tokenization.
MRT vs DeepAR:
- DeepAR: It uses autoregressive recurrent networks for probabilistic forecasting but can be complex to implement.
- MRT: However, MRT's unique multiple-resolution patching provides superior handling of context windows, giving it an edge in more complex forecasting scenarios like markdown pricing.

Advantages of MRT

Contextual Awareness: One of the standout features of MRT is its ability to broaden the context window by processing data at multiple resolutions simultaneously. This means it can capture both short-term fluctuations and long-term trends within a single model. By looking at the data through different lenses at the same time, MRT gains a deeper understanding of the patterns within the time series, leading to more accurate and insightful forecasts.
Cross-Series Learning: The ability to model relationships between different time series makes MRT ideal for applications where multiple related series need to be forecasted, such as predicting sales across various stores.
Handling Non-Stationarity: MRT excels in handling non-stationary time series data, a common challenge for traditional models like ARIMA and SARIMA. Its deep learning architecture adapts to changes in the data-generating process without the need for extensive pre-processing.
Incorporating Auxiliary Variables: Unlike many traditional methods, MRT can easily include additional data types (e.g., prices, weather, promotions) into its predictions, making it a more versatile tool for real-world applications like retail sales forecasting.
Scalability: MRT’s design, especially its reverse splitting output head, allows for efficient scaling. It can handle large datasets with multiple time series without becoming computationally expensive, a significant advantage over traditional methods.

Constraints of MRT

Complexity in Training: MRT introduces a high number of hyperparameters, especially in terms of the number of resolutions to consider. This increases the difficulty of hyperparameter tuning and model optimization.
Overfitting Risks: Due to the large number of tokens and auxiliary data incorporated into the model, MRT can be prone to overfitting, especially when applied to small datasets.
Quadratic Scaling with Token Size: The transformer’s self-attention mechanism scales quadratically with the number of tokens. While MRT addresses this through efficient tokenization, the method still becomes computationally expensive as the number of tokens increases.
Computational Overhead: The increased complexity of multiple resolutions and auxiliary tokens means MRT can be more computationally intensive compared to simpler models like ARIMA or Prophet.

Conclusion: Why MRT is a Game-Changer

MRT represents a significant leap forward in time series forecasting, offering capabilities that surpass many traditional methods. Its ability to model time series at multiple resolutions, incorporate auxiliary data, and learn from cross-series relationships makes it a powerful tool for complex forecasting tasks. While it has some constraints in terms of computational cost and complexity, its advantages in handling real-world, noisy, and highly variable data make it a compelling choice for industries like retail, finance, and logistics.
As time series data continues to grow in complexity and volume, techniques like MRT will become increasingly important. Future research and development could focus on automating the resolution selection process and optimizing computational efficiency.

Aman Gupta

AUTHOR - FOLLOW
Aman Gupta
Consultant, Data Science

Next Topic

From Data Engineering to Visualization: How Databricks Enables E2E Analytics Processes

Continue reading

Next Topic

From Data Engineering to Visualization: How Databricks Enables E2E Analytics Processes

Continue reading

our categories

Telecom, Media, Technology

Travel & Hospitality

Healthcare & Life Sciences

Banking & Financial Services

Ready to talk?

Join forces with our data science and AI leaders to navigate your toughest challenges.

recommended articles

Data Engineering Automation : An In-Depth Guide

Blog

Data Engineering Automation : An In-Depth Guide

Quantifying Customer Spending Uplift from Loyalty Programs via Causal Analysis

Blog

Quantifying Customer Spending Uplift from Loyalty Programs via Causal Analysis

Graph Neural Networks Enhancing Personalization in E-Commerce Product Recommendations

Blog

Graph Neural Networks Enhancing Personalization in E-Commerce Product Recommendations

×

Thank you for a like!

Stay informed and up-to-date with the most recent trends in data science and AI.

Share this article

×

Ready to talk?

Join forces with our data science and AI leaders to navigate your toughest challenges.